# 2D pattern matching
Zadanie dotyczy wyszukiwania wzorców dwuwymiarowych.

1. Zaimplementuj algorytm wyszukiwania wzorca 2-wymiarowego
2. Znajdź w załączonym pliku "haystack.txt" wszyskie sytuacje, gdy taka sama litera występuje na tej samej pozycji w dwóch kolejnych linijkach. Zwróć uwagę, na nierówną długość linii w pliku.
3. Znajdź wszystkie wystąpienia "th" oraz "t h" w dwóch kolejnych liniach na tej samej pozycji.
4. Wybierz przynajmniej 4 litery (małe). Znajdź wszystkie wystąpienia tej litery w załączonym pliku "haystack.png"
5. Znajdź wszystkie wystąpienia słowa "p a t t e r n" w haystack.png.
6. Porównaj czas budowania automatu i czas wyszukiwania dla różnych rozmiarów wzorca
7. Podziel plik na 2, 4 i 8 fragmentów (w poziomie) i porównaj czas przeszukiwania
---

In [1]:
import numpy as np
import pandas as pd
from PIL import Image

__Exercise 1__

Zaimplementuj algorytm wyszukiwania wzorca 2-wymiarowego

In [2]:
def get_diff_columns(pattern):
    columns = []
    indexes = []
    alphabet = set()
    for i in range(len(pattern[0])):
        col = []
        for j in range(len(pattern)):
            col.append(pattern[j][i])
            alphabet.add(pattern[j][i])

        if col in columns:
            idx = columns.index(col)
            indexes.append(idx)
        else:
            columns.append(col)
            indexes.append(len(columns) - 1)
    return columns, indexes, alphabet

In [3]:
def vertical_automaton(columns, letters):
    tt = [{}]
    words = [[]]
    states = [0] * len(columns)

    for i in range(len(columns[0])):
        for j in range(len(columns)):
            if columns[j][i] in tt[states[j]]:
                states[j] = tt[states[j]][columns[j][i]]
            else:
                tt[states[j]][columns[j][i]] = len(tt)
                words.append(words[states[j]] + [columns[j][i]])
                states[j] = len(tt)
                tt.append({})

    for i in range(len(tt)):
        for l in letters:
            if l not in tt[i]:
                suffix = (words[i] + [l])[1:]
                state = 0
                for s in suffix:
                    if s in tt[state]:
                        state = tt[state][s]
                    else:
                        state = 0

                tt[i][l] = state
    return tt, states

In [4]:
def horizontal_automaton(pattern, letters):
    result = []
    for state in range(len(pattern) + 1):
        result.append({})
        for l in letters:
            next_state = min(len(pattern), state + 1)
            while True:
                if pattern[:next_state] == (pattern[:state] + [l])[state - next_state + 1:state + 1]:
                    break
                next_state -= 1
            result[state][l] = next_state
    return result

In [5]:
def main_automaton(pattern):
    columns, indexes, letters = get_diff_columns(pattern)
    vertical_tt, vertical_states = vertical_automaton(columns, letters)

    new_pattern = [vertical_states[indexes[i]] for i in range(len(indexes))]
    horizontal_tt = horizontal_automaton(new_pattern, vertical_states)
    horizontal_state = len(horizontal_tt) - 1
    return vertical_tt, horizontal_tt, horizontal_state

In [6]:
def pattern_matching_2d(text, pattern, automaton=None):
    if automaton is None:
        vertical_tt, horizontal_tt, horizontal_state = main_automaton(pattern)
    else:
        vertical_tt, horizontal_tt, horizontal_state = automaton

    result = []
    vertical_states = []
    for i in range(len(text)):
        if len(text[i]) < len(vertical_states):
            vertical_states = vertical_states[:len(text[i])]
        elif len(vertical_states) < len(text[i]):
            vertical_states = vertical_states + [0] * (len(text[i]) - len(vertical_states))

        new_horizontal_state = 0
        for j in range(len(text[i])):
            if text[i][j] in vertical_tt[vertical_states[j]]:
                vertical_states[j] = vertical_tt[vertical_states[j]][text[i][j]]
            else:
                vertical_states[j] = 0
            if vertical_states[j] in horizontal_tt[new_horizontal_state]:
                new_horizontal_state = horizontal_tt[new_horizontal_state][vertical_states[j]]
                if new_horizontal_state == horizontal_state:
                    result.append((i - len(pattern) + 1, j - len(pattern[0]) + 1))
            else:
                new_horizontal_state = 0
    return result

__Exercise 2__

Znajdź w załączonym pliku "haystack.txt" wszyskie sytuacje, gdy taka sama litera występuje na tej samej pozycji w dwóch kolejnych linijkach. Zwróć uwagę, na nierówną długość linii w pliku.

In [7]:
with open("haystack.txt") as f:
    text = f.readlines()

In [8]:
for i in range(ord("a"), ord("z") + 1):
    pattern = [chr(i), chr(i)]
    result = pattern_matching_2d(text, pattern)
    print(f"PATTERN:\n{pattern}")
    print(f"FOUND INDEXES:\n{result}")
    print("#############################################")

PATTERN:
['a', 'a']
FOUND INDEXES:
[(0, 82), (3, 30), (5, 60), (6, 63), (20, 6), (28, 69), (31, 50), (31, 73), (33, 66), (37, 4), (52, 12), (53, 12), (53, 48), (56, 11), (57, 36), (58, 36), (59, 24), (64, 2), (64, 14), (64, 22), (65, 35), (69, 35), (76, 21), (76, 74), (77, 42), (77, 61), (78, 59), (79, 37)]
#############################################
PATTERN:
['b', 'b']
FOUND INDEXES:
[]
#############################################
PATTERN:
['c', 'c']
FOUND INDEXES:
[(3, 54), (10, 45), (13, 10), (41, 0), (68, 0), (82, 41)]
#############################################
PATTERN:
['d', 'd']
FOUND INDEXES:
[(37, 19)]
#############################################
PATTERN:
['e', 'e']
FOUND INDEXES:
[(0, 63), (1, 8), (4, 77), (7, 65), (10, 1), (10, 64), (14, 2), (15, 43), (17, 6), (18, 27), (20, 10), (21, 61), (22, 53), (24, 3), (24, 65), (28, 67), (28, 73), (29, 38), (29, 43), (37, 48), (40, 11), (40, 26), (41, 57), (42, 36), (42, 48), (46, 52), (47, 50), (51, 31), (57, 54), (58, 50), (58

__Exercise 3__

Znajdź wszystkie wystąpienia "th" oraz "t h" w dwóch kolejnych liniach na tej samej pozycji.

In [9]:
pattern = ["th", "th"]
result = pattern_matching_2d(text, pattern)
result

[]

In [10]:
pattern = ["t h", "t h"]
result = pattern_matching_2d(text, pattern)
result

[(37, 0)]

__Exercise 4__

Wybierz przynajmniej 4 litery (małe). Znajdź wszystkie wystąpienia tej litery w załączonym pliku "haystack.png"

In [11]:
def convert_image(file_name):
    image = Image.open(file_name)
    pixels = list(image.getdata())
    width, height = image.size
    text = []
    i = width
    for pixel in pixels:
        if i == width:
            i = 0
            text.append([])
        text[-1].append(pixel[0])
        i += 1
    return text

In [12]:
text = convert_image("haystack.png")

![a.png](attachment:a.png)

In [13]:
a = convert_image("pictures/a.png")
a

[[255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255],
 [255, 253, 177, 74, 25, 8, 45, 149, 255, 255, 255],
 [255, 203, 0, 0, 0, 0, 0, 0, 129, 255, 255],
 [255, 206, 87, 193, 236, 241, 186, 35, 10, 242, 255],
 [255, 255, 255, 255, 255, 255, 255, 175, 0, 188, 255],
 [255, 255, 159, 60, 16, 1, 0, 0, 0, 162, 255],
 [255, 138, 0, 0, 0, 0, 0, 0, 0, 155, 255],
 [255, 38, 23, 189, 238, 252, 255, 177, 0, 155, 255],
 [255, 38, 27, 201, 249, 238, 178, 31, 0, 155, 255],
 [255, 120, 0, 0, 0, 0, 0, 89, 0, 155, 255],
 [255, 249, 123, 31, 7, 41, 152, 199, 0, 155, 255],
 [255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255]]

In [14]:
a_matched = pattern_matching_2d(text, a)
print(f"Liczba wystąpień a to: {len(a_matched)}.")

Liczba wystąpień a to: 356.


![l.png](attachment:l.png)

In [15]:
l = convert_image("pictures/l.png")
l

[[255, 255, 255, 255, 255],
 [255, 179, 0, 175, 255],
 [255, 179, 0, 175, 255],
 [255, 179, 0, 175, 255],
 [255, 179, 0, 175, 255],
 [255, 179, 0, 175, 255],
 [255, 179, 0, 175, 255],
 [255, 179, 0, 175, 255],
 [255, 179, 0, 175, 255],
 [255, 179, 0, 175, 255],
 [255, 179, 0, 175, 255],
 [255, 179, 0, 175, 255],
 [255, 179, 0, 175, 255],
 [255, 179, 0, 175, 255],
 [255, 179, 0, 175, 255],
 [255, 255, 255, 255, 255]]

In [16]:
l_matched = pattern_matching_2d(text, l)
print(f"Liczba wystąpień l to: {len(l_matched)}.")

Liczba wystąpień l to: 169.


![m.png](attachment:m.png)

In [17]:
m = convert_image("pictures/m.png")
m

[[163, 0, 187, 151, 36, 9, 59, 209, 255, 208, 68, 11, 25, 135, 255],
 [163, 0, 86, 0, 0, 0, 0, 24, 209, 14, 0, 0, 0, 0, 171],
 [163, 0, 7, 153, 236, 232, 87, 0, 16, 85, 218, 244, 167, 2, 63],
 [163, 0, 121, 255, 255, 255, 225, 0, 31, 249, 255, 255, 255, 65, 23],
 [163, 0, 179, 255, 255, 255, 254, 0, 83, 255, 255, 255, 255, 94, 1],
 [163, 0, 187, 255, 255, 255, 255, 0, 91, 255, 255, 255, 255, 95, 0],
 [163, 0, 187, 255, 255, 255, 255, 0, 91, 255, 255, 255, 255, 95, 0],
 [163, 0, 187, 255, 255, 255, 255, 0, 91, 255, 255, 255, 255, 95, 0],
 [163, 0, 187, 255, 255, 255, 255, 0, 91, 255, 255, 255, 255, 95, 0],
 [163, 0, 187, 255, 255, 255, 255, 0, 91, 255, 255, 255, 255, 95, 0]]

In [18]:
m_matched = pattern_matching_2d(text, m)
print(f"Liczba wystąpień m to: {len(m_matched)}.")

Liczba wystąpień m to: 131.


![s.png](attachment:s.png)

In [19]:
s = convert_image("pictures/s.png")
s

[[249, 129, 40, 10, 25, 75, 194, 255],
 [107, 0, 0, 0, 0, 0, 7, 255],
 [28, 28, 196, 245, 235, 183, 70, 255],
 [48, 70, 255, 255, 255, 255, 255, 255],
 [194, 23, 46, 115, 166, 234, 255, 255],
 [255, 246, 181, 124, 71, 9, 110, 254],
 [255, 255, 255, 255, 255, 179, 0, 171],
 [58, 164, 220, 249, 228, 107, 0, 138],
 [0, 0, 0, 0, 0, 0, 9, 216],
 [198, 95, 35, 7, 28, 88, 208, 255]]

In [20]:
s_matched = pattern_matching_2d(text, s)
print(f"Liczba wystąpień s to: {len(s_matched)}.")

Liczba wystąpień s to: 334.


__Exercise 5__

Znajdź wszystkie wystąpienia słowa "p a t t e r n" w haystack.png.

![pattern.png](attachment:pattern.png)

In [21]:
pattern = convert_image("pictures/pattern.png")
pattern_matched = pattern_matching_2d(text, pattern)
print(f"Liczba wystąpień pattern to: {len(pattern_matched)}.")

Liczba wystąpień pattern to: 5.


__Exercise 6__

Porównaj czas budowania automatu i czas wyszukiwania dla różnych rozmiarów wzorca

In [22]:
from time import perf_counter
from random import randint

In [23]:
with open("haystack.txt") as f:
    text = f.readlines()

In [24]:
def building_times(text_size):
    building_times = []
    for i in text_size:
        pattern = [[chr(randint(ord('a'), ord('z'))) for _ in range(i)] for _ in range(i)]
        start = perf_counter()
        main_automaton(pattern)
        end = perf_counter()
        building_times += [i, end - start]
    df = pd.DataFrame(data={"text size": building_times[::2],
                            "building time [s]": building_times[1::2]})
    return df

In [25]:
text_size = [i for i in range(10, 260, 20)]
df_1 = building_times(text_size)
df_1

Unnamed: 0,text size,building time [s]
0,10,0.00251
1,30,0.046743
2,50,0.1775
3,70,0.477154
4,90,1.056525
5,110,1.863254
6,130,3.136854
7,150,4.957588
8,170,7.361527
9,190,10.530337


In [26]:
def searching_times(text, text_size, path_size):
    pattern = [line[:path_size] for line in text[:path_size]]
    automaton = main_automaton(pattern)
    searching_times = []

    for i in text_size:
        text = [line[:i] for line in text[:i]]
        start = perf_counter()
        pattern_matching_2d(text, pattern, automaton)
        end = perf_counter()
        searching_times += [i, end - start]
    df = pd.DataFrame(data={"text size": searching_times[::2],
                            "searching time [s]": searching_times[1::2]})
    return df

In [27]:
text = convert_image("haystack.png")
text_size = [i for i in range(500, 10001, 500)]
path_size = 25
df_2 = searching_times(text, text_size, path_size)
df_2

Unnamed: 0,text size,searching time [s]
0,500,0.067165
1,1000,0.061193
2,1500,0.063023
3,2000,0.058339
4,2500,0.058512
5,3000,0.05834
6,3500,0.058099
7,4000,0.063418
8,4500,0.062535
9,5000,0.066019


__Exercise 7__

Podziel plik na 2, 4 i 8 fragmentów (w poziomie) i porównaj czas przeszukiwania

In [28]:
from time import perf_counter

In [29]:
def divide_and_measure(text, path_size):
    pattern = [line[:path_size] for line in text[:path_size]]
    result = []

    for div in [2, 4, 8]:
        length = len(text) // div
        intervals = [text[i * length:(i + 1) * length] for i in range(div)]
        start = perf_counter()
        for i in intervals:
            pattern_matching_2d(i, pattern)
        end = perf_counter()
        result += [div, end - start]
    df = pd.DataFrame(data={"part": result[::2],
                            "time [s]": result[1::2]})
    return df

In [30]:
path_size = 25
df_3 = divide_and_measure(text, path_size)
df_3

Unnamed: 0,part,time [s]
0,2,0.466174
1,4,0.435927
2,8,0.435606
