# Edit distance  

Please note that the visualization is an animation, so it's necessary 
to run cells with `import` and `visualize`.

In [1]:
from edit import edit_table, edit_distance, visualize

x, y = 'ala', 'alicja'
edit_table(x, y)

array([[0., 1., 2., 3., 4., 5., 6.],
       [1., 0., 1., 2., 3., 4., 5.],
       [2., 1., 0., 1., 2., 3., 4.],
       [3., 2., 1., 1., 2., 3., 3.]])

In [2]:
edit_distance(x,y)

3.0

In [3]:
visualize(x, y, sleep_for=1.2)

------------------------
alicja
      ↑
alicja
      ↑
------------------------
DONE.. edit distance: 3


## Examples

In [4]:
x, y = 'los', 'kloc'
visualize(x, y)

------------------------
kloc
    ↑
kloc
    ↑
------------------------
DONE.. edit distance: 2


In [5]:
x, y = 'Łódź', 'Lodz'
visualize(x, y)

------------------------
Lodz
    ↑
Lodz
    ↑
------------------------
DONE.. edit distance: 3


In [6]:
x, y = 'kwintesencja', 'quintessence'
visualize(x, y)

------------------------
quintessence
            ↑
quintessence
            ↑
------------------------
DONE.. edit distance: 5


In [8]:
x, y = 'ATGAATCTTACCGCCTCG', 'ATGAGGCTCTGGCCCCTG'
visualize(x, y)

------------------------
ATGAGGCTCTGGCCCCTG
                  ↑
ATGAGGCTCTGGCCCCTG
                  ↑
------------------------
DONE.. edit distance: 7


# Longest common subsequence

Example from wikipedia

In [2]:
from lcs import lcs_table, lcs, diff_line

x, y = 'XMJYAUZ', 'MZJAWXU'
lcs_table(x, y)

array([[0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 1.],
       [0., 1., 1., 1., 1., 1., 1., 1.],
       [0., 1., 1., 2., 2., 2., 2., 2.],
       [0., 1., 1., 2., 2., 2., 2., 2.],
       [0., 1., 1., 2., 3., 3., 3., 3.],
       [0., 1., 1., 2., 3., 3., 3., 4.],
       [0., 1., 2., 2., 3., 3., 3., 4.]])

In [3]:
lcs(x, y, join = True)

'MJAU'

#### Line diff

This one is just for fun.

In [4]:
print(diff_line(x, y))

<XMJYAUZ
>MZJAWXU
 XMYAUZJWXUAU
 - ---- +++  



# diff

- Format of my diff corresponds to that described here: https://en.wikipedia.org/wiki/Diff#Usage.
- Note that I am using ansi escape codes to get colored output, so if terminal (or whatever reads stdout) does not handle them, output might be messy.

Below is simple example

In [5]:
x = ['common', 'outdated', 'more_common']
y = ['new', 'common', 'more_common', 'more_new']

lcs(x, y)

['common', 'more_common']

In [1]:
from lcs import diff

x = ['common', 'outdated', 'more_common']
y = ['new', 'common', 'more_common', 'more_new']

print(diff(x, y))

0a1[32m
> new
[0m2d3[31m
< outdated
[0m3a4[32m
> more_new
[0m


### Example from wikipedia
https://en.wikipedia.org/wiki/Diff#Usage

There is a subtle difference:
- I don't use 'c' as in changed instead I have 'd' (delete) followed by 'a' (append).

Program can also be used from command line: 
- `python lcs.py "resources/original.txt" "resources/new.txt"`

In [1]:
from lcs import diff_files

original = 'resources/original.txt'
new = 'resources/new.txt'

diff_files(original, new)

0a1,6[32m
> This is an important
> notice! It should
> therefore be located at
> the beginning of this
> document!
> 
[0m11,15d17[31m
< This paragraph contains
< text that is outdated.
< It will be deleted in the
< near future.
< 
[0m17d18[31m
< check this dokument. On
[0m16a18[32m
> check this document. On
[0m24a26,29[32m
> 
> This paragraph contains
> important new additions
> to this document.
[0m


## Romeo i Julia

To keep things simple I split text into lines before I delete approx 3% of tokens.

In [2]:
from spacy.tokenizer import Tokenizer
from spacy.lang.pl import Polish
from numpy.random import rand
import spacy

filename = 'resources/romeo-i-julia.txt'
lines = None

with open(filename, 'r', encoding = 'UTF-8') as file:
    lines = file.read().splitlines()

nlp = Polish()
tokenizer = Tokenizer(nlp.vocab)

In [3]:
def remove_rand_tokens(lines):
    text = []
    for doc in tokenizer.pipe(lines):
        text.append(' '.join([token.text for token in doc if rand() <= .97]))

    return text

def save_lines(lines, filename):
    with open(filename, 'w', encoding = 'UTF-8') as file:
        file.write('\n'.join(lines))

In [4]:
lines1 = remove_rand_tokens(lines)
file1  = 'resources/f1.txt'
save_lines(lines1, file1)

lines2 = remove_rand_tokens(lines)
file2 = 'resources/f2.txt'
save_lines(lines2, file2)

In [5]:
diff_files(file1, file2)

podobno schować dudy w miech i wynieść się za drzwi.
[0m6062a6063[32m
> Trzeba podobno schować dudy w miech i wynieść się za drzwi.
[0m6069d6069[31m
< Poczciwi ludzie, nie ma tu co robić.
[0m6068a6069[32m
> Poczciwi ludzie, nie ma tu robić.
[0m6076d6076[31m
< / Wchodzi Piotr. /
[0m6075a6076[32m
> / Piotr. /
[0m6081d6081[31m
< Zagrajcie mi na basetli, panowie zagrajcie mi na basetli, jeżeli mi dobrze życzycie.
[0m6080a6081[32m
> Zagrajcie mi na basetli, panowie muzykanci, zagrajcie mi na basetli, jeżeli mi dobrze życzycie.
[0m6136d6136[31m
< Schowaj, waćpan, swój rożen, a wydobądź lepiej swój dowcip.
[0m6135a6136[32m
> Schowaj, waćpan, swój rożen, a wydobądź swój dowcip.
[0m6141d6141[31m
< Strzeżcie się ostrza mego dowcipu, bo was przeszyje na wylot. Baczność!
[0m6140a6141[32m
> Strzeżcie się ostrza mego dowcipu, was przeszyje na wylot. Baczność!
[0m6143d6143[31m
< śpiewa /
[0m6142a6143[32m
> / śpiewa /
[0m6150d6150[31m
< Dlaczego srebrny dźwięk? Dlaczego muz