I was studying German the other day and stumbled upon a typo that leads me an interesting observation on these two words:

- [anschließen](https://en.wiktionary.org/wiki/anschlie%C3%9Fen#German) (to connect)
- [anschließend](https://en.wiktionary.org/wiki/anschlie%C3%9Fend#German) (following, afterwards)

They are very "similar" and I would like them to be connected in [wilhelmlang.com](https://wilhelmlang.com/), a platform that helps language learner learn multi-languages via knowledge graph.

We define the similarity of two words in this context as follows:

___Two words are similar either structurally or semantically___.

For example:

- __anschließen__ and __anschließend__ are structually similar because they differ by just one character (trailing __d__).
- __anschließend__ and [__nachher__](https://en.wiktionary.org/wiki/nachher#German), are semantically similar because they both mean __afterwards__ as adverb
- Some can possess both. For instance, [__das Theater__](https://en.wiktionary.org/wiki/Theater#German) (the theater) and [__das Theaterstück__](https://en.wiktionary.org/wiki/Theaterst%C3%BCck#German) (the drama) are similar both semantically and structurally

### Lavenshtien's Distance

The first idea was to calculating the similarity between two words

The closest would be like the [Levenstein's distance](https://en.wikipedia.org/wiki/Levenshtein_distance) (also popularly called the _edit distance_).

> In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.

In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.

In [1]:
import nltk
nltk.edit_distance("anschließen", "anschließend")

1

The code above would return 1, as only one letter is different between the two words. Lavenshtien's distance is good for spotting the __anschließen-anschließend__ case

The __anschließend-nachher__ won't work well with the edit distance, though. We need a different metric approach.

### Cosin Similarity

In [2]:
import spacy
spacy.cli.download('de_core_news_sm')

nlp = spacy.load('de_core_news_sm') 
  
text1 = 'anschließend'
text2 = 'nachher'
doc1 = nlp(text1)
doc2 = nlp(text2)
print("spaCy :", doc1.similarity(doc2))

Collecting de-core-news-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-3.8.0/de_core_news_sm-3.8.0-py3-none-any.whl (14.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.6/14.6 MB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('de_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
spaCy : 0.33368661999702454


  print("spaCy :", doc1.similarity(doc2))
