# Words Apart
## Levenshtein Distance

## How far do you say?
### Measuring the gaps between words

Bioinformatics studies the building blocks of life. To examine a few more Python tools, we will look at the most basic building block of bioinformatics - [edit distance](https://en.wikipedia.org/wiki/Edit_distance).

This is how you can measure the separation between two strings of letters. If you remember the sci-fi film Gattica (about a world gone genetic-testing crazy), or even school biology, you may be aware that our genes are made up of long DNA sequences of nucleic acids, called _bases_ . There are only four bases: `G`, `A`, `T` and `C`. Guess where the film name comes from...

This means you can understand a lot about genetic similarity by comparing long sequences and working our how far apart they are - perhaps what genetic mutations would get from one version to another, to understand what animals, plants or bacteria are related and why they differ.

GGACTATCTACTACCATACGGACTATCTACTACCATACGGACTATCTACTACCATACGGACTATCTACTACCATACGGACTATCTACTACCATAC...
GAGCTATCTACCTAGCATTCGACTAACTACTACCATTCGGACTATCTACTACCATACGGACTATCTACTACCATACGGACTATCTACTACCATAC...

You can see these are similar, but not quite the same, but more more alike than either are to:
GGGGGGGGGGGGGGGGGGGAAAAAAAAAAAAAAAAAAAAAAAAAATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCCCCCCCCCCCCC...

How do we quantify that? One way:

**Levenshtein Distance**: (roughly) the minimum number of single-character edits to get from the first string to the second

In this exercise, we will touch on some of the tools that will become more foundational in the IDE setting, with VSCode.

## pip

`pip` is one of a wide range of tools constituting the Python packaging ecosystem. It is hugely fragmented compared to most languages, but `pip` is a relatively simple and standard way of installing tools. You may also come across the Anaconda distribution, and its tool `conda`, which works very similarly. We need additional libraries for this exercise - you can install them as follows:

In [15]:
!pip install python-Levenshtein pytest-benchmark



Good practice is to have a requirements file, where version limits can be set for each package - to avoid accidental breaking upgrades (a common standard is to pin the major version number, as under semantic versioning practice, minor versions should not introduce major breaking changes). Then, when you clone down the repository, one way of installing dependencies is:

In [27]:
!pip install -r requirements.txt

Collecting Levenshtein==0.27.1 (from python-levenshtein>=0.20.0->-r requirements.txt (line 4))
  Using cached levenshtein-0.27.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.6 kB)
Using cached levenshtein-0.27.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (159 kB)
Installing collected packages: Levenshtein
Successfully installed Levenshtein-0.27.1


In [28]:
!cat requirements.txt

numpy
pytest==8.4.1
pytest-benchmark==5.1.0
python-levenshtein>=0.20.0


You can see here that `pytest-benchmark` and `python-levenshtein` are pinned - ideally being generous down the way, but not giving _too_ much flexibility up the way (you don't know if a breaking change will come in a dependency's new version), will help ensure your dependencies can be met but the risk of third-party breakages is reduced. In production deployment, exact pinning and dedicated repositories are strongly recommended - some language tools do this in a more streamlined way (package.lock for npm, for example). However... narrowly pinning development code can prevent security patches being brought in, or conflicts when your module is later used with other code - the dependency versions conflict.

## pytest

This is the first contact with `pytest`. It automatically seeks out files starting with `test_` -- I'll walk you through `test_levenshtein.py` now...

You can run these tests from the command line:

In [36]:
!pytest test_levenshtein.py

platform linux -- Python 3.12.11, pytest-8.4.1, pluggy-1.6.0
benchmark: 5.1.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /content
plugins: benchmark-5.1.0, langsmith-0.4.14, typeguard-4.4.4, anyio-4.10.0
[1mcollecting ... [0m[1mcollected 5 items                                                              [0m

test_levenshtein.py [32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m                                                [100%][0m



### Exercise: Gene Hacking

There is a link there to Wikipedia, which has a standard textbook description of two algorithms - we will test which is faster. Can you fill in `my_levenshtein.py` to implement the _recursive_ `calculate_levenshtein` algorithm? Re-run the `pytest` command above until it passes!

### Extension: Gene Wilder

Going one further, can you implement the matrix version, using numpy, in the routine below? (algorithm also available from the same source)

### Extension: Gene E Us

Can you write a version that passes all the tests, but does not work in general? Can you add a test to catch your "mistake"? How much more robust can you make the testing?

This highlights a challenge with testing numerical or ML algorithms - that enumerating all possible cases is not necessarily possible. How might you break it up your testing, or your algorithm, to be able to more reliably test it?

## pylint

Pylint exists to help make sure your code is compliant with the PEP8 style guide (and a few others) - you can run it like so:

In [38]:
!pip install pylint

Collecting pylint
  Downloading pylint-3.3.8-py3-none-any.whl.metadata (12 kB)
Collecting astroid<=3.4.0.dev0,>=3.3.8 (from pylint)
  Downloading astroid-3.3.11-py3-none-any.whl.metadata (4.4 kB)
Collecting isort!=5.13,<7,>=4.2.5 (from pylint)
  Downloading isort-6.0.1-py3-none-any.whl.metadata (11 kB)
Collecting mccabe<0.8,>=0.6 (from pylint)
  Downloading mccabe-0.7.0-py2.py3-none-any.whl.metadata (5.0 kB)
Downloading pylint-3.3.8-py3-none-any.whl (523 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m523.2/523.2 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading astroid-3.3.11-py3-none-any.whl (275 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m275.6/275.6 kB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading isort-6.0.1-py3-none-any.whl (94 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.2/94.2 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading mccabe-0.7.0-py2.py3-none-any.whl (7.3 kB)
Ins

In [39]:
!pylint my_levenshtein.py

************* Module my_levenshtein
my_levenshtein.py:35:0: C0116: Missing function or method docstring (missing-function-docstring)
my_levenshtein.py:9:0: W0611: Unused import python_course_levenshtein_py (unused-import)

-----------------------------------
Your code has been rated at 9.20/10



It's a pain, but once your code passes it, it's a breeze! It is recommended to include it in a continuous integration pipeline, just like testing - some developers include it in the githooks (automated code that is run when code is either committed locally, or pushed remotely).

### Exercise: Code Hy-gene

Try to adapt your code until it passes. Are all the checks useful? Are there some you would switch off?