# 1: Distance measures

### 1. Hamming distance 

Start by implementing the Hamming distance function below. The function takes two strings `s1` and `s2` as input and returns their Hamming distance. When implemented correctly, the output should be the following:

```
hamming("dog", "dgo") == 2
hamming("blouse", "house") == 6
```

In case the input strings are of different lengths, we should pad the shorter string **at the end** with padding symbols `"#"`.

In [1]:
def hamming(s1, s2):

    if len(s1) < len(s2):
        return hamming(s2,s1)
    else:
        s2 += "#" * (len(s1) - len(s2))
        return sum([c1 != c2 for c1, c2 in zip(s1, s2)])
    
print(hamming("dog","dgo"))
print(hamming("house", "blouse"))

2
6


### 2. Comparing to minimum edit distance

We'll then compare Hamming distance and minimum edit distance (also called Levensthein distance). Start by installing the module [Levensthein](https://maxbachmann.github.io/Levenshtein/installation.html).

We can then apply the `Levenshtein.distance` function. It should give the following output:
```
Levenshtein.distance("dog", "dgo") == 2
Levenshtein.distance("blouse", "house") == 2
```

In [2]:
import Levenshtein

print(Levenshtein.distance("dog", "dgo"))
print(Levenshtein.distance("blouse", "house"))

2
2


### 3. Our own version of minimum edit distance

We'll now implement our own version of the minimum edit distance algorithm `lev`. 

For input strings `"house"` and `"blouse"`, the skeleton code initializes a matrix:

$${\rm matrix} = \begin{bmatrix}
0 & 1 & 2 & 3 & 4 & 5\\
1 & 0 & 0 & 0 & 0 & 0\\
2 & 0 & 0 & 0 & 0 & 0\\
3 & 0 & 0 & 0 & 0 & 0\\
4 & 0 & 0 & 0 & 0 & 0\\
5 & 0 & 0 & 0 & 0 & 0\\
6 & 0 & 0 & 0 & 0 & 0\\
\end{bmatrix}$$

The element `matrix[i,j]` corresponds to the substrings `s1[:i]` and `s2[:j]`. Now we fill in the entries in this matrix correctly using the update rule: 

1. Test if one of the strings is empty. In that case, we need to insert all symbols in the other string;
2. Check if we can copy, i.e. the strings start with the same symbol. Otherwise, cost will be at least 1;
3. Check which produces the lowest overall cost: substitution (or copying), insertion or deletion;
4. Return lowest overall cost


You should then return the last element of the last row of `matrix`. This is the minimum edit distance between `s1` and `s2`.

The function should return the following result:

```
lev("dog", "dgo") == 2
lev("house", "blouse") == 2
lev("mitten", "sitting") == 3
```

> **Hint:** If you encounter problems, it's a good idea to print matrix after the for loops and check that the entries in the matrix agree with your understanding of the update rule.

In [3]:
import numpy as np

def lev(s1, s2):
    matrix = np.zeros((len(s1) + 1, len(s2) + 1))
    
    # We initialize the first row and column of the
    # distance matrix
    matrix[:,0] = np.arange(0,len(s1)+1)
    matrix[0,:] = np.arange(0,len(s2)+1)
    
    for col in range(1,len(s2) + 1):
        for row in range(1,len(s1) + 1):
            # Please use the distance update rule from the lecture
            # to fill in cell matrix[row][col]. Remember to check
            # whether we're copying or substituting.
            
            subst_cost = 0 if s1[row-1] == s2[col-1] else 1 
            matrix[row,col] = min(matrix[row-1,col] + 1,
                                  matrix[row, col-1] + 1,
                                  matrix[row-1, col-1] + subst_cost)

    # You should now return the last entry in the last 
    # row. 
        
    return matrix[-1,-1]

print(lev("dog", "dgo"))
print(lev("house", "blouse"))
print(lev("mitten", "sitting"))

2.0
2.0
3.0


### 4. Comparing document vectors

Let's start by compiling a few document vectors for Brown corpus categories. The Brown corpus contains text from 15 categories:

```
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']
```

We'll need to encode these into ID numbers:

In [6]:
from nltk.corpus import brown
cat2i = {cat:i for i, cat in enumerate(brown.categories())}

We will first represent each category as a bag-of-words containing all the tokens from documents in that category.

In [7]:
import nltk
from nltk.corpus import brown,stopwords
from collections import Counter

nltk.download("brown")
nltk.download("stopwords")

en_stopwords = stopwords.words("english")

def preprocess(words):
    '''lower case and remove stopwords'''
    return [word.lower() for word in words if word.lower() not in en_stopwords and word.isalpha()]

raw_feature_dicts = []
for category in brown.categories():
    # Represent each category as a counter over word types
    raw_feature_dicts.append(Counter(preprocess(brown.words(categories=category))))
    
print(len(raw_feature_dicts))
print(raw_feature_dicts[cat2i["mystery"]].most_common(10))
print(raw_feature_dicts[cat2i["government"]].most_common(10))

[nltk_data] Downloading package brown to /Users/lxy/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/lxy/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


15
[('said', 204), ('would', 189), ('one', 172), ('back', 158), ('could', 145), ('like', 139), ('man', 107), ('get', 99), ('two', 89), ('know', 86)]
[('state', 196), ('year', 183), ('states', 180), ('may', 179), ('united', 155), ('new', 150), ('development', 125), ('one', 125), ('would', 120), ('made', 118)]


We will then use `DictVectorizer` from sklearn to transform our word counters into real-valued vectors.

`DictVectorizer` will associate each word type with a unique dimension and and store the counts for this word type in  each category under this dimension. 

In [8]:
from sklearn.feature_extraction import DictVectorizer

vectorizer = DictVectorizer()
sparse_matrix = vectorizer.fit_transform(raw_feature_dicts)

Let's print the shape of the resulting matrix. We have one row for each of the 15 categories and around 40k feature dimensions corresponding to word types.

In [9]:
sparse_matrix.shape

(15, 40097)

To use distance metrics, we'll need to convert this sparse matrix into an `np.array`.

In [10]:
full_matrix = sparse_matrix.toarray()

Let's now use `scipy.spatial.distance.cosine` to compare vectors for different categories. This function actually return the cosine **distance** rather than similarity. These are related via the formula:
$$cosdist(x,y) = 1 - cossim(x,y)$$

Print the cosine distance between the category *adventure* and all other categories in the Brown corpus. What is the closest category to *adventure*? Which one is furthest away?

In [11]:
from scipy.spatial.distance import cosine

for cat in brown.categories():
    print(cat, cosine(full_matrix[cat2i["adventure"]], full_matrix[cat2i[cat]]))

adventure 0
belles_lettres 0.31761452179089555
editorial 0.4106633168160483
fiction 0.11781160852800698
government 0.6846370760021299
hobbies 0.4495174612745898
humor 0.24818366256735258
learned 0.5719927326485756
lore 0.32945460871737575
mystery 0.11043616570333414
news 0.3883890659297321
religion 0.5140895334150573
reviews 0.43939022838544683
romance 0.12038926342301437
science_fiction 0.29095478019263565


In the next notebook, we will cluster data points and for this purpose, we often need to know the distance between every pair of points, e.g. category vectors in our case. It is very slow to compute this individually for each pair. Instead we can use the [`scipy.spatial.distance.pdist`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html) function. `pdist` returns a so called condensed distance matrix (which is an upper triangular matrix), you can convert this to a reqular square matrix using the function [`scipy.spatial.distance.squareform`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.squareform.html#scipy.spatial.distance.squareform).

Read the documentation of the functions and use them to generate an array `m`, where `m[cat2i["adventure"], ["mystery"]]` gives you the cosine distance between these two categories. 

Test a few categories to make sure that your results agree with the distances we computed using a loop above.  

In [12]:
from scipy.spatial.distance import pdist, squareform


m = squareform(pdist(full_matrix, "cosine"))
print(m.shape)
print(m[cat2i["adventure"], cat2i["mystery"]])
print(m[cat2i["adventure"], cat2i["government"]])

(15, 15)
0.11043616570333403
0.6846370760021299
