# Find Similarities among Words
<p>This is a program written in Python to load any csv with texts and measure character similarities among each word.</p>

## Load a List of Words in CSV Format

In [18]:
import re
file__IO = input('What is the filename: ')#user enters a path such as F:\INSY5336\Shakespeare.txt
with open(file__IO, 'r') as f:#open and read in the texts
    data = f.read()
    line = data.splitlines()#a line is a sequence of characters that end with a newline (\n) character
    word = data.split()#a word bounded by one or more spaces (or \n) on either side of it (or both sides)

What is the filename: test_doc.csv


## Pick a Root Word That Other Words from the CSV Will Be Compared with

In [19]:
root_term = input(str("Pick a root word: "))

Pick a root word: Believe


## Analyzing Term Similarity
<p>This function takes a list of words or terms and returns the corresponding character vectors for the words. This function first changes the case of all characters to lower case. Then this function creates an array of the characters from the words given. Lastly, this function assigns the character value (integer) representing the Unicode character in the array before returning the array.</p>

In [20]:
import numpy as np
import pandas as pd
import copy

def vectorize_terms(terms):
    terms = [term.lower() for term in terms]
    terms = [np.array(list(term)) for term in terms]
    terms = [np.array([ord(char) for char in term]) for term in terms]
    #The ord() function returns an integer representing the Unicode character. 
    return terms

<p>We will use words from the csv to test and measure character similarity with a root word.</p>

In [21]:
#create array of words from csv
words = []
for i in word:
    words.append(re.sub(r'[^\w\d\s]+', '', i))
    
other_terms = []
for i in words:
    other_terms.append(i) 

terms = []
terms.append(root)
for i in other_terms:
    terms.append(i)

<p>Here we perform character vectorization on each of these strings (list of character tokens) and view their representation in the form of a data frame. A data frame is created with two arguments: (1) the array of integers representing the Unicode characters and (2) the words originally fed into the function placed in the column known as the index (not to be confused with the first column of the data frame).</p>

In [22]:
#Character vectorization
term_vectors = vectorize_terms(terms)

#show vector representations
vec_df = pd.DataFrame(term_vectors, index=terms)
print(vec_df)

             0      1      2      3      4      5      6      7      8
Believe     98  101.0  108.0  105.0  101.0  118.0  101.0    NaN    NaN
I          105    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN
am          97  109.0    NaN    NaN    NaN    NaN    NaN    NaN    NaN
Jason      106   97.0  115.0  111.0  110.0    NaN    NaN    NaN    NaN
I          105    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN
like       108  105.0  107.0  101.0    NaN    NaN    NaN    NaN    NaN
to         116  111.0    NaN    NaN    NaN    NaN    NaN    NaN    NaN
have       104   97.0  118.0  101.0    NaN    NaN    NaN    NaN    NaN
fun        102  117.0  110.0    NaN    NaN    NaN    NaN    NaN    NaN
Do         100  111.0    NaN    NaN    NaN    NaN    NaN    NaN    NaN
you        121  111.0  117.0    NaN    NaN    NaN    NaN    NaN    NaN
program    112  114.0  111.0  103.0  114.0   97.0  109.0    NaN    NaN
I          105    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN
progra

<p>We will set up some necessary variables by storing the root term, the other terms with which its similarity will be measured, and their various vector representations.</p>

In [23]:
root_term_vec = vec_df[vec_df.index == root_term].dropna(axis=1).values[0]
other_term_vecs = [vec_df[vec_df.index == term].dropna(axis=1).values[0] for term in other_terms]

In [24]:
def levenshtein_distance(u, v):
    #convert to lower case
    u = u.lower()
    v = v.lower()
    #initialize distance matrix
    edit_matrix = []
    #initialize two distance matrices
    du = [0] * (len(v) + 1)
    dv = [0] * (len(v) + 1)
    #du: the previous row of distances
    for i in range(len(du)):
        du[i] = i
    #dv: the current row of distances
    for i in range(len(u)):
        dv[0] = i + 1
        #compute cost as per algorithm
        for j in range(len(v)):
            if u[i] == v[j]:
                cost = 0
            else:
                cost = 1
            dv[j + 1] = min(dv[j] + 1, du[j + 1] + 1, du[j] + cost)
        #assign dv to du for next iteration
        for j in range(len(du)):
            du[j] = dv[j]
        #copy dv to the matrix
        edit_matrix.append(copy.copy(dv))
    #compute the final distance and matrix
    distance = dv[len(v)]
    edit_matrix = np.array(edit_matrix)
    edit_matrix = edit_matrix.T
    edit_matrix = edit_matrix[1:,]
    edit_matrix = pd.DataFrame(data=edit_matrix, 
                               index=list(v), 
                               columns=list(u))
    return distance, edit_matrix

In [25]:
for term in other_terms:
    edit_d, edit_m = levenshtein_distance(root_term, term)
    print('Computing distance between root: {} and term: {}'.format (root_term, term))
    print('Levenshtein distance is {}'.format(edit_d))
    print('The complete distance matrix is depicted below')
    print(edit_m)
    print('-' * 30)

Computing distance between root: Believe and term: I
Levenshtein distance is 6
The complete distance matrix is depicted below
   b  e  l  i  e  v  e
i  1  2  3  3  4  5  6
------------------------------
Computing distance between root: Believe and term: am
Levenshtein distance is 7
The complete distance matrix is depicted below
   b  e  l  i  e  v  e
a  1  2  3  4  5  6  7
m  2  2  3  4  5  6  7
------------------------------
Computing distance between root: Believe and term: Jason
Levenshtein distance is 7
The complete distance matrix is depicted below
   b  e  l  i  e  v  e
j  1  2  3  4  5  6  7
a  2  2  3  4  5  6  7
s  3  3  3  4  5  6  7
o  4  4  4  4  5  6  7
n  5  5  5  5  5  6  7
------------------------------
Computing distance between root: Believe and term: I
Levenshtein distance is 6
The complete distance matrix is depicted below
   b  e  l  i  e  v  e
i  1  2  3  3  4  5  6
------------------------------
Computing distance between root: Believe and term: like
Levenshtein 