## Comparing strings

Credit: Built on top of the work of Dario Radecic in [Calculating String Similarity](https://towardsdatascience.com/calculating-string-similarity-in-python-276e18a7d33a)

In [13]:
import string
import Levenshtein
import re

## Comparing email and name with Levenshtein Distance

In information theory, linguistics and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.

##### Note
Levenshtein Distance is case sensitive, so make sure to lower() everything before comparison, and strip out any punctuation

In [56]:
name = 'Michael Taverner'
email = 'michael.taverner1-2@hotmail.com'

lower_name = str.lower(name)
lower_email = str.lower(str.split(email,'@')[0])
nopunc_email = re.sub('[!@#$%^&*()-=+.,]', ' ', lower_email)
nonum_email = re.sub(r'[0-9]+', '', nopunc_email).strip()

distance = Levenshtein.distance(lower_name,nonum_email)
print(distance)
print(lower_name)
print(nonum_email)

0
michael taverner
michael taverner


1. Assigned email and name variables
2. Lower cased the name
3. Lower cased the email, split it on the @ symbol and selected the first element
4. Replaced common punctuation in emails with a space
5. Replaced any numeric characters with an empty string and stripped the whitespace
6. Calculated the Levenshtein distance, which measures the number of changes required to get from one string to the other

As hoped, the resulting distance is 0, meaning 0 changes were required to get from the name to the email.


### Function

In [71]:
def compare_email_name(email, name):
    name = name
    email = email

    lower_name = str.lower(name)
    lower_email = str.lower(str.split(email,'@')[0])
    nopunc_email = re.sub('[!@#$%^&*()-=+.,]', ' ', lower_email)
    nonum_email = re.sub(r'[0-9]+', '', nopunc_email).strip()

    distance = Levenshtein.distance(lower_name,nonum_email)
    #print(f'The name {lower_name} is {distance} characters different to the email {email}')
    return distance

In [72]:
compare_email_name(name = 'Michael Taverner', email = 'michael.taverner1-2@hotmail.com')

0

## Cosine Similarity

Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them.

In [74]:
import string
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords

stopwords = stopwords.words('english')

In [92]:
sentences = ['This is a foo bar sentence.',
             'This sentence is similar to a foo bar sentence.',
             'This is another string, but it is not quite similar to the previous ones.',
             'This is just another string.']

In [93]:
def clean_string(text):
    text = ''.join([word for word in text if word not in string.punctuation])
    text = text.lower()
    text = ' '.join([word for word in text.split() if word not in stopwords])
    
    return text

In [94]:
cleaned = list(map(clean_string, sentences))
cleaned

['foo bar sentence',
 'sentence similar foo bar sentence',
 'another string quite similar previous ones',
 'another string']

In [95]:
vectorizer = CountVectorizer().fit_transform(cleaned)
vectors = vectorizer.toarray()
vectors

array([[0, 1, 1, 0, 0, 0, 1, 0, 0],
       [0, 1, 1, 0, 0, 0, 2, 1, 0],
       [1, 0, 0, 1, 1, 1, 0, 1, 1],
       [1, 0, 0, 0, 0, 0, 0, 0, 1]], dtype=int64)

In [96]:
csim = cosine_similarity(vectors)
csim

array([[1.        , 0.87287156, 0.        , 0.        ],
       [0.87287156, 1.        , 0.15430335, 0.        ],
       [0.        , 0.15430335, 1.        , 0.57735027],
       [0.        , 0.        , 0.57735027, 1.        ]])

In [97]:
def cosine_sim_vectors(vec1, vec2):
    vec1 = vec1.reshape(1,-1)
    vec2 = vec2.reshape(1,-1)
    
    return cosine_similarity(vec1, vec2)[0][0]

In [98]:
cosine_sim_vectors(vectors[0], vectors[1])

0.8728715609439696

### Concise

In [104]:
sentences = ['This is a foo bar sentence.',
             'This sentence is similar to a foo bar sentence.']

def clean_string(text):
    text = ''.join([word for word in text if word not in string.punctuation])
    text = text.lower()
    text = ' '.join([word for word in text.split() if word not in stopwords])
    
    return text

cleaned = list(map(clean_string, sentences))
cleaned

vectorizer = CountVectorizer().fit_transform(cleaned)
vectors = vectorizer.toarray()

csim = cosine_similarity(vectors)

def cosine_sim_vectors(vec1, vec2):
    vec1 = vec1.reshape(1,-1)
    vec2 = vec2.reshape(1,-1)
    
    return cosine_similarity(vec1, vec2)[0][0]

cosine_sim_vectors(vectors[0], vectors[1])

0.8728715609439696