**Tf-Idfs**

In this notebook, I am going to calculate the similarity between different pieces of text.
To do this, we first convert each document into a vector of tf-idfs. 

Tf-idfs calculate the importance of a word in a text. For example, if you had a text that 
has the word dinosaur a lot of times, that is probably significant to the content of the
text. Therefore, the vector is an array of how important each word is to the text. Similar
texts will have similar words of similar importance.

This means that we can calculate the cosine similarity of these vectors to see how similar
each of these texts are.

In [108]:
import numpy as np
from functools import reduce

In [109]:
corpus = [
    "the cat sat on the mat",
    "the dog barked at the cat",
    "the cat meowed at the dog",
    "dogs and cats are friends",
    "the mat was sat on by the cat"
]

To calculate the tf-idf, we first need to calculate the tfs (term frequencies).
This is just the relative frequency of each word in each document.

In [110]:
tfs = []

for document in corpus:
    doc_tf = {}
    words = document.lower().split(' ')
    for word in words:
        #add to the tfs array
        if word in doc_tf:
            doc_tf[word] += 1 / len(words)
        else:
            doc_tf[word] = 1 / len(words)
    tfs.append(doc_tf)

for doc_tf in tfs:
    print(doc_tf)

{'the': 0.3333333333333333, 'cat': 0.16666666666666666, 'sat': 0.16666666666666666, 'on': 0.16666666666666666, 'mat': 0.16666666666666666}
{'the': 0.3333333333333333, 'dog': 0.16666666666666666, 'barked': 0.16666666666666666, 'at': 0.16666666666666666, 'cat': 0.16666666666666666}
{'the': 0.3333333333333333, 'cat': 0.16666666666666666, 'meowed': 0.16666666666666666, 'at': 0.16666666666666666, 'dog': 0.16666666666666666}
{'dogs': 0.2, 'and': 0.2, 'cats': 0.2, 'are': 0.2, 'friends': 0.2}
{'the': 0.25, 'mat': 0.125, 'was': 0.125, 'sat': 0.125, 'on': 0.125, 'by': 0.125, 'cat': 0.125}


We the calculate the idf (inverse document frequency). This describes how
common the word is in the documents. This will be low if the word is in a lot
of documents and high if in a few. This allows us to filter out words that are
too common to have any significance to the true meaning of the text.

In [111]:
idfs = {}

for tf in tfs:
    for word in tf.keys():
        if word in idfs:
            idfs[word] += 1
        else:
            idfs[word] = 1

for idf in idfs.keys():
    idfs[idf] = float(np.log(len(corpus) / idfs[idf]))

print (idfs)

{'the': 0.22314355131420976, 'cat': 0.22314355131420976, 'sat': 0.9162907318741551, 'on': 0.9162907318741551, 'mat': 0.9162907318741551, 'dog': 0.9162907318741551, 'barked': 1.6094379124341003, 'at': 0.9162907318741551, 'meowed': 1.6094379124341003, 'dogs': 1.6094379124341003, 'and': 1.6094379124341003, 'cats': 1.6094379124341003, 'are': 1.6094379124341003, 'friends': 1.6094379124341003, 'was': 1.6094379124341003, 'by': 1.6094379124341003}


tf-idfs are now just the tfs multiplied by the idfs

In [112]:
tf_idfs = []

for tf in tfs:
    tf_idf = {}
    for word in tf.keys():
        tf_idf[word] = round(tf[word] * idfs[word],4)
    tf_idfs.append(tf_idf)

for item in tf_idfs:
    print(item)

{'the': 0.0744, 'cat': 0.0372, 'sat': 0.1527, 'on': 0.1527, 'mat': 0.1527}
{'the': 0.0744, 'dog': 0.1527, 'barked': 0.2682, 'at': 0.1527, 'cat': 0.0372}
{'the': 0.0744, 'cat': 0.0372, 'meowed': 0.2682, 'at': 0.1527, 'dog': 0.1527}
{'dogs': 0.3219, 'and': 0.3219, 'cats': 0.3219, 'are': 0.3219, 'friends': 0.3219}
{'the': 0.0558, 'mat': 0.1145, 'was': 0.2012, 'sat': 0.1145, 'on': 0.1145, 'by': 0.2012, 'cat': 0.0279}


We convert these dictionaries into a matrix containing the tf idf vectors.
If a word is not in a text, then the value will be 0

In [123]:
vocab = idfs.keys()
vectors = []

for doc in tf_idfs:
    vector = []
    for word in vocab:
        if word in doc:
            vector.append(doc[word])
        else: 
            vector.append(0)
    vectors.append(np.array(vector))

for item in vectors:
    print(item)

[0.0744 0.0372 0.1527 0.1527 0.1527 0.     0.     0.     0.     0.
 0.     0.     0.     0.     0.     0.    ]
[0.0744 0.0372 0.     0.     0.     0.1527 0.2682 0.1527 0.     0.
 0.     0.     0.     0.     0.     0.    ]
[0.0744 0.0372 0.     0.     0.     0.1527 0.     0.1527 0.2682 0.
 0.     0.     0.     0.     0.     0.    ]
[0.     0.     0.     0.     0.     0.     0.     0.     0.     0.3219
 0.3219 0.3219 0.3219 0.3219 0.     0.    ]
[0.0558 0.0279 0.1145 0.1145 0.1145 0.     0.     0.     0.     0.
 0.     0.     0.     0.     0.2012 0.2012]


Now we can calculate the cosine similarites

In [132]:
def cosine(A, B):
    return float(np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B)))

In [136]:
similarities = []
for A in vectors:
    row = []
    for B in vectors:
        row.append(round(cosine(A,B), 4))
    similarities.append(row)

for similarity in similarities:
    print(similarity)

[1.0, 0.0704, 0.0704, 0.0, 0.59]
[0.0704, 1.0, 0.4268, 0.0, 0.0416]
[0.0704, 0.4268, 1.0, 0.0, 0.0416]
[0.0, 0.0, 0.0, 1.0, 0.0]
[0.59, 0.0416, 0.0416, 0.0, 1.0]
