# Content Similarity

First lets import all modules.
Need gensim for word embeddings. It provides a method to load Google's Word2Vec 300 dimension vector Word Embeddings.
preprocess holds methods for various steps required in cleansing of text
os is used to iterate over the files
numpy is used for matrix calculations

In [1]:
from gensim import models, matutils
from preprocess import punctuation, stopword
from os import walk
import numpy



Let's define all our variables.

In [2]:
embedding_score = []
cos_sim_matrix = numpy.zeros((17, 17))
unique_id = 0
id_lookup = {}



In [3]:
model = models.KeyedVectors.load_word2vec_format(
    'Resources\GoogleNews-vectors-negative300.bin', binary=True)

This is our method to generate document embeddings. It will parse through the text files and store all content words in a list. We get the vectors for these words by a simple lookup at the Word2Vec. We then calculate the mean of all the word vectors and generate a normalized, unit vector for each document. 

In [4]:
def generate_score(filepath):
    with open(filepath, 'r') as f:
        words = []
        for line in f.readlines():
            line = punctuation(line)
            tokens = stopword(line)
            for token in tokens:
                words.append(token)
        vectors = [model[word] for word in words if word in model.vocab]
        embedding_score.append(matutils.unitvec(numpy.mean(vectors, axis=0)))


This is our method to generate cosine similarity matrix for all the document vectors. Since the number of documents are low, we can build the square matrix, otherwise we should go for the triangle matrix to reduce repetitive computations. Since we built unit vectors while generating scores, we can simply perform dot product (norm is 1).

In [5]:
def generate_cos_sim_matrix():
    for i in range(0, len(embedding_score)):
        for j in range(0, len(embedding_score)):
            cos_sim_matrix[i][j] = numpy.dot(embedding_score[i], embedding_score[j])
        cos_sim_matrix[i][i] = 0


this is our main script where we iterate over the documents to generate the scores

In [6]:
for root, dir, files in walk('Resources\Literature'):
    for file in files:
        path = root + '\\' + file
        generate_score(path)
        id_lookup[unique_id] = file
        unique_id += 1

In [7]:
generate_cos_sim_matrix()


# Analysis

Let's look at the cosine similarity matrix.

In [8]:
cos_sim_matrix

array([[0.        , 0.77064266, 0.77659589, 0.79999454, 0.78937563,
        0.78857102, 0.80348056, 0.79743656, 0.81331713, 0.79645459,
        0.8065689 , 0.78682124, 0.8167036 , 0.79087126, 0.76666235,
        0.79662652, 0.77760814],
       [0.77064266, 0.        , 0.74282799, 0.79431756, 0.72912796,
        0.73679943, 0.74710037, 0.71924784, 0.72122472, 0.72521052,
        0.67605161, 0.73303178, 0.79902668, 0.82031451, 0.80092176,
        0.80825141, 0.81728713],
       [0.77659589, 0.74282799, 0.        , 0.81813557, 0.794951  ,
        0.77714194, 0.7907143 , 0.798927  , 0.7943737 , 0.79066378,
        0.80476037, 0.80650089, 0.78892165, 0.74407457, 0.72517892,
        0.77115472, 0.7518591 ],
       [0.79999454, 0.79431756, 0.81813557, 0.        , 0.79885263,
        0.83165865, 0.79442899, 0.78285104, 0.77747958, 0.80550096,
        0.80502909, 0.79503664, 0.8119385 , 0.7845675 , 0.7699288 ,
        0.79338902, 0.78225501],
       [0.78937563, 0.72912796, 0.794951  , 0.798852

Let's also display the key for our notations.

In [9]:
id_lookup

{0: 'The_Ass_and_the_Lapdog.txt',
 1: 'The_Cock_and_the_Pearl.txt',
 2: 'The_Dog_and_the_Shadow.txt',
 3: 'The_Fox_and_the_Crow.txt',
 4: 'The_Frogs_Desiring_a_King.txt',
 5: 'The_Lions_Share.txt',
 6: 'The_Lion_and_the_Mouse.txt',
 7: 'The_Man_and_the_Serpent.txt',
 8: 'The_Sick_Lion.txt',
 9: 'The_Town_Mouse_and_the_Country_Mouse.txt',
 10: 'The_Wolf_and_the_Crane.txt',
 11: 'The_Wolf_and_the_Lamb.txt',
 12: 'Coriolanus.txt',
 13: 'HenryV.txt',
 14: 'King_Lear.txt',
 15: 'Othello.txt',
 16: 'Tempest.txt'}

Now let's take a look at which document our system reccommends.

In [10]:
for i in range(0, len(cos_sim_matrix)):
    print(id_lookup[i],': ', id_lookup[numpy.argmax(cos_sim_matrix[i])])


The_Ass_and_the_Lapdog.txt :  Coriolanus.txt
The_Cock_and_the_Pearl.txt :  HenryV.txt
The_Dog_and_the_Shadow.txt :  The_Fox_and_the_Crow.txt
The_Fox_and_the_Crow.txt :  The_Lions_Share.txt
The_Frogs_Desiring_a_King.txt :  Coriolanus.txt
The_Lions_Share.txt :  The_Sick_Lion.txt
The_Lion_and_the_Mouse.txt :  The_Lions_Share.txt
The_Man_and_the_Serpent.txt :  Coriolanus.txt
The_Sick_Lion.txt :  The_Lions_Share.txt
The_Town_Mouse_and_the_Country_Mouse.txt :  Coriolanus.txt
The_Wolf_and_the_Crane.txt :  The_Lion_and_the_Mouse.txt
The_Wolf_and_the_Lamb.txt :  Coriolanus.txt
Coriolanus.txt :  Othello.txt
HenryV.txt :  King_Lear.txt
King_Lear.txt :  Othello.txt
Othello.txt :  King_Lear.txt
Tempest.txt :  Othello.txt


Of the 12 fables: 6 have a Shakespeare work as reccomended reading and the other 6 are fables. 
For the Shakespearen works, recommended reading is another Shakespearen work.
Let's take a closer look.

Let's look at fables that recommend plays. below we display documents that have a cosine similarity score of 0.8 or greater. Note that the list is not ordered.

In [11]:
for i in range(0, len(cos_sim_matrix)):
    if i in [0, 1, 4, 7, 9, 11]:
        top_recos = []
        for j in range(0, len(cos_sim_matrix[i])):
            if cos_sim_matrix[i][j] >= 0.8:
                top_recos.append(id_lookup[j] + ': ' + str(cos_sim_matrix[i][j]))
        print(id_lookup[i], ': ', top_recos)


The_Ass_and_the_Lapdog.txt :  ['The_Lion_and_the_Mouse.txt: 0.8034805610123193', 'The_Sick_Lion.txt: 0.813317131726151', 'The_Wolf_and_the_Crane.txt: 0.8065689029178493', 'Coriolanus.txt: 0.8167035976513876']
The_Cock_and_the_Pearl.txt :  ['HenryV.txt: 0.8203145100031514', 'King_Lear.txt: 0.8009217594756634', 'Othello.txt: 0.8082514090624006', 'Tempest.txt: 0.8172871338534662']
The_Frogs_Desiring_a_King.txt :  ['The_Lions_Share.txt: 0.8053968619625487', 'The_Lion_and_the_Mouse.txt: 0.8588294030632264', 'The_Man_and_the_Serpent.txt: 0.8083861512482955', 'The_Town_Mouse_and_the_Country_Mouse.txt: 0.8474198539400096', 'The_Wolf_and_the_Lamb.txt: 0.8169049883762372', 'Coriolanus.txt: 0.8822527641991893', 'HenryV.txt: 0.8462804674431661', 'King_Lear.txt: 0.8395884951413659', 'Othello.txt: 0.8615419842491271', 'Tempest.txt: 0.8314010119411108']
The_Man_and_the_Serpent.txt :  ['The_Frogs_Desiring_a_King.txt: 0.8083861512482955', 'The_Lion_and_the_Mouse.txt: 0.8243399859310384', 'The_Sick_Lion

Points to note:

The_Ass_and_the_Lapdog.txt is almost as similar to The_Sick_Lion.txt as it is to Coriolanus.txt

The other 5 fables have (4, 5, 5, 3, 5) plays with a score higher than 0.8, in order.

The fable, The_Cock_and_the_Pearl.txt does not have a single other fable with a score higher than 0.8

3 fables have another fable in the top 2. 1 fable, The_Frogs_Desiring_a_King.txt, has fable in top 3.