# Importing the libraries

In [1]:
import pandas as pd
import glob
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

We have two folders one having shakespeares novels and one containing short stories. we will first work with Shakespeare novels and try to provide recommendations for each of his novels.

the below piece of code will read each of text files in current folder (in this case "Shakespeare"). Each text file corresponds to one of his novels. this will be written into a Dataframe with name of the text files used as title column for the dataframe

In [2]:
#prepping the data

file_list = glob.glob(os.path.join(os.getcwd(), "*.txt"))
corpus = []
title=[]
for file_path in file_list:
    with open(file_path) as f_input:
        corpus.append(f_input.read())
        title.append(os.path.splitext(os.path.basename(file_path))[0])
    
df =pd.DataFrame({'overview': corpus,'Title':title})
df.head()

Unnamed: 0,Title,overview
0,Coriolanus,ACT I\nSCENE I. Rome. A street.\nEnter a compa...
1,HenryV,SCENE I. London. An ante-chamber in the KING'S...
2,King_Lear,KENT\nI thought the king had more affected the...
3,Othello,SCENE I. Venice. A street.\nEnter RODERIGO and...
4,Tempest,SCENE I. On a ship at sea: a tempestuous noise...


the above dataframe has a piece of Shakespeares work and the name of the novel. the data is now ready.

# Analysing data

We will now try to find the "cosine distance" of 1 Novel with the rest of the novels. the cosine distance determines the similarity between two texts

In [6]:
#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
df['overview'] = df['overview'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(df['overview'])

#Output the shape of tfidf_matrix
print(tfidf_matrix.shape)

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
print(cosine_sim)
indices = pd.Series(df.index, index=df['Title']).drop_duplicates()


(5, 3911)
[[1.         0.12389193 0.11738927 0.13646613 0.11251281]
 [0.12389193 1.         0.29409406 0.13379463 0.14639482]
 [0.11738927 0.29409406 1.         0.15575269 0.17578303]
 [0.13646613 0.13379463 0.15575269 1.         0.17776455]
 [0.11251281 0.14639482 0.17578303 0.17776455 1.        ]]


The above matrix shows the cosine distance i.e the similarity among the five novels with each other.
The diagonal elements show the similarity of a novel with itself.
Now that we have the similarity parameters, we will define Function that takes in book title as input and outputs most similar books as a list.

For starters we will print only 3 most similar book titles. 

In [7]:
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the text that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all texts with that text
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the text based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar text
    sim_scores = sim_scores[1:4]

    # Get the text indices
    text_indices = [i[0] for i in sim_scores]

    # Return the top 3 most similar texts
    return df['Title'].iloc[text_indices].tolist()

Now we are done with the implementation. to get the recommended 3 books for one particular book we now only need to call the above function with the name of the novel.

In [8]:
get_recommendations('HenryV')

['King_Lear', 'Tempest', 'Othello']

In [9]:
get_recommendations('Othello')

['Tempest', 'King_Lear', 'Coriolanus']

In [10]:
get_recommendations('King_Lear')

['HenryV', 'Tempest', 'Othello']