# Importing the libraries

In [1]:
import pandas as pd
import glob
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

We have two folders one having shakespeares novels and one containing short stories. we will first work with Shakespeare novels and try to provide recommendations for each of his novels.

the below piece of code will read each of text files in current folder (in this case "Shakespeare"). Each text file corresponds to one of his novels. this will be written into a Dataframe with name of the text files used as title column for the dataframe

In [2]:
#prepping the data

file_list = glob.glob(os.path.join(os.getcwd(),"Shakespeare", "*.txt"))
corpus = []
title=[]
for file_path in file_list:
    with open(file_path) as f_input:
        corpus.append(f_input.read())
        title.append(os.path.splitext(os.path.basename(file_path))[0])
    
df =pd.DataFrame({'overview': corpus,'Title':title})
df.head()

Unnamed: 0,Title,overview
0,Coriolanus,ACT I\nSCENE I. Rome. A street.\nEnter a compa...
1,HenryV,SCENE I. London. An ante-chamber in the KING'S...
2,King_Lear,KENT\nI thought the king had more affected the...
3,Othello,SCENE I. Venice. A street.\nEnter RODERIGO and...
4,Tempest,SCENE I. On a ship at sea: a tempestuous noise...


the above dataframe has a piece of Shakespeares work and the name of the novel. the data is now ready.

# Analysing data

We will now try to find the "cosine distance" of 1 Novel with the rest of the novels. the cosine distance determines the similarity between two texts

In [6]:
#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
df['overview'] = df['overview'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(df['overview'])

#Output the shape of tfidf_matrix
print(tfidf_matrix.shape)

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
print(cosine_sim)
indices = pd.Series(df.index, index=df['Title']).drop_duplicates()


(5, 3911)
[[1.         0.12389193 0.11738927 0.13646613 0.11251281]
 [0.12389193 1.         0.29409406 0.13379463 0.14639482]
 [0.11738927 0.29409406 1.         0.15575269 0.17578303]
 [0.13646613 0.13379463 0.15575269 1.         0.17776455]
 [0.11251281 0.14639482 0.17578303 0.17776455 1.        ]]


The above matrix shows the cosine distance i.e the similarity among the five novels with each other.
The diagonal elements show the similarity of a novel with itself.
Now that we have the similarity parameters, we will define Function that takes in book title as input and outputs most similar books as a list.

For starters we will print only 3 most similar book titles. 

In [12]:
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the text that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all texts with that text
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the text based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar text
    sim_scores = sim_scores[1:4]

    # Get the text indices
    text_indices = [i[0] for i in sim_scores]

    # Return the top 3 most similar texts
    return df['Title'].iloc[text_indices].tolist()

Now we are done with the implementation. to get the recommended 3 books for one particular book we now only need to call the above function with the name of the novel.

In [8]:
get_recommendations('HenryV')

['King_Lear', 'Tempest', 'Othello']

In [9]:
get_recommendations('Othello')

['Tempest', 'King_Lear', 'Coriolanus']

In [10]:
get_recommendations('King_Lear')

['HenryV', 'Tempest', 'Othello']

lets do the same for short stories.

In [8]:
#prepping the data

file_list = glob.glob(os.path.join(os.getcwd(),"Fables", "*.txt"))
corpus = []
title=[]
for file_path in file_list:
    with open(file_path) as f_input:
        corpus.append(f_input.read())
        title.append(os.path.splitext(os.path.basename(file_path))[0])
    
df =pd.DataFrame({'overview': corpus,'Title':title})
df

Unnamed: 0,Title,overview
0,The_Ass_and_the_Lapdog,A Farmer one day came to the stables to see to...
1,The_Cock_and_the_Pearl,A cock was once strutting up and down the farm...
2,The_Dog_and_the_Shadow,It happened that a Dog had got a piece of meat...
3,The_Fox_and_the_Crow,A Fox once saw a Crow fly off with a piece of ...
4,The_Frogs_Desiring_a_King,The Frogs were living as happy as could be in ...
5,The_Lions_Share,The Lion went once a-hunting along with the Fo...
6,The_Lion_and_the_Mouse,Once when a Lion was asleep a little Mouse beg...
7,The_Man_and_the_Serpent,A Countryman's son by accident trod upon a Ser...
8,The_Sick_Lion,A Lion had come to the end of his days and lay...
9,The_Town_Mouse_and_the_Country_Mouse,Now you must know that a Town Mouse once upon ...


we have 11 short stories in this folder. again for each of these stories will find the its similarity coefficient with the rest of the stories.

In [9]:
#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
df['overview'] = df['overview'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(df['overview'])

#Output the shape of tfidf_matrix
print(tfidf_matrix.shape)

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
print(cosine_sim)
indices = pd.Series(df.index, index=df['Title']).drop_duplicates()

(12, 507)
[[1.         0.01647195 0.         0.0304829  0.04200542 0.02647074
  0.02089552 0.10256527 0.07171731 0.01259892 0.00930028 0.0138581 ]
 [0.01647195 1.         0.01404645 0.01573528 0.02780893 0.00462484
  0.01817086 0.         0.         0.01038527 0.01305219 0.01603284]
 [0.         0.01404645 1.         0.13014916 0.00659072 0.
  0.02417539 0.01739183 0.02075071 0.00569457 0.04538892 0.05637561]
 [0.0304829  0.01573528 0.13014916 1.         0.02691441 0.06565164
  0.05704785 0.03868229 0.01322035 0.05375994 0.0394276  0.04144642]
 [0.04200542 0.02780893 0.00659072 0.02691441 1.         0.03320781
  0.09499263 0.01587927 0.03746393 0.04097817 0.00777447 0.04379759]
 [0.02647074 0.00462484 0.         0.06565164 0.03320781 1.
  0.16071774 0.03585887 0.14799073 0.02976599 0.0450469  0.02849021]
 [0.02089552 0.01817086 0.02417539 0.05704785 0.09499263 0.16071774
  1.         0.0691921  0.1512687  0.18082082 0.02695763 0.09631125]
 [0.10256527 0.         0.01739183 0.03868229 0

we can use the same get_recommendations fuction without defining it again 

In [13]:
get_recommendations('The_Wolf_and_the_Lamb')

['The_Wolf_and_the_Crane', 'The_Lion_and_the_Mouse', 'The_Dog_and_the_Shadow']

In [14]:
get_recommendations('The_Lion_and_the_Mouse')

['The_Town_Mouse_and_the_Country_Mouse', 'The_Lions_Share', 'The_Sick_Lion']

In [15]:
get_recommendations('The_Lions_Share')

['The_Lion_and_the_Mouse', 'The_Sick_Lion', 'The_Fox_and_the_Crow']

In the first output we can see that stories relating to the word "Wolf" have come up first in the recommended list.
Same is the case for the next two.

So we have successfully built a recommendation engine to find closely related book to the one in question. The above function can be edited to output 'N' number of books.
