# Case Study: Book Recommendations from Charles Darwin
### Data
Charles Darwin is the most famous scientist in the world. He 
wrote many other books on a wide range of topics, including 
geology, plants or his personal life. In this project, we will 
develop a content-based book recommendation system, 
which will determine which books are close to each other based on 
how similar the discussed topics are. Let’s take a look at the books 
we will use later. 

In [1]:
import glob  # glob is a general term used to define techniques to match specified patterns according to rules related to Unix shell. 
folder = "datasets/" 
files = glob.glob(folder + "*.txt") 
files.sort() 

### Text Preprocessing
As the first step, we need to load the content of each book and 
check the regular expression to facilitate the process by 
removing the all non-alpha-numeric characters. We call such a 
collection of texts a corpus.

In [2]:
import re, os 
txts = [] 
titles = [] 
 
for n in files: 
    f = open(n, encoding='utf-8-sig') 
    data = re.sub('[\W_]+', ' ', f.read())  
    txts.append(data) 
    titles.append(os.path.basename(n).replace('.txt', '')) 
[len(t) for t in txts]

[]

And then, for consistency, we will refer to Darwin’s most famous 
book “On the Origin of Species” to check the results for other given 
book.

In [4]:
for i in range(len(titles)): 
    if titles[i] == 'OriginofSpecies': 
        ori = i 
        print(ori)      # Index = 15

Next step, we transform the corpus into a format by 
doing tokenization.

In [7]:
stoplist = set('for a of the and to in to be which some is at that we i who whom show via may my our might as well'.split()) 
 
txts_lower_case = [i.lower() for i in txts] 
txts_split = [i.split() for i in txts] 
 
texts = [[word for word in txt if word not in stoplist] for txt in txts_split] 
texts[15][0:20]

IndexError: list index out of range

For the next parts of text preprocessing, we use 
a stemming process, which will group together the inflected 
forms of a word so they can be analyzed as a single item: the stem. 
In order to make the process faster, we will directly load the final 
results from a pickle file and review the method used to generate 
it.

In [8]:
import pickle 
texts_stem = pickle.load(open('datasets/texts_stem.p', 'rb')) 
texts_stem[15][0:20] 

FileNotFoundError: [Errno 2] No such file or directory: 'datasets/texts_stem.p'

## Text Vectorization
### Bag-of-Words Models (BoW)
First, we need to create a universe of all words contained in our 
corpus of Charles Darwin’s books, which we call a dictionary. 
Then, using the stemmed tokens and the dictionary, we will create 
bag-of-words models (BoW) to represent our books as a list of all 
uniques tokens they contain associated with their respective 
number of occurrences.

In [9]:
from gensim import corpora 
dictionary = corpora.Dictionary(texts_stem) 
bows = [dictionary.doc2bow(text) for text in texts_stem] 
print(bows[15][:5])

NameError: name 'texts_stem' is not defined

In order to better understand the model, we will transform it into 
a DataFrame and display the 10 most common stems for the book 
“On the Origin of Species”.

In [None]:
df_bow_origin = pd.DataFrame() 
df_bow_origin['index'] = [i[0] for i in bows[15] if i] 
df_bow_origin['occurrences'] = [i[1] for i in bows[15] if i] 
df_bow_origin['token'] = [dictionary[index] for index in 
df_bow_origin['index']] 
df_bow_origin.occurrences.sort_values(ascending=False).head(10)

### TF-IDF Model
Next, we will use a TF-IDF model to define the importance of 
each word depending on how frequent it is in the text. As a result, 
a high TF-IDF score for a word will indicate that this word is 
specific to this text.

In [None]:
from gensim.models import TfidfModel 
model = TfidfModel(bows) 
model[bows[15]]

Once again, in order to better understand the model, we will 
transform it into a DataFrame and display the 10 most specific 
words for the “On the Origin of Species” book.

In [None]:
df_tfidf = pd.DataFrame() 
df_tfidf['id'] = [i[0] for i in model[bows[15]]] 
df_tfidf['score'] = [i[1] for i in model[bows[15]]] 
df_tfidf['token'] = [dictionary[index] for index in df_tfidf['id']] 
df_tfidf.score.sort_values(ascending=False).head(10) 

### Recommendation
Now that we have a TF-IDF model on how specific they are to each 
book, we can measure how related to books are between each 
other. Therefore, we will use Cosine Similarity and visualize the 
results as a distance matrix.

In [None]:
from gensim import similarities 

sims = similarities.MatrixSimilarity(model[bows]) 
sim_df = pd.DataFrame(list(sims)) 
sim_df.columns = titles  
sim_df.index = titles 
sim_df

### Conclusion 
We now have a matrix containing all the similarity measures 
between any pair of books from Charles Darwin! We can use 
barh() to display a horizontal bar plot for which books are the 
most similar to “On the Origin of Species.”

In [None]:
%matplotlib inline 
import matplotlib.pyplot as plt 
 
v = sim_df['OriginofSpecies'] 
v_sorted = v.sort_values() 
v_sorted.plot.barh() 
plt.xlabel('Similarity')

However, we want to have a better understanding of the big 
picture and see how Darwin’s books are generally related to each 
other. To this purpose, we will represent the whole similarity 
matrix as a dendrogram, which is a standard tool to display such 
data.

In [None]:
from scipy.cluster import hierarchy 
 
Z = hierarchy.linkage(sim_df, 'ward') 
chart = hierarchy.dendrogram(Z, leaf_font_size=8, 
labels=sim_df.index, orientation="left") 

Finally, based on the chart we created before, we can conclude that 
“the variation of animals and plants under domestication” is 
most related to “On the Origin of Species.”