# Part3 ML Recommender System

In [13]:
import pandas as pd
import numpy as np
import pickle
import self_created_functions as scf

In [14]:
df = pd.read_csv("./cleaned_datasets/books_clean.csv")
df.shape
df.head(3)

Unnamed: 0,author,avg_rating,genres,language,num_pages,num_ratings,num_reviews,title,url
0,jon krakauer,4.0,environment travel survival biography memoir a...,English,215.0,983231.0,24367.0,Into the Wild,https://www.goodreads.com/book/show/1845.Into_...
1,bell hooks,4.14,social movement politic sociology race women s...,English,123.0,18885.0,1586.0,Feminism Is for Everybody: Passionate Politics,https://www.goodreads.com/book/show/168484.Fem...
2,mark bowden,4.28,war africa north american cultural politic mil...,English,386.0,59451.0,1727.0,Black Hawk Down: A Story of Modern War,https://www.goodreads.com/book/show/55403.Blac...


## Using minmaxscaler and ball tree nearest neighbors classifier
- avg rating
- language
- num_reviews
- num_pages

Balltree algorithm has the ability to calculate distance between variables (neighbors) even with various distance metrics. Allowing us to calculate both euclidian and haversine distance.

In [15]:
# Preprocessing

# Standardize title
df['title'] = df['title'].apply(lambda x: x.title())

# Feature 1: Group the ratings
ratings = ['very low','low','neutral','high','very high']

df['avg_rating'] = df['avg_rating'].apply(lambda x: ratings[0] if x<=1
                                         else (ratings[1] if (x <=2) & (x>1)
                                              else (ratings[2] if (x<=3) & (x>2)
                                                   else (ratings[3] if (x<=4) & (x>3)
                                                        else ratings[4]))))

# Feature 2: Group the languages
languages = ['English','German','Spanish','French','Dutch']

df['language'] = df['language'].apply(lambda x: 'others' if x not in languages else x)

# One hot encode
rating_df = pd.get_dummies(df['avg_rating'])
language_df = pd.get_dummies(df['language'])

In [16]:
# Train the model (Recommender System)
features_1 = pd.concat([rating_df, 
                      language_df, 
                      df['num_reviews'], 
                      df['num_pages']], axis=1)

scf.train_ball_tree(features_1)

Input "Y" to train model else any other letter to skip.
n
You've chosen not to train the model.


In [17]:
# Testing the recommender system
model_1 = pickle.load(open("./models/ball_tree_1","rb"))
recommendations = scf.ball_tree_recommender("Harry Potter And The Deathly Hallows",df=df,id_list=model_1[1])

Book Recommendations:
1 Six Of Crows
  Author: Leigh Bardugo 

2 The Lightning Thief
  Author: Rick Riordan 

3 The Girl With The Dragon Tattoo
  Author: Stieg Larsson 

4 Wonder
  Author: R J Palacio 

5 The Giver
  Author: Lois Lowry 



#### Trying using the dataframe of genre by itself
Here using the same techniques above, we use a different feature: Genres to train the model
Due to the lack of memory and ram we're unable to combine them both together.

In [18]:
#Preprocessing

#Drop duplicate texts in each cell
df['genres'] = df['genres'].apply(lambda x: scf.remove_duplicate_text(x))

#One Hot encode
columns = list(set(" ".join(x for x in df['genres']).split(" ")))
features_2 =df[['genres']].copy()
for col in columns:
    features_2[col] = features_2['genres'].apply(lambda x:1 if col in x else 0)
features_2.drop(columns=['genres'],inplace=True)

  features_2[col] = features_2['genres'].apply(lambda x:1 if col in x else 0)


In [19]:
# Training the 2nd model (Recommender System)
scf.train_ball_tree(features_2)

Input "Y" to train model else any other letter to skip.
n
You've chosen not to train the model.


In [20]:
# Testing the 2nd model
model_2 = pickle.load(open("./models/ball_tree_2","rb"))
recommendations_2 = scf.ball_tree_recommender("Harry Potter And The Deathly Hallows",df=df,id_list=model_2[1])

Book Recommendations:
1 Harry Potter Ja Surma Vägised
  Author: J K Rowling 

2 Harry Potter Series Box Set
  Author: J K Rowling 

3 Harry Potter And The Goblet Of Fire
  Author: J K Rowling 

4 Harry Potter And The Order Of The Phoenix
  Author: J K Rowling 

5 Harry Potter And The Half-Blood Prince
  Author: J K Rowling 



Using Genres seems to give better/ more relevant recommendations

## Using Sentence Transformer and Cosine Similarity

In [21]:
train_data = np.array(df.genres)

scf.generating_cosine_similarity(train_data)

Input "Y" to run else any other letter to skip.
n
You've chosen not to generate the generator.


In [22]:
# Load cs model
# Note cosine_similarity.pkl is 11GB in size, hence not uploaded to Github
# Please run scf.generating_cosine_similarity to generate your own pickle file on your local machine
cs_df= pd.read_pickle("./models/cosine_similarity.pkl")

# Testing cs model
cs_recommendations = scf.cosine_similarity_recommender("Harry Potter and the Deathly Hallows",df=df,cs_df=cs_df)

Book Recommendations:
1 Harry Potter And The Goblet Of Fire
  Author: J k rowling 

2 Harry Potter And The Half-Blood Prince
  Author: J k rowling 

3 Harry Potter Series Box Set
  Author: J k rowling 

4 Harry Potter And The Order Of The Phoenix
  Author: J k rowling 

5 Harry Potter Ja Surma Vägised
  Author: J k rowling 



## Combining all three models to generate a set of unique recommendations

In [23]:
book_title = "Harry Potter and the Deathly Hallows"
test = scf.combined_recommender(book_title=book_title,id_list1=model_1[1],id_list2=model_2[1],df=df,cs_df=cs_df)
test

Model 1
Book Recommendations:
1 Six Of Crows
  Author: Leigh Bardugo 

2 The Lightning Thief
  Author: Rick Riordan 

3 The Girl With The Dragon Tattoo
  Author: Stieg Larsson 

4 Wonder
  Author: R J Palacio 

5 The Giver
  Author: Lois Lowry 

Model 2
Book Recommendations:
1 Harry Potter Ja Surma Vägised
  Author: J K Rowling 

2 Harry Potter Series Box Set
  Author: J K Rowling 

3 Harry Potter And The Goblet Of Fire
  Author: J K Rowling 

4 Harry Potter And The Order Of The Phoenix
  Author: J K Rowling 

5 Harry Potter And The Half-Blood Prince
  Author: J K Rowling 

Model 3
Book Recommendations:
1 Harry Potter And The Goblet Of Fire
  Author: J k rowling 

2 Harry Potter And The Half-Blood Prince
  Author: J k rowling 

3 Harry Potter Series Box Set
  Author: J k rowling 

4 Harry Potter And The Order Of The Phoenix
  Author: J k rowling 

5 Harry Potter Ja Surma Vägised
  Author: J k rowling 



Unnamed: 0,Recommendations,Author
1,Six Of Crows,leigh bardugo
2,The Lightning Thief,rick riordan
3,The Girl With The Dragon Tattoo,stieg larsson
4,Wonder,r j palacio
5,The Giver,lois lowry
6,Harry Potter Ja Surma Vägised,j k rowling
7,Harry Potter Series Box Set,j k rowling
8,Harry Potter And The Goblet Of Fire,j k rowling
9,Harry Potter And The Order Of The Phoenix,j k rowling
10,Harry Potter And The Half-Blood Prince,j k rowling


# Conclusion

I've created 3 recommender systems to recommend books in a cold start scenario using content base filtering. And then combining them together to get a unique list of books that is similiar by way of distance calculated using cosine similarity and a ball tree classifer.

There is another way, using the multi-armed bandit method, where random books are recommended to the user to get the user feedback. And if the user rates a book positively, the recommender would then generate a new list of recommendations. However, due to time constraint, I shall not be exploring this method.

One recurring issue I constantly face was the lack of memory/ram to run and test different models and ideas. This restricted my ability to run more test and visualise some graphs. One way to work around this would be to work in a cloud environment. But once again due to time constraint and financial reasons I did not take that step. But it would be a good way to further enchance the recommender system.