# Book-Oracle: Basic Recommendation System

- Develop a basic Recommendation System
- 26.11.2023
- Janina, Oliwia, Neha, Nina

## Import Libraries

In [59]:
import pandas as pd
import numpy as np
import requests

#Modelling
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler, OneHotEncoder

from sklearn.metrics import classification_report, accuracy_score, roc_auc_score, roc_curve, confusion_matrix, make_scorer, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from scipy.sparse import csr_matrix, hstack
from sklearn.neighbors import NearestNeighbors

#NLP
import nltk

#Plotting
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.ticker import PercentFormatter
plt.rcParams.update({ "figure.figsize" : (8, 5),"axes.facecolor" : "white", "axes.edgecolor":  "black"})
plt.rcParams["figure.facecolor"]= "w"
pd.plotting.register_matplotlib_converters()
pd.set_option('display.float_format', lambda x: '%.3f' % x)
pd.options.display.float_format = "{:,.2f}".format

RSEED = 42

import warnings
warnings.filterwarnings('ignore')

## Import Data

In [60]:
df = pd.read_csv('data/kaggle_full_df.csv')
df['country'].fillna('unknown', inplace=True)
df.head(3)

Unnamed: 0,book_title,book_author,year_of_publication,publisher,image_url_m,common_identifier,user_id,isbn,book_rating,age,city,country
0,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,1,2,195153448,0,18,stockton,usa
1,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,1,269782,801319536,7,30,edmonton,canada
2,Pay It Forward: A Novel,Catherine Ryan Hyde,2000,Simon &amp; Schuster,http://images.amazon.com/images/P/0684862719.0...,2392,269782,684862719,8,30,edmonton,canada


In [61]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1005487 entries, 0 to 1005486
Data columns (total 12 columns):
 #   Column               Non-Null Count    Dtype 
---  ------               --------------    ----- 
 0   book_title           1005487 non-null  object
 1   book_author          1005487 non-null  object
 2   year_of_publication  1005487 non-null  object
 3   publisher            1005487 non-null  object
 4   image_url_m          1005487 non-null  object
 5   common_identifier    1005487 non-null  int64 
 6   user_id              1005487 non-null  int64 
 7   isbn                 1005487 non-null  object
 8   book_rating          1005487 non-null  int64 
 9   age                  1005487 non-null  int64 
 10  city                 1005209 non-null  object
 11  country              1005487 non-null  object
dtypes: int64(4), object(8)
memory usage: 92.1+ MB


##  <span style="color: red;">TO DOs before Modelling</span>

1. Resolve issue of users who gave multiple ratings - if user gave the same rating, keep just one. If they gave different ratings - remove those ratings
2. Resolve Nan in Country & City

In [62]:
# example of users rating the same book multiple times
df.groupby(['book_title', 'book_author', 'user_id']).size().reset_index(name='Count').sort_values(by='Count', ascending=False).query("Count > 1").head(10)

Unnamed: 0,book_title,book_author,user_id,Count
551152,"Phonics Fun: Reading Program, Pack 4 (Clifford...",Francie Alexander,185233,12
578426,Ranma 1/2 (Ranma 1/2),Rumiko Takahashi,156111,6
272856,Flame Of Recca (Flame Of Recca),Nobuyuki Anzai,10354,5
42106,Adventures Of Huckleberry Finn,Mark Twain,240258,5
424530,Life And Teaching Of The Masters Of The Far Ea...,Baird T. Spalding,187763,4
145978,Chobits (Chobits),Clamp,9227,4
145980,Chobits (Chobits),Clamp,38023,4
145984,Chobits (Chobits),Clamp,196160,4
145985,Chobits (Chobits),Clamp,224904,4
431933,Little Women,Louisa May Alcott,203240,4


## Recommendation System

There are mainly two types of recommender systems: collaborative filtering and content-based filtering. Hybrid systems, which combine elements of both, are also common. Let's go over the basics of each type:

Let's create the most basic recommendation system, based on EXPLICIT rating (1-10) and readers from usa, canada & uk
- we will start with Collaborative Filtering - Item based

<img src="https://miro.medium.com/v2/resize:fit:1064/1*mz9tzP1LjPBhmiWXeHyQkQ.png" alt="Alt text" width="500"/>

#### Subset data
- only EXCPLICIT rating and users from USA & Canada

In [63]:
#Only Rating above 0
df = df[df['book_rating']>0]

#Only users from US or Canada
df = df[df['country'].str.contains("usa|canada")]

df.shape

(303032, 12)

#### Create a new variable: Rating Count

In [64]:
#Add a new column with a total rating count for each book by common identifier
df['rating_count'] = df.groupby(['book_title', 'book_author'])['book_rating'].transform('count')

#Show a list of books that got the highest rating count, group by title and author to show unique books

df.groupby(['book_title', 'book_author', 'rating_count']).size().reset_index(name='Count').sort_values(by='rating_count', ascending=False).head(5)

Unnamed: 0,book_title,book_author,rating_count,Count
86042,The Lovely Bones: A Novel,Alice Sebold,614,614
79268,The Da Vinci Code,Dan Brown,420,420
91346,The Secret Life Of Bees,Sue Monk Kidd,387,387
103977,Wild Animus,Rich Shapero,352,352
90242,The Red Tent (Bestselling Backlist),Anita Diamant,351,351


In [65]:
df.head(3)

Unnamed: 0,book_title,book_author,year_of_publication,publisher,image_url_m,common_identifier,user_id,isbn,book_rating,age,city,country,rating_count
1,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,1,269782,801319536,7,30,edmonton,canada,1
2,Pay It Forward: A Novel,Catherine Ryan Hyde,2000,Simon &amp; Schuster,http://images.amazon.com/images/P/0684862719.0...,2392,269782,684862719,8,30,edmonton,canada,26
3,Watership Down,Richard Adams,1976,Avon,http://images.amazon.com/images/P/0380002930.0...,3172,269782,140039589,10,30,edmonton,canada,99


#### Define book popularity threshold

In [66]:
popularity_threshold = 50
df = df[df['rating_count'] >= popularity_threshold]
df.shape

(51972, 13)

#### Define user activity threshold

In [67]:
#Subset only users with more than 30 ratings

user_rating_counts = df['user_id'].value_counts()
df = df[df['user_id'].isin(user_rating_counts[user_rating_counts >= 30].index)]
df.shape

(4430, 13)

In [68]:
df['book_title'].nunique()

526

In [69]:
#read txt file with book titles and authors, define column names: book_id, bbshit, book_title, book_author, year_of_publication, tags, description

df_cmu = pd.read_csv('data/booksummaries.txt', sep='\t', header=None)
df_cmu.columns = ['book_id', 'bbshit', 'book_title', 'book_author', 'year_of_publication', 'tags', 'description']

df_cmu.head()

#Merge df with df_cmu on book_title and book_author

df = pd.merge(df, df_cmu[['book_title', 'book_author', 'tags', 'description']], on=['book_title', 'book_author'], how='left')
df.head(3)

Unnamed: 0,book_title,book_author,year_of_publication,publisher,image_url_m,common_identifier,user_id,isbn,book_rating,age,city,country,rating_count,tags,description
0,The Kitchen God'S Wife,Amy Tan,1991,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.0...,6,110912,080410753X,9,36,milpitas,usa,99,,
1,Pigs In Heaven,Barbara Kingsolver,1993,Harpercollins,http://images.amazon.com/images/P/0060168013.0...,40,110912,0060922532,8,36,milpitas,usa,96,,
2,The Five People You Meet In Heaven,Mitch Albom,2003,Hyperion,http://images.amazon.com/images/P/0786868716.0...,108,110912,0786868716,10,36,milpitas,usa,231,,


In [70]:
df['description'].nunique()

135

In [71]:
small_df.head()

Unnamed: 0,book_title,book_author,year_of_publication,publisher,image_url_m,common_identifier,user_id,isbn,book_rating,age,city,country,rating_count,description
0,The Kitchen God'S Wife,Amy Tan,1991,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.0...,6,110912,080410753X,9,36,milpitas,usa,99,An absorbing narrative of Winnie Louie's life.
1,Pigs In Heaven,Barbara Kingsolver,1993,Harpercollins,http://images.amazon.com/images/P/0060168013.0...,40,110912,0060922532,8,36,milpitas,usa,96,Generic description
2,The Five People You Meet In Heaven,Mitch Albom,2003,Hyperion,http://images.amazon.com/images/P/0786868716.0...,108,110912,0786868716,10,36,milpitas,usa,231,Eddie dies on his eighty-third birthday in a t...
3,Angels &Amp; Demons,Dan Brown,2001,Pocket Star,http://images.amazon.com/images/P/0671027360.0...,119,110912,0671027360,9,36,milpitas,usa,274,A novel about a symbologist who discovers the ...
4,Little Altars Everywhere: A Novel,Rebecca Wells,1996,Perennial,http://images.amazon.com/images/P/0060976845.0...,135,110912,0060976845,8,36,milpitas,usa,170,"Don't miss Little Altars Everywhere, the New Y..."


In [72]:
small_df = pd.read_csv('data/df_w_description.csv')
small_df.head()

#merrge df with small_df on common identifier

df = pd.merge(df, small_df[['common_identifier', 'description']], on='common_identifier', how='left')

#If description equals "Generic Description" in small_df, then replace it with the content of "description" from df based on column "common identifier"

df['description'] = np.where(df['description_y'] == 'Generic description', df['description_x'], df['description_y'])



In [73]:
df.nunique()

book_title             526
book_author            253
year_of_publication     61
publisher              119
image_url_m            527
common_identifier      527
user_id                101
isbn                   936
book_rating             10
age                     33
city                    91
country                  2
rating_count           147
tags                    87
description_x          135
description_y          324
description            380
dtype: int64

## I. Collaborative Filtering

Collaborative filtering is based on the idea that users who agreed in the past tend to agree in the future.
It makes automatic predictions about the interests of a user by collecting preferences from many users (collaborating). There are two main types of collaborative filtering: item-based & user-based

### I. Collaborative Filtering - item based

This method recommends items similar to those the user has liked in the past. The steps involve:

1. Calculate similarity between items.
2. Identify items similar to those the user has liked.
3. Recommend items that are similar to the user's preferences.



#### Create a Pivot Matrix

In [14]:
movie_features_df = df.pivot_table(index='book_title',columns='user_id',values='book_rating').fillna(0)
movie_features_df.head()

user_id,4017,6251,6575,7346,8454,13552,16795,21014,22625,23872,...,234828,235105,235282,236283,240567,241980,242083,255489,258534,270713
book_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1984,0.0,0.0,0.0,8.0,0.0,0.0,8.0,0.0,0.0,7.0,...,0.0,0.0,0.0,0.0,8.0,0.0,7.0,0.0,0.0,0.0
1St To Die: A Novel,0.0,0.0,0.0,0.0,0.0,0.0,9.0,0.0,0.0,0.0,...,0.0,0.0,8.0,0.0,0.0,0.0,0.0,0.0,6.0,0.0
2Nd Chance,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,10.0,0.0,0.0,0.0,0.0
A Bend In The Road,0.0,0.0,1.0,0.0,0.0,0.0,8.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A Case Of Need,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Visualisation of corresponding matrix for user-based collaborative filtering
<img src="https://developers.google.com/machine-learning/recommendation/images/1Dmatrix.svg" alt="Alt text" width="800"/>

#### Train a KNN model for item-based collaborative filtering

<img src = "https://dataconomy.com/wp-content/uploads/2015/04/Five-most-popular-similarity-measures-implementation-in-python-4-620x475.png" alt="Alt text" width="400"/>

In [26]:
#Convert our table to a matrix
movie_features_df_matrix = csr_matrix(movie_features_df.values)


#Instatiate a model - cosine similarity as a metric
model_knn = NearestNeighbors(metric = 'cosine', algorithm = 'brute')

#Fit the model to our data
model_knn.fit(movie_features_df_matrix)

#### Assess predictions of the model

In [29]:
#Choose a random book from our dataset
query_index = np.random.choice(movie_features_df.shape[0])
print("Query index: {}".format(query_index))

#Find the 6 nearest neighbors based on cosine similarity
distances, indices = model_knn.kneighbors(movie_features_df.iloc[query_index,:].values.reshape(1, -1), n_neighbors = 6)

#Print predicted books

for i in range(0, len(distances.flatten())):
    if i == 0:
        print('Recommendations for {0}:\n'.format(movie_features_df.index[query_index]))
    else:
        print('{0}: {1}, with distance of {2}:'.format(i, movie_features_df.index[indices.flatten()[i]], distances.flatten()[i]))

Query index: 69
Recommendations for Body Of Lies:

1: Final Target, with distance of 0.36686957215430427:
2: The Blue Nowhere : A Novel, with distance of 0.4896841453718712:
3: Deception Point, with distance of 0.5010881197077026:
4: Red Rabbit, with distance of 0.5124989420538478:
5: City Of Bones, with distance of 0.5218999520055406:


## Content-based filtering

In [30]:
df.head()

Unnamed: 0,book_title,book_author,year_of_publication,publisher,image_url_m,common_identifier,user_id,isbn,book_rating,age,city,country,rating_count
2671,The Kitchen God'S Wife,Amy Tan,1991,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.0...,6,110912,080410753X,9,36,milpitas,usa,99
2673,Pigs In Heaven,Barbara Kingsolver,1993,Harpercollins,http://images.amazon.com/images/P/0060168013.0...,40,110912,0060922532,8,36,milpitas,usa,96
2679,The Five People You Meet In Heaven,Mitch Albom,2003,Hyperion,http://images.amazon.com/images/P/0786868716.0...,108,110912,0786868716,10,36,milpitas,usa,231
2681,Angels &Amp; Demons,Dan Brown,2001,Pocket Star,http://images.amazon.com/images/P/0671027360.0...,119,110912,0671027360,9,36,milpitas,usa,274
2683,Little Altars Everywhere: A Novel,Rebecca Wells,1996,Perennial,http://images.amazon.com/images/P/0060976845.0...,135,110912,0060976845,8,36,milpitas,usa,170


In [34]:
#Connect to Google Books API
api_key = "AIzaSyDRoazp3yawNUUi171Js8DmYu-y-rQvrLw"

def get_book_description(isbn, api_key):
    base_url = "https://www.googleapis.com/books/v1/volumes"
    params = {
        'q': f"isbn:{isbn}",
        'key': api_key,
    }

    response = requests.get(base_url, params=params)

    if response.status_code == 200:
        data = response.json()
        if 'items' in data:
            # Extract relevant information from the response
            book_info = data['items'][0]['volumeInfo']
            description = book_info.get('description', 'Description not available')
            return description
        else:
            return "Book not found."
    else:
        return f"Error: {response.status_code}"

# Create a new column 'Description' and fill it with the book descriptions
df['description'] = df['isbn'].apply(lambda x: get_book_description(x, api_key))

KeyboardInterrupt: 

## <span style="color: green;">Next Steps</span>

#### 0. <span style="color: blue;">Baseline model - Item based:</span>

**TO DOs** Recommend always the most popular book.

#### I. <span style="color: blue;">Collaborative Filtering - Item based:</span>

**TO DOs**

expand on the above code:
- (implement a possibility of choosing 2+ books and get recommendations based on them)
- implement meta-data to books (author, genre, etc., "suitable for kids") and use it to get recommendations
- explore different libraries (see Resources below)
- evaluate the model

#### II. <span style="color: blue;"> Collaborative Filtering - User-based:</span>

This approach recommends products or items that users with similar tastes have liked in the past. The basic steps include:

- Calculate similarity between users (e.g., cosine similarity).
- Identify users with similar preferences.
- Recommend items that these similar users have liked **but the target user hasn't**.


**TO DOs**

- research!
- implement a basic user-based collaborative filtering model 
- compare results to an item-based collaborative filtering model, then make it better :) :
- Auto-Generate Embeddings: https://developers.google.com/machine-learning/recommendation/collaborative/basics, https://www.youtube.com/watch?v=v90un9ALRzw&list=PLQY2H8rRoyvy2MiyUBz5RWZr5MPFkV3qz&index=2
- use implicit vs. explicit rating
- (explore different libraries (see Resources below))
- evaluate the model

#### III. <span style="color: blue;"> Content-based Filtering:</span>

Content-based filtering recommends items based on their features and the user's past preferences. It's built on the idea of understanding the properties of items and a profile of the user's preferences. The steps include:

- Create a profile of the user's preferences based on items they have liked.
- Recommend items that match the user's profile.

**TO DOs**

- research about this type of filtering (check existing notebooks which use it in the context of books or movies)
- Add meta-data to books - genre & book description
- Perform modelling
- explore different libraries (see Resources below)
- evaluate the model

#### Key Concepts & Components

- **User/item matrix:** A matrix where rows represent users, columns represent items, and entries contain user-item interactions (e.g., ratings, clicks).

- **Sparsity:** The user-item matrix is often sparse, meaning most users have not interacted with most items. Recommender systems must deal with missing data.

- **Similarity metrics:** For collaborative filtering, calculating similarity between users or items is crucial. Common metrics include:
    - cosine similarity
    - Pearson correlation.

- **Recommendation algorithms:** Most common modelling techniques used for collaborative and content-based filtering:
    - k-nearest neighbors
    - matrix factorization
    - deep learning

#### Resources

Popular libraries for building recommender systems in Python include:

- **Surprise**: A Python library specifically designed for building and analyzing recommender systems.

- **scikit-learn**: General-purpose machine learning library that includes tools for building collaborative and content-based models.

- **LightFM**: A hybrid recommender library that supports both collaborative and content-based filtering.


## Advanced:

#### IV. <span style="color: blue;"> Neural Networks:</span>

TensorFlow is an open-source machine learning library developed by the Google Brain team. It provides a flexible and efficient platform for building and deploying machine learning models, including recommender systems. TensorFlow has various tools and modules that make it suitable for developing complex recommendation algorithms. Here, I'll provide a high-level overview of how TensorFlow can be used for building recommender systems.

- Exemplary Code & Implementation step-by-step: https://www.tensorflow.org/recommenders
- YT video tutorial playlist: https://www.youtube.com/playlist?list=PLQY2H8rRoyvy2MiyUBz5RWZr5MPFkV3qz




## Pipline Architecture

## Sample Size

## Modelling

## Evaluation

## Error Analysis