<a href="https://colab.research.google.com/github/JoshAmpofo/Zummit_Africa_Fellowship/blob/main/RecommenderSystems.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Project Synopsis**
Develop a book recommender system that suggests **personalized book recommendations** for users based on their **past ratings** using **collaborative filtering** techniques.

- Implement collaborative filtering, specifically focusing on user-based collaborative filtering.
 - *Tips: Maintain a clean and organized code structure with clear comments explaining each step.*
- Use descriptive variable names throughout your code for better readability.
- Consider incorporating error handling to address potential issues during data processing or model training.

**Dataset: [Book Recommendation](https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset/data)**


#**Load Libraries**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from math import sqrt

#**Load Dataset**

In [2]:
from google.colab import drive
drive.mount('/content/drive/')
# load all csv files
books = pd.read_csv('/content/drive/My Drive/Zummit_Datasets/Books.csv')
ratings = pd.read_csv('/content/drive/My Drive/Zummit_Datasets/Ratings.csv')
user_info = pd.read_csv('/content/drive/My Drive/Zummit_Datasets/Users.csv')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


  books = pd.read_csv('/content/drive/My Drive/Zummit_Datasets/Books.csv')


#**Investigate Datasets**

In [3]:
books.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [4]:
# make a copy of books
bookcpy = books.copy()
# drop unwanted columns
drop_cols = ['Image-URL-S', 'Image-URL-M', 'Image-URL-L', 'Book-Author']
bookcpy = bookcpy.drop(drop_cols, axis=1)
bookcpy.head()

Unnamed: 0,ISBN,Book-Title,Year-Of-Publication,Publisher
0,195153448,Classical Mythology,2002,Oxford University Press
1,2005018,Clara Callan,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,1999,W. W. Norton &amp; Company


In [5]:
bookcpy['Book-Title'].unique()

array(['Classical Mythology', 'Clara Callan', 'Decision in Normandy', ...,
       'Lily Dale : The True Story of the Town that Talks to the Dead',
       "Republic (World's Classics)",
       "A Guided Tour of Rene Descartes' Meditations on First Philosophy with Complete Translations of the Meditations by Ronald Rubin"],
      dtype=object)

In [6]:
# check for null or missing values
bookcpy.isna().sum()

ISBN                   0
Book-Title             0
Year-Of-Publication    0
Publisher              2
dtype: int64

In [7]:
# check dtypes of column values
bookcpy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 4 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   ISBN                 271360 non-null  object
 1   Book-Title           271360 non-null  object
 2   Year-Of-Publication  271360 non-null  object
 3   Publisher            271358 non-null  object
dtypes: object(4)
memory usage: 8.3+ MB


In [8]:
# investigate ratings
# make a copy of ratings
ratingscpy = ratings.copy()
ratingscpy.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [9]:
# check for null values
ratingscpy.isna().sum()

User-ID        0
ISBN           0
Book-Rating    0
dtype: int64

In [10]:
# check dytpes
ratingscpy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   User-ID      1149780 non-null  int64 
 1   ISBN         1149780 non-null  object
 2   Book-Rating  1149780 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 26.3+ MB


In [11]:
# investigate user_info
usercpy = user_info.copy()
usercpy.head()

Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


In [12]:
# check for null values
usercpy.isna().sum()

User-ID          0
Location         0
Age         110762
dtype: int64

In [13]:
# use mean imputation to fill in NaN values in Age column
usercpy['Age'] = usercpy['Age'].fillna(usercpy['Age'].mean())
usercpy.isna().sum()

User-ID     0
Location    0
Age         0
dtype: int64

In [14]:
# drop location column from user_info
# drop location column
usercpy = usercpy.drop('Location', axis=1)
usercpy

Unnamed: 0,User-ID,Age
0,1,34.751434
1,2,18.000000
2,3,34.751434
3,4,17.000000
4,5,34.751434
...,...,...
278853,278854,34.751434
278854,278855,50.000000
278855,278856,34.751434
278856,278857,34.751434


# Model Building: **Collaborative Filtering**

In [15]:
# import necessary libraries
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse import csr_matrix

In [16]:
# select a subset of the dataset for model building
# select users that may have rated books more than a certain number of times
ratings500 = ratingscpy['User-ID'].value_counts() > 300
ratings500.head()

User-ID
11676     True
198711    True
153662    True
98391     True
35859     True
Name: count, dtype: bool

In [17]:
# create an indexx of user ratings and select the dataset based on those ratings
ratindx = ratings500[ratings500].index
ratindx

Index([ 11676, 198711, 153662,  98391,  35859, 212898, 278418,  76352, 110973,
       235105,
       ...
       224646,  63394,  85701, 106816,  15418,  82511,  92853,  62895,  37567,
       263163],
      dtype='int64', name='User-ID', length=559)

In [18]:
new_ratings = ratingscpy[ratingscpy['User-ID'].isin(ratindx)] # the final dataset to use

In [19]:
new_ratings

Unnamed: 0,User-ID,ISBN,Book-Rating
1456,277427,002542730X,10
1457,277427,0026217457,0
1458,277427,003008685X,8
1459,277427,0030615321,0
1460,277427,0060002050,0
...,...,...,...
1147612,275970,3829021860,0
1147613,275970,4770019572,0
1147614,275970,896086097,0
1147615,275970,9626340762,8


In [20]:
# merge ratings with books (use ISBN)
ratings_with_books = pd.merge(new_ratings, bookcpy, on='ISBN')
ratings_with_books.head()

Unnamed: 0,User-ID,ISBN,Book-Rating,Book-Title,Year-Of-Publication,Publisher
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,1994,John Wiley &amp; Sons Inc
1,3363,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,1994,John Wiley &amp; Sons Inc
2,11676,002542730X,6,Politically Correct Bedtime Stories: Modern Ta...,1994,John Wiley &amp; Sons Inc
3,12538,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,1994,John Wiley &amp; Sons Inc
4,13552,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,1994,John Wiley &amp; Sons Inc


In [21]:
# check number of times a particular book is rated, create a column for it
number_of_ratings = ratings_with_books.groupby('Book-Title')['Book-Rating'].count().reset_index()
# rename Book-rating column
number_of_ratings.rename(columns={'Book-Rating':'Num_of_Ratings'},inplace=True)
number_of_ratings.head()

Unnamed: 0,Book-Title,Num_of_Ratings
0,A Light in the Storm: The Civil War Diary of ...,2
1,Always Have Popsicles,1
2,Apple Magic (The Collector's series),1
3,Beyond IBM: Leadership Marketing and Finance ...,1
4,Clifford Visita El Hospital (Clifford El Gran...,1


In [22]:
# add new_ratings back to ratings_with_books df
book_ratings = ratings_with_books.merge(number_of_ratings, on='Book-Title')
book_ratings.head()

Unnamed: 0,User-ID,ISBN,Book-Rating,Book-Title,Year-Of-Publication,Publisher,Num_of_Ratings
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,1994,John Wiley &amp; Sons Inc,64
1,3363,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,1994,John Wiley &amp; Sons Inc,64
2,11676,002542730X,6,Politically Correct Bedtime Stories: Modern Ta...,1994,John Wiley &amp; Sons Inc,64
3,12538,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,1994,John Wiley &amp; Sons Inc,64
4,13552,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,1994,John Wiley &amp; Sons Inc,64


In [23]:
# drop duplicates
book_ratings.drop_duplicates(['User-ID', 'Book-Title'], inplace=True)
book_ratings.head()

Unnamed: 0,User-ID,ISBN,Book-Rating,Book-Title,Year-Of-Publication,Publisher,Num_of_Ratings
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,1994,John Wiley &amp; Sons Inc,64
1,3363,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,1994,John Wiley &amp; Sons Inc,64
2,11676,002542730X,6,Politically Correct Bedtime Stories: Modern Ta...,1994,John Wiley &amp; Sons Inc,64
3,12538,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,1994,John Wiley &amp; Sons Inc,64
4,13552,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,1994,John Wiley &amp; Sons Inc,64


In [24]:
# select all books where num_of_ratings greater than 50
book_ratings = book_ratings[book_ratings['Num_of_Ratings'] >= 50]
book_ratings.head()

Unnamed: 0,User-ID,ISBN,Book-Rating,Book-Title,Year-Of-Publication,Publisher,Num_of_Ratings
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,1994,John Wiley &amp; Sons Inc,64
1,3363,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,1994,John Wiley &amp; Sons Inc,64
2,11676,002542730X,6,Politically Correct Bedtime Stories: Modern Ta...,1994,John Wiley &amp; Sons Inc,64
3,12538,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,1994,John Wiley &amp; Sons Inc,64
4,13552,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,1994,John Wiley &amp; Sons Inc,64


In [25]:
# # generate a user-item interaction matrix (pivot table)
user_item_matrix = book_ratings.pivot(columns='User-ID', index='Book-Title', values='Book-Rating')

In [26]:
user_item_matrix.head()

User-ID,254,2276,3363,3757,4385,6251,6543,6575,7158,7346,...,270713,271284,273979,274004,274061,274301,274308,275970,277427,278418
Book-Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1984,9.0,,,,,,,,,8.0,...,,,,,,,,0.0,,
1st to Die: A Novel,,,,,,,9.0,,0.0,,...,,,,,,,,,,
2nd Chance,,10.0,,,,,0.0,,,,...,,,,,,,0.0,,,
4 Blondes,,,,,,0.0,,,,,...,,,,,,,,,,
A Bend in the Road,0.0,,,,,,,1.0,,,...,,,0.0,,,,,,,


In [27]:
# replace NaN with zeros
user_item_matrix.fillna(0, inplace=True)
user_item_matrix.head()

User-ID,254,2276,3363,3757,4385,6251,6543,6575,7158,7346,...,270713,271284,273979,274004,274061,274301,274308,275970,277427,278418
Book-Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1984,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1st to Die: A Novel,0.0,0.0,0.0,0.0,0.0,0.0,9.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2nd Chance,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4 Blondes,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A Bend in the Road,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [28]:
# convert df to a sparse matrix for memory efficiency
user_item_sparse_matrix = csr_matrix(user_item_matrix.values)

In [29]:
# implement an ML model (e.g. KNN) to train recommender system
from sklearn.neighbors import NearestNeighbors
# initialize model
model = NearestNeighbors(algorithm='brute', metric='cosine')
model.fit(user_item_sparse_matrix)

In [30]:
# # define the recommendation function
def recommend_books(book_name):
  # get the index of the book
  book_id = np.where(user_item_matrix.index == book_name)[0][0]

  # reshape the book vector for the model
  book_vector = user_item_matrix.iloc[book_id,:].values.reshape(1, -1)

  # get the distances and suggestions
  distances, suggestions = model.kneighbors(book_vector, n_neighbors=6)

  for i in range(len(suggestions)):
    books = user_item_matrix.index[suggestions[i]]
    for j in books:
      if j == book_name:
        print(f"You searched for '{book_name}'\n")
        print("We also recommend these books:\n")
      else:
        print(j)

In [31]:
recommend_books("2nd Chance")

You searched for '2nd Chance'

We also recommend these books:

The Next Accident
Violets Are Blue
The Blue Nowhere : A Novel
Four Blind Mice
1st to Die: A Novel


In [32]:
recommend_books('A Bend in the Road')

You searched for 'A Bend in the Road'

We also recommend these books:

A Walk to Remember
The Last Time They Met : A Novel
Angels
Blue Diary
Good in Bed


In [33]:
recommend_books('1st to Die: A Novel')

You searched for '1st to Die: A Novel'

We also recommend these books:

Pop Goes the Weasel
Violets Are Blue
Two for the Dough
2nd Chance
On the Street Where You Live
