<a href="https://colab.research.google.com/github/SusheelThapa/ML-From-Scratch/blob/tfProject/tensorflow/projects/fcc_book_recommendation_knn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Book recommendation engine using KNN

## Importing necessary packages

In [1]:
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt

## Getting the datasets

In [2]:
# get data files
!wget https://cdn.freecodecamp.org/project-data/books/book-crossings.zip

!unzip book-crossings.zip

books_filename = 'BX-Books.csv'
ratings_filename = 'BX-Book-Ratings.csv'

--2023-05-15 06:11:18--  https://cdn.freecodecamp.org/project-data/books/book-crossings.zip
Resolving cdn.freecodecamp.org (cdn.freecodecamp.org)... 104.26.3.33, 104.26.2.33, 172.67.70.149, ...
Connecting to cdn.freecodecamp.org (cdn.freecodecamp.org)|104.26.3.33|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26085508 (25M) [application/zip]
Saving to: ‘book-crossings.zip.2’


2023-05-15 06:11:19 (165 MB/s) - ‘book-crossings.zip.2’ saved [26085508/26085508]

Archive:  book-crossings.zip
replace BX-Book-Ratings.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

## Loading the datasets into dataframes

In [3]:
df_books = pd.read_csv(
    books_filename,
    encoding = "ISO-8859-1",
    sep=";",
    header=0,
    names=['isbn', 'title', 'author'],
    usecols=['isbn', 'title', 'author'],
    dtype={'isbn': 'str', 'title': 'str', 'author': 'str'})

df_ratings = pd.read_csv(
    ratings_filename,
    encoding = "ISO-8859-1",
    sep=";",
    header=0,
    names=['user', 'isbn', 'rating'],
    usecols=['user', 'isbn', 'rating'],
    dtype={'user': 'int32', 'isbn': 'str', 'rating': 'float32'})

Let's look at the dataset `df_ratings` and `df_books` that we have loaded.

In [4]:
df_ratings.head()

Unnamed: 0,user,isbn,rating
0,276725,034545104X,0.0
1,276726,0155061224,5.0
2,276727,0446520802,0.0
3,276729,052165615X,3.0
4,276729,0521795028,6.0


In [5]:
df_books.head()

Unnamed: 0,isbn,title,author
0,195153448,Classical Mythology,Mark P. O. Morford
1,2005018,Clara Callan,Richard Bruce Wright
2,60973129,Decision in Normandy,Carlo D'Este
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata
4,393045218,The Mummies of Urumchi,E. J. W. Barber


## Preprocessing the data

If we look at the dataset `df_ratings` and `df_books` then we can find that both of the datasets have `isbn` as common column.

So, We will merge the two dataframe with `isbn` column as key

In [6]:
df_train = df_books.merge(df_ratings, on='isbn',how="right")

Now, we will select those books with more that 100 reviews and those user who have review more than 200 books.

In [7]:
# Counting the rating by user and rating by isbn
ratingByUser = df_train['user'].value_counts()
ratingByISBN = df_train['isbn'].value_counts()

# Selecting the user with rating more than 200 and book with review more than 100
df_train = df_train[df_train['user'].isin(ratingByUser[ratingByUser>=200].index)]
df_train = df_train[df_train['isbn'].isin(ratingByISBN[ratingByISBN>=100].index)]

Since, our data may have chance of having the duplicate entries so we will be removing the duplicate entries with `.drop_duplicates()` methods of dataframes.

In [8]:
df_train = df_train.drop_duplicates(['title','user'])

## Creating Training set

Since we cannot feed the dataframe to the KNN model as like other model.
First of all, we will be converting the dataframe in to 2D matrix for further data analysis.

We will be creating rating 2D matrix.

In [9]:
df_train_2D_rating = pd.pivot_table(data=df_train, values='rating', index='title', columns='user').fillna(0)
df_train

Unnamed: 0,isbn,title,author,user,rating
1456,002542730X,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,277427,10.0
1469,0060930535,The Poisonwood Bible: A Novel,Barbara Kingsolver,277427,0.0
1471,0060934417,Bel Canto: A Novel,Ann Patchett,277427,0.0
1474,0061009059,One for the Money (Stephanie Plum Novels (Pape...,Janet Evanovich,277427,9.0
1484,0140067477,The Tao of Pooh,Benjamin Hoff,277427,0.0
...,...,...,...,...,...
1147304,0804111359,Secret History,DONNA TARTT,275970,0.0
1147436,140003065X,A Fine Balance,Rohinton Mistry,275970,0.0
1147439,1400031346,The No. 1 Ladies' Detective Agency,Alexander McCall Smith,275970,0.0
1147440,1400031354,Tears of the Giraffe (No.1 Ladies Detective Ag...,Alexander McCall Smith,275970,0.0


We will convert the generated rating 2D matrix into `csr_matrix`(Compressed Sparse Row matrix) which will be feed to the model.

In [10]:
df_train_csr_matrix = csr_matrix(df_train_2D_rating.values)
df_train_csr_matrix

<673x888 sparse matrix of type '<class 'numpy.float32'>'
	with 12425 stored elements in Compressed Sparse Row format>

## Training the models

We will be using `NearestNeighbours` model from scikit learn.

In [11]:
model = NearestNeighbors(algorithm='auto', metric='cosine')

Fitting the `csr matrix` we have generated from the dataframe into the model

In [12]:
model.fit(df_train_csr_matrix)

## Recommendation function

We will be implementing the recommendation function which will take `book` as parameter and return `array of recommended books`

In [13]:
# function to return recommended books - this will be tested
def get_recommends(book = ""):

    # Locating the index of the book
    index_of_book = df_train_2D_rating.loc[book]

    # Calculating the distance and indices of 8 near neighbour
    distances, indices = model.kneighbors([index_of_book],n_neighbors=8,return_distance = True)

    recommended_books = []

    # Adding the nearest neighbour into the recommend books
    for x in reversed(range(1,6)):
        bookrecommended = [df_train_2D_rating.index[indices.flatten()[x]], distances.flatten()[x]]
        recommended_books.append(bookrecommended)

    # Returnin the data as asked by testing function
    recommended_books = [book, recommended_books]

    return recommended_books

### Pre-Testing our function

In [14]:
get_recommends("The Queen of the Damned (Vampire Chronicles (Paperback))")

[[5.9604645e-08 5.1784116e-01 5.3763384e-01 7.3450685e-01 7.4486566e-01
  7.9398352e-01 7.9571664e-01 8.0523217e-01]] [[567 610 599 251 617 100 276 508]]


['The Queen of the Damned (Vampire Chronicles (Paperback))',
 [['Catch 22', 0.7939835],
  ['The Witching Hour (Lives of the Mayfair Witches)', 0.74486566],
  ['Interview with the Vampire', 0.73450685],
  ['The Tale of the Body Thief (Vampire Chronicles (Paperback))', 0.53763384],
  ['The Vampire Lestat (Vampire Chronicles, Book II)', 0.51784116]]]

## Testing by freecodecamp

In [15]:
books = get_recommends("Where the Heart Is (Oprah's Book Club (Paperback))")
print(books)

def test_book_recommendation():
  test_pass = True
  recommends = get_recommends("Where the Heart Is (Oprah's Book Club (Paperback))")
  if recommends[0] != "Where the Heart Is (Oprah's Book Club (Paperback))":
    test_pass = False
  recommended_books = ["I'll Be Seeing You", 'The Weight of Water', 'The Surgeon', 'I Know This Much Is True']
  recommended_books_dist = [0.8, 0.77, 0.77, 0.77]
  for i in range(2): 
    if recommends[1][i][0] not in recommended_books:
      test_pass = False
    if abs(recommends[1][i][1] - recommended_books_dist[i]) >= 0.05:
      test_pass = False
  if test_pass:
    print("You passed the challenge! 🎉🎉🎉🎉🎉")
  else:
    print("You haven't passed yet. Keep trying!")

test_book_recommendation()

[[0.         0.7234864  0.7677075  0.7699411  0.77085835 0.8016211
  0.802759   0.8060607 ]] [[654 539 240 597 614 243 481 627]]
["Where the Heart Is (Oprah's Book Club (Paperback))", [["I'll Be Seeing You", 0.8016211], ['The Weight of Water', 0.77085835], ['The Surgeon', 0.7699411], ['I Know This Much Is True', 0.7677075], ['The Lovely Bones: A Novel', 0.7234864]]]
[[0.         0.7234864  0.7677075  0.7699411  0.77085835 0.8016211
  0.802759   0.8060607 ]] [[654 539 240 597 614 243 481 627]]
You passed the challenge! 🎉🎉🎉🎉🎉
