<a href="https://colab.research.google.com/github/Bryan-PORTAILL/FCC-ML-Certification-Book-Recommendation-Engine-Using-KNN/blob/main/FCC%20-%20ML%20Certification%20-%20Book%20recommandation%20KNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


This is my solution to the FreeCodeCamp Machine Learning Certification project Book Recommendation Engine using KNN.

I took the liberty of modifying a few things (such as some variable names) to make the project clearer to read.

The first cell imports all the utilities we will need to create a K-Nearest Neighbors algorithm that will suggest books to read based on a chosen title.

In [None]:
# 1

import numpy
import pandas

from sklearn.neighbors import NearestNeighbors

The second cell imports our data : one file containing the list of books used for the project, another containing a list of users and their reviews for those same books.

In [None]:
# 2

!wget https://cdn.freecodecamp.org/project-data/books/book-crossings.zip
!unzip book-crossings.zip

books_filename = 'BX-Books.csv'
ratings_filename = 'BX-Book-Ratings.csv'

--2024-05-18 16:44:01--  https://cdn.freecodecamp.org/project-data/books/book-crossings.zip
Resolving cdn.freecodecamp.org (cdn.freecodecamp.org)... 172.67.70.149, 104.26.2.33, 104.26.3.33, ...
Connecting to cdn.freecodecamp.org (cdn.freecodecamp.org)|172.67.70.149|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26085508 (25M) [application/zip]
Saving to: ‘book-crossings.zip’


2024-05-18 16:44:02 (84.4 MB/s) - ‘book-crossings.zip’ saved [26085508/26085508]

Archive:  book-crossings.zip
  inflating: BX-Book-Ratings.csv     
  inflating: BX-Books.csv            
  inflating: BX-Users.csv            


The third cell turns the imported data into two dataframes and makes a copy of each.

In [None]:
# 3

book_list = pandas.read_csv(
    books_filename,
    encoding = 'ISO-8859-1',
    sep = ';',
    header = 0,
    names = ['ISBN', 'Title', 'Author'],
    usecols = ['ISBN', 'Title', 'Author'],
    dtype = {'ISBN': 'str', 'Title': 'str', 'Author': 'str'})

book_list_for_processing = book_list.copy()

rating_list = pandas.read_csv(
    ratings_filename,
    encoding = 'ISO-8859-1',
    sep = ';',
    header = 0,
    names = ['User', 'ISBN', 'Rating'],
    usecols = ['User', 'ISBN', 'Rating'],
    dtype = {'User': 'int32', 'ISBN': 'str', 'Rating': 'float32'})

rating_list_for_processing = rating_list.copy()

*Here are samples of each dataframe :*

In [None]:
book_list

Unnamed: 0,ISBN,Title,Author
0,0195153448,Classical Mythology,Mark P. O. Morford
1,0002005018,Clara Callan,Richard Bruce Wright
2,0060973129,Decision in Normandy,Carlo D'Este
3,0374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata
4,0393045218,The Mummies of Urumchi,E. J. W. Barber
...,...,...,...
271374,0440400988,There's a Bat in Bunk Five,Paula Danziger
271375,0525447644,From One to One Hundred,Teri Sloat
271376,006008667X,Lily Dale : The True Story of the Town that Ta...,Christine Wicker
271377,0192126040,Republic (World's Classics),Plato


In [None]:
rating_list

Unnamed: 0,User,ISBN,Rating
0,276725,034545104X,0.0
1,276726,0155061224,5.0
2,276727,0446520802,0.0
3,276729,052165615X,3.0
4,276729,0521795028,6.0
...,...,...,...
1149775,276704,1563526298,9.0
1149776,276706,0679447156,0.0
1149777,276709,0515107662,10.0
1149778,276721,0590442449,10.0


We start cleaning the data in cell 4, first by removing all users and books that fall outside of the project parameters from the rating list.

In [None]:
# 4

number_of_reviews_per_user = rating_list_for_processing['User'].value_counts()
number_of_reviews_per_book = rating_list_for_processing['ISBN'].value_counts()

users_to_remove = number_of_reviews_per_user[number_of_reviews_per_user < 200]
books_to_remove = number_of_reviews_per_book[number_of_reviews_per_book < 100]

updated_rating_list_for_processing = rating_list_for_processing[
    ~ rating_list_for_processing['User'].isin(users_to_remove.index) &
    ~ rating_list_for_processing['ISBN'].isin(books_to_remove.index)]

After removing irrelevant data from the rating list, we merge it with the book list and remove any duplicate titles or reviews in cell 5, thus completing the cleaning part of the project.

In [None]:
# 5

book_list_with_ratings = pandas.merge(book_list_for_processing, updated_rating_list_for_processing, on = 'ISBN')

book_list_with_ratings.drop_duplicates(['Title', 'User'], inplace = True)

*This is what the updated list looks like :*

In [None]:
book_list_with_ratings

Unnamed: 0,ISBN,Title,Author,User,Rating
0,0440234743,The Testament,John Grisham,277478,0.0
1,0440234743,The Testament,John Grisham,2977,0.0
2,0440234743,The Testament,John Grisham,3363,0.0
3,0440234743,The Testament,John Grisham,7346,9.0
4,0440234743,The Testament,John Grisham,9856,0.0
...,...,...,...,...,...
49512,0515135739,Eleventh Hour: An FBI Thriller (FBI Thriller (...,Catherine Coulter,236283,0.0
49513,0515135739,Eleventh Hour: An FBI Thriller (FBI Thriller (...,Catherine Coulter,251613,0.0
49514,0515135739,Eleventh Hour: An FBI Thriller (FBI Thriller (...,Catherine Coulter,252071,0.0
49515,0515135739,Eleventh Hour: An FBI Thriller (FBI Thriller (...,Catherine Coulter,256407,0.0


It is possible to display the cleaned up data in a spreadsheet, in cell 6.

I personally prefer displaying the book titles horizontally, hence the way the spreadsheet is formatted. However in order to assign an index to each book, we will display the book titles vertically when building the KNN.

In [None]:
# 6

spreadsheet = book_list_with_ratings.pivot(index = 'User', columns = 'Title', values = 'Rating').fillna('-')

*The spreadsheet :*

In [None]:
spreadsheet

Title,1984,1st to Die: A Novel,2nd Chance,4 Blondes,A Beautiful Mind: The Life of Mathematical Genius and Nobel Laureate John Nash,A Bend in the Road,A Case of Need,"A Child Called \It\"": One Child's Courage to Survive""",A Civil Action,A Confederacy of Dunces (Evergreen Book),...,Wicked: The Life and Times of the Wicked Witch of the West,Wifey,Wild Animus,Winter Moon,Wish You Well,Without Remorse,Year of Wonders,You Belong To Me,Zen and the Art of Motorcycle Maintenance: An Inquiry into Values,"\O\"" Is for Outlaw"""
User,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
254,9.0,-,-,-,-,-,-,-,-,-,...,-,-,-,-,-,-,-,-,-,-
2276,-,-,10.0,-,-,-,-,-,-,-,...,-,-,-,-,-,-,-,-,-,-
2766,-,-,-,-,-,7.0,0.0,-,-,-,...,-,-,6.0,-,-,-,-,-,-,-
2977,-,-,-,-,-,-,-,-,-,-,...,-,-,0.0,-,-,-,7.0,-,-,-
3363,-,-,-,-,-,-,-,-,0.0,-,...,0.0,-,0.0,-,-,-,-,-,0.0,-
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
275970,0.0,-,-,-,-,-,-,-,-,-,...,-,-,-,-,-,-,0.0,-,0.0,-
277427,-,-,-,-,-,-,-,-,-,-,...,-,-,0.0,-,-,-,-,-,-,-
277478,-,-,-,-,-,-,-,-,-,-,...,-,-,0.0,-,-,-,-,-,-,-
277639,-,-,0.0,-,-,-,-,-,-,-,...,-,-,-,-,-,-,-,-,-,-


Cell 7 is where we build and fit our data to the KNN algorithm.

In [None]:
7 #

formated_book_list_with_ratings = book_list_with_ratings.pivot(index = 'Title', columns = 'User', values = 'Rating').fillna(0)

formated_book_list_with_ratings_as_nparray = formated_book_list_with_ratings.to_numpy()

model = NearestNeighbors(metric = 'cosine',  n_neighbors = 6)

model.fit(formated_book_list_with_ratings_as_nparray)

Cell 8 is the function we'll use to pick a book and then find its nearest neighbors.

In [None]:
# 8

def get_recommends(chosen_book = ''):

  chosen_book_and_recommended_books = [chosen_book, []]

  # This for-loop identifies the index of the chosen book by iterating over each line of the book list.

  for line_index in range(len(formated_book_list_with_ratings)):
    if formated_book_list_with_ratings.index[line_index] == chosen_book:
      chosen_book_index = line_index

  # Once we have the index of the chosen book, we can use the KNN to find its nearest neighbors.

  distances, indexes = model.kneighbors(formated_book_list_with_ratings.iloc[chosen_book_index].values.reshape(1, -1))

  recommended_books_indexes = indexes.flatten()
  recommended_books_distances = distances.flatten()

  # The KNN gives us 6 neighbors.
  # We add them together as lists [title, distance] to the list (B) inside the 'chosen_book_and_recommended_books' list (A).
  # We do so by iterating over either the distances or indexes list.

  for index in range(2, len(recommended_books_indexes)):
    chosen_book_and_recommended_books[1].insert(   # list A, with the chosen book at index 0 and list B index 1
        0,                                         # each new [title, distance] list is added at index 0, more on that below *
        [formated_book_list_with_ratings.index[recommended_books_indexes[index]], # the book title
        recommended_books_distances[index]])       # the distance to the chosen book

  # * The recommended books in the test module in cell 9 are listed in reverse.
  # This is why we need to add each new book at index 0, that way, the last is first and the first is last.
  # For similar reasons, we have to start iterating at 2 and the KNN needs to find 6 neighbors.

  return chosen_book_and_recommended_books

Cell 9 is the final cell with the test module provided by freeCodeCamp.

In [None]:
# 9

books = get_recommends("Where the Heart Is (Oprah's Book Club (Paperback))")

print(books)

###############################################################################

def test_book_recommendation():

  test_pass = True

  recommends = get_recommends("Where the Heart Is (Oprah's Book Club (Paperback))")

  if recommends[0] != "Where the Heart Is (Oprah's Book Club (Paperback))":
    test_pass = False

  recommended_books = ["I'll Be Seeing You", 'The Weight of Water', 'The Surgeon', 'I Know This Much Is True']
  recommended_books_dist = [0.8, 0.77, 0.77, 0.77]

  for i in range(2):
    if recommends[1][i][0] not in recommended_books:
      test_pass = False
    if abs(recommends[1][i][1] - recommended_books_dist[i]) >= 0.05:
      test_pass = False

  if test_pass:
    print("You passed the challenge! 🎉🎉🎉🎉🎉")
  else:
    print("You haven't passed yet. Keep trying!")

test_book_recommendation()

["Where the Heart Is (Oprah's Book Club (Paperback))", [["I'll Be Seeing You", 0.8016211], ['The Weight of Water', 0.77085835], ['The Surgeon', 0.7699411], ['I Know This Much Is True', 0.7677075]]]
You passed the challenge! 🎉🎉🎉🎉🎉
