#Book Recommendation System

##### Project Objective: to build a recommendation system, where the user will type the book of interest, and the system will return 5 books that are similar to the indicated book.


#####Dataset Link: https://www.kaggle.com/datasets/imtkaggleteam/book-recommendation-good-book-api

In [1]:
# Installs
!pip install -q -U watermark
!pip install -q -U nltk

In [2]:
# Imports
import ast
import nltk
import sklearn
import pandas as pd
import numpy as np
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
pd.options.mode.chained_assignment = None

##1. Loading and Understanding Data

In [3]:
# Loading the dataset
movies_df = pd.read_csv('books_data.csv')

In [4]:
# First records
movies_df.head(5)

Unnamed: 0.1,Unnamed: 0,Book,Author,Description,Genres,Avg_Rating,Num_Ratings,URL
0,0,To Kill a Mockingbird,Harper Lee,The unforgettable novel of a childhood in a sl...,"['Classics', 'Fiction', 'Historical Fiction', ...",4.27,5691311,https://www.goodreads.com/book/show/2657.To_Ki...
1,1,Harry Potter and the Philosopher’s Stone (Harr...,J.K. Rowling,Harry Potter thinks he is an ordinary boy - un...,"['Fantasy', 'Fiction', 'Young Adult', 'Magic',...",4.47,9278135,https://www.goodreads.com/book/show/72193.Harr...
2,2,Pride and Prejudice,Jane Austen,"Since its immediate success in 1813, Pride and...","['Classics', 'Fiction', 'Romance', 'Historical...",4.28,3944155,https://www.goodreads.com/book/show/1885.Pride...
3,3,The Diary of a Young Girl,Anne Frank,Discovered in the attic in which she spent the...,"['Classics', 'Nonfiction', 'History', 'Biograp...",4.18,3488438,https://www.goodreads.com/book/show/48855.The_...
4,4,Animal Farm,George Orwell,Librarian's note: There is an Alternate Cover ...,"['Classics', 'Fiction', 'Dystopia', 'Fantasy',...",3.98,3575172,https://www.goodreads.com/book/show/170448.Ani...


In [5]:
# Shape
movies_df.shape

(10000, 8)

In [6]:
movies_df.columns

Index(['Unnamed: 0', 'Book', 'Author', 'Description', 'Genres', 'Avg_Rating',
       'Num_Ratings', 'URL'],
      dtype='object')

In [7]:
# Filtering columns
# The columns 'Unnamed: 0', 'Avg_Rating', 'Num_Rating', 'URL' are not important for the
# analysis, therefore, we will remove then
movies_df = movies_df.drop(['Unnamed: 0', 'Avg_Rating', 'Num_Ratings', 'URL'], axis=1)

In [8]:
movies_df.head(5)

Unnamed: 0,Book,Author,Description,Genres
0,To Kill a Mockingbird,Harper Lee,The unforgettable novel of a childhood in a sl...,"['Classics', 'Fiction', 'Historical Fiction', ..."
1,Harry Potter and the Philosopher’s Stone (Harr...,J.K. Rowling,Harry Potter thinks he is an ordinary boy - un...,"['Fantasy', 'Fiction', 'Young Adult', 'Magic',..."
2,Pride and Prejudice,Jane Austen,"Since its immediate success in 1813, Pride and...","['Classics', 'Fiction', 'Romance', 'Historical..."
3,The Diary of a Young Girl,Anne Frank,Discovered in the attic in which she spent the...,"['Classics', 'Nonfiction', 'History', 'Biograp..."
4,Animal Farm,George Orwell,Librarian's note: There is an Alternate Cover ...,"['Classics', 'Fiction', 'Dystopia', 'Fantasy',..."


In [9]:
# function for checking for missing values
def missing_values(df):
  sum = movies_df.isnull().sum()
  percent = ((movies_df.isnull().sum() / len(movies_df)) * 100).map('{:.2f}%'.format)
  return pd.DataFrame({'Missing Values': sum,
                       'Percentage': percent})

In [10]:
missing_values(movies_df)

Unnamed: 0,Missing Values,Percentage
Book,0,0.00%
Author,0,0.00%
Description,77,0.77%
Genres,0,0.00%


In [11]:
# Checking the books that there were no description

# Code for display all the columns and rows
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

# Print
print(movies_df['Book'][movies_df['Description'].isnull()])

1066    A Decade of Desire: Erotic Tales from the Char...
1224                     Complicated Moonlight (DCYE, #2)
1745                                  Mad Love (DCYE, #3)
2104    The Spirit of Prayer: The Believer's Authority...
2270    Lift: Five Practices Great Managers Do Consist...
2279    Glucose Control Eating: Lose Weight Stay Slimm...
2362                                      SHADOW PANTHEON
2728                 Counter Identity (Remmich/Miller #2)
2845    Dear Su Yen: A Young Woman From Taiwan Discove...
2850    Sex: How to Get More of It: A guy's roadmap to...
2863    Journey to the West: A Long March from Eastern...
3214                                  The Chinaberry Tree
3334                            Packfire (Simon Pack, #9)
3347    When the Children Fight Back (Children of the ...
3512                         The Zombie Wizards of Ala-ka
3553                                                 Guts
3721            Home Made Pirates - A Story from the Seas
3749          

In [12]:
# We don't want to lose these books by deleting the rows with missing values, so we
# will replace the descriptions that are null with "No description"
movies_df['Description'] = movies_df['Description'].fillna('No description')

In [13]:
missing_values(movies_df)

Unnamed: 0,Missing Values,Percentage
Book,0,0.00%
Author,0,0.00%
Description,0,0.00%
Genres,0,0.00%


In [14]:
# Checking duplicates values
movies_df.duplicated().sum()

29

In [15]:
# Print of the rows that are duplicated
movies_df[movies_df.duplicated()]

Unnamed: 0,Book,Author,Description,Genres
1345,For the Love of Armin,Michael G. Kramer,"In September of the year 9 A.D., the young Ger...","['Fiction', 'Adventure', 'Historical Fiction',..."
1929,The Last Lecture,Randy Pausch,A lot of professors give talks titled 'The Las...,"['Nonfiction', 'Memoir', 'Biography', 'Self He..."
2030,Room,Emma Donoghue,"To five-year-old-Jack, Room is the world....To...","['Fiction', 'Contemporary', 'Adult', 'Thriller..."
2183,"Missing Wings (Aranysargas, #1)",Andrea Luhman,Born with an ability the Veilede people of Mad...,"['Fantasy', 'Fantasy Romance', 'Romance', 'Hig..."
2649,"Females of Valor (The Viking's Kurdish Love, #2)",Widad Akreyi,"Love in crisis. When life gives you lemons, wh...",[]
2655,Zoroastrians' Fight for Survival (The Viking's...,Widad Akreyi,An epic tale of romance and reminiscence. A me...,['Historical Fiction']
3033,"Hometown Girl After All (Hometown, #2)",Kirsten Fullmer,Julia lost everything while she was ill. Self-...,"['Contemporary', 'Young Adult', 'New Adult', '..."
3459,"The Threat Below (Brathius History, #1)",Jason Latshaw,"Three hundred years ago, something terrifying ...","['Fantasy', 'Young Adult', 'Dystopia', 'Fictio..."
3461,Nothing to Envy: Ordinary Lives in North Korea,Barbara Demick,Nothing to Envy follows the lives of six North...,"['Nonfiction', 'History', 'Politics', 'Asia', ..."
3946,Between the World and Me,Ta-Nehisi Coates,"“This is your country, this is your world, thi...","['Nonfiction', 'Memoir', 'Race', 'Audiobook', ..."


In [16]:
# Now we will delete the duplicates rows
movies_df.drop_duplicates(inplace=True)

In [17]:
movies_df.duplicated().sum()

0

In [18]:
# Now, we will remove the values ​​that are inside the parentheses in the Book column, as they do not add
# important information.
movies_df['Book'] = movies_df['Book'].str.replace(r'\s*\(.*\)', '', regex=True)

In [19]:
movies_df.head(3)

Unnamed: 0,Book,Author,Description,Genres
0,To Kill a Mockingbird,Harper Lee,The unforgettable novel of a childhood in a sl...,"['Classics', 'Fiction', 'Historical Fiction', ..."
1,Harry Potter and the Philosopher’s Stone,J.K. Rowling,Harry Potter thinks he is an ordinary boy - un...,"['Fantasy', 'Fiction', 'Young Adult', 'Magic',..."
2,Pride and Prejudice,Jane Austen,"Since its immediate success in 1813, Pride and...","['Classics', 'Fiction', 'Romance', 'Historical..."


##2. Text Processing with Abstract Syntax Trees

#####The ast module helps you programmatically discover what the current grammar of an object looks like by processing abstract syntax grammar trees.

In [20]:
movies_df.head(1)

Unnamed: 0,Book,Author,Description,Genres
0,To Kill a Mockingbird,Harper Lee,The unforgettable novel of a childhood in a sl...,"['Classics', 'Fiction', 'Historical Fiction', ..."


In [21]:
# Converting strings to lists using literal_eval
movies_df['Genres'] = movies_df['Genres'].apply(ast.literal_eval)

In [22]:
movies_df.head(1)

Unnamed: 0,Book,Author,Description,Genres
0,To Kill a Mockingbird,Harper Lee,The unforgettable novel of a childhood in a sl...,"[Classics, Fiction, Historical Fiction, School..."


In [23]:
# Initializing an empty list
new_columns = []

# Loop to iterate through each description and add the list of words to the new list
for description in movies_df['Description']:
  words_list = description.split() # Splitting the description into a list of words
  new_columns.append(words_list) # Adding the word list to the new column

# Replacing the original column with the new list of lists
movies_df['Description'] = new_columns

In [24]:
# Initializing an empty list
new_column = []

# Loop to iterate through each description and add the list of words to the new list
for author in movies_df['Author']:
  new_column.append([author])

movies_df['Author'] = new_column

In [25]:
movies_df.head(1)

Unnamed: 0,Book,Author,Description,Genres
0,To Kill a Mockingbird,[Harper Lee],"[The, unforgettable, novel, of, a, childhood, ...","[Classics, Fiction, Historical Fiction, School..."


##3. Data Cleaning

In [26]:
# Removing empty spaces
movies_df['Author'] = movies_df['Author'].apply(lambda x:[i.replace(' ', '') for i in x])
movies_df['Genres'] = movies_df['Genres'].apply(lambda x:[i.replace(' ', '') for i in x])
movies_df['Description'] = movies_df['Description'].apply(lambda x:[i.replace(' ', '') for i in x])

In [27]:
# Removing dot in authors name
movies_df['Author'] = movies_df['Author'].apply(lambda x:[i.replace('.', '') for i in x])

In [28]:
movies_df.head(1)

Unnamed: 0,Book,Author,Description,Genres
0,To Kill a Mockingbird,[HarperLee],"[The, unforgettable, novel, of, a, childhood, ...","[Classics, Fiction, HistoricalFiction, School,..."


##4. Preparing DataFrame for Vectorization

In [29]:
# We create the tags column, in this case a vector of strings with the column values
movies_df['Tags'] = movies_df['Author'] + \
                    movies_df['Genres'] + \
                    movies_df['Description']

In [30]:
movies_df['Tags'].head()

Unnamed: 0,Tags
0,"[HarperLee, Classics, Fiction, HistoricalFicti..."
1,"[JKRowling, Fantasy, Fiction, YoungAdult, Magi..."
2,"[JaneAusten, Classics, Fiction, Romance, Histo..."
3,"[AnneFrank, Classics, Nonfiction, History, Bio..."
4,"[GeorgeOrwell, Classics, Fiction, Dystopia, Fa..."


In [31]:
# Adding a columns of ID
movies_df['Book_id'] = movies_df.index

In [32]:
# Creating the final dataframe
final_movies_df = pd.DataFrame({'book_id': movies_df['Book_id'],
                                'book': movies_df['Book'],
                                'tags': movies_df['Tags']})

In [33]:
final_movies_df.head()

Unnamed: 0,book_id,book,tags
0,0,To Kill a Mockingbird,"[HarperLee, Classics, Fiction, HistoricalFicti..."
1,1,Harry Potter and the Philosopher’s Stone,"[JKRowling, Fantasy, Fiction, YoungAdult, Magi..."
2,2,Pride and Prejudice,"[JaneAusten, Classics, Fiction, Romance, Histo..."
3,3,The Diary of a Young Girl,"[AnneFrank, Classics, Nonfiction, History, Bio..."
4,4,Animal Farm,"[GeorgeOrwell, Classics, Fiction, Dystopia, Fa..."


In [34]:
# Join strings to simplify the vector
final_movies_df['tags'] = final_movies_df['tags'].apply(lambda x: ' '.join(x))

In [35]:
# Put everything in lowercase to avoid upper/lower case differences
final_movies_df['tags'] = final_movies_df['tags'].apply(lambda x: x.lower())

In [36]:
final_movies_df.head()

Unnamed: 0,book_id,book,tags
0,0,To Kill a Mockingbird,harperlee classics fiction historicalfiction s...
1,1,Harry Potter and the Philosopher’s Stone,jkrowling fantasy fiction youngadult magic chi...
2,2,Pride and Prejudice,janeausten classics fiction romance historical...
3,3,The Diary of a Young Girl,annefrank classics nonfiction history biograph...
4,4,Animal Farm,georgeorwell classics fiction dystopia fantasy...


##5. Perse and Vectorization

##### Stemming é o processo de redução de uma palavra ao seu radical que está ligado a sufixos e prefixos ou às raízes de palavras conhecidas como "lemmas". Stemming é importante na compreensão de linguagem natural e processamento de linguagem natural.

In [37]:
# Criate parser
parser_ps = PorterStemmer()

In [38]:
# Stemming function
def stem(text):

    # Creates an empty list called y to store the stemmed words.
    y = []

    # Splits the input string 'text' into words and iterates over them.
    for i in text.split():

        # Performs stemming on the current word 'i' and adds the result to the list y.
        y.append(parser_ps.stem(i))

    # Returns the processed words as a string, joining them with spaces.
    return " ".join(y)

In [39]:
final_movies_df['tags'] = final_movies_df['tags'].apply(stem)

In [40]:
final_movies_df.head(3)

Unnamed: 0,book_id,book,tags
0,0,To Kill a Mockingbird,harperle classic fiction historicalfict school...
1,1,Harry Potter and the Philosopher’s Stone,jkrowl fantasi fiction youngadult magic childr...
2,2,Pride and Prejudice,janeausten classic fiction romanc historicalfi...


#####Vectorization in the context of natural language processing (NLP) is the process of converting text into a numerical representation, typically in the form of vectors. This process is crucial for machine learning algorithms to work with textual data, as they require numerical input.

In [41]:
# Create the vectorizer with a maximum of 10000 attributes
# The stop_words = 'english' is used to check common words in English, such as 'the', 'be', 'and', etc.
cv = CountVectorizer(max_features=15000, stop_words='english')

In [42]:
# Creating the vectors for the tags
vectors = cv.fit_transform(final_movies_df['tags']).toarray()

In [43]:
len(cv.get_feature_names_out())

15000

In [44]:
vectors

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

#####For more information on how CountVectorizer works, see below.
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

##6. Distance between Vectors

In [45]:
# Calculating the similarity between vectors based on the distance between them
similarity = cosine_similarity(vectors)

##7. Building the Recommendation System

In [46]:
# Function for the recommendation system
def recommendation_system(book):

   # Gets the index of the book passed as an argument (the one the user interacted with)
   # Index[0] = means it will only return the first element
   if book not in final_movies_df['book'].values:
    print(f"The book '{book}' was not found in the database.")
    return

   index = final_movies_df[final_movies_df['book'] == book].index[0]

   # We then check the books with the smallest distance vectors to the book passed as an argument
   distances = sorted(list(enumerate(similarity[index])), reverse = True, key = lambda x: x[1])

   # And then we consider the 5 books with the smallest distance, i.e., greatest similarity
   for i in distances[1:11]:
    print(final_movies_df.iloc[i[0]]['book'])


##8. Applying the Recommendation System

In [47]:
recommendation_system('The Hobbit')

The Hobbit
J.R.R. Tolkien 4-Book Boxed Set: The Hobbit and The Lord of the Rings
The Hobbit, Part One
The Hunting of the Snark
Lad: A Dog
The Story of Ferdinand
The Last Battle
Unfinished Tales of Númenor and Middle-Earth
The Complete Adventures of Curious George
Worm Holes


In [48]:
recommendation_system('Harry Potter and the Philosopher’s Stone')

Harry Potter and the Goblet of Fire
Harry Potter and the Prisoner of Azkaban
Harry Potter and the Chamber of Secrets
Harry Potter and the Cursed Child: Parts One and Two
Harry Potter and the Half-Blood Prince
Harry Potter and the Order of the Phoenix
The Magic Faraway Tree
Harry Potter and the Deathly Hallows
Harry Potter and the Order of the Phoenix
The Princess and the Goblin


In [49]:
recommendation_system('The Story of Ferdinand')

The Sword in the Stone
The Jungle
Oliver Twist
The Dark Diamonds
Where's Waldo?
The Poky Little Puppy
The Hunting of the Snark
The Story of Babar
Stellaluna
The Complete Tales


##9. System and Package Versions

In [50]:
%reload_ext watermark
%watermark -v -m
%watermark --iversions

Python implementation: CPython
Python version       : 3.11.11
IPython version      : 7.34.0

Compiler    : GCC 11.4.0
OS          : Linux
Release     : 6.1.85+
Machine     : x86_64
Processor   : x86_64
CPU cores   : 2
Architecture: 64bit

numpy  : 1.26.4
sklearn: 1.6.1
nltk   : 3.9.1
pandas : 2.2.2

