# Introduction 

In this notebook i am going to show how we can make a content based recommendation system on books.
I am gonna use 2 algorithms namely
* Cosine Similarity
* Jaccof Similarity

If you are not samiliar with these names then do not worry i am gonna explain them in detail later so follow the notebook without skipping anything

# Import Neseccary Libraries

In [1]:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd
import json
import matplotlib.pyplot as plt
import warnings

# Ignore all warnings
warnings.filterwarnings("ignore")

# Data Preparation and EDA

For book recommendation system i am gonna use the Goodreads_BestBooksEver_1-10000 dataset which contain data of about 10000 books

In [2]:
df=pd.read_csv("/kaggle/input/goodreads-best-books-ever-with-recommendations/Goodreads_BestBooksEver_1-10000.csv")

In [3]:
df.shape

(10000, 12)

In [4]:
df.head()

Unnamed: 0,url,bookTitle,bookImage,bookAuthors,bookDesc,bookRating,ratingCount,reviewCount,bookPages,bookGenres,bookISBN,recommendations
0,https://www.goodreads.com/book/show/2767052-th...,The Hunger Games,https://i.gr-assets.com/images/S/compressed.ph...,Suzanne Collins,"Could you survive on your own in the wild, wit...",4.32,6717635,176054,374 pages,"Young Adult/31,498|Fiction/17,878|Science Fict...",9780439000000.0,"['Divergent (Divergent, #1)|https://www.goodre..."
1,https://www.goodreads.com/book/show/2.Harry_Po...,Harry Potter and the Order of the Phoenix,https://i.gr-assets.com/images/S/compressed.ph...,"J.K. Rowling,Mary GrandPré",There is a door at the end of a silent corrido...,4.5,2668409,45724,870 pages,"Fantasy/1,797|Young Adult/15,961|Fiction/14,15...",,['Harry Potter and the Cursed Child: Parts One...
2,https://www.goodreads.com/book/show/2657.To_Ki...,To Kill a Mockingbird,https://i.gr-assets.com/images/S/compressed.ph...,Harper Lee,The unforgettable novel of a childhood in a sl...,4.28,4772918,95595,324 pages,"Classics/47,203|Fiction/23,575|Historical-Hist...",,['The Great Gatsby|https://www.goodreads.com/b...
3,https://www.goodreads.com/book/show/1885.Pride...,Pride and Prejudice,https://i.gr-assets.com/images/S/compressed.ph...,"Jane Austen,Anna Quindlen",Alternate cover edition of ISBN 9780679783268S...,4.27,3206070,74020,279 pages,"Classics/52,699|Fiction/15,730|Romance/12,874|...",,['Jane Eyre|https://www.goodreads.com/book/sho...
4,https://www.goodreads.com/book/show/41865.Twil...,Twilight,https://i.gr-assets.com/images/S/compressed.ph...,Stephenie Meyer,About three things I was absolutely positive.F...,3.61,5231000,107619,501 pages,"Young Adult/19,982|Fantasy/19,312|Romance/12,0...",9780316000000.0,"['The Hunger Games (The Hunger Games, #1)|http..."


In [5]:
df.columns

Index(['url', 'bookTitle', 'bookImage', 'bookAuthors', 'bookDesc',
       'bookRating', 'ratingCount', 'reviewCount', 'bookPages', 'bookGenres',
       'bookISBN', 'recommendations'],
      dtype='object')

Now for our task we need bookGenres and bookTitle as we are gonna recommend books based on the same genre as user will probably like the book from similar genre

In [6]:
columns_to_drop=['url','bookImage','bookDesc','bookRating','ratingCount','reviewCount','bookPages','bookISBN','recommendations']

In [7]:
df.drop(columns=columns_to_drop, inplace=True) 
df.head()

Unnamed: 0,bookTitle,bookAuthors,bookGenres
0,The Hunger Games,Suzanne Collins,"Young Adult/31,498|Fiction/17,878|Science Fict..."
1,Harry Potter and the Order of the Phoenix,"J.K. Rowling,Mary GrandPré","Fantasy/1,797|Young Adult/15,961|Fiction/14,15..."
2,To Kill a Mockingbird,Harper Lee,"Classics/47,203|Fiction/23,575|Historical-Hist..."
3,Pride and Prejudice,"Jane Austen,Anna Quindlen","Classics/52,699|Fiction/15,730|Romance/12,874|..."
4,Twilight,Stephenie Meyer,"Young Adult/19,982|Fantasy/19,312|Romance/12,0..."


In [8]:
df.isna().sum()

bookTitle        0
bookAuthors      0
bookGenres     100
dtype: int64

There are 100 null values in the bookGenres column so we will simply drop them because 100 values does not matter that much 

In [9]:
df.dropna(inplace=True)

In [10]:
df.drop_duplicates(inplace=True)

In [11]:
df.shape

(9681, 3)

In [12]:
df['bookGenres'][0]

'Young Adult/31,498|Fiction/17,878|Science Fiction-Dystopia/16,665|Fantasy/14,057|Science Fiction/10,807|Romance/4,067|Adventure/3,496|Young Adult-Teen/1,906|Apocalyptic-Post Apocalyptic/1,658|Action/1,375'

Now as you can see the genres for books are some random string with some numbers in it so we need to extract those genres and store them seperatly so i am using set to make the set of genres for a book

In [13]:
def extract_genres(input_string):
    genres_data = input_string.split('|')
    extracted_genres = set()
    for genre_entry in genres_data:
        genre_parts = genre_entry.split('/')
        if len(genre_parts) >= 2:
            genre_name = genre_parts[0]  
            extracted_genres.add(genre_name)    
    return extracted_genres

df['cleaned_bookGenres'] = df["bookGenres"].apply(extract_genres)

In [14]:
df['bookTitle'][0]

'The Hunger Games'

In [15]:
df['bookGenres'][0]

'Young Adult/31,498|Fiction/17,878|Science Fiction-Dystopia/16,665|Fantasy/14,057|Science Fiction/10,807|Romance/4,067|Adventure/3,496|Young Adult-Teen/1,906|Apocalyptic-Post Apocalyptic/1,658|Action/1,375'

In [16]:
df['cleaned_bookGenres'][0]

{'Action',
 'Adventure',
 'Apocalyptic-Post Apocalyptic',
 'Fantasy',
 'Fiction',
 'Romance',
 'Science Fiction',
 'Science Fiction-Dystopia',
 'Young Adult',
 'Young Adult-Teen'}

In [17]:
df.columns

Index(['bookTitle', 'bookAuthors', 'bookGenres', 'cleaned_bookGenres'], dtype='object')

Now as we clean the genres and we have cleaned_bookGenres so there is no need for bookGenres so i am going to drop it

In [18]:
df.drop(['bookGenres'],inplace=True,axis=1)

In [19]:
df.head()

Unnamed: 0,bookTitle,bookAuthors,cleaned_bookGenres
0,The Hunger Games,Suzanne Collins,"{Science Fiction-Dystopia, Apocalyptic-Post Ap..."
1,Harry Potter and the Order of the Phoenix,"J.K. Rowling,Mary GrandPré","{Childrens-Middle Grade, Adventure, Classics, ..."
2,To Kill a Mockingbird,Harper Lee,"{Historical-Historical Fiction, Novels, Classi..."
3,Pride and Prejudice,"Jane Austen,Anna Quindlen","{Historical-Historical Fiction, Novels, Romanc..."
4,Twilight,Stephenie Meyer,"{Paranormal-Vampires, Fantasy-Supernatural, Fa..."


# A little Extra

Here i want to do a little extra thing i want the recommended books to me in same language because most of the reader prefer to read the books in the same language so i am going to add the extra column bookLang in the dataset

In [20]:
import langid
import re

In [21]:
def detect_lang(input_string):
    cleaned_text = re.sub(r'[^a-zA-Z]', ' ', input_string)
    language, confidence = langid.classify(cleaned_text)
    return language

df['bookLang'] = df["bookTitle"].apply(detect_lang)

In [22]:
df.head()

Unnamed: 0,bookTitle,bookAuthors,cleaned_bookGenres,bookLang
0,The Hunger Games,Suzanne Collins,"{Science Fiction-Dystopia, Apocalyptic-Post Ap...",en
1,Harry Potter and the Order of the Phoenix,"J.K. Rowling,Mary GrandPré","{Childrens-Middle Grade, Adventure, Classics, ...",en
2,To Kill a Mockingbird,Harper Lee,"{Historical-Historical Fiction, Novels, Classi...",en
3,Pride and Prejudice,"Jane Austen,Anna Quindlen","{Historical-Historical Fiction, Novels, Romanc...",en
4,Twilight,Stephenie Meyer,"{Paranormal-Vampires, Fantasy-Supernatural, Fa...",en


# Cosine Similarity

Cosine similarity is simply the similarity between many numbers so first we need to convert the data into numbers for that purpose we will use TF/IDF 

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity,euclidean_distances

For TF/IDF we first need to gather all genres and language into a single string so lets do that

In [24]:
x=df.iloc[0]
x

bookTitle                                              The Hunger Games
bookAuthors                                             Suzanne Collins
cleaned_bookGenres    {Science Fiction-Dystopia, Apocalyptic-Post Ap...
bookLang                                                             en
Name: 0, dtype: object

In [25]:
result_string = " ".join(x.cleaned_bookGenres)
result_string=result_string+" "+x.bookLang
result_string

'Science Fiction-Dystopia Apocalyptic-Post Apocalyptic Adventure Romance Fiction Science Fiction Action Fantasy Young Adult Young Adult-Teen en'

So now we will create a function to do this for all the dataset

In [26]:
def get_string(row):
    result_string = " ".join(row.cleaned_bookGenres)
    result_string=result_string+" "+row.bookLang
    return  result_string
df['string'] = df.apply(get_string,axis=1)

In [27]:
df.head()

Unnamed: 0,bookTitle,bookAuthors,cleaned_bookGenres,bookLang,string
0,The Hunger Games,Suzanne Collins,"{Science Fiction-Dystopia, Apocalyptic-Post Ap...",en,Science Fiction-Dystopia Apocalyptic-Post Apoc...
1,Harry Potter and the Order of the Phoenix,"J.K. Rowling,Mary GrandPré","{Childrens-Middle Grade, Adventure, Classics, ...",en,Childrens-Middle Grade Adventure Classics Fant...
2,To Kill a Mockingbird,Harper Lee,"{Historical-Historical Fiction, Novels, Classi...",en,Historical-Historical Fiction Novels Classics ...
3,Pride and Prejudice,"Jane Austen,Anna Quindlen","{Historical-Historical Fiction, Novels, Romanc...",en,Historical-Historical Fiction Novels Romance C...
4,Twilight,Stephenie Meyer,"{Paranormal-Vampires, Fantasy-Supernatural, Fa...",en,Paranormal-Vampires Fantasy-Supernatural Fanta...


In [28]:
tfidf=TfidfVectorizer(max_features=3000)

In [29]:
vector=tfidf.fit_transform(df['string'])
vector.shape

(9681, 703)

We need the list of all the Book Titles and there index in the dataset

In [30]:
movie2idx=pd.Series(df.index,index=df['bookTitle'])
movie2idx

bookTitle
The Hunger Games                                0
Harry Potter and the Order of the Phoenix       1
To Kill a Mockingbird                           2
Pride and Prejudice                             3
Twilight                                        4
                                             ... 
Civil War: A Marvel Comics Event             9995
Peter the Great: His Life and World          9996
Owl at Home (I Can Read, Level 2)            9997
The People in the Trees                      9998
Half Girlfriend                              9999
Length: 9681, dtype: int64

In [31]:
idx=movie2idx['The Hunger Games']
idx

0

Now we will query the TF/IDF value for that Book or index

In [32]:
query=vector[idx]
query

<1x703 sparse matrix of type '<class 'numpy.float64'>'
	with 13 stored elements in Compressed Sparse Row format>

Now we will calculate the Cosine Similarity of all the other books 

In [33]:
scores=cosine_similarity(query,vector)
scores

array([[1.        , 0.28324503, 0.12138773, ..., 0.01049705, 0.15175197,
        0.09581961]])

In [34]:
scores=scores.flatten()

Now we will sort the scores array and select the one with most similarity as we know argsort will sort in ascending order so we will add a - sign to the scores and get the top 5 values

In [35]:
recommended_idx=(-scores).argsort()[1:6]

In [36]:
df['bookTitle'].iloc[recommended_idx]

1518    Blood Red Road
243      Catching Fire
156          Insurgent
4160         Article 5
1370    The Kill Order
Name: bookTitle, dtype: object

So these are the books recommended for the users who love 'The Hunger Games' 
So now we will make a function that take the title of book as input and give the recommendations.Here i want to add a functionality so that if the title entered by user is a little bit change or mis spelled then rather than giving Book not exist it sees if there is any Book name similar to the given Title

In [37]:
def recommended_movies_cosine(title):
    try:
        idx=movie2idx[title]
    except:
        matches = process.extract(title, df['bookTitle'].tolist(), limit=1)
        if matches and matches[0][1] >= 80:  
            similar_name = matches[0][0]
            return f"Did you mean '{similar_name}'?"
        print("Book Not Exist")
        return      
    if type(idx)==pd.Series:
        idx=idx.iloc[0]
    print(idx)
    query=vector[idx]
    scores=cosine_similarity(query,vector)
    scores=scores.flatten()
    recommended_idx=(-scores).argsort()[1:6]
    return df['bookTitle'].iloc[recommended_idx]

In [38]:
recommended_movies_cosine('Article')

"Did you mean 'Article 5'?"

In [39]:
recommended_movies_cosine('Article 5')

4160


3167            Assuming Names: A Con Artist's Masquerade
4684    The Stranger Beside Me: Ted Bundy: The Shockin...
2030                                                Lucky
5450                                              Wiseguy
9770                                       My Dark Places
Name: bookTitle, dtype: object

# Jaccof Similarity

Jaccof Similarity is another Algorithm used for detecting similarites between 2 or more thinks. The idea is simple rather than string here we need a set contain all important information like in this case genres and language and then we will take the Intersection of Two different Sets (Sets of Two different books) and divide it to the Union of these Two that will give us a score and that score will indicate how close they are

In [40]:
x=df.iloc[0]
x

bookTitle                                              The Hunger Games
bookAuthors                                             Suzanne Collins
cleaned_bookGenres    {Science Fiction-Dystopia, Apocalyptic-Post Ap...
bookLang                                                             en
string                Science Fiction-Dystopia Apocalyptic-Post Apoc...
Name: 0, dtype: object

In [41]:
strlist=x['cleaned_bookGenres']
strlist.add(x.bookLang)
strlist

{'Action',
 'Adventure',
 'Apocalyptic-Post Apocalyptic',
 'Fantasy',
 'Fiction',
 'Romance',
 'Science Fiction',
 'Science Fiction-Dystopia',
 'Young Adult',
 'Young Adult-Teen',
 'en'}

Create a function for whole data

In [42]:
def convert_into_set(x):
    strlist=x['cleaned_bookGenres']
    strlist.add(x.bookLang)
    return strlist

In [43]:
df['set']=df.apply(convert_into_set,axis=1)


In [44]:
df.head()

Unnamed: 0,bookTitle,bookAuthors,cleaned_bookGenres,bookLang,string,set
0,The Hunger Games,Suzanne Collins,"{Science Fiction-Dystopia, Apocalyptic-Post Ap...",en,Science Fiction-Dystopia Apocalyptic-Post Apoc...,"{Science Fiction-Dystopia, Apocalyptic-Post Ap..."
1,Harry Potter and the Order of the Phoenix,"J.K. Rowling,Mary GrandPré","{Childrens-Middle Grade, Adventure, Classics, ...",en,Childrens-Middle Grade Adventure Classics Fant...,"{Childrens-Middle Grade, Adventure, Classics, ..."
2,To Kill a Mockingbird,Harper Lee,"{Historical-Historical Fiction, Novels, Classi...",en,Historical-Historical Fiction Novels Classics ...,"{Historical-Historical Fiction, Novels, Classi..."
3,Pride and Prejudice,"Jane Austen,Anna Quindlen","{Historical-Historical Fiction, Novels, Romanc...",en,Historical-Historical Fiction Novels Romance C...,"{Historical-Historical Fiction, Novels, Romanc..."
4,Twilight,Stephenie Meyer,"{Paranormal-Vampires, Fantasy-Supernatural, Fa...",en,Paranormal-Vampires Fantasy-Supernatural Fanta...,"{Paranormal-Vampires, Fantasy-Supernatural, Fa..."


# Algorithm 

I do not find any implementation of Jaccof Similarity so i am implementing it by myself 

In [45]:
def calculate_score(row,inputset):
    intersection=len(row['set'].intersection(inputset))
    union = len(row['set'].union(inputset))
    return intersection/union

In [46]:
x['bookTitle']

'The Hunger Games'

In [47]:
name='The Hunger Games'
inputset=df.loc[df['bookTitle'] == name, 'set'].iloc[0]
temp=df[df['bookTitle']!=name]
temp['score']=temp.apply(lambda row: calculate_score(row, inputset), axis=1)
temp=temp.sort_values(by='score', ascending=False)
top_5_rows = temp.iloc[:5, :]
top_5_names = top_5_rows['bookTitle'].tolist()
top_5_names=set(top_5_names)
top_5_names

{'Article 5', 'Blood Red Road', 'Catching Fire', 'Insurgent', 'Once'}

Now we will create a function that takes title as input and give recommendation as output

In [48]:
def Find_recommendation_jaccof(name):
    if name in df['bookTitle'].tolist():
        inputset=df.loc[df['bookTitle'] == name, 'set'].iloc[0]
        temp=df[df['bookTitle']!=name]
        temp['score']=temp.apply(lambda row: calculate_score(row, inputset), axis=1)
        temp=temp.sort_values(by='score', ascending=False)
        top_5_rows = temp.iloc[:5, :]
        top_5_recommendation = top_5_rows['bookTitle'].tolist()
        return top_5_recommendation
    matches = process.extract(name, df['bookTitle'].tolist(), limit=1)
    if matches and matches[0][1] >= 80:  
        similar_name = matches[0][0]
        return f"Did you mean '{similar_name}'?"
    return f"'{name}' does not exist in the dataset."

In [49]:
Find_recommendation_jaccof("The Alchemist")

['One',
 'Illusions: The Adventures of a Reluctant Messiah',
 'جاناتان، مرغ دریایی',
 'Jonathan Livingston Seagull',
 'The Prophet']

In [50]:
recommended_movies_cosine("The Alchemist")

22


4901                                       The Alchemist
5046                                                 One
780     Illusions: The Adventures of a Reluctant Messiah
6513                                 جاناتان، مرغ دریایی
182                          Jonathan Livingston Seagull
Name: bookTitle, dtype: object

I hope this is helpfull

##### 