## Importing libraries

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()
import warnings
warnings.filterwarnings('ignore')

## Importing the datasets

In [2]:
books_df=pd.read_csv('data.csv')

In [3]:
bk_df=books_df.copy()

In [4]:
bk_df.head()

Unnamed: 0,isbn13,isbn10,title,subtitle,authors,categories,thumbnail,description,published_year,average_rating,num_pages,ratings_count
0,9780002005883,2005883,Gilead,,Marilynne Robinson,Fiction,http://books.google.com/books/content?id=KQZCP...,A NOVEL THAT READERS and critics have been eag...,2004.0,3.85,247.0,361.0
1,9780002261982,2261987,Spider's Web,A Novel,Charles Osborne;Agatha Christie,Detective and mystery stories,http://books.google.com/books/content?id=gA5GP...,A new 'Christie for Christmas' -- a full-lengt...,2000.0,3.83,241.0,5164.0
2,9780006163831,6163831,The One Tree,,Stephen R. Donaldson,American fiction,http://books.google.com/books/content?id=OmQaw...,Volume Two of Stephen Donaldson's acclaimed se...,1982.0,3.97,479.0,172.0
3,9780006178736,6178731,Rage of angels,,Sidney Sheldon,Fiction,http://books.google.com/books/content?id=FKo2T...,"A memorable, mesmerizing heroine Jennifer -- b...",1993.0,3.93,512.0,29532.0
4,9780006280897,6280897,The Four Loves,,Clive Staples Lewis,Christian life,http://books.google.com/books/content?id=XhQ5X...,Lewis' work on the nature of love divides love...,2002.0,4.15,170.0,33684.0


## Data preprocessing

#### Checking the shape of the dataset

In [5]:
print('Books dataset shape: ',bk_df.shape)

Books dataset shape:  (6810, 12)


#### Checking the dataset for null values

In [6]:
bk_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6810 entries, 0 to 6809
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   isbn13          6810 non-null   int64  
 1   isbn10          6810 non-null   object 
 2   title           6810 non-null   object 
 3   subtitle        2381 non-null   object 
 4   authors         6738 non-null   object 
 5   categories      6711 non-null   object 
 6   thumbnail       6481 non-null   object 
 7   description     6548 non-null   object 
 8   published_year  6804 non-null   float64
 9   average_rating  6767 non-null   float64
 10  num_pages       6767 non-null   float64
 11  ratings_count   6767 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 638.6+ KB


All the columns except isbn13, isbn10 and title have missing data

#### Checking the percentage of data missing in each rows in order to drop or retain the rows

In [7]:
bk_df.isnull().sum() / len(bk_df)*100

isbn13             0.000000
isbn10             0.000000
title              0.000000
subtitle          65.036711
authors            1.057269
categories         1.453744
thumbnail          4.831131
description        3.847283
published_year     0.088106
average_rating     0.631424
num_pages          0.631424
ratings_count      0.631424
dtype: float64

#### The threshold value 20-25% of missing data is acceptable and can be imputed
65% of the data is missing in 'subtitle' column. Also, 'subtitle' column is not a significant column considered for recommending a book. <br>
So, we have to drop this column

#### Analysing the significance of a variable (i.e., a column is a significant column considered for recommending a book or not)

In [8]:
# Going through the columns one by one to decide if its significant in recommending a book
# The significant variables are,
# 1.isbn13 - required for deployment purpose
# 2.isbn10 - required for deployment purpose
# 3.title
# 4.authors
# 5.categories
# 6.description

In [9]:
bk_filtered=bk_df[['isbn13','isbn10','title','authors','categories','description']]
bk_filtered.head()

Unnamed: 0,isbn13,isbn10,title,authors,categories,description
0,9780002005883,2005883,Gilead,Marilynne Robinson,Fiction,A NOVEL THAT READERS and critics have been eag...
1,9780002261982,2261987,Spider's Web,Charles Osborne;Agatha Christie,Detective and mystery stories,A new 'Christie for Christmas' -- a full-lengt...
2,9780006163831,6163831,The One Tree,Stephen R. Donaldson,American fiction,Volume Two of Stephen Donaldson's acclaimed se...
3,9780006178736,6178731,Rage of angels,Sidney Sheldon,Fiction,"A memorable, mesmerizing heroine Jennifer -- b..."
4,9780006280897,6280897,The Four Loves,Clive Staples Lewis,Christian life,Lewis' work on the nature of love divides love...


#### Handling the null values

Among all the columns that have missing data(checked previously - All the columns except isbn13, isbn10 and title have missing data) <br> <br>
We know that we cannot do imputation here because the author, category and description of every movie is unique..so, we keep the null values as such

In [10]:
bk_filtered.isnull().sum()

isbn13           0
isbn10           0
title            0
authors         72
categories      99
description    262
dtype: int64

#### Converting the values to a list and removing the space inbetween

In [11]:
def lst(text):
    lst1=[]
    str1=''
    # Check if text is not NaN
    if isinstance(text, str):
        for i in text:
            if i.isalpha() or i==',':
                str1+=i
            if i==';':
                str1+=i.replace(';',',')
        lst1.append(str1)
    return lst1

In [12]:
bk_filtered['authors']=bk_filtered['authors'].apply(lst)

In [13]:
bk_filtered.head()

Unnamed: 0,isbn13,isbn10,title,authors,categories,description
0,9780002005883,2005883,Gilead,[MarilynneRobinson],Fiction,A NOVEL THAT READERS and critics have been eag...
1,9780002261982,2261987,Spider's Web,"[CharlesOsborne,AgathaChristie]",Detective and mystery stories,A new 'Christie for Christmas' -- a full-lengt...
2,9780006163831,6163831,The One Tree,[StephenRDonaldson],American fiction,Volume Two of Stephen Donaldson's acclaimed se...
3,9780006178736,6178731,Rage of angels,[SidneySheldon],Fiction,"A memorable, mesmerizing heroine Jennifer -- b..."
4,9780006280897,6280897,The Four Loves,[CliveStaplesLewis],Christian life,Lewis' work on the nature of love divides love...


In [14]:
bk_filtered['categories']=bk_filtered['categories'].apply(lst)

In [15]:
bk_filtered.head()

Unnamed: 0,isbn13,isbn10,title,authors,categories,description
0,9780002005883,2005883,Gilead,[MarilynneRobinson],[Fiction],A NOVEL THAT READERS and critics have been eag...
1,9780002261982,2261987,Spider's Web,"[CharlesOsborne,AgathaChristie]",[Detectiveandmysterystories],A new 'Christie for Christmas' -- a full-lengt...
2,9780006163831,6163831,The One Tree,[StephenRDonaldson],[Americanfiction],Volume Two of Stephen Donaldson's acclaimed se...
3,9780006178736,6178731,Rage of angels,[SidneySheldon],[Fiction],"A memorable, mesmerizing heroine Jennifer -- b..."
4,9780006280897,6280897,The Four Loves,[CliveStaplesLewis],[Christianlife],Lewis' work on the nature of love divides love...


The columns 'authors', 'categories' have data word by word and is seperated using a comma. The column description has an entire sentence for which the token conversion is tough..So, we have to split the entire sentence into list of words, which makes it easier for token conversion

In [16]:
# Converting the 'description' column to string datatype
def to_str(text):
    return str(text)

In [17]:
bk_filtered['description']=bk_filtered['description'].apply(to_str)

In [18]:
bk_filtered.head()

Unnamed: 0,isbn13,isbn10,title,authors,categories,description
0,9780002005883,2005883,Gilead,[MarilynneRobinson],[Fiction],A NOVEL THAT READERS and critics have been eag...
1,9780002261982,2261987,Spider's Web,"[CharlesOsborne,AgathaChristie]",[Detectiveandmysterystories],A new 'Christie for Christmas' -- a full-lengt...
2,9780006163831,6163831,The One Tree,[StephenRDonaldson],[Americanfiction],Volume Two of Stephen Donaldson's acclaimed se...
3,9780006178736,6178731,Rage of angels,[SidneySheldon],[Fiction],"A memorable, mesmerizing heroine Jennifer -- b..."
4,9780006280897,6280897,The Four Loves,[CliveStaplesLewis],[Christianlife],Lewis' work on the nature of love divides love...


In [19]:
def sent_to_words(text):
    return text.split() # splitting the entire sentence to list of words

In [20]:
bk_filtered['description']=bk_filtered['description'].apply(sent_to_words)

In [21]:
bk_filtered.head()

Unnamed: 0,isbn13,isbn10,title,authors,categories,description
0,9780002005883,2005883,Gilead,[MarilynneRobinson],[Fiction],"[A, NOVEL, THAT, READERS, and, critics, have, ..."
1,9780002261982,2261987,Spider's Web,"[CharlesOsborne,AgathaChristie]",[Detectiveandmysterystories],"[A, new, 'Christie, for, Christmas', --, a, fu..."
2,9780006163831,6163831,The One Tree,[StephenRDonaldson],[Americanfiction],"[Volume, Two, of, Stephen, Donaldson's, acclai..."
3,9780006178736,6178731,Rage of angels,[SidneySheldon],[Fiction],"[A, memorable,, mesmerizing, heroine, Jennifer..."
4,9780006280897,6280897,The Four Loves,[CliveStaplesLewis],[Christianlife],"[Lewis', work, on, the, nature, of, love, divi..."


#### Merging authors, categories, and description, because when a user searches, he searches using the title..so, we will have only four columns finally for encoding (i.e., isbn13, isbn10, title, context[merged columns])

In [22]:
bk_filtered['context']=bk_filtered['authors']+bk_filtered['categories']+bk_filtered['description']

In [23]:
bk_filtered.head()

Unnamed: 0,isbn13,isbn10,title,authors,categories,description,context
0,9780002005883,2005883,Gilead,[MarilynneRobinson],[Fiction],"[A, NOVEL, THAT, READERS, and, critics, have, ...","[MarilynneRobinson, Fiction, A, NOVEL, THAT, R..."
1,9780002261982,2261987,Spider's Web,"[CharlesOsborne,AgathaChristie]",[Detectiveandmysterystories],"[A, new, 'Christie, for, Christmas', --, a, fu...","[CharlesOsborne,AgathaChristie, Detectiveandmy..."
2,9780006163831,6163831,The One Tree,[StephenRDonaldson],[Americanfiction],"[Volume, Two, of, Stephen, Donaldson's, acclai...","[StephenRDonaldson, Americanfiction, Volume, T..."
3,9780006178736,6178731,Rage of angels,[SidneySheldon],[Fiction],"[A, memorable,, mesmerizing, heroine, Jennifer...","[SidneySheldon, Fiction, A, memorable,, mesmer..."
4,9780006280897,6280897,The Four Loves,[CliveStaplesLewis],[Christianlife],"[Lewis', work, on, the, nature, of, love, divi...","[CliveStaplesLewis, Christianlife, Lewis', wor..."


In [24]:
bk_filtered.drop(['authors','categories','description'],axis=1,inplace=True)

In [25]:
bk_filtered.head()

Unnamed: 0,isbn13,isbn10,title,context
0,9780002005883,2005883,Gilead,"[MarilynneRobinson, Fiction, A, NOVEL, THAT, R..."
1,9780002261982,2261987,Spider's Web,"[CharlesOsborne,AgathaChristie, Detectiveandmy..."
2,9780006163831,6163831,The One Tree,"[StephenRDonaldson, Americanfiction, Volume, T..."
3,9780006178736,6178731,Rage of angels,"[SidneySheldon, Fiction, A, memorable,, mesmer..."
4,9780006280897,6280897,The Four Loves,"[CliveStaplesLewis, Christianlife, Lewis', wor..."


#### In 'context' column, we have the values seperated by comma..now, we are going to combine the values and convert it to a sentence because; when the machine converts them into tokens, it can easily identify and do pattern matching (as a whole sentence) with the other movies

In [26]:
def space_grant(text):
    return " ".join(text)

In [27]:
bk_filtered['context']=bk_filtered['context'].apply(space_grant)

In [28]:
bk_filtered.head()

Unnamed: 0,isbn13,isbn10,title,context
0,9780002005883,2005883,Gilead,MarilynneRobinson Fiction A NOVEL THAT READERS...
1,9780002261982,2261987,Spider's Web,"CharlesOsborne,AgathaChristie Detectiveandmyst..."
2,9780006163831,6163831,The One Tree,StephenRDonaldson Americanfiction Volume Two o...
3,9780006178736,6178731,Rage of angels,"SidneySheldon Fiction A memorable, mesmerizing..."
4,9780006280897,6280897,The Four Loves,CliveStaplesLewis Christianlife Lewis' work on...


### Feature extraction(encoding) - using Bag Of Words (BOW) method

In [29]:
from sklearn.feature_extraction.text import CountVectorizer

In [30]:
cntvec=CountVectorizer(stop_words='english',max_features=7000)
bow=cntvec.fit_transform(bk_filtered['context']).toarray()
print(bow)
print('Shape of the encoded "context" column:',bow.shape)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
Shape of the encoded "context" column: (6810, 7000)


## Model building
#### The content based recommendation system is built based on the similarity metrics(cosine similarity)

In [31]:
from sklearn.metrics.pairwise import cosine_similarity

In [32]:
cos_sim=cosine_similarity(bow)
cos_sim

array([[1.        , 0.00800641, 0.        , ..., 0.        , 0.00836242,
        0.        ],
       [0.00800641, 1.        , 0.02886751, ..., 0.        , 0.00870388,
        0.02886751],
       [0.        , 0.02886751, 1.        , ..., 0.        , 0.03015113,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.00836242, 0.00870388, 0.03015113, ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.02886751, 0.        , ..., 0.        , 0.        ,
        1.        ]])

In [33]:
def recommendation(book):
    ind=bk_filtered[bk_filtered['title']==book].index[0]
#     'ind' stores the index number of the movie that the user inputs
    dist=sorted(list(enumerate(cos_sim[ind])),reverse=True,key=lambda x : x[1])
#     the cosine similarity of the input movie is taken and sorted in reverse manner, x[1] 
#     indicates that the similarity of the given input movie with itself is 1 and that index
#     number is considered as 1
#     When the cosine similarity is greater, the movies are more similar..so, we sort it in
#     descending order to get the top 'n'(our choice) movies
    for i in dist[1:6]: # 0:5 is not taken because the cos_sim of a movie with itself is 1..so,
#     the input movie will be ignored
        print(bk_filtered.iloc[i[0]].title)
#     getting the title of the top 5 movies that are more similar to the input movie

#### Getting the movie recommendations based on the content of 'Gilead' book

In [34]:
recommendation('Gilead')

Go Tell it on the Mountain
The Last Eyewitness
Children of the Alley
John Adams
The Chaneysville Incident


#### Getting the movie recommendations based on the content of 'The diaries of Franz Kafka' book

In [35]:
recommendation('The diaries of Franz Kafka')

A twist of Lennon
Winston S. Churchill
Soldier, general of the army, president-elect, 1890-1952
The Art of Maurice Sendak
Tolkien


#### Getting the movie recommendations based on the content of 'The Very Persistent Gappers of Frip' book

In [36]:
recommendation('The Very Persistent Gappers of Frip')

Pippi Goes on Board
That was Then, this is Now
The World of Peter Rabbit
The People of Sparks
The Woman in the Dunes
