Importing packages

In [2]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

Reading our datasets

In [3]:
df = pd.read_csv('C:/Users/Nam/Desktop/DoAnHoanThien/original-data/books.csv')
df.head()

Unnamed: 0,isbn13,isbn10,title,subtitle,authors,categories,thumbnail,description,published_year,average_rating,num_pages,ratings_count
0,9780002005883,2005883,Gilead,,Marilynne Robinson,Fiction,http://books.google.com/books/content?id=KQZCP...,A NOVEL THAT READERS and critics have been eag...,2004.0,3.85,247.0,361.0
1,9780002261982,2261987,Spider's Web,A Novel,Charles Osborne;Agatha Christie,Detective and mystery stories,http://books.google.com/books/content?id=gA5GP...,A new 'Christie for Christmas' -- a full-lengt...,2000.0,3.83,241.0,5164.0
2,9780006163831,6163831,The One Tree,,Stephen R. Donaldson,American fiction,http://books.google.com/books/content?id=OmQaw...,Volume Two of Stephen Donaldson's acclaimed se...,1982.0,3.97,479.0,172.0
3,9780006178736,6178731,Rage of angels,,Sidney Sheldon,Fiction,http://books.google.com/books/content?id=FKo2T...,"A memorable, mesmerizing heroine Jennifer -- b...",1993.0,3.93,512.0,29532.0
4,9780006280897,6280897,The Four Loves,,Clive Staples Lewis,Christian life,http://books.google.com/books/content?id=XhQ5X...,Lewis' work on the nature of love divides love...,2002.0,4.15,170.0,33684.0


Delete First Two Columns :

In [4]:
df.drop("isbn13", inplace = True, axis = 1)

In [5]:
df.drop("isbn10", inplace = True, axis = 1)

In [6]:
df.head()

Unnamed: 0,title,subtitle,authors,categories,thumbnail,description,published_year,average_rating,num_pages,ratings_count
0,Gilead,,Marilynne Robinson,Fiction,http://books.google.com/books/content?id=KQZCP...,A NOVEL THAT READERS and critics have been eag...,2004.0,3.85,247.0,361.0
1,Spider's Web,A Novel,Charles Osborne;Agatha Christie,Detective and mystery stories,http://books.google.com/books/content?id=gA5GP...,A new 'Christie for Christmas' -- a full-lengt...,2000.0,3.83,241.0,5164.0
2,The One Tree,,Stephen R. Donaldson,American fiction,http://books.google.com/books/content?id=OmQaw...,Volume Two of Stephen Donaldson's acclaimed se...,1982.0,3.97,479.0,172.0
3,Rage of angels,,Sidney Sheldon,Fiction,http://books.google.com/books/content?id=FKo2T...,"A memorable, mesmerizing heroine Jennifer -- b...",1993.0,3.93,512.0,29532.0
4,The Four Loves,,Clive Staples Lewis,Christian life,http://books.google.com/books/content?id=XhQ5X...,Lewis' work on the nature of love divides love...,2002.0,4.15,170.0,33684.0


Calculate missing values in each Feature :

In [7]:
missing_data = df.isnull()

Get Summary of missing values in each feature :

In [8]:
for column in missing_data.columns.values.tolist():
    print(missing_data[column].value_counts())
    print(" ")

title
False    6810
Name: count, dtype: int64
 
subtitle
True     4429
False    2381
Name: count, dtype: int64
 
authors
False    6738
True       72
Name: count, dtype: int64
 
categories
False    6711
True       99
Name: count, dtype: int64
 
thumbnail
False    6481
True      329
Name: count, dtype: int64
 
description
False    6548
True      262
Name: count, dtype: int64
 
published_year
False    6804
True        6
Name: count, dtype: int64
 
average_rating
False    6767
True       43
Name: count, dtype: int64
 
num_pages
False    6767
True       43
Name: count, dtype: int64
 
ratings_count
False    6767
True       43
Name: count, dtype: int64
 


Data Wrangling :

Replacing missing values in 'subtitle' column with Frequency

In [9]:
df['subtitle'].replace(np.nan , df['subtitle'].value_counts().idxmax() , inplace = True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['subtitle'].replace(np.nan , df['subtitle'].value_counts().idxmax() , inplace = True)


Replacing missing values in 'description' column with 'No Description for that book'

In [10]:
df['description'] = df['description'].fillna("No Description For that book")

Replacing missing values in 'author' column with 'No Author for that book'



In [11]:
df['authors'] = df['authors'].fillna("No Author For that book")

Replacing missing values in 'categories' column with its most frequency

In [12]:
df['categories'].replace(np.nan , df['categories'].value_counts().idxmax() , inplace = True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['categories'].replace(np.nan , df['categories'].value_counts().idxmax() , inplace = True)


Drop 'thumbnail' column :



In [13]:
df.drop("thumbnail", inplace = True, axis = 1)

Replacing 'published_year' column with its most frequency :

In [14]:
df['published_year'].replace(np.nan , df['published_year'].value_counts().idxmax() , inplace = True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['published_year'].replace(np.nan , df['published_year'].value_counts().idxmax() , inplace = True)


Replace 'average_rating' column with its mean() :

In [15]:
average_rating_mean = df['average_rating'].astype(float).mean(axis = 0)
print("The average rating mean = ", average_rating_mean)

The average rating mean =  3.933283582089552


In [16]:
df['average_rating'].replace(np.nan,average_rating_mean, inplace = True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['average_rating'].replace(np.nan,average_rating_mean, inplace = True)


Replacing 'num_pages' column with its mean() :

In [17]:
num_pages_mean = df['num_pages'].astype(float).mean(axis = 0)
print("The number of pages mean = ", num_pages_mean)

The number of pages mean =  348.1810255652431


In [18]:
num_pages_mean.astype(int)

348

In [19]:
df['num_pages'].replace(np.nan, num_pages_mean, inplace = True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['num_pages'].replace(np.nan, num_pages_mean, inplace = True)


Replacing 'ratings_count' with its mean () :

In [20]:
ratings_count_mean = df['ratings_count'].astype(float).mean(axis = 0)
print("The Rating count mean =", ratings_count_mean)

The Rating count mean = 21069.09989655682


In [21]:
ratings_count_mean.astype(int)

21069

In [22]:
df['ratings_count'].replace(np.nan, ratings_count_mean, inplace = True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['ratings_count'].replace(np.nan, ratings_count_mean, inplace = True)


Checking the missing values again :

In [23]:
missing_data = df.isnull()
for column in missing_data.columns.values.tolist():
    print(missing_data[column].value_counts())
    print(" ")

title
False    6810
Name: count, dtype: int64
 
subtitle
False    6810
Name: count, dtype: int64
 
authors
False    6810
Name: count, dtype: int64
 
categories
False    6810
Name: count, dtype: int64
 
description
False    6810
Name: count, dtype: int64
 
published_year
False    6810
Name: count, dtype: int64
 
average_rating
False    6810
Name: count, dtype: int64
 
num_pages
False    6810
Name: count, dtype: int64
 
ratings_count
False    6810
Name: count, dtype: int64
 


Getting info about our Dataset :

In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6810 entries, 0 to 6809
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           6810 non-null   object 
 1   subtitle        6810 non-null   object 
 2   authors         6810 non-null   object 
 3   categories      6810 non-null   object 
 4   description     6810 non-null   object 
 5   published_year  6810 non-null   float64
 6   average_rating  6810 non-null   float64
 7   num_pages       6810 non-null   float64
 8   ratings_count   6810 non-null   float64
dtypes: float64(4), object(5)
memory usage: 479.0+ KB


In [25]:
df.head()

Unnamed: 0,title,subtitle,authors,categories,description,published_year,average_rating,num_pages,ratings_count
0,Gilead,A Novel,Marilynne Robinson,Fiction,A NOVEL THAT READERS and critics have been eag...,2004.0,3.85,247.0,361.0
1,Spider's Web,A Novel,Charles Osborne;Agatha Christie,Detective and mystery stories,A new 'Christie for Christmas' -- a full-lengt...,2000.0,3.83,241.0,5164.0
2,The One Tree,A Novel,Stephen R. Donaldson,American fiction,Volume Two of Stephen Donaldson's acclaimed se...,1982.0,3.97,479.0,172.0
3,Rage of angels,A Novel,Sidney Sheldon,Fiction,"A memorable, mesmerizing heroine Jennifer -- b...",1993.0,3.93,512.0,29532.0
4,The Four Loves,A Novel,Clive Staples Lewis,Christian life,Lewis' work on the nature of love divides love...,2002.0,4.15,170.0,33684.0


Saving Cleaned Dataset :

In [26]:
df.to_csv('C:/Users/Nam/Desktop/DoAnHoanThien/Clean_Data/cleaned_dataset.csv')