In [7]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/goodreads-books/cleaned_description.csv
/kaggle/input/goodreads-books/cleaned_books.csv.bz2
/kaggle/input/goodreads-books/full_df.csv.bz2


## Data Science Problem 

Goodreads is a popular online platform where readers can discover new books, read reviews, and connect with other readers. However, with millions of books available on the platform, it can be overwhelming for users to find books that match their preferences. In this project, we aim to build a recommendation system for Goodreads users that suggests books based on their reading history, preferences, and ratings. Our goal is to provide a personalized and intuitive experience for users, helping them discover new books that they are likely to enjoy. Using supervised learning and NLP techniques, we will build a model that can predict the likelihood of a user liking a book based on features such as book title, author, genre, description, user reviews, and book ratings. The model will be trained on a subset of the Goodreads dataset and evaluated based on accuracy, precision, recall, and F1-score. The final output will be a recommendation engine that suggests books to users based on their input and history on the platform.

Data Extracted from UCSD Goodreads data: https://sites.google.com/eng.ucsd.edu/ucsdbookgraph/home

## Objectives

In the preprocessing step, we aim to clean and transform the raw data into a format that is suitable for machine learning models. 
Here are some of the questions we aim to answer during the preprocessing step in this notebook:

Data Cleaning:
Are there any missing values in the dataset?
Are there any duplicate entries in the dataset?
Are there any irrelevant features in the dataset that can be removed?
Are there any inconsistencies in the data that need to be corrected?

Data Transformation:
How can we extract relevant features from the dataset, such as book title, author, genre, description, user reviews, and book ratings?
How can we preprocess the text data to make it suitable for machine learning models, such as tokenization, removing stop words, stemming, and lemmatization?
How can we convert the text data into numerical features that can be used in machine learning models, such as TF-IDF, Bag of Words, or Word2Vec?

Exploratory Data Analysis:
What is the distribution of ratings in the dataset?
What are the most popular genres and authors in the dataset?
Are there any correlations between different features in the dataset?

Data Preparation:
How can we split the dataset into training and testing sets?
How can we balance the dataset to handle class imbalance?
How can we encode categorical variables into numerical variables?

Answering these questions during the preprocessing step is crucial in building an accurate and robust recommendation system that can provide personalized recommendations to Goodreads users.

In [8]:
df = pd.read_csv('/kaggle/input/goodreads-books/full_df.csv.bz2', compression='bz2')

In [9]:
df.shape

(1115445, 16)

In [10]:
df.head()

Unnamed: 0,book_id,isbn,author_id,authors,title,description,publisher,genres,avg_rating,ratings_count,num_pages,pub_year,language_code,similar_books,url,cover_image
0,1333909,743509986.0,626222,Anita Diamant,Good Harbor,"Anita Diamant's international bestseller ""The ...",Simon & Schuster Audio,"['to-read', 'fiction', 'currently-reading', 'c...",3.23,10,0,2001.0,,"['8709549', '17074050', '28937', '158816', '22...",https://www.goodreads.com/book/show/1333909.Go...,https://s.gr-assets.com/assets/nophoto/book/11...
1,7327624,,10333,Barbara Hambly,"The Unschooled Wizard (Sun Wolf and Starhawk, ...",Omnibus book club edition containing the Ladie...,"Nelson Doubleday, Inc.","['to-read', 'fantasy', 'fiction', 'owned', 'ha...",4.03,140,600,1987.0,eng,"['19997', '828466', '1569323', '425389', '1176...",https://www.goodreads.com/book/show/7327624-th...,https://images.gr-assets.com/books/1304100136m...
2,6066819,743294297.0,9212,Jennifer Weiner,Best Friends Forever,Addie Downs and Valerie Adler were eight when ...,Atria Books,"['to-read', 'chick-lit', 'currently-reading', ...",3.49,51184,368,2009.0,eng,"['6604176', '6054190', '2285777', '82641', '75...",https://www.goodreads.com/book/show/6066819-be...,https://s.gr-assets.com/assets/nophoto/book/11...
3,287141,1599150603.0,3041852,Alfred J. Church,The Aeneid for Boys and Girls,"Relates in vigorous prose the tale of Aeneas, ...",Yesterday's Classics,"['to-read', 'currently-reading', 'history', 'c...",4.13,46,162,2006.0,,[],https://www.goodreads.com/book/show/287141.The...,https://s.gr-assets.com/assets/nophoto/book/11...
4,6066812,1934876569.0,19158,Rachel Roberts,All's Fairy in Love and War (Avalon: Web of Ma...,"To Kara's astonishment, she discovers that a p...",Seven Seas,"['to-read', 'fantasy', 'owned', 'books-i-own',...",4.22,98,216,2009.0,,"['948696', '439885', '274955', '12978730', '37...",https://www.goodreads.com/book/show/6066812-al...,https://images.gr-assets.com/books/1316637798m...


In [11]:
df.columns

Index(['book_id', 'isbn', 'author_id', 'authors', 'title', 'description',
       'publisher', 'genres', 'avg_rating', 'ratings_count', 'num_pages',
       'pub_year', 'language_code', 'similar_books', 'url', 'cover_image'],
      dtype='object')

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1115445 entries, 0 to 1115444
Data columns (total 16 columns):
 #   Column         Non-Null Count    Dtype  
---  ------         --------------    -----  
 0   book_id        1115445 non-null  int64  
 1   isbn           760981 non-null   object 
 2   author_id      1115445 non-null  int64  
 3   authors        1115444 non-null  object 
 4   title          1115442 non-null  object 
 5   description    1115435 non-null  object 
 6   publisher      863332 non-null   object 
 7   genres         1115445 non-null  object 
 8   avg_rating     1115445 non-null  float64
 9   ratings_count  1115445 non-null  int64  
 10  num_pages      1115445 non-null  int64  
 11  pub_year       896977 non-null   float64
 12  language_code  522531 non-null   object 
 13  similar_books  1115445 non-null  object 
 14  url            1115445 non-null  object 
 15  cover_image    1115445 non-null  object 
dtypes: float64(2), int64(4), object(10)
memory usage: 136.

## Number Of Missing Values By Column

In [13]:
missing = pd.concat([df.isna().sum(), 100* df.isna().mean()], axis = 1)
missing.columns = ['count','%']
missing.sort_values(by = 'count')

Unnamed: 0,count,%
book_id,0,0.0
author_id,0,0.0
genres,0,0.0
avg_rating,0,0.0
ratings_count,0,0.0
num_pages,0,0.0
similar_books,0,0.0
url,0,0.0
cover_image,0,0.0
authors,1,9e-05


In [14]:
# Let's look closely at the "language code" and "isbn" coloums
df['language_code'].value_counts()


eng    522531
Name: language_code, dtype: int64

In [15]:
df['language_code'].isna().sum()

592914

In [16]:
#Removing the language code columns: There are 50% missing value in "language code coloumn". We saw that the only langauge represented in this column is Engish, so, it does not provide any insights for the recommender system.
df = df.drop('language_code', axis = 1)

In [17]:
df = df.drop('isbn', axis =1)

## Categorical Features

In [18]:
df.select_dtypes('object')

Unnamed: 0,authors,title,description,publisher,genres,similar_books,url,cover_image
0,Anita Diamant,Good Harbor,"Anita Diamant's international bestseller ""The ...",Simon & Schuster Audio,"['to-read', 'fiction', 'currently-reading', 'c...","['8709549', '17074050', '28937', '158816', '22...",https://www.goodreads.com/book/show/1333909.Go...,https://s.gr-assets.com/assets/nophoto/book/11...
1,Barbara Hambly,"The Unschooled Wizard (Sun Wolf and Starhawk, ...",Omnibus book club edition containing the Ladie...,"Nelson Doubleday, Inc.","['to-read', 'fantasy', 'fiction', 'owned', 'ha...","['19997', '828466', '1569323', '425389', '1176...",https://www.goodreads.com/book/show/7327624-th...,https://images.gr-assets.com/books/1304100136m...
2,Jennifer Weiner,Best Friends Forever,Addie Downs and Valerie Adler were eight when ...,Atria Books,"['to-read', 'chick-lit', 'currently-reading', ...","['6604176', '6054190', '2285777', '82641', '75...",https://www.goodreads.com/book/show/6066819-be...,https://s.gr-assets.com/assets/nophoto/book/11...
3,Alfred J. Church,The Aeneid for Boys and Girls,"Relates in vigorous prose the tale of Aeneas, ...",Yesterday's Classics,"['to-read', 'currently-reading', 'history', 'c...",[],https://www.goodreads.com/book/show/287141.The...,https://s.gr-assets.com/assets/nophoto/book/11...
4,Rachel Roberts,All's Fairy in Love and War (Avalon: Web of Ma...,"To Kara's astonishment, she discovers that a p...",Seven Seas,"['to-read', 'fantasy', 'owned', 'books-i-own',...","['948696', '439885', '274955', '12978730', '37...",https://www.goodreads.com/book/show/6066812-al...,https://images.gr-assets.com/books/1316637798m...
...,...,...,...,...,...,...,...,...
1115440,Trish Morey,The Spaniard's Blackmailed Bride,"Blackmailed into marriage to save her family, ...",Harlequin,"['to-read', 'harlequin', 'harlequin-presents',...","['2200344', '695337', '10333421', '1934240', '...",https://www.goodreads.com/book/show/2685097-th...,https://s.gr-assets.com/assets/nophoto/book/11...
1115441,Christopher Lee,"This Sceptred Isle, Vol. 10: The Age of Victor...","The award-winning story of Britain, from the a...",BBC Audiobooks,"['to-read', 'non-fiction', 'audiobooks', 'hist...",[],https://www.goodreads.com/book/show/3084038-th...,https://images.gr-assets.com/books/1494763458m...
1115442,Arthur Conan Doyle,Sherlock Holmes and the July Crisis,Sir Arthur Conan Doyle is brought back to life...,MX Publishing,"['to-read', 'mystery', 'giveaways', 'baker-str...","['12064253', '25017213', '571796', '27306126',...",https://www.goodreads.com/book/show/26168430-s...,https://images.gr-assets.com/books/1440592011m...
1115443,Nicola Baxter,The Children's Classic Poetry Collection,"Gathers poems by William Blake, Emily Bronte, ...",Smithmark Publishers,"['to-read', 'poetry', 'default', 'currently-re...",[],https://www.goodreads.com/book/show/2342551.Th...,https://s.gr-assets.com/assets/nophoto/book/11...


In [19]:
df = df.drop('cover_image', axis =1)

In [20]:
df['book_id'].value_counts().sum()

1115445

In [21]:
df['author_id'].value_counts().head()

123715     1988
3780       1762
3389       1516
5158478    1275
947        1271
Name: author_id, dtype: int64

In [22]:
df['title'].value_counts().head()

Pride and Prejudice    246
Jane Eyre              221
Wuthering Heights      205
Selected Poems         183
Dracula                183
Name: title, dtype: int64

In [23]:
df[['title','authors']].nunique()

title      818982
authors    271339
dtype: int64

In [24]:
(df['title'] + ', ' + df['authors']).value_counts().head()

Pride and Prejudice, Jane Austen             238
Jane Eyre, Charlotte Bronte                  216
Wuthering Heights, Emily Bronte              200
Dracula, Bram Stoker                         167
Frankenstein, Mary Wollstonecraft Shelley    152
dtype: int64

In [25]:
df[df[['title', 'authors']].duplicated()]

Unnamed: 0,book_id,author_id,authors,title,description,publisher,genres,avg_rating,ratings_count,num_pages,pub_year,similar_books,url
351,820229,7128,Jodi Picoult,Second Glance,From the moment Ross's fiancee Aimee was kille...,,"['to-read', 'currently-reading', 'fiction', 'j...",3.79,82,0,,"['8359929', '723742', '297130', '7570244', '39...",https://www.goodreads.com/book/show/820229.Sec...
356,820227,7128,Jodi Picoult,Second Glance,From the moment Ross's fiancee Aimee was kille...,Hodder,"['to-read', 'currently-reading', 'fiction', 'j...",3.79,24,420,2007.0,"['8359929', '723742', '297130', '7570244', '39...",https://www.goodreads.com/book/show/820227.Sec...
483,35593693,822613,Pierre Lemaitre,Three Days and a Life,"""In 1999, in the small provincial town of Beau...",Maclehose Press Quercus,"['to-read', 'thriller', 'french', 'suspense', ...",3.57,50,0,2017.0,"['6650065', '394154', '12398221', '232396', '4...",https://www.goodreads.com/book/show/35593693-t...
511,8037412,3420,Elizabeth Enright,The Saturdays,Saturdays can make dreams come true when the M...,,"['to-read', 'childrens', 'fiction', 'children'...",4.14,6,0,,"['7926', '42337', '7904', '7932', '377889', '3...",https://www.goodreads.com/book/show/8037412-th...
619,25580382,5194,Michael Crichton,"Jurassic Park (Jurassic Park, #1)",This is an alternate cover edition for ISBN ....,Ballantine Books,"['to-read', 'science-fiction', 'fiction', 'sci...",3.97,294,448,2015.0,"['117710', '19691', '136641', '611637', '35449...",https://www.goodreads.com/book/show/25580382-j...
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1115412,24784578,5234671,S.M. Blooding,Whiskey Witches - Complete Season 1,Detective Paige Whiskey comes from a long line...,,"['to-read', 'currently-reading', 'kindle', 'fa...",3.96,127,0,,"['13505327', '1752135', '12591482', '18770492'...",https://www.goodreads.com/book/show/24784578-w...
1115418,7925060,1221698,Neil Gaiman,Instructions,"""A perfect reminder to always be on the lookou...",HarperCollins,"['to-read', 'fantasy', 'picture-books', 'child...",4.29,40,40,2010.0,"['7552359', '7493149', '9762805', '8409657', '...",https://www.goodreads.com/book/show/7925060-in...
1115425,25308738,13786119,Alan Joshua,The SHIVA Syndrome,"Goodreads Readers:\n""The Shiva Syndrome is in ...",,"['to-read', 'paranormal', 'netgalley', 'to-rev...",4.32,31,526,,"['27391087', '25253442', '23441596', '28182084...",https://www.goodreads.com/book/show/25308738-t...
1115427,7195902,1455,Ernest Hemingway,The Short Happy Life of Francis Macomber,"""The Short Happy Life of Francis Macomber"" is ...",,"['to-read', 'short-stories', 'fiction', 'class...",4.07,290,0,,"['978053', '425481', '361551', '255045', '2581...",https://www.goodreads.com/book/show/7195902-th...


In [26]:
df.duplicated(subset=['title','authors']).value_counts()

False    869631
True     245814
dtype: int64

In [27]:
df[df['title']=='Second Glance']

Unnamed: 0,book_id,author_id,authors,title,description,publisher,genres,avg_rating,ratings_count,num_pages,pub_year,similar_books,url
350,820228,7128,Jodi Picoult,Second Glance,"Ross, a suicidal drifter and ghost hunter, is ...",,"['to-read', 'currently-reading', 'fiction', 'j...",3.79,70,0,2006.0,"['8359929', '723742', '297130', '7570244', '39...",https://www.goodreads.com/book/show/820228.Sec...
351,820229,7128,Jodi Picoult,Second Glance,From the moment Ross's fiancee Aimee was kille...,,"['to-read', 'currently-reading', 'fiction', 'j...",3.79,82,0,,"['8359929', '723742', '297130', '7570244', '39...",https://www.goodreads.com/book/show/820229.Sec...
356,820227,7128,Jodi Picoult,Second Glance,From the moment Ross's fiancee Aimee was kille...,Hodder,"['to-read', 'currently-reading', 'fiction', 'j...",3.79,24,420,2007.0,"['8359929', '723742', '297130', '7570244', '39...",https://www.goodreads.com/book/show/820227.Sec...
113311,6397205,7128,Jodi Picoult,Second Glance,From the moment Ross's fiancee Aimee was kille...,Atria Books,"['to-read', 'currently-reading', 'fiction', 'j...",3.79,1200,494,2003.0,"['8359929', '723742', '297130', '7570244', '39...",https://www.goodreads.com/book/show/6397205-se...
175578,10911,7128,Jodi Picoult,Second Glance,"When odd, supernatural events plague the town ...",,"['to-read', 'currently-reading', 'fiction', 'j...",3.79,33808,0,,"['8359929', '723742', '297130', '7570244', '39...",https://www.goodreads.com/book/show/10911.Seco...
451391,2355102,7128,Jodi Picoult,Second Glance,- From the New York Times best-selling author ...,Recorded Books,"['to-read', 'currently-reading', 'fiction', 'j...",3.79,22,0,2008.0,"['8359929', '723742', '297130', '7570244', '39...",https://www.goodreads.com/book/show/2355102.Se...
512842,11214862,7128,Jodi Picoult,Second Glance,From the moment Ross's fiancee Aimee was kille...,,"['to-read', 'currently-reading', 'fiction', 'j...",3.79,32,0,,"['8359929', '723742', '297130', '7570244', '39...",https://www.goodreads.com/book/show/11214862-s...
630403,526465,7128,Jodi Picoult,Second Glance,"""Sometimes I wonder....Can a ghost find you, i...",Atria Books,"['to-read', 'currently-reading', 'fiction', 'j...",3.79,157,432,2003.0,"['8359929', '723742', '297130', '7570244', '39...",https://www.goodreads.com/book/show/526465.Sec...
665498,3104040,7128,Jodi Picoult,Second Glance,"""Sometimes I wonder....Can a ghost find you, i...",Washington Square Press,"['to-read', 'currently-reading', 'fiction', 'j...",3.79,928,420,2008.0,"['8359929', '723742', '297130', '7570244', '39...",https://www.goodreads.com/book/show/3104040-se...


In [28]:
df[df['title']=='Three Days and a Life']

Unnamed: 0,book_id,author_id,authors,title,description,publisher,genres,avg_rating,ratings_count,num_pages,pub_year,similar_books,url
482,35593692,822613,Pierre Lemaitre,Three Days and a Life,"""In 1999, in the small provincial town of Beau...",Maclehose Press Quercus,"['to-read', 'thriller', 'french', 'suspense', ...",3.57,15,0,2017.0,"['6650065', '394154', '12398221', '232396', '4...",https://www.goodreads.com/book/show/35593692-t...
483,35593693,822613,Pierre Lemaitre,Three Days and a Life,"""In 1999, in the small provincial town of Beau...",Maclehose Press Quercus,"['to-read', 'thriller', 'french', 'suspense', ...",3.57,50,0,2017.0,"['6650065', '394154', '12398221', '232396', '4...",https://www.goodreads.com/book/show/35593693-t...
800570,34974147,822613,Pierre Lemaitre,Three Days and a Life,"In 1999, in the small provincial town of Beauv...",,"['to-read', 'thriller', 'french', 'suspense', ...",3.57,14,0,,"['6650065', '394154', '12398221', '232396', '4...",https://www.goodreads.com/book/show/34974147-t...
1004886,32941442,822613,Pierre Lemaitre,Three Days and a Life,"Three days at the edge of the new millennium, ...",MacLehose Press,"['to-read', 'thriller', 'french', 'suspense', ...",3.57,19,320,2017.0,"['6650065', '394154', '12398221', '232396', '4...",https://www.goodreads.com/book/show/32941442-t...
1062714,35849902,822613,Pierre Lemaitre,Three Days and a Life,"""In 1999, in the small provincial town of Beau...",,"['to-read', 'thriller', 'french', 'suspense', ...",3.57,8,0,,"['6650065', '394154', '12398221', '232396', '4...",https://www.goodreads.com/book/show/35849902-t...


In [29]:
 len(df['authors'].unique())


271340

In [30]:
len(df['title'].unique())

818983

## Imputing the NAN values and Removing the duplicates

In [31]:
from sklearn.impute import SimpleImputer

In [32]:
# create the imputer object with most_frequent strategy
imputer = SimpleImputer(strategy = 'most_frequent')
# fit the imputer to the data
imputer.fit(df)
# impute missing values in each column
data_imputed = imputer.transform(df)

# convert the imputed data back to a pandas dataframe
data_imputed_df = pd.DataFrame(data_imputed, columns=df.columns)

# print the first 5 rows of the imputed dataframe
print(data_imputed_df.head())



   book_id author_id           authors  \
0  1333909    626222     Anita Diamant   
1  7327624     10333    Barbara Hambly   
2  6066819      9212   Jennifer Weiner   
3   287141   3041852  Alfred J. Church   
4  6066812     19158    Rachel Roberts   

                                               title  \
0                                        Good Harbor   
1  The Unschooled Wizard (Sun Wolf and Starhawk, ...   
2                               Best Friends Forever   
3                      The Aeneid for Boys and Girls   
4  All's Fairy in Love and War (Avalon: Web of Ma...   

                                         description               publisher  \
0  Anita Diamant's international bestseller "The ...  Simon & Schuster Audio   
1  Omnibus book club edition containing the Ladie...  Nelson Doubleday, Inc.   
2  Addie Downs and Valerie Adler were eight when ...             Atria Books   
3  Relates in vigorous prose the tale of Aeneas, ...    Yesterday's Classics   
4  To Kara

In [33]:
# We check to see if all NAN are imputed
data_imputed_df[data_imputed_df['title']=='Second Glance']

Unnamed: 0,book_id,author_id,authors,title,description,publisher,genres,avg_rating,ratings_count,num_pages,pub_year,similar_books,url
350,820228,7128,Jodi Picoult,Second Glance,"Ross, a suicidal drifter and ghost hunter, is ...",Createspace Independent Publishing Platform,"['to-read', 'currently-reading', 'fiction', 'j...",3.79,70,0,2006.0,"['8359929', '723742', '297130', '7570244', '39...",https://www.goodreads.com/book/show/820228.Sec...
351,820229,7128,Jodi Picoult,Second Glance,From the moment Ross's fiancee Aimee was kille...,Createspace Independent Publishing Platform,"['to-read', 'currently-reading', 'fiction', 'j...",3.79,82,0,2014.0,"['8359929', '723742', '297130', '7570244', '39...",https://www.goodreads.com/book/show/820229.Sec...
356,820227,7128,Jodi Picoult,Second Glance,From the moment Ross's fiancee Aimee was kille...,Hodder,"['to-read', 'currently-reading', 'fiction', 'j...",3.79,24,420,2007.0,"['8359929', '723742', '297130', '7570244', '39...",https://www.goodreads.com/book/show/820227.Sec...
113311,6397205,7128,Jodi Picoult,Second Glance,From the moment Ross's fiancee Aimee was kille...,Atria Books,"['to-read', 'currently-reading', 'fiction', 'j...",3.79,1200,494,2003.0,"['8359929', '723742', '297130', '7570244', '39...",https://www.goodreads.com/book/show/6397205-se...
175578,10911,7128,Jodi Picoult,Second Glance,"When odd, supernatural events plague the town ...",Createspace Independent Publishing Platform,"['to-read', 'currently-reading', 'fiction', 'j...",3.79,33808,0,2014.0,"['8359929', '723742', '297130', '7570244', '39...",https://www.goodreads.com/book/show/10911.Seco...
451391,2355102,7128,Jodi Picoult,Second Glance,- From the New York Times best-selling author ...,Recorded Books,"['to-read', 'currently-reading', 'fiction', 'j...",3.79,22,0,2008.0,"['8359929', '723742', '297130', '7570244', '39...",https://www.goodreads.com/book/show/2355102.Se...
512842,11214862,7128,Jodi Picoult,Second Glance,From the moment Ross's fiancee Aimee was kille...,Createspace Independent Publishing Platform,"['to-read', 'currently-reading', 'fiction', 'j...",3.79,32,0,2014.0,"['8359929', '723742', '297130', '7570244', '39...",https://www.goodreads.com/book/show/11214862-s...
630403,526465,7128,Jodi Picoult,Second Glance,"""Sometimes I wonder....Can a ghost find you, i...",Atria Books,"['to-read', 'currently-reading', 'fiction', 'j...",3.79,157,432,2003.0,"['8359929', '723742', '297130', '7570244', '39...",https://www.goodreads.com/book/show/526465.Sec...
665498,3104040,7128,Jodi Picoult,Second Glance,"""Sometimes I wonder....Can a ghost find you, i...",Washington Square Press,"['to-read', 'currently-reading', 'fiction', 'j...",3.79,928,420,2008.0,"['8359929', '723742', '297130', '7570244', '39...",https://www.goodreads.com/book/show/3104040-se...


In [34]:
# sort the dataset by ratings_count in descending order
data_sorted = data_imputed_df.sort_values('ratings_count', ascending=False)

# drop duplicates based on author and title, and keep the first occurence (which has the highest ratings_count)
data_unique = data_sorted.drop_duplicates(subset=['authors', 'title'], keep='first')

# print the first 5 rows of the unique dataframe
print(data_unique.head())

         book_id author_id              authors  \
242287   2767052    153394      Suzanne Collins   
748769         3   1077326         J.K. Rowling   
673991      2657      1825           Harper Lee   
394846      4671      3190  F. Scott Fitzgerald   
933517  11870085   1406384           John Green   

                                                    title  \
242287            The Hunger Games (The Hunger Games, #1)   
748769  Harry Potter and the Sorcerer's Stone (Harry P...   
673991                              To Kill a Mockingbird   
394846                                   The Great Gatsby   
933517                             The Fault in Our Stars   

                                              description  \
242287  Winning will make you famous.\nLosing means ce...   
748769  Harry Potter's life is miserable. His parents ...   
673991  The unforgettable novel of a childhood in a sl...   
394846  THE GREAT GATSBY, F. Scott Fitzgerald's third ...   
933517  There is an a

In [35]:
data_unique[data_unique['title']=='Second Glance']

Unnamed: 0,book_id,author_id,authors,title,description,publisher,genres,avg_rating,ratings_count,num_pages,pub_year,similar_books,url
175578,10911,7128,Jodi Picoult,Second Glance,"When odd, supernatural events plague the town ...",Createspace Independent Publishing Platform,"['to-read', 'currently-reading', 'fiction', 'j...",3.79,33808,0,2014.0,"['8359929', '723742', '297130', '7570244', '39...",https://www.goodreads.com/book/show/10911.Seco...


# We have successfully removed all the duplicates based on author and title and only kept the one with the highest rating counts. 

### Genres

In [36]:
data_unique[data_unique['authors']== 'Jodi Picoult']


Unnamed: 0,book_id,author_id,authors,title,description,publisher,genres,avg_rating,ratings_count,num_pages,pub_year,similar_books,url
175575,10917,7128,Jodi Picoult,My Sister's Keeper,"Anna is not sick, but she might as well be. By...",Washington Square Press,"['to-read', 'fiction', 'currently-reading', 'f...",4.06,876319,423,2005.0,"['1472878', '5161', '8492768', '5166', '760226...",https://www.goodreads.com/book/show/10917.My_S...
1030252,14866,7128,Jodi Picoult,Nineteen Minutes,"In nineteen minutes, you can mow the front law...",Atria Books,"['currently-reading', 'fiction', 'to-read', 'c...",4.1,239763,440,2007.0,"['217433', '6251052', '5355136', '228128', '91...",https://www.goodreads.com/book/show/14866.Nine...
914495,14864,7128,Jodi Picoult,Plain Truth,"The small town of Paradise, Pennsylvania, is a...",Atria Books,"['to-read', 'currently-reading', 'fiction', 'b...",3.97,129447,405,2004.0,"['161084', '1268348', '297134', '781910', '637...",https://www.goodreads.com/book/show/14864.Plai...
156983,15753740,7128,Jodi Picoult,The Storyteller,Sage Singer befriends an old man who's particu...,Atria,"['to-read', 'currently-reading', 'fiction', 'f...",4.26,111927,460,2013.0,"['14759321', '15793165', '17125479', '2485785'...",https://www.goodreads.com/book/show/15753740-t...
1024817,10909,7128,Jodi Picoult,The Tenth Circle,Fourteen-year-old Trixie Stone is in love for ...,Allen & Ulwin,"['to-read', 'fiction', 'jodi-picoult', 'books-...",3.48,100103,416,2006.0,"['6081394', '3975186', '15850', '8359929', '14...",https://www.goodreads.com/book/show/10909.The_...
...,...,...,...,...,...,...,...,...,...,...,...,...,...
361403,28820307,7128,Jodi Picoult,A l'intérieur,Quand votre fils ne vous regarde jamais dans l...,Createspace Independent Publishing Platform,"['currently-reading', 'jodi-picoult', 'favorit...",4.01,11,0,2014.0,"['2589061', '1470232', '6335026', '4071018', '...",https://www.goodreads.com/book/show/28820307-a...
988419,1021437,7128,Jodi Picoult,Die Macht Des Zweifels,"In the course of her everyday work, career-dri...",Createspace Independent Publishing Platform,"['to-read', 'currently-reading', 'books-i-own'...",3.93,10,0,2014.0,"['895617', '8359929', '1152201', '14959', '147...",https://www.goodreads.com/book/show/1021437.Di...
576868,34773147,7128,Jodi Picoult,La Tristesse des éléphants,"La mere de Jenna, Alice, a disparu lorsque cel...",Createspace Independent Publishing Platform,"['to-read', 'currently-reading', 'fiction', 'f...",3.94,10,0,2014.0,"['827361', '18371384', '18722887', '18144049',...",https://www.goodreads.com/book/show/34773147-l...
10581,28189591,7128,Jodi Picoult,Harvesting The Heart,A young woman who was abandoned by her mother ...,Allen & Unwin,"['to-read', 'jodi-picoult', 'currently-reading...",3.59,9,453,2009.0,"['8359929', '297130', '895617', '64694', '3323...",https://www.goodreads.com/book/show/28189591-h...


In [37]:
data_unique.to_csv('Removed_Duplicats_df.csv', index = False)

In [38]:
!pwd

/kaggle/working


In [39]:
df = pd.read_csv('Removed_Duplicats_df.csv')

## Cleaning the Genres Column

In [40]:
data_unique['genres'][175575]

"['to-read', 'fiction', 'currently-reading', 'favorites', 'books-i-own', 'jodi-picoult', 'chick-lit', 'book-club', 'contemporary', 'owned', 'drama', 'young-adult', 'adult', 'adult-fiction', 'realistic-fiction', 'contemporary-fiction', 'favourites', 'general-fiction', 'owned-books', 'family', 'novels', 'contempor\\u200bary', 'ya', 'made-me-cry', 'my-books', 'bookclub', 'tear-jerker', 'library', 'cancer', 'i-own', 'to-buy', 'default', 'rory-gilmore-reading-challenge', 'favorite-books', 'kindle', 'shelfari-favorites', 'all-time-favorites', 'sisters', 'novel', 'picoult', 'death', 'favorite', 'my-library', 'audiobook', 'book-club-books', 'sad', 'borrowed', 'movies', 'my-favorites', '5-stars', 'read-in-2009', 'tear-jerkers', 'audiobooks', 'romance', 'medical', 'movie', 'stand-alone', 'rory-gilmore-challenge', 'own-it', 'family-relationships', 'book-to-movie', 'coming-of-age', 'realistic', 'book-group', 'audio', 'wish-list', 'books', 'illness', 'relationships', 'literature', 'ebook', 'chickli

In [41]:
data_unique[data_unique['book_id']== 10917]['genres']

175575    ['to-read', 'fiction', 'currently-reading', 'f...
Name: genres, dtype: object

In [42]:
#Remove space and make all text lower case.
data_unique['genres'] = data_unique['genres'].str.strip().str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [43]:
# Split the genres column into multiple columns
genres = data_unique['genres'].str.split(';', expand=True)

In [44]:
# Stack the columns into a single column and count the frequencies
genre_counts = genres.stack().value_counts()

In [45]:
# Get the top 5 genres
top_genres = genre_counts.head(10).index.tolist()

In [46]:
# Count the frequencies of each genre label
genre_counts = data_unique['genres'].explode().value_counts()

# Print the top 10 genres
print(genre_counts.head(10))


['to-read']                                   2588
['to-read', 'currently-reading']              1451
['currently-reading', 'to-read']               362
[]                                             239
['to-read', 'poetry']                          164
['currently-reading']                           87
['to-read', 'favorites']                        78
['to-read', 'fiction']                          63
['to-read', 'poetry', 'currently-reading']      41
['to-read', 'kindle']                           41
Name: genres, dtype: int64


In [47]:
print()


