# DATA CLEANING (BOOK DATASET)

### DATA SCRAPED FROM GOOD READS.

In [2]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as ply
import seaborn as sns

# Dataset Overview

The dataset contains top books of the 21st century, spanning from the 2000s to the present day. The data is scraped from a popular book website, Goodreads. Some notable books in the dataset include the Harry Potter series, A Thousand Splendid Suns, The Kite Runner, and The Fault in Our Stars.

The dataset consists of a total of 84,033 books and comprises 16 columns.

## Column Description:

- **Title:** The name of the book.
- **Authors:** Author name.
- **Avg Ratings:** Needs to be converted to **Total_ratings**.(**Total_Rating**): No of users who gave  ratings to a book
- **Rating:** needs to be removed same as reviews column
- **First Published:** Year in which the book was first published.
- **Description:** Description about the story in the book.
- **Total Pages:** Number of pages in the book.
- **Genre:** The genre the book belongs to (a style or category of art, music, or literature).
- **Author Followers:** Number of followers of the author on the website.
- **No. of Reviews:** Total number of reviews given to the book.
- **No. of Books by Author:** Number of books written by that author.
- **One Star Ratings:** Number of 1-star ratings out of 5 (Worse).
- **Two Star Ratings:** Number of 2-star ratings out of 5 (Bad).
- **Three Star Ratings:** Number of 3-star ratings out of 5 (Average).
- **Four Star Ratings:** Number of 4-star ratings out of 5 (Best).
- **Five Star Ratings:** Number of 5-star ratings out of 5 (Excellent).

## Additional Information

It is a list of top 21st-century books, so you wouldn't find any classics from the 18th or 19th century here.


## Issues with the data

**1. Dirty Data**
- Remove null values from the datset
- **first_published column**: remove first published text and change its format to datetime.
- **ratings column**: remove ratings text from infront of the number.
- **Change** all columns to their **respective formats.**
- Change syntax of column names
- Remove text from the column **author_followers and no_of_reviews**
- Remove text from **Total_pages** column.
- Remove (,) and (1%) part from **one_star,two_star**,etc columns.
- change column  **avg_ratings to toTal_ratings**
- Drop column **Rating**
- Some values in column **Author_followers**has inconsistent values like 1boo576,1boo890 etc 
- Two columns **Author_followers and Reviews ** has invisible special charaters The values are strings containing numeric characters along with non-printable characters like \xa0, which is the Unicode representation of a non-breaking space.**
- Remove books with page less than 20.
- Multiplying author folowers with 1000 so to convert 'k' into numeric form.




In [3]:
df = pd.read_csv(r"C:\Users\91971\books_new.csv", encoding='latin1', error_bad_lines=False)
df

b'Skipping line 2522: expected 16 fields, saw 37\nSkipping line 3694: expected 16 fields, saw 27\nSkipping line 6593: expected 16 fields, saw 37\nSkipping line 8060: expected 16 fields, saw 19\n'


Unnamed: 0,Title,Authors,Avg Ratings,Rating,first_published,Description,Total_Pages,Genre,author_followers,no_of_reviews,no_book_by_author,one_star_ratings,two_star_ratings,three_star_ratings,four_star_ratings,five_star_ratings
0,Harry Potter and the Deathly Hallows,J.K. Rowling,3674428,"ratings84,560","First published July 21, 2007","Harry has been burdened with a dark, dangerous...","759 pages, Hardcover",Fantasy,225k followers,"84,560 reviews",451 books225k followers,"30,370 (<1%)","43,345 (1%)","206,753 (5%)","722,799 (19%)","2,671,161 (72%)"
1,The Hunger Games,Suzanne Collins,8577686,"ratings216,194","First published September 14, 2008","Could you survive on your own in the wild, wit...","374 pages, Hardcover",Young Adult,99.4k followers,"216,194 reviews",71 books99.4k followers,"120,406 (1%)","218,605 (2%)","981,400 (11%)","2,604,669 (30%)","4,652,606 (54%)"
2,The Kite Runner,Khaled Hosseini,3122264,"ratings96,940","First published May 29, 2003",1970s Afghanistan: Twelve-year-old Amir is des...,"371 pages, Paperback",Fiction,154k followers,"96,940 reviews",43 books154k followers,"46,881 (1%)","83,304 (2%)","323,768 (10%)","971,488 (31%)","1,696,823 (54%)"
3,The Book Thief,Markus Zusak,2531587,"ratings143,507","First published September 1, 2005",Librarian's note: An alternate cover edition c...,"592 pages, Hardcover",Historical Fiction,39.2k followers,"143,507 reviews",23 books39.2k followers,"33,196 (1%)","65,113 (2%)","249,718 (9%)","712,353 (28%)","1,471,207 (58%)"
4,Harry Potter and the Half-Blood Prince,J.K. Rowling,3243037,"ratings56,890","First published July 16, 2005","It is the middle of the summer, but there is a...","652 pages, Paperback",Fantasy,225k followers,"56,890 reviews",451 books225k followers,"17,279 (<1%)","35,392 (1%)","216,854 (6%)","763,716 (23%)","2,209,796 (68%)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8321,She of the Mountains,Vivek Shraya,3100,ratings470,14-Oct-14,"Finalist, Lambda Literary AwardIn the beginnin...","128 pages, Paperback",Fiction,"1,010 follower",470 reviews,13,20 (<1%),119 (3%),574 (18%),"1,282 (41%)","1,105 (35%)"
8322,The Book of Lost Friends,Lisa Wingate,99711,"ratings10,571",07-Apr-20,A new novel inspired by historical events: a s...,"388 pages, Hardcover",Historical Fiction,10.8k follower,"10,571 reviews",49,734 (<1%),"2,750 (2%)","15,790 (15%)","41,264 (41%)","39,173 (39%)"
8323,In the Kingdom of Ice: The Grand and Terrible ...,Hampton Sides,22141,"ratings2,752",05-Aug-14,"On July 8, 1879, Captain George Washington De ...","454 pages, Hardcover",Nonfiction,"1,200 follower","2,752 reviews",23,332 (1%),567 (2%),"2,576 (11%)","8,155 (36%)","10,511 (47%)"
8324,The Philosophy of Modern Song,Bob Dylan,2980,ratings612,01-Nov-22,The Philosophy of Modern Song is Bob Dylans f...,"352 pages, Hardcover",Music,"1,383 follower",612 reviews,509,61 (2%),234 (7%),784 (26%),"1,052 (35%)",849 (28%)


In [4]:
# making a copy of the data
df_copy = df.copy()

## Automatic Assessment 

In [5]:
df_copy.head()

Unnamed: 0,Title,Authors,Avg Ratings,Rating,first_published,Description,Total_Pages,Genre,author_followers,no_of_reviews,no_book_by_author,one_star_ratings,two_star_ratings,three_star_ratings,four_star_ratings,five_star_ratings
0,Harry Potter and the Deathly Hallows,J.K. Rowling,3674428,"ratings84,560","First published July 21, 2007","Harry has been burdened with a dark, dangerous...","759 pages, Hardcover",Fantasy,225k followers,"84,560 reviews",451 books225k followers,"30,370 (<1%)","43,345 (1%)","206,753 (5%)","722,799 (19%)","2,671,161 (72%)"
1,The Hunger Games,Suzanne Collins,8577686,"ratings216,194","First published September 14, 2008","Could you survive on your own in the wild, wit...","374 pages, Hardcover",Young Adult,99.4k followers,"216,194 reviews",71 books99.4k followers,"120,406 (1%)","218,605 (2%)","981,400 (11%)","2,604,669 (30%)","4,652,606 (54%)"
2,The Kite Runner,Khaled Hosseini,3122264,"ratings96,940","First published May 29, 2003",1970s Afghanistan: Twelve-year-old Amir is des...,"371 pages, Paperback",Fiction,154k followers,"96,940 reviews",43 books154k followers,"46,881 (1%)","83,304 (2%)","323,768 (10%)","971,488 (31%)","1,696,823 (54%)"
3,The Book Thief,Markus Zusak,2531587,"ratings143,507","First published September 1, 2005",Librarian's note: An alternate cover edition c...,"592 pages, Hardcover",Historical Fiction,39.2k followers,"143,507 reviews",23 books39.2k followers,"33,196 (1%)","65,113 (2%)","249,718 (9%)","712,353 (28%)","1,471,207 (58%)"
4,Harry Potter and the Half-Blood Prince,J.K. Rowling,3243037,"ratings56,890","First published July 16, 2005","It is the middle of the summer, but there is a...","652 pages, Paperback",Fantasy,225k followers,"56,890 reviews",451 books225k followers,"17,279 (<1%)","35,392 (1%)","216,854 (6%)","763,716 (23%)","2,209,796 (68%)"


In [6]:
df.tail()

Unnamed: 0,Title,Authors,Avg Ratings,Rating,first_published,Description,Total_Pages,Genre,author_followers,no_of_reviews,no_book_by_author,one_star_ratings,two_star_ratings,three_star_ratings,four_star_ratings,five_star_ratings
8321,She of the Mountains,Vivek Shraya,3100,ratings470,14-Oct-14,"Finalist, Lambda Literary AwardIn the beginnin...","128 pages, Paperback",Fiction,"1,010 follower",470 reviews,13,20 (<1%),119 (3%),574 (18%),"1,282 (41%)","1,105 (35%)"
8322,The Book of Lost Friends,Lisa Wingate,99711,"ratings10,571",07-Apr-20,A new novel inspired by historical events: a s...,"388 pages, Hardcover",Historical Fiction,10.8k follower,"10,571 reviews",49,734 (<1%),"2,750 (2%)","15,790 (15%)","41,264 (41%)","39,173 (39%)"
8323,In the Kingdom of Ice: The Grand and Terrible ...,Hampton Sides,22141,"ratings2,752",05-Aug-14,"On July 8, 1879, Captain George Washington De ...","454 pages, Hardcover",Nonfiction,"1,200 follower","2,752 reviews",23,332 (1%),567 (2%),"2,576 (11%)","8,155 (36%)","10,511 (47%)"
8324,The Philosophy of Modern Song,Bob Dylan,2980,ratings612,01-Nov-22,The Philosophy of Modern Song is Bob Dylans f...,"352 pages, Hardcover",Music,"1,383 follower",612 reviews,509,61 (2%),234 (7%),784 (26%),"1,052 (35%)",849 (28%)
8325,The City in the Middle of the Night,Charlie Jane Anders,11630,"ratings2,060",12-Feb-19,ASIN B07FJZDJ8Y moved to the more recent editi...,"348 pages, Kindle Edition",Science Fiction,"3,846 follower","2,060 reviews",147,477 (4%),"1,572 (13%)","3,435 (29%)","3,909 (33%)","2,237 (19%)"


In [7]:
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8326 entries, 0 to 8325
Data columns (total 16 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Title               8326 non-null   object
 1   Authors             8326 non-null   object
 2   Avg Ratings         8326 non-null   object
 3   Rating              8326 non-null   object
 4   first_published     8326 non-null   object
 5   Description         8320 non-null   object
 6   Total_Pages         8326 non-null   object
 7   Genre               8326 non-null   object
 8   author_followers    8326 non-null   object
 9   no_of_reviews       8326 non-null   object
 10  no_book_by_author   8326 non-null   object
 11  one_star_ratings    8326 non-null   object
 12  two_star_ratings    8326 non-null   object
 13  three_star_ratings  8326 non-null   object
 14  four_star_ratings   8326 non-null   object
 15  five_star_ratings   8326 non-null   object
dtypes: object(16)
memory usa

**Observation: Every column is in object format.**

In [8]:
df_copy.isnull().sum()

Title                 0
Authors               0
Avg Ratings           0
Rating                0
first_published       0
Description           6
Total_Pages           0
Genre                 0
author_followers      0
no_of_reviews         0
no_book_by_author     0
one_star_ratings      0
two_star_ratings      0
three_star_ratings    0
four_star_ratings     0
five_star_ratings     0
dtype: int64

In [9]:
# remove null 
df_copy.dropna(inplace=True)

In [10]:
# Reset the index to start from 0
df_copy.reset_index(drop=True, inplace=True)

In [11]:
df_copy

Unnamed: 0,Title,Authors,Avg Ratings,Rating,first_published,Description,Total_Pages,Genre,author_followers,no_of_reviews,no_book_by_author,one_star_ratings,two_star_ratings,three_star_ratings,four_star_ratings,five_star_ratings
0,Harry Potter and the Deathly Hallows,J.K. Rowling,3674428,"ratings84,560","First published July 21, 2007","Harry has been burdened with a dark, dangerous...","759 pages, Hardcover",Fantasy,225k followers,"84,560 reviews",451 books225k followers,"30,370 (<1%)","43,345 (1%)","206,753 (5%)","722,799 (19%)","2,671,161 (72%)"
1,The Hunger Games,Suzanne Collins,8577686,"ratings216,194","First published September 14, 2008","Could you survive on your own in the wild, wit...","374 pages, Hardcover",Young Adult,99.4k followers,"216,194 reviews",71 books99.4k followers,"120,406 (1%)","218,605 (2%)","981,400 (11%)","2,604,669 (30%)","4,652,606 (54%)"
2,The Kite Runner,Khaled Hosseini,3122264,"ratings96,940","First published May 29, 2003",1970s Afghanistan: Twelve-year-old Amir is des...,"371 pages, Paperback",Fiction,154k followers,"96,940 reviews",43 books154k followers,"46,881 (1%)","83,304 (2%)","323,768 (10%)","971,488 (31%)","1,696,823 (54%)"
3,The Book Thief,Markus Zusak,2531587,"ratings143,507","First published September 1, 2005",Librarian's note: An alternate cover edition c...,"592 pages, Hardcover",Historical Fiction,39.2k followers,"143,507 reviews",23 books39.2k followers,"33,196 (1%)","65,113 (2%)","249,718 (9%)","712,353 (28%)","1,471,207 (58%)"
4,Harry Potter and the Half-Blood Prince,J.K. Rowling,3243037,"ratings56,890","First published July 16, 2005","It is the middle of the summer, but there is a...","652 pages, Paperback",Fantasy,225k followers,"56,890 reviews",451 books225k followers,"17,279 (<1%)","35,392 (1%)","216,854 (6%)","763,716 (23%)","2,209,796 (68%)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8315,She of the Mountains,Vivek Shraya,3100,ratings470,14-Oct-14,"Finalist, Lambda Literary AwardIn the beginnin...","128 pages, Paperback",Fiction,"1,010 follower",470 reviews,13,20 (<1%),119 (3%),574 (18%),"1,282 (41%)","1,105 (35%)"
8316,The Book of Lost Friends,Lisa Wingate,99711,"ratings10,571",07-Apr-20,A new novel inspired by historical events: a s...,"388 pages, Hardcover",Historical Fiction,10.8k follower,"10,571 reviews",49,734 (<1%),"2,750 (2%)","15,790 (15%)","41,264 (41%)","39,173 (39%)"
8317,In the Kingdom of Ice: The Grand and Terrible ...,Hampton Sides,22141,"ratings2,752",05-Aug-14,"On July 8, 1879, Captain George Washington De ...","454 pages, Hardcover",Nonfiction,"1,200 follower","2,752 reviews",23,332 (1%),567 (2%),"2,576 (11%)","8,155 (36%)","10,511 (47%)"
8318,The Philosophy of Modern Song,Bob Dylan,2980,ratings612,01-Nov-22,The Philosophy of Modern Song is Bob Dylans f...,"352 pages, Hardcover",Music,"1,383 follower",612 reviews,509,61 (2%),234 (7%),784 (26%),"1,052 (35%)",849 (28%)


In [31]:
df_copy.isnull().sum()

Title                 0
Authors               0
Avg Ratings           0
Rating                0
first_published       0
Description           0
Total_Pages           0
Genre                 0
Author_followers      0
no_of_reviews         0
no_book_by_author     0
one_star_ratings      0
two_star_ratings      0
three_star_ratings    0
four_star_ratings     0
five_star_ratings     0
dtype: int64

### define -> define the problem
### code -> appropriate code for the solution 
### test -> testing if we got the desired output

In [13]:
# Define: Clean the Rating column

In [14]:
# Code
df_copy["Rating"] = df_copy["Rating"].str.replace("ratings","")

In [15]:
df_copy["Rating"] = df_copy["Rating"].str.replace(",","")

In [16]:
# Test
df_copy

Unnamed: 0,Title,Authors,Avg Ratings,Rating,first_published,Description,Total_Pages,Genre,author_followers,no_of_reviews,no_book_by_author,one_star_ratings,two_star_ratings,three_star_ratings,four_star_ratings,five_star_ratings
0,Harry Potter and the Deathly Hallows,J.K. Rowling,3674428,84560,"First published July 21, 2007","Harry has been burdened with a dark, dangerous...","759 pages, Hardcover",Fantasy,225k followers,"84,560 reviews",451 books225k followers,"30,370 (<1%)","43,345 (1%)","206,753 (5%)","722,799 (19%)","2,671,161 (72%)"
1,The Hunger Games,Suzanne Collins,8577686,216194,"First published September 14, 2008","Could you survive on your own in the wild, wit...","374 pages, Hardcover",Young Adult,99.4k followers,"216,194 reviews",71 books99.4k followers,"120,406 (1%)","218,605 (2%)","981,400 (11%)","2,604,669 (30%)","4,652,606 (54%)"
2,The Kite Runner,Khaled Hosseini,3122264,96940,"First published May 29, 2003",1970s Afghanistan: Twelve-year-old Amir is des...,"371 pages, Paperback",Fiction,154k followers,"96,940 reviews",43 books154k followers,"46,881 (1%)","83,304 (2%)","323,768 (10%)","971,488 (31%)","1,696,823 (54%)"
3,The Book Thief,Markus Zusak,2531587,143507,"First published September 1, 2005",Librarian's note: An alternate cover edition c...,"592 pages, Hardcover",Historical Fiction,39.2k followers,"143,507 reviews",23 books39.2k followers,"33,196 (1%)","65,113 (2%)","249,718 (9%)","712,353 (28%)","1,471,207 (58%)"
4,Harry Potter and the Half-Blood Prince,J.K. Rowling,3243037,56890,"First published July 16, 2005","It is the middle of the summer, but there is a...","652 pages, Paperback",Fantasy,225k followers,"56,890 reviews",451 books225k followers,"17,279 (<1%)","35,392 (1%)","216,854 (6%)","763,716 (23%)","2,209,796 (68%)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8315,She of the Mountains,Vivek Shraya,3100,470,14-Oct-14,"Finalist, Lambda Literary AwardIn the beginnin...","128 pages, Paperback",Fiction,"1,010 follower",470 reviews,13,20 (<1%),119 (3%),574 (18%),"1,282 (41%)","1,105 (35%)"
8316,The Book of Lost Friends,Lisa Wingate,99711,10571,07-Apr-20,A new novel inspired by historical events: a s...,"388 pages, Hardcover",Historical Fiction,10.8k follower,"10,571 reviews",49,734 (<1%),"2,750 (2%)","15,790 (15%)","41,264 (41%)","39,173 (39%)"
8317,In the Kingdom of Ice: The Grand and Terrible ...,Hampton Sides,22141,2752,05-Aug-14,"On July 8, 1879, Captain George Washington De ...","454 pages, Hardcover",Nonfiction,"1,200 follower","2,752 reviews",23,332 (1%),567 (2%),"2,576 (11%)","8,155 (36%)","10,511 (47%)"
8318,The Philosophy of Modern Song,Bob Dylan,2980,612,01-Nov-22,The Philosophy of Modern Song is Bob Dylans f...,"352 pages, Hardcover",Music,"1,383 follower",612 reviews,509,61 (2%),234 (7%),784 (26%),"1,052 (35%)",849 (28%)


In [17]:
#code
df_copy["Total_Pages"] = df_copy["Total_Pages"].str.split().str[0]


In [18]:
#test
df_copy

Unnamed: 0,Title,Authors,Avg Ratings,Rating,first_published,Description,Total_Pages,Genre,author_followers,no_of_reviews,no_book_by_author,one_star_ratings,two_star_ratings,three_star_ratings,four_star_ratings,five_star_ratings
0,Harry Potter and the Deathly Hallows,J.K. Rowling,3674428,84560,"First published July 21, 2007","Harry has been burdened with a dark, dangerous...",759,Fantasy,225k followers,"84,560 reviews",451 books225k followers,"30,370 (<1%)","43,345 (1%)","206,753 (5%)","722,799 (19%)","2,671,161 (72%)"
1,The Hunger Games,Suzanne Collins,8577686,216194,"First published September 14, 2008","Could you survive on your own in the wild, wit...",374,Young Adult,99.4k followers,"216,194 reviews",71 books99.4k followers,"120,406 (1%)","218,605 (2%)","981,400 (11%)","2,604,669 (30%)","4,652,606 (54%)"
2,The Kite Runner,Khaled Hosseini,3122264,96940,"First published May 29, 2003",1970s Afghanistan: Twelve-year-old Amir is des...,371,Fiction,154k followers,"96,940 reviews",43 books154k followers,"46,881 (1%)","83,304 (2%)","323,768 (10%)","971,488 (31%)","1,696,823 (54%)"
3,The Book Thief,Markus Zusak,2531587,143507,"First published September 1, 2005",Librarian's note: An alternate cover edition c...,592,Historical Fiction,39.2k followers,"143,507 reviews",23 books39.2k followers,"33,196 (1%)","65,113 (2%)","249,718 (9%)","712,353 (28%)","1,471,207 (58%)"
4,Harry Potter and the Half-Blood Prince,J.K. Rowling,3243037,56890,"First published July 16, 2005","It is the middle of the summer, but there is a...",652,Fantasy,225k followers,"56,890 reviews",451 books225k followers,"17,279 (<1%)","35,392 (1%)","216,854 (6%)","763,716 (23%)","2,209,796 (68%)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8315,She of the Mountains,Vivek Shraya,3100,470,14-Oct-14,"Finalist, Lambda Literary AwardIn the beginnin...",128,Fiction,"1,010 follower",470 reviews,13,20 (<1%),119 (3%),574 (18%),"1,282 (41%)","1,105 (35%)"
8316,The Book of Lost Friends,Lisa Wingate,99711,10571,07-Apr-20,A new novel inspired by historical events: a s...,388,Historical Fiction,10.8k follower,"10,571 reviews",49,734 (<1%),"2,750 (2%)","15,790 (15%)","41,264 (41%)","39,173 (39%)"
8317,In the Kingdom of Ice: The Grand and Terrible ...,Hampton Sides,22141,2752,05-Aug-14,"On July 8, 1879, Captain George Washington De ...",454,Nonfiction,"1,200 follower","2,752 reviews",23,332 (1%),567 (2%),"2,576 (11%)","8,155 (36%)","10,511 (47%)"
8318,The Philosophy of Modern Song,Bob Dylan,2980,612,01-Nov-22,The Philosophy of Modern Song is Bob Dylans f...,352,Music,"1,383 follower",612 reviews,509,61 (2%),234 (7%),784 (26%),"1,052 (35%)",849 (28%)


In [19]:
# define: remove the word ' k followers' from the column author_followers

In [32]:
# code
df_copy["author_followers"] = df_copy["author_followers"].str.replace("followers","")
df_copy["author_followers"] = df_copy["author_followers"].str.replace("follower","")
df_copy["author_followers"] = df_copy["author_followers"].str.replace("k","")
df_copy["author_followers"] = df_copy["author_followers"].str.replace(",","")


KeyError: 'author_followers'

In [33]:
#test
df_copy

Unnamed: 0,Title,Authors,Avg Ratings,Rating,first_published,Description,Total_Pages,Genre,Author_followers,no_of_reviews,no_book_by_author,one_star_ratings,two_star_ratings,three_star_ratings,four_star_ratings,five_star_ratings
0,Harry Potter and the Deathly Hallows,J.K. Rowling,3674428,84560,"First published July 21, 2007","Harry has been burdened with a dark, dangerous...",759,Fantasy,225,84560,451 books225k followers,"30,370 (<1%)","43,345 (1%)","206,753 (5%)","722,799 (19%)","2,671,161 (72%)"
1,The Hunger Games,Suzanne Collins,8577686,216194,"First published September 14, 2008","Could you survive on your own in the wild, wit...",374,Young Adult,99.4,216194,71 books99.4k followers,"120,406 (1%)","218,605 (2%)","981,400 (11%)","2,604,669 (30%)","4,652,606 (54%)"
2,The Kite Runner,Khaled Hosseini,3122264,96940,"First published May 29, 2003",1970s Afghanistan: Twelve-year-old Amir is des...,371,Fiction,154,96940,43 books154k followers,"46,881 (1%)","83,304 (2%)","323,768 (10%)","971,488 (31%)","1,696,823 (54%)"
3,The Book Thief,Markus Zusak,2531587,143507,"First published September 1, 2005",Librarian's note: An alternate cover edition c...,592,Historical Fiction,39.2,143507,23 books39.2k followers,"33,196 (1%)","65,113 (2%)","249,718 (9%)","712,353 (28%)","1,471,207 (58%)"
4,Harry Potter and the Half-Blood Prince,J.K. Rowling,3243037,56890,"First published July 16, 2005","It is the middle of the summer, but there is a...",652,Fantasy,225,56890,451 books225k followers,"17,279 (<1%)","35,392 (1%)","216,854 (6%)","763,716 (23%)","2,209,796 (68%)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8315,She of the Mountains,Vivek Shraya,3100,470,14-Oct-14,"Finalist, Lambda Literary AwardIn the beginnin...",128,Fiction,1010,470,13,20 (<1%),119 (3%),574 (18%),"1,282 (41%)","1,105 (35%)"
8316,The Book of Lost Friends,Lisa Wingate,99711,10571,07-Apr-20,A new novel inspired by historical events: a s...,388,Historical Fiction,10.8,10571,49,734 (<1%),"2,750 (2%)","15,790 (15%)","41,264 (41%)","39,173 (39%)"
8317,In the Kingdom of Ice: The Grand and Terrible ...,Hampton Sides,22141,2752,05-Aug-14,"On July 8, 1879, Captain George Washington De ...",454,Nonfiction,1200,2752,23,332 (1%),567 (2%),"2,576 (11%)","8,155 (36%)","10,511 (47%)"
8318,The Philosophy of Modern Song,Bob Dylan,2980,612,01-Nov-22,The Philosophy of Modern Song is Bob Dylans f...,352,Music,1383,612,509,61 (2%),234 (7%),784 (26%),"1,052 (35%)",849 (28%)


In [34]:
# define: add "(in k)" in the column name author_followers

In [35]:
# code
df_copy.rename(columns={'author_followers': 'Author_followers'}, inplace=True)

In [36]:
#test Author_followers
df_copy

Unnamed: 0,Title,Authors,Avg Ratings,Rating,first_published,Description,Total_Pages,Genre,Author_followers,no_of_reviews,no_book_by_author,one_star_ratings,two_star_ratings,three_star_ratings,four_star_ratings,five_star_ratings
0,Harry Potter and the Deathly Hallows,J.K. Rowling,3674428,84560,"First published July 21, 2007","Harry has been burdened with a dark, dangerous...",759,Fantasy,225,84560,451 books225k followers,"30,370 (<1%)","43,345 (1%)","206,753 (5%)","722,799 (19%)","2,671,161 (72%)"
1,The Hunger Games,Suzanne Collins,8577686,216194,"First published September 14, 2008","Could you survive on your own in the wild, wit...",374,Young Adult,99.4,216194,71 books99.4k followers,"120,406 (1%)","218,605 (2%)","981,400 (11%)","2,604,669 (30%)","4,652,606 (54%)"
2,The Kite Runner,Khaled Hosseini,3122264,96940,"First published May 29, 2003",1970s Afghanistan: Twelve-year-old Amir is des...,371,Fiction,154,96940,43 books154k followers,"46,881 (1%)","83,304 (2%)","323,768 (10%)","971,488 (31%)","1,696,823 (54%)"
3,The Book Thief,Markus Zusak,2531587,143507,"First published September 1, 2005",Librarian's note: An alternate cover edition c...,592,Historical Fiction,39.2,143507,23 books39.2k followers,"33,196 (1%)","65,113 (2%)","249,718 (9%)","712,353 (28%)","1,471,207 (58%)"
4,Harry Potter and the Half-Blood Prince,J.K. Rowling,3243037,56890,"First published July 16, 2005","It is the middle of the summer, but there is a...",652,Fantasy,225,56890,451 books225k followers,"17,279 (<1%)","35,392 (1%)","216,854 (6%)","763,716 (23%)","2,209,796 (68%)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8315,She of the Mountains,Vivek Shraya,3100,470,14-Oct-14,"Finalist, Lambda Literary AwardIn the beginnin...",128,Fiction,1010,470,13,20 (<1%),119 (3%),574 (18%),"1,282 (41%)","1,105 (35%)"
8316,The Book of Lost Friends,Lisa Wingate,99711,10571,07-Apr-20,A new novel inspired by historical events: a s...,388,Historical Fiction,10.8,10571,49,734 (<1%),"2,750 (2%)","15,790 (15%)","41,264 (41%)","39,173 (39%)"
8317,In the Kingdom of Ice: The Grand and Terrible ...,Hampton Sides,22141,2752,05-Aug-14,"On July 8, 1879, Captain George Washington De ...",454,Nonfiction,1200,2752,23,332 (1%),567 (2%),"2,576 (11%)","8,155 (36%)","10,511 (47%)"
8318,The Philosophy of Modern Song,Bob Dylan,2980,612,01-Nov-22,The Philosophy of Modern Song is Bob Dylans f...,352,Music,1383,612,509,61 (2%),234 (7%),784 (26%),"1,052 (35%)",849 (28%)


In [37]:
# define: remove word reviews and comma from column no_of_reviews.

In [38]:
# code

df_copy["no_of_reviews"] = df_copy["no_of_reviews"].str.replace("reviews","")
df_copy["no_of_reviews"] = df_copy["no_of_reviews"].str.replace(",","")

In [39]:
# test
df_copy

Unnamed: 0,Title,Authors,Avg Ratings,Rating,first_published,Description,Total_Pages,Genre,Author_followers,no_of_reviews,no_book_by_author,one_star_ratings,two_star_ratings,three_star_ratings,four_star_ratings,five_star_ratings
0,Harry Potter and the Deathly Hallows,J.K. Rowling,3674428,84560,"First published July 21, 2007","Harry has been burdened with a dark, dangerous...",759,Fantasy,225,84560,451 books225k followers,"30,370 (<1%)","43,345 (1%)","206,753 (5%)","722,799 (19%)","2,671,161 (72%)"
1,The Hunger Games,Suzanne Collins,8577686,216194,"First published September 14, 2008","Could you survive on your own in the wild, wit...",374,Young Adult,99.4,216194,71 books99.4k followers,"120,406 (1%)","218,605 (2%)","981,400 (11%)","2,604,669 (30%)","4,652,606 (54%)"
2,The Kite Runner,Khaled Hosseini,3122264,96940,"First published May 29, 2003",1970s Afghanistan: Twelve-year-old Amir is des...,371,Fiction,154,96940,43 books154k followers,"46,881 (1%)","83,304 (2%)","323,768 (10%)","971,488 (31%)","1,696,823 (54%)"
3,The Book Thief,Markus Zusak,2531587,143507,"First published September 1, 2005",Librarian's note: An alternate cover edition c...,592,Historical Fiction,39.2,143507,23 books39.2k followers,"33,196 (1%)","65,113 (2%)","249,718 (9%)","712,353 (28%)","1,471,207 (58%)"
4,Harry Potter and the Half-Blood Prince,J.K. Rowling,3243037,56890,"First published July 16, 2005","It is the middle of the summer, but there is a...",652,Fantasy,225,56890,451 books225k followers,"17,279 (<1%)","35,392 (1%)","216,854 (6%)","763,716 (23%)","2,209,796 (68%)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8315,She of the Mountains,Vivek Shraya,3100,470,14-Oct-14,"Finalist, Lambda Literary AwardIn the beginnin...",128,Fiction,1010,470,13,20 (<1%),119 (3%),574 (18%),"1,282 (41%)","1,105 (35%)"
8316,The Book of Lost Friends,Lisa Wingate,99711,10571,07-Apr-20,A new novel inspired by historical events: a s...,388,Historical Fiction,10.8,10571,49,734 (<1%),"2,750 (2%)","15,790 (15%)","41,264 (41%)","39,173 (39%)"
8317,In the Kingdom of Ice: The Grand and Terrible ...,Hampton Sides,22141,2752,05-Aug-14,"On July 8, 1879, Captain George Washington De ...",454,Nonfiction,1200,2752,23,332 (1%),567 (2%),"2,576 (11%)","8,155 (36%)","10,511 (47%)"
8318,The Philosophy of Modern Song,Bob Dylan,2980,612,01-Nov-22,The Philosophy of Modern Song is Bob Dylans f...,352,Music,1383,612,509,61 (2%),234 (7%),784 (26%),"1,052 (35%)",849 (28%)


In [40]:
# define: correcting the name of the column with getting only the number of books in column no_book_by_author

In [41]:
# code

df_copy.rename(columns={"no_book_by_author":"No_of_books_by_author"},inplace=True)


In [42]:
df_copy["No_of_books_by_author"] = df_copy["No_of_books_by_author"].str.split().str[0]

In [43]:
# test 
df_copy

Unnamed: 0,Title,Authors,Avg Ratings,Rating,first_published,Description,Total_Pages,Genre,Author_followers,no_of_reviews,No_of_books_by_author,one_star_ratings,two_star_ratings,three_star_ratings,four_star_ratings,five_star_ratings
0,Harry Potter and the Deathly Hallows,J.K. Rowling,3674428,84560,"First published July 21, 2007","Harry has been burdened with a dark, dangerous...",759,Fantasy,225,84560,451,"30,370 (<1%)","43,345 (1%)","206,753 (5%)","722,799 (19%)","2,671,161 (72%)"
1,The Hunger Games,Suzanne Collins,8577686,216194,"First published September 14, 2008","Could you survive on your own in the wild, wit...",374,Young Adult,99.4,216194,71,"120,406 (1%)","218,605 (2%)","981,400 (11%)","2,604,669 (30%)","4,652,606 (54%)"
2,The Kite Runner,Khaled Hosseini,3122264,96940,"First published May 29, 2003",1970s Afghanistan: Twelve-year-old Amir is des...,371,Fiction,154,96940,43,"46,881 (1%)","83,304 (2%)","323,768 (10%)","971,488 (31%)","1,696,823 (54%)"
3,The Book Thief,Markus Zusak,2531587,143507,"First published September 1, 2005",Librarian's note: An alternate cover edition c...,592,Historical Fiction,39.2,143507,23,"33,196 (1%)","65,113 (2%)","249,718 (9%)","712,353 (28%)","1,471,207 (58%)"
4,Harry Potter and the Half-Blood Prince,J.K. Rowling,3243037,56890,"First published July 16, 2005","It is the middle of the summer, but there is a...",652,Fantasy,225,56890,451,"17,279 (<1%)","35,392 (1%)","216,854 (6%)","763,716 (23%)","2,209,796 (68%)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8315,She of the Mountains,Vivek Shraya,3100,470,14-Oct-14,"Finalist, Lambda Literary AwardIn the beginnin...",128,Fiction,1010,470,13,20 (<1%),119 (3%),574 (18%),"1,282 (41%)","1,105 (35%)"
8316,The Book of Lost Friends,Lisa Wingate,99711,10571,07-Apr-20,A new novel inspired by historical events: a s...,388,Historical Fiction,10.8,10571,49,734 (<1%),"2,750 (2%)","15,790 (15%)","41,264 (41%)","39,173 (39%)"
8317,In the Kingdom of Ice: The Grand and Terrible ...,Hampton Sides,22141,2752,05-Aug-14,"On July 8, 1879, Captain George Washington De ...",454,Nonfiction,1200,2752,23,332 (1%),567 (2%),"2,576 (11%)","8,155 (36%)","10,511 (47%)"
8318,The Philosophy of Modern Song,Bob Dylan,2980,612,01-Nov-22,The Philosophy of Modern Song is Bob Dylans f...,352,Music,1383,612,509,61 (2%),234 (7%),784 (26%),"1,052 (35%)",849 (28%)


In [44]:
# define : removing percentages from 5 columns
# one_star_ratings	two_star_ratings	three_star_ratings	four_star_ratings	five_star_ratings

In [45]:
# code
df_copy["one_star_ratings"] = df_copy["one_star_ratings"].str.split('(', n=1).str[0]
df_copy["two_star_ratings"] = df_copy["two_star_ratings"].str.split('(', n=1).str[0]
df_copy["three_star_ratings"] = df_copy["three_star_ratings"].str.split('(', n=1).str[0]
df_copy["four_star_ratings"] = df_copy["four_star_ratings"].str.split('(', n=1).str[0]
df_copy["five_star_ratings"] = df_copy["five_star_ratings"].str.split('(', n=1).str[0]


In [46]:
# test
df_copy

Unnamed: 0,Title,Authors,Avg Ratings,Rating,first_published,Description,Total_Pages,Genre,Author_followers,no_of_reviews,No_of_books_by_author,one_star_ratings,two_star_ratings,three_star_ratings,four_star_ratings,five_star_ratings
0,Harry Potter and the Deathly Hallows,J.K. Rowling,3674428,84560,"First published July 21, 2007","Harry has been burdened with a dark, dangerous...",759,Fantasy,225,84560,451,30370,43345,206753,722799,2671161
1,The Hunger Games,Suzanne Collins,8577686,216194,"First published September 14, 2008","Could you survive on your own in the wild, wit...",374,Young Adult,99.4,216194,71,120406,218605,981400,2604669,4652606
2,The Kite Runner,Khaled Hosseini,3122264,96940,"First published May 29, 2003",1970s Afghanistan: Twelve-year-old Amir is des...,371,Fiction,154,96940,43,46881,83304,323768,971488,1696823
3,The Book Thief,Markus Zusak,2531587,143507,"First published September 1, 2005",Librarian's note: An alternate cover edition c...,592,Historical Fiction,39.2,143507,23,33196,65113,249718,712353,1471207
4,Harry Potter and the Half-Blood Prince,J.K. Rowling,3243037,56890,"First published July 16, 2005","It is the middle of the summer, but there is a...",652,Fantasy,225,56890,451,17279,35392,216854,763716,2209796
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8315,She of the Mountains,Vivek Shraya,3100,470,14-Oct-14,"Finalist, Lambda Literary AwardIn the beginnin...",128,Fiction,1010,470,13,20,119,574,1282,1105
8316,The Book of Lost Friends,Lisa Wingate,99711,10571,07-Apr-20,A new novel inspired by historical events: a s...,388,Historical Fiction,10.8,10571,49,734,2750,15790,41264,39173
8317,In the Kingdom of Ice: The Grand and Terrible ...,Hampton Sides,22141,2752,05-Aug-14,"On July 8, 1879, Captain George Washington De ...",454,Nonfiction,1200,2752,23,332,567,2576,8155,10511
8318,The Philosophy of Modern Song,Bob Dylan,2980,612,01-Nov-22,The Philosophy of Modern Song is Bob Dylans f...,352,Music,1383,612,509,61,234,784,1052,849


In [47]:
# define: remove comma from the same 5 columns and also from column Avg Ratings also change it to Total_Ratings.

In [48]:
# code
df_copy["one_star_ratings"] = df_copy["one_star_ratings"].str.replace(",","")
df_copy["two_star_ratings"] = df_copy["two_star_ratings"].str.replace(",","")
df_copy["three_star_ratings"] = df_copy["three_star_ratings"].str.replace(",","")
df_copy["four_star_ratings"] = df_copy["four_star_ratings"].str.replace(",","")
df_copy["five_star_ratings"] = df_copy["five_star_ratings"].str.replace(",","")
df_copy["Avg Ratings"] = df_copy["Avg Ratings"].str.replace(",","")

In [49]:
df_copy.rename(columns = {"Avg Ratings":"Avg_ratings"},inplace=True)

In [50]:
df_copy.rename(columns = {"Avg_ratings":"Total_ratings"},inplace=True)

In [51]:
df_copy

Unnamed: 0,Title,Authors,Total_ratings,Rating,first_published,Description,Total_Pages,Genre,Author_followers,no_of_reviews,No_of_books_by_author,one_star_ratings,two_star_ratings,three_star_ratings,four_star_ratings,five_star_ratings
0,Harry Potter and the Deathly Hallows,J.K. Rowling,3674428,84560,"First published July 21, 2007","Harry has been burdened with a dark, dangerous...",759,Fantasy,225,84560,451,30370,43345,206753,722799,2671161
1,The Hunger Games,Suzanne Collins,8577686,216194,"First published September 14, 2008","Could you survive on your own in the wild, wit...",374,Young Adult,99.4,216194,71,120406,218605,981400,2604669,4652606
2,The Kite Runner,Khaled Hosseini,3122264,96940,"First published May 29, 2003",1970s Afghanistan: Twelve-year-old Amir is des...,371,Fiction,154,96940,43,46881,83304,323768,971488,1696823
3,The Book Thief,Markus Zusak,2531587,143507,"First published September 1, 2005",Librarian's note: An alternate cover edition c...,592,Historical Fiction,39.2,143507,23,33196,65113,249718,712353,1471207
4,Harry Potter and the Half-Blood Prince,J.K. Rowling,3243037,56890,"First published July 16, 2005","It is the middle of the summer, but there is a...",652,Fantasy,225,56890,451,17279,35392,216854,763716,2209796
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8315,She of the Mountains,Vivek Shraya,3100,470,14-Oct-14,"Finalist, Lambda Literary AwardIn the beginnin...",128,Fiction,1010,470,13,20,119,574,1282,1105
8316,The Book of Lost Friends,Lisa Wingate,99711,10571,07-Apr-20,A new novel inspired by historical events: a s...,388,Historical Fiction,10.8,10571,49,734,2750,15790,41264,39173
8317,In the Kingdom of Ice: The Grand and Terrible ...,Hampton Sides,22141,2752,05-Aug-14,"On July 8, 1879, Captain George Washington De ...",454,Nonfiction,1200,2752,23,332,567,2576,8155,10511
8318,The Philosophy of Modern Song,Bob Dylan,2980,612,01-Nov-22,The Philosophy of Modern Song is Bob Dylans f...,352,Music,1383,612,509,61,234,784,1052,849


In [52]:
#define: drop column Ratings

In [53]:
# code
# Drop column 'B'
df_copy = df_copy.drop(columns=['Rating'])

In [54]:
df_copy

Unnamed: 0,Title,Authors,Total_ratings,first_published,Description,Total_Pages,Genre,Author_followers,no_of_reviews,No_of_books_by_author,one_star_ratings,two_star_ratings,three_star_ratings,four_star_ratings,five_star_ratings
0,Harry Potter and the Deathly Hallows,J.K. Rowling,3674428,"First published July 21, 2007","Harry has been burdened with a dark, dangerous...",759,Fantasy,225,84560,451,30370,43345,206753,722799,2671161
1,The Hunger Games,Suzanne Collins,8577686,"First published September 14, 2008","Could you survive on your own in the wild, wit...",374,Young Adult,99.4,216194,71,120406,218605,981400,2604669,4652606
2,The Kite Runner,Khaled Hosseini,3122264,"First published May 29, 2003",1970s Afghanistan: Twelve-year-old Amir is des...,371,Fiction,154,96940,43,46881,83304,323768,971488,1696823
3,The Book Thief,Markus Zusak,2531587,"First published September 1, 2005",Librarian's note: An alternate cover edition c...,592,Historical Fiction,39.2,143507,23,33196,65113,249718,712353,1471207
4,Harry Potter and the Half-Blood Prince,J.K. Rowling,3243037,"First published July 16, 2005","It is the middle of the summer, but there is a...",652,Fantasy,225,56890,451,17279,35392,216854,763716,2209796
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8315,She of the Mountains,Vivek Shraya,3100,14-Oct-14,"Finalist, Lambda Literary AwardIn the beginnin...",128,Fiction,1010,470,13,20,119,574,1282,1105
8316,The Book of Lost Friends,Lisa Wingate,99711,07-Apr-20,A new novel inspired by historical events: a s...,388,Historical Fiction,10.8,10571,49,734,2750,15790,41264,39173
8317,In the Kingdom of Ice: The Grand and Terrible ...,Hampton Sides,22141,05-Aug-14,"On July 8, 1879, Captain George Washington De ...",454,Nonfiction,1200,2752,23,332,567,2576,8155,10511
8318,The Philosophy of Modern Song,Bob Dylan,2980,01-Nov-22,The Philosophy of Modern Song is Bob Dylans f...,352,Music,1383,612,509,61,234,784,1052,849


In [55]:
# define: remove word first published from column first_published and convert it into date time format 

In [56]:
# code
df_copy["first_published"] = df_copy["first_published"].str.replace("First published","")
df_copy["first_published"] = df_copy["first_published"].str.replace("Published","")
df_copy["first_published"] = df_copy["first_published"].str.replace("Audio CD","")

In [57]:
df_copy_1 = df_copy

In [58]:
# Define a regular expression pattern to match 'paperback' or 'hardcover'
#pattern = r'paperback|hardcover|\d+\s+pages,\s+Kindle\s+Edition'
pattern = r'\d+\s+pages.*?(ebook|hardcover|paperback|Kindle)'

# Filter rows containing 'paperback' or 'hardcover' in the 'first_published' column using regex
filtered_df = df_copy_1[~df_copy_1['first_published'].str.contains(pattern, case=False, na=False)]

# Display the filtered DataFrame
print("Filtered DataFrame:")
print(filtered_df)

Filtered DataFrame:
                                                  Title              Authors  \
0                  Harry Potter and the Deathly Hallows         J.K. Rowling   
1                                      The Hunger Games      Suzanne Collins   
2                                       The Kite Runner      Khaled Hosseini   
3                                        The Book Thief         Markus Zusak   
4                Harry Potter and the Half-Blood Prince         J.K. Rowling   
...                                                 ...                  ...   
8315                               She of the Mountains         Vivek Shraya   
8316                           The Book of Lost Friends         Lisa Wingate   
8317  In the Kingdom of Ice: The Grand and Terrible ...        Hampton Sides   
8318                      The Philosophy of Modern Song            Bob Dylan   
8319                The City in the Middle of the Night  Charlie Jane Anders   

     Total_ratings 

  return func(self, *args, **kwargs)


In [59]:
filtered_df

Unnamed: 0,Title,Authors,Total_ratings,first_published,Description,Total_Pages,Genre,Author_followers,no_of_reviews,No_of_books_by_author,one_star_ratings,two_star_ratings,three_star_ratings,four_star_ratings,five_star_ratings
0,Harry Potter and the Deathly Hallows,J.K. Rowling,3674428,"July 21, 2007","Harry has been burdened with a dark, dangerous...",759,Fantasy,225,84560,451,30370,43345,206753,722799,2671161
1,The Hunger Games,Suzanne Collins,8577686,"September 14, 2008","Could you survive on your own in the wild, wit...",374,Young Adult,99.4,216194,71,120406,218605,981400,2604669,4652606
2,The Kite Runner,Khaled Hosseini,3122264,"May 29, 2003",1970s Afghanistan: Twelve-year-old Amir is des...,371,Fiction,154,96940,43,46881,83304,323768,971488,1696823
3,The Book Thief,Markus Zusak,2531587,"September 1, 2005",Librarian's note: An alternate cover edition c...,592,Historical Fiction,39.2,143507,23,33196,65113,249718,712353,1471207
4,Harry Potter and the Half-Blood Prince,J.K. Rowling,3243037,"July 16, 2005","It is the middle of the summer, but there is a...",652,Fantasy,225,56890,451,17279,35392,216854,763716,2209796
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8315,She of the Mountains,Vivek Shraya,3100,14-Oct-14,"Finalist, Lambda Literary AwardIn the beginnin...",128,Fiction,1010,470,13,20,119,574,1282,1105
8316,The Book of Lost Friends,Lisa Wingate,99711,07-Apr-20,A new novel inspired by historical events: a s...,388,Historical Fiction,10.8,10571,49,734,2750,15790,41264,39173
8317,In the Kingdom of Ice: The Grand and Terrible ...,Hampton Sides,22141,05-Aug-14,"On July 8, 1879, Captain George Washington De ...",454,Nonfiction,1200,2752,23,332,567,2576,8155,10511
8318,The Philosophy of Modern Song,Bob Dylan,2980,01-Nov-22,The Philosophy of Modern Song is Bob Dylans f...,352,Music,1383,612,509,61,234,784,1052,849


In [60]:
df_copy = filtered_df

In [61]:
# Convert 'date' column to datetime, handle errors by setting invalid dates to NaT (Not a Time)
df_copy['first_published'] = pd.to_datetime(df_copy['first_published'], errors='coerce')

# Drop rows with NaT values
df_copy = df_copy.dropna()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_copy['first_published'] = pd.to_datetime(df_copy['first_published'], errors='coerce')


In [62]:
df_copy

Unnamed: 0,Title,Authors,Total_ratings,first_published,Description,Total_Pages,Genre,Author_followers,no_of_reviews,No_of_books_by_author,one_star_ratings,two_star_ratings,three_star_ratings,four_star_ratings,five_star_ratings
0,Harry Potter and the Deathly Hallows,J.K. Rowling,3674428,2007-07-21,"Harry has been burdened with a dark, dangerous...",759,Fantasy,225,84560,451,30370,43345,206753,722799,2671161
1,The Hunger Games,Suzanne Collins,8577686,2008-09-14,"Could you survive on your own in the wild, wit...",374,Young Adult,99.4,216194,71,120406,218605,981400,2604669,4652606
2,The Kite Runner,Khaled Hosseini,3122264,2003-05-29,1970s Afghanistan: Twelve-year-old Amir is des...,371,Fiction,154,96940,43,46881,83304,323768,971488,1696823
3,The Book Thief,Markus Zusak,2531587,2005-09-01,Librarian's note: An alternate cover edition c...,592,Historical Fiction,39.2,143507,23,33196,65113,249718,712353,1471207
4,Harry Potter and the Half-Blood Prince,J.K. Rowling,3243037,2005-07-16,"It is the middle of the summer, but there is a...",652,Fantasy,225,56890,451,17279,35392,216854,763716,2209796
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8315,She of the Mountains,Vivek Shraya,3100,2014-10-14,"Finalist, Lambda Literary AwardIn the beginnin...",128,Fiction,1010,470,13,20,119,574,1282,1105
8316,The Book of Lost Friends,Lisa Wingate,99711,2020-04-07,A new novel inspired by historical events: a s...,388,Historical Fiction,10.8,10571,49,734,2750,15790,41264,39173
8317,In the Kingdom of Ice: The Grand and Terrible ...,Hampton Sides,22141,2014-08-05,"On July 8, 1879, Captain George Washington De ...",454,Nonfiction,1200,2752,23,332,567,2576,8155,10511
8318,The Philosophy of Modern Song,Bob Dylan,2980,2022-11-01,The Philosophy of Modern Song is Bob Dylans f...,352,Music,1383,612,509,61,234,784,1052,849


In [63]:
# checkpoint save it
df_copy_1 = df_copy

In [64]:
# Convert the 'date' column to datetime format
df_copy['first_published'] = pd.to_datetime(df_copy['first_published'])

# Convert the datetime format to the desired format (DD-MM-YY)
df_copy['first_published'] = df_copy['first_published'].dt.strftime('%d-%m-%y')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_copy['first_published'] = pd.to_datetime(df_copy['first_published'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_copy['first_published'] = df_copy['first_published'].dt.strftime('%d-%m-%y')


In [65]:
df_copy

Unnamed: 0,Title,Authors,Total_ratings,first_published,Description,Total_Pages,Genre,Author_followers,no_of_reviews,No_of_books_by_author,one_star_ratings,two_star_ratings,three_star_ratings,four_star_ratings,five_star_ratings
0,Harry Potter and the Deathly Hallows,J.K. Rowling,3674428,21-07-07,"Harry has been burdened with a dark, dangerous...",759,Fantasy,225,84560,451,30370,43345,206753,722799,2671161
1,The Hunger Games,Suzanne Collins,8577686,14-09-08,"Could you survive on your own in the wild, wit...",374,Young Adult,99.4,216194,71,120406,218605,981400,2604669,4652606
2,The Kite Runner,Khaled Hosseini,3122264,29-05-03,1970s Afghanistan: Twelve-year-old Amir is des...,371,Fiction,154,96940,43,46881,83304,323768,971488,1696823
3,The Book Thief,Markus Zusak,2531587,01-09-05,Librarian's note: An alternate cover edition c...,592,Historical Fiction,39.2,143507,23,33196,65113,249718,712353,1471207
4,Harry Potter and the Half-Blood Prince,J.K. Rowling,3243037,16-07-05,"It is the middle of the summer, but there is a...",652,Fantasy,225,56890,451,17279,35392,216854,763716,2209796
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8315,She of the Mountains,Vivek Shraya,3100,14-10-14,"Finalist, Lambda Literary AwardIn the beginnin...",128,Fiction,1010,470,13,20,119,574,1282,1105
8316,The Book of Lost Friends,Lisa Wingate,99711,07-04-20,A new novel inspired by historical events: a s...,388,Historical Fiction,10.8,10571,49,734,2750,15790,41264,39173
8317,In the Kingdom of Ice: The Grand and Terrible ...,Hampton Sides,22141,05-08-14,"On July 8, 1879, Captain George Washington De ...",454,Nonfiction,1200,2752,23,332,567,2576,8155,10511
8318,The Philosophy of Modern Song,Bob Dylan,2980,01-11-22,The Philosophy of Modern Song is Bob Dylans f...,352,Music,1383,612,509,61,234,784,1052,849


In [66]:
# chevkpoint 2 
df_copy_1 = df_copy

In [67]:
df_copy['first_published'] = pd.to_datetime(df_copy['first_published'], errors='coerce')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_copy['first_published'] = pd.to_datetime(df_copy['first_published'], errors='coerce')


In [68]:
df_copy

Unnamed: 0,Title,Authors,Total_ratings,first_published,Description,Total_Pages,Genre,Author_followers,no_of_reviews,No_of_books_by_author,one_star_ratings,two_star_ratings,three_star_ratings,four_star_ratings,five_star_ratings
0,Harry Potter and the Deathly Hallows,J.K. Rowling,3674428,2007-07-21,"Harry has been burdened with a dark, dangerous...",759,Fantasy,225,84560,451,30370,43345,206753,722799,2671161
1,The Hunger Games,Suzanne Collins,8577686,2008-09-14,"Could you survive on your own in the wild, wit...",374,Young Adult,99.4,216194,71,120406,218605,981400,2604669,4652606
2,The Kite Runner,Khaled Hosseini,3122264,2003-05-29,1970s Afghanistan: Twelve-year-old Amir is des...,371,Fiction,154,96940,43,46881,83304,323768,971488,1696823
3,The Book Thief,Markus Zusak,2531587,2005-01-09,Librarian's note: An alternate cover edition c...,592,Historical Fiction,39.2,143507,23,33196,65113,249718,712353,1471207
4,Harry Potter and the Half-Blood Prince,J.K. Rowling,3243037,2005-07-16,"It is the middle of the summer, but there is a...",652,Fantasy,225,56890,451,17279,35392,216854,763716,2209796
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8315,She of the Mountains,Vivek Shraya,3100,2014-10-14,"Finalist, Lambda Literary AwardIn the beginnin...",128,Fiction,1010,470,13,20,119,574,1282,1105
8316,The Book of Lost Friends,Lisa Wingate,99711,2020-07-04,A new novel inspired by historical events: a s...,388,Historical Fiction,10.8,10571,49,734,2750,15790,41264,39173
8317,In the Kingdom of Ice: The Grand and Terrible ...,Hampton Sides,22141,2014-05-08,"On July 8, 1879, Captain George Washington De ...",454,Nonfiction,1200,2752,23,332,567,2576,8155,10511
8318,The Philosophy of Modern Song,Bob Dylan,2980,2022-01-11,The Philosophy of Modern Song is Bob Dylans f...,352,Music,1383,612,509,61,234,784,1052,849


In [69]:
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8281 entries, 0 to 8319
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   Title                  8281 non-null   object        
 1   Authors                8281 non-null   object        
 2   Total_ratings          8281 non-null   object        
 3   first_published        8281 non-null   datetime64[ns]
 4   Description            8281 non-null   object        
 5   Total_Pages            8281 non-null   object        
 6   Genre                  8281 non-null   object        
 7   Author_followers       8281 non-null   object        
 8   no_of_reviews          8281 non-null   object        
 9   No_of_books_by_author  8281 non-null   object        
 10  one_star_ratings       8281 non-null   object        
 11  two_star_ratings       8281 non-null   object        
 12  three_star_ratings     8281 non-null   object        
 13  fou

In [70]:
# chevkpoint 3
df_copy_1 = df_copy

In [71]:
# define: convert multiple object data type coluns to numeric


In [72]:
# code
# Convert columns to numeric type
df_copy[['Total_ratings', 'one_star_ratings', 'two_star_ratings','three_star_ratings','four_star_ratings','five_star_ratings','Total_Pages','No_of_books_by_author']] = df_copy[['Total_ratings','one_star_ratings', 'two_star_ratings','three_star_ratings','four_star_ratings','five_star_ratings','Total_Pages','No_of_books_by_author']].apply(pd.to_numeric, errors='coerce')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


In [73]:
df_copy

Unnamed: 0,Title,Authors,Total_ratings,first_published,Description,Total_Pages,Genre,Author_followers,no_of_reviews,No_of_books_by_author,one_star_ratings,two_star_ratings,three_star_ratings,four_star_ratings,five_star_ratings
0,Harry Potter and the Deathly Hallows,J.K. Rowling,3674428,2007-07-21,"Harry has been burdened with a dark, dangerous...",759.0,Fantasy,225,84560,451.0,30370,43345,206753,722799,2671161
1,The Hunger Games,Suzanne Collins,8577686,2008-09-14,"Could you survive on your own in the wild, wit...",374.0,Young Adult,99.4,216194,71.0,120406,218605,981400,2604669,4652606
2,The Kite Runner,Khaled Hosseini,3122264,2003-05-29,1970s Afghanistan: Twelve-year-old Amir is des...,371.0,Fiction,154,96940,43.0,46881,83304,323768,971488,1696823
3,The Book Thief,Markus Zusak,2531587,2005-01-09,Librarian's note: An alternate cover edition c...,592.0,Historical Fiction,39.2,143507,23.0,33196,65113,249718,712353,1471207
4,Harry Potter and the Half-Blood Prince,J.K. Rowling,3243037,2005-07-16,"It is the middle of the summer, but there is a...",652.0,Fantasy,225,56890,451.0,17279,35392,216854,763716,2209796
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8315,She of the Mountains,Vivek Shraya,3100,2014-10-14,"Finalist, Lambda Literary AwardIn the beginnin...",128.0,Fiction,1010,470,13.0,20,119,574,1282,1105
8316,The Book of Lost Friends,Lisa Wingate,99711,2020-07-04,A new novel inspired by historical events: a s...,388.0,Historical Fiction,10.8,10571,49.0,734,2750,15790,41264,39173
8317,In the Kingdom of Ice: The Grand and Terrible ...,Hampton Sides,22141,2014-05-08,"On July 8, 1879, Captain George Washington De ...",454.0,Nonfiction,1200,2752,23.0,332,567,2576,8155,10511
8318,The Philosophy of Modern Song,Bob Dylan,2980,2022-01-11,The Philosophy of Modern Song is Bob Dylans f...,352.0,Music,1383,612,509.0,61,234,784,1052,849


In [74]:
# define:  The values are strings containing numeric characters 
# along with non-printable characters like \xa0, which is the Unicode representation of a non-breaking space.
unique_values = df_copy['no_of_reviews'].unique()
print(unique_values)

['84560\xa0' '216194\xa0' '96940\xa0' ... '470\xa0' '10571\xa0' '2060\xa0']


In [75]:
# Remove non-numeric characters and convert the column to numeric type
df_copy['no_of_reviews'] = df_copy['no_of_reviews'].str.replace('\xa0', '')

df_copy['no_of_reviews'] = df_copy['no_of_reviews'].str.replace('1review', '')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_copy['no_of_reviews'] = df_copy['no_of_reviews'].str.replace('\xa0', '')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_copy['no_of_reviews'] = df_copy['no_of_reviews'].str.replace('1review', '')


In [76]:
df_copy['no_of_reviews'] = pd.to_numeric(df_copy['no_of_reviews'], errors='ignore')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_copy['no_of_reviews'] = pd.to_numeric(df_copy['no_of_reviews'], errors='ignore')


In [77]:
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8281 entries, 0 to 8319
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   Title                  8281 non-null   object        
 1   Authors                8281 non-null   object        
 2   Total_ratings          8281 non-null   int64         
 3   first_published        8281 non-null   datetime64[ns]
 4   Description            8281 non-null   object        
 5   Total_Pages            8138 non-null   float64       
 6   Genre                  8281 non-null   object        
 7   Author_followers       8281 non-null   object        
 8   no_of_reviews          8221 non-null   float64       
 9   No_of_books_by_author  8099 non-null   float64       
 10  one_star_ratings       8281 non-null   int64         
 11  two_star_ratings       8281 non-null   int64         
 12  three_star_ratings     8281 non-null   int64         
 13  fou

In [78]:
df_copy

Unnamed: 0,Title,Authors,Total_ratings,first_published,Description,Total_Pages,Genre,Author_followers,no_of_reviews,No_of_books_by_author,one_star_ratings,two_star_ratings,three_star_ratings,four_star_ratings,five_star_ratings
0,Harry Potter and the Deathly Hallows,J.K. Rowling,3674428,2007-07-21,"Harry has been burdened with a dark, dangerous...",759.0,Fantasy,225,84560.0,451.0,30370,43345,206753,722799,2671161
1,The Hunger Games,Suzanne Collins,8577686,2008-09-14,"Could you survive on your own in the wild, wit...",374.0,Young Adult,99.4,216194.0,71.0,120406,218605,981400,2604669,4652606
2,The Kite Runner,Khaled Hosseini,3122264,2003-05-29,1970s Afghanistan: Twelve-year-old Amir is des...,371.0,Fiction,154,96940.0,43.0,46881,83304,323768,971488,1696823
3,The Book Thief,Markus Zusak,2531587,2005-01-09,Librarian's note: An alternate cover edition c...,592.0,Historical Fiction,39.2,143507.0,23.0,33196,65113,249718,712353,1471207
4,Harry Potter and the Half-Blood Prince,J.K. Rowling,3243037,2005-07-16,"It is the middle of the summer, but there is a...",652.0,Fantasy,225,56890.0,451.0,17279,35392,216854,763716,2209796
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8315,She of the Mountains,Vivek Shraya,3100,2014-10-14,"Finalist, Lambda Literary AwardIn the beginnin...",128.0,Fiction,1010,470.0,13.0,20,119,574,1282,1105
8316,The Book of Lost Friends,Lisa Wingate,99711,2020-07-04,A new novel inspired by historical events: a s...,388.0,Historical Fiction,10.8,10571.0,49.0,734,2750,15790,41264,39173
8317,In the Kingdom of Ice: The Grand and Terrible ...,Hampton Sides,22141,2014-05-08,"On July 8, 1879, Captain George Washington De ...",454.0,Nonfiction,1200,2752.0,23.0,332,567,2576,8155,10511
8318,The Philosophy of Modern Song,Bob Dylan,2980,2022-01-11,The Philosophy of Modern Song is Bob Dylans f...,352.0,Music,1383,612.0,509.0,61,234,784,1052,849


In [79]:
# define: Some values in column Author_followers has inconsistent values like 1boo576,1boo890 etc and convert it to numeric column

In [80]:
unique_values = df_copy['Author_followers'].unique()
print(unique_values)

['225\xa0' '99.4\xa0' '154\xa0' ... '795\xa0' '4741\xa0' '1\xa0boo98\xa0']


In [81]:
# Some values in column Author_followers has inconsistent values like 1boo576,1boo890 etc 

# Filter out rows containing 'boo' in the specified column
df_copy = df_copy[~df_copy['Author_followers'].str.contains('boo')]

# Reset the index if needed
df_copy.reset_index(drop=True, inplace=True)

# Verify the DataFrame after deletion
df_copy

Unnamed: 0,Title,Authors,Total_ratings,first_published,Description,Total_Pages,Genre,Author_followers,no_of_reviews,No_of_books_by_author,one_star_ratings,two_star_ratings,three_star_ratings,four_star_ratings,five_star_ratings
0,Harry Potter and the Deathly Hallows,J.K. Rowling,3674428,2007-07-21,"Harry has been burdened with a dark, dangerous...",759.0,Fantasy,225,84560.0,451.0,30370,43345,206753,722799,2671161
1,The Hunger Games,Suzanne Collins,8577686,2008-09-14,"Could you survive on your own in the wild, wit...",374.0,Young Adult,99.4,216194.0,71.0,120406,218605,981400,2604669,4652606
2,The Kite Runner,Khaled Hosseini,3122264,2003-05-29,1970s Afghanistan: Twelve-year-old Amir is des...,371.0,Fiction,154,96940.0,43.0,46881,83304,323768,971488,1696823
3,The Book Thief,Markus Zusak,2531587,2005-01-09,Librarian's note: An alternate cover edition c...,592.0,Historical Fiction,39.2,143507.0,23.0,33196,65113,249718,712353,1471207
4,Harry Potter and the Half-Blood Prince,J.K. Rowling,3243037,2005-07-16,"It is the middle of the summer, but there is a...",652.0,Fantasy,225,56890.0,451.0,17279,35392,216854,763716,2209796
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8167,She of the Mountains,Vivek Shraya,3100,2014-10-14,"Finalist, Lambda Literary AwardIn the beginnin...",128.0,Fiction,1010,470.0,13.0,20,119,574,1282,1105
8168,The Book of Lost Friends,Lisa Wingate,99711,2020-07-04,A new novel inspired by historical events: a s...,388.0,Historical Fiction,10.8,10571.0,49.0,734,2750,15790,41264,39173
8169,In the Kingdom of Ice: The Grand and Terrible ...,Hampton Sides,22141,2014-05-08,"On July 8, 1879, Captain George Washington De ...",454.0,Nonfiction,1200,2752.0,23.0,332,567,2576,8155,10511
8170,The Philosophy of Modern Song,Bob Dylan,2980,2022-01-11,The Philosophy of Modern Song is Bob Dylans f...,352.0,Music,1383,612.0,509.0,61,234,784,1052,849


In [82]:
# Remove non-numeric characters and convert the column to numeric type
df_copy['Author_followers'] = df_copy['Author_followers'].str.replace('\xa0', '')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_copy['Author_followers'] = df_copy['Author_followers'].str.replace('\xa0', '')


In [83]:
df_copy['Author_followers'] = pd.to_numeric(df_copy['Author_followers'], errors='ignore')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_copy['Author_followers'] = pd.to_numeric(df_copy['Author_followers'], errors='ignore')


In [90]:
df_copy['Author_followers'] = df_copy['Author_followers'] * 1000

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_copy['Author_followers'] = df_copy['Author_followers'] * 1000


In [91]:
df_copy

Unnamed: 0,Title,Authors,Total_ratings,first_published,Description,Total_Pages,Genre,Author_followers,no_of_reviews,No_of_books_by_author,one_star_ratings,two_star_ratings,three_star_ratings,four_star_ratings,five_star_ratings
0,Harry Potter and the Deathly Hallows,J.K. Rowling,3674428,2007-07-21,"Harry has been burdened with a dark, dangerous...",759.0,Fantasy,225000.0,84560.0,451.0,30370,43345,206753,722799,2671161
1,The Hunger Games,Suzanne Collins,8577686,2008-09-14,"Could you survive on your own in the wild, wit...",374.0,Young Adult,99400.0,216194.0,71.0,120406,218605,981400,2604669,4652606
2,The Kite Runner,Khaled Hosseini,3122264,2003-05-29,1970s Afghanistan: Twelve-year-old Amir is des...,371.0,Fiction,154000.0,96940.0,43.0,46881,83304,323768,971488,1696823
3,The Book Thief,Markus Zusak,2531587,2005-01-09,Librarian's note: An alternate cover edition c...,592.0,Historical Fiction,39200.0,143507.0,23.0,33196,65113,249718,712353,1471207
4,Harry Potter and the Half-Blood Prince,J.K. Rowling,3243037,2005-07-16,"It is the middle of the summer, but there is a...",652.0,Fantasy,225000.0,56890.0,451.0,17279,35392,216854,763716,2209796
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8167,She of the Mountains,Vivek Shraya,3100,2014-10-14,"Finalist, Lambda Literary AwardIn the beginnin...",128.0,Fiction,1010000.0,470.0,13.0,20,119,574,1282,1105
8168,The Book of Lost Friends,Lisa Wingate,99711,2020-07-04,A new novel inspired by historical events: a s...,388.0,Historical Fiction,10800.0,10571.0,49.0,734,2750,15790,41264,39173
8169,In the Kingdom of Ice: The Grand and Terrible ...,Hampton Sides,22141,2014-05-08,"On July 8, 1879, Captain George Washington De ...",454.0,Nonfiction,1200000.0,2752.0,23.0,332,567,2576,8155,10511
8170,The Philosophy of Modern Song,Bob Dylan,2980,2022-01-11,The Philosophy of Modern Song is Bob Dylans f...,352.0,Music,1383000.0,612.0,509.0,61,234,784,1052,849


In [92]:
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8172 entries, 0 to 8171
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   Title                  8172 non-null   object        
 1   Authors                8172 non-null   object        
 2   Total_ratings          8172 non-null   int64         
 3   first_published        8172 non-null   datetime64[ns]
 4   Description            8172 non-null   object        
 5   Total_Pages            8032 non-null   float64       
 6   Genre                  8172 non-null   object        
 7   Author_followers       8172 non-null   float64       
 8   no_of_reviews          8119 non-null   float64       
 9   No_of_books_by_author  7990 non-null   float64       
 10  one_star_ratings       8172 non-null   int64         
 11  two_star_ratings       8172 non-null   int64         
 12  three_star_ratings     8172 non-null   int64         
 13  fou

In [93]:
df_copy[df_copy["Total_Pages"] < 20]

Unnamed: 0,Title,Authors,Total_ratings,first_published,Description,Total_Pages,Genre,Author_followers,no_of_reviews,No_of_books_by_author,one_star_ratings,two_star_ratings,three_star_ratings,four_star_ratings,five_star_ratings
53,The Devil in the White City,Erik Larson,676909,2003-11-02,"Murder, Magic, and Madness at the Fair That Ch...",15.0,Nonfiction,66400.0,39983.0,40.0,16149,36601,128842,245717,249600
526,Saving CeeCee Honeycutt,Beth Hoffman,83962,2010-12-01,Twelve-year-old CeeCee is in trouble. For year...,10.0,Fiction,1463000.0,9350.0,15.0,1202,4089,18800,33296,26575
1214,Without Fail,Lee Child,104889,2002-05-13,"Skilled, cautious, and anonymous, Jack Reacher...",14.0,Thriller,30600.0,3240.0,307.0,637,2241,17396,43854,40761
1382,Rhett Butler's People,Donald McCaig,20324,2007-01-01,Fully authorized by the Margaret Mitchell esta...,18.0,Historical Fiction,115000.0,1925.0,52.0,910,1711,4771,6039,6893
1459,The Only Plane in the Sky: An Oral History of ...,Garrett M. Graff,34373,2019-10-09,"15 hours, 54 minutesRead by a 45-person cast, ...",16.0,Nonfiction,421000.0,5698.0,18.0,55,154,1136,6586,26442
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7679,"Motherland: A Memoir of Love, Loathing, and Lo...",Elissa Altman,912,2019-06-08,"A richly textured, multilayered story about mo...",7.0,Memoir,94000.0,144.0,10.0,14,61,264,325,248
7727,A Collegiate Casting-Out of Devilish Devices,Terry Pratchett,1045,2005-05-13,Free online fiction.Wizards at Terry Pratchett...,6.0,Fantasy,42500.0,72.0,488.0,7,62,321,360,295
7825,Sadie,Courtney Summers,117124,2018-04-09,An innovative audiobook production featuring m...,7.0,Young Adult,7474000.0,22740.0,14.0,1347,4687,21151,49130,40809
7884,What My Bones Know: A Memoir of Healing from C...,Stephanie Foo,40324,2022-02-22,A searing memoir of reckoning and healing from...,11.0,Memoir,1005000.0,5302.0,2.0,166,568,2986,11248,25356


In [94]:
# Drop rows where Total_Pages < 20
df_copy = df_copy.drop(df_copy[df_copy["Total_Pages"] < 20].index)

In [95]:
df_copy.to_csv("Cleaned_book_dataset.csv",index=False)