--- Kheire --- 

Work in progress. 

Data Cleaning Phase 1 is ready. 

Data Cleaning Phase 2 is work in progress. It is done on the scraped data. 

This Notebook illustrates the different steps taken to do the data cleaning and feature engineering.

In [4]:
# import necessary libraries
%load_ext autoreload
%autoreload 2
import numpy as np
import pandas as pd
from utils import *
import re

# Data Cleaning Phase 1

This the first phase of data cleaning. It is very basic and is done on the data format and columns names. 

The first thing noticed was that when reading the original csv file "books.csv" some contents of sepicific sections in specific rows were separated by commas. This caused an error, because pd.read_csv considered them as content belonging to different sections (i.e. columns), resulting in different number of columns for several rows. Luckily, these content were separate by a comma and space ", ", which made it easy to solve the problem by replacing the ", " by "/" as seen below. "books_updated.csv" is the new updated csv which is read by pandas with success.

In [3]:
### COMMENT THE FOLLOWING CELL IF YOU HAVE ALREADY RUN IT BEFORE AND SAVED THE books_updated.csv ###
# Specify the filename
filename = 'books.csv'
updated_filename = "books_updated.csv"

## In some cases in certain fields such us author the authors are separated by ", " 
## to avoid problems when reading the csv directly using pandas and to retain all rows the following steps done

# Open the file and read lines
with open(filename, 'r', encoding='utf-8') as file: # utf-8 encoding to support all languages since there are non-english content
    lines = file.readlines()

# Process the lines to handle unwanted delimiters
cleaned_lines = []
is_firstline = True # used to avoid updating the first line

for line in lines:
    
    if is_firstline: # if it is the first line i.e. Header do not update it
        is_firstline = False
        cleaned_lines.append(line)
        continue
    cleaned_line = line.replace(', ', '/')

    # Append the cleaned line
    cleaned_lines.append(cleaned_line)

# rejoin all the lines to create a new updated csv
with open(updated_filename, 'w', encoding='utf-8') as output_file:
    for cleaned_line in cleaned_lines:
        output_file.write(cleaned_line + '\n')


In [4]:
# Read the cleaned data into a DataFrame
df = pd.read_csv(updated_filename, delimiter = ",", encoding = 'utf-8', index_col=False) 

# Look at the first 5 rows of the DataFrame
df.head()

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher
0,1,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling/Mary GrandPré,4.57,0439785960,9780439785969,eng,652,2095690,27591,9/16/2006,Scholastic Inc.
1,2,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling/Mary GrandPré,4.49,0439358078,9780439358071,eng,870,2153167,29221,9/1/2004,Scholastic Inc.
2,4,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,4.42,0439554896,9780439554893,eng,352,6333,244,11/1/2003,Scholastic
3,5,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling/Mary GrandPré,4.56,043965548X,9780439655484,eng,435,2339585,36325,5/1/2004,Scholastic Inc.
4,8,Harry Potter Boxed Set Books 1-5 (Harry Potte...,J.K. Rowling/Mary GrandPré,4.78,0439682584,9780439682589,eng,2690,41428,164,9/13/2004,Scholastic


Looking closer at the column names, we notice that in one of the column '  num_pages' the name starts with space. For ease of use, it is better to remove this unnecessary space.

In [5]:
df.columns

Index(['bookID', 'title', 'authors', 'average_rating', 'isbn', 'isbn13',
       'language_code', '  num_pages', 'ratings_count', 'text_reviews_count',
       'publication_date', 'publisher'],
      dtype='object')

In [6]:
# Remove the space before the column num_pages, for ease of use
df.rename(columns={'  num_pages': '  num_pages'.replace(' ', '')}, inplace=True)
df.columns

Index(['bookID', 'title', 'authors', 'average_rating', 'isbn', 'isbn13',
       'language_code', 'num_pages', 'ratings_count', 'text_reviews_count',
       'publication_date', 'publisher'],
      dtype='object')

Examining the column types, it is noticed that some of the columns that are supposed to be numerical types are object types

In [7]:
types_columns=df.dtypes
number_lines,number_columns=df.shape
print(types_columns)

bookID                 object
title                  object
authors                object
average_rating        float64
isbn                   object
isbn13                 object
language_code          object
num_pages               int64
ratings_count           int64
text_reviews_count     object
publication_date       object
publisher              object
dtype: object


In [8]:
# investigate if their are non-numerical non-date types in supposedly numerical and date columns

# text_reviews_count 
# Filter the DataFrame to get non-numerical values in the specified column
def check_non_numerical_date(columns_list: list, intended_type: str):
    
    if intended_type == "numerical":
        for column in columns_list:
            non_numerical_values = df.loc[~df[column].apply(pd.to_numeric, errors='coerce').notna()]
            print("The non_numerical_values in column {} : ".format(column), non_numerical_values[column])

    elif intended_type == "date":
        for column in columns_list:
            non_numerical_values = df.loc[~df[column].apply(pd.to_datetime, errors='coerce').notna()]
            print("The non_date_values in column {} : ".format(column), non_numerical_values[column])
    
# Display the non-numerical and non_date values of the selected columns
columns = ["isbn", "isbn13", "text_reviews_count"]
check_non_numerical_date(columns, "numerical")
check_non_numerical_date(["publication_date"], "date")

The non_numerical_values in column isbn :  3        043965548X
12       076790818X
16       076790382X
27       097669400X
40       006076273X
            ...    
11090    030727411X
11101    074347788X
11106    057305133X
11110    843221728X
11115    972233168X
Name: isbn, Length: 985, dtype: object
The non_numerical_values in column isbn13 :  1847    en-US
Name: isbn13, dtype: object
The non_numerical_values in column text_reviews_count :  1847    9/1/2003
Name: text_reviews_count, dtype: object
The non_date_values in column publication_date :  1847      MTV Books
8180     11/31/2000
11098     6/31/1982
Name: publication_date, dtype: object


From the above investigation one can notice the following : 
- isbn is not numerical since some of the isbn codes end with X 
- we notice that at index location 1847 the data is shifted to the left
- dates at index locations 8180 and 11098 were not considered as dates since they are wrong. June and November are 31 days, but in reality they are only 30 days.

Also, as seen below there is one empty cell in publisher column, which is on index location 1847.

In [9]:
df.isna().sum()

bookID                0
title                 0
authors               0
average_rating        0
isbn                  0
isbn13                0
language_code         0
num_pages             0
ratings_count         0
text_reviews_count    0
publication_date      0
publisher             1
dtype: int64

In [10]:
df["publisher"][df["publisher"].isna()==True]

1847    NaN
Name: publisher, dtype: object

In [11]:
df.iloc[1847]

bookID                6549/ said the shotgun to the head.
title                                       Saul Williams
authors                                              4.22
average_rating                                743470796.0
isbn                                        9780743470797
isbn13                                              en-US
language_code                                         192
num_pages                                            2762
ratings_count                                         214
text_reviews_count                               9/1/2003
publication_date                                MTV Books
publisher                                             NaN
Name: 1847, dtype: object

One can notice that there is a / in the bookID which separates the BookID from the book title. This must be a result from updating automatically the csv files at the beginning. 

Below is the code to fix this unwanted mistake

In [12]:
bookid_title = df["bookID"].iloc[1847].split('/')


# content of the remaining columns
remaining_columns = df[['title', 'authors', 'average_rating', 'isbn', 'isbn13',
                                   'language_code', 'num_pages', 'ratings_count', 
                                   'text_reviews_count', 'publication_date']].iloc[1847]


# rearrange cells content for each column
df.iloc[1847, 2:] = remaining_columns
df["bookID"].iloc[1847]= bookid_title[0]
df["title"].iloc[1847]= bookid_title[1]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["bookID"].iloc[1847]= bookid_title[0]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["title"].iloc[1847]= bookid_title[1]


In [13]:
# rechek the content 
df.iloc[1847]

bookID                                          6549
title                  said the shotgun to the head.
authors                                Saul Williams
average_rating                                  4.22
isbn                                     743470796.0
isbn13                                 9780743470797
language_code                                  en-US
num_pages                                        192
ratings_count                                   2762
text_reviews_count                               214
publication_date                            9/1/2003
publisher                                  MTV Books
Name: 1847, dtype: object

In [14]:
# Check the types of the columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11127 entries, 0 to 11126
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   bookID              11127 non-null  object
 1   title               11127 non-null  object
 2   authors             11127 non-null  object
 3   average_rating      11127 non-null  object
 4   isbn                11127 non-null  object
 5   isbn13              11127 non-null  object
 6   language_code       11127 non-null  object
 7   num_pages           11127 non-null  object
 8   ratings_count       11127 non-null  int64 
 9   text_reviews_count  11127 non-null  object
 10  publication_date    11127 non-null  object
 11  publisher           11127 non-null  object
dtypes: int64(1), object(11)
memory usage: 1.0+ MB


For some reasons the supposedly numerical columns became objects 
Re-investigate again if they contain non-numerical content

In [15]:
# re-investigate if there is non-numerical content in supposedly numerical columns
columns = ["average_rating", "isbn13", "text_reviews_count", "num_pages", "ratings_count", "text_reviews_count"]
check_non_numerical_date(columns, "numerical")

The non_numerical_values in column average_rating :  Series([], Name: average_rating, dtype: object)
The non_numerical_values in column isbn13 :  Series([], Name: isbn13, dtype: object)
The non_numerical_values in column text_reviews_count :  Series([], Name: text_reviews_count, dtype: object)
The non_numerical_values in column num_pages :  Series([], Name: num_pages, dtype: object)
The non_numerical_values in column ratings_count :  Series([], Name: ratings_count, dtype: int64)
The non_numerical_values in column text_reviews_count :  Series([], Name: text_reviews_count, dtype: object)


They do not contain non-numerical content, so will transform them to numerical

In [16]:
# convert to numerical
def convert_to_numerical(columns_list: list):
    
    for column in columns_list:
        df[column] = df[column].apply(pd.to_numeric)

columns = ["average_rating", "isbn13", "text_reviews_count", "num_pages", "ratings_count", "text_reviews_count"]

convert_to_numerical(columns)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11127 entries, 0 to 11126
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   bookID              11127 non-null  object 
 1   title               11127 non-null  object 
 2   authors             11127 non-null  object 
 3   average_rating      11127 non-null  float64
 4   isbn                11127 non-null  object 
 5   isbn13              11127 non-null  int64  
 6   language_code       11127 non-null  object 
 7   num_pages           11127 non-null  int64  
 8   ratings_count       11127 non-null  int64  
 9   text_reviews_count  11127 non-null  int64  
 10  publication_date    11127 non-null  object 
 11  publisher           11127 non-null  object 
dtypes: float64(1), int64(4), object(7)
memory usage: 1.0+ MB


In [17]:
# fix the dates by replacing 31 with 30 in June and November
# convert the date column to type pd.datetime
df["publication_date"].iloc[8180] = '11/30/2000'
df["publication_date"].iloc[11098] = '6/30/1982'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["publication_date"].iloc[8180] = '11/30/2000'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["publication_date"].iloc[11098] = '6/30/1982'


Resave the DataFrame to avoid repeating the process each time we want to use the df

In [18]:
df.to_csv("books_updated.csv", index_label=False) # index_label = False so that it does not add another index label to the DataFrame

# Data Cleaning Phase 2

This data cleaning is done on the data set after doing webscraping. Through webscraping extra features were extracted which are: 

- first_publish : this is the date a book was first published (this will help distinguish two different books with the same name)
- book_format : this is the format of the book (some books are present in different formats (paperback, Audio CD, hardcover))
- new_publisher: this is because it was noticed that in the original data some books had wrong publishers
- edition_avgRating: this is the actaul average rating of each edition
- added_toShelves: this is the number of users that added a book to shelves

*For more details about the process of scraping, please refer to scraper/scraper.py*

### Read the Data

In [195]:
# read the scraped data
df_scraped = pd.read_csv("scraper/booksRating_extraFeats.csv")

In [196]:
df_scraped.head()

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher,first_published,book_format,new_publisher,edition_avgRating,added_toShelves
0,1,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling/Mary GrandPré,4.57,0439785960,9780439785969,eng,652,2095690,27591,2006-09-16,Scholastic Inc.,"July 16, 2005",Paperback,Scholastic Inc,4.57,4405980.0
1,2,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling/Mary GrandPré,4.49,0439358078,9780439358071,eng,870,2153167,29221,2004-09-01,Scholastic Inc.,"June 21, 2003",Paperback,Scholastic Inc.,4.5,4518536.0
2,4,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,4.42,0439554896,9780439554893,eng,352,6333,244,2003-11-01,Scholastic,"July 2, 1998",Hardcover,,4.05,7469.0
3,5,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling/Mary GrandPré,4.56,043965548X,9780439655484,eng,435,2339585,36325,2004-05-01,Scholastic Inc.,"July 8, 1999",Mass Market Paperback,Scholastic Inc.,4.57,5223956.0
4,8,Harry Potter Boxed Set Books 1-5 (Harry Potte...,J.K. Rowling/Mary GrandPré,4.78,0439682584,9780439682589,eng,2690,41428,164,2004-09-13,Scholastic,"October 1, 2003",Paperback,Scholastic,4.72,172736.0


In [197]:
df_scraped.tail()

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher,first_published,book_format,new_publisher,edition_avgRating,added_toShelves
11122,45631,Expelled from Eden: A William T. Vollmann Reader,William T. Vollmann/Larry McCaffery/Michael He...,4.06,1560254416,9781560254416,eng,512,156,20,2004-12-21,Da Capo Press,"December 1, 2004",Paperback,Da Capo Press,4.06,552.0
11123,45633,You Bright and Risen Angels,William T. Vollmann,4.08,140110879,9780140110876,eng,635,783,56,1988-12-01,Penguin Books,"June 27, 1987",Paperback,Penguin Books,4.04,4269.0
11124,45634,The Ice-Shirt (Seven Dreams #1),William T. Vollmann,3.96,140131965,9780140131963,eng,415,820,95,1993-08-01,Penguin Books,"January 1, 1990",Paperback,Penguin Publishing Group,3.98,4686.0
11125,45639,Poor People,William T. Vollmann,3.72,60878827,9780060878825,eng,434,769,139,2007-02-27,Ecco,"January 1, 2007",Hardcover,Ecco,3.75,2948.0
11126,45641,Las aventuras de Tom Sawyer,Mark Twain,3.91,8497646983,9788497646987,spa,272,113,12,2006-05-28,Edimat Libros,"June 1, 1876",Paperback,,3.72,228.0


In [198]:
df_scraped.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11127 entries, 0 to 11126
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   bookID              11127 non-null  int64  
 1   title               11127 non-null  object 
 2   authors             11127 non-null  object 
 3   average_rating      11127 non-null  float64
 4   isbn                11127 non-null  object 
 5   isbn13              11127 non-null  int64  
 6   language_code       11127 non-null  object 
 7   num_pages           11127 non-null  int64  
 8   ratings_count       11127 non-null  int64  
 9   text_reviews_count  11127 non-null  int64  
 10  publication_date    11127 non-null  object 
 11  publisher           11127 non-null  object 
 12  first_published     11125 non-null  object 
 13  book_format         11126 non-null  object 
 14  new_publisher       8360 non-null   object 
 15  edition_avgRating   11124 non-null  float64
 16  adde

### Fill Missing Data

In the new dataset, there is 2 missing data in the first_published, 1 missing value in the book_format, 3 missing values in the edition_avgRating and added_toShelves. These were values that the scraper failed to get. Since they are very few, they will be filled manually. 

In [199]:
df_scraped[df_scraped[["first_published", "book_format", "edition_avgRating", "added_toShelves"]].isna().any(axis=1)]

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher,first_published,book_format,new_publisher,edition_avgRating,added_toShelves
2255,8077,Animales No Se Visten Los (Animals Should Def...,Judi Barrett/Ron Barrett,4.11,1595191356,9781595191359,eng,32,0,0,1991-10-30,Live Oak Media,"January 1, 1970",Paperback,Atheneum Books for Young Readers,,
6396,24062,The Deep (Dive Trilogy),Gordon Korman,3.78,613674839,9780613674836,eng,148,0,0,2003-07-01,Turtleback Books,,,,,
8476,32552,Essential Tales and Poems,Edgar Allan Poe/Benjamin F. Fisher,4.36,1593080646,9781593080648,en-US,688,66382,109,2004-10-25,Barnes Noble Classics,,Paperback,Barnes & Noble Classics,4.36,92622.0
8520,32703,The Diary of Ellen Rimbauer: My Life at Rose Red,Joyce Reardon/Steven Rimbauer/Ridley Pearson,3.67,786890436,9780786890439,eng,277,7852,352,2001-04-29,Hyperion,"January 1, 2001",Mass Market Paperback,Hyperion,,


In [200]:
df_scraped.loc[2255, ["edition_avgRating", "added_toShelve"]] = [4.11, 5402]
# df_scraped.loc[6396, ["first_published", "book_format", "new_publisher", "edition_avgRating", "added_toShelve"]] ### !!!! Couldn'y find exact book edition online !!!!
df_scraped.loc[8476, ["first_published"]] = ["January 1, 1843"]
df_scraped.loc[8520, ["edition_avgRating", "added_toShelve"]] = [3.7, 15942]

One of the books at index 6396 was not found on good reads even when searching by book ID, isbn and isbn13. Therefore, the edition_avgRating will be the same as the average_rating. added_toShelves will be equal to the mean of the added_toShelves of the other editions of the book; if other editions are not found the added_toShelves will be equal to the mode of the books having same ratings_count and text_reviews_count. The first_published will be the same as the publication_date and the book_format will be the mode of the books of approximately the same size. 

In [201]:
df_scraped[df_scraped['title'] == df_scraped.loc[6396, 'title']]

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher,first_published,book_format,new_publisher,edition_avgRating,added_toShelves,added_toShelve
6396,24062,The Deep (Dive Trilogy),Gordon Korman,3.78,613674839,9780613674836,eng,148,0,0,2003-07-01,Turtleback Books,,,,,,


In [202]:
df_scraped.loc[df_scraped['title'].str.contains('dive', case=False)]

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher,first_published,book_format,new_publisher,edition_avgRating,added_toShelves,added_toShelve
2021,7271,Marine Conservation Biology: The Science of Ma...,Elliott A. Norse/Larry B. Crowder/Michael E. S...,4.59,1559636629,9781559636629,eng,496,16,2,2005-05-09,Island Press,"May 9, 2005",Paperback,Island Press,4.5,175.0,
4458,16059,The Dive From Clausen's Pier,Ann Packer,3.42,375727132,9780375727139,eng,432,19844,1747,2003-04-08,Vintage,"January 1, 2002",Paperback,Vintage,3.44,34745.0,
4459,16063,The Dive from Clausen's Pier,Ann Packer,3.42,749933631,9780749933630,eng,368,60,8,2003-03-27,Piatkus,"January 1, 2002",Paperback,,3.15,90.0,
6385,24037,The Deep (Dive #2),Gordon Korman,3.78,439507235,9780439507233,eng,148,1436,55,2003-07-01,Scholastic,"July 1, 2003",Paperback,Scholastic,3.78,2837.0,
6386,24040,The Discovery (Dive #1),Gordon Korman,3.72,439507227,9780439507226,eng,141,1858,135,2003-06-01,Apple Paperbacks (Scholastic),"September 1, 2005",Paperback,Scholastic,3.71,3896.0,
6387,24043,New York City's Best Dive Bars: Drinking and D...,Wendy Mitchell/June Kim,3.75,970312539,9780970312532,eng,160,16,2,2003-07-01,Gamble Guides,"December 1, 2002",Paperback,Gamble Guides,3.71,32.0,
6396,24062,The Deep (Dive Trilogy),Gordon Korman,3.78,613674839,9780613674836,eng,148,0,0,2003-07-01,Turtleback Books,,,,,,


-- Clemence --

I found the book on GoodReads : https://www.goodreads.com/book/show/24037.The_Deep

As the information on the goodreads page are not exactly the same as the ones in line 6385, I filled the information manually based on the GoddReads page (the title, num_pages and average rating where the same in the database and the link to goodread).

In [203]:
df_scraped.loc[6396, ["ratings_count", "text_reviews_coun", "first_published","book_format","new_publisher","edition_avgRating","added_toShelves"]] = [1659,66,"July 1, 2003","Paperback","Scholastic",3.78,3004]

In [204]:
df_scraped.iloc[6396]

bookID                                  24062
title                 The Deep (Dive Trilogy)
authors                         Gordon Korman
average_rating                           3.78
isbn                               0613674839
isbn13                          9780613674836
language_code                             eng
num_pages                                 148
ratings_count                            1659
text_reviews_count                          0
publication_date                   2003-07-01
publisher                    Turtleback Books
first_published                  July 1, 2003
book_format                         Paperback
new_publisher                      Scholastic
edition_avgRating                        3.78
added_toShelves                        3004.0
added_toShelve                            NaN
text_reviews_coun                        66.0
Name: 6396, dtype: object

For the new_publisher, the scraper failed to get a lot of values, this is because of how the publisher is saved in the html. Due to constraint of time, and since there is already a publisher column in the original dataset. In the cases where there are NaN the old publisher will be put. The new publisher will be taken, only in the cases where the new publisher is different from the old publisher in value.

In [205]:
# Let's create the function for it. This function will only be use when the text format will be editing, 
#in the subtitle "publisher"

def update_publishers(df):
    # Iterate over each row in the DataFrame
    for index, row in df.iterrows():
        # Check if the new publisher is different from the old publisher
        if pd.notna(row['new_publisher']) and row['new_publisher'] != row['publisher']:
            # Update the publisher with the new publisher value
            df.at[index, 'publisher'] = row['new_publisher']
        else:
            # Use the old publisher if the new publisher is NaN or the same as the old publisher
            df.at[index, 'publisher'] = row['publisher']
    return df

### Set text columns to lower text

When dealing with text data it always safer to have them all in the same case either lower or upper 

In [206]:
def to_lower(df: pd.DataFrame, columns: list):
    """Function to convert text columns to lowercase"""
    for col in columns:
        if df[col].dtype == 'object':
            df[col] = df[col].str.lower()
    return df

In [207]:
# get the text columns
text_columns = df_scraped.select_dtypes(include=['object']).columns

# Convert to lowercase
df_scraped = to_lower(df_scraped, text_columns)

In [208]:
df_scraped.sample(5)

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher,first_published,book_format,new_publisher,edition_avgRating,added_toShelves,added_toShelve,text_reviews_coun
10027,40364,the martians (mars trilogy #3.5),kim stanley robinson,3.56,553574019,9780553574012,eng,434,1495,92,2000-10-03,spectra books,"january 1, 1999",paperback,spectra books,3.59,5175.0,,
7007,26578,the eye of charon (age of conan: hyborian adve...,richard a. knaak,3.91,441014453,9780441014453,en-gb,288,43,4,2006-09-26,ace,"september 26, 2006",mass market paperback,ace,3.96,140.0,,
7070,26973,their eyes were watching god,zora neale hurston/ruby dee,3.91,60776536,9780060776534,en-us,7,421,99,2004-11-23,caedmon,"january 1, 1937",audio cd,,4.06,1379.0,,
7422,28533,well of darkness (sovereign stone #1),margaret weis/tracy hickman,3.67,61020575,9780061020575,eng,562,2049,49,2001-09-04,harpertorch,"august 22, 2000",mass market paperback,harpertorch,3.68,4225.0,,
508,1585,aristophanes and athens: an introduction to th...,douglas m. macdowell,4.07,198721595,9780198721598,eng,376,14,3,1995-10-01,oxford university press,"january 1, 1995",paperback,oxford university press,3.94,58.0,,


### Take First Author Name and Create num_contributors Column

We have decided to take only the name of the first author, since usually first authors are the main authors. Another column will be added to indicate the total number of authors/contributors to the book.

In [209]:
# take the first author in the authors column
df_scraped["first_author"] = df_scraped["authors"].apply(lambda x: x.split("/")[0].strip())


In [210]:
df_scraped["first_author"].sample(5)

4241             barry hughart
1583              barbara park
254     gabriel garcía márquez
6037          jean baudrillard
1680             shiva naipaul
Name: first_author, dtype: object

In [211]:
df_scraped["num_contributors"] = df_scraped["authors"].apply(lambda x: len(x.split("/")))

In [212]:
df_scraped[["authors","num_contributors"]].sample(5)

Unnamed: 0,authors,num_contributors
4812,richard p. feynman,1
9830,howard chaykin,1
10421,eleanor estes/edward ardizzone,2
5466,jerome preisler/tom clancy/martin greenberg,3
5415,christopher janaway,1


### Clean Text Columns

Remove punctuation, and extra white spaces from all text columns.

Better clean the text in publisher column, to avoid having same publishers represented differently. 
Example: indexes 5993, 5365 and 9753, W.W. Norton & Company, W. W. Norton and Company and W. W. Norton  Company represent the same publisher

In [213]:
general_replacements = [
    (r'[^\w\s]', ''), # remove punctuation
    (r'\s\s+', " "), # remove double spaces and more   
]

columns_ = ["title", "first_author"]

df_scraped = sub_text(df_scraped, columns_, general_replacements)

In [214]:
df_scraped[["title", "first_author"]].sample(5)

Unnamed: 0,title,first_author
3272,crime novels american noir of the 1950s,robert polito
8581,the house on mango street,sandra cisneros
7602,a certain justice adam dalgliesh 10,pd james
5405,the book of illusions,paul auster
4138,the lady chosen bastion club 1,stephanie laurens


In [215]:
specific_replacements = [
    ("inc.", ""),
    ("llc", ""),
    ("ltd", ""),
    ("w. w.", "ww"), 
    ("&", " and "),
    (r'\bbooks?\b', ""),  
    (r'\bclassics?\b', ""),
    (r'\bpublishers?\b', ""),
    (r'\bpress\b', ""),
    (r'\bpublishing\b', "")
]

df_scraped = sub_text(df_scraped, ["publisher"], specific_replacements)

### Analyse the "new_publisher" column

In [216]:
#Use update_publishers function to be sure about our "publisher" column : 

In [217]:
df_scraped[['publisher', 'new_publisher']].tail(50)

Unnamed: 0,publisher,new_publisher
11077,debols!llo,
11078,montena,
11079,montena,
11080,listening library,
11081,listening library (audio),listening library (audio)
11082,alfred a. knopf for young readers,
11083,alfred a. knopf for young readers,
11084,listening library,
11085,alfred a. knopf,
11086,ediciones b,


Based on the sample,it's better to only keep the "publisher" column : there is more informations and they are harmonized.

### Add "size_of_publisher" column

In [218]:
# Adding the new columns that allows to see if a publisher is often cited
df_scraped['publisher_count'] = df_scraped.groupby('publisher')['publisher'].transform('count')

In [219]:
# The exact number of time a publisher name is in the database is not as relevant as the number of time an author or a title is
# in it.
# To simplify this information, we can, instead of the count, create a column that categorize the publisher :
# - Small publisher(1) = only 1 time in the database
# - Medium publisher(2) = from 2 to 10 times
# - big publisher(3) = more that 10 times

# Create a function to determine the size of the publisher based on the number of times it is mentioned
def determine_size(publisher_count):
    if publisher_count == 1:
        return 1
    elif publisher_count < 10:
        return 2
    else:
        return 3

# Apply the function to the 'publisher_count' column to create the new 'size_of_publisher' column
df_scraped['size_of_publisher'] = df_scraped['publisher_count'].apply(lambda x: determine_size(x))

In [220]:
print('Number of line with small publisher',len(df_scraped[df_scraped['size_of_publisher'] == 1]))
print('Number of line with medium publisher',len(df_scraped[df_scraped['size_of_publisher'] == 2]))
print('Number of line with big publisher',len(df_scraped[df_scraped['size_of_publisher'] == 3]))

Number of line with small publisher 1254
Number of line with medium publisher 2647
Number of line with big publisher 7226


In [221]:
#Now that we have the "size_of_publisher" column, let's delete the "publisher_count" used to create it 
df_scraped.drop(columns=['publisher_count'], inplace=True)

### Create "num_book_per_author" column

In [222]:
# Check for duplicates in a specific column using value_counts
num_book_per_author = df_scraped["first_author"].value_counts()

# Display values with count greater than 1 (indicating duplicates)
print(num_book_per_author[num_book_per_author > 1])

william shakespeare    88
stephen king           82
jrr tolkien            51
pg wodehouse           46
agatha christie        45
                       ..
paul farmer             2
amy sedaris             2
ruby ann boxcar         2
anthony loyd            2
james lee burke         2
Name: first_author, Length: 1471, dtype: int64


In [223]:
# Adding the new columns that allows to see if an author is often cited
df_scraped['num_book_per_author'] = df_scraped.groupby('first_author')['first_author'].transform('count')

In [224]:
df_scraped.sample(10)

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,...,book_format,new_publisher,edition_avgRating,added_toShelves,added_toShelve,text_reviews_coun,first_author,num_contributors,size_of_publisher,num_book_per_author
5689,21194,the stranger,caroline b. cooney,3.53,0590456806,9780590456807,eng,198,55,11,...,paperback,,3.37,140.0,,,caroline b cooney,1,3,6
7968,30536,beauty and the contemporary sublime,jeremy gilbert-rolfe,3.56,1581150377,9781581150377,eng,180,39,5,...,paperback,allworth,3.59,130.0,,,jeremy gilbertrolfe,1,1,1
8443,32453,the servants of twilight,leigh nichols/dean koontz,3.83,0747236380,9780747236382,eng,499,15666,209,...,paperback,headline feature,3.87,25741.0,,,leigh nichols,2,1,3
6730,25242,reason in history,georg wilhelm friedrich hegel/robert s. hartman,3.65,0023513209,9780023513206,eng,95,451,17,...,paperback,pearson,3.66,1817.0,,,georg wilhelm friedrich hegel,2,3,14
1366,4767,star wars episode 1 the phantom menace illustr...,george lucas,3.92,0345431103,9780345431103,eng,150,259,9,...,paperback,del rey,3.94,582.0,,,george lucas,1,3,3
5036,18240,war and peace,leo tolstoy,4.11,3895086908,9783895086908,en-us,1500,126,9,...,hardcover,,3.94,185.0,,,leo tolstoy,1,1,23
5750,21534,trunk music harry bosch 5 harry bosch universe 6,michael connelly/dick hill,4.18,1423323386,9781423323389,eng,13,126,26,...,audio cd,,3.88,179.0,,,michael connelly,2,3,17
1375,4808,latitude and longitude rookie readabout geography,rebecca aberg/jeanne clidas,3.53,0516277650,9780516277653,eng,32,1,0,...,paperback,,4.0,9.0,,,rebecca aberg,2,1,1
5707,21290,spider mountain cam richter 2,p.t. deutermann,3.92,031233379x,9780312333799,eng,309,461,41,...,hardcover,st. martin's press,3.89,806.0,,,pt deutermann,1,3,3
4353,15651,inferno la divina commedia 1,dante alighieri/ronald l. martinez/robert m. d...,4.0,0195087445,9780195087444,eng,672,468,50,...,paperback,,4.28,1416.0,,,dante alighieri,4,3,20


### Create "is_english" column

In [225]:
# Add a new column 'is_english' with 1 for English (en, en-CA, en-GB,...) and 0 for non-English
df_scraped['is_english'] = np.where(df_scraped['language_code'].str.contains('en', case=False), 1, 0)

In [226]:
df_scraped.sample(10)

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,...,new_publisher,edition_avgRating,added_toShelves,added_toShelve,text_reviews_coun,first_author,num_contributors,size_of_publisher,num_book_per_author,is_english
267,816,cryptonomicon,neal stephenson,4.25,0060512806,9780060512804,eng,1139,83184,4249,...,avon,4.24,194821.0,,,neal stephenson,1,3,14,1
7473,28698,to green angel tower part 2 memory sorrow and ...,tad williams,4.2,0886776066,9780886776060,eng,815,20025,215,...,daw fantasy,4.22,27904.0,,,tad williams,1,1,11,1
9076,35350,what the body remembers,shauna singh baldwin,3.89,0385496052,9780385496056,eng,471,2340,153,...,anchor,3.91,8636.0,,,shauna singh baldwin,1,3,1,1
462,1504,euripides medea,william allan,4.04,071563187x,9780715631874,eng,160,23,1,...,bristol classical press,3.9,151.0,,,william allan,1,2,1,1
976,3311,self,yann martel,3.43,0571219764,9780571219766,eng,331,2359,146,...,faber & faber,3.46,6081.0,,,yann martel,1,3,4,1
3823,13666,wonderful alexander and the catwings,ursula k. le guin/s.d. schindler,4.13,053106851x,9780531068519,eng,42,21,1,...,,4.4,58.0,,,ursula k le guin,2,2,16,1
2702,9924,the grass harp including a tree of night and o...,truman capote,4.01,0679745572,9780679745570,eng,272,4533,188,...,vintage,4.02,10844.0,,,truman capote,1,3,11,1
9935,40020,a patchwork planet,anne tyler,3.79,080411918x,9780804119184,eng,320,11355,571,...,ballantine books,3.8,19759.0,,,anne tyler,1,3,10,1
3652,13177,private parts,howard stern,3.77,0671009443,9780671009441,eng,660,3878,151,...,pocket,3.79,6620.0,,,howard stern,1,3,1,1
6915,26064,kissing in manhattan,david schickler,3.61,0385335679,9780385335676,eng,288,2665,304,...,dial press trade paperback,3.6,4623.0,,,david schickler,1,3,1,1


### Create "book_count" column

In [227]:
# Adding the new columns that allows to see if a book is often cited
df_scraped['book_count'] = df_scraped.groupby('title')['title'].transform('count')

In [228]:
df_scraped.query("title == 'the brothers karamazov'")

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,...,edition_avgRating,added_toShelves,added_toShelve,text_reviews_coun,first_author,num_contributors,size_of_publisher,num_book_per_author,is_english,book_count
1406,4933,the brothers karamazov,fyodor dostoyevsky/constance garnett/manuel ko...,4.32,0451527348,9780451527349,eng,736,983,91,...,4.27,1638.0,,,fyodor dostoyevsky,4,3,37,1,9
1407,4934,the brothers karamazov,fyodor dostoyevsky/fyodor dostoevsky/richard p...,4.32,0374528373,9780374528379,eng,796,191531,6795,...,4.35,855582.0,,,fyodor dostoyevsky,4,3,37,1,9
1408,4935,the brothers karamazov,fyodor dostoyevsky/david mcduff,4.32,0140449248,9780140449242,eng,1013,1673,184,...,4.44,11774.0,,,fyodor dostoyevsky,2,3,37,1,9
1409,4936,the brothers karamazov,fyodor dostoyevsky/richard pevear/larissa volo...,4.32,0679729259,9780679729259,eng,796,617,80,...,4.51,1381.0,,,fyodor dostoyevsky,3,3,37,1,9
1410,4938,the brothers karamazov,fyodor dostoyevsky/simon vance/thomas r. beyer...,4.32,1596440791,9781596440791,eng,16,20,2,...,4.52,164.0,,,fyodor dostoyevsky,3,2,37,1,9
1411,4940,the brothers karamazov,fyodor dostoyevsky/constance garnett/maire jaanus,4.32,159308045x,9781593080457,eng,720,1089,202,...,4.38,3421.0,,,fyodor dostoyevsky,3,3,37,1,9
1634,5691,the brothers karamazov,fyodor dostoyevsky/richard pevear/larissa volo...,4.32,0099922800,9780099922803,eng,796,443,55,...,4.45,1657.0,,,fyodor dostoyevsky,3,3,37,1,9
1990,7135,the brothers karamazov,fyodor dostoyevsky/andrew r. macandrew/konstan...,4.32,0553212168,9780553212167,eng,1072,1022,154,...,4.37,3236.0,,,fyodor dostoyevsky,3,3,37,1,9
9356,37058,the brothers karamazov,fyodor dostoyevsky/thomas r. beyer jr./simon v...,4.32,1596440783,9781596440784,eng,16,3,1,...,4.67,9.0,,,fyodor dostoyevsky,3,2,37,1,9


### Create a "is_serie" column

In [229]:
# Create a regular expression pattern to match titles containing numbers
pattern = r'\b\d{1,2}\b'  # This pattern matches 1 or 2 digits, if a number has more that 2 digits it's not a number of books in a serie

# Use the str.contains() method with the regex pattern to filter the DataFrame
books_with_number = df_scraped[df_scraped['title'].str.contains(pattern)]

# Print or further process the extracted books
books_with_number['title'].sample(20)

2440    return to the planet of the apes 2 escape from...
8587    the mystery at the mosscovered mansion nancy d...
3277                                 inversions culture 6
5898                                  count zero sprawl 2
7480               four twenty blackbirds bardic voices 4
2596               cast in shadow chronicles of elantra 1
3921                 by slanderous tongues doubled edge 3
9236                        warrior mackenzieblackthorn 5
9675                   starting over sweet valley high 33
5661                      crown of stars crown of stars 7
9457                       beyond seduction beyond duet 2
6095    the gap into power a dark and hungry god arise...
3056                                vengeance joe kurtz 1
2551                   shopaholic and sister shopaholic 4
1054                             pat of silver bush pat 1
3254           industrial magic women of the otherworld 4
4814              the feynman lectures on physics vols 56
1563      ambe

In [230]:
# Add a new column 'is_serie' with 1 for books with a number in the title and 0 for books without numbers
df_scraped['is_serie'] = np.where(df_scraped['title'].str.contains(pattern), 1, 0)

In [231]:
df_scraped.sample(10)

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,...,added_toShelves,added_toShelve,text_reviews_coun,first_author,num_contributors,size_of_publisher,num_book_per_author,is_english,book_count,is_serie
9993,40264,the nature of play great apes and humans,anthony d. pellegrini/peter k. smith,4.33,1593851170,9781593851170,eng,308,3,1,...,46.0,,,anthony d pellegrini,2,2,1,1,1,0
7463,28670,perils gate wars of light shadow 6 arc 3 allia...,janny wurts,4.16,61054674,9780061054679,en-us,940,521,8,...,758.0,,,janny wurts,1,3,7,1,1,1
938,3100,em forster critical guidebook,lionel trilling/e.m. forster,3.51,811202100,9780811202107,eng,208,6,0,...,19.0,,,lionel trilling,2,3,1,1,1,0
8621,33157,a midsummer nights dream sparknotes literature...,sparknotes/william shakespeare,3.96,1586634046,9781586634049,eng,64,19,2,...,67.0,,,sparknotes,2,3,8,1,1,0
4769,17152,my antonia great plains trilogy 3,willa cather/alyssa harad,3.79,743487699,9780743487696,eng,314,5985,507,...,9015.0,,,willa cather,2,3,2,1,1,1
5553,20239,watchers of time inspector ian rutledge 5,charles todd,3.99,553583166,9780553583168,eng,421,3194,326,...,8185.0,,,charles todd,1,3,9,1,1,1
5358,19354,the gilgamesh epic and old testament parallels,alexander heidel,3.95,226323986,9780226323985,eng,280,58,8,...,272.0,,,alexander heidel,1,1,1,1,1,0
1985,7118,the karamazov brothers,fyodor dostoyevsky/ignat avsey,4.32,192835092,9780192835093,eng,1054,235,26,...,511.0,,,fyodor dostoyevsky,2,3,37,1,1,0
2991,11019,jane eyre,charlotte brontë/richard j. dunn,4.12,393975428,9780393975420,eng,534,1475,156,...,2505.0,,,charlotte brontë,2,3,6,1,6,0
9001,35002,the one minute minute sales person,spencer johnson,3.78,7104847,9780007104840,eng,109,68,7,...,143.0,,,spencer johnson,1,3,2,1,1,0


In [232]:
#Let's see if we have to add some books in "is_serie" based on word in there title
df_scraped[(df_scraped['is_serie'] == '0') & (df_scraped['title'].str.contains('trilogy|tome|chronicles|series', case=False))]

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,...,added_toShelves,added_toShelve,text_reviews_coun,first_author,num_contributors,size_of_publisher,num_book_per_author,is_english,book_count,is_serie


### Convert the first_published date into 3 columns : published_year, published_month, published_day

By looking at the data, we notice that for some line, the first_published date didn't extracted well, and it's written "ary" instead of january or february.

In [233]:
df_scraped[['title','first_published']][df_scraped['first_published'].str.contains(r'ary', case=False) & ~df_scraped['first_published'].str.contains(r'january|february', case=False)]

Unnamed: 0,title,first_published
56,simply beautiful beaded jewelry,"ary 28, 2006"
77,the power of one the solo play for playwrights...,"ary 7, 2000"
124,tropic of capricorn,"ary 1, 1939"
176,gravitys rainbow,"ary 28, 1973"
179,gravitys rainbow,"ary 28, 1973"
...,...,...
11067,the home front,"ary 23, 1989"
11090,la conspiración de los alquimistas,"ary 1, 1999"
11092,the call of the mall how we shop,"ary 2, 2004"
11101,undaunted courage the pioneering first mission...,"ary 15, 1996"


In [234]:
#We will replace the "ary" in those line by "february" 
# Filter rows containing "ary" but not "january" or "february"
filtered_rows = df_scraped['first_published'].str.contains(r'ary', case=False) & ~df_scraped['first_published'].str.contains(r'january|february', case=False)

# Replace the filtered results with "february"
df_scraped.loc[filtered_rows, 'first_published'] = df_scraped.loc[filtered_rows, 'first_published'].str.replace(r'ary', 'february', case=False)

In [235]:
df_scraped[['title','first_published']][df_scraped['first_published'].str.contains('published', case=False)]

Unnamed: 0,title,first_published
133,love letters,"published january 1, 1999"
201,timbuktu leviathan moon palace,"published october 31, 2002"
203,the coming economic collapse how you can thriv...,"published february 21, 2006"
229,guidebook to zen and the art of motorcycle mai...,"published november 19, 1990"
263,best of london lonely planet best of,"published january 1, 2004"
...,...,...
10881,the best american travel writing 2006,"published october 11, 2006"
10898,poems between women four centuries of love rom...,"published april 15, 1999"
10937,un amour de swann à la recherche du temps perd...,"published january 1, 2006"
10947,poetry and prose of alexander pope riverside e...,"published january 2, 1968"


In [236]:
#Let's delete the word "published" from the first_published column
df_scraped['first_published'] = df_scraped['first_published'].str.replace('published ', '')

By trying to convert the "first_published" column into date, we notice several errors due to date before 1677. Indeed, the smallest date that pandas can handle is January 1, 1677. As we are looking for an average rating given on a website, there is probably not a big difference between books written in 1524 or 1678, as they are considered "old". By lack of time, we'll change all year below 1677 by 1678.

In [237]:
# The smallest date that pandas can handle is January 1, 1677. Let's remove all date below 1677 and change the year by 1678
for i, date_str in enumerate(df_scraped['first_published']):
    # Use regular expression to find the year in the string
    match = re.search(r'\b\d{1,4}\b$', date_str)
    if match:
        year = int(match.group())
        if year < 1677:
            # Replace the year with 1678
            modified_date_str = re.sub(r'\b\d{1,4}\b$', '1678', date_str)
            df_scraped.at[i, 'first_published'] = modified_date_str

In [238]:
df_scraped['first_published'] = pd.to_datetime(df_scraped['first_published'], format='%B %d, %Y')

In [239]:
# Convert the first_published date into 3 columns : published_year, published_month, published_day
df_scraped['published_year'] = df_scraped['first_published'].dt.year
df_scraped['published_month'] = df_scraped['first_published'].dt.month
df_scraped['published_day'] = df_scraped['first_published'].dt.day


In [240]:
df_scraped.sample(5)

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,...,first_author,num_contributors,size_of_publisher,num_book_per_author,is_english,book_count,is_serie,published_year,published_month,published_day
344,1169,monkey business,sarah mlynowski,3.67,373250711,9780373250714,eng,392,3379,67,...,sarah mlynowski,1,2,6,1,2,0,2004,1,1
2155,7742,ahabs wife or the stargazer,sena jeter naslund,4.03,60838744,9780060838744,eng,704,40049,2391,...,sena jeter naslund,1,3,1,1,1,0,1999,9,22
1575,5417,carrie salems lot the shining,stephen king,4.54,517219026,9780517219027,eng,1096,13137,61,...,stephen king,1,3,82,1,1,0,1983,1,1
1703,5931,the essential neruda selected poems,pablo neruda/mark eisner/lawrence ferlinghetti...,4.46,872864286,9780872864283,eng,200,5149,210,...,pablo neruda,10,2,9,1,1,0,1979,1,1
908,2988,louisa may alcotts christmas treasury,louisa may alcott/c. michael dudash/stephen w....,3.96,1589199502,9781589199507,eng,282,715,44,...,louisa may alcott,3,1,15,1,1,0,2002,1,1


### Categorization then encoding of Book_format

In [241]:
df_scraped['book_format'].unique()

array(['paperback', 'hardcover', 'mass market paperback', 'leather bound',
       'audio cd', 'library binding', 'paper', 'product bundle',
       'perfect paperback', 'imitation leather', 'hardback', 'cloth',
       'mp3 cd', 'softcover', 'ebook', 'board book', 'unknown binding',
       'textbook binding', 'pocket book', 'audiobook', 'audio cassette',
       '240 pages', 'taschenbuch', 'wireless phone accessory',
       "publisher's binding", 'cd-rom', 'audio', 'コミック',
       'trade paperback', 'unbound', 'spiral-bound',
       'school & library binding', 'diary', 'kindle edition', 'comic',
       'slipcased hardcover', 'poche', 'unknown', 'dvd and book',
       'flexibound', '156 pages', 'comics', 'paperback manga',
       'staple bound', 'bath book', 'capa mole'], dtype=object)

In [242]:
#There seems to be mistakes for some lines, with "book format" = "xx pages", let's look at one of those lines
df_scraped.loc[df_scraped['book_format'] == '156 pages'].transpose()

Unnamed: 0,7724
bookID,29776
title,the dying animal
authors,philip roth
average_rating,3.63
isbn,0099422697
isbn13,9780099422693
language_code,eng
num_pages,156
ratings_count,5910
text_reviews_count,437


In [243]:
#Let's create a new columns to translate those multiples book_format into less category :
# First we define the categories
categories = {
    'hardcove': ['hardcover', 'leather bound', 'library binding', 'hardback', "publisher's binding", 'slipcased hardcover','board book','textbook binding'],
    'paperback': ['paperback', 'mass market paperback', 'perfect paperback', 'softcover', 'trade paperback', 'paperback manga', 'taschenbuch','poche','flexibound', 'capa mole','pocket book'],
    'audio': ['audio cd', 'mp3 cd', 'audiobook', 'audio cassette', 'cd-rom','audio'],
    'ebook': ['ebook', 'kindle edition'],
    'books': ['paper','unbound', 'spiral-bound','240 pages','school & library binding','156 pages','staple bound','diary','imitation leather'],
    'comics': ['コミック','comic','comics'],
    'other': ['unknown binding', 'unknown', 'product bundle', 'cloth', 'wireless phone accessory', 'dvd and book', 'bath book']
}

# Function to map the value of book_format to the corresponding category
def map_category(format):
    for category, formats in categories.items():
        if format in formats:
            return category
    return None

# Create a new column "category" by applying the map_category function to the "book_format" column
df_scraped['category'] = df_scraped['book_format'].apply(map_category)

In [244]:
df_scraped[['book_format','category']].sample(5)

Unnamed: 0,book_format,category
4273,paperback,paperback
2087,hardcover,hardcove
428,paperback,paperback
4932,paperback,paperback
707,paperback,paperback


In [245]:
df_scraped['category'].isnull().any()

False

In [246]:
#Let's create the function to do the one-hot encoding 
def one_hot_encode(df, column_name):
    encoded_columns = pd.get_dummies(df[column_name], prefix=column_name)
    
    # Concatenate the encoded columns to the original DataFrame
    df = pd.concat([df, encoded_columns], axis=1)
    
    # Drop the original column
    df.drop(column_name, axis=1, inplace=True)
    
    return df

In [247]:
#We will do this encoding on the category column 
df_scraped = one_hot_encode(df_scraped, 'category')

In [248]:
df_scraped.sample(5)

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,...,published_year,published_month,published_day,category_audio,category_books,category_comics,category_ebook,category_hardcove,category_other,category_paperback
5152,18638,le sens du vent,iain m. banks,4.2,2221095529,9782221095522,fre,404,7,0,...,2000,8,1,0,0,0,0,0,0,1
4144,14839,last man standing,david baldacci,4.04,446611778,9780446611770,eng,640,41611,727,...,2001,11,6,0,0,0,0,0,0,1
4422,15901,friends lovers chocolate isabel dalhousie 2,alexander mccall smith,3.62,375422994,9780375422997,eng,261,9901,747,...,2005,9,20,0,0,0,0,1,0,0
3223,11779,them wonderland quartet 3,joyce carol oates/elaine showalter,3.71,345484401,9780345484406,eng,576,2686,230,...,1969,1,1,0,0,0,0,0,0,1
868,2880,bleach volume 01,tite kubo,4.22,1591164419,9781591164418,eng,200,140403,1063,...,2002,1,5,0,0,0,0,0,0,1


In [249]:
df_scraped.sample(3).transpose()

Unnamed: 0,7806,10145,10424
bookID,30095,41054,42376
title,gloriana,the fish kisser,theater shoes
authors,michael moorcock,james hawkins,noel streatfeild/diane goode
average_rating,3.65,3.4,4.02
isbn,0446691402,0888822405,0613013379
isbn13,9780446691406,9780888822406,9780613013376
language_code,eng,eng,en-us
num_pages,496,360,252
ratings_count,1742,9,85
text_reviews_count,64,0,11


### Removal of text columns and unnecessary columns

In [250]:
df_forML = df_scraped.copy()

# List of columns to remove
columns_to_remove = ['title', 'authors', 'isbn', 'language_code', 'publication_date', 'publisher', 
                     'first_published', 'book_format', 'new_publisher', 'added_toShelve', 
                     'text_reviews_coun', 'first_author']

# Remove the specified columns from df_forML
df_forML.drop(columns=columns_to_remove, inplace=True)

In [251]:
df_forML.sample(3).transpose()

Unnamed: 0,710,7384,5568
bookID,2299.0,28417.0,20343.0
average_rating,4.07,4.06,3.71
isbn13,9780447000000.0,9782070000000.0,9780440000000.0
num_pages,200.0,462.0,448.0
ratings_count,1827.0,6293.0,825.0
text_reviews_count,154.0,152.0,98.0
edition_avgRating,4.09,4.02,3.75
added_toShelves,5854.0,17428.0,2573.0
num_contributors,2.0,2.0,1.0
size_of_publisher,3.0,3.0,3.0


In [252]:
#We can now save this dataframe in a csv file in order to use it in the notebook for Machine Learning
df_forML.to_csv("dataframe_forML.csv", index_label=False)