--- Kheire --- 

Work in progress. 

Data Cleaning Phase 1 is ready. 

Data Cleaning Phase 2 is work in progress. It is done on the scraped data. 

This Notebook illustrates the different steps taken to do the data cleaning and feature engineering.

In [16]:
# import necessary libraries
%load_ext autoreload
%autoreload 2
import numpy as np
import pandas as pd
from utils import *
import re

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Data Cleaning Phase 1

This the first phase of data cleaning. It is very basic and is done on the data format and columns names. 

The first thing noticed was that when reading the original csv file "books.csv" some contents of sepicific sections in specific rows were separated by commas. This caused an error, because pd.read_csv considered them as content belonging to different sections (i.e. columns), resulting in different number of columns for several rows. Luckily, these content were separate by a comma and space ", ", which made it easy to solve the problem by replacing the ", " by "/" as seen below. "books_updated.csv" is the new updated csv which is read by pandas with success.

In [2]:
### COMMENT THE FOLLOWING CELL IF YOU HAVE ALREADY RUN IT BEFORE AND SAVED THE books_updated.csv ###
# Specify the filename
filename = 'books.csv'
updated_filename = "books_updated.csv"

## In some cases in certain fields such us author the authors are separated by ", " 
## to avoid problems when reading the csv directly using pandas and to retain all rows the following steps done

# Open the file and read lines
with open(filename, 'r', encoding='utf-8') as file: # utf-8 encoding to support all languages since there are non-english content
    lines = file.readlines()

# Process the lines to handle unwanted delimiters
cleaned_lines = []
is_firstline = True # used to avoid updating the first line

for line in lines:
    
    if is_firstline: # if it is the first line i.e. Header do not update it
        is_firstline = False
        cleaned_lines.append(line)
        continue
    cleaned_line = line.replace(', ', '/')

    # Append the cleaned line
    cleaned_lines.append(cleaned_line)

# rejoin all the lines to create a new updated csv
with open(updated_filename, 'w', encoding='utf-8') as output_file:
    for cleaned_line in cleaned_lines:
        output_file.write(cleaned_line + '\n')


In [3]:
# Read the cleaned data into a DataFrame
df = pd.read_csv(updated_filename, delimiter = ",", encoding = 'utf-8', index_col=False) 

# Look at the first 5 rows of the DataFrame
df.head()

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher
0,1,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling/Mary GrandPré,4.57,0439785960,9780439785969,eng,652,2095690,27591,9/16/2006,Scholastic Inc.
1,2,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling/Mary GrandPré,4.49,0439358078,9780439358071,eng,870,2153167,29221,9/1/2004,Scholastic Inc.
2,4,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,4.42,0439554896,9780439554893,eng,352,6333,244,11/1/2003,Scholastic
3,5,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling/Mary GrandPré,4.56,043965548X,9780439655484,eng,435,2339585,36325,5/1/2004,Scholastic Inc.
4,8,Harry Potter Boxed Set Books 1-5 (Harry Potte...,J.K. Rowling/Mary GrandPré,4.78,0439682584,9780439682589,eng,2690,41428,164,9/13/2004,Scholastic


Looking closer at the column names, we notice that in one of the column '  num_pages' the name starts with space. For ease of use, it is better to remove this unnecessary space.

In [4]:
df.columns

Index(['bookID', 'title', 'authors', 'average_rating', 'isbn', 'isbn13',
       'language_code', '  num_pages', 'ratings_count', 'text_reviews_count',
       'publication_date', 'publisher'],
      dtype='object')

In [5]:
# Remove the space before the column num_pages, for ease of use
df.rename(columns={'  num_pages': '  num_pages'.replace(' ', '')}, inplace=True)
df.columns

Index(['bookID', 'title', 'authors', 'average_rating', 'isbn', 'isbn13',
       'language_code', 'num_pages', 'ratings_count', 'text_reviews_count',
       'publication_date', 'publisher'],
      dtype='object')

Examining the column types, it is noticed that some of the columns that are supposed to be numerical types are object types

In [6]:
types_columns=df.dtypes
number_lines,number_columns=df.shape
print(types_columns)

bookID                 object
title                  object
authors                object
average_rating        float64
isbn                   object
isbn13                 object
language_code          object
num_pages               int64
ratings_count           int64
text_reviews_count     object
publication_date       object
publisher              object
dtype: object


In [7]:
# investigate if their are non-numerical non-date types in supposedly numerical and date columns

# text_reviews_count 
# Filter the DataFrame to get non-numerical values in the specified column
def check_non_numerical_date(columns_list: list, intended_type: str):
    
    if intended_type == "numerical":
        for column in columns_list:
            non_numerical_values = df.loc[~df[column].apply(pd.to_numeric, errors='coerce').notna()]
            print("The non_numerical_values in column {} : ".format(column), non_numerical_values[column])

    elif intended_type == "date":
        for column in columns_list:
            non_numerical_values = df.loc[~df[column].apply(pd.to_datetime, errors='coerce').notna()]
            print("The non_date_values in column {} : ".format(column), non_numerical_values[column])
    
# Display the non-numerical and non_date values of the selected columns
columns = ["isbn", "isbn13", "text_reviews_count"]
check_non_numerical_date(columns, "numerical")
check_non_numerical_date(["publication_date"], "date")

The non_numerical_values in column isbn :  3        043965548X
12       076790818X
16       076790382X
27       097669400X
40       006076273X
            ...    
11090    030727411X
11101    074347788X
11106    057305133X
11110    843221728X
11115    972233168X
Name: isbn, Length: 985, dtype: object
The non_numerical_values in column isbn13 :  1847    en-US
Name: isbn13, dtype: object
The non_numerical_values in column text_reviews_count :  1847    9/1/2003
Name: text_reviews_count, dtype: object
The non_date_values in column publication_date :  1847      MTV Books
8180     11/31/2000
11098     6/31/1982
Name: publication_date, dtype: object


From the above investigation one can notice the following : 
- isbn is not numerical since some of the isbn codes end with X 
- we notice that at index location 1847 the data is shifted to the left
- dates at index locations 8180 and 11098 were not considered as dates since they are wrong. June and November are 31 days, but in reality they are only 30 days.

Also, as seen below there is one empty cell in publisher column, which is on index location 1847.

In [8]:
df.isna().sum()

bookID                0
title                 0
authors               0
average_rating        0
isbn                  0
isbn13                0
language_code         0
num_pages             0
ratings_count         0
text_reviews_count    0
publication_date      0
publisher             1
dtype: int64

In [9]:
df["publisher"][df["publisher"].isna()==True]

1847    NaN
Name: publisher, dtype: object

In [10]:
df.iloc[1847]

bookID                6549/ said the shotgun to the head.
title                                       Saul Williams
authors                                              4.22
average_rating                                743470796.0
isbn                                        9780743470797
isbn13                                              en-US
language_code                                         192
num_pages                                            2762
ratings_count                                         214
text_reviews_count                               9/1/2003
publication_date                                MTV Books
publisher                                             NaN
Name: 1847, dtype: object

One can notice that there is a / in the bookID which separates the BookID from the book title. This must be a result from updating automatically the csv files at the beginning. 

Below is the code to fix this unwanted mistake

In [11]:
bookid_title = df["bookID"].iloc[1847].split('/')


# content of the remaining columns
remaining_columns = df[['title', 'authors', 'average_rating', 'isbn', 'isbn13',
                                   'language_code', 'num_pages', 'ratings_count', 
                                   'text_reviews_count', 'publication_date']].iloc[1847]


# rearrange cells content for each column
df.iloc[1847, 2:] = remaining_columns
df["bookID"].iloc[1847]= bookid_title[0]
df["title"].iloc[1847]= bookid_title[1]

  df.iloc[1847, 2:] = remaining_columns
  df.iloc[1847, 2:] = remaining_columns
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["bookID"].iloc[1847]= bookid_title[0]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["title"].iloc[1847]= bookid_title[1]


In [12]:
# rechek the content 
df.iloc[1847]

bookID                                          6549
title                  said the shotgun to the head.
authors                                Saul Williams
average_rating                                  4.22
isbn                                     743470796.0
isbn13                                 9780743470797
language_code                                  en-US
num_pages                                        192
ratings_count                                   2762
text_reviews_count                               214
publication_date                            9/1/2003
publisher                                  MTV Books
Name: 1847, dtype: object

In [13]:
# Check the types of the columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11127 entries, 0 to 11126
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   bookID              11127 non-null  object
 1   title               11127 non-null  object
 2   authors             11127 non-null  object
 3   average_rating      11127 non-null  object
 4   isbn                11127 non-null  object
 5   isbn13              11127 non-null  object
 6   language_code       11127 non-null  object
 7   num_pages           11127 non-null  object
 8   ratings_count       11127 non-null  int64 
 9   text_reviews_count  11127 non-null  object
 10  publication_date    11127 non-null  object
 11  publisher           11127 non-null  object
dtypes: int64(1), object(11)
memory usage: 1.0+ MB


For some reasons the supposedly numerical columns became objects 
Re-investigate again if they contain non-numerical content

In [14]:
# re-investigate if there is non-numerical content in supposedly numerical columns
columns = ["average_rating", "isbn13", "text_reviews_count", "num_pages", "ratings_count", "text_reviews_count"]
check_non_numerical_date(columns, "numerical")

The non_numerical_values in column average_rating :  Series([], Name: average_rating, dtype: object)
The non_numerical_values in column isbn13 :  Series([], Name: isbn13, dtype: object)


The non_numerical_values in column text_reviews_count :  Series([], Name: text_reviews_count, dtype: object)
The non_numerical_values in column num_pages :  Series([], Name: num_pages, dtype: object)
The non_numerical_values in column ratings_count :  Series([], Name: ratings_count, dtype: int64)
The non_numerical_values in column text_reviews_count :  Series([], Name: text_reviews_count, dtype: object)


They do not contain non-numerical content, so will transform them to numerical

In [15]:
# convert to numerical
def convert_to_numerical(columns_list: list):
    
    for column in columns_list:
        df[column] = df[column].apply(pd.to_numeric)

columns = ["average_rating", "isbn13", "text_reviews_count", "num_pages", "ratings_count", "text_reviews_count"]

convert_to_numerical(columns)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11127 entries, 0 to 11126
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   bookID              11127 non-null  object 
 1   title               11127 non-null  object 
 2   authors             11127 non-null  object 
 3   average_rating      11127 non-null  float64
 4   isbn                11127 non-null  object 
 5   isbn13              11127 non-null  int64  
 6   language_code       11127 non-null  object 
 7   num_pages           11127 non-null  int64  
 8   ratings_count       11127 non-null  int64  
 9   text_reviews_count  11127 non-null  int64  
 10  publication_date    11127 non-null  object 
 11  publisher           11127 non-null  object 
dtypes: float64(1), int64(4), object(7)
memory usage: 1.0+ MB


In [16]:
# fix the dates by replacing 31 with 30 in June and November
# convert the date column to type pd.datetime
df["publication_date"].iloc[8180] = '11/30/2000'
df["publication_date"].iloc[11098] = '6/30/1982'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["publication_date"].iloc[8180] = '11/30/2000'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["publication_date"].iloc[11098] = '6/30/1982'


Resave the DataFrame to avoid repeating the process each time we want to use the df

In [17]:
df.to_csv("books_updated.csv", index_label=False) # index_label = False so that it does not add another index label to the DataFrame

# Data Cleaning Phase 2

This data cleaning is done on the data set after doing webscraping. Through webscraping extra features were extracted which are: 

- first_publish : this is the date a book was first published (this will help distinguish two different books with the same name)
- book_format : this is the format of the book (some books are present in different formats (paperback, Audio CD, hardcover))
- new_publisher: this is because it was noticed that in the original data some books had wrong publishers
- edition_avgRating: this is the actaul average rating of each edition
- added_toShelves: this is the number of users that added a book to shelves

*For more details about the process of scraping, please refer to scraper/scraper.py*

### Read the Data

In [2]:
# read the scraped data
df_scraped = pd.read_csv("scraper/booksRating_extraFeats.csv")

In [60]:
df_scraped.head()

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher,first_published,book_format,new_publisher,edition_avgRating,added_toShelves
0,1,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling/Mary GrandPré,4.57,0439785960,9780439785969,eng,652,2095690,27591,2006-09-16,Scholastic Inc.,"July 16, 2005",Paperback,Scholastic Inc,4.57,4405980.0
1,2,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling/Mary GrandPré,4.49,0439358078,9780439358071,eng,870,2153167,29221,2004-09-01,Scholastic Inc.,"June 21, 2003",Paperback,Scholastic Inc.,4.5,4518536.0
2,4,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,4.42,0439554896,9780439554893,eng,352,6333,244,2003-11-01,Scholastic,"July 2, 1998",Hardcover,,4.05,7469.0
3,5,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling/Mary GrandPré,4.56,043965548X,9780439655484,eng,435,2339585,36325,2004-05-01,Scholastic Inc.,"July 8, 1999",Mass Market Paperback,Scholastic Inc.,4.57,5223956.0
4,8,Harry Potter Boxed Set Books 1-5 (Harry Potte...,J.K. Rowling/Mary GrandPré,4.78,0439682584,9780439682589,eng,2690,41428,164,2004-09-13,Scholastic,"October 1, 2003",Paperback,Scholastic,4.72,172736.0


In [61]:
df_scraped.tail()

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher,first_published,book_format,new_publisher,edition_avgRating,added_toShelves
11122,45631,Expelled from Eden: A William T. Vollmann Reader,William T. Vollmann/Larry McCaffery/Michael He...,4.06,1560254416,9781560254416,eng,512,156,20,2004-12-21,Da Capo Press,"December 1, 2004",Paperback,Da Capo Press,4.06,552.0
11123,45633,You Bright and Risen Angels,William T. Vollmann,4.08,140110879,9780140110876,eng,635,783,56,1988-12-01,Penguin Books,"June 27, 1987",Paperback,Penguin Books,4.04,4269.0
11124,45634,The Ice-Shirt (Seven Dreams #1),William T. Vollmann,3.96,140131965,9780140131963,eng,415,820,95,1993-08-01,Penguin Books,"January 1, 1990",Paperback,Penguin Publishing Group,3.98,4686.0
11125,45639,Poor People,William T. Vollmann,3.72,60878827,9780060878825,eng,434,769,139,2007-02-27,Ecco,"January 1, 2007",Hardcover,Ecco,3.75,2948.0
11126,45641,Las aventuras de Tom Sawyer,Mark Twain,3.91,8497646983,9788497646987,spa,272,113,12,2006-05-28,Edimat Libros,"June 1, 1876",Paperback,,3.72,228.0


In [3]:
df_scraped.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11127 entries, 0 to 11126
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   bookID              11127 non-null  int64  
 1   title               11127 non-null  object 
 2   authors             11127 non-null  object 
 3   average_rating      11127 non-null  float64
 4   isbn                11127 non-null  object 
 5   isbn13              11127 non-null  int64  
 6   language_code       11127 non-null  object 
 7   num_pages           11127 non-null  int64  
 8   ratings_count       11127 non-null  int64  
 9   text_reviews_count  11127 non-null  int64  
 10  publication_date    11127 non-null  object 
 11  publisher           11127 non-null  object 
 12  first_published     11125 non-null  object 
 13  book_format         11126 non-null  object 
 14  new_publisher       8360 non-null   object 
 15  edition_avgRating   11124 non-null  float64
 16  adde

### Fill Missing Data

In the new dataset, there is 2 missing data in the first_published, 1 missing value in the book_format, 3 missing values in the edition_avgRating and added_toShelves. These were values that the scraper failed to get. Since they are very few, they will be filled manually. 

For the new_publisher, the scraper failed to get a lot of values, this is because of how the publisher is saved in the html. Due to constraint of time, and since there is already a publisher column in the original dataset. In the cases where there are NaN the old publisher will be put. The new publisher will be taken, only in the cases where the new publisher is different from the old publisher in value.

In [55]:
df_scraped[df_scraped[["first_published", "book_format", "edition_avgRating", "added_toShelves"]].isna().any(axis=1)]

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher,first_published,book_format,new_publisher,edition_avgRating,added_toShelves,first_author,num_contributors
2255,8077,animales no se visten los animals should defin...,judi barrettron barrett,4.11,1595191356,9781595191359,eng,32,0,0,1991-10-30,live oak media,"january 1, 1970",paperback,atheneum books for young readers,,,judi barrett,2
6396,24062,the deep dive trilogy,gordon korman,3.78,613674839,9780613674836,eng,148,0,0,2003-07-01,turtleback,,,,,,gordon korman,1
8476,32552,essential tales and poems,edgar allan poebenjamin f fisher,4.36,1593080646,9781593080648,en-us,688,66382,109,2004-10-25,barnes noble,,paperback,barnes & noble classics,4.36,92622.0,edgar allan poe,2
8520,32703,the diary of ellen rimbauer my life at rose red,joyce reardonsteven rimbauerridley pearson,3.67,786890436,9780786890439,eng,277,7852,352,2001-04-29,hyperion,"january 1, 2001",mass market paperback,hyperion,,,joyce reardon,3


In [58]:
df_scraped.loc[2255, ["edition_avgRating", "added_toShelve"]] = [4.11, 5402]
# df_scraped.loc[6396, ["first_published", "book_format", "new_publisher", "edition_avgRating", "added_toShelve"]] ### !!!! Couldn'y find exact book edition online !!!!
df_scraped.loc[8476, ["first_published"]] = ["January 1, 1843"]
df_scraped.loc[8520, ["edition_avgRating", "added_toShelve"]] = [3.7, 15942]

One of the books at index 6396 was not found on good reads even when searching by book ID, isbn and isbn13. Therefore, the edition_avgRating will be the same as the average_rating. added_toShelves will be equal to the mean of the added_toShelves of the other editions of the book; if other editions are not found the added_toShelves will be equal to the mode of the books having same ratings_count and text_reviews_count. The first_published will be the same as the publication_date and the book_format will be the mode of the books of approximately the same size. 

--- Kheirie --- 

**NOTE**

to continue as per the noted above: 
- combine publisher and new_publisher
- fill the NaN of book at index 6396

The below parts work properly, they only need to adapted on the updated df_scraped after doing the two above points

-----------------------

### Set text columns to lower text

When dealing with text data it always safer to have them all in the same case either lower or upper 

In [6]:
# get the text columns
text_columns = df_scraped.select_dtypes(include=['object']).columns

# Convert to lowercase
df_scraped = to_lower(df_scraped, text_columns)

In [7]:
df_scraped.sample(5)

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher,first_published,book_format,new_publisher,edition_avgRating,added_toShelves
1270,4398,cliffsnotes on steinbeck's the grapes of wrath,kelly mcgrath vlcek/cliffsnotes/john steinbeck,3.81,0764585967,9780764585968,eng,112,30,5,2000-06-05,cliffs notes,"june 5, 2000",paperback,cliffs notes,3.77,79.0
7907,30352,led astray (hellraisers #1),erin st. claire/sandra brown,3.76,0778321584,9780778321583,eng,240,2647,143,2005-02-22,mira books,"october 1, 1985",hardcover,mira,3.71,5712.0
1295,4524,the shadow of the wind,carlos ruiz zafón/lucia graves,4.26,0753819317,9780753819319,eng,403,1278,175,2004-10-28,phoenix,"may 1, 2001",paperback,,4.1,2257.0
8457,32506,the poet (jack mcevoy #1; harry bosch univers...,michael connelly,4.2,0446690457,9780446690454,eng,510,64309,2025,2002-07-01,grand central publishing,"january 28, 1996",paperback,grand central publishing,4.2,127064.0
7342,28272,legacy of the darksword (the darksword #4),margaret weis/tracy hickman,3.47,055357812x,9780553578126,eng,400,138,4,1998-06-01,spectra,"july 1, 1997",mass market paperback,,3.52,478.0


### Take First Author Name and Create num_contributors Column

We have decided to take only the name of the first author, since usually first authors are the main authors. Another column will be added to indicate the total number of authors/contributors to the book.

In [8]:
# take the first author in the authors column
df_scraped["first_author"] = df_scraped["authors"].apply(lambda x: x.split("/")[0].strip())


In [9]:
df_scraped["first_author"].sample(5)

1156        ruth prawer jhabvala
10031       edgar rice burroughs
6594              bill watterson
8994              lester del rey
2274     barbara taylor bradford
Name: first_author, dtype: object

In [10]:
df_scraped["num_contributors"] = df_scraped["authors"].apply(lambda x: len(x.split("/")))

In [12]:
df_scraped[["authors","num_contributors"]].sample(5)

Unnamed: 0,authors,num_contributors
3288,paul bowles/francine prose,2
9065,michael munn,1
5055,georg wilhelm friedrich hegel/james black baillie,2
6277,rumiko takahashi,1
3994,philip k. dick/john brunner,2


### Clean Text Columns

Remove punctuation, and extra white spaces from all text columns.

Better clean the text in publisher column, to avoid having same publishers represented differently. 
Example: indexes 5993, 5365 and 9753, W.W. Norton & Company, W. W. Norton and Company and W. W. Norton  Company represent the same publisher

In [28]:
general_replacements = [
    (r'[^\w\s]', ''), # remove punctuation
    (r'\s\s+', " "), # remove double spaces and more   
]

columns_ = ["title", "first_author"]

df_scraped = sub_text(df_scraped, columns_, general_replacements)

In [49]:
df_scraped[["title", "first_author"]].sample(5)

Unnamed: 0,title,first_author
6476,his dark materials trilogy northern lights the...,philip pullman
6990,seven gothic tales,isak dinesen
9803,brand new justice how branding places and prod...,simon anholt
6253,the perfect london walk,roger ebert
5227,the poet the warrior the prophet,rubem a alves


In [52]:
specific_replacements = [
    ("inc.", ""),
    ("llc", ""),
    ("ltd", ""),
    ("w. w.", "ww"), 
    ("&", " and "),
    (r'\bbooks?\b', ""),  
    (r'\bclassics?\b', ""),
    (r'\bpublishers?\b', ""),
    (r'\bpress\b', ""),
    (r'\bpublishing\b', "")
]

df_scraped = sub_text(df_scraped, ["publisher"], specific_replacements)