 ## ML-Project
 - **Author(s)**: 
     - Hugues Delattre (DS)
     - Souheil Maatoug (DS)
 
 
### Description:
- Based on real user experience information about books (title, author, year, ...), the purpose of this project is to develop an end-to-end machine learning project to <mark>**predict a book's rating**</mark>.
 
- This notebook presents an exploration of the available dataset with some analytics.

### Dataset
- The available dataset is a collection of user experience information taken from the social cataloging website [Goodreads](https://www.goodreads.com). The size of the dataset is about 10k lines which is considered as sufficient for an ML project.

- The information in the dataset are : *bookID, title, authors, average_rating, isbn, isbn13, language_code, num_pages, ratings_count, text_reviews_count, publication_date, publisher*
 
- Optional additional dataset: If needed, we can access an additional database from the same source (Goodreads) and that contains additional information (Genres, Reviews, ...).
 
---
 
 - eda
 - find duplicate values
 - find representative data
 - drop values
 - add values
 - replace empty values
 - compute stats
 - view the distribution of the different values
 - view the features that would be used in the model
 - are they representative to solving the problem

### 1. Getting to know the data

In [36]:
# headers / column names
with open("../dataset/books.csv", "r", encoding="utf-8") as f:
    header_line = f.readline()
    header_list = [item.strip() for item in header_line.rstrip("\n").split(",")]

In [37]:
print(len(header_list), "columns: ")
print(header_list)

12 columns: 
['bookID', 'title', 'authors', 'average_rating', 'isbn', 'isbn13', 'language_code', 'num_pages', 'ratings_count', 'text_reviews_count', 'publication_date', 'publisher']


- **bookID**: a unique identifier in the csv file
- **title**: the title of the book
- **authors**: the authors of the book
- **average_rating**: the average rating having a value between 1 and 5
- **isbn**: the isbn code (unique identifier) of the book
- **isbn13** : a 13 digit isbn code
- **language_code**: the language of the code
- **num_pages**: the number of pages of the book
- **rating_count**: the count of ratings of the book
- **text_reviews_count**: the count of reviews of the book
- **publication_date**: the date of publication of the book
- **publisher**: the name of the publisher

In [38]:
# check for incorrect structure in the dataset file
with open("../dataset/books.csv", "r", encoding="utf-8") as f:
    for line in f:
        line_list = line.rstrip("\n").split(",") 
        if len(line_list) != len(header_list):
            print(len(line_list), "columns")
            print(line_list, "\n")        

13 columns
['12224', 'Streetcar Suburbs: The Process of Growth in Boston  1870-1900', 'Sam Bass Warner', ' Jr./Sam B. Warner', '3.58', '0674842111', '9780674842113', 'en-US', '236', '61', '6', '4/20/2004', 'Harvard University Press'] 

13 columns
['16914', "The Tolkien Fan's Medieval Reader", 'David E. Smith (Turgon of TheOneRing.net', ' one of the founding members of this Tolkien website)/Verlyn Flieger/Turgon (=David E. Smith)', '3.58', '1593600119', '9781593600112', 'eng', '400', '26', '4', '4/6/2004', 'Cold Spring Press'] 

13 columns
['22128', 'Patriots (The Coming Collapse)', 'James Wesley', ' Rawles', '3.63', '156384155X', '9781563841552', 'eng', '342', '38', '4', '1/15/1999', 'Huntington House Publishers'] 

13 columns
['34889', "Brown's Star Atlas: Showing All The Bright Stars With Full Instructions How To Find And Use Them For Navigational Purposes And Department Of Trade Examinations.", 'Brown', ' Son & Ferguson', '0.00', '0851742718', '9780851742717', 'eng', '49', '0', '0',

4 lines of the dataset file have an incorrect structure (13 columns instead of 12) due to a comma separator that was used inside a field. We can replace manually those information and delete them:

- line 3350: the name of the author is duplicated: Sam Bass Warner and Jr./Sam B. Warner. The correct name is `Jr./Sam B. Warner`
- line 4704: the authors are `David E. Smith (Turgon of TheOneRing.net, one of the founding members of this Tolkien website)/Verlyn Flieger/Turgon (=David E. Smith)`. We can replace that by: `David E. Smith/Verlyn Flieger`
- line 5879: the authors are `James Wesley, Rawles`. We can replace that by `James Wesley/Rawles`
- line 8981: the authors are `Brown, Son & Ferguson`. We can replace that by `Brown/Son & Ferguson`

In [None]:
import pandas as csv

In [65]:
books_df = pd.read_csv("../dataset/books_clean.csv", sep=",", encoding="utf-8")
books_df.head()

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher
0,1,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling/Mary GrandPré,4.57,0439785960,9780439785969,eng,652,2095690,27591,9/16/2006,Scholastic Inc.
1,2,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling/Mary GrandPré,4.49,0439358078,9780439358071,eng,870,2153167,29221,9/1/2004,Scholastic Inc.
2,4,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,4.42,0439554896,9780439554893,eng,352,6333,244,11/1/2003,Scholastic
3,5,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling/Mary GrandPré,4.56,043965548X,9780439655484,eng,435,2339585,36325,5/1/2004,Scholastic Inc.
4,8,Harry Potter Boxed Set Books 1-5 (Harry Potte...,J.K. Rowling/Mary GrandPré,4.78,0439682584,9780439682589,eng,2690,41428,164,9/13/2004,Scholastic


In [50]:
books_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11127 entries, 0 to 11126
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   bookID              11127 non-null  int64  
 1   title               11127 non-null  object 
 2   authors             11127 non-null  object 
 3   average_rating      11127 non-null  float64
 4   isbn                11127 non-null  object 
 5   isbn13              11127 non-null  int64  
 6   language_code       11127 non-null  object 
 7     num_pages         11127 non-null  int64  
 8   ratings_count       11127 non-null  int64  
 9   text_reviews_count  11127 non-null  int64  
 10  publication_date    11127 non-null  object 
 11  publisher           11127 non-null  object 
dtypes: float64(1), int64(5), object(6)
memory usage: 1.0+ MB


In [46]:
books_df.describe()

Unnamed: 0,bookID,average_rating,isbn13,num_pages,ratings_count,text_reviews_count
count,11127.0,11127.0,11127.0,11127.0,11127.0,11127.0
mean,21310.938887,3.933631,9759888000000.0,336.376921,17936.41,541.854498
std,13093.358023,0.352445,442896400000.0,241.127305,112479.4,2576.176608
min,1.0,0.0,8987060000.0,0.0,0.0,0.0
25%,10287.0,3.77,9780345000000.0,192.0,104.0,9.0
50%,20287.0,3.96,9780586000000.0,299.0,745.0,46.0
75%,32104.5,4.135,9780873000000.0,416.0,4993.5,237.5
max,45641.0,5.0,9790008000000.0,6576.0,4597666.0,94265.0


In [68]:
print("finding duplicates: \n")
for col_value in books_df.columns:
    print(col_value, ": ", books_df[col_value].duplicated().any())

finding duplicates: 

bookID :  False
title :  True
authors :  True
average_rating :  True
isbn :  False
isbn13 :  False
language_code :  True
  num_pages :  True
ratings_count :  True
text_reviews_count :  True
publication_date :  True
publisher :  True


#### Comments
- `title` have long string of characters and is not systematically unique
- `authors` can be a single name or multiple names separated by a "/" (to be verified)
- `average_rating` this is the target attribute. It's values must be between 1 and 5
- `isbn` and `isbn13` are unique 
- `language_code` is a categorical value
- `num_pages`: should be strictly positive
- `count_rating` and `text_reviews_count` can be 0 (should we keep them?)
- `publisher` is the name of the publisher, sometimes we can find a small alteration like `Scholastic` and `Scholastic inc.`

### title

### authors

In [70]:
books_df[["authors"]]

Unnamed: 0,authors
0,J.K. Rowling/Mary GrandPré
1,J.K. Rowling/Mary GrandPré
2,J.K. Rowling
3,J.K. Rowling/Mary GrandPré
4,J.K. Rowling/Mary GrandPré
...,...
11122,William T. Vollmann/Larry McCaffery/Michael He...
11123,William T. Vollmann
11124,William T. Vollmann
11125,William T. Vollmann


In [76]:
authors_list = books_df["authors"].to_list()

In [84]:
authors_list

['J.K. Rowling/Mary GrandPré',
 'J.K. Rowling/Mary GrandPré',
 'J.K. Rowling',
 'J.K. Rowling/Mary GrandPré',
 'J.K. Rowling/Mary GrandPré',
 'W. Frederick Zimmerman',
 'J.K. Rowling',
 'Douglas Adams',
 'Douglas Adams',
 'Douglas Adams',
 'Douglas Adams/Stephen Fry',
 'Douglas Adams',
 'Bill Bryson',
 'Bill Bryson',
 'Bill Bryson',
 'Bill Bryson',
 'Bill Bryson',
 'Bill Bryson',
 'Bill Bryson',
 'Bill Bryson',
 'Bill Bryson',
 'J.R.R. Tolkien',
 'J.R.R. Tolkien',
 'J.R.R. Tolkien',
 'J.R.R. Tolkien/Alan  Lee',
 'Chris   Smith/Christopher  Lee/Richard Taylor',
 'Jude Fisher',
 'Dave Thomas/David Heinemeier Hansson/Leon Breedt/Mike Clark/Thomas  Fuchs/Andreas  Schwarz',
 'Gary Paulsen',
 'Donna Ickes/Edward Sciranko/Keith Vasconcelles',
 'Gary Paulsen',
 'Molly Hatchet',
 'Dale Peck',
 'Angela Knight/Sahara Kelly/Judy Mays/Marteeka Karland/Kate Douglas/Shelby Morgen/Lacey Savage/Kate Hill/Willa Okati',
 'Delia Sherman',
 'Patricia A. McKillip',
 'Zilpha Keatley Snyder',
 'Kate Horsley',

In [90]:
# counting multiple authors
count_mult_authors = 0
num_mult_authors = []
for item in authors_list:
    if "/" in item:
        count_mult_authors += 1
        num_mult_authors.append(len(item.split("/")))

In [91]:
count_mult_authors

4566

### average rating

### language code

### number of pages

### rating  counts

### text review counts

### publication date

### publisher

### Notes
- It might be better to include an additional dataset containing more information (like genres, keywords of summary, ...)