# Goodreads-books


<img src="Books/book1.jpg">

The basic idea behind analysing the Goodreads dataset is to get a fair idea about the relationships between the multiple attributes a book might have, such as:the aggregrate rating of each book, the trend of the authors over the years and books with numerous languages. With over a hundred thousand ratings, there are books which just tend to become popular as each day seems to pass.

We've always conisdered the magical persona books seem to hold, and with this notebook, we step out on a journey to see what kind of books really drives people to read in this era of modern smart devices.

With such a vast, overwhelming number of factors, we'll go over such demographics:

* Does any relationship lie between ratings and the total ratings given?
* Where do majority of the books lie, in terms of ratings - Does reading a book really bring forth bias for the ratings?
* Do authors tend to perform same over time, with all their newer books? Or do they just fizzle out.
* Do number of pages make an impact on reading styles, ratings and popularity?
* Can books be recommended based on ratings? Is that a factor which can work?

## 1).Import Module

* **Packages Imported** :-

    1. sys: access to system parameters
    2. numpy : for numerical computation
    3. pandas: for data manipulation and analysis
    4. matplotlib: plotting library
    5. seaborn: plotting library based on matplotlib
    6. sklearn: machine learning library

In [None]:
# load packages
import numpy as np
import pandas as pd

In [2]:
#import visual modules
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style("whitegrid")

In [3]:
## version of packages used
print(f'Numpy version: {np.__version__}')
print(f'Pandas version: {pd.__version__}')
#print(f'Matplotlib version: {plt.__version__}')
print(f'Seaborn version: {sns.__version__}')

Numpy version: 1.15.4
Pandas version: 0.23.4
Seaborn version: 0.9.0


## 2).The Data

**The data contain following information :-**

* **bookIDA :** unique Identification number for each book.
* **title :** The name under which the book was published.
* **authors :** Names of the authors of the book. Multiple authors are delimited with/.
* **average_rating :** The average rating of the book received in total.
* **isbn :** Another unique number to identify the book, the International Standard Book Number.
* **isbn13 :** A 13-digit ISBN to identify the book, instead of the standard 11-digit ISBN.
* **language_code :** Helps understand what is the primary language of the book. For instance, eng is standard for English.
* **num_pages :** Number of pages the book contains.
* **ratings_count :** Total number of ratings the book received.
* **text_reviews_count :** Total number of written text reviews the book received.
* **publication_date :** Date when the book was first published.
* **publisher :** The name of the publisher. 


**Read the data from the books.csv file in Books folder**

In [79]:
#save the data from books.csv in books dataframe
books=pd.read_csv("Books/books.csv",error_bad_lines=False)

b'Skipping line 3350: expected 12 fields, saw 13\nSkipping line 4704: expected 12 fields, saw 13\nSkipping line 5879: expected 12 fields, saw 13\nSkipping line 8981: expected 12 fields, saw 13\n'


Using head function to display the data in books

In [81]:
#displaying first 5 rows from the books datasset
books.head()

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher
0,1,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling/Mary GrandPré,4.57,0439785960,9780439785969,eng,652,2095690,27591,9/16/2006,Scholastic Inc.
1,2,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling/Mary GrandPré,4.49,0439358078,9780439358071,eng,870,2153167,29221,9/1/2004,Scholastic Inc.
2,4,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,4.42,0439554896,9780439554893,eng,352,6333,244,11/1/2003,Scholastic
3,5,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling/Mary GrandPré,4.56,043965548X,9780439655484,eng,435,2339585,36325,5/1/2004,Scholastic Inc.
4,8,Harry Potter Boxed Set Books 1-5 (Harry Potte...,J.K. Rowling/Mary GrandPré,4.78,0439682584,9780439682589,eng,2690,41428,164,9/13/2004,Scholastic


In [82]:
#using shape funtion to get the number of rows and column of the data.
books.shape

(11123, 12)

There are 11123 rows and 12 columns in books dataframe


In [83]:
#Using describe function to view the statistical details like percentile ,mean,std etc.
books.describe()

Unnamed: 0,bookID,average_rating,isbn13,num_pages,ratings_count,text_reviews_count
count,11123.0,11123.0,11123.0,11123.0,11123.0,11123.0
mean,21310.856963,3.934075,9759880000000.0,336.405556,17942.85,542.048099
std,13094.727252,0.350485,442975800000.0,241.152626,112499.2,2576.619589
min,1.0,0.0,8987060000.0,0.0,0.0,0.0
25%,10277.5,3.77,9780345000000.0,192.0,104.0,9.0
50%,20287.0,3.96,9780582000000.0,299.0,745.0,47.0
75%,32104.5,4.14,9780872000000.0,416.0,5000.5,238.0
max,45641.0,5.0,9790008000000.0,6576.0,4597666.0,94265.0


~From the data we got various statistical information like max averge_rating is 5. which means the rating is out of 5.

In [86]:
#using the info function to get the concise summary of the dataframe
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11123 entries, 0 to 11122
Data columns (total 12 columns):
bookID                11123 non-null int64
title                 11123 non-null object
authors               11123 non-null object
average_rating        11123 non-null float64
isbn                  11123 non-null object
isbn13                11123 non-null int64
language_code         11123 non-null object
  num_pages           11123 non-null int64
ratings_count         11123 non-null int64
text_reviews_count    11123 non-null int64
publication_date      11123 non-null object
publisher             11123 non-null object
dtypes: float64(1), int64(5), object(6)
memory usage: 1.0+ MB


 **We get the info that:-**
   * bookID ,isbn13 ,num_pages ,ratings_count ,text_reviews_count are of datatype int64
   * title ,authors ,isbn ,language_code ,publication_date ,publisher  are of datatype object
   * average_rating are of dataype float64
   
   ~Also the memory required for the books dataframe is about 1.0+ MB
   


**To make our data small and fast we need to decrease the memory space required by the data**
So for that we need to know what exact memory space taken by each column and the exact space of the dataframe.

In [72]:
#To get exact memory space books dataframe is using
books.info(memory_usage = "deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11123 entries, 0 to 11122
Data columns (total 12 columns):
bookID                11123 non-null int64
title                 11123 non-null object
authors               11123 non-null object
average_rating        11123 non-null float64
isbn                  11123 non-null object
isbn13                11123 non-null int64
language_code         11123 non-null object
  num_pages           11123 non-null int64
ratings_count         11123 non-null int64
text_reviews_count    11123 non-null int64
publication_date      11123 non-null object
publisher             11123 non-null object
dtypes: float64(1), int64(5), object(6)
memory usage: 5.2 MB


~The total memory usage is 5.2MB

In [87]:
#to get the memory usage of each individual column
books.memory_usage(deep = True)

Index                      80
bookID                  88984
title                 1044646
authors                925543
average_rating          88984
isbn                   745240
isbn13                  88984
language_code          670637
  num_pages             88984
ratings_count           88984
text_reviews_count      88984
publication_date       731048
publisher              806283
dtype: int64

**Making the books dataframe fast and small**
* Converting publication_date from object to datetime datatype
* Converting language_code from object to category 

In [88]:
#changing publication_date column from object to dateTime 
books["publication_date"] = pd.to_datetime(books["publication_date"],  errors='coerce')

In [89]:
#changing language_code column from object to category
books["language_code"] = books["language_code"].astype("category")

In [76]:
books["language_code"].cat.codes.head()

0    5
1    5
2    5
3    5
4    5
dtype: int8

**why we can remove isbn column from books dataframe?**

There are 2 column isbn and isbn13 in books dataframe.isbn consist of 11 digit number assigned to books.But after 2007 13 digit number are assigned to books .Each book has a unique isbn13 and isbn 11. As both are acting as an identity for the books so we can remove one column i.e isbn column because now isbn13 is used.

In [77]:
#dropping the isbn column.
books.drop(["isbn"],axis = 1,inplace = True)

In [78]:
books.info(memory_usage = "deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11123 entries, 0 to 11122
Data columns (total 11 columns):
bookID                11123 non-null int64
title                 11123 non-null object
authors               11123 non-null object
average_rating        11123 non-null float64
isbn13                11123 non-null int64
language_code         11123 non-null category
  num_pages           11123 non-null int64
ratings_count         11123 non-null int64
text_reviews_count    11123 non-null int64
publication_date      11121 non-null datetime64[ns]
publisher             11123 non-null object
dtypes: category(1), datetime64[ns](1), float64(1), int64(5), object(3)
memory usage: 3.3 MB


Now we can see that the memory usage is 3.3 MB from 5.2 MB. 
this is beneficial cause 
* will make books dataframe faster.
* will become smaller