![](https://scontent.fjed4-5.fna.fbcdn.net/v/t1.0-9/101887167_10158651456902028_6551300404716503040_n.jpg?_nc_cat=103&_nc_sid=dd9801&_nc_ohc=yxIA4q5SKPsAX_FERau&_nc_ht=scontent.fjed4-5.fna&oh=fcdcb5f76c2a3046273c9fa0cc3d8f03&oe=5F3C6B0A)

# Table of Contents
*   [Cleaning Data](#Cleaning-Data)
  
*   [Exploring data](#Exploring-data)
       
       [Univariate exploratory analysis](#Univariate-exploratory-analysis)
       
       [Bivariate exploratory analysis](#Bivariate-exploratory-analysis)
       
       [Multivariate exploratory analysis](#Multivariate-exploratory-analysis)
       
* [Summary](#Summary)  

           
  




[Goodreads](https://www.goodreads.com/) is the world’s largest site for readers and book recommendations.On the Goodreads website, indiviuals can add books to their bookshelves, rate, recommend, and review books. 

This data contains a comprehensive list of 11127 books listed in goodreads. The data includes the bookID, title, authors, average_rating, language_code, number of pages, ratings_count, text_reviews_count, publication_date and publisher.   

 

<a id="Cleaning-Data"></a>
# Cleaning Data 

In [None]:
#importing libraries
import pandas as pd 
import seaborn as sn
import matplotlib.pyplot as plt
import numpy as np
import numpy as np
from sklearn.linear_model import LinearRegression
import scipy.cluster.hierarchy as shc
%matplotlib inline

### Reading Data 

In [None]:
#Reading dataset with pandas
books= pd.read_csv('../input/goodreads-bookscsv/goodreads_books.csv')
books.head()

### Assessing

#### The structure of the dataset


In [None]:
books.shape

In [None]:
books.info()

In [None]:
books.nunique()

### Cleaning

The data is `tidy`, but there are fews `quilty` issues needed to be fixed.
 - Some columns have incorrect type.
 - Removing rows with zeros values 
 - The'isbn','isbn13','Unnamed: 12' columns are unnecessary

 

#### Making copy of the data 

In [None]:
#make copy of the data to start cleaning
df= books.copy()

#### Correcting columns type.

In [None]:
#changing the type of The average rating colum
df['average_rating']= pd.to_numeric(df.average_rating, errors='coerce')
df['  num_pages']= pd.to_numeric(df['  num_pages'], errors='coerce')
#testing 
df.info()

#### Removing rows with zeros values

In [None]:
#find out the zeros rows
df.isnull().sum()

#### Dropping unnecessary columns and rows

In [None]:
#dropping unnecessary column
df.drop(['isbn','isbn13','Unnamed: 12'],axis =1,inplace=True)
#drop rows with zeros rating count
df = df[df.ratings_count!= 0]
#Removing rows with zeros values
df.dropna(inplace=True)

#### Testing the cleaning process

In [None]:
#testinig data to see the change after cleaning 

df.isnull().sum()

In [None]:
df.info()

##### To make this analysis more interested, a rating column can be added that contain the meaning of the rating average.
> In goodreads a book can be rate of 5 stars. Personally, I interpret the rating as:
 - 1 star  is disappointed
 - 2 stars is ok
 - 3 stars is good
 - 4 stars is very good 
 - 5 stars highly recommended
 
> In this data there are averages valuse, so this interpretation can adjusted.

In [None]:
# Create a list to store the data
rating = []

for x in df['average_rating']: 
    if x >= 2.5 and x < 3.5:
        rating.append('Ok')
    elif x >= 3.5 and  x <3.9:
        rating.append('GOOD')
    elif x >= 3.9  and x < 4.2:
        rating.append('Very Good')
    elif x >= 4.2 :
        rating.append('Highly Recommended')
    else :
        rating.append('Disappointed')
# Create a column for the list
df['rating']= rating


In [None]:
#testing the change
df.head()

### Creating a colum that contain the ratio of text review to rationg count

In [None]:
#creating a colum that contain the ratio of text review to rationg count
df['ratio']= df['text_reviews_count']/df['ratings_count']*100

In [None]:
df['ratio'].describe()

> The cleaning process are completed at this point. 

<a id="Exploring-data"></a>
## Exploring data 

<a id="Univariate-exploratory-analysis"></a>

## Univariate exploratory analysis
> Univariate analysis involves the analysis of a single variable. In this section, we will explore each variable at a time.

#### Descriptive statistics for each variable

In [None]:
#Descriptive statistics for each numerical variables 
df.describe()

### The distribution of ratings count, text reviews count, and Number of pages

> A histogram is used to plot the distribution of a numeric variable. 

In [None]:
#distribution of num_pages,ratings_count, and text_reviews_count
np.seterr(divide = 'ignore')
# left plot: hist of ratings count
plt.figure(figsize = [12, 8])
plt.subplot(1, 3, 1)
log_data = np.log10(df['ratings_count']) # data transform
log_bin_edges = np.arange(0, log_data.max()+0.25,0.25)
plt.hist(log_data, bins = log_bin_edges)
plt.xlabel('log(ratings count)')

# central plot: hist of text reviews count
plt.subplot(1, 3, 2)
log_data = np.log10(df['text_reviews_count']) # direct data transform
log_bin_edges = np.arange(0, log_data.max()+0.25,0.25)
plt.hist(log_data, bins = log_bin_edges)
plt.xlabel('log(text reviews count)')

# right plot: # of pages 
plt.subplot(1, 3, 3)
plt.hist(df['  num_pages'], bins = 100)
plt.xlabel('Number of pages')
plt.xlim([50,1500]) #setting this limit because # of pages are in range lower than 1500
 

> The distribution of rating count and text reviews count seem to be a normal distribution after applying a logarithmic transform to the data. For the number of pages, most books have between 250 and 500 pages.

### The average rating distribution

In [None]:
#average_rating distribution 
plt.hist(df['average_rating'],bins=60)
plt.xlabel('Average rating')
plt.ylabel('Count')
plt.xlim([1,5])

> The mean average rating around 4. 

### What are the top 20 publishers in goodreads?

In [None]:
#Top 20 publishers
publishers = df.groupby('publisher')['bookID'].count().sort_values(ascending=False).head(20)


In [None]:
#plot the 20 top publisher based on the goodreads data 
#set the color 
base_color = sn.color_palette()[0]
# set the plot to the size of A4 paper
fig, ax = plt.subplots()
fig.set_size_inches(11.7, 8.27)
#plot
sn.barplot(publishers, publishers.index, color = base_color)
plt.title('Top 20 Publishers')
plt.xlabel('Counts')
plt.ylabel(' ');

### What are the 10 most popular books in goodreads?

In [None]:
#The most rated book
ratings_count=df.groupby('title')['ratings_count'].sum().sort_values(ascending=False).head(10)
# plot
fig, ax = plt.subplots()
fig.set_size_inches(10, 6)
sn.barplot(ratings_count, ratings_count.index, color="salmon")
plt.title('The most rated book')
plt.xlabel('Rating Counts')
plt.ylabel('-');

### What are the top 10 authors in goodreads?

In [None]:
#Top 10 author
authors= df['authors'].value_counts().head(10)
#plot 
fig, ax = plt.subplots()
fig.set_size_inches(10, 5)
sn.barplot(authors, authors.index,color = base_color)
plt.title('Authors')
plt.xlabel('Book Counts')
plt.ylabel('-');

<a id="Bivariate-exploratory-analysis"></a>

## Bivariate exploratory  analysis

> Bivariate exploratory analysis involves the analysis of two variables to determine the empirical relationship between them.

### Pairs plot
 >A pairs plot allows us to see relationships between two variables. 

In [None]:
df_numeric = ['average_rating' ,'  num_pages' ,'ratings_count','text_reviews_count','ratio']
sn.pairplot(df[df_numeric], diag_kind='kde');

> We can see that there is a relationship between a rating count and text reviews count.

### The relationship between rating count and text reviews count

> A linear regression is used to modeling the relationship between a rating count and text reviews count. 

In [None]:
#relationship between rating count and text reviews count
#sklearn.linear_model.LinearRegression
model = LinearRegression()
x = df['ratings_count']
y = df['text_reviews_count']
x= x.values.reshape(-1, 1)
y= y.values.reshape(-1, 1)
model.fit(x, y)
model = LinearRegression().fit(x, y)
y_pred = model.predict(x)
r_sq = model.score(x, y) 
r_sq #The Correlation Coefficient

In [None]:
# visualizing the relationship between rating count and text reviews count
# set the plot to the size of A4 paper
#plot
fig, ax = plt.subplots()
fig.set_size_inches(10, 8)
plt.scatter(x, y)
plt.plot(x, y_pred, color='red');
plt.title('Ratings Count Vs Text Reviews Count')
plt.xlabel('Ratings Count')
plt.ylabel('Text Reviews counts')
plt.xlim([0,3e6]);

> The correlation Coefficient of this analysis is equal to 0.76 which indicates a relatively strong correlation. Whith hogher ratings count, there is a high reviews with text and thish finding is reasonable and not suprizing. 

<a id="Multivariate-exploratory-analysis"></a>
### Multivariate exploratory analysis

### What are the distributions of ratings count with respect to the rating categories? 

> In this section, we will create a subset datafram with the 200 most rated books,and visualize their statistics to answer this question.


#### Creating a subset with the 200 most rated books

In [None]:
#creat a subset with the 200 most rated books
df_highest = df.nlargest(200,['ratings_count'])
df_highest.head()

##### Creating box plots to show distributions with respect to the categories.

In [None]:
#creating two plots to show the distribution of rating count and text reviews count

# left plot: hist of ratings count
plt.figure(figsize = [12, 6])
plt.subplot(1, 2, 1)
sn.boxplot(x="rating", y="ratings_count", data=df_highest)
plt.ylim([0,2e6])
plt.xlabel('Rating')
plt.ylabel('Ratings Count')
plt.xticks(rotation=90,fontsize = 12);
# Right plot: hist of text reviews count
plt.subplot(1, 2, 2)
sn.boxplot(x="rating", y="text_reviews_count", data=df_highest)
plt.ylim([0,4e4])
plt.xlabel('Rating')
plt.ylabel('Text ReviwsCount')
plt.xticks(rotation=90,fontsize = 12);

### Which publishers have the highest ratings average?


 In this part, I fillter the data based on publishers which books count is larger than 100 books. Then, the distributions of the averages rate visulized by using violin plots.

>A violin plot shows the distribution of quantitative data across several levels of one (or more) categorical variables such that those distributions can be compared.


In [None]:
df_publisher = df[df.groupby('publisher')['publisher'].transform('size') > 100]
df_publisher


In [None]:
plt.figure(figsize = [10, 8])
sn.violinplot(data = df_publisher, x = 'publisher', y = 'average_rating')
plt.xlabel(' ')
plt.ylabel('Average Rating')
plt.xticks(rotation=90,fontsize = 14);

> For violin plot, the wider sections represent a higher probability that most of the population will have the given value; the skinnier sections represent a lower probability.Therefore, Pengun Classics and Pocket Books rating averages are around the mean which is closer to 4 star. For HarperColins,the wider section is higher than the center which indicates that a large number of books got more star than 4.

## The distribution of rating based on authors

Here, I will look for each author at time. The authors that will be considered are the ones with more that 25 books. I will select three and visualize their Kernel Distribution estimation based on their rating and the ratio of text reviews count to the ratings count.

> The seaborn.kdeplot fit and plot a univariate or bivariate kernel density estimate.It represents the probability distribution of the data values as the area under the plotted curve.It is useful to visualize the shape data.

In [None]:
#looking for authors with more than 25 books
df_author = df[df.groupby('authors')['authors'].transform('size') > 25]
df_author['authors'].unique()

> We find list with 9 authors, Stephen King, P.G. Wodehouse, and Agatha Christie are selected. 

In [None]:
# Creatinf dataframe for Stephen King books
df_king=df[df['authors']=='Stephen King']
df_king.rating.unique()

In [None]:
# Creatinf dataframe for P.G. Wodehouse books
df_Wodehouse=df[df['authors']=='P.G. Wodehouse']
df_Wodehouse.rating.unique()

In [None]:
# Creatinf dataframe for Agatha Christie books
df_Christie = df[df['authors']=='Agatha Christie']
df_Christie.rating.unique()

In [None]:
#creatin three plots for the selected authors 
#seting the size of the plots
plt.figure(figsize = [16, 8])
#first plot for Stephen King books
plt.subplot(1, 3, 1)
sn.kdeplot(df_king.ratio[df_king['rating'] == 'Ok'], shade=True, color="deeppink", label="Ok", alpha=.7)
sn.kdeplot(df_king.ratio[df_king['rating'] == 'Very Good'], shade=True, color="g", label="Very Good", alpha=.7)
sn.kdeplot(df_king.ratio[df_king['rating'] == 'GOOD'], shade=True, color="orange", label="Good", alpha=.7)
sn.kdeplot(df_king.ratio[df_king['rating'] == 'Highly Recommended'], shade=True, color="grey", label="Highly Recommended", alpha=.7)
plt.title('Stephen King')
#second plot : P.G. Wodehouse books
plt.subplot(1, 3, 2)
sn.kdeplot(df_Wodehouse.ratio[df_Wodehouse['rating'] == 'Very Good'], shade=True, color="g", label="Very Good", alpha=.7)
sn.kdeplot(df_Wodehouse.ratio[df_Wodehouse['rating'] == 'GOOD'], shade=True, color="orange", label="Good", alpha=.7)
sn.kdeplot(df_Wodehouse.ratio[df_Wodehouse['rating'] == 'Highly Recommended'], shade=True, color="grey", label="Highly Recommended", alpha=.7)
plt.title('P.G. Wodehouse')
#third plot : Agatha Christie
plt.subplot(1, 3, 3)
sn.kdeplot(df_Christie.ratio[df_Christie['rating'] == 'Ok'], shade=True, color="deeppink", label="Ok", alpha=.7)
sn.kdeplot(df_Christie.ratio[df_Christie['rating'] == 'Very Good'], shade=True, color="g", label="Very Good", alpha=.7)
sn.kdeplot(df_Christie.ratio[df_Christie['rating'] == 'GOOD'], shade=True, color="orange", label="Good", alpha=.7)
sn.kdeplot(df_Christie.ratio[df_Christie['rating'] == 'Highly Recommended'], shade=True, color="grey", label="Highly Recommended", alpha=.7)
plt.title('Agatha Christie')


In [None]:
df.groupby('authors')['bookID'].count().sort_values(ascending=False).head(10)


### Most popular copy of Crime and Punishment

In [None]:

df_f = df[df['authors'].str.contains("Dostoyevsky")]
df_crime = df_f[df_f['title']== 'Crime and Punishment']
rating_crime=df_crime.groupby('publisher')['ratings_count'].sum().sort_values(ascending=False)
rating_crime

Penguin edition, translated by David McDuff, is the most rated.

### Most popular copy of Anna Karenina

In [None]:
df_t = df[df['authors'].str.contains("Tolstoy")]
df_anna = df_t[df_t['title']== 'Anna Karenina']
rating_anna=df_anna.groupby('publisher')['ratings_count'].sum().sort_values(ascending=False)
rating_anna


Anna Karenina (Signet Classics edition) translated by David Magarshack is the most rated.

<a id="Summary"></a>
### Summary

The Goodreads data is interesting and we can perform a variety of analysis and get insights. In this project we looked at different aspects of the data.

*     There is a correlation between the number of rating and the count of text reviews.The correlation Coefficient is 0.76 which shows a strong correlation.

*     Pengun Classics and Pocket Books rating averages are closer to 4 star. For HarperColins,a large number of books got more star than 4.

*     We found out that quantity does not equal quality. For example , Stephen King is the most popular author in goodreads, but his books are rated in range of Ok and Good.

