# Introduction

From this dataset we will analyze and define relation between the multiple attributes a book can have, like from aggregate rating of each book, the trend of authors and books with different languages and we will see which books were becoming popular with time.
<hr> 

# Content
<font color='green'>
* [Overview](#11)
* [Import libraries](#1)
* [Data anaylze](#2)
* [EDA](#3)
* [Modelling](#4)
* [Conclusion](#5)

# <span id='11'></span>Overview
## Columns Description:
* **BookID** : Contains the unique ID for each books/series
* **title** : Contains the title of the books
* **authors** : Contains the author of the particular book
* **Average_rating** : The average rating of the books by users
* **ISBN** : ISBN(10) number tells the information about a book - such as edition and publisher
* **ISBN 13** : The new format for ISBN, implemented in 2007. 13 digits
* **Language_code** : Tells the languages for the book
* **Num_pages** : Contains the number of pages for the book
* **Ratings_count** : Contains the number of rating given for the book
* **text_reviews_count** : Has the count of reviews left by users

# <span id='1'></span>Import libraries

In [None]:
import numpy as np
import pandas as pd
import os
import seaborn as sns
import matplotlib.pyplot as plt


In [None]:
import warnings
warnings.filterwarnings('ignore')

## Getting Basic Ideas

In [None]:
df = pd.read_csv('../input/books.csv', error_bad_lines = False)

In [None]:
df.index = df['bookID']

In [None]:
print("Dataset contains "+str(df.shape[0])+" rows and "+str(df.shape[1])+" columns")

In [None]:
df.head()

Here we see that there is a confusion between J.K. Rowling-Mary GrandPre with J.K. Rowling. Although both are same.

In [None]:
df.replace(to_replace="J.K. Rowling-Mary GrandPré", value='J.K. Rowling', inplace=True)

In [None]:
df.head()

We will need to remove BookID column since its same a index

In [None]:
df=df.drop('bookID', axis=1)

In [None]:
df['title'].nunique()

In [None]:
df['authors'].nunique()

For me I won't do anything with the ISBN. so there are not in use for me.

In [None]:
df=df.drop('isbn', axis=1)
df=df.drop('isbn13', axis=1)

In [None]:
df.dtypes

In [None]:
df.count()

Top 20 most Occurring books

# <span id='3'></span>EDA 

In [None]:
sns.set_context('poster')
plt.figure(figsize=(20,15))
book = df['title'].value_counts()[:20]
sns.barplot(x=book, y=book.index, palette='Set3')
plt.title('Most Occurring Books')
plt.ylabel("Books")
plt.xlabel("Number of occurances");

We can see that 'Salem's Lot and One Hundred Years of Solitude is Most occurance book. which means that these books have come up in this database over and over again, with various Publication.

Top ten author with most books

In [None]:
sns.set_context('paper')
plt.figure(figsize=(20,15))
author = df.groupby('authors')['title'].count().reset_index().sort_values('title', ascending=False).head(10).set_index('authors')
sns.barplot(author['title'], author.index, palette='Set3')
plt.title('Top 10 authors')
plt.ylabel('Authors')
plt.xlabel('Total number of books');

In [None]:
df.head(10)

Lets see the relation between average_rating and ratings_count

In [None]:
plt.figure(figsize=(15,10))
rating = df[['average_rating', 'ratings_count']]
rating = rating.sort_values('average_rating')
sns.regplot(rating['ratings_count'], rating['average_rating'])
plt.title('Average_rating V/s Rating_count');

In [None]:
fig, ax=plt.subplots(1,2, figsize=(10,10))

sns.boxplot(y=df['average_rating'], data=df, ax=ax[0], color='g')
ax[0].set_title('Average_rating')


sns.boxplot(y=df['# num_pages'], data=df, ax=ax[1], color='g')
ax[0].set_title('Average_rating')
plt.show()

We see there is huge outlier in data in place of number of pages and ratings

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(20,10))
sns.distplot(df['average_rating'], ax=ax[0], color='g')
ax[0].set_title('Average Rating')
sns.distplot(df['# num_pages'], ax=ax[1], color='r')
ax[1].set_title('Number of pages')
plt.show()

In [None]:
plt.figure(figsize=(10,10))
sns.kdeplot(df.average_rating, df['# num_pages'], cmap='Blues', shade=True, shade_lowest=True)
plt.show()

correlation between average_rating, number pages, ratings count and text reviews count

In [None]:
correlation = df[['average_rating','# num_pages','ratings_count','text_reviews_count']].corr()
sns.heatmap(correlation, annot=True, vmax=1, vmin=-1, center=0)
plt.show()

In [None]:
df['language_code'].unique()

In [None]:
freq_table_lang = pd.DataFrame(df.language_code.value_counts())
freq_table_lang

In [None]:
sns.set_context('poster')
plt.figure(figsize=(15,10))
sns.barplot(freq_table_lang.index, freq_table_lang['language_code'])
plt.xticks(rotation=90)
plt.title('Based on Language')
plt.ylabel('Frequency Distribution')
plt.xlabel('Languages');

In [None]:
freq_table_lang[:7].plot(kind='pie', subplots=True, figsize=(10,10))
plt.show()

Tops 10 highest rated books

In [None]:
top=df[df['average_rating']==5.0]
top[['title','authors', 'average_rating']]

In [None]:
plt.figure(figsize=(10,15))
top = top.sort_values('ratings_count', ascending=False).head(10)
sns.barplot(x=top['ratings_count'], y=top.title, palette='Set3');

Here we can see that we cannot consider the top average_ratings as the best books since they just have only few ratings counts.

Now we will do the opposite with highest rating count we will try plot the average rating distributions

In [None]:
top = df.sort_values('ratings_count', ascending=False).head(10)
plt.figure(figsize=(20,10))
sns.barplot(x='average_rating', y=top.title, data=top);

Based on author

In [None]:
plt.figure(figsize=(20,15))
author = top.groupby('authors')['title'].count().reset_index().sort_values('title', ascending=False).set_index('authors')
sns.barplot(x=author['title'],y=author.index, palette='Set3');
plt.xticks([0,1,2,3,4,5])
plt.title('authors of top books');


Does it true that average rating is related with text_reviews_count

In [None]:
top=top.sort_values('text_reviews_count', ascending=False)
plt.figure(figsize=(20,20))
sns.lmplot(x='average_rating', y='text_reviews_count', data=top, palette='Set3')
plt.title("text reviews VS average rating")
plt.show()

Top Authors books 
* Stephen King
* Agatha Christie
* Dan Brown
* J.K.Rowling

In [None]:
#Stephen King
authors = ['Stephen King', 'Agatha Christie', 'Dan Brown', 'J.K. Rowling']
authors = df[df['authors']==authors[0]]
authors = authors[authors['language_code']=='eng']
plt.figure(figsize=(20,15))
sns.barplot(authors['title'], authors.average_rating, palette='Set3')
plt.xticks(rotation=90)
plt.title('Stephen King\'s Books');

In [None]:
#Agatha Christie
authors = ['Stephen King', 'Agatha Christie', 'Dan Brown', 'J.K. Rowling']
authors = df[df['authors']==authors[1]]
authors = authors[authors['language_code']=='eng']
plt.figure(figsize=(20,20))
sns.barplot(authors['title'], authors.average_rating, palette='Set3')
plt.xticks(rotation=90)
plt.title('Agatha Christie\'s Books');

In [None]:
#Dan Brown
authors = ['Stephen King', 'Agatha Christie', 'Dan Brown', 'J.K. Rowling']
authors = df[df['authors']==authors[2]]
authors = authors[authors['language_code']=='eng']
plt.figure(figsize=(20,15))
sns.barplot(authors['title'], authors.average_rating, palette='Set3')
plt.xticks(rotation=90)
plt.title('Dan Brown\'s Books');

In [None]:
#J.K. Rowling
authors = ['Stephen King', 'Agatha Christie', 'Dan Brown', 'J.K. Rowling']
authors = df[df['authors']==authors[3]]
authors = authors[authors['language_code']=='eng']
plt.figure(figsize=(20,15))
sns.barplot(authors['title'], authors.average_rating, palette='Set3')
plt.xticks(rotation=90)
plt.title('J.K. Rowling\'s Books');

# To be continue ::