## Introduction:
This kernel contains the **Exploratory Data Analysis** of the **GoodReads** dataset.

## Importing the basic Libraries:

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt

## Importing the dataset:

In [None]:
df = pd.read_csv('../input/books.csv', error_bad_lines = False)

This dataset has some errornous bad lines, therefore we use the >*error_bad_lines = False*< code in pandas.

### Let's have Overview of the Data:

In [None]:
df.head()

### Brief Description:

In [None]:
df.describe(include='all')

### Brief Information of various variables:

In [None]:
df.info()

In [None]:
df.replace(to_replace='J.K. Rowling-Mary GrandPré', value = 'J.K. Rowling', inplace=True)

Corrected J.K. Rowling's Name.

## Finding total number of authors in Record

In [None]:
df['authors'].nunique()

## Top Authors:

In [None]:
plt.figure(1, figsize=(15, 7))
plt.title("Which aurthor wrote maximum books")
sns.countplot(x = "authors", order=df['authors'].value_counts().index[0:10] ,data=df)

Above Graph shows top 10 authors who have written maximum books

## Most Occuring Books in Record:

In [None]:
plt.figure(1, figsize=(25,7))
plt.title("Most Occuring Books")
sns.countplot(x = "title", order=df['title'].value_counts().index[0:10] ,data=df)

#### Above Graph shows the most occuring books in the list. Most of these are old all time classics.

## Most Used Languages:

In [None]:
plt.figure(1, figsize=(25,10))
plt.title("language_codes")
sns.countplot(x = "language_code", order=df['language_code'].value_counts().index[0:10] ,data=df)

We infer that **English** and **United States English** are the languages most used.

## Most Rated Books:

In [None]:
most_rated = df.sort_values('ratings_count', ascending = False).head(10).set_index('title')
plt.figure(figsize=(15,10))
sns.barplot(most_rated['ratings_count'], most_rated.index, palette='Accent')

We see that **Harry Potter and the Sorcer's Stone** is the most rated book but it has quite a significant difference in ratings with the other parts.

Correcting the **num_pages** column name.

In [None]:
most_rated['num_pages']=most_rated['# num_pages']
most_rated=most_rated.drop('# num_pages',axis=1)

## Relation b/w num_pages and ratings_count:

In [None]:
plt.figure(figsize=(20,8))
sns.barplot(most_rated['num_pages'], most_rated['ratings_count'], palette='Accent')

## Relation of Average Rating:

In [None]:
plt.figure(figsize=(10,5))
most_rated.groupby(['average_rating','title']).num_pages.sum().nlargest(10).plot(kind='barh',color='b')

The above plot depicts the relation between number of pages in book and average ratings alongwith book title.

## Listing the top books with Least No. of Pages:

In [None]:
df_pages=df['# num_pages'] > 5
df_rating=df['average_rating'] > 4.5
req=pd.DataFrame(df[df_pages & df_rating].sort_values('# num_pages', ascending = True).head(5))
req.head()

In [None]:
plt.figure(figsize=(15,10))
sns.barplot(req['# num_pages'], req['title'], palette='Accent')

We see that **The Feynman Lectures on Physics** is Most Rated book with least no. of pages.

## Text Review Count

In [None]:
most_rated = df.sort_values('text_reviews_count', ascending = False).head(10).set_index('title')
plt.figure(figsize=(15,10))
sns.barplot(most_rated['text_reviews_count'], most_rated.index, palette='Accent')

We see that **Twilight** leads this Segment and **Harry Potter** is left far behind.

## Rating Count vs Average Rating

In [None]:
req = df.groupby(pd.cut(df['average_rating'], [0,1,2,3,4,5]))
req = req[['ratings_count']]
req.sum().reset_index()

#### We observe that Books with low rating have Less no. of Readers

### Please Upvote/ Star if you like.