In [None]:
import numpy as np # number processing 
import pandas as pd # data processing 
import matplotlib.pyplot as plt # data visualization
import seaborn as sns # data visualization
import os # directory access

# return list of files in directory 'input'
print(os.listdir('../input')) 
# load dataset
df = pd.read_csv('../input/books.csv', error_bad_lines=False) 

In [None]:
# return number of rows and columns
df.shape 

In [None]:
# check for missing values
df.count()

There are no missing values in the dataset.

In [None]:
# check each column's data type
df.dtypes

In [None]:
# summary of statistics
df.describe()

In [None]:
# return first 5 rows
df.head() 

In [None]:
# rename columns
df.rename(columns={'average_rating':'avg_rating',
                   '# num_pages':'num_pages',
                   'language_code':'lang_code'},inplace=True) 
df.columns

## Analyzing Books By Language

In [None]:
# find out what and how many language codes are there
print(df['lang_code'].unique())
print('\n Total language codes:', len(df['lang_code'].unique()))

In [None]:
# top 10 languages for books
langs = df['lang_code'].value_counts().head(10)
plt.figure(figsize=(15,6))
sns.barplot(x=langs, y=langs.index) # horizontal bar plot
sns.despine() # remove line to the top and right of chart
sns.despine(left=True, bottom=True) # remove line to the bottom and left of chart
plt.title('Top 10 Languages By Number of Books Written', fontsize=20, fontweight='bold')
plt.xlabel('Number of Books', fontsize=12, fontstyle='italic') 

Unsurprisingly, English is the most common language books are written in. The different variants of English, such as American English and British English, are treated as separate language categories. To get a better idea of how dominant the English language is, we'll combine all the different variants into one.

In [None]:
# books written in all variants of English
eng_books = df[(df['lang_code'] == 'eng') | (df['lang_code'] == 'en-US') | (df['lang_code'] == 'en-GB') 
               | (df['lang_code'] == 'en-CA')]

# plot a pie chart to show the percentage of English books out of all total books
sizes = [eng_books.shape[0], df.shape[0]] 
labels = ['English', 'Other Languages']
colors = ['lightblue', 'lightcoral']
explode=(0.1, 0) # explode the first slice of the pie
plt.pie(sizes, labels=labels, colors=colors, explode=explode, textprops=dict(fontsize=16), autopct='%1.0f%%', shadow=True, startangle=90)
plt.title('English vs Other Languages', fontsize=20, fontweight='bold')
plt.axis('equal')

From the pie chart we can see that books written in English (any variant) comprise 48% of the total.

## Analyzing Books By Rating

In [None]:
# plot books against average rating
sns.distplot(a=df['avg_rating'], kde=False)
sns.despine()
sns.despine(left=True, bottom=True)

The majority of books appear to have average ratings between 3.5 to 4.5, which would seem to indicate favourable quality overall. However, the above chart does not yet factor in the number of ratings and text reviews, which are far more telling indicators of a book's true quality. 

In [None]:
# correlation between average rating, ratings count, and text reviews count
sns.set_style('whitegrid')
sns.scatterplot(x=df['avg_rating'], y=df['ratings_count'], hue=df['text_reviews_count'])
sns.despine()
sns.despine(left=True, bottom=True)

The scatterplot reveals that some books which, despite not having been rated yet, have an average rating higher than zero. This points to inaccuracies in the dataset. There is still a large proportion of books with a ratings count that is at least in the hundreds, and the average rating for these fall within the 3.5 to 4.5 range, which bodes well for overall quality. Books with higher ratings count tend to have more text reviews; this makes sense as the more readers a books has, the larger the pool of potential raters and reviewers. 

Two outliers can be detected immediately in the plot; both have a ratings count of above 4 million, while the next closest ratings count are around the 2.5 million mark.

In [None]:
# find outliers
df[df['ratings_count'] > 4000000]

Both outliers turn out to be titles from phenomenally bestselling series, so that accounts for the high readership. It is noteworthy that both books also happen to be the first instalment in their respective series, with no sequels coming close in terms of readership numbers. It is possible that the hype surrounding the series led people to read the first book before deciding they didn't find it engaging enough to continue on to the second. A second hypothesis is that the sequels are not included in the dataset. 

In [None]:
# find Twilight sequels
df[df['authors'] == 'Stephenie Meyer'].sort_values(['ratings_count', 'text_reviews_count'], ascending=False)

With the Twilight series, it is the second hypothesis that holds true.

In [None]:
# find Harry Potter sequels
df[(df['authors'] == 'J.K. Rowling-Mary GrandPré') | 
   (df['authors'] == 'J.K. Rowling')].sort_values(['ratings_count', 'text_reviews_count'], ascending=False)

With the Harry Potter series, the first hypothesis seems more likely, especially given that the difference in rating counts between the two most rated books is a startling 3.5 million. 