# AUDIBLE CATALOG EDA

Importing the important libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
print("Setup complete")

Importing the datasets. We use ```read_csv``` for the same.

In [None]:
audiob = pd.read_csv('../input/audible-complete-catalog/Audible_Catlog.csv')
audiob_adv = pd.read_csv('../input/audible-complete-catalog/Audible_Catlog_Advanced_Features.csv')

# Sneak Peak of the dataset

Let us see what our dataset looks like.
<br> For this, ```head()``` function is used

In [None]:
audiob.head(5)

In [None]:
audiob_adv.head(5)

My first intuition on looking at the data was, there is some error in price. So I did a cross check of the same.

The price of ['Think Like a Monk'](https://www.audible.in/pd/Think-Like-A-Monk-Audiobook/B07YSQ8GT5?plink=rJ6ts4I7l48a8A1E&ref=a_hp_c4_adblp13nbssx_1_1&pf_rd_p=b1ea9011-ff9b-499a-a399-9843ffcbd0f5&pf_rd_r=MAH06E8SRNDH717BXZHZ) and ['Subtle art...'](https://www.audible.in/pd/The-Subtle-Art-of-Not-Giving-a-F-ck-Audiobook/B079BC54JT?plink=LLoP79JfMWa0iHlG&ref=a_pd_Think-_c2_adblp13npsbx_1_3&pf_rd_p=a55f3f73-7c6c-4aac-8ad0-05045d19c470&pf_rd_r=28HQJKEDV597MNA1RDTS) is actually 10k+

😬

As we can see that the Listening Time column is not suitable for computation. We will extract the numbers and convert them to minutes. For this I am using regex.

## RegEx
RegEx is the short form of Regular Expression. RegEx helps in finding patterns in the string, which is used to locate, manage and match texts.The image below from [Computer Hope](https://www.computerhope.com) can help us in understanding.
![image.png](https://www.computerhope.com/jargon/r/regular-expression.gif)


[Cheatsheet for RegEx](https://www.computerhope.com/jargon/r/regex.htm)

In [None]:
temp_df = audiob_adv['Listening Time'].str.extract(r'(\d+)[^\d]+(\d+)').astype('float64')
audiob_adv["Time"] = temp_df.iloc[:,0]*60 + temp_df.iloc[:,1]

Dropping the columns, which we dont want. 
Dropping 

Listening Time -> We have extracted the time out of it

Ranks And Genres -> We wont be using that in our analysis

In [None]:
audiob_adv.drop(['Listening Time','Ranks and Genre'], axis = 1)

Since ```audiob_adv``` has the columns present in ```audiob``` along with others. That is why we will proceed with the former.

# Cleaning the dataset

Checking for any missing data. To check this, I am using the ```isnull``` function

In [None]:
audiob_adv.isnull().values.any()

We can see that we have certain missing data.

So, let us see where we have the missing data

To visualise where the missing data are, I will use ```heatmap``` from the seaborn library

In [None]:
sns.heatmap(audiob_adv.isnull(), cbar=False)

To deal with the missing values, I am using the ```fillna``` function, and filling up the missing values with "Not available".

I am filling the null values of Time, with 10 minutes.

In [None]:
audiob_adv["Number of Reviews"].fillna(0, inplace = True)
audiob_adv["Time"].fillna(10, inplace = True)

Let us see again, at the heatmap again

In [None]:
sns.heatmap(audiob_adv.isnull(), cbar=False)

## Outliers

Renaming columns for ease of computation

In [None]:
audiob_adv.rename(columns = {"Number of Reviews": "Number_of_Reviews"},  
                             inplace = True)

Here I am changing the type of Price and Number of Reviews columns. For this, I am using ```astype``` function

In [None]:
audiob_adv["Number_of_Reviews"] = audiob_adv.Number_of_Reviews.astype(float)
audiob_adv["Price"] = audiob_adv.Price.astype(float)


### Price

In [None]:
plt.figure(figsize=(30,5))
sns.boxplot(x=audiob_adv['Price'],palette = 'colorblind')

We can see that the range of price goes from 0 to 17500. But as seen earlier, these are not wrong values and dropping them can affect the analysis. [This article](https://medium.com/@Ipshita/outlier-what-to-do-439e21899a98) gives a brief information for outliers.

### Rating

In [None]:
plt.figure(figsize=(30,5))
sns.boxplot(x=audiob_adv['Rating'],palette = 'colorblind')

The point at -1 is impossible, so we will drop it.

In [None]:
audiob_adv = audiob_adv[~(audiob_adv['Rating']<=0)]

Replotting again

In [None]:
plt.figure(figsize=(30,5))
sns.boxplot(x=audiob_adv['Rating'],palette = 'colorblind')

Better. 

### Time

In [None]:
plt.figure(figsize=(30,5))
sns.boxplot(x=audiob_adv['Time'],palette = 'colorblind')

The longest running book is more than 7000 minutes, which is approximately 5 days. 

# Exploratory Data Analysis

## Author Feature

### Most Frequent Authors

In [None]:
sns.set_context('talk')
plt.figure(figsize=(20,10))
cnt = audiob_adv['Author'].value_counts().to_frame()[0:20]
sns.barplot(x= cnt['Author'], y =cnt.index, data=cnt, palette='deep',orient='h')
plt.title('Distribution of Audio Books of Top 20 Authors');

We can see **Harvard Business Review** at the top, followed by **Devdutt Pattanaik**

### Most Expensive books by different Authors

In [None]:
plt.figure(figsize=(16,8))

cnt = audiob_adv.groupby(['Author'])['Price'].max().sort_values(ascending=False).to_frame()[:20]
g2 = sns.barplot(x = cnt['Price'], y = cnt.index)
g2.set_title('Most expensive book by Author')
g2.set_ylabel('Author')
g2.set_xlabel('')

**Benjamin Graham**, the *greatest* investment advisor, has the most expensive book in the dataset.

### Highest Rated Author

In [None]:
plt.figure(figsize=(16,8))

cnt = audiob_adv.groupby(['Author'])['Rating'].max().sort_values(ascending=False).to_frame()[:20]
g2 = sns.barplot(x = cnt['Rating'], y = cnt.index)
g2.set_title('Highest Rated book by Author')
g2.set_ylabel('Author')
g2.set_xlabel('')

We can see quite a lot of authors with perfect 5 ⭐ rating

Being a bengoli, I had to update Satyajit Ray's name. 😁 

In [None]:
audiob_adv.replace(to_replace='Satyajit Rai', value = 'Satyajit Ray', inplace=True)

### Lowest Rated Author

In [None]:
plt.figure(figsize=(16,8))

cnt = audiob_adv.groupby(['Author'])['Rating'].max().sort_values(ascending=True).to_frame()[:20]
g2 = sns.barplot(x = cnt['Rating'], y = cnt.index)
g2.set_title('Lowest Rated book by Author')
g2.set_ylabel('Author')
g2.set_xlabel('')

Seeing, D.H. Lawrence on the top was a real shock. On looking at the main site, I could see that the low rating was because of the corrupted file.

### Authors with shortest audio books

In [None]:
trial = audiob_adv[~(audiob_adv['Time']<20)]

In [None]:
plt.figure(figsize=(16,8))

cnt = trial.groupby(['Author'])['Time'].max().sort_values(ascending=True).to_frame()[:20]
g2 = sns.barplot(x = cnt['Time'], y = cnt.index)
g2.set_title('Shortest book by Author (in minutes)')
g2.set_ylabel('Author')
g2.set_xlabel('')

We can **Roberta Edwards** having an audiobook of approximately 1 hour. Roberta Edwards publishes short "Who was..." audiobooks.

## Rating

### Average rating of books

In [None]:
plt.figure(figsize=(10,10))
rating= audiob_adv.Rating.astype(float)
sns.distplot(rating, bins=20)

We can see that most of the rating between 4 and 5

### Time X Rating

In [None]:
plt.figure(figsize=(15,10))
sns.set_context('paper')
ax = sns.jointplot(x="Rating", y="Time", data = audiob_adv, color = 'crimson')
ax.set_axis_labels("Rating", "Time")
plt.show()

We can see that most of the 5 star ratings are for books with approximately 35 hours of listening time.

### Rating X Price

In [None]:
plt.figure(figsize=(15,10))
sns.set_context('paper')
ax = sns.jointplot(x="Rating", y="Price", data = audiob_adv, color = 'crimson')
ax.set_axis_labels("Rating", "Price")
plt.show()

The outliers in the dataset are skewing the data.

Taking a trial dataframe containing records with price less than 3000

In [None]:
trial = audiob_adv[~(audiob_adv['Price']>3000)]

In [None]:
plt.figure(figsize=(15,10))
sns.set_context('paper')
ax = sns.jointplot(x="Rating", y="Price", data = trial, color = 'crimson')
ax.set_axis_labels("Rating", "Price")
plt.show()

We cannot see any linear relationship between both of them. But we can see a cluster of data, which has rating from 4 to 5 and price ranging from 0 to 1500.

### Rating X Reviews

In [None]:
ax = sns.jointplot(x="Rating", y="Number_of_Reviews", data = audiob_adv)
ax.set_axis_labels("Rating", "Number of Reviews")

The rating 4.5 is given the most. We can see that, people give 1,2,3 and very less.

## Price

Let us see the correlations between different variables

In [None]:
sns.heatmap(audiob_adv.corr(),vmin=-1, vmax=1, annot=True);

We cannot see any correlation between the variables. The highest correlation, we can see is **0.079** between *Number of Revews* and *Price*, which is weakly positive.

In [None]:
plt.figure(figsize=(15,10))
sns.set_context('paper')
ax = sns.jointplot(x="Price", y="Time", data = audiob_adv, color = 'crimson')
ax.set_axis_labels("Price", "Time")
plt.show()

Because, of the outliers, the analysis is highly skewed. That is why, we will take a temporary df without outliers.

In [None]:
trial = audiob_adv[~(audiob_adv['Price']>3000)]

In [None]:
plt.figure(figsize=(15,10))
sns.set_context('paper')
ax = sns.jointplot(x="Price", y="Time", data = trial, color = 'crimson')
ax.set_axis_labels("Price", "Time")
plt.show()

Most of the books are less than 1200 and time is less that ~34 hours

In [None]:
plt.figure(figsize=(10,10))
rating= audiob_adv.Time.astype(float)
sns.distplot(rating, bins=20)

Most of the books are less than ~34 hours

# Thankyou