### 1. Data Wrangling of Audible Dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
pd.options.display.max_rows = 50000

In [2]:
#import ydata_profiling as yp

In [12]:
#read data
audible_data = pd.read_csv("data/audible_dataset/audible_uncleaned.csv")

Below, I've tried out **`ydata-profiling`** pacakge to generate a basic overview report on the datafile. It highlights a lot of features in the dataset like number of duplicates, missing values, encoding, language consistency etc.  
[This](https://www.blog.datahut.co/post/data-cleaning-techniques) is a great blog on cleaning scraped data.

In [4]:
#original_report = yp.ProfileReport(audible_data, title = "Prelim Analysis")
#original_report.to_file("audible_prelim_report.html")

In [4]:
audible_data.sample(4)

Unnamed: 0,name,author,narrator,time,releasedate,language,stars,price
82306,The White Silence,Writtenby:JackLondon,Narratedby:JohnChatty,24 mins,25-06-07,English,Not rated yet,65.0
10069,Ivy,Writtenby:KatherineCoville,"Narratedby:CarmenVivianoCrafts,CynthiaBishop,H...",2 hrs and 3 mins,21-06-21,English,Not rated yet,305.0
81870,Open Mic,Writtenby:MitaliPerkins,"Narratedby:MitaliPerkins,ToddHaberkorn,JDJackson,",2 hrs and 48 mins,10-09-13,English,Not rated yet,352.0
5885,The Negro Leagues,Writtenby:MattDoeden,Narratedby:BookBuddyDigitalMedia,1 hr and 11 mins,06-05-21,English,Not rated yet,469.0


In [13]:
audible_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87489 entries, 0 to 87488
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   name         87489 non-null  object
 1   author       87489 non-null  object
 2   narrator     87489 non-null  object
 3   time         87489 non-null  object
 4   releasedate  87489 non-null  object
 5   language     87489 non-null  object
 6   stars        87489 non-null  object
 7   price        87489 non-null  object
dtypes: object(8)
memory usage: 5.3+ MB


In [15]:
audible_data.describe()

Unnamed: 0,name,author,narrator,time,releasedate,language,stars,price
count,87489,87489,87489,87489,87489,87489,87489,87489.0
unique,82767,48374,29717,2284,5058,36,665,1011.0
top,The Art of War,"Writtenby:矢島雅弘,石橋遊",Narratedby:anonymous,2 mins,16-05-18,English,Not rated yet,586.0
freq,20,874,1034,372,773,61884,72417,5533.0


From the `describe()` output, we can see that there are a total of 87,489 books but only 82,767 of them are unique entries. Some books have multiple rows maybe due to a different language of publication or a different narrator etc.  
Below, I have listed out books with equal to or more than 10 entries in the dataset.

In [16]:
#checking number of occurances of same book
book_counts = audible_data['name'].value_counts()
book_counts[book_counts >= 10]

The Art of War                 20
Sterling Biographies           19
The Odyssey                    16
Sterling Point Books           16
Hamlet                         15
The Prophet                    14
Pride and Prejudice            14
A Christmas Carol              14
The Iliad                      13
As a Man Thinketh              13
The Science of Getting Rich    13
The Picture of Dorian Gray     12
Abraham Lincoln                12
Meditations                    11
The Richest Man in Babylon     11
The Raven                      11
The Prince                     11
Unstoppable                    10
Name: name, dtype: int64

No duplicate rows found.

In [17]:
#check for duplicate rows
audible_data.duplicated().sum()

0

In [18]:
#filter out all book names containing any special characters
booknames_special_chars = audible_data[audible_data.name.str.contains(r'[@#$%+/*]')].drop_duplicates()
booknames_special_chars.sample(4)

Unnamed: 0,name,author,narrator,time,releasedate,language,stars,price
13355,Ecos Audio - Communicación virtual. 2/2021,Writtenby:CovadongaJiménez,Narratedby:div.,58 mins,02-02-21,german,Not rated yet,501.0
61523,Toddlers Are A**holes,Writtenby:BunmiLaditan,Narratedby:BahniTurpin,3 hrs and 24 mins,02-06-15,English,4 out of 5 stars2 ratings,502.0
42538,"The $60,000 Dog",Writtenby:LaurenSlater,Narratedby:CassandraCampbell,11 hrs and 31 mins,20-11-12,English,Not rated yet,836.0
68887,花粉症の原型を見つけて、77%緩和するためのNLP瞑想,Writtenby:志麻絹依,Narratedby:志麻絹依,23 mins,06-11-17,japanese,Not rated yet,781.0


In [19]:
#Author column
#remove the phrase "Writtenby:"
audible_data['author'] = audible_data['author'].str.replace(pat = "Writtenby:", repl = "")

In [20]:
audible_data.author

0        GeronimoStilton
1            RickRiordan
2             JeffKinney
3            RickRiordan
4            RickRiordan
              ...       
87484       ChrisStewart
87485      StephenO'Shea
87486          MarkTwain
87487     LaurenceSterne
87488      MarkKurlansky
Name: author, Length: 87489, dtype: object

Stack overflow thread to understand the Regex: [SO](https://stackoverflow.com/questions/199059/a-pythonic-way-to-insert-a-space-before-capital-letters)

In [21]:
#Add space between the first, middle and last names of Authors.
#e.g. JaneAustin becomes Jane Austin
audible_data['author'] = audible_data['author'].str.replace(pat = r"(\w)([A-Z])", repl = r"\1 \2", regex = True)

Some books have multiple authors. So, below the 'author' column has been split into multiple columns with author1, author2, author3..etc. each column with a single name.  
Separator "," has been used to split the column.

In [23]:
audible_data2 = pd.concat( [audible_data['name'], 
                            audible_data['author'].str.split(',', expand = True).add_prefix('author'),
                            audible_data.loc[:,['narrator', 'time', 'releasedate', 'language', 'stars', 'price']]], 
                            axis = 1)

In [26]:
audible_data2.sample(4)

Unnamed: 0,name,author0,author1,author2,author3,narrator,time,releasedate,language,stars,price
87462,Up with the Larks,Tessa Hainsworth,,,,Narratedby:AnnaBentinck,10 hrs and 4 mins,14-06-10,English,Not rated yet,531.0
2293,P'tit Loup ne veut pas dormir,Orianne Lallemand,,,,Narratedby:WillProduction,3 mins,28-07-21,french,Not rated yet,74.0
602,Los Atrevidos,Elsa Punset,,,,"Narratedby:OliviaVives,SilviaGómezLasil",4 hrs and 57 mins,23-09-21,spanish,Not rated yet,268.0
5237,The Vampire Book,D K,,,,Narratedby:BethEyre,1 hr and 43 mins,29-10-20,English,Not rated yet,410.0


In the following code, I just wanted to check how many of the new author name columns- author1, author2, author3 contain null values.  
If most of the rows are vacant, then there's no point creating additional columns for the same.

In [33]:
#number of missing values in author columns
audible_data2.loc[:,['author0','author1','author2','author3']].isnull().sum()

author0        0
author1    73762
author2    85135
author3    86713
dtype: int64

In [34]:
#remove "Narratedby:" from Narrator column
audible_data2['narrator'] = audible_data2['narrator'].str.replace(pat = "Narratedby:", repl = "")

In [35]:
#add space between first and last name of the Narrator
#e.g. JaneAustin becomes Jane Austin
audible_data2['narrator'] = audible_data2['narrator'].str.replace(pat = r"(\w)([A-Z])", repl = r"\1 \2", regex = True)

In [36]:
audible_data2.sample(5)

Unnamed: 0,name,author0,author1,author2,author3,narrator,time,releasedate,language,stars,price
33347,RX 17 Series: Stop Drinking,Dick Sutphen,,,,Dick Sutphen,59 mins,01-05-21,English,Not rated yet,258.0
11144,Speech Police,David Kaye,,,,Andrew Eiden,3 hrs and 49 mins,04-06-19,English,Not rated yet,502.0
40538,God's Mighty Hand,"Richard""Little Bear""Wheeler",,,,Jim Hodges,12 hrs and 47 mins,20-06-17,English,Not rated yet,1003.0
55079,Lulu,Nancy Friday,,,,Karen White,6 hrs and 4 mins,01-05-18,English,Not rated yet,668.0
24609,Miracle for Jen,Linda Barrick,,,,Kirsten Potter,7 hrs and 33 mins,01-03-12,English,Not rated yet,586.0


#### Time column

'Time' column contains following formats:  
1. 7 hrs and 54 mins
2. 9 hrs
3. 7 mins

In [37]:
#make a copy of time column to understand all kind of formats in which data is present
time_column = audible_data2['time']

In [91]:
#replace all numbers with blanks
time_column = time_column.str.replace(pat = r'[0-9]', repl = '', regex = True)
time_column.sample(10)

  time_column = time_column.str.replace(pat = r'[0-9]', repl = '')


6663               mins
70945     hrs and  mins
74027     hrs and  mins
86279     hrs and  mins
46503     hrs and  mins
87296      hr and  mins
56718              mins
3048               mins
66370              mins
13629      hr and  mins
Name: time, dtype: object

Now, retain only unique patterns in the **`time_column`** object.  
I've done this to see all kinds of words present in the 'time' column of original dataframe.  
As we can see, there are rows which contain **x hrs and y mins format**, others contain **x hr and y mins** (Note: hr is not plural here) and so on.  
The intention is to convert these characters to the form **hh:mm**. 

In [92]:
#keep only unique patterns
time_column = time_column.drop_duplicates()
time_column

0           hrs and  mins
4                     hrs
12           hrs and  min
29           hr and  mins
53                   mins
227                    hr
255           hr and  min
1203                  min
1401    Less than  minute
Name: time, dtype: object

In [87]:
audible_data2['time'].sample(50)

59945     7 hrs and 54 mins
37645     6 hrs and 34 mins
8889      2 hrs and 47 mins
3648       1 hr and 13 mins
48675                 9 hrs
58188      13 hrs and 1 min
6637      2 hrs and 17 mins
67209     6 hrs and 14 mins
49235     2 hrs and 29 mins
4047                 7 mins
46524     8 hrs and 37 mins
76973     4 hrs and 15 mins
40488     6 hrs and 44 mins
9386       3 hrs and 6 mins
72133     6 hrs and 58 mins
68725               22 mins
14976                7 mins
64196     7 hrs and 45 mins
18165               46 mins
68426     8 hrs and 24 mins
82085      4 hrs and 3 mins
70099    14 hrs and 19 mins
14871               58 mins
50994      1 hr and 58 mins
74554     9 hrs and 22 mins
18040     6 hrs and 30 mins
43299     9 hrs and 58 mins
78901      5 hrs and 3 mins
20484    13 hrs and 40 mins
4046                 6 mins
53701               36 mins
74950     7 hrs and 10 mins
43910     10 hrs and 3 mins
79644    14 hrs and 19 mins
62477     3 hrs and 32 mins
37479    15 hrs and 