### 1. Data Wrangling of Audible Dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
pd.options.display.max_rows = 50000

In [2]:
#import ydata_profiling as yp

In [3]:
#read data
audible_data = pd.read_csv("data/audible_dataset/audible_uncleaned.csv")

Below, I've tried out **`ydata-profiling`** pacakge to generate a basic overview report on the datafile. It highlights a lot of features in the dataset like number of duplicates, missing values, encoding, language consistency etc.  
[This](https://www.blog.datahut.co/post/data-cleaning-techniques) is a great blog on cleaning scraped data.

In [4]:
#original_report = yp.ProfileReport(audible_data, title = "Prelim Analysis")
#original_report.to_file("audible_prelim_report.html")

In [4]:
audible_data.sample(4)

Unnamed: 0,name,author,narrator,time,releasedate,language,stars,price
16274,新刊ラジオ第2部プレミアム　第281回,"Writtenby:矢島雅弘,石橋遊","Narratedby:矢島雅弘,石橋遊",21 mins,16-05-18,japanese,Not rated yet,139.0
59890,331 metri al secondo,Writtenby:RosannaRubino,Narratedby:PaoloCarenzo,6 hrs and 29 mins,07-10-21,italian,Not rated yet,305.0
15787,第711回 新刊ラジオ第2部プレミアム,"Writtenby:矢島雅弘,石橋遊","Narratedby:矢島雅弘,石橋遊",19 mins,15-05-18,japanese,Not rated yet,139.0
53350,The Conference of the Birds,"Writtenby:Attar,SholehWolpé-Translatedby",Narratedby:FajerAl-Kaisi,8 hrs and 8 mins,28-09-21,English,Not rated yet,586.0


In [13]:
audible_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87489 entries, 0 to 87488
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   name         87489 non-null  object
 1   author       87489 non-null  object
 2   narrator     87489 non-null  object
 3   time         87489 non-null  object
 4   releasedate  87489 non-null  object
 5   language     87489 non-null  object
 6   stars        87489 non-null  object
 7   price        87489 non-null  object
dtypes: object(8)
memory usage: 5.3+ MB


In [15]:
audible_data.describe()

Unnamed: 0,name,author,narrator,time,releasedate,language,stars,price
count,87489,87489,87489,87489,87489,87489,87489,87489.0
unique,82767,48374,29717,2284,5058,36,665,1011.0
top,The Art of War,"Writtenby:矢島雅弘,石橋遊",Narratedby:anonymous,2 mins,16-05-18,English,Not rated yet,586.0
freq,20,874,1034,372,773,61884,72417,5533.0


From the `describe()` output, we can see that there are a total of 87,489 books but only 82,767 of them are unique entries. Some books have multiple rows maybe due to a different language of publication or a different narrator etc.  
Below, I have listed out books with equal to or more than 10 entries in the dataset.

In [5]:
#checking number of occurances of same book
book_counts = audible_data['name'].value_counts()
book_counts[book_counts >= 10]

The Art of War                 20
Sterling Biographies           19
The Odyssey                    16
Sterling Point Books           16
Hamlet                         15
The Prophet                    14
Pride and Prejudice            14
A Christmas Carol              14
The Iliad                      13
As a Man Thinketh              13
The Science of Getting Rich    13
The Picture of Dorian Gray     12
Abraham Lincoln                12
Meditations                    11
The Richest Man in Babylon     11
The Raven                      11
The Prince                     11
Unstoppable                    10
Name: name, dtype: int64

No duplicate rows found.

In [17]:
#check for duplicate rows
audible_data.duplicated().sum()

0

In [6]:
#filter out all book names containing any special characters
booknames_special_chars = audible_data[audible_data.name.str.contains(r'[@#$%+/*]')].drop_duplicates()
#number of books that contain special characters- 592 books
booknames_special_chars.shape

Unnamed: 0,name,author,narrator,time,releasedate,language,stars,price
12694,English Grammar Master - New Edition - Grammar...,"Writtenby:DorotaGuzik,DominikaTkaczyk","Narratedby:LaraKalenik,TadeuszZ.Wolański,Maybe...",3 hrs and 38 mins,27-03-18,English,5 out of 5 stars1 rating,401.0
47675,Die Maske des roten Todes / Die schwarze Katze,Writtenby:EdgarAllanPoe,Narratedby:ArndtSchmöle,51 mins,21-09-21,german,Not rated yet,166.0
87121,高橋御山人の百社巡礼/其之十一 愛媛・石鎚山 西日本最高峰に 究極の天狗あり,Writtenby:高橋御山人,Narratedby:高橋御山人,26 mins,15-05-18,japanese,Not rated yet,418.0
12579,Business Spotlight Audio - Elon Musk: a contro...,Writtenby:MelitaCameron-Wood,"Narratedby:DavidIngram,ElisaMoolecherry,Melita...",1 hr and 4 mins,30-03-22,german,Not rated yet,501.0


In [8]:
booknames_special_chars.sample(4)

(592, 8)

In [9]:
#Author column
#remove the phrase "Writtenby:"
audible_data['author'] = audible_data['author'].str.replace(pat = "Writtenby:", repl = "")

In [10]:
audible_data.author

0        GeronimoStilton
1            RickRiordan
2             JeffKinney
3            RickRiordan
4            RickRiordan
              ...       
87484       ChrisStewart
87485      StephenO'Shea
87486          MarkTwain
87487     LaurenceSterne
87488      MarkKurlansky
Name: author, Length: 87489, dtype: object

Stack overflow thread to understand the Regex: [SO](https://stackoverflow.com/questions/199059/a-pythonic-way-to-insert-a-space-before-capital-letters)

In [11]:
#Add space between the first, middle and last names of Authors.
#e.g. JaneAustin becomes Jane Austin
audible_data['author'] = audible_data['author'].str.replace(pat = r"(\w)([A-Z])", repl = r"\1 \2", regex = True)

Some books have multiple authors. So, below the 'author' column has been split into multiple columns with author1, author2, author3..etc. each column with a single name.  
Separator "," has been used to split the column.

In [12]:
audible_data2 = pd.concat( [audible_data['name'], 
                            audible_data['author'].str.split(',', expand = True).add_prefix('author'),
                            audible_data.loc[:,['narrator', 'time', 'releasedate', 'language', 'stars', 'price']]], 
                            axis = 1)

In [13]:
audible_data2.sample(4)

Unnamed: 0,name,author0,author1,author2,author3,narrator,time,releasedate,language,stars,price
83472,Gumdrop Angel,Scott Cawthon,Andrea Waggener,,,Narratedby:SuzanneEliseFreeman,6 hrs and 26 mins,04-05-21,English,Not rated yet,492.0
32821,Organflüstern - Wie wir verdauen,Ewald Kliegel,Abbas Schirmohammadi,,,Narratedby:EwaldKliegel,59 mins,11-10-21,german,Not rated yet,233.0
31571,Vietnam - Kultur und Kommunikation,Frank Brinkmann,Ulrich Leifeld,,,Narratedby:AndreasHerrler,36 mins,04-04-19,german,Not rated yet,65.0
21741,Sink ‘Em All,Charles A.Lockwood,,,,Narratedby:EricMartin,16 hrs and 5 mins,03-07-18,English,Not rated yet,1003.0


In the following code, I just wanted to check how many of the new author name columns- author1, author2, author3 contain null values.  
If most of the rows are vacant, then there's no point creating additional columns for the same.

In [14]:
#number of missing values in author columns
audible_data2.loc[:,['author0','author1','author2','author3']].isnull().sum()

author0        0
author1    73762
author2    85135
author3    86713
dtype: int64

In [15]:
#remove "Narratedby:" from Narrator column
audible_data2['narrator'] = audible_data2['narrator'].str.replace(pat = "Narratedby:", repl = "")

In [16]:
#add space between first and last name of the Narrator
#e.g. JaneAustin becomes Jane Austin
audible_data2['narrator'] = audible_data2['narrator'].str.replace(pat = r"(\w)([A-Z])", repl = r"\1 \2", regex = True)

In [17]:
audible_data2.sample(5)

Unnamed: 0,name,author0,author1,author2,author3,narrator,time,releasedate,language,stars,price
30026,Crypto Economy,Aries Wang,,,,Mike Lenz,3 hrs and 45 mins,03-08-21,English,4 out of 5 stars1 rating,422.0
40527,A More Perfect Union,Adam Russell Taylor,John Lewis-foreword,,,Terrence Kidd,10 hrs and 1 min,19-10-21,English,Not rated yet,586.0
24578,Love Isn't Supposed to Hurt,Christi Paul,,,,Christi Paul,6 hrs and 55 mins,18-06-12,English,Not rated yet,797.0
6235,A Day at the Beach,Lissa Rovetch,,,,Highlightsfor Children,1 min,20-08-18,English,Not rated yet,46.0
10518,A Prescription for Change,Michael Kinch,,,,William Hughes,13 hrs and 27 mins,07-11-16,English,5 out of 5 stars2 ratings,820.0


#### Time column

In [25]:
#make a copy of time column to understand all kind of formats in which data is present
time_column = audible_data2['time']
time_column_copy = time_column

Since the 'time' column is of char type, first we need to see all distinct formats in which time has been mentioned.  
For example, some rows might have a format- '7 hrs 22 mins' while others with '7 hr 22 mins' (note the missing 's' in hr), '9 hrs' and so on..  
Depending on all formats, we figure out steps to be taken to clean this column.  
The objective is to convert these values to the form **hh:mm**. 

In [26]:
#replace all numbers with blanks
time_column_copy = time_column_copy.str.replace(pat = r'[0-9]', repl = '', regex = True)
time_column_copy.sample(3)

33256              mins
38150     hrs and  mins
45664     hrs and  mins
Name: time, dtype: object

In [27]:
#keep only unique patterns
time_column_copy = time_column_copy.drop_duplicates()
time_column_copy

0           hrs and  mins
4                     hrs
12           hrs and  min
29           hr and  mins
53                   mins
227                    hr
255           hr and  min
1203                  min
1401    Less than  minute
Name: time, dtype: object

'Time' column contains following formats:  
1. 7 hrs and 54 mins
2. 9 hrs
3. 7 mins