Audible is a popular audiobook and spoken-word entertainment service owned by Amazon. It offers a vast library of audiobooks, podcasts, and other audio content that users can purchase or subscribe to. Audible allows users to listen to books on various devices, such as smartphones, tablets, and computers, making it convenient for people to enjoy literature and other content while on the go or at home. It also provides features like offline listening, bookmarking, and speed adjustment for a personalized listening experience.

Understanding the data

- `name` — The name of the audiobook.
- `author` — The audiobook’s author.
- `narrator` — The audiobook’s narrator.
- `time`— The audiobook’s duration, in hours and minutes.
- `releasedate` — The date the audiobook was published.
- `language `— The audiobook’s language.
- `stars` — The average number of stars (out of 5) and the number of ratings (if available).
- `price` — The audiobook’s price in INR (Indian Rupee).

## Problem with the dataset:
### `Dirty Data`
- `name`:
    - Some books has the version in different formats. e.g.
        - â€™, Ã¤, Ã¼ values `validity`
        - é«˜æ©‹å¾¡å±±äººã®ç™¾ç¤¾å·¡ç¤¼ï¼å…¶ä¹‹äº”æ‹¾å£±ã€€æ»‹è³€ãƒ»æ°¸æºå¯ºã€€æœ¨åœ°å¸«ã‚’æŸã­ã‚‹ã€Œãƒ•ãƒªãƒ¼ãƒ¡ãƒ¼ã‚½ãƒ³ã€ `validity`
        - Duplicate name found `validity`


- `author` & `narrator`:
    - Every value starts with "Writtenby:" e.g., "Writtenby:GeronimoStilton" `accuracy`
    - Some values also have the strange part, e.g., "Writtenby:FranciscoDÃ­azValladares" `validity`
    - First name and the last are not separated with a white space, e.g. "Writtenby:NicolasGorny". `accuracy`
    - Some additional informations are also inclused in some values. E.g., "Writtenby:AndrewPeterson-editor,JonathanRogers,N.D.Wilson," `validity`
    - There are no proper names in narrator, e.g., "Narratedby:uncredited". `completeness`


- `time`:
    - The values are combination of total hour and minutes, e.g., "2 hrs and 20 mins", "10 hrs", "22 mins" `validity`


- `releasedate`:
    - There are 2 types entries, "08-04-2008" and "13-01-10" `validity`
    - The dtype is object, if we convert to datetime object, then the above will be resolved.`validity`


- `language`
    - [x] Some values are in title formed and some are in lower case, e.g. "English" and "german". `consistency`


- `stars`
    - Fill 0 or other name instead of "Not rated yet" `completeness`

- `price`
    - [x] There is a value, "Free". This also change the dtype of the column. `completeness`

### `Messy Data`
- `time`:
    - Join hour and minute column in one sinlge minute column
- `stars`
    - separate "5 out of 5 stars34 ratings" into 5.5 in one column and in another one is 34

### Data Cleaning Order

1. Quality -> Completeness
2. Tidiness (messy data)
3. Quality -> Validity
4. Quality -> Accuracy
5. Quality -> Consistency

In [1]:
import numpy as np
import pandas as pd
from unidecode import unidecode

In [2]:
# you can read the dataset by upload the excel file; here I am using same file's link
df = pd.read_csv(r"D:\CampMain\DSMP 1.0\2) Python\3) Data analysis Process\02) week 12\01) Data Assessing and Cleaning\audible\audible_uncleaned(in).csv")
df.head()

Unnamed: 0,name,author,narrator,time,releasedate,language,stars,price
0,Geronimo Stilton #11 & #12,Writtenby:GeronimoStilton,Narratedby:BillLobely,2 hrs and 20 mins,08-04-2008,English,5 out of 5 stars34 ratings,468
1,The Burning Maze,Writtenby:RickRiordan,Narratedby:RobbieDaymond,13 hrs and 8 mins,05-01-2018,English,4.5 out of 5 stars41 ratings,820
2,The Deep End,Writtenby:JeffKinney,Narratedby:DanRussell,2 hrs and 3 mins,11-06-2020,English,4.5 out of 5 stars38 ratings,410
3,Daughter of the Deep,Writtenby:RickRiordan,Narratedby:SoneelaNankani,11 hrs and 16 mins,10-05-2021,English,4.5 out of 5 stars12 ratings,615
4,"The Lightning Thief: Percy Jackson, Book 1",Writtenby:RickRiordan,Narratedby:JesseBernstein,10 hrs,13-01-2010,English,4.5 out of 5 stars181 ratings,820


In [3]:
audible = df.copy()

# Completeness

In [4]:
audible.shape

(87489, 8)

In [5]:
def check_missing(df):
    X = (df.isnull().sum() / df.shape[0] * 100).sort_values(ascending=False).reset_index().rename(columns={'index':'feature',0:'missing_percentage'})
    return X[X['missing_percentage']>0]

check_missing(audible)

Unnamed: 0,feature,missing_percentage


In [6]:
audible.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87489 entries, 0 to 87488
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   name         87489 non-null  object
 1   author       87489 non-null  object
 2   narrator     87489 non-null  object
 3   time         87489 non-null  object
 4   releasedate  87489 non-null  object
 5   language     87489 non-null  object
 6   stars        87489 non-null  object
 7   price        87489 non-null  object
dtypes: object(8)
memory usage: 5.3+ MB


In [10]:
audible['price'].apply(lambda x: x if x.isalpha() else np.nan).dropna(axis=0).unique()

array(['Free'], dtype=object)

In [12]:
audible['price']=audible['price'].str.replace('Free','0').str.replace(',','').astype(float)

In [19]:
audible['stars'] = audible['stars'].replace('Not rated yet',np.nan)

# Tidiness

In [24]:
audible['rating_stars'] = audible['stars'].str.extract('^([\d.]+)').astype(float)
audible['n_ratings'] = audible['stars'].str.replace(',', '').str.extract('(\d+) rating').astype(float)
audible.drop(columns=['stars'],inplace=True)

In [25]:
check_missing(audible)

Unnamed: 0,feature,missing_percentage
0,rating_stars,82.772691
1,n_ratings,82.772691


- We could turn the NaN values to 0 or another numeric value, or we could keep them. It depends on our use case.

- If we want to plot the ratings distribution, it can make sense to drop audiobooks with no ratings. But if we need to use the distribution of prices for our analysis, then removing audiobooks with no ratings will bias our results (since unrated audiobooks are likely more niche and might have a different pricing structure than rated audiobooks). We can use `missingIndicator` to impute the missing values.

# Validity

## Define
- Convert `releasedate` to datetime
- fix `time` column

In [26]:
audible['releasedate'].head(10)

0    08-04-2008
1    05-01-2018
2    11-06-2020
3    10-05-2021
4    13-01-2010
5      30-10-18
6      25-11-14
7    05-02-2017
8    05-02-2017
9      24-09-19
Name: releasedate, dtype: object

In [136]:
pd.to_datetime(audible['releasedate'], errors='coerce').head(10)

0   2008-08-04
1   2018-05-01
2   2020-11-06
3   2021-10-05
4          NaT
5          NaT
6          NaT
7   2017-05-02
8   2017-05-02
9          NaT
Name: releasedate, dtype: datetime64[ns]

In [52]:
import re
previous_val = None
def date_fix(x):
    global previous_val
    v = re.findall(r'(\d+).(\d+).(\d+)', x)[0][-1]
    if len(v) == 4:
        previous_val = v
        return x
    else:
        return x.replace(v, previous_val[:2]+v)

In [49]:
audible['releasedate'].apply(date_fix).head(10)

0    08-04-2008
1    05-01-2018
2    11-06-2020
3    10-05-2021
4    13-01-2010
5    30-10-2018
6    25-11-2014
7    05-02-2017
8    05-02-2017
9    24-09-2019
Name: releasedate, dtype: object

In [50]:
pd.to_datetime(audible['releasedate'], format='mixed').head(10)

0   2008-08-04
1   2018-05-01
2   2020-11-06
3   2021-10-05
4   2010-01-13
5   2018-10-30
6   2014-11-25
7   2017-05-02
8   2017-05-02
9   2019-09-24
Name: releasedate, dtype: datetime64[ns]

In [51]:
audible['releasedate'] = pd.to_datetime(audible['releasedate'], format='mixed')

In [53]:
audible['time'].sample(10)

74020    13 hrs and 16 mins
72811     5 hrs and 28 mins
76132     6 hrs and 52 mins
54166               30 mins
3974                 6 mins
80796     8 hrs and 49 mins
18548     8 hrs and 38 mins
15296               18 mins
11171       9 hrs and 1 min
75111     26 hrs and 4 mins
Name: time, dtype: object

In [70]:
def time_handler(x):
    if 'and' in x:
        hr = int(re.findall(r'(\d+)\shr', x)[0])  * 60
        mint = int(re.findall(r'(\d+)\smin', x)[0])
        return hr + mint
    else:
        return int(re.findall(r'(\d+)\smin', x)[0])

In [75]:
hours=audible['time'].str.extract(r'(\d+)\shr').fillna(0).astype(int)
minutes=audible['time'].str.extract(r'(\d+)\smin').fillna(0).astype(int)
audible['time_mins']=hours*60+minutes
audible.drop(columns=['time'], axis=1, inplace=True)

In [76]:
subset_cols=['name','author','narrator','time_mins','price']
audible.duplicated(subset=subset_cols).sum()

70

In [77]:
# Keep the record with the last release date.
audible.drop_duplicates(subset=subset_cols,keep='last',inplace=True)

In [78]:
# unidecode removes accents from the string meaning that it is easier to work with the string 
audible['name'] = audible['name'].apply(unidecode)
audible['author'] = audible['author'].apply(unidecode)
audible['narrator'] = audible['narrator'].apply(unidecode)

# Accuracy

In [86]:
audible['author']   = audible['author'].str.replace('Writtenby:','')
audible['narrator'] = audible['narrator'].str.replace('Narratedby:','')

# Consistency

In [87]:
audible['language'] = audible['language'].str.capitalize()

In [89]:
audible.drop(columns=['n_ratings'], inplace=True)

In [98]:
audible['rating_indiactor'] = np.where(audible['rating_stars'].isnull(), 0, 1)

In [101]:
audible.to_pickle('audible_cleaned.pkl')