In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


## Task 1: Loading and Inspecting the Data

We will be working with a dataset of audiobooks downloaded from audible.in from 1998 to 2025 (pre-planned releases). [Source](https://www.kaggle.com/datasets/snehangsude/audible-dataset)

The first thing we will do is load the raw audible data.

### Instructions:
* Using pandas, read the `audible_raw.csv` file that is located inside the `data` folder in our local directory. Assign to `audible`.
* Show the first few rows of the `audible` data frame.

In [2]:
# Load the audible_raw.csv file
audible = pd.read_csv('audible_raw.csv')
# View the first rows of the dataframe
audible.head()

Unnamed: 0,name,author,narrator,time,releasedate,language,stars,price
0,Geronimo Stilton #11 & #12,Writtenby:GeronimoStilton,Narratedby:BillLobely,2 hrs and 20 mins,04-08-08,English,5 out of 5 stars34 ratings,468.0
1,The Burning Maze,Writtenby:RickRiordan,Narratedby:RobbieDaymond,13 hrs and 8 mins,01-05-18,English,4.5 out of 5 stars41 ratings,820.0
2,The Deep End,Writtenby:JeffKinney,Narratedby:DanRussell,2 hrs and 3 mins,06-11-20,English,4.5 out of 5 stars38 ratings,410.0
3,Daughter of the Deep,Writtenby:RickRiordan,Narratedby:SoneelaNankani,11 hrs and 16 mins,05-10-21,English,4.5 out of 5 stars12 ratings,615.0
4,"The Lightning Thief: Percy Jackson, Book 1",Writtenby:RickRiordan,Narratedby:JesseBernstein,10 hrs,13-01-10,English,4.5 out of 5 stars181 ratings,820.0


### 💾 The data

- "name" - The name of the audiobook.
- "author" - The audiobook's author.
- "narrator" - The audiobook's narrator.
- "time" -  The audiobook's duration, in hours and minutes.
- "releasedate" -  The date the audiobook was published.
- "language" -  The audiobook's language.
- "stars" -  The average number of stars (out of 5) and the number of ratings (if available).
- "price" -  The audiobook's price in INR (Indian Rupee).

 We can use the `.info()` method to inspect the data types of the columns

In [3]:
# Inspect the columns' data types
audible.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87489 entries, 0 to 87488
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   name         87489 non-null  object
 1   author       87489 non-null  object
 2   narrator     87489 non-null  object
 3   time         87489 non-null  object
 4   releasedate  87489 non-null  object
 5   language     87489 non-null  object
 6   stars        87489 non-null  object
 7   price        87489 non-null  object
dtypes: object(8)
memory usage: 5.3+ MB


## Task 2: Clean text data in Author and Narrator columns

We will start cleaning some of the text columns like `author` and `narrator`. We can remove the `Writtenby:` and `Narratedby:` portions of the text in those columns.

For this, we will use the `.str.replace()` method

### Instructions:
* Remove 'Writtenby:' from the `author` column
* Remove 'Narratedby:' from the `narrator` column
* Check the results

In [4]:
# Remove Writtenby: from the author column
audible['author'] = audible['author'].str.replace('Writtenby:', '')
# Remove Narratedby: from the narrator column
audible['narrator'] = audible['narrator'].str.replace('Narratedby:', '')
# Check the results
audible.head(10)

Unnamed: 0,name,author,narrator,time,releasedate,language,stars,price
0,Geronimo Stilton #11 & #12,GeronimoStilton,BillLobely,2 hrs and 20 mins,04-08-08,English,5 out of 5 stars34 ratings,468.0
1,The Burning Maze,RickRiordan,RobbieDaymond,13 hrs and 8 mins,01-05-18,English,4.5 out of 5 stars41 ratings,820.0
2,The Deep End,JeffKinney,DanRussell,2 hrs and 3 mins,06-11-20,English,4.5 out of 5 stars38 ratings,410.0
3,Daughter of the Deep,RickRiordan,SoneelaNankani,11 hrs and 16 mins,05-10-21,English,4.5 out of 5 stars12 ratings,615.0
4,"The Lightning Thief: Percy Jackson, Book 1",RickRiordan,JesseBernstein,10 hrs,13-01-10,English,4.5 out of 5 stars181 ratings,820.0
5,The Hunger Games: Special Edition,SuzanneCollins,TatianaMaslany,10 hrs and 35 mins,30-10-18,English,5 out of 5 stars72 ratings,656.0
6,Quest for the Diamond Sword,WinterMorgan,LukeDaniels,2 hrs and 23 mins,25-11-14,English,5 out of 5 stars11 ratings,233.0
7,The Dark Prophecy,RickRiordan,RobbieDaymond,12 hrs and 32 mins,02-05-17,English,5 out of 5 stars50 ratings,820.0
8,Merlin Mission Collection,MaryPopeOsborne,MaryPopeOsborne,10 hrs and 56 mins,02-05-17,English,5 out of 5 stars5 ratings,1256.0
9,The Tyrant’s Tomb,RickRiordan,RobbieDaymond,13 hrs and 22 mins,24-09-19,English,5 out of 5 stars58 ratings,820.0


## Task 3: Extract number of stars and ratings from Stars column.

The `stars` column combines the number of stars and the number of ratins. Let's turn this into numbers and split it into two columns: `rating_stars` and `n_ratings`.

First we will use the `.sample()` method to get a glimpse at the type of entries in that column.

In [5]:
# Get a glimpse of the stars column
audible.stars.sample(n = 10)

62171               Not rated yet
49402               Not rated yet
6207                Not rated yet
81982               Not rated yet
18397               Not rated yet
20443               Not rated yet
60283               Not rated yet
19393               Not rated yet
61252               Not rated yet
30244    5 out of 5 stars1 rating
Name: stars, dtype: object

Since there are many instances of `Not rated yet`, let's filter them out and sample again:

In [6]:
# Explore the values of the star column that are not 'Not rated yet'
audible[audible.stars != 'Not rated yet'].sample(n =10)

Unnamed: 0,name,author,narrator,time,releasedate,language,stars,price
32976,The Open-Focus Brain,"LesFehmi,JimRobbins",ArthurMorey,6 hrs and 10 mins,30-07-19,English,3 out of 5 stars1 rating,837.00
28135,Gandhi CEO,AlanAxelrod,DonHagen,6 hrs and 12 mins,17-05-10,English,5 out of 5 stars1 rating,820.00
83655,You'll Be the Death of Me,KarenM.McManus,"RachelL.Jacobs,AnthonyReyPerez,MaxMeyers,",9 hrs and 51 mins,02-12-21,English,4.5 out of 5 stars3 ratings,615.00
38941,Shattered Sword,"JonathanParshall,AnthonyTully",TomPerkins,24 hrs and 44 mins,16-04-19,English,5 out of 5 stars1 rating,820.00
71456,Tell Me Your Dreams,ShanayaTaneja,IrisVelvet,25 mins,02-06-20,English,3 out of 5 stars10 ratings,Free
86753,The Road to Oxiana,RobertByron,BarnabyEdwards,11 hrs and 46 mins,13-05-19,English,5 out of 5 stars1 rating,727.00
60837,The Incredible Banker,RaviSubramanian,SartajGarewal,8 hrs and 33 mins,01-01-17,English,4.5 out of 5 stars34 ratings,284.00
18575,The Private World of Georgette Heyer,JaneAikenHodge,PhyllidaNash,7 hrs and 2 mins,10-12-12,English,4 out of 5 stars1 rating,607.00
28485,How to Sell Your Way Through Life,NapoleonHill,A.C.Fellner,10 hrs and 54 mins,09-07-20,English,5 out of 5 stars1 rating,836.00
39789,On a Knife’s Edge,PritButtar,RogerClark,22 hrs and 7 mins,29-10-19,English,5 out of 5 stars1 rating,820.00


As a first step, we can replace the instances of `Not rated yet` with `NaN`

In [7]:
# Replace 'Not rated yet' with NaN
audible['stars'].replace('Not rated yet', np.nan, inplace = True)

We can use `.str.extract()` to get the number of stars and the number of ratings into their own columns.

### Instructions:
* Extract the number of stars into the `rating_stars` column
* Extract the number of ratings into the `n_ratings` column
* Convert both new columns to float

In [8]:
import pandas as pd

# Assuming 'audible' is a DataFrame containing the data
# Extract number of stars into rating_stars and turn into float
audible['rating_stars'] = audible['stars'].str.extract('^([\d.]+)').astype(float)
# Replace the comma, extract number of ratings into n_ratings and turn into float
audible['n_rating'] = audible['stars'].str.replace(',', '').str.extract('(\d+) ratings')

# Examine the new rating_stars and n_ratings columns
audible[['rating_stars', 'n_rating']]

Unnamed: 0,rating_stars,n_rating
0,5.0,34
1,4.5,41
2,4.5,38
3,4.5,12
4,4.5,181
...,...,...
87484,,
87485,,
87486,,
87487,,


As a last step, let's delete the `stars` column using the `.drop` command:

In [9]:
# Drop the stars column
audible.drop(columns = ['stars'], axis = 1, inplace = True)
# Check the results
audible.head()

Unnamed: 0,name,author,narrator,time,releasedate,language,price,rating_stars,n_rating
0,Geronimo Stilton #11 & #12,GeronimoStilton,BillLobely,2 hrs and 20 mins,04-08-08,English,468.0,5.0,34
1,The Burning Maze,RickRiordan,RobbieDaymond,13 hrs and 8 mins,01-05-18,English,820.0,4.5,41
2,The Deep End,JeffKinney,DanRussell,2 hrs and 3 mins,06-11-20,English,410.0,4.5,38
3,Daughter of the Deep,RickRiordan,SoneelaNankani,11 hrs and 16 mins,05-10-21,English,615.0,4.5,12
4,"The Lightning Thief: Percy Jackson, Book 1",RickRiordan,JesseBernstein,10 hrs,13-01-10,English,820.0,4.5,181


## Task 4: Change data types

Another important step is to have our data in the correct data types

### Instructions:
* Set `price` to float
* Turn `rating_stars` to category
* Convert `releasedate` to datetime

In [10]:
# Explore the price column
audible.price.head(10)

0      468.00
1      820.00
2      410.00
3      615.00
4      820.00
5      656.00
6      233.00
7      820.00
8    1,256.00
9      820.00
Name: price, dtype: object

We need to get rid of the comma and the word 'Free' before we can convert the data.

In [11]:
# Replace the comma with ''
audible['price'] = audible['price'].str.replace(',', '')
# Replace 'Free' with 0
audible['price'] = audible['price'].apply(lambda x: 0.0 if x == 'Free' else x)
# Turn price to float
audible['price'] = audible.price.astype(float)
#audible.price.dtype

* Turn `rating_stars` to category

Since `stars` can only take a small number of discrete values, the best data type for the column is `category`.

Let's first look at the unique values in that column to confirm:

In [12]:
# Look at the unique values in the rating_stars column
audible.rating_stars.unique()

array([5. , 4.5, 4. , nan, 3.5, 3. , 1. , 2. , 2.5, 1.5])

We can now use `.astype` to change the data type.

In [13]:
# Turn rating_stars to category
audible['rating_stars'] = audible.rating_stars.astype('category')

audible.rating_stars.dtype

CategoricalDtype(categories=[1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0], ordered=False)

* Convert `releasedate` date to datetime

Here we will use the `to_datetime()` function to turn the dates into datetime objects:

In [14]:
# Convert releasedate to datetime
audible['releasedate']  = pd.to_datetime(audible.releasedate)
# Inspect the dataframe 
audible.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87489 entries, 0 to 87488
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   name          87489 non-null  object        
 1   author        87489 non-null  object        
 2   narrator      87489 non-null  object        
 3   time          87489 non-null  object        
 4   releasedate   87489 non-null  datetime64[ns]
 5   language      87489 non-null  object        
 6   price         87489 non-null  float64       
 7   rating_stars  15072 non-null  category      
 8   n_rating      9250 non-null   object        
dtypes: category(1), datetime64[ns](1), float64(1), object(6)
memory usage: 5.4+ MB


## Task 5: Extract hours and minutes from the `time` column

The `time` column combines the number of hours and minutes into one column. We want to transform and consolidate the information into a new `time_minutes` column.

In [15]:
# Explore the values in the time column
audible.time.sample(n = 10)

22443     6 hrs and 54 mins
51476      8 hrs and 3 mins
38759     11 hrs and 4 mins
12373      1 hr and 20 mins
52645     2 hrs and 37 mins
9937      2 hrs and 58 mins
17244    11 hrs and 51 mins
61926     4 hrs and 33 mins
2338                33 mins
67636     6 hrs and 36 mins
Name: time, dtype: object

Let's see what other ways they have encoded `min` or `minutes`:

In [16]:
# Search the entries in the time column for different spellings of min. Let' try min, mins, minutes
print(audible.time[audible.time.str.contains('minute')].sample(n = 10).head(10))
#print(audible.time[audible.time.str.contains('hr')].sample(n = 10).head(10))


7170     Less than 1 minute
87058    Less than 1 minute
87100    Less than 1 minute
67108    Less than 1 minute
6243     Less than 1 minute
87059    Less than 1 minute
87148    Less than 1 minute
10895    Less than 1 minute
55648    Less than 1 minute
69297    Less than 1 minute
Name: time, dtype: object


We can see that we need to fix the following:
* hr, hrs -> consolidate as `hr`
* min, mins -> consolidate as `min`
* Less than 1 minute -> round to 1 min

In [17]:
# Replace hrs, mins, and 'Less than 1 minute'
audible['time'] = audible['time'].str.replace('hrs', 'hr')
audible['time'] = audible['time'].str.replace('mins', 'min')
audible['time'] = audible['time'].str.replace('Less than a minute', '1 min')

Let's see how it looks now:

In [18]:
# Check the results
audible[['time']].head()

Unnamed: 0,time
0,2 hr and 20 min
1,13 hr and 8 min
2,2 hr and 3 min
3,11 hr and 16 min
4,10 hr


The next step is to extract the number of hours and minutes from the text, then combine in a new `time_mins` column.

### Instructions: 
* Extract the number of hours from `time`. Assign to the `hours` variable.
* Extract the number of minutes from `time`. Assign to the `mins` variable.
* Create the `time_mins` column combining hours and minutes.

In [24]:
# Extract the number of hours, turn to integer
#hours = audible['time'].str.extract('(\d+) hr').fillna(0).astype(int)
# Extract the number of minutes, turn to integer
#mins = audible['time'].str.extract('(\d+) min').fillna(0).astype(int)
# Combine hours and minutes into the time_mins column
#audible['time_in_minutes'] = (hours * 60) + mins
# Check the results
audible[['time_in_minutes']].head()

Unnamed: 0,time_in_minutes
0,140
1,788
2,123
3,676
4,600


And as final step, let's delete the columns we don't need any more:

In [26]:
# Drop the time column
#audible.drop(columns = ['time'], axis = 1, inplace = True)

Here is how our dataframe looks now:

In [27]:
# Inspect the dataframe 
audible.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87489 entries, 0 to 87488
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   name             87489 non-null  object        
 1   author           87489 non-null  object        
 2   narrator         87489 non-null  object        
 3   releasedate      87489 non-null  datetime64[ns]
 4   language         87489 non-null  object        
 5   price            87489 non-null  float64       
 6   rating_stars     15072 non-null  category      
 7   n_rating         9250 non-null   object        
 8   time_in_minutes  87489 non-null  int32         
dtypes: category(1), datetime64[ns](1), float64(1), int32(1), object(5)
memory usage: 5.1+ MB


## Task 6: Check data ranges

Another important step is to confirm that the values in our columns are in the expected ranges and that we don't have out-of-range values.

Let's create a histogram of the numeric columns to visually inspect ht ranges and the shape of the distribution:

In [None]:
# Plot histograms of all the numerical columns


Additionally, we can use `.describe()` to look at a summary of our data

In [None]:
# Look at the numeric columns


In [None]:
# Look at the non numeric columns
audible.describe(exclude=[np.number])

We will transform the prices in `price` to USD for this exercise. We can use the exchange rate of 1 USD = 0.012 INR:

In [None]:
# Transform prices to USD (multiply times 0.012)

# Check the results


There values in the `language` column have different capitalization. Let's fix that.

In [None]:
# Inspect the language column before making changes


In [None]:
# Update capitalization in the language column

# Check the results


## Task 7: Checking for duplicates

How many duplicates do we have?

As a first step look for duplicates using `.duplicated()` and `.sum()`:

In [None]:
# Look for duplicate rows


It is useful to look for duplicates only using a subset of the columns that make sense. We will use the following subset of columns:
* name
* author
* narrator
* time_mins
* price

Here we use `.duplicated()` again, but with our subset of columns.

In [None]:
# Create a list of our subset columns and assign to subset_cols

# Check for duplicates using our subset of columns


Let's look at those values (use `keep=false`) and see what is going on:

In [None]:
# Check the duplicated rows keeping the duplicates and order by the name column


We can see that the duplicates are for files with different release dates. We can decide to keep the record with the last release date.

We will can use the `drop_duplicates()` method with the same subset and using `keep=last` to keep the last release date

In [None]:
# Drop duplicated rows keeping the last release date


We can check again for duplicates:

In [None]:
# Check again for duplicates using our subset of columns


## Task 8: Dealing with missing data

Before we finish, let's take a look at missing data in our columns. We can use the `.isna()` method and chain it with `.sum()` to get the total:

In [None]:
# Check for null values


We could turn the **NaN** values to 0 or another numeric value, or we could keep them. It depends on our use case.

If we want to plot the ratings distribution, it can make sense to drop audiobooks with no ratings. But if we need to use the distribution of prices for our analysis, then removing audiobooks with no ratings will bias our results (since unrated audiobooks are likely more niche and might have a different pricing structure than rated audiobooks).

We will keep the unrated audiobooks for now.

## Task 9: Save the cleaned data set

We can use the `.to_csv` method to save the clean file. We include `index=False` so that we don't also copy the current index to our destination file.

In [None]:
# Save the dataframe to a new file: 'audible_clean.csv'
