# 🎧 Audible Dataset Cleaning Notebook

This notebook documents the **step-by-step cleaning process** applied to the `audible_uncleaned.csv` dataset to produce the cleaned `audible_cleaned.csv`. It covers handling duplicates, missing values, parsing fields, and normalizing formats.

- Raw dataset shape: **87489 rows × 8 columns**
- Cleaned dataset shape: **87489 rows × 9 columns**

In [None]:

import pandas as pd
import numpy as np

raw_df = pd.read_csv("data/raw/audible_uncleaned.csv")
cleaned_df = pd.read_csv("data/processed/audible_cleaned.csv")

print("Raw dataset shape:", raw_df.shape)
print("Cleaned dataset shape:", cleaned_df.shape)
raw_df.head()


## Step 1: Inspect Raw Dataset

In [None]:

raw_df.info()
raw_df.describe(include='all').T.head(15)


## Step 2: Remove Duplicates

In [None]:

raw_df = raw_df.drop_duplicates()
print("After removing duplicates:", raw_df.shape)


## Step 3: Handle Missing Values

In [None]:

raw_df.isnull().sum()


## Step 4: Parse Price Column

In [None]:

# Remove currency symbols and convert to float
raw_df['Price'] = raw_df['Price'].replace('[\$,]', '', regex=True).astype(float)
raw_df['Price'].head()


## Step 6: Save Cleaned Dataset

In [None]:

raw_df.to_csv("data/processed/audible_cleaned_generated.csv", index=False)
print("Saved cleaned dataset to data/processed/audible_cleaned_generated.csv")
