# Data Cleaning and Preprocessing

This notebook focuses on loading, cleaning, and preprocessing the Spotify streaming and listener preferences datasets for further analysis.

## Objectives
- Load the raw datasets
- Explore the data structure and identify issues
- Clean and preprocess the data
- Save the cleaned datasets for further analysis

## 1. Setup and Imports

In [None]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import sys

# Add the src directory to the path to import custom modules
sys.path.append('..')
from src import data_loader

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', 1000)

# Set plotting style
sns.set(style="whitegrid")
plt.rcParams['figure.figsize'] = (12, 8)

## 2. Load the Raw Datasets

In [None]:
# Load Spotify streaming data
spotify_df = data_loader.load_spotify_data()

# Load listener preferences data
listener_df = data_loader.load_listener_preferences()

print(f"Spotify dataset shape: {spotify_df.shape}")
print(f"Listener preferences dataset shape: {listener_df.shape}")

## 3. Explore the Spotify Streaming Dataset

In [None]:
# Display the first few rows
spotify_df.head()

In [None]:
# Check column names and data types
spotify_df.info()

In [None]:
# Check for missing values
spotify_missing = spotify_df.isnull().sum()
print("Columns with missing values:")
print(spotify_missing[spotify_missing > 0])

In [None]:
# Check for duplicate rows
print(f"Number of duplicate rows: {spotify_df.duplicated().sum()}")

In [None]:
# Basic statistics for numeric columns
spotify_df.describe()

## 4. Explore the Listener Preferences Dataset

In [None]:
# Display the first few rows
listener_df.head()

In [None]:
# Check column names and data types
listener_df.info()

In [None]:
# Check for missing values
listener_missing = listener_df.isnull().sum()
print("Columns with missing values:")
print(listener_missing[listener_missing > 0])

In [None]:
# Check for duplicate rows
print(f"Number of duplicate rows: {listener_df.duplicated().sum()}")

In [None]:
# Basic statistics for numeric columns
listener_df.describe()

## 5. Clean the Spotify Streaming Dataset

In [None]:
# Make a copy to avoid modifying the original
spotify_clean = spotify_df.copy()

# Remove duplicate rows if any
spotify_clean = spotify_clean.drop_duplicates()

# Check for string columns that should be numeric
# For example, some numeric columns might have quotes or commas
numeric_cols = spotify_clean.select_dtypes(include=['object']).columns

for col in numeric_cols:
    # Check if the column contains numeric values with quotes
    if spotify_clean[col].str.contains('"').any():
        # Remove quotes and convert to numeric
        spotify_clean[col] = spotify_clean[col].str.replace('"', '')
        
    # Check if the column contains numeric values with commas
    if spotify_clean[col].str.contains(',').any():
        # Remove commas and convert to numeric
        spotify_clean[col] = spotify_clean[col].str.replace(',', '')
    
    # Try to convert to numeric
    try:
        spotify_clean[col] = pd.to_numeric(spotify_clean[col])
        print(f"Converted {col} to numeric")
    except:
        pass

In [None]:
# Handle missing values
spotify_clean = data_loader.clean_spotify_data(spotify_clean)

# Check if all missing values have been handled
print("Columns with missing values after cleaning:")
print(spotify_clean.isnull().sum()[spotify_clean.isnull().sum() > 0])

In [None]:
# Check the data types after cleaning
spotify_clean.info()

## 6. Clean the Listener Preferences Dataset

In [None]:
# Make a copy to avoid modifying the original
listener_clean = listener_df.copy()

# Remove duplicate rows if any
listener_clean = listener_clean.drop_duplicates()

# Handle missing values
listener_clean = data_loader.clean_listener_preferences(listener_clean)

# Check if all missing values have been handled
print("Columns with missing values after cleaning:")
print(listener_clean.isnull().sum()[listener_clean.isnull().sum() > 0])

In [None]:
# Check the data types after cleaning
listener_clean.info()

## 7. Additional Data Preprocessing

In [None]:
# Create age groups in the listener dataset
if 'Age' in listener_clean.columns:
    bins = [0, 18, 25, 35, 45, 55, 65, 100]
    labels = ['<18', '18-24', '25-34', '35-44', '45-54', '55-64', '65+']
    listener_clean['Age Group'] = pd.cut(listener_clean['Age'], bins=bins, labels=labels)
    
    # Check the distribution of age groups
    print("Age group distribution:")
    print(listener_clean['Age Group'].value_counts().sort_index())

In [None]:
# Create a release year column in the Spotify dataset if it has a release date
if 'Release Date' in spotify_clean.columns:
    # Check if Release Date is already a datetime
    if pd.api.types.is_datetime64_any_dtype(spotify_clean['Release Date']):
        spotify_clean['Release Year'] = spotify_clean['Release Date'].dt.year
    else:
        # Try to convert to datetime
        try:
            spotify_clean['Release Date'] = pd.to_datetime(spotify_clean['Release Date'])
            spotify_clean['Release Year'] = spotify_clean['Release Date'].dt.year
        except:
            print("Could not convert Release Date to datetime")
    
    # Check the distribution of release years
    if 'Release Year' in spotify_clean.columns:
        print("Release year distribution:")
        print(spotify_clean['Release Year'].value_counts().sort_index())

## 8. Save the Cleaned Datasets

In [None]:
# Create a processed data directory if it doesn't exist
processed_dir = os.path.join('..', 'data', 'processed')
if not os.path.exists(processed_dir):
    os.makedirs(processed_dir)

# Save the cleaned Spotify dataset
spotify_clean_path = os.path.join(processed_dir, 'spotify_clean.csv')
spotify_clean.to_csv(spotify_clean_path, index=False)
print(f"Saved cleaned Spotify dataset to {spotify_clean_path}")

# Save the cleaned listener preferences dataset
listener_clean_path = os.path.join(processed_dir, 'listener_clean.csv')
listener_clean.to_csv(listener_clean_path, index=False)
print(f"Saved cleaned listener preferences dataset to {listener_clean_path}")

## 9. Summary and Next Steps

### Summary of Data Cleaning

In this notebook, we have:
1. Loaded the raw Spotify streaming and listener preferences datasets
2. Explored the data structure and identified issues
3. Cleaned and preprocessed the data, including:
   - Handling missing values
   - Converting data types
   - Removing duplicates
   - Creating derived features (age groups, release years)
4. Saved the cleaned datasets for further analysis

### Next Steps

The cleaned datasets are now ready for exploratory analysis in the next notebook:
- `02_exploratory_analysis.ipynb`: Descriptive statistics and simple visualizations

This will help us understand the distributions, patterns, and relationships in the data before proceeding to more advanced statistical analysis.