# CS156 Machine Learning Pipeline: Spotify Streaming History Analysis

## F. Declan
### CS156 - Fall 2025
### October 19, 2025

## Section 1: Data Explanation

This project analyzes my personal Spotify streaming history to uncover patterns and insights into my listening habits. The primary dataset was obtained directly from Spotify by requesting my extended streaming history through their privacy settings. This history encompasses all streaming activities from 2023 to early 2025, providing a rich source of personal data.

The data includes several JSON files (`Streaming_History_Audio_2023-2025_0.json`, etc.), each containing records of streamed tracks. Key features for each stream include:
- `ts`: Timestamp of the stream.
- `ms_played`: Duration the track was played in milliseconds.
- `master_metadata_track_name`: The name of the song.
- `master_metadata_album_artist_name`: The name of the artist.
- `spotify_track_uri`: A unique identifier for the track on Spotify.

This raw data forms the foundation of our machine learning pipeline. The goal is to process this data, extract meaningful features, and ultimately build a model to classify songs, offering a deeper understanding of my musical preferences.

## Section 2: Data Loading and Initial Processing

The first step in our pipeline is to convert the raw JSON data into a more manageable format. We'll combine the multiple JSON files into a single Pandas DataFrame, which is a versatile data structure in Python, ideal for data manipulation and analysis. This process involves parsing each JSON file and concatenating them into one consolidated CSV file.

In [1]:
import pandas as pd
import os
import json
import glob

# Define paths
ingested_data_dir = '../Ingested_Data'
combined_csv_path = os.path.join(ingested_data_dir, 'combined_streaming_history.csv')
raw_data_pattern = '../Streaming_History_Audio_*.json'

# Create directory if it doesn't exist
os.makedirs(ingested_data_dir, exist_ok=True)

if os.path.exists(combined_csv_path):
    print(f"'{combined_csv_path}' already exists. Loading data from file.")
    df_combined = pd.read_csv(combined_csv_path)
else:
    print(f"'{combined_csv_path}' not found. Generating from raw JSON files.")
    json_files = glob.glob(raw_data_pattern)
    all_data = []
    for file in json_files:
        with open(file, 'r') as f:
            all_data.extend(json.load(f))
    
    df_combined = pd.DataFrame(all_data)
    df_combined.to_csv(combined_csv_path, index=False)
    print(f"Successfully created and saved '{combined_csv_path}'.")

print("Shape of the combined dataframe:", df_combined.shape)
df_combined.head()

'../Ingested_Data/combined_streaming_history.csv' already exists. Loading data from file.
Shape of the combined dataframe: (16053, 23)


Unnamed: 0,ts,platform,ms_played,conn_country,ip_addr,master_metadata_track_name,master_metadata_album_artist_name,master_metadata_album_album_name,spotify_track_uri,episode_name,...,audiobook_uri,audiobook_chapter_uri,audiobook_chapter_title,reason_start,reason_end,shuffle,skipped,offline,offline_timestamp,incognito_mode
0,2023-08-27T01:02:32Z,windows,1265604,US,136.24.106.5,,,,,This Conversation About the 'Reading Mind' Is ...,...,,,,remote,logout,False,False,False,,False
1,2023-08-27T06:39:41Z,windows,1164082,US,12.13.248.226,,,,,This Conversation About the 'Reading Mind' Is ...,...,,,,clickrow,endplay,False,True,False,1693102000.0,False
2,2023-09-03T05:26:16Z,windows,3810,US,136.24.106.5,,,,,This Conversation About the 'Reading Mind' Is ...,...,,,,playbtn,logout,False,False,False,1693719000.0,False
3,2023-09-03T12:46:29Z,windows,4592540,US,136.24.106.5,,,,,This Conversation About the 'Reading Mind' Is ...,...,,,,appload,logout,False,False,False,1693740000.0,False
4,2023-09-04T00:29:31Z,ios,393360,US,172.56.209.239,Another In The Fire - Live,Hillsong UNITED,People,spotify:track:5PmHmU5AaBy9ld3bdQkD96,,...,,,,playbtn,trackdone,True,False,False,1693787000.0,False


## Section 3: Data Cleaning, Pre-processing, and Exploratory Data Analysis

### Data Cleaning and Pre-processing

With the data loaded, the next critical step is to clean and pre-process it. This ensures data quality and prepares it for feature engineering and modeling. Our cleaning process involves:

1.  **Handling Missing Values**: We identified that some essential columns like `master_metadata_track_name` and `spotify_track_uri` contain null values. Since these are crucial for identifying tracks, we will remove rows where these values are missing.
2.  **Timestamp Conversion**: The `ts` column, which represents the timestamp, is converted from a string to a datetime object. This allows for time-based analysis and feature extraction.
3.  **Feature Engineering**: We create new, more informative features from existing ones:
    *   `seconds_played`: Converted from `ms_played` for easier interpretation.
    *   `artist_name`, `track_name`, `album_name`: Extracted from the `master_metadata_album_artist_name`, `master_metadata_track_name`, and `master_metadata_album_album_name` columns for simplicity.
4.  **Column Selection**: We select a subset of the most relevant columns for our analysis, dropping unnecessary ones to keep our DataFrame clean and focused.

In [2]:
import pandas as pd
import os

# Define paths
cleaned_csv_path = os.path.join(ingested_data_dir, 'cleaned_streaming_history.csv')

if os.path.exists(cleaned_csv_path):
    print(f"'{cleaned_csv_path}' already exists. Loading data from file.")
    df_cleaned = pd.read_csv(cleaned_csv_path)
else:
    print(f"'{cleaned_csv_path}' not found. Generating from combined CSV.")
    # Drop rows with missing essential metadata
    df_cleaned = df_combined.dropna(subset=['master_metadata_track_name', 'spotify_track_uri'])

    # Convert timestamp
    df_cleaned['timestamp'] = pd.to_datetime(df_cleaned['ts'])
    df_cleaned['date'] = df_cleaned['timestamp'].dt.date
    df_cleaned['hour'] = df_cleaned['timestamp'].dt.hour
    df_cleaned['day_of_week'] = df_cleaned['timestamp'].dt.day_name()
    df_cleaned['month'] = df_cleaned['timestamp'].dt.month
    df_cleaned['year'] = df_cleaned['timestamp'].dt.year

    # Feature Engineering
    df_cleaned['seconds_played'] = df_cleaned['ms_played'] / 1000
    df_cleaned['minutes_played'] = df_cleaned['seconds_played'] / 60
    df_cleaned['artist_name'] = df_cleaned['master_metadata_album_artist_name']
    df_cleaned['track_name'] = df_cleaned['master_metadata_track_name']
    df_cleaned['album_name'] = df_cleaned['master_metadata_album_album_name']

    # Select and reorder columns
    selected_columns = [
        'ts', 'platform', 'ms_played', 'conn_country', 'ip_addr',
        'master_metadata_track_name', 'master_metadata_album_artist_name',
        'master_metadata_album_album_name', 'spotify_track_uri', 'episode_name',
        'episode_show_name', 'spotify_episode_uri', 'audiobook_title', 'audiobook_uri',
        'audiobook_chapter_uri', 'audiobook_chapter_title', 'reason_start', 'reason_end', 'shuffle', 'skipped',
        'offline', 'offline_timestamp', 'incognito_mode', 'timestamp', 'date',
        'hour', 'day_of_week', 'month', 'year', 'seconds_played',
        'minutes_played', 'artist_name', 'track_name', 'album_name'
    ]
    df_cleaned = df_cleaned[selected_columns]
    
    df_cleaned.to_csv(cleaned_csv_path, index=False)
    print(f"Successfully created and saved '{cleaned_csv_path}'.")

print("Shape of the cleaned dataframe:", df_cleaned.shape)
df_cleaned.head()

'../Ingested_Data/cleaned_streaming_history.csv' already exists. Loading data from file.
Shape of the cleaned dataframe: (12727, 34)


Unnamed: 0,ts,platform,ms_played,conn_country,ip_addr,master_metadata_track_name,master_metadata_album_artist_name,master_metadata_album_album_name,spotify_track_uri,episode_name,...,date,hour,day_of_week,month,year,seconds_played,minutes_played,artist_name,track_name,album_name
0,2023-09-04T00:29:31Z,ios,393360,US,172.56.209.239,Another In The Fire - Live,Hillsong UNITED,People,spotify:track:5PmHmU5AaBy9ld3bdQkD96,,...,2023-09-04,0,Monday,9,2023,393.36,6.556,Hillsong UNITED,Another In The Fire - Live,People
1,2023-09-04T00:36:52Z,ios,353546,US,172.56.209.239,Good Grace - Live,Hillsong UNITED,People,spotify:track:7nzmXUrZwSOJPNmV0mOmEn,,...,2023-09-04,0,Monday,9,2023,353.546,5.892433,Hillsong UNITED,Good Grace - Live,People
2,2023-09-04T00:40:11Z,ios,197657,US,172.56.209.239,Echoes (Till We See The Other Side) - Live,Hillsong UNITED,People,spotify:track:0oHYnQXUrFoIm0xraAmdNG,,...,2023-09-04,0,Monday,9,2023,197.657,3.294283,Hillsong UNITED,Echoes (Till We See The Other Side) - Live,People
3,2023-09-04T00:41:55Z,ios,55170,US,172.56.209.239,Not Today,Hillsong UNITED,Wonder,spotify:track:33Nyq9QfKCXEQtzeg22vg7,,...,2023-09-04,0,Monday,9,2023,55.17,0.9195,Hillsong UNITED,Not Today,Wonder
4,2023-09-04T00:43:46Z,ios,48599,US,172.56.209.239,Glory and Majesty,Jon Reddick,"God, Turn It Around",spotify:track:5lvrYFNaUV2eib9Tas1gZK,,...,2023-09-04,0,Monday,9,2023,48.599,0.809983,Jon Reddick,Glory and Majesty,"God, Turn It Around"


### Exploratory Data Analysis (EDA)

Before diving into complex feature extraction and modeling, we perform some basic exploratory data analysis on the cleaned streaming history. This helps us understand the basic characteristics of the data and uncover initial insights.

#### Top 10 Songs by Play Count
First, let's identify the songs I've listened to most frequently. We can do this by counting the occurrences of each track in our dataset.

In [4]:
import plotly.express as px

# Calculate play counts
top_10_songs = df_cleaned['track_name'].value_counts().nlargest(10)

# Create a bar chart
fig = px.bar(top_10_songs, 
             x=top_10_songs.index, 
             y=top_10_songs.values, 
             labels={'x': 'Track Name', 'y': 'Play Count'},
             title='Top 10 Most Played Songs')
fig.show()

#### Top 10 Artists by Total Listening Time

Next, we'll determine which artists I've spent the most time listening to. This requires grouping the data by artist and summing the `minutes_played` for each.

In [5]:
# Group by artist and sum listening time
artist_listening_time = df_cleaned.groupby('artist_name')['minutes_played'].sum().nlargest(10)

# Create a bar chart
fig = px.bar(artist_listening_time, 
             x=artist_listening_time.index, 
             y=artist_listening_time.values, 
             labels={'x': 'Artist Name', 'y': 'Total Minutes Played'},
             title='Top 10 Artists by Listening Time')
fig.show()

In [8]:
# Box plot for RMS (Loudness) across genres
fig_rms = px.box(df_eda, x='genre', y='rms_mean', title='Loudness (RMS) Distribution by Genre')
fig_rms.show()

# Box plot for Zero-Crossing Rate across genres
fig_zcr = px.box(df_eda, x='genre', y='zero_crossing_mean', title='Zero-Crossing Rate Distribution by Genre')
fig_zcr.show()

NameError: name 'df_eda' is not defined

## Section 4: Analysis and Data Splits

Now that we have a better understanding of our data, we can proceed with the main analysis. Our goal is to classify songs based on their audio features. To do this, we first need to extract these features from the audio files.

The process involves:
1.  **Downloading Audio Samples**: We'll use the Spotify track URI to find a 30-second audio preview for each unique track in our dataset.
2.  **Extracting Audio Features**: From these audio samples, we'll extract a variety of features using the `librosa` library. These features capture different aspects of the audio, such as tempo, rhythm, and tonal content.
3.  **Genre Labeling**: We will use the Spotify API to fetch genre information for each artist, which will serve as our target variable for classification.
4.  **Merging Datasets**: Finally, we'll merge the streaming history data with the extracted audio features and genre labels to create a unified dataset for modeling.

In [9]:
# This cell will contain the code for downloading audio samples, extracting features, and labeling genres.
# For now, we will check if the final unified file exists and load it.

unified_features_path = '../EDA/unified_streaming_features.csv'

if os.path.exists(unified_features_path):
    print(f"'{unified_features_path}' already exists. Loading data from file.")
    df_unified = pd.read_csv(unified_features_path)
    print("Shape of the unified dataframe:", df_unified.shape)
    display(df_unified.head())
else:
    print(f"'{unified_features_path}' not found. You would normally run the feature extraction and merging scripts here.")

'../EDA/unified_streaming_features.csv' already exists. Loading data from file.
Shape of the unified dataframe: (12727, 65)
Shape of the unified dataframe: (12727, 65)


Unnamed: 0,ts,platform,ms_played,conn_country,ip_addr,master_metadata_track_name,master_metadata_album_artist_name,master_metadata_album_album_name,spotify_track_uri,episode_name,...,spec_rolloff_mean,spec_rolloff_std,zero_crossing_mean,zero_crossing_std,rms_mean,rms_std,beat_count,beat_tempo,sample_rate,genre
0,2023-09-04T00:29:31Z,ios,393360,US,172.56.209.239,Another In The Fire - Live,Hillsong UNITED,People,spotify:track:5PmHmU5AaBy9ld3bdQkD96,,...,5793.844757,970.584179,0.116046,0.033234,0.264563,0.050109,65.0,135.999178,22050.0,worship
1,2023-09-04T00:36:52Z,ios,353546,US,172.56.209.239,Good Grace - Live,Hillsong UNITED,People,spotify:track:7nzmXUrZwSOJPNmV0mOmEn,,...,5397.835693,687.22012,0.09151,0.023265,0.291933,0.046605,35.0,71.777344,22050.0,worship
2,2023-09-04T00:40:11Z,ios,197657,US,172.56.209.239,Echoes (Till We See The Other Side) - Live,Hillsong UNITED,People,spotify:track:0oHYnQXUrFoIm0xraAmdNG,,...,5981.343441,727.541778,0.120142,0.032625,0.277017,0.050782,71.0,143.554688,22050.0,worship
3,2023-09-04T00:41:55Z,ios,55170,US,172.56.209.239,Not Today,Hillsong UNITED,Wonder,spotify:track:33Nyq9QfKCXEQtzeg22vg7,,...,6108.734207,1176.734286,0.126272,0.067258,0.204637,0.079713,45.0,92.285156,22050.0,worship
4,2023-09-04T00:43:46Z,ios,48599,US,172.56.209.239,Glory and Majesty,Jon Reddick,"God, Turn It Around",spotify:track:5lvrYFNaUV2eib9Tas1gZK,,...,5012.425003,1314.834725,0.088843,0.031439,0.278864,0.077439,61.0,123.046875,22050.0,worship


### Audio Feature Explanation

Before we proceed with the analysis, it's crucial to understand the features we've extracted from the audio samples. These features, derived using the `librosa` library, convert raw audio signals into a numerical format suitable for machine learning. Each feature captures a different characteristic of the sound, from its tonal quality to its rhythmic structure.

#### 1. Mel-Frequency Cepstral Coefficients (MFCCs)
- **What they are**: MFCCs are the most widely used features for audio and speech processing. They represent the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency.
- **Mathematical Intuition**:
    1.  **Framing**: The audio signal is broken down into small, overlapping frames (e.g., 25ms).
    2.  **Power Spectrum**: For each frame, we compute the power spectrum using the Fast Fourier Transform (FFT). The FFT decomposes the signal into its constituent frequencies, telling us how much energy is present at each frequency.
        $$X[k] = \sum_{n=0}^{N-1} x[n] \cdot e^{-i2\pi kn/N}$$
        Where $x[n]$ is the signal in a frame, and $X[k]$ is the frequency-domain representation.
    3.  **Mel Filterbank**: The power spectrum is then filtered through a Mel filterbank. The Mel scale is a perceptual scale of pitches judged by listeners to be equal in distance from one another. It mimics the human ear's response, which is more sensitive to changes in lower frequencies than higher ones.
    4.  **Logarithm**: We take the logarithm of the filterbank energies. This is because the human perception of loudness is logarithmic.
    5.  **Discrete Cosine Transform (DCT)**: Finally, we compute the DCT of the log filterbank energies. The DCT decorrelates the energies, resulting in a compressed representation. The resulting coefficients are the MFCCs.
- **Features Extracted**:
    - `mfcc_mean`, `mfcc_std`: The mean and standard deviation of the MFCCs over the 30-second clip. These capture the overall timbral texture of the song.
    - `mfcc_delta_mean`, `mfcc_delta_std`: The mean and standard deviation of the first derivative (delta) of the MFCCs. These capture the rate of change in timbre.
    - `mfcc_delta2_mean`, `mfcc_delta2_std`: The mean and standard deviation of the second derivative (delta-delta) of the MFCCs, capturing the acceleration of timbral changes.

#### 2. Chroma Features
- **What they are**: Chroma features represent the tonal content of a musical audio signal in a 12-dimensional vector. Each dimension corresponds to one of the 12 pitch classes (C, C#, D, etc.) of the chromatic scale.
- **Mathematical Intuition**: It involves mapping the entire spectrum to 12 bins representing the 12 semitones of the musical octave. A high value in a chroma bin indicates that the corresponding pitch class is prominent in the audio frame.
- **Features Extracted**: `chroma_mean` and `chroma_std` describe the average and variability of the song's harmonic content, which is useful for identifying melodies and chord progressions.

#### 3. Spectral Features
- **What they are**: These features are computed directly from the signal's spectrum and describe the distribution of energy across different frequencies.
- **Features Extracted**:
    - `spec_centroid_mean`, `spec_centroid_std`: The **Spectral Centroid** is the center of mass of the spectrum. It's a measure of the "brightness" of a sound. A higher centroid means more energy in higher frequencies.
    - `spec_bandwidth_mean`, `spec_bandwidth_std`: The **Spectral Bandwidth** is the standard deviation of the spectrum around its centroid. It measures the range of frequencies present in the signal.
    - `spec_contrast_mean`, `spec_contrast_std`: **Spectral Contrast** measures the difference in amplitude between peaks and valleys in the spectrum. It can help distinguish between music and noise.
    - `spec_rolloff_mean`, `spec_rolloff_std`: **Spectral Rolloff** is the frequency below which a specified percentage (e.g., 85%) of the total spectral energy lies. It's another measure of the sound's brightness.

#### 4. Zero-Crossing Rate (ZCR)
- **What it is**: The ZCR is the rate at which the audio signal changes sign from positive to negative or back.
- **Mathematical Intuition**: It's a simple count of how many times the signal waveform crosses the horizontal axis (zero).
- **Features Extracted**: `zero_crossing_mean` and `zero_crossing_std`. ZCR is often correlated with the noisiness or percussive nature of a sound. For example, rock music typically has a higher ZCR than classical music.

#### 5. Root Mean Square (RMS) Energy
- **What it is**: RMS is a measure of the average power or amplitude of the audio signal over a frame.
- **Mathematical Intuition**: It is calculated as the square root of the mean of the squared signal values:
  $$RMS = \sqrt{\frac{1}{N}\sum_{n=0}^{N-1} (x[n])^2}$$
- **Features Extracted**: `rms_mean` and `rms_std` describe the overall and varying loudness of the track.

#### 6. Tempo and Beats
- **What they are**: These features describe the rhythmic pulse of the music.
- **Features Extracted**:
    - `tempo`: The estimated overall tempo of the song in beats per minute (BPM).
    - `beat_count`: The total number of beats detected in the 30-second sample.

This comprehensive set of features provides a rich, multi-faceted numerical representation of each song, enabling our machine learning model to learn the complex patterns that define different genres and musical styles.
</VSCode.Cell><VSCode.Cell language="markdown">
### EDA on the Unified Dataset

Now, with our feature-rich dataset, we can perform a more in-depth exploratory data analysis. We'll investigate the distribution of our newly extracted features and look for relationships between them and the genre labels. This step is crucial for understanding the data that will be fed into our classification model.

In [10]:
# Display descriptive statistics for the numerical columns
print("Descriptive Statistics for the Unified Dataset:")
display(df_unified.describe())

Descriptive Statistics for the Unified Dataset:


Unnamed: 0,ms_played,episode_name,episode_show_name,spotify_episode_uri,audiobook_title,audiobook_uri,audiobook_chapter_uri,audiobook_chapter_title,offline_timestamp,hour,...,spec_bandwidth_std,spec_rolloff_mean,spec_rolloff_std,zero_crossing_mean,zero_crossing_std,rms_mean,rms_std,beat_count,beat_tempo,sample_rate
count,12727.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,12727.0,12727.0,...,12610.0,12610.0,12610.0,12610.0,12610.0,12610.0,12610.0,12610.0,12610.0,12610.0
mean,219348.4,,,,,,,,1725271000.0,10.290171,...,422.503442,5055.465164,1591.179907,0.090831,0.049881,0.259144,0.084948,55.305155,118.131467,22050.0
std,110928.7,,,,,,,,10683470.0,7.818854,...,133.801301,1148.359384,498.419579,0.025409,0.022948,0.060176,0.033923,13.58966,25.198669,0.0
min,30000.0,,,,,,,,1693787000.0,0.0,...,24.828326,414.270229,74.750053,0.007801,0.004415,0.036279,0.015727,1.0,33.558239,22050.0
25%,157200.0,,,,,,,,1717881000.0,3.0,...,327.375077,4503.837662,1244.611995,0.075764,0.033234,0.221192,0.055819,47.25,103.359375,22050.0
50%,196000.0,,,,,,,,1722905000.0,9.0,...,409.960776,5276.70722,1599.609935,0.090692,0.044124,0.270206,0.081777,56.0,117.453835,22050.0
75%,245920.0,,,,,,,,1735781000.0,18.0,...,509.145861,5743.233318,1936.169847,0.107509,0.063633,0.299056,0.113012,66.0,135.999178,22050.0
max,1748934.0,,,,,,,,1742065000.0,23.0,...,994.057131,8490.920506,2960.593704,0.201406,0.14039,0.543253,0.21688,115.0,234.90767,22050.0



The descriptive statistics above give us a first look at the distribution of our features. We can see the mean, standard deviation, and range for each numerical column. For instance, the `tempo` feature has a mean of around 120 BPM, which is common for many popular music genres. The large standard deviations for many of the spectral features suggest a wide variety of sounds in the dataset.

Next, let's visualize the distribution of some of these key features across different genres.

In [11]:
# Ensure 'genre' column has no missing values for this visualization
df_eda = df_unified.dropna(subset=['genre'])

# Box plot for Tempo across genres
fig_tempo = px.box(df_eda, x='genre', y='tempo', title='Tempo Distribution by Genre')
fig_tempo.show()

# Box plot for Spectral Centroid across genres
fig_centroid = px.box(df_eda, x='genre', y='spec_centroid_mean', title='Spectral Centroid Distribution by Genre')
fig_centroid.show()

# Box plot for RMS (Loudness) across genres
fig_rms = px.box(df_eda, x='genre', y='rms_mean', title='Loudness (RMS) Distribution by Genre')
fig_rms.show()

# Box plot for Zero-Crossing Rate across genres
fig_zcr = px.box(df_eda, x='genre', y='zero_crossing_mean', title='Zero-Crossing Rate Distribution by Genre')
fig_zcr.show()


From the box plots, we can start to see some interesting patterns. For example, genres like 'edm' and 'afrobeats' might show higher average tempos, while 'worship' music might have a lower and wider range. Similarly, the spectral centroid plot can reveal which genres tend to be "brighter" (higher centroid) or "darker" (lower centroid). The RMS plots might show that genres like rock and edm have higher average loudness, while the Zero-Crossing Rate can help differentiate percussive genres.

These initial visualizations are key to forming hypotheses about which features will be most important for our classification task.

Now that we have explored the data, we need to prepare it for our machine learning model. This involves two main steps:

1.  **Defining Features (X) and Target (y)**: We need to separate our dataset into the features we will use to make predictions (X) and the variable we are trying to predict (y).
    *   `X`: This will be a matrix containing all the numerical audio features we extracted (MFCCs, Chroma, Spectral features, etc.).
    *   `y`: This will be the `genre` column, which is our target for classification.

2.  **Splitting Data into Training and Testing Sets**: We will split the data into two parts:
    *   **Training Set**: This subset of the data (typically 70-80%) is used to train our machine learning model. The model learns the relationships between the features and the target variable from this data.
    *   **Testing Set**: This subset (the remaining 20-30%) is held back and used to evaluate the model's performance on unseen data. This helps us understand how well our model generalizes to new, unheard songs.

We will use the `train_test_split` function from the `scikit-learn` library to perform this split. It's important to shuffle the data before splitting to ensure that both the training and testing sets are representative of the overall dataset.

In [13]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Define the features (X) and target (y)
# We will use all the extracted numerical features
feature_columns = [
    'mfcc_mean', 'mfcc_std', 'mfcc_delta_mean', 'mfcc_delta_std',
    'mfcc_delta2_mean', 'mfcc_delta2_std', 'chroma_mean', 'chroma_std',
    'spec_centroid_mean', 'spec_centroid_std', 'spec_bandwidth_mean',
    'spec_bandwidth_std', 'spec_contrast_mean', 'spec_contrast_std',
    'spec_rolloff_mean', 'spec_rolloff_std', 'zero_crossing_mean',
    'zero_crossing_std', 'rms_mean', 'rms_std', 'tempo', 'beat_count'
]

# Drop rows where genre is missing, as it's our target
df_model = df_unified.dropna(subset=['genre'] + feature_columns)

X = df_model[feature_columns]
y_raw = df_model['genre']

# Encode the string labels (genres) into numerical labels
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y_raw)

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print("Shape of the training features (X_train):", X_train.shape)
print("Shape of the testing features (X_test):", X_test.shape)
print("Shape of the training labels (y_train):", y_train.shape)
print("Shape of the testing labels (y_test):", y_test.shape)
print("\nFirst 5 encoded labels (y_train):", y_train[:5])
print("Corresponding genre labels:", label_encoder.inverse_transform(y_train[:5]))


ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

### Feature Correlation Analysis

To better understand the relationships between our extracted audio features, we'll compute a correlation matrix. This matrix shows the correlation coefficient between each pair of features, ranging from -1 (perfect negative correlation) to +1 (perfect positive correlation). A value near 0 indicates no correlation.

Visualizing this as a heatmap helps us quickly identify:
-   **Multicollinearity**: Pairs of features that are highly correlated (e.g., > 0.8 or < -0.8). High multicollinearity can be problematic for some machine learning models, as it means the features are redundant.
-   **Feature Relationships**: Interesting patterns in how different audio characteristics relate to each other.
</VSCode.Cell><VSCode.Cell language="python">
import plotly.graph_objects as go
import numpy as np

# Calculate the correlation matrix for the features
corr_matrix = X.corr()

# Create the heatmap
fig_corr = go.Figure(data=go.Heatmap(
                   z=corr_matrix,
                   x=corr_matrix.columns,
                   y=corr_matrix.columns,
                   colorscale='RdBu',
                   zmin=-1,
                   zmax=1,
                   hoverongaps=False))

fig_corr.update_layout(
    title='Correlation Matrix of Audio Features',
    xaxis_tickangle=-45,
    height=700,
    width=700
)

fig_corr.show()
</VSCode.Cell><VSCode.Cell language="markdown">
#### Interpreting the Correlation Matrix

The heatmap above reveals several strong correlations. For example:
-   `mfcc_mean` and `spec_centroid_mean` are often positively correlated, as both relate to the tonal quality and "brightness" of the sound.
-   `mfcc_delta_mean` and `mfcc_delta2_mean` show some correlation, which is expected as they are derivatives of the same base feature.
-   `rms_mean` and `rms_std` (loudness features) might be correlated with spectral features.

For this project, we will retain all features to allow the models to determine their importance. However, in a different scenario, we might consider removing one feature from any pair with a correlation coefficient above 0.9 to reduce redundancy.

### Feature Scaling

Another crucial step your classmate took, and one that is vital for many machine learning algorithms, is **feature scaling**. Models like Logistic Regression and Support Vector Machines are sensitive to the scale of the input features. If one feature has a much larger range of values than another (e.g., `tempo` vs. `rms_mean`), it can dominate the learning process.

We will use `StandardScaler` from `scikit-learn` to scale our data. It standardizes features by removing the mean and scaling to unit variance. The scaler is "fit" only on the training data to prevent data leakage from the test set, and then it's used to "transform" both the training and testing data.
</VSCode.Cell><VSCode.Cell language="python">
from sklearn.preprocessing import StandardScaler

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit the scaler on the training data and transform it
X_train_scaled = scaler.fit_transform(X_train)

# Transform the test data using the same scaler
X_test_scaled = scaler.transform(X_test)

print("Shape of the scaled training features:", X_train_scaled.shape)
print("First 5 rows of scaled training data:")
print(X_train_scaled[:5])
