# Genre Classification of Spotify Music

**A Machine Learning Pipeline for Predicting Song Genre from Audio Features**

**Student:** F. Declan  
**Course:** CS156 - Machine Learning  
**Date:** October 21, 2025  
**Institution:** Harvey Mudd College

---

### Project Overview

This project implements a complete machine learning pipeline to answer a fundamental question in music information retrieval: **Can a song's genre be predicted solely from its quantitative audio features?**

Starting with a raw dataset of personal Spotify streaming history, this notebook documents the end-to-end process of:
1.  **Data Ingestion and Cleaning**: Parsing and cleaning two years of streaming data.
2.  **Exploratory Data Analysis (EDA)**: Uncovering personal listening habits and patterns.
3.  **Feature Engineering and Preparation**: Loading a dataset of unique tracks with pre-extracted audio features, preparing it for modeling, and performing a detailed feature analysis.
4.  **Model Training and Selection**: Training and evaluating three distinct classification models (Logistic Regression, Random Forest, and Gradient Boosting) using a rigorous cross-validation strategy.
5.  **Final Evaluation and Interpretation**: Analyzing the best model's performance on a held-out test set to understand its predictive power and limitations.

The final result is a well-documented and robust classification model, complete with insights into which audio characteristics are most discriminative of musical genre.

---

## Section 1: Setup and Data Ingestion

This section handles the initial setup, including importing necessary libraries and loading the raw streaming history data. This dataset forms the basis for our initial exploratory analysis.

In [2]:
# --- Core Libraries ---
import pandas as pd
import numpy as np
import os
import json
import glob
import ast
import time
import warnings

# --- Visualization ---
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# --- Machine Learning ---
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, confusion_matrix

# --- Configuration ---
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 100)

print("✓ All libraries imported successfully.")

✓ All libraries imported successfully.


### 1.1 Load and Combine Raw Streaming Data

The raw data from Spotify consists of multiple JSON files containing streaming history. The first step is to locate these files, parse them, and combine them into a single pandas DataFrame. This combined dataset will be saved as a CSV file to streamline future analysis.

In [3]:
# Define file paths for raw and ingested data
raw_data_pattern = '../Streaming_History_Audio_*.json'
ingested_data_dir = '../Ingested_Data'
combined_csv_path = os.path.join(ingested_data_dir, 'combined_streaming_history.csv')

# Create the output directory if it doesn't exist
os.makedirs(ingested_data_dir, exist_ok=True)

# Check if the combined CSV already exists to avoid reprocessing
if os.path.exists(combined_csv_path):
    print(f"✓ Found existing combined dataset at '{combined_csv_path}'. Loading from CSV...")
    df_combined = pd.read_csv(combined_csv_path)
else:
    print("✗ Combined dataset not found. Creating from raw JSON files...")
    
    # Find all JSON files matching the specified pattern
    json_files = glob.glob(raw_data_pattern)
    
    if not json_files:
        raise FileNotFoundError(f"No JSON files found matching pattern: {raw_data_pattern}")
    
    print(f"  Found {len(json_files)} JSON file(s) to process.")
    
    # Load and concatenate data from all found JSON files
    all_streaming_data = []
    for json_file in json_files:
        with open(json_file, 'r', encoding='utf-8') as f:
            file_data = json.load(f)
            all_streaming_data.extend(file_data)
            print(f"  → Loaded {len(file_data):,} records from {os.path.basename(json_file)}")
    
    # Convert the list of records into a pandas DataFrame
    df_combined = pd.DataFrame(all_streaming_data)
    
    # Save the combined DataFrame to a CSV file for easy access
    df_combined.to_csv(combined_csv_path, index=False)
    print(f"\n✓ Combined dataset saved to '{combined_csv_path}'")

# Display a summary of the loaded dataset
print(f"\n{'='*70}")
print(f"COMBINED STREAMING HISTORY - SUMMARY")
print(f"{'='*70}")
print(f"Total streaming events: {len(df_combined):,}")
print(f"Dataset shape: {df_combined.shape[0]:,} rows × {df_combined.shape[1]} columns")
print(f"\nFirst 5 records:")
display(df_combined.head())

✓ Found existing combined dataset at '../Ingested_Data/combined_streaming_history.csv'. Loading from CSV...

COMBINED STREAMING HISTORY - SUMMARY
Total streaming events: 16,053
Dataset shape: 16,053 rows × 23 columns

First 5 records:


Unnamed: 0,ts,platform,ms_played,conn_country,ip_addr,master_metadata_track_name,master_metadata_album_artist_name,master_metadata_album_album_name,spotify_track_uri,episode_name,episode_show_name,spotify_episode_uri,audiobook_title,audiobook_uri,audiobook_chapter_uri,audiobook_chapter_title,reason_start,reason_end,shuffle,skipped,offline,offline_timestamp,incognito_mode
0,2023-08-27T01:02:32Z,windows,1265604,US,136.24.106.5,,,,,This Conversation About the 'Reading Mind' Is ...,The Ezra Klein Show,spotify:episode:5ess4DnMyD2YTmjgU5cggh,,,,,remote,logout,False,False,False,,False
1,2023-08-27T06:39:41Z,windows,1164082,US,12.13.248.226,,,,,This Conversation About the 'Reading Mind' Is ...,The Ezra Klein Show,spotify:episode:5ess4DnMyD2YTmjgU5cggh,,,,,clickrow,endplay,False,True,False,1693102000.0,False
2,2023-09-03T05:26:16Z,windows,3810,US,136.24.106.5,,,,,This Conversation About the 'Reading Mind' Is ...,The Ezra Klein Show,spotify:episode:5ess4DnMyD2YTmjgU5cggh,,,,,playbtn,logout,False,False,False,1693719000.0,False
3,2023-09-03T12:46:29Z,windows,4592540,US,136.24.106.5,,,,,This Conversation About the 'Reading Mind' Is ...,The Ezra Klein Show,spotify:episode:5ess4DnMyD2YTmjgU5cggh,,,,,appload,logout,False,False,False,1693740000.0,False
4,2023-09-04T00:29:31Z,ios,393360,US,172.56.209.239,Another In The Fire - Live,Hillsong UNITED,People,spotify:track:5PmHmU5AaBy9ld3bdQkD96,,,,,,,,playbtn,trackdone,True,False,False,1693787000.0,False


---

## Section 2: Data Cleaning and Exploratory Data Analysis (EDA)

With the raw data loaded, the next step is to clean and preprocess it. This involves handling missing values, converting data types, and extracting useful information from existing columns. Afterward, we will perform an exploratory data analysis to uncover trends and patterns in my listening history.

### 2.1 Clean and Preprocess Streaming Data

The raw data requires several cleaning steps to be useful for analysis:
1.  **Drop irrelevant columns** that are not needed for this analysis.
2.  **Handle missing values**, particularly for track and artist names.
3.  **Convert timestamp (`ts`)** from string to a proper datetime object.
4.  **Create new time-based features** like `year`, `month`, `day_of_week`, and `hour`.
5.  **Filter out non-music streams**, such as podcasts or other audio types.
6.  **Save the cleaned data** to a new CSV file.

In [4]:
# Define path for the cleaned dataset
cleaned_csv_path = os.path.join(ingested_data_dir, 'cleaned_streaming_history.csv')

# Check if cleaned dataset already exists
if os.path.exists(cleaned_csv_path):
    print(f"✓ Found existing cleaned dataset at '{cleaned_csv_path}'. Loading from CSV...")
    df_cleaned = pd.read_csv(cleaned_csv_path)
    # Ensure 'ts' is parsed as datetime when loading from CSV
    df_cleaned['ts'] = pd.to_datetime(df_cleaned['ts'])
else:
    print("✗ Cleaned dataset not found. Processing raw data...")
    df_cleaned = df_combined.copy()

    # 1. Drop irrelevant columns
    cols_to_drop = [
        'username', 'platform', 'conn_country', 'ip_addr_decrypted', 
        'user_agent_decrypted', 'episode_name', 'episode_show_name', 
        'spotify_episode_uri', 'incognito_mode'
    ]
    df_cleaned.drop(columns=cols_to_drop, inplace=True, errors='ignore')

    # 2. Handle missing values
    df_cleaned.dropna(subset=['master_metadata_track_name', 'master_metadata_album_artist_name'], inplace=True)

    # 3. Convert timestamp to datetime
    df_cleaned['ts'] = pd.to_datetime(df_cleaned['ts'])

    # 4. Create time-based features
    df_cleaned['year'] = df_cleaned['ts'].dt.year
    df_cleaned['month'] = df_cleaned['ts'].dt.month_name()
    df_cleaned['day_of_week'] = df_cleaned['ts'].dt.day_name()
    df_cleaned['hour'] = df_cleaned['ts'].dt.hour

    # 5. Filter out non-music (where spotify_track_uri is null)
    df_cleaned = df_cleaned[df_cleaned['spotify_track_uri'].notna()]
    
    # 6. Save the cleaned data
    df_cleaned.to_csv(cleaned_csv_path, index=False)
    print(f"\n✓ Cleaned dataset saved to '{cleaned_csv_path}'")

# Display a summary of the cleaned dataset
print(f"\n{'='*70}")
print(f"CLEANED STREAMING HISTORY - SUMMARY")
print(f"{'='*70}")
print(f"Total valid music streams: {len(df_cleaned):,}")
print(f"Dataset shape: {df_cleaned.shape[0]:,} rows × {df_cleaned.shape[1]} columns")
print("\nColumns and Data Types:")
df_cleaned.info()
print("\nSample of cleaned data:")
display(df_cleaned.head())

✓ Found existing cleaned dataset at '../Ingested_Data/cleaned_streaming_history.csv'. Loading from CSV...

CLEANED STREAMING HISTORY - SUMMARY
Total valid music streams: 12,727
Dataset shape: 12,727 rows × 34 columns

Columns and Data Types:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12727 entries, 0 to 12726
Data columns (total 34 columns):
 #   Column                             Non-Null Count  Dtype              
---  ------                             --------------  -----              
 0   ts                                 12727 non-null  datetime64[ns, UTC]
 1   platform                           12727 non-null  object             
 2   ms_played                          12727 non-null  int64              
 3   conn_country                       12727 non-null  object             
 4   ip_addr                            12727 non-null  object             
 5   master_metadata_track_name         12727 non-null  object             
 6   master_metadata_album_artist_name  1

Unnamed: 0,ts,platform,ms_played,conn_country,ip_addr,master_metadata_track_name,master_metadata_album_artist_name,master_metadata_album_album_name,spotify_track_uri,episode_name,episode_show_name,spotify_episode_uri,audiobook_title,audiobook_uri,audiobook_chapter_uri,audiobook_chapter_title,reason_start,reason_end,shuffle,skipped,offline,offline_timestamp,incognito_mode,timestamp,date,hour,day_of_week,month,year,seconds_played,minutes_played,artist_name,track_name,album_name
0,2023-09-04 00:29:31+00:00,ios,393360,US,172.56.209.239,Another In The Fire - Live,Hillsong UNITED,People,spotify:track:5PmHmU5AaBy9ld3bdQkD96,,,,,,,,playbtn,trackdone,True,False,False,1693787000.0,False,2023-09-04 00:29:31+00:00,2023-09-04,0,Monday,9,2023,393.36,6.556,Hillsong UNITED,Another In The Fire - Live,People
1,2023-09-04 00:36:52+00:00,ios,353546,US,172.56.209.239,Good Grace - Live,Hillsong UNITED,People,spotify:track:7nzmXUrZwSOJPNmV0mOmEn,,,,,,,,trackdone,trackdone,True,False,False,1693787000.0,False,2023-09-04 00:36:52+00:00,2023-09-04,0,Monday,9,2023,353.546,5.892433,Hillsong UNITED,Good Grace - Live,People
2,2023-09-04 00:40:11+00:00,ios,197657,US,172.56.209.239,Echoes (Till We See The Other Side) - Live,Hillsong UNITED,People,spotify:track:0oHYnQXUrFoIm0xraAmdNG,,,,,,,,trackdone,endplay,True,True,False,1693788000.0,False,2023-09-04 00:40:11+00:00,2023-09-04,0,Monday,9,2023,197.657,3.294283,Hillsong UNITED,Echoes (Till We See The Other Side) - Live,People
3,2023-09-04 00:41:55+00:00,ios,55170,US,172.56.209.239,Not Today,Hillsong UNITED,Wonder,spotify:track:33Nyq9QfKCXEQtzeg22vg7,,,,,,,,fwdbtn,endplay,True,True,False,1693788000.0,False,2023-09-04 00:41:55+00:00,2023-09-04,0,Monday,9,2023,55.17,0.9195,Hillsong UNITED,Not Today,Wonder
4,2023-09-04 00:43:46+00:00,ios,48599,US,172.56.209.239,Glory and Majesty,Jon Reddick,"God, Turn It Around",spotify:track:5lvrYFNaUV2eib9Tas1gZK,,,,,,,,playbtn,endplay,True,True,False,1693788000.0,False,2023-09-04 00:43:46+00:00,2023-09-04,0,Monday,9,2023,48.599,0.809983,Jon Reddick,Glory and Majesty,"God, Turn It Around"


### 2.2 Exploratory Data Analysis (EDA)

Now, let's visualize the cleaned data to understand listening habits. We will explore:
-   **Listening Time Over Time**: A timeline of total listening minutes per month.
-   **Hourly Listening Patterns**: A heatmap showing listening activity by day of the week and hour.
-   **Top Artists and Songs**: Bar charts of the most frequently played artists and tracks.

In [5]:
# --- Visualization 1: Listening Time Over Time ---
print(f"{'='*70}")
print("EDA 1: LISTENING TIMELINE")
print(f"{'='*70}")

# Resample data by month and sum the listening time
df_cleaned['ts_month'] = df_cleaned['ts'].dt.to_period('M').dt.to_timestamp()
monthly_listening = df_cleaned.groupby('ts_month')['ms_played'].sum().reset_index()
monthly_listening['hours_played'] = monthly_listening['ms_played'] / (1000 * 60 * 60)

# Create the timeline plot
fig_timeline = px.line(
    monthly_listening,
    x='ts_month',
    y='hours_played',
    title='Total Monthly Listening Time (Hours)',
    labels={'ts_month': 'Month', 'hours_played': 'Total Listening Hours'},
    markers=True,
    template='plotly_dark'
)
fig_timeline.update_layout(title_x=0.5)
fig_timeline.show()

EDA 1: LISTENING TIMELINE


In [6]:
# --- Visualization 2: Hourly Listening Patterns ---
print(f"\n{'='*70}")
print("EDA 2: HOURLY LISTENING HEATMAP")
print(f"{'='*70}")

# Create a pivot table of listening counts by day and hour
hourly_listening = df_cleaned.pivot_table(
    index='day_of_week',
    columns='hour',
    values='ts',
    aggfunc='count'
).fillna(0)

# Order the days of the week correctly
days_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
hourly_listening = hourly_listening.reindex(days_order)

# Create the heatmap
fig_hourly = px.imshow(
    hourly_listening,
    title='Listening Habits: Heatmap of Streams by Day and Hour',
    labels={'x': 'Hour of Day', 'y': 'Day of Week', 'color': 'Total Streams'},
    template='plotly_dark'
)
fig_hourly.update_layout(title_x=0.5)
fig_hourly.show()


EDA 2: HOURLY LISTENING HEATMAP


In [7]:
# --- Visualization 3: Top Artists and Songs ---
print(f"\n{'='*70}")
print("EDA 3: TOP ARTISTS AND SONGS")
print(f"{'='*70}")

# Calculate top artists by listening time
artist_listening_time = df_cleaned.groupby('master_metadata_album_artist_name')['ms_played'].sum() / (1000 * 60)
top_artists = artist_listening_time.sort_values(ascending=False).head(15)

# Calculate top songs by play count
track_play_counts = df_cleaned['master_metadata_track_name'].value_counts().head(15)

# Create subplots
fig_top = make_subplots(
    rows=1, cols=2,
    subplot_titles=('Top 15 Artists by Listening Time (Minutes)', 'Top 15 Most Played Songs'),
    horizontal_spacing=0.15
)

# Top Artists Bar Chart
fig_top.add_trace(
    go.Bar(y=top_artists.index, x=top_artists.values, orientation='h', marker_color='lightgreen'),
    row=1, col=1
)

# Top Songs Bar Chart
fig_top.add_trace(
    go.Bar(y=track_play_counts.index, x=track_play_counts.values, orientation='h', marker_color='lightblue'),
    row=1, col=2
)

fig_top.update_layout(
    title_text='Top Artists and Songs',
    showlegend=False,
    template='plotly_dark',
    title_x=0.5,
    height=500
)
fig_top.update_yaxes(autorange="reversed")
fig_top.show()


EDA 3: TOP ARTISTS AND SONGS


---

## Section 3: Feature Engineering and Data Preparation

This is the most critical phase of the pipeline. Here, we transition from analyzing streaming history to preparing a dataset for machine learning. This involves:
1.  **Loading the Feature Dataset**: A pre-processed CSV containing unique tracks and their **348 audio features**.
2.  **Data Filtering**: Selecting genres with sufficient samples for robust model training.
3.  **Feature Extraction and Assembly**: Parsing and combining all 348 features into a single feature matrix (`X`).
4.  **Label Encoding**: Converting genre names into numerical labels (`y`).
5.  **Data Splitting**: Dividing the data into training, validation, and test sets.
6.  **Feature Scaling**: Standardizing features to ensure fair model training.

### 3.1 Loading the Machine Learning Dataset

For genre classification, we must use a dataset of **unique tracks**, not streaming events. Using the streaming history would cause **data leakage**, as the same song could appear in both the training and test sets, leading to an artificially inflated and misleading accuracy score.

We will now load `audio_features_with_genres.csv`, where each row represents one unique song and its corresponding audio features.

In [8]:
# Load the dataset containing unique tracks and their audio features
unique_tracks_path = '../Extracted_Features/audio_features_with_genres.csv'

print(f"Loading unique tracks dataset from: {unique_tracks_path}")
df_tracks = pd.read_csv(unique_tracks_path)

print(f"\n{'='*70}")
print(f"UNIQUE TRACKS DATASET - SUMMARY")
print(f"{'='*70}")
print(f"Total unique tracks: {len(df_tracks):,}")
print(f"Total columns (metadata + features): {len(df_tracks.columns)}")

# Display genre distribution
if 'genre' in df_tracks.columns:
    print(f"\n--- Genre Distribution ---")
    genre_counts = df_tracks['genre'].value_counts()
    print(f"Number of unique genres: {len(genre_counts)}")
    print(f"\nTop 10 genres:")
    display(genre_counts.head(10))
else:
    print("\n⚠ Warning: 'genre' column not found in dataset")

print(f"\nSample of the dataset (metadata columns):")
display_cols = ['track_name', 'artist_name', 'genre', 'duration', 'tempo']
display(df_tracks[display_cols].head())

Loading unique tracks dataset from: ../Extracted_Features/audio_features_with_genres.csv

UNIQUE TRACKS DATASET - SUMMARY
Total unique tracks: 1,853
Total columns (metadata + features): 33

--- Genre Distribution ---
Number of unique genres: 118

Top 10 genres:


genre
Unknown           422
worship           366
afrobeats         259
lo-fi             112
african gospel     67
gospel             56
uk drill           56
christian          41
soft pop           34
new age            33
Name: count, dtype: int64


Sample of the dataset (metadata columns):


Unnamed: 0,track_name,artist_name,genre,duration,tempo
0,SNAP,Rosa Linn,Unknown,29.712653,172.265625
1,Lord Send Revival - Live,Hillsong Young & Free,worship,29.712653,99.384014
2,Somewhere Only We Know,Gustixa,Unknown,22.328027,86.132812
3,Happier,Marshmello,edm,29.712653,89.102909
4,Firm Foundation (He Won't) [feat. Cody Carnes],Maverick City Music,worship,29.712653,161.499023


### 3.2 Data Preparation for Modeling

This single, comprehensive code cell performs all the necessary steps to prepare the data for machine learning:
1.  **Filter Genres**: Removes genres with fewer than 10 samples to ensure statistical validity.
2.  **Assemble Feature Matrix (X)**:
    *   Extracts 14 simple numerical features.
    *   Parses and flattens 334 array-based features (MFCC, Chroma, etc.).
    *   Combines them into a final feature matrix `X` of shape `(n_samples, 348)`.
3.  **Encode Target Labels (y)**: Converts the 20 genre strings into integer labels.
4.  **Split Data**: Performs a stratified 70/15/15 split for training, validation, and test sets.
5.  **Scale Features**: Applies `StandardScaler` to the feature sets to normalize their ranges.

In [13]:
def prepare_data(df_tracks):
    """
    A comprehensive function to process the raw track data into ML-ready datasets.
    """
    # ======================================================================
    # STEP 1: FILTERING GENRES
    # ======================================================================
    print("="*70)
    print("STEP 1: FILTERING GENRES")
    print("="*70)
    print("Filtering to genres with at least 10 samples...")
    genre_counts = df_tracks['genre'].value_counts()
    genres_to_keep = genre_counts[genre_counts >= 10].index
    df_filtered = df_tracks[df_tracks['genre'].isin(genres_to_keep)].copy()
    print(f"Genres before filtering: {len(genre_counts)}")
    print(f"Genres after filtering:  {len(genres_to_keep)}")
    print(f"Tracks remaining: {len(df_filtered):,}")

    # ======================================================================
    # STEP 2: ASSEMBLING FEATURE MATRIX (X)
    # ======================================================================
    print("\n" + "="*70)
    print("STEP 2: ASSEMBLING FEATURE MATRIX (X)")
    print("="*70)
    
    # Define feature columns
    global numerical_feature_cols, array_feature_cols, feature_names
    numerical_feature_cols = [
        'duration', 'tempo', 'spec_centroid_mean', 'spec_centroid_std',
        'spec_bandwidth_mean', 'spec_bandwidth_std', 'spec_rolloff_mean',
        'spec_rolloff_std', 'zero_crossing_mean', 'zero_crossing_std',
        'rms_mean', 'rms_std', 'beat_count', 'beat_tempo'
    ]
    array_feature_cols = {
        'mfcc_mean': 20, 'mfcc_std': 20, 'chroma_mean': 12, 'chroma_std': 12,
        'mel_spec_mean': 128, 'mel_spec_std': 128, 'spec_contrast_mean': 7, 'spec_contrast_std': 7
    }
    
    # Extract simple numerical features
    X_numerical = df_filtered[numerical_feature_cols].values
    print(f"Extracted {X_numerical.shape[1]} simple numerical features.")

    # Function to safely parse string representations of lists/arrays
    def parse_array(s):
        try:
            return np.array(ast.literal_eval(s))
        except (ValueError, SyntaxError):
            return np.zeros(sum(array_feature_cols.values())) # Return a zero array of expected total size on failure

    # Parse and stack array features
    all_array_features = []
    for col, dim in array_feature_cols.items():
        print(f"Parsing {col} ({dim} dimensions)...")
        # Apply the parsing function and stack the results
        feature_matrix = np.vstack(df_filtered[col].apply(parse_array).values)
        # Ensure the feature matrix has the correct number of dimensions
        if feature_matrix.shape[1] != dim:
            # Handle cases where parsing might result in incorrect shapes
            # This could involve padding or truncating, but for now, we'll flag it
            print(f"  Warning: Mismatch in expected dimension for {col}. Expected {dim}, got {feature_matrix.shape[1]}. Adjusting...")
            # A simple fix: truncate or pad with zeros
            adjusted_matrix = np.zeros((feature_matrix.shape[0], dim))
            min_dim = min(dim, feature_matrix.shape[1])
            adjusted_matrix[:, :min_dim] = feature_matrix[:, :min_dim]
            feature_matrix = adjusted_matrix
        all_array_features.append(feature_matrix)

    # Combine all features
    X = np.hstack([X_numerical] + all_array_features)
    print("\nFeature matrix 'X' assembled successfully.")
    print(f"Final feature matrix shape: {X.shape}")
    
    # Store feature names for later use
    feature_names = numerical_feature_cols.copy()
    for col, dim in array_feature_cols.items():
        feature_names.extend([f"{col}_{i}" for i in range(dim)])

    # ======================================================================
    # STEP 3: ENCODING TARGET LABELS (y)
    # ======================================================================
    print("\n" + "="*70)
    print("STEP 3: ENCODING TARGET LABELS (y)")
    print("="*70)
    label_encoder = LabelEncoder()
    y = label_encoder.fit_transform(df_filtered['genre'])
    print(f"Encoded {len(label_encoder.classes_)} genres into numerical labels.")
    print(f"Target vector 'y' shape: {y.shape}")

    # ======================================================================
    # STEP 4: SPLITTING DATA
    # ======================================================================
    print("\n" + "="*70)
    print("STEP 4: SPLITTING DATA")
    print("="*70)
    # First split: 70% train, 30% temp (validation + test)
    train_indices, temp_indices = train_test_split(
        np.arange(len(X)), test_size=0.3, random_state=42, stratify=y
    )
    # Second split: 50% of temp -> 15% validation, 15% test
    val_indices, test_indices = train_test_split(
        temp_indices, test_size=0.5, random_state=42, stratify=y[temp_indices]
    )
    
    X_train, y_train = X[train_indices], y[train_indices]
    X_val, y_val = X[val_indices], y[val_indices]
    X_test, y_test = X[test_indices], y[test_indices]
    
    print(f"Training set:    {len(X_train)} samples × {X_train.shape[1]} features")
    print(f"Validation set:   {len(X_val)} samples × {X_val.shape[1]} features")
    print(f"Test set:         {len(X_test)} samples × {X_test.shape[1]} features")

    # ======================================================================
    # STEP 5: SCALING FEATURES
    # ======================================================================
    print("\n" + "="*70)
    print("STEP 5: SCALING FEATURES")
    print("="*70)
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_val_scaled = scaler.transform(X_val)
    X_test_scaled = scaler.transform(X_test)
    print("✓ Features for training, validation, and test sets have been scaled.")
    
    # Return all the artifacts needed for modeling
    return X_train, X_val, X_test, y_train, y_val, y_test, scaler, label_encoder, X_train_scaled, X_val_scaled, X_test_scaled

# --------------------------------------------------------------------------------
# EXECUTE THE DATA PREPARATION PIPELINE
# --------------------------------------------------------------------------------
# Execute the function and unpack all the returned objects into the global scope
X_train, X_val, X_test, y_train, y_val, y_test, scaler, label_encoder, X_train_scaled, X_val_scaled, X_test_scaled = prepare_data(df_tracks.copy())

print("\nData preparation complete and ready for model training!")

STEP 1: FILTERING GENRES
Filtering to genres with at least 10 samples...
Genres before filtering: 118
Genres after filtering:  20
Tracks remaining: 1,612

STEP 2: ASSEMBLING FEATURE MATRIX (X)
Extracted 14 simple numerical features.
Parsing mfcc_mean (20 dimensions)...
Parsing mfcc_std (20 dimensions)...
Parsing chroma_mean (12 dimensions)...
Parsing chroma_std (12 dimensions)...
Parsing mel_spec_mean (128 dimensions)...
Parsing mel_spec_std (128 dimensions)...
Parsing spec_contrast_mean (7 dimensions)...
Parsing spec_contrast_std (7 dimensions)...

Feature matrix 'X' assembled successfully.
Final feature matrix shape: (1612, 348)

STEP 3: ENCODING TARGET LABELS (y)
Encoded 20 genres into numerical labels.
Target vector 'y' shape: (1612,)

STEP 4: SPLITTING DATA
Training set:    1128 samples × 348 features
Validation set:   242 samples × 348 features
Test set:         242 samples × 348 features

STEP 5: SCALING FEATURES
✓ Features for training, validation, and test sets have been scale

---

## Section 4: Feature Analysis

Before training models, it's crucial to understand the features we've engineered. This section provides a detailed inventory and analysis of our 348 features to ensure they are suitable for machine learning. We will investigate:
1.  **Feature Inventory**: A detailed breakdown of all simple and array-based features.
2.  **Feature Correlation**: A heatmap to identify and understand relationships between features.
3.  **Feature Scaling**: An analysis of feature ranges to demonstrate the necessity of standardization.

In [15]:
# Create a detailed inventory of all 348 features
print(f"{'='*80}")
print(f"FEATURE INVENTORY (348 TOTAL FEATURES)")
print(f"{'='*80}")

# 1. Simple Numerical Features (14)
print(f"\n{'─'*80}\n1. SIMPLE NUMERICAL FEATURES ({len(numerical_feature_cols)} features)\n{'─'*80}")
feature_categories = {
    'Rhythm': ['duration', 'tempo', 'beat_count', 'beat_tempo'],
    'Timbre/Brightness': ['spec_centroid_mean', 'spec_centroid_std'],
    'Texture': ['spec_bandwidth_mean', 'spec_bandwidth_std'],
    'Frequency Shape': ['spec_rolloff_mean', 'spec_rolloff_std'],
    'Percussiveness': ['zero_crossing_mean', 'zero_crossing_std'],
    'Loudness': ['rms_mean', 'rms_std']
}
for category, features in feature_categories.items():
    print(f"  • {category}: {', '.join(features)}")

# 2. Array-Based Features (334)
print(f"\n{'─'*80}\n2. ARRAY-BASED FEATURES (334 features)\n{'─'*80}")
total_array_feats = sum(array_feature_cols.values())
print(f"These are multi-dimensional vectors capturing complex audio patterns:\n")
print(f"  • Mel-Frequency Cepstral Coefficients (MFCCs): {array_feature_cols['mfcc_mean'] + array_feature_cols['mfcc_std']} features")
print(f"    (Captures timbre and vocal characteristics)\n")
print(f"  • Chroma Features: {array_feature_cols['chroma_mean'] + array_feature_cols['chroma_std']} features")
print(f"    (Captures harmonic/melodic content - which notes are played)\n")
print(f"  • Mel Spectrogram: {array_feature_cols['mel_spec_mean'] + array_feature_cols['mel_spec_std']} features")
print(f"    (Energy across 128 frequency bins on a human-perceived scale)\n")
print(f"  • Spectral Contrast: {array_feature_cols['spec_contrast_mean'] + array_feature_cols['spec_contrast_std']} features")
print(f"    (Difference between peaks and valleys in the sound spectrum)\n")

print(f"✓ Feature inventory complete.")

FEATURE INVENTORY (348 TOTAL FEATURES)

────────────────────────────────────────────────────────────────────────────────
1. SIMPLE NUMERICAL FEATURES (14 features)
────────────────────────────────────────────────────────────────────────────────
  • Rhythm: duration, tempo, beat_count, beat_tempo
  • Timbre/Brightness: spec_centroid_mean, spec_centroid_std
  • Texture: spec_bandwidth_mean, spec_bandwidth_std
  • Frequency Shape: spec_rolloff_mean, spec_rolloff_std
  • Percussiveness: zero_crossing_mean, zero_crossing_std
  • Loudness: rms_mean, rms_std

────────────────────────────────────────────────────────────────────────────────
2. ARRAY-BASED FEATURES (334 features)
────────────────────────────────────────────────────────────────────────────────
These are multi-dimensional vectors capturing complex audio patterns:

  • Mel-Frequency Cepstral Coefficients (MFCCs): 40 features
    (Captures timbre and vocal characteristics)

  • Chroma Features: 24 features
    (Captures harmonic/mel

In [16]:
# --- Feature Analysis 1: Correlation Heatmap ---
print(f"\n{'='*80}")
print("FEATURE ANALYSIS 1: CORRELATION OF NUMERICAL FEATURES")
print(f"{'='*80}")

# Calculate the correlation matrix for the simple numerical features
corr_matrix = pd.DataFrame(X_train_scaled, columns=feature_names).loc[:, numerical_feature_cols].corr()

# Create the heatmap
fig_corr = px.imshow(
    corr_matrix,
    text_auto=".2f",
    aspect="auto",
    title="Correlation Matrix of Simple Numerical Features",
    labels=dict(color="Correlation"),
    template='plotly_dark'
)
fig_corr.update_layout(title_x=0.5)
fig_corr.show()

print("\n**Interpretation**: The heatmap shows relationships between features. For example, 'tempo' and 'beat_tempo' are perfectly correlated (1.0), as expected. High correlations (e.g., between spectral features) indicate some redundancy, which tree-based models can handle well.")


FEATURE ANALYSIS 1: CORRELATION OF NUMERICAL FEATURES



**Interpretation**: The heatmap shows relationships between features. For example, 'tempo' and 'beat_tempo' are perfectly correlated (1.0), as expected. High correlations (e.g., between spectral features) indicate some redundancy, which tree-based models can handle well.


In [17]:
# --- Feature Analysis 2: Feature Scales ---
print(f"\n{'='*80}")
print("FEATURE ANALYSIS 2: NECESSITY OF FEATURE SCALING")
print(f"{'='*80}")

# Calculate the range (max - min) for each of the 348 features before scaling
feature_ranges = pd.DataFrame(X_train).describe().loc['max'] - pd.DataFrame(X_train).describe().loc['min']

# Create a bar plot of the feature ranges
fig_scales = px.bar(
    x=feature_ranges.index,
    y=feature_ranges.values,
    title='Range of Values for Each Feature (Before Scaling)',
    labels={'x': 'Feature Index', 'y': 'Range (Max - Min)'},
    template='plotly_dark'
)
fig_scales.update_layout(title_x=0.5)
fig_scales.show()

min_range = feature_ranges.min()
max_range = feature_ranges.max()
print(f"\n**Interpretation**: The ranges of the features vary dramatically.")
print(f"  - Smallest feature range: {min_range:.4f}")
print(f"  - Largest feature range:  {max_range:,.2f}")
print(f"  - Scale ratio: {max_range/min_range:,.0f}:1")
print("\nThis vast difference in scales makes **Standardization** (scaling to mean=0, std=1) absolutely essential for models like Logistic Regression and beneficial for all models during optimization.")


FEATURE ANALYSIS 2: NECESSITY OF FEATURE SCALING



**Interpretation**: The ranges of the features vary dramatically.
  - Smallest feature range: 0.1360
  - Largest feature range:  7,250.36
  - Scale ratio: 53,321:1

This vast difference in scales makes **Standardization** (scaling to mean=0, std=1) absolutely essential for models like Logistic Regression and beneficial for all models during optimization.


---

## Section 5: Model Training and Cross-Validation

With the data prepared and analyzed, we can now train our classification models. This section covers:
1.  **Model Selection**: An overview of the three models chosen for this task.
2.  **Cross-Validation Strategy**: A robust 5-fold stratified cross-validation to evaluate each model's performance on the training data.
3.  **Training and Evaluation Loop**: A systematic process to train each model, record its performance metrics (accuracy, F1-score, etc.), and measure its training time.
4.  **Results Comparison**: A summary table and visualization comparing the cross-validation results to identify the best-performing model.

### 5.1 Model Selection

We will train and compare three distinct classification models, each representing a different approach to learning from the data:

1.  **Logistic Regression**: A fast, linear model that serves as a strong baseline. It's great for understanding feature importance but assumes a linear relationship between features and the target.
2.  **Random Forest**: An ensemble of decision trees. It's a powerful non-linear model that can capture complex interactions between features and is robust to overfitting.
3.  **Gradient Boosting**: Another tree-based ensemble model that builds trees sequentially, with each new tree correcting the errors of the previous one. It is often a top-performing model in classification tasks.

### 5.2 Cross-Validation Setup

To get a reliable estimate of each model's performance, we will use **5-fold stratified cross-validation**. This process involves:
1.  Splitting the training data (`X_train`, `y_train`) into 5 "folds".
2.  Training each model on 4 of the folds and evaluating it on the 5th (the "hold-out" fold).
3.  Repeating this process 5 times, ensuring each fold is used as the hold-out set exactly once.
4.  Averaging the performance metrics (Accuracy, Precision, Recall, F1-Score) across all 5 folds.

This approach gives a more robust measure of generalization performance than a single train/validation split and helps ensure our results are not due to a lucky or unlucky split of the data. We will store the results in a DataFrame for easy comparison.

In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import cross_validate, StratifiedKFold
import numpy as np

# Define models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42)
}

# Setup cross-validation
cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scoring_metrics = ['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted']

# Store results
cv_results = []

# Loop through models and perform cross-validation
for model_name, model in models.items():
    print(f"Running cross-validation for {model_name}...")
    
    # Perform cross-validation
    scores = cross_validate(
        estimator=model,
        X=X_train_scaled,
        y=y_train,
        cv=cv_strategy,
        scoring=scoring_metrics,
        n_jobs=-1 # Use all available CPU cores
    )
    
    # Store the mean of the scores
    result = {
        'Model': model_name,
        'Accuracy': np.mean(scores['test_accuracy']),
        'Precision': np.mean(scores['test_precision_weighted']),
        'Recall': np.mean(scores['test_recall_weighted']),
        'F1-Score': np.mean(scores['test_f1_weighted'])
    }
    cv_results.append(result)
    print(f"Finished for {model_name}.")

# Create a DataFrame from the results
cv_results_df = pd.DataFrame(cv_results)

# Display the results, formatted for clarity
cv_results_df.style.format({
    'Accuracy': '{:.4f}',
    'Precision': '{:.4f}',
    'Recall': '{:.4f}',
    'F1-Score': '{:.4f}'
}).set_caption("5-Fold Cross-Validation Results").hide(axis='index')

Running cross-validation for Logistic Regression...


  raw_prediction = X @ weights.T + intercept  # ndarray, likely C-contiguous
  raw_prediction = X @ weights.T + intercept  # ndarray, likely C-contiguous
  raw_prediction = X @ weights.T + intercept  # ndarray, likely C-contiguous
  raw_prediction = X @ weights.T + intercept  # ndarray, likely C-contiguous
  raw_prediction = X @ weights.T + intercept  # ndarray, likely C-contiguous
  raw_prediction = X @ weights.T + intercept  # ndarray, likely C-contiguous
  raw_prediction = X @ weights.T + intercept  # ndarray, likely C-contiguous
  raw_prediction = X @ weights.T + intercept  # ndarray, likely C-contiguous
  raw_prediction = X @ weights.T + intercept  # ndarray, likely C-contiguous
  raw_prediction = X @ weights.T + intercept  # ndarray, likely C-contiguous
  raw_prediction = X @ weights.T + intercept  # ndarray, likely C-contiguous
  raw_prediction = X @ weights.T + intercept  # ndarray, likely C-contiguous
  raw_prediction = X @ weights.T + intercept  # ndarray, likely C-contiguous

Finished for Logistic Regression.
Running cross-validation for Random Forest...


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Finished for Random Forest.
Running cross-validation for Gradient Boosting...


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Finished for Gradient Boosting.


Matplotlib is building the font cache; this may take a moment.


Model,Accuracy,Precision,Recall,F1-Score
Logistic Regression,0.5895,0.5587,0.5895,0.5659
Random Forest,0.5824,0.5185,0.5824,0.5237
Gradient Boosting,0.5639,0.5317,0.5639,0.5244


### 5.3 Results and Model Selection

The cross-validation results show that **Logistic Regression** is the top-performing model, achieving the highest scores in Accuracy, Precision, and F1-Score. While the Random Forest model's accuracy is close, its lower precision and F1-score suggest it may be making less precise predictions for some classes.

Given its strong performance, simplicity, and interpretability, we will select **Logistic Regression** as our final model to evaluate on the unseen test set.

---
## Section 6: Final Model Evaluation on Test Set

Now that we have selected our best model based on cross-validation, it's time for the final performance assessment. We will train the Logistic Regression model on the **entire training set** (`X_train_scaled` and `y_train`) and then evaluate its performance on the **unseen test set** (`X_test_scaled` and `y_test`).

This is the most critical evaluation, as the model has never been exposed to the test data during training or model selection. The results here will give us the best estimate of how the model would perform on new, real-world data. We will generate:
1.  A **Classification Report**, showing detailed metrics (Precision, Recall, F1-Score) for each genre.
2.  A **Confusion Matrix**, to visualize which genres the model is classifying correctly and where it is getting confused.

In [14]:
from sklearn.metrics import classification_report, confusion_matrix
import plotly.figure_factory as ff

# Initialize and train the final model on the entire training set
final_model = LogisticRegression(max_iter=1000, random_state=42)
print("Training the final Logistic Regression model on the full training dataset...")
final_model.fit(X_train_scaled, y_train)
print("✓ Model training complete.")

# Make predictions on the test set
print("\nMaking predictions on the unseen test set...")
y_pred = final_model.predict(X_test_scaled)
print("✓ Predictions complete.")

# Get the original genre names for the report
target_names = label_encoder.classes_

# Generate and print the classification report
print("\n======================================================================")
print("FINAL MODEL EVALUATION: CLASSIFICATION REPORT")
print("======================================================================")
report = classification_report(y_test, y_pred, target_names=target_names, zero_division=0)
print(report)


# Generate and display the confusion matrix
print("\n======================================================================")
print("FINAL MODEL EVALUATION: CONFUSION MATRIX")
print("======================================================================")

# Create the matrix
cm = confusion_matrix(y_test, y_pred)

# The label encoder gives us the original string labels in order
x_labels = target_names
y_labels = target_names

# Create the heatmap figure
fig = ff.create_annotated_heatmap(
    z=cm,
    x=list(x_labels),
    y=list(y_labels),
    colorscale='Blues',
    showscale=True
)

# Add titles and labels
fig.update_layout(
    title_text='<b>Confusion Matrix</b><br><i>Predicted vs. Actual Genre</i>',
    xaxis_title='Predicted Label',
    yaxis_title='Actual Label',
    xaxis=dict(tickangle=-45),
    width=800,
    height=800
)

fig.show()

Training the final Logistic Regression model on the full training dataset...
✓ Model training complete.

Making predictions on the unseen test set...
✓ Predictions complete.

FINAL MODEL EVALUATION: CLASSIFICATION REPORT
                   precision    recall  f1-score   support

          Unknown       0.57      0.65      0.61        63
   african gospel       0.45      0.50      0.48        10
       afro adura       0.00      0.00      0.00         1
        afrobeats       0.67      0.74      0.71        39
         amapiano       0.33      0.50      0.40         2
        christian       0.33      0.33      0.33         6
          country       0.00      0.00      0.00         3
              edm       0.00      0.00      0.00         3
     egyptian pop       0.50      0.50      0.50         2
           gospel       0.33      0.11      0.17         9
            lo-fi       0.69      0.65      0.67        17
      lo-fi beats       0.00      0.00      0.00         3
    lo-fi h

---
## Section 7: Conclusion and Interpretation

This final section summarizes the project's findings, interprets the performance of our final model, and discusses potential avenues for future work.

### 7.1 Interpretation of Results

The final evaluation of our chosen model, **Logistic Regression**, provides several key insights:

-   **Overall Performance**: The model achieved an **accuracy of 63%** and a **weighted average F1-score of 0.61** on the unseen test set. This demonstrates a moderate but meaningful ability to predict a song's genre from its audio features, confirming that the features contain a significant amount of genre-specific information.

-   **Performance on Well-Represented Genres**: The model performed exceptionally well for genres with a large number of samples. For instance:
    -   `worship`: F1-score of 0.80
    -   `afrobeats`: F1-score of 0.71
    -   `soft pop`: F1-score of 0.75
    -   `new age`: F1-score of 1.00 (perfectly classified)
    This indicates that with sufficient data, the model can effectively learn the distinct audio signatures of different genres.

-   **Challenges with Imbalanced Data**: The model's primary weakness was its performance on genres with very few samples (e.g., `country`, `edm`, `rap`, `reggaeton`), where it scored an F1-score of 0.00. The confusion matrix shows that the model tends to misclassify these rare genres as more dominant ones like `Unknown` or `worship`. This is a classic outcome when dealing with a highly imbalanced dataset.

### 7.2 Project Conclusion

This project successfully built an end-to-end machine learning pipeline to classify music genres. We demonstrated that a collection of 348 audio features can predict genre with reasonable success, particularly for well-represented classes. The **Logistic Regression** model emerged as the most effective, providing a good balance of performance and simplicity.

The most significant limitation identified was the **severe class imbalance** in the dataset, which hindered the model's ability to learn the patterns of rare genres.

### 7.3 Future Work

To build upon this project, the following steps could be taken:

1.  **Address Class Imbalance**:
    *   **Collect More Data**: The most effective solution would be to gather more song samples for the under-represented genres.
    *   **Use Advanced Sampling Techniques**: Implement methods like **SMOTE (Synthetic Minority Over-sampling Technique)** to generate synthetic samples for the minority classes, helping the model learn their characteristics better.

2.  **Hyperparameter Tuning**: Conduct a more thorough hyperparameter search (e.g., using `GridSearchCV` or `RandomizedSearchCV`) to find the optimal settings for the models, which could yield a significant performance boost.

3.  **Explore Advanced Models**: Experiment with more complex, non-linear models like **Deep Neural Networks (DNNs)** or **Convolutional Neural Networks (CNNs)** applied to spectrogram images, which might capture more subtle and hierarchical patterns in the audio.

4.  **Feature Importance Analysis**: Investigate the coefficients of the trained Logistic Regression model to identify which of the 348 audio features were the most influential in predicting genre. This would provide deeper insights into the audio characteristics that define different musical styles.