## Finally!ðŸ¥³

### Saving the Preprocessing Pipeline

We have now invested quite some effort into building and fitting our preprocessing pipeline â€” handling missing values, scaling numeric features, and encoding categorical variables.

To make sure we can reproduce exactly the same preprocessing steps later, even after closing this notebook or when working in a different script, we can **save the fitted pipeline** to disk.

This allows us to:
- Reuse the exact same transformations on new data (e.g. test data or data in production)
- Ensure full reproducibility of our workflow
- Experiment with different preprocessing strategies and compare them fairly
- Keep preprocessing and modeling steps **consistent and versioned**

In short, saving the pipeline means that our entire data preparation process becomes **reusable, consistent, and shareable** â€” an essential part of any professional machine learning workflow.

(From Notes)


In [None]:
import joblib
import pandas as pd
# Even though you don't call sklearn directly to load the file, 
# joblib needs it in the background to reconstruct the objects.
import numpy as np
import json

In [None]:
# 1. Load the preprocessor and feature set list
loaded_preprocessor = joblib.load("../models/standard_scaler.joblib")
with open('../models/feature_sets.json', 'r') as f:
    feature_sets = json.load(f)

# 2. Load raw data
df_new = pd.read_csv("../data/dataset.csv")

# 3. RECREATE ENGINEERED FEATURES 
# Math features
df_new['duration_min'] = df_new['duration_ms'] / 60000
df_new['energy_x_danceability'] = df_new['energy'] * df_new['danceability']
df_new['loudness_x_energy'] = df_new['loudness'] * df_new['energy']
df_new['valence_x_danceability'] = df_new['valence'] * df_new['danceability']
df_new['tempo_log'] = np.log1p(df_new['tempo'])

# Logical/Categorical features (This fixes your KeyError)
df_new['is_instrumental'] = df_new['instrumentalness'] > 0.5
df_new['has_vocals'] = df_new['instrumentalness'] < 0.5
df_new['is_speech_heavy'] = df_new['speechiness'] > 0.66

# Duration Category (Example logic - ensure this matches your first notebook)
df_new['duration_category'] = pd.cut(df_new['duration_min'], 
                                     bins=[0, 2, 4, 10, 100], 
                                     labels=['short', 'medium', 'long', 'very_long'])

# 4. NOW SELECT THE FEATURES
X_new = df_new[feature_sets['full']]


In [None]:
X_new.head()