# Train Song Classification Model

This notebook demonstrates the complete workflow to train a deep learning model to classify songs based on audio features. It includes data preprocessing, model construction, training, evaluation, and deployment-ready export to CoreML.

Make sure that the `features.csv` file is available in the specified path, containing pre-extracted audio features and corresponding labels.

## Imports and Dependencies

This section imports all required libraries. TensorFlow/Keras is used for model building, `scikit-learn` for preprocessing and dataset splitting, `joblib` for saving scalers, `pandas` for data handling, and `coremltools` for CoreML conversion.

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Input
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.regularizers import l2
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
import joblib
import json
import coremltools as ct
import os


## File Paths

Define all file paths used in the project for model storage, scaler, label mapping, CoreML export, and the feature CSV. Centralized paths make the notebook easier to maintain and portable across environments.

In [None]:
MODEL_PATH = "/app/song-storage/model.keras"
SCALER_PATH = "/app/song-storage/scaler.pkl"
LABELS_PATH = "/app/song-storage/label_order.json"
COREML_PATH = "/app/song-storage/model.mlmodel"
FEATURES_CSV = "/app/song-storage/features.csv"


## Load Dataset

We load the CSV file containing pre-extracted audio features.

- `feature_columns` are all columns except `"filename"` and `"label"`.

- `X` contains the feature vectors and `y` the categorical labels.

Printing shapes confirms the dataset dimensions and ensures proper loading.

In [None]:
df = pd.read_csv(FEATURES_CSV)
feature_columns = [col for col in df.columns if col not in ["filename", "label"]]
X = df[feature_columns].values
y = df["label"].values
print("Features shape:", X.shape)
print("Labels shape:", y.shape)


## Standardize Features

Normalization is crucial for deep learning convergence.
We use `StandardScaler` to standardize features to zero mean and unit variance.
The scaler is saved using `joblib` to ensure the same transformation can be applied at inference time.

In [None]:
scaler = StandardScaler()
X = scaler.fit_transform(X)
joblib.dump(scaler, SCALER_PATH)
print("Scaler saved.")


## Encode Labels

Convert categorical string labels into one-hot encoded vectors.
This allows the model to output probabilities across multiple dance classes.
We also save a `label_to_index` mapping for consistent label decoding.

In [None]:
unique_labels = sorted(df["label"].unique())
label_to_index = {label: idx for idx, label in enumerate(unique_labels)}
y = to_categorical([label_to_index[label] for label in y], num_classes=len(unique_labels))


## Split Dataset

Split data into training and test sets using an 80/20 split.
This allows evaluation of generalization performance on unseen data while training is performed on the remaining 80%.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## Build Model

Construct a fully connected feedforward neural network with:

- Input Layer: Size matches the number of features

- Hidden Layers: Two dense layers (128 and 64 units) with ReLU activation and L2 regularization

- Dropout: 50% to prevent overfitting

- Output Layer: Softmax with units equal to the number of labels

Compile using `adam` optimizer and categorical cross-entropy loss.

In [None]:
model = Sequential([
    Input(shape=(X_train.shape[1],)),
    Dense(128, activation='relu', kernel_regularizer=l2(0.01)),
    Dropout(0.5),
    Dense(64, activation='relu', kernel_regularizer=l2(0.01)),
    Dropout(0.5),
    Dense(len(unique_labels), activation='softmax')
])
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()


## Train Model

Training is done with early stopping to avoid overfitting:

- `monitor='val_loss'` ensures we stop when validation performance plateaus.

- `restore_best_weights=True` rolls back to the best epoch.

Validation split of 20% ensures the model is evaluated on unseen data during training.

In [None]:
early_stopping = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
history = model.fit(X_train, y_train, validation_split=0.2, epochs=100, batch_size=32, callbacks=[early_stopping])


## Save Model and Labels

Save the trained Keras model and the sorted list of labels.
This allows the same model to be loaded for inference and ensures labels remain consistent.

In [None]:
model.save(MODEL_PATH)
print("Model saved.")

with open(LABELS_PATH, "w") as f:
    json.dump(unique_labels, f)
print("Label order saved.")


## Convert to CoreML

Optionally, convert the Keras model to CoreML for deployment on iOS devices.
We specify dynamic batch size using `ct.RangeDim()` and set minimum deployment target to iOS 14. This prepares the model for mobile applications.

In [None]:
coreml_model = ct.convert(
    model,
    source="tensorflow",
    inputs=[ct.TensorType(shape=(ct.RangeDim(), X_train.shape[1]))],
    minimum_deployment_target=ct.target.iOS14
)
coreml_model.save(COREML_PATH)
print("CoreML model saved.")


## Evaluate Model

Finally, evaluate the trained model on the test set to report the final loss and accuracy.
This provides an unbiased estimate of the model’s predictive performance.

In [None]:
loss, acc = model.evaluate(X_test, y_test, verbose=0)
print(f"Evaluation - Loss: {loss:.4f}, Accuracy: {acc:.4f}")
