# Music Genre Classification using Transfer Learning (EfficientNetB0)

**Course Project: Advanced Neural Networks**  
**Model Architecture:** EfficientNetB0 (Compound Scaling)  
**Dataset:** GTZAN Genre Collection  
**Dataset Link:** [Kaggle: GTZAN Dataset](https://www.kaggle.com/datasets/andradaolteanu/gtzan-dataset-music-genre-classification/data)

---

## 1. Introduction
This notebook demonstrates a high-precision approach to music genre classification. We leverage **Transfer Learning** by utilizing a pre-trained **EfficientNetB0** model, which has been fine-tuned on Mel-Spectrograms derived from audio samples. 

### Key Technical Features:
- **Compound Scaling:** Balanced adjustment of network depth, width, and resolution.
- **Audio Slicing:** Augmenting data by dividing 30-second tracks into smaller segments.
- **Mel-Spectrograms:** Visualizing audio frequency maps to allow CNNs to process sound as "images".

## 2. Configuration & Dependencies
We define our hyper-parameters and environment constants, ensuring a fixed input shape for the EfficientNet backbone.

In [None]:
import os
import zipfile
import numpy as np
import librosa
import tensorflow as tf
from tensorflow.keras.applications import EfficientNetB0
from tensorflow.keras.layers import GlobalAveragePooling2D, Dense, Dropout, Rescaling
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import cv2
import matplotlib.pyplot as plt
import seaborn as sns
import traceback
import gc

# --- CONFIGURATION ---
SAMPLE_RATE = 22050
DURATION = 30
SAMPLES_PER_TRACK = SAMPLE_RATE * DURATION
INPUT_SHAPE = (224, 224, 3)
NUM_CLASSES = 10
BATCH_SIZE = 32
EPOCHS = 15
SLICES_PER_TRACK = 10

## 3. Dataset Extraction
We perform a robust extraction of the GTZAN dataset, ensuring the directory structure is correctly mapped even in cloud environments like Google Colab.

In [None]:
ZIP_PATH = '/content/genres_original.zip'
EXTRACT_PATH = '/content/dataset_extracted'

if not os.path.exists(EXTRACT_PATH):
    if os.path.exists(ZIP_PATH):
        with zipfile.ZipFile(ZIP_PATH, 'r') as zip_ref:
            zip_ref.extractall(EXTRACT_PATH)
        print("Unzip complete.")
    else:
        print("Dataset zip not found! Ensure it is uploaded to /content/.")

DATASET_PATH = EXTRACT_PATH
for root, dirs, files in os.walk(EXTRACT_PATH):
    if 'blues' in dirs:
        DATASET_PATH = root
        break

## 4. Exploratory Data Analysis (EDA)
In this section, we explore the GTZAN dataset by analyzing a single audio track. We visualize its waveform and demonstrate the **3-second slicing logic** used during training to augment our dataset from 1,000 to 10,000 samples.

In [None]:
import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np
import IPython.display as ipd
import os
import cv2

# Search for a sample file
sample_file = 'genres_original/blues/blues.00000.wav' # Local path
if not os.path.exists(sample_file):
    # Try looking in Colab extracted path if it exists
    colab_path = '/content/dataset_extracted/genres_original/blues/blues.00000.wav'
    if os.path.exists(colab_path): sample_file = colab_path

if os.path.exists(sample_file):
    # 1. Load Audio
    y, sr = librosa.load(sample_file, sr=22050)
    
    # 2. Plot Full Waveform
    plt.figure(figsize=(14, 5))
    librosa.display.waveshow(y, sr=sr, color='blue')
    plt.title('Full 30-Second Waveform (Blues)')
    plt.xlabel('Time (s)')
    plt.ylabel('Amplitude')
    plt.show()
    
    # Play the audio
    print("Playing Sample Audio:")
    ipd.display(ipd.Audio(sample_file))
else:
    print('Sample file not found. Please ensure dataset is extracted.')

### Demonstrating 3-Second Slicing & Mel-Spectrogram Extraction
To maximize the learning capacity of our model, we slice each 30-second track into ten 3-second segments. Each segment is then converted into a **Mel-Spectrogram**, which represents the intensity of frequencies over time on a log scale (Mel scale), mimicking human hearing.

In [None]:
if os.path.exists(sample_file):
    # Define slicing parameters
    duration = 30
    slices_per_track = 10
    slice_duration = duration / slices_per_track
    samples_per_slice = int(slice_duration * sr)

    plt.figure(figsize=(18, 8))
    plt.suptitle('Data Augmentation: 3-Second Mel-Spectrogram Slices', fontsize=16)

    # Visualize specific slices (e.g., first 4 slices)
    for i in range(4):
        start_sample = i * samples_per_slice
        end_sample = start_sample + samples_per_slice
        chunk = y[start_sample:end_sample]
        
        # Generate Mel-Spectrogram
        mel = librosa.power_to_db(librosa.feature.melspectrogram(y=chunk, sr=sr, n_mels=128))
        
        plt.subplot(2, 2, i + 1)
        librosa.display.specshow(mel, sr=sr, x_axis='time', y_axis='mel')
        plt.title(f'Slice {i+1} (Seconds {i*3}-{(i+1)*3})')
        plt.colorbar(format='%+2.0f dB')

    plt.tight_layout(rect=[0, 0.03, 1, 0.95])
    plt.show()

## 5. RAM-Optimized Data Loading & Preprocessing
Efficiency is critical when handling spectrogram data. We pre-allocate memory using `uint8` to save RAM and use `librosa` for high-quality Mel-Spectrogram extraction.

In [None]:
def load_data_sliced_optimized(dataset_path, slices_per_track=10):
    genres = sorted([d for d in os.listdir(dataset_path) if os.path.isdir(os.path.join(dataset_path, d))])
    genre_to_id = {genre: i for i, genre in enumerate(genres)}
    total_files = sum([len([f for f in os.listdir(os.path.join(dataset_path, g)) if f.endswith('.wav')]) for g in genres])
    total_slices = total_files * slices_per_track

    X = np.zeros((total_slices, 224, 224, 3), dtype=np.uint8)
    y = np.zeros((total_slices,), dtype=np.int32)

    SAMPLES_PER_SLICE = int(SAMPLES_PER_TRACK / slices_per_track)
    current_idx = 0

    for genre in genres:
        genre_path = os.path.join(dataset_path, genre)
        for filename in os.listdir(genre_path):
            if filename.endswith('.wav'):
                try:
                    signal, sr = librosa.load(os.path.join(genre_path, filename), sr=SAMPLE_RATE, duration=DURATION)
                    for s in range(slices_per_track):
                        start, end = s * SAMPLES_PER_SLICE, (s+1) * SAMPLES_PER_SLICE
                        mel = librosa.power_to_db(librosa.feature.melspectrogram(y=signal[start:end], sr=sr, n_mels=128))
                        resized = cv2.resize(mel, (224, 224))
                        rgb = np.stack([resized] * 3, axis=-1)
                        min_v, max_v = np.min(rgb), np.max(rgb)
                        norm = (rgb - min_v) / (max_v - min_v) * 255.0 if max_v > min_v else np.zeros_like(rgb)
                        X[current_idx] = norm.astype(np.uint8)
                        y[current_idx] = genre_to_id[genre]
                        current_idx += 1
                except: continue
        gc.collect()
    return X[:current_idx], y[:current_idx], genres

## 6. Model Architecture (Transfer Learning)
We load the EfficientNetB0 backbone with weights pre-trained on ImageNet. Critical architectural decisions include:
- **Freezing BatchNormalization:** Prevents the noise statistics from being corrupted during fine-tuning.
- **Custom Head:** Includes a GlobalAveragePooling2D layer, a dense layer for non-linear mapping, and a Dropout layer to reduce overfitting.

In [None]:
def create_fixed_model(input_shape, num_classes):
    inputs = tf.keras.Input(shape=input_shape)
    x = Rescaling(1.0)(inputs)
    base_model = EfficientNetB0(weights='imagenet', include_top=False, input_tensor=x)
    base_model.trainable = True
    for layer in base_model.layers:
        if isinstance(layer, tf.keras.layers.BatchNormalization):
            layer.trainable = False
    
    x = GlobalAveragePooling2D()(base_model.output)
    x = Dense(512, activation='relu')(x)
    x = Dropout(0.5)(x)
    outputs = Dense(num_classes, activation='softmax')(x)
    
    model = Model(inputs=inputs, outputs=outputs)
    model.compile(optimizer=Adam(learning_rate=1e-4), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model

## 7. Training Visualization Logic
Utilities for monitoring the learning process.

In [None]:
def plot_history(history):
    plt.figure(figsize=(12, 4))
    plt.subplot(1, 2, 1)
    plt.plot(history.history['accuracy'], label='Train')
    plt.plot(history.history['val_accuracy'], label='Val')
    plt.title('Accuracy'); plt.legend()
    
    plt.subplot(1, 2, 2)
    plt.plot(history.history['loss'], label='Train')
    plt.plot(history.history['val_loss'], label='Val')
    plt.title('Loss'); plt.legend()
    plt.show()

## 8. Model Execution
Loading data, splitting into Train/Test sets, and running the training loop for 15 epochs.

In [None]:
X, y, genres = load_data_sliced_optimized(DATASET_PATH, slices_per_track=SLICES_PER_TRACK)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = create_fixed_model(INPUT_SHAPE, NUM_CLASSES)
history = model.fit(X_train, y_train, validation_data=(X_test, y_test), 
                    epochs=EPOCHS, batch_size=BATCH_SIZE)

## 9. Final Evaluation & Export
Displaying the classification report and saving the model in multiple formats for deployment.

In [None]:
y_pred = np.argmax(model.predict(X_test), axis=1)
print(classification_report(y_test, y_pred, target_names=genres))
plot_history(history)

model.save("EfficientNet_Model.keras")
model.save("EfficientNet_Model.h5", include_optimizer=False)
model.save_weights("EfficientNet_Model.weights.h5")