# Deep Neural Network for Voice Gender Classification: Application

Welcome to the advanced audio classification lab! Building on what you learned in Lab 1 (binary speech vs music classification), you'll now use deep neural networks to classify voice recordings as male or female.

**After this assignment you will be able to:**

- Build and train a deep L-layer neural network for binary voice classification
- Apply deep learning to real-world voice analysis tasks
- Compare shallow vs deep network performance on audio data
- Understand why deeper networks work better for extracting voice features
- Work with real audio data from TTS systems

Let's get started!

## Important Note on Submission to the AutoGrader

Before submitting your assignment to the AutoGrader, please make sure you are not doing the following:

1. You have not added any _extra_ `print` statement(s) in the assignment.
2. You have not added any _extra_ code cell(s) in the assignment.
3. You have not changed any of the function parameters.
4. You are not using any global variables inside your graded exercises. Unless specifically instructed to do so, please refrain from it and use the local variables instead.
5. You are not changing the assignment code where it is not required, like creating _extra_ variables.

## Table of Contents
- [1 - Packages](#1)
- [2 - Load and Process the Dataset](#2)
- [3 - Model Architecture](#3)
    - [3.1 - 2-layer Neural Network](#3-1)
    - [3.2 - L-layer Deep Neural Network](#3-2)
    - [3.3 - General Methodology](#3-3)
- [4 - Two-layer Neural Network](#4)
    - [Exercise 1 - two_layer_model](#ex-1)
    - [4.1 - Train the model](#4-1)
- [5 - L-layer Neural Network](#5)
    - [Exercise 2 - L_layer_model](#ex-2)
    - [5.1 - Train the model](#5-1)
- [6 - Results Analysis](#6)
- [7 - Test with your own audio (optional/ungraded exercise)](#7)

<a name='1'></a>
## 1 - Packages

Begin by importing all the packages you'll need during this assignment. 

- [numpy](https://www.numpy.org/) is the fundamental package for scientific computing with Python.
- [matplotlib](http://matplotlib.org) is a library to plot graphs in Python.
- [librosa](https://librosa.org/) is a library for audio analysis.
- `dnn_app_utils_v3` provides the functions implemented in the "Building your Deep Neural Network: Step by Step" assignment to this notebook.
- `audio_utils` provides functions to load and process audio data.
- `np.random.seed(1)` is used to keep all the random function calls consistent. It helps grade your work - so please don't change it!

In [None]:
import time
import numpy as np
import matplotlib.pyplot as plt
import librosa
import librosa.display
from dnn_app_utils_v3 import *
from audio_utils import *
from public_tests import *

%matplotlib inline
plt.rcParams['figure.figsize'] = (5.0, 4.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

%load_ext autoreload
%autoreload 2

np.random.seed(1)

<a name='2'></a>
## 2 - Load and Process the Dataset

You'll be using real voice recordings for binary gender classification. In Lab 1, you classified audio as speech or music. Now you'll classify voice recordings as male or female:

- **0: Male** - Male voice recordings
- **1: Female** - Female voice recordings

**Problem Statement**: You are given a dataset containing:
    - a training set of `m_train` voice clips labeled by gender (0 or 1)
    - a test set of `m_test` voice clips labeled by gender
    - each audio clip is converted to a mel-spectrogram of shape (n_mels, time_steps)

**Data Source**: We'll use real voice samples from TTS systems (like the speaker samples from Lab 3).

Let's get more familiar with the dataset. Load the data by running the cell below.

In [None]:
train_x_orig, train_y, test_x_orig, test_y, classes = load_audio_dataset()

The following code will show you a spectrogram in the dataset. Feel free to change the index and re-run the cell multiple times to check out other spectrograms.

In [None]:
# Example of a spectrogram
index = 200
plt.figure(figsize=(10, 4))
librosa.display.specshow(train_x_orig[index], x_axis='time', y_axis='mel', sr=22050)
plt.colorbar(format='%+2.0f dB')
plt.title(f"Mel-Spectrogram - Gender: {classes[int(train_y[0,index])]}")
print ("y = " + str(train_y[0,index]) + ". It's a '" + classes[int(train_y[0,index])] +  "' voice.")
plt.show()

In [None]:
# Explore your dataset 
m_train = train_x_orig.shape[0]
n_mels = train_x_orig.shape[1]
time_steps = train_x_orig.shape[2]
m_test = test_x_orig.shape[0]
num_classes = len(classes)

print ("Number of training examples: " + str(m_train))
print ("Number of testing examples: " + str(m_test))
print ("Number of classes: " + str(num_classes))
print ("Each spectrogram is of size: (" + str(n_mels) + ", " + str(time_steps) + ")")
print ("train_x_orig shape: " + str(train_x_orig.shape))
print ("train_y shape: " + str(train_y.shape))
print ("test_x_orig shape: " + str(test_x_orig.shape))
print ("test_y shape: " + str(test_y.shape))
print ("\nGender classes: " + str(classes))

As usual, you reshape and standardize the spectrograms before feeding them to the network. The code is given in the cell below.

**Note**: Just like images are flattened from (height, width, channels) to vectors, spectrograms are flattened from (n_mels, time_steps) to vectors.

In [None]:
# Reshape the training and test examples 
train_x_flatten = train_x_orig.reshape(train_x_orig.shape[0], -1).T   # The "-1" makes reshape flatten the remaining dimensions
test_x_flatten = test_x_orig.reshape(test_x_orig.shape[0], -1).T

# Standardize data to have feature values between 0 and 1.
train_x = train_x_flatten/255.
test_x = test_x_flatten/255.

print ("train_x's shape: " + str(train_x.shape))
print ("test_x's shape: " + str(test_x.shape))

**Note**:
The input size is the number of mel-frequency bins multiplied by the number of time steps in the spectrogram.

<a name='3'></a>
## 3 - Model Architecture

Now that you're familiar with the dataset, it's time to build a deep neural network to classify audio by genre!

<a name='3-1'></a>
### 3.1 - 2-layer Neural Network

You're going to build two different models:

- A 2-layer neural network
- An L-layer deep neural network

Then, you'll compare the performance of these models, and try out some different values for $L$. 

The 2-layer model architecture:

**INPUT (Mel-Spectrogram) -> LINEAR -> RELU -> LINEAR -> SIGMOID -> OUTPUT (Gender Probability)**

<u><b>Detailed Architecture</b></u>:
- The input is a mel-spectrogram which is flattened to a vector of size $(n\_mels \times time\_steps, 1)$
- The vector is multiplied by the weight matrix $W^{[1]}$ of size $(n^{[1]}, input\_size)$
- Add a bias term and take its relu to get the hidden layer activations
- Multiply by $W^{[2]}$ and add bias
- Apply sigmoid to get probability of female voice (1) vs male voice (0)

<a name='3-2'></a>
### 3.2 - L-layer Deep Neural Network

For a deeper network:

**[LINEAR -> RELU] $\times$ (L-1) -> LINEAR -> SIGMOID**

<u><b>Detailed Architecture</b></u>:
- The input is a mel-spectrogram flattened to a vector
- The vector is multiplied by $W^{[1]}$ and then you add the intercept $b^{[1]}$
- Take the relu activation
- This process repeats for each $(W^{[l]}, b^{[l]})$ layer
- Finally, apply sigmoid to get probability of female voice

<a name='3-3'></a>
### 3.3 - General Methodology

As usual, you'll follow the Deep Learning methodology to build the model:

1. Initialize parameters / Define hyperparameters
2. Loop for num_iterations:
    a. Forward propagation
    b. Compute cost function
    c. Backward propagation
    d. Update parameters (using parameters, and grads from backprop) 
3. Use trained parameters to predict labels

Now go ahead and implement those two models!

<a name='4'></a>
## 4 - Two-layer Neural Network

<a name='ex-1'></a>
### Exercise 1 - two_layer_model 

Use the helper functions you have implemented in the previous assignment to build a 2-layer neural network with the following structure: *LINEAR -> RELU -> LINEAR -> SOFTMAX*. 

**Note**: For binary classification (male vs female), we use sigmoid in the output layer with a single output neuron.

In [None]:
### CONSTANTS DEFINING THE MODEL ####
n_x = train_x.shape[0]  # input size (n_mels * time_steps)
n_h = 20                 # hidden layer size
n_y = 1                  # output size (1 for binary classification)
layers_dims = (n_x, n_h, n_y)
learning_rate = 0.0075

In [None]:
# GRADED FUNCTION: two_layer_model

def two_layer_model(X, Y, layers_dims, learning_rate = 0.0075, num_iterations = 3000, print_cost=False):
    """
    Implements a two-layer neural network: LINEAR->RELU->LINEAR->SIGMOID.
    
    Arguments:
    X -- input data, of shape (n_x, number of examples)
    Y -- true "label" vector (0=male, 1=female), of shape (1, number of examples)
    layers_dims -- dimensions of the layers (n_x, n_h, n_y)
    num_iterations -- number of iterations of the optimization loop
    learning_rate -- learning rate of the gradient descent update rule
    print_cost -- If set to True, this will print the cost every 100 iterations 
    
    Returns:
    parameters -- a dictionary containing W1, W2, b1, and b2
    """
    
    np.random.seed(1)
    grads = {}
    costs = []                              # to keep track of the cost
    m = X.shape[1]                           # number of examples
    (n_x, n_h, n_y) = layers_dims
    
    # Initialize parameters dictionary
    #(≈ 1 line of code)
    # parameters = ...
    # YOUR CODE STARTS HERE

    # YOUR CODE ENDS HERE
    
    # Get W1, b1, W2 and b2 from the dictionary parameters.
    W1 = parameters["W1"]
    b1 = parameters["b1"]
    W2 = parameters["W2"]
    b2 = parameters["b2"]
    
    # Loop (gradient descent)
    for i in range(0, num_iterations):

        # Forward propagation: LINEAR -> RELU -> LINEAR -> SIGMOID
        #(≈ 2 lines of code)
        # A1, cache1 = ...
        # A2, cache2 = ...
        # YOUR CODE STARTS HERE

        # YOUR CODE ENDS HERE
        
        # Compute cost
        #(≈ 1 line of code)
        # cost = ...
        # YOUR CODE STARTS HERE

        # YOUR CODE ENDS HERE
        
        # Initializing backward propagation
        # For sigmoid + binary cross-entropy
        dA2 = - (np.divide(Y, A2) - np.divide(1 - Y, 1 - A2))
        
        # Backward propagation
        #(≈ 2 lines of code)
        # dA1, dW2, db2 = ...
        # dA0, dW1, db1 = ...
        # YOUR CODE STARTS HERE

        # YOUR CODE ENDS HERE
        
        # Set grads
        grads['dW1'] = dW1
        grads['db1'] = db1
        grads['dW2'] = dW2
        grads['db2'] = db2
        
        # Update parameters
        #(approx. 1 line of code)
        # parameters = ...
        # YOUR CODE STARTS HERE

        # YOUR CODE ENDS HERE

        # Retrieve parameters
        W1 = parameters["W1"]
        b1 = parameters["b1"]
        W2 = parameters["W2"]
        b2 = parameters["b2"]
        
        # Print the cost every 100 iterations
        if print_cost and (i % 100 == 0 or i == num_iterations - 1):
            print("Cost after iteration {}: {}".format(i, np.squeeze(cost)))
        if i % 100 == 0:
            costs.append(cost)
            
    return parameters, costs

def plot_costs(costs, learning_rate=0.0075):
    plt.plot(np.squeeze(costs))
    plt.ylabel('cost')
    plt.xlabel('iterations (per hundreds)')
    plt.title("Learning rate =" + str(learning_rate))
    plt.show()

In [None]:
parameters, costs = two_layer_model(train_x, train_y, layers_dims = (n_x, n_h, n_y), num_iterations = 2, print_cost=False)

print("Cost after first iteration: " + str(costs[0]))

two_layer_model_test(two_layer_model)

**Expected output:**

```
Cost after first iteration: ~0.69 (around -log(0.5) for random binary initialization)
```

<a name='4-1'></a>
### 4.1 - Train the model 

If your code passed the previous cell, run the cell below to train your parameters. 

- The cost should decrease on every iteration. 
- It may take up to 5 minutes to run 2500 iterations.

In [None]:
parameters, costs = two_layer_model(train_x, train_y, layers_dims = (n_x, n_h, n_y), num_iterations = 2000, print_cost=True)
plot_costs(costs, learning_rate)

Now, you can use the trained parameters to classify voice recordings from the dataset. To see your predictions on the training and test sets, run the cell below.

In [None]:
predictions_train = predict(train_x, train_y, parameters)

In [None]:
predictions_test = predict(test_x, test_y, parameters)

**Note**: You may notice that running the model on fewer iterations (say 1500) gives better accuracy on the test set. This is called "early stopping" and is a way to prevent overfitting.

<a name='5'></a>
## 5 - L-layer Neural Network

<a name='ex-2'></a>
### Exercise 2 - L_layer_model 

Use the helper functions you implemented previously to build an $L$-layer neural network with the following structure: *[LINEAR -> RELU]$\times$(L-1) -> LINEAR -> SIGMOID*.

In [None]:
### CONSTANTS ###
layers_dims = [n_x, 25, 15, 10, n_y] #  4-layer model

In [None]:
# GRADED FUNCTION: L_layer_model

def L_layer_model(X, Y, layers_dims, learning_rate = 0.0075, num_iterations = 3000, print_cost=False):
    """
    Implements a L-layer neural network: [LINEAR->RELU]*(L-1)->LINEAR->SIGMOID.
    
    Arguments:
    X -- input data, of shape (n_x, number of examples)
    Y -- true "label" vector (0=male, 1=female), of shape (1, number of examples)
    layers_dims -- list containing the input size and each layer size, of length (number of layers + 1).
    learning_rate -- learning rate of the gradient descent update rule
    num_iterations -- number of iterations of the optimization loop
    print_cost -- if True, it prints the cost every 100 steps
    
    Returns:
    parameters -- parameters learnt by the model. They can then be used to predict.
    """

    np.random.seed(1)
    costs = []                         # keep track of cost
    
    # Parameters initialization
    #(≈ 1 line of code)
    # parameters = ...
    # YOUR CODE STARTS HERE

    # YOUR CODE ENDS HERE
    
    # Loop (gradient descent)
    for i in range(0, num_iterations):

        # Forward propagation: [LINEAR -> RELU]*(L-1) -> LINEAR -> SOFTMAX.
        #(≈ 1 line of code)
        # AL, caches = ...
        # YOUR CODE STARTS HERE

        # YOUR CODE ENDS HERE
        
        # Compute cost
        #(≈ 1 line of code)
        # cost = ...
        # YOUR CODE STARTS HERE

        # YOUR CODE ENDS HERE
    
        # Backward propagation
        #(≈ 1 line of code)
        # grads = ...    
        # YOUR CODE STARTS HERE

        # YOUR CODE ENDS HERE
 
        # Update parameters
        #(≈ 1 line of code)
        # parameters = ...
        # YOUR CODE STARTS HERE

        # YOUR CODE ENDS HERE
                
        # Print the cost every 100 iterations
        if print_cost and (i % 100 == 0 or i == num_iterations - 1):
            print("Cost after iteration {}: {}".format(i, np.squeeze(cost)))
        if i % 100 == 0:
            costs.append(cost)
    
    return parameters, costs

In [None]:
parameters, costs = L_layer_model(train_x, train_y, layers_dims, num_iterations = 1, print_cost = False)

print("Cost after first iteration: " + str(costs[0]))

L_layer_model_test(L_layer_model)

<a name='5-1'></a>
### 5.1 - Train the model 

If your code passed the previous cell, run the cell below to train your model as a 4-layer neural network. 

- The cost should decrease on every iteration. 
- It may take up to 5 minutes to run 2500 iterations.

In [None]:
parameters, costs = L_layer_model(train_x, train_y, layers_dims, num_iterations = 2000, print_cost = True)

In [None]:
pred_train = predict(train_x, train_y, parameters)

In [None]:
pred_test = predict(test_x, test_y, parameters)

### Congrats! It seems that your 4-layer neural network has better performance than your 2-layer neural network on multi-genre audio classification!

This demonstrates how deeper networks can learn more complex patterns in audio data.

<a name='6'></a>
##  6 - Results Analysis

First, take a look at some audio clips the L-layer model labeled incorrectly. This will show a few misclassified spectrograms.

In [None]:
print_mislabeled_audio(classes, test_x, test_y, pred_test)

**A few types of audio the model tends to misclassify:** 
- Mixed audio (speech with background music)
- Low-quality recordings with noise
- Similar genres (e.g., ambient vs sound effects)
- Very short or truncated clips
- Unusual or atypical examples within a genre

**Why deeper networks help:**
- More layers can learn hierarchical audio features
- Early layers detect low-level patterns (frequencies, rhythms)
- Later layers combine these into high-level genre characteristics
- Better capacity to separate complex, overlapping patterns

### Congratulations on finishing this assignment! 

You just built and trained a deep L-layer neural network for audio genre classification! 

You've seen how:
- Deep networks outperform shallow ones for complex audio patterns
- The same principles from image classification apply to audio spectrograms
- Network depth is crucial for learning hierarchical representations

If you'd like to test your model with your own audio, there's an optional ungraded exercise below.

<a name='7'></a>
## 7 - Test with your own audio (optional/ungraded exercise)

From this point, if you so choose, you can use your own audio to test the output of your model. To do that follow these steps:

1. Add your audio file to the "data/" folder
2. Change the audio filename in the following code
3. Run the code and check if the algorithm correctly classifies it!

In [None]:
## START CODE HERE ##
my_audio = "my_voice_sample.wav" # change this to the name of your audio file 
my_label_y = [1] # the true class of your audio (0=speech, 1=music, 2=ambient, 3=sound_effects, 4=mixed)
## END CODE HERE ##

# Load and process your audio
audio_path = "data/" + my_audio
spectrogram = load_and_process_audio(audio_path)

# Visualize the spectrogram
plt.figure(figsize=(10, 4))
librosa.display.specshow(spectrogram, x_axis='time', y_axis='mel', sr=22050)
plt.colorbar(format='%+2.0f dB')
plt.title('Your Audio - Mel-Spectrogram')
plt.show()

# Flatten and normalize
spec_flatten = spectrogram.reshape((1, -1)).T
spec_normalized = spec_flatten / 255.

# Predict
my_predicted_genre = predict(spec_normalized, my_label_y, parameters)

print ("y = " + str(np.squeeze(my_predicted_genre)) + ", your L-layer model predicts a \"" + classes[int(np.squeeze(my_predicted_genre))] +  "\" audio clip.")

**Comparison with Lab 1:**

| Aspect | Lab 1 (Logistic Regression) | Lab 4 (Deep Neural Network) |
|--------|------------------------------|------------------------------|
| **Task** | Binary (Speech vs Music) | Multi-class (5 genres) |
| **Model** | Single layer | L-layer (up to 4+ layers) |
| **Activation** | Sigmoid only | ReLU + Softmax |
| **Complexity** | Linear decision boundary | Non-linear hierarchical |
| **Accuracy** | ~70-75% (binary) | ~75-80% (5-class) |
| **Learning** | Simple patterns | Complex audio features |