[![Fixel Algorithms](https://fixelalgorithms.co/images/CCExt.png)](https://fixelalgorithms.gitlab.io)

# Deep Learning Methods

## Deep Learning for Computer Vision - 1D Convolution Net for Audio Classification

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 1.0.000 | 21/12/2025 | Royi Avital | First version                                                      |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/AIProgram/2024_02/0085DeepLearning1DConvFreqEst.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Deep Learning
import torch
import torch.nn            as nn
import torch.nn.functional as F
from torch.utils.data import Dataset
from torchmetrics.classification import MulticlassAccuracy
import torchinfo
import torchvista

# Miscellaneous
from concurrent.futures import ThreadPoolExecutor
import os
from platform import python_version
import random

# Typing
from typing import Callable, Dict, Generator, List, Literal, Optional, Self, Sequence, Set, Tuple, Union
from numpy.typing import NDArray
from torch import Tensor

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Jupyter
from IPython import get_ipython

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

Code Notations:

```python
someVar    = 2; #<! Notation for a variable
vVector    = np.random.rand(4) #<! Notation for 1D array
mMatrix    = np.random.rand(4, 3) #<! Notation for 2D array
tTensor    = np.random.rand(4, 3, 2, 3) #<! Notation for nD array (Tensor)
tuTuple    = (1, 2, 3) #<! Notation for a tuple
lList      = [1, 2, 3] #<! Notation for a list
dDict      = {1: 3, 2: 2, 3: 1} #<! Notation for a dictionary
oObj       = MyClass() #<! Notation for an object
dfData     = pd.DataFrame() #<! Notation for a data frame
dsData     = pd.Series() #<! Notation for a series
hObj       = plt.Axes() #<! Notation for an object / handler / function handler
```

### Code Exercise

 - Single line fill

```python
valToFill = ???
```

 - Multi Line to Fill (At least one)

```python
# You need to start writing
?????
```

 - Section to Fill

```python
#===========================Fill This===========================#
# 1. Explanation about what to do.
# !! Remarks to follow / take under consideration.
mX = ???

?????
#===============================================================#
```

In [None]:
# Configuration
# %matplotlib inline

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# Matplotlib default color palette
lMatPltLibclr = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']
# sns.set_theme() #>! Apply SeaBorn theme

runInGoogleColab = 'google.colab' in str(get_ipython())

# Improve performance by benchmarking
torch.backends.cudnn.benchmark = True

# Reproducibility
# torch.manual_seed(seedNum)
# torch.backends.cudnn.deterministic = True
# torch.backends.cudnn.benchmark     = False

In [None]:
# Constants

FIG_SIZE_DEF    = (8, 8)
ELM_SIZE_DEF    = 50
CLASS_COLOR     = ('b', 'r')
EDGE_COLOR      = 'k'
MARKER_SIZE_DEF = 10
LINE_WIDTH_DEF  = 2

DATA_FOLDER_NAME           = 'DataSets'
TENSOR_BOARD_FOLDER_NAME   = 'TB'

BASE_FOLDER_NAME = 'FixelCourses'
BASE_FOLDER_PATH = os.getcwd()[:(len(os.getcwd()) - (os.getcwd()[::-1].lower().find(BASE_FOLDER_NAME.lower()[::-1])))]

In [None]:
# Courses Packages

from DataManipulation import DownloadUrl
from DataVisualization import PlotConfusionMatrix
from DeepLearningPyTorch import GenDataLoaders, TrainModel

In [None]:
# General Auxiliary Functions

# PyTorch Data Loader
class AudioMNISTDataset(Dataset):
    def __init__( self, dfData: pd.DataFrame, targetCol: Literal['Digit', 'Speaker', 'Accent', 'Gender', 'NativeSpeaker'], hMap: Optional[Callable[[Union[int, str]], int]], *, hTransform: Optional[Callable] = None, padSize: int = 0, seedNum: Optional[int] = None ) -> None:
        """
        PyTorch Dataset class for the AudioMNIST dataset.
        Input:
            - filePath (str): Path to the Parquet file containing the AudioMNIST dataset.
            - targetCol (str): The target column to be used for classification. 
                               Options are 'Digit', 'Speaker', 'Accent', 'Gender', 'NativeSpeaker'.
            - hMap (Callable[[Union[int, str]], int]): A mapping function to convert target values to integer labels.
        """

        lCol = dfData.columns.tolist()
        # Find the index of '0' in `lCol`
        signalStartIdx = lCol.index('0')
        
        self._dfData         = dfData.copy()
        self._targetCol      = targetCol
        self._numSamples     = len(dfData)
        self._hTransform     = hTransform
        self._hMap           = hMap
        self._signalStartIdx = signalStartIdx
        self._winSize        = dfData.shape[1] + padSize #<! Window size to embed the signals into
        self._oRng           = random.Random(seedNum)

    def __len__( self: Self ) -> int:
        
        return self._numSamples

    def __getitem__( self: Self, idx: int ) -> Tuple[Union[NDArray, Tensor], int]:
        # Extract the signal from the Data Frame
        # Embeds it into a window of size `self._winSize` with random shift
        
        numSamples = self._dfData.loc[idx, 'NumSamples']
        vA = self._dfData.iloc[idx, self._signalStartIdx:(self._signalStartIdx + numSamples)].to_numpy() #<! Int16 Values
        vA = vA.astype(np.int16)
        vX = np.zeros(self._winSize, dtype = np.float32)
        maxShift = self._winSize - len(vA)
        
        # Randomly shift the signal within the window
        rndShift = self._oRng.randint(0, maxShift)
        vX[rndShift:(rndShift + len(vA))] = vA.astype(np.float32)
        vX = np.expand_dims(vX, axis = 0) #<! Add channel dimension

        valY = self._hMap(self._dfData.loc[idx, self._targetCol])

        if self._hTransform is not None:
            vX = self._hTransform(vX)

        return vX, valY
    
    def GenTrainValSplit( self: Self, valFraction: float ) -> Tuple[Sequence[int], Sequence[int]]:
        """
        Generates training and validation datasets from the current dataset.
        It uses the 'Speaker' column to ensure that samples from the same speaker are not split between training and validation sets.
        
        Input:
            - valFraction (float): Fraction of the dataset to be used for validation.
            - shuffle (bool): Whether to shuffle the dataset before splitting.
            - randomSeed (Optional[int]): Random seed for shuffling.
        Output:
            Tuple[NDArray, NDArray]: Training and validation indices.
        """
        
        lSpeakerIdx = self._dfData['Speaker'].unique().tolist()
        numSpeakers = len(lSpeakerIdx)
        
        numValSpeakers   = int(np.floor(valFraction * numSpeakers))
        numTrainSpeakers = numSpeakers - numValSpeakers

        lTrainSpkIdx = self._oRng.sample(lSpeakerIdx, k = numTrainSpeakers)
        lValSpkIdx   = [spk for spk in lSpeakerIdx if spk not in lTrainSpkIdx]

        # Get indices of the samples for training and validation sets
        lTrainIdx = self._dfData[self._dfData['Speaker'].isin(lTrainSpkIdx)].index.tolist()
        lValIdx   = self._dfData[self._dfData['Speaker'].isin(lValSpkIdx)].index.tolist()

        return lTrainIdx, lValIdx
    
    def GetWinSize( self: Self ) -> int:
        """
        Returns the window size used for embedding the signals.
        """
        return self._winSize
    

def DownloadUrlFile( fileUrl: str, destFolderPath: str, fileName: str ) -> str:
    """
    Downloads a file from a URL to a destination folder. 
    If the destination folder does not exist, it is created.
    If the file already exists in the destination folder, it is not downloaded again.

    Input:
        - fileUrl (str): The URL of the file to download.
        - fileName (str): The name to save the file as.
        - destFolderPath (str): The destination folder path.

    Output:
        str: The path to the downloaded file.
    """

    if not os.path.isdir(destFolderPath):
        os.makedirs(destFolderPath)

    filePath = os.path.join(destFolderPath, fileName)
    filePath = DownloadUrl(fileUrl, filePath)

    return filePath

def DecimateData( mD: NDArray, vN: NDArray, decFactor: int, winSize: int ) -> Tuple[NDArray, NDArray]:
    """
    Decimates the input data matrix by the specified factor.  
    Works on each signal (Row) individually based on its valid length.
    Input:
        - mD (NDArray): Input data matrix.
        - vN (NDArray): Vector of valid lengths for each signal.
        - decFactor (int): Decimation factor.
        - winSize (int): Window size for the decimated signals.
    Output:
        - mDDec (NDArray): Decimated data matrix.
        - vNDec (NDArray): Vector of valid lengths for each decimated signal.
    """
    
    numSignals = mD.shape[0]
    mDDec = np.zeros((numSignals, winSize))
    vNDec = np.zeros(numSignals, dtype = np.int32)

    for ii in range(numSignals):
        vA = mD[ii, :vN[ii]]
        vADec = sp.signal.decimate(vA, decFactor)
        numSamplesDec = len(vADec)
        mDDec[ii, :numSamplesDec] = vADec
        vNDec[ii]    = numSamplesDec

    return mDDec, vNDec


def DecimateDataThread( ii: int, mDDec: NDArray, vNDec: NDArray, mD: NDArray, vN: NDArray, decFactor: int, winSize: int ) -> None:
    
    vA    = mD[ii, :vN[ii]]
    vADec = sp.signal.decimate(vA, decFactor)

    numSamplesDec = len(vADec)

    mDDec[ii, :numSamplesDec] = vADec
    vNDec[ii]                 = numSamplesDec


def DecimateDataParallel( mD: NDArray, vN: NDArray, decFactor: int, winSize: int ) -> Tuple[NDArray, NDArray]:
    """
    Decimates the input data matrix by the specified factor.  
    Works on each signal (Row) individually based on its valid length.
    Input:
        - mD (NDArray): Input data matrix.
        - vN (NDArray): Vector of valid lengths for each signal.
        - decFactor (int): Decimation factor.
        - winSize (int): Window size for the decimated signals.
    Output:
        - mDDec (NDArray): Decimated data matrix.
        - vNDec (NDArray): Vector of valid lengths for each decimated signal.
    """
    
    numSignals = mD.shape[0]
    mDDec = np.zeros((numSignals, winSize))
    vNDec = np.zeros(numSignals, dtype = np.int32)

    hDecimateDataThread = lambda ii: DecimateDataThread(ii, mDDec, vNDec, mD, vN, decFactor, winSize)

    with ThreadPoolExecutor() as hExe: #<! Optimized for IO bound tasks
        hExe.map(hDecimateDataThread, range(numSignals))

    return mDDec, vNDec

## Audio (Speech) Classification with 1D Convolution Model in PyTorch

This notebook **recognize the spoken digit** of a given set of audio samples from [`AudioMNIST`](https://github.com/soerenab/AudioMNIST).

The notebook presents:

 * Use of convolution layers in PyTorch.
 * Use of pool layers in PyTorch.
 * Use of adaptive pool layer in PyTorch.  
   The motivation is to set a constant output size regardless of input dimensions.
 * Use the model for inference on the test data.

</br>

 * <font color='brown'>(**#**)</font> While the layers are called _Convolution Layer_ they actually implement correlation.  
   Since the weights are learned, in practice it makes no difference as _Correlation_ is convolution with the a flipped kernel.

* <font color='red'>(**?**)</font> What kind of a problem is the _recognition_ of a single word audio?

In [None]:
# Parameters

# Data
dataSetName     = 'AudioMNIST'
datasetFileUrl  = r'https://huggingface.co/datasets/Royi/AudioMNIST/resolve/main/AudioMNIST.parquet'
datasetFileName = 'AudioMNIST.parquet'
targetCol       = 'Digit' #<! Target Column for Classification
numCls          = 10

decFactor = 8 #<! Decimation Factor from 48 [kHz] to 6 [kHz]
winSize   = 6_000 #<! Window Size after Decimation

# Dataset
padSize     = 400 #<! Padding Size to embed the signals into (In addition to `winSize`)
valSetRatio = 0.2 #<! Tune model's parameters

# Model
dropP = 0.1 #<! Dropout Layer

# Training
batchSize     = 512
numWork       = 2 #<! Number of workers
nEpochs       = 20
learningRate = 5e-4
weightDecay  = 5e-5
tuβ          = (0.9, 0.99)

# Visualization
numSigPlot = 5

## Generate / Load Data

This section generates the data from the following model:



In [None]:
# Download Data

datasetFolderPath = os.path.join(BASE_FOLDER_PATH, DATA_FOLDER_NAME, dataSetName)
datasetFilePath = DownloadUrlFile(datasetFileUrl, datasetFolderPath, datasetFileName)

In [None]:
# Generate / Load Data

dfData = pd.read_parquet(datasetFilePath)
dfData

### Pre Process Data

In [None]:
# Decimation of Data
# Data is sampled at 48 [kHz] which is overkill for digit classification.
# The data is decimated by `decFactor`. This improves performance and reduces memory consumption.

mData = dfData.iloc[:, dfData.columns.tolist().index('0'):].to_numpy()
vN = dfData['NumSamples'].to_numpy()
mDataDec, vNDec = DecimateData(mData, vN, decFactor = decFactor, winSize = winSize)
# mDataDec, vNDec = DecimateDataParallel(mData, vN, decFactor = decFactor, winSize = winSize)

In [None]:
# Build the Data DataFrame

dfAudioDec = pd.DataFrame(mDataDec, columns = [str(ii) for ii in range(mDataDec.shape[1])])
dfData = pd.concat((dfData.iloc[:, :dfData.columns.tolist().index('0')].copy(), dfAudioDec), axis = 1)
dfData['NumSamples'] = vNDec
dfData

In [None]:
# Mapping invalid values

dfData['Accent'] = dfData['Accent'].replace({'german': 'German', 'Egyptian_American?': 'Egyptian/American'})
dfData['Age']    = dfData['Age'].replace({1234: 34})
dfData['Continent'] = dfData['Continent'].replace({'South-America': 'America'})

In [None]:
# Generate / Load Data

print(f'The number of samples : {dfData.shape[0]}')
print(f'The number of speakers: {dfData['Speaker'].nunique()}')

* <font color='red'>(**?**)</font> What is the content of `vY` above? Explain its shape.

### Plot Data

In [None]:
# Plot the Data

# Distribution of Digits
hF, hA = plt.subplots(figsize = (6, 4))
sns.histplot(data = dfData, x = 'Digit', bins = 10, discrete = True, kde = False, ax = hA)
hA.set_title('Distribution of Digits in AudioMNIST Dataset')
hA.set_xlabel('Digit')
hA.set_ylabel('Count');

In [None]:
# Number of Samples per Speaker
hF, hA = plt.subplots(figsize = (6, 4))
sns.histplot(data = dfData, x = 'Speaker', bins = 10, discrete = True, kde = False, ax = hA)
hA.set_title('Distribution of Speakers in AudioMNIST Dataset')
hA.set_xlabel('Speaker')
hA.set_ylabel('Count');

In [None]:
# Gender Distribution
hF, hA = plt.subplots(figsize = (6, 4))
sns.histplot(data = dfData, x = 'Gender', bins = 10, discrete = True, kde = False, ax = hA)
hA.set_title('Distribution of Genders in AudioMNIST Dataset')
hA.set_xlabel('Gender')
hA.set_ylabel('Count');

In [None]:
# Age Distribution
# See https://github.com/soerenab/AudioMNIST/issues/10
hF, hA = plt.subplots(figsize = (6, 4))
sns.histplot(data = dfData, x = 'Age', kde = False, ax = hA)
hA.set_title('Distribution of Ages in AudioMNIST Dataset')
hA.set_xlabel('Age')
hA.set_ylabel('Count');

In [None]:
# Number of Samples Distribution
hF, hA = plt.subplots(figsize = (6, 4))
sns.histplot(data = dfData, x = 'NumSamples', kde = False, ax = hA)
hA.set_title('Distribution of Number of Sample in AudioMNIST Dataset')
hA.set_xlabel('Number of Samples')
hA.set_ylabel('Count');

## Model Input Data

Each signal has a different length.  
In order to wrap them into a single batch one must apply some kind of transformation.  
Common transformations are:

1. Padding with zeros.
2. Feature extraction.

In this case padding with zeros is used.  
In order to avoid the model to count the padding zeros as a feature, the signal is shifted within the window. 

![](https://i.imgur.com/6iPXza5.png)
<!-- ![](https://i.postimg.cc/nL5XpXNz/Diagrams-Pad-Signal-Zeros-Rnd-Shift.png) -->

The length of the signal in the batch is constant.  
Yet the location of the start of the signal within the batch signal is random.  
Hence the model must become insensitive to the padding of the signal and its length.

* <font color='red'>(**?**)</font> One may see the random shift as an augmentation. What other augmentation can fit this task?

In [None]:
# Set the Dataset

# The signals are embedded into windows of size `winSize` + `padSize`
dsData = AudioMNISTDataset(dfData, targetCol = 'Digit', hMap = lambda x: int(x), hTransform = None, padSize = padSize, seedNum = seedNum)

### Train / Validation Split Methodology

The dataset consists 60 speakers each with 50 samples per digit.  
One may want to ensure each speaker is either on the _Train Set_ or the _Validation Set_ in order to prevent overfitting to a speaker.  

 * <font color='brown'>(**#**)</font> A deeper discussion should take into account what the real world scenario the model should solve.
 * <font color='brown'>(**#**)</font> One may look at [Data Leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning)).
  * <font color='brown'>(**#**)</font> One may use a more delicate granularity by only ensuring each digit of a specified speaker is either on the _Train Set_ or the _Validation Set_.

In [None]:
# Train / Validation Split

lTrainIdx, lValIdx = dsData.GenTrainValSplit(valSetRatio)
dsTrain = torch.utils.data.Subset(dsData, lTrainIdx)
dsVal   = torch.utils.data.Subset(dsData, lValIdx)

 * <font color='brown'>(**#**)</font> There are tricks to wrap a `Subset` in order to allow different transforms for train and test sets.  
   See [StackOverflow - Use a Different Data Augmentation for Train and Validation `Subset`](https://stackoverflow.com/questions/51782021), [PyTorch Forum - Use a Different Data Augmentation for Train and Validation `Subset`](https://discuss.pytorch.org/t/32209).  
   An alternative is to use the course's `SubSet` class in `DeepLearningPyTorch.py`.

In [None]:
# Plot Signals

hF, hA = plt.subplots(figsize = FIG_SIZE_DEF)
for ii in range(numSigPlot):
    signalIdx = random.randint(0, len(dsData) - 1)
    vX, valY = dsData[signalIdx]
    hA.plot(vX.squeeze(), lw = 2, alpha = 0.75, label = f'Signal {ii:05d} - Digit {valY}')

hA.set_title('Audio Signals from AudioMNIST Dataset')
hA.set_xlabel('Sample Index')
hA.set_ylabel('Amplitude')
hA.legend();

In [None]:
# Set the Data Loaders

# dlTrain, dlVal = GenDataLoaders(dsTrain, dsVal, batchSize, numWorkers = numWork, persWork = True)
dlTrain, dlVal = GenDataLoaders(dsTrain, dsVal, batchSize, numWorkers = 0) #<! Prevents issues with multiple workers in some environments

In [None]:
# Iterate on the Loader
# The first batch.
tX, vY = next(iter(dlTrain)) #<! PyTorch Tensors

print(f'The batch features dimensions: {tX.shape}')
print(f'The batch features data type : {tX.dtype}')
print(f'The batch labels dimensions  : {vY.shape}')
print(f'The batch labels data type   : {vY.dtype}')

## Define the Model

The model is defined as a sequential model.


In [None]:
# Model
# Defining a sequential model.

oModel = nn.Sequential(
    nn.Identity(),
        
    nn.Conv1d(in_channels = 1,   out_channels = 16,  kernel_size = 21), nn.MaxPool1d(kernel_size = 2), nn.ReLU(),
    nn.Conv1d(in_channels = 16,  out_channels = 32,  kernel_size = 19), nn.MaxPool1d(kernel_size = 2), nn.ReLU(),
    nn.Conv1d(in_channels = 32,  out_channels = 48,  kernel_size = 17), nn.MaxPool1d(kernel_size = 2), nn.ReLU(),
    nn.Conv1d(in_channels = 48,  out_channels = 64,  kernel_size = 15), nn.MaxPool1d(kernel_size = 2), nn.ReLU(),
    nn.Conv1d(in_channels = 64,  out_channels = 96,  kernel_size = 13), nn.MaxPool1d(kernel_size = 2), nn.ReLU(),
    nn.Conv1d(in_channels = 96,  out_channels = 128, kernel_size = 11), nn.MaxPool1d(kernel_size = 2), nn.ReLU(),
    nn.Conv1d(in_channels = 128, out_channels = 160, kernel_size = 9) , nn.MaxPool1d(kernel_size = 2), nn.ReLU(),
    nn.Conv1d(in_channels = 160, out_channels = 192, kernel_size = 7) , nn.MaxPool1d(kernel_size = 2), nn.ReLU(),
                
    nn.AdaptiveAvgPool1d(output_size = 1), nn.ReLU(),
    nn.Flatten          (),
    nn.Linear           (in_features = 192, out_features = 128), nn.ReLU(),
    nn.Dropout          (p = dropP),
    nn.Linear           (in_features = 128, out_features = 64),  nn.ReLU(),
    nn.Dropout          (p = dropP),
    nn.Linear           (in_features = 64, out_features = numCls),
)

* <font color='brown'>(**#**)</font> The [`torch.nn.AdaptiveAvgPool1d`](https://pytorch.org/docs/stable/generated/torch.nn.AdaptiveAvgPool1d.html) allows the same output shape regard less of the  input.
* <font color='red'>(**?**)</font> What is the role of the [`torch.nn.Flatten`](https://pytorch.org/docs/stable/generated/torch.nn.Flatten.html) layers?

In [None]:
# Model Summary

torchinfo.summary(oModel, tX.shape, col_names = ['kernel_size', 'input_size', 'output_size', 'num_params'], device = 'cpu')

* <font color='brown'>(**#**)</font> Pay attention the dropout parameter of PyTorch is about the probability to zero out the value.

In [None]:
# Plot the Model Graph  

torchvista.trace_model(oModel, tX)

In [None]:
# Run Model
# Apply a test run.

tXX = torch.randn(batchSize, 1, tX.shape[2])
with torch.inference_mode():
    vYHat = oModel(tXX)

print(f'The input dimensions: {tXX.shape}')
print(f'The output dimensions: {vYHat.shape}')

## Training the Model

Use the training and validation samples.  

In [None]:
# Check GPU Availability

runDevice = torch.device('cuda:0' if torch.cuda.is_available() else ('mps' if torch.backends.mps.is_available() else 'cpu')) #<! The 1st CUDA device
oModel    = oModel.to(runDevice) #<! Transfer model to device

In [None]:
# Set the Loss & Score

hL = nn.CrossEntropyLoss()
hS = MulticlassAccuracy(num_classes = numCls, average = 'micro') #<! See documentation for `macro` vs. `micro`
hS = hS.to(runDevice)

In [None]:
# Define Optimizer

oOpt = torch.optim.AdamW(oModel.parameters(), lr = learningRate, betas = tuβ, weight_decay = weightDecay) #<! Define optimizer
oSch = torch.optim.lr_scheduler.OneCycleLR(oOpt, max_lr = learningRate, total_steps = nEpochs)

In [None]:
# Train the Model

_, lTrainLoss, lTrainScore, lValLoss, lValScore, lLearnRate = TrainModel(oModel, dlTrain, dlVal, oOpt, nEpochs, hL, hS, oSch = oSch)

In [None]:
# Plot Results
hF, vHa = plt.subplots(nrows = 1, ncols = 3, figsize = (12, 5))
vHa = vHa.flat

hA = vHa[0]
hA.plot(lTrainLoss, lw = 2, label = 'Train')
hA.plot(lValLoss, lw = 2, label = 'Validation')
hA.grid()
hA.set_title('Cross Entropy Loss')
hA.set_xlabel('Epoch Index')
hA.set_ylabel('Loss')
hA.legend();


hA = vHa[1]
hA.plot(lTrainScore, lw = 2, label = 'Train')
hA.plot(lValScore, lw = 2, label = 'Validation')
hA.grid()
hA.set_title('Accuracy Score')
hA.set_xlabel('Epoch Index')
hA.set_ylabel('Score')
hA.legend();

hA = vHa[2]
hA.plot(lLearnRate, lw = 2)
hA.grid()
hA.set_title('Learn Rate Scheduler')
hA.set_xlabel('Epoch')
hA.set_ylabel('Learn Rate');

* <font color='brown'>(**#**)</font> Results are of the last iteration of the model as the best weights ar not reloaded.

## Results Analysis 

This section runs the model on the data and analyze results.

In [None]:
# Train Data Loader 
dlTrain = torch.utils.data.DataLoader(dsTrain, batch_size = 4 * batchSize, shuffle = False, num_workers = 0, drop_last = False, persistent_workers = False)

In [None]:
# Run on Train Data

lYY     = []
lYYHat  = []
with torch.inference_mode():
    for tXX, vYY in dlTrain:
        tXX = tXX.to(runDevice)
        lYY.append(vYY)
        lYYHat.append(oModel(tXX))

vYY    = torch.cat(lYY, dim = 0).cpu().numpy()
tYYHat = torch.cat(lYYHat, dim = 0).cpu().numpy()
vYYHat = np.argmax(tYYHat, axis = 1)

In [None]:
dfRes = dfData.iloc[lTrainIdx].copy()
dfRes = dfRes[['Digit', 'Speaker', 'NumSamples', 'Accent', 'Age', 'Gender', 'NativeSpeaker']]
dfRes['Prediction'] = vYYHat
dfRes['Correct'] = (dfRes['Digit'].to_numpy() == vYYHat)
# np.all(dfRes['Digit'].to_numpy() == vYY) #<! Verify labels match
dfRes

In [None]:
# Show Accuracy per Speaker

dfResGrpSpkr = dfRes.groupby('Speaker')
dsSpkrAcc    = dfResGrpSpkr['Correct'].mean()

hF, hA = plt.subplots(figsize = (14, 4))
sns.barplot(x = dsSpkrAcc.index, y = dsSpkrAcc, ax = hA)
hA.set_title('Accuracy per Speaker')
hA.set_xlabel('Speaker')
hA.set_ylabel('Accuracy');

In [None]:
# Confusion Matrix per Digit

hF, hA = plt.subplots(figsize = (7, 6))

hA, _ = PlotConfusionMatrix(vYY, vYYHat, hA = hA)
hA.set_title(f'Train Data, Accuracy {np.mean(vYY == vYYHat): 0.2%}');

## Validation Data Analysis 

This section runs the model on the data and analyze results.  
Results should be comparable to AudioNet in the original paper ([AudioMNIST: Exploring Explainable Artificial Intelligence for Audio Analysis on a Simple Benchmark](https://www.sciencedirect.com/science/article/pii/S0016003223007536)):

![](https://i.imgur.com/qmvjLlZ.png)
<!-- ![](https://i.postimg.cc/HLvbjxYQ/image.png) -->

In [None]:
# Run on Validation Data
lYY     = []
lYYHat  = []
with torch.inference_mode():
    for tXX, vYY in dlVal:
        tXX = tXX.to(runDevice)
        lYY.append(vYY)
        lYYHat.append(oModel(tXX))

vYY    = torch.cat(lYY, dim = 0).cpu().numpy()
tYYHat = torch.cat(lYYHat, dim = 0).cpu().numpy()
vYYHat = np.argmax(tYYHat, axis = 1)

In [None]:
dfRes = dfData.iloc[lValIdx].copy()
dfRes = dfRes[['Digit', 'Speaker', 'NumSamples', 'Accent', 'Age', 'Gender', 'NativeSpeaker']]
dfRes['Prediction'] = vYYHat
dfRes['Correct'] = (dfRes['Digit'].to_numpy() == vYYHat)
# np.all(dfRes['Digit'].to_numpy() == vYY) #<! Verify labels match
dfRes

In [None]:
# Show Accuracy per Speaker

dfResGrpSpkr = dfRes.groupby('Speaker')
dsSpkrAcc    = dfResGrpSpkr['Correct'].mean()

hF, hA = plt.subplots(figsize = (6, 4))
sns.barplot(x = dsSpkrAcc.index, y = dsSpkrAcc, ax = hA)
hA.set_title('Accuracy per Speaker')
hA.set_xlabel('Speaker')
hA.set_ylabel('Accuracy');

In [None]:
# Confusion Matrix per Digit

hF, hA = plt.subplots(figsize = (7, 6))

hA, _ = PlotConfusionMatrix(vYY, vYYHat, hA = hA)
hA.set_title(f'Validation Data, Accuracy {np.mean(vYY == vYYHat): 0.2%}');

* <font color='red'>(**?**)</font> Can you find where the model struggles?
* <font color='green'>(**@**)</font> Optimize the _Hyper Parameters_ (`learningRate`, `weightDecay`) to improve the model performance.
* <font color='red'>(**?**)</font> Augmentation is changing the features during training in order to improve the model generalization.  
  Think of possible augmentations that can be applied to this problem.