# HW1: Frame-Level Speech Recognition

## Overview
This project is focused on building a frame-level speech recognition model using Mel-Frequency Cepstral Coefficients (MFCC) features. The data contains 28 MFCC features per frame, and the task is to predict which phoneme occurs in each frame of the audio data. This problem involves sequence modeling, making it an ideal application for deep learning techniques such as recurrent neural networks (RNNs) or convolutional neural networks (CNNs).

## Dataset
The dataset provided contains the following:
- **MFCC Features**: 28 features representing different characteristics of the audio signal at each time frame.
- **Phoneme Labels**: A corresponding phoneme label for each time frame, which represents the spoken sound at that point in time.

### Data Preprocessing
The preprocessing steps include:
1. **Loading the MFCC data**: Using Python libraries to load the data into a format suitable for model training.
2. **Normalization**: Normalizing the MFCC features to ensure consistency and stability in training.
3. **Train-Test Split**: Splitting the data into training and testing sets to evaluate the model performance.
4. **Sequence Padding**: Ensuring all sequences are of uniform length by padding the sequences.

## Objective
The primary goal is to develop a machine learning model that accurately predicts the phoneme label for each time frame based on the input MFCC features. The challenge is to:
- Handle sequential data.
- Make accurate frame-wise predictions.
- Optimize the model’s performance through hyperparameter tuning and training.

## Model Architecture
The notebook implements a deep learning model to recognize phonemes. Some key components are:
1. **Input Layer**: Accepts the 28-dimensional MFCC features.
2. **Hidden Layers**: Depending on the approach, you may implement:
   - **Recurrent Neural Networks (RNNs)** or **LSTMs** for sequential data modeling.
   - **Convolutional Neural Networks (CNNs)** if using a convolutional approach to detect patterns in the input features.
3. **Output Layer**: A softmax layer that outputs the probability distribution across all possible phoneme labels.
   
The loss function used will likely be categorical cross-entropy, and the optimization is performed using gradient descent (e.g., via Adam optimizer).

## Workflow

### 1. Libraries
The project relies on several key libraries for building and training the model:
- **PyTorch**: For defining and training the neural network.
- **NumPy**: For numerical operations.
- **Matplotlib**: For plotting performance metrics.
- **scikit-learn**: For computing evaluation metrics like accuracy.

### 2. Data Loading and Preprocessing
The notebook starts by loading the MFCC feature data and performing the necessary preprocessing steps, such as:
- **Normalizing the data**.
- **Padding sequences** to ensure uniform length.
- **Creating batches** for efficient training and validation.

### 3. Model Definition
You can find the definition of a deep learning model built using PyTorch. Key elements include:
- Input features (28 MFCC values per frame).
- Recurrent or convolutional layers for learning temporal patterns.
- A softmax output layer for multi-class classification (predicting phoneme labels).

### 4. Model Training
The model is trained using backpropagation and gradient descent. Key training details include:
- **Loss Function**: Cross-entropy loss for classification.
- **Optimizer**: Typically Adam or SGD.
- **Epochs**: The model is trained over several iterations to optimize the weights.
- **Batch Size**: You can tune this hyperparameter to balance training speed and model accuracy.

### 5. Model Evaluation
The model’s performance is evaluated using:
- **Accuracy**: The percentage of correctly classified phoneme labels.
- **Confusion Matrix**: Visualizing how well the model distinguishes between different phonemes.
- **Loss Curves**: Plots showing training and validation loss over epochs.

### 6. Visualizations
- **Prediction Plotting**: Visualize how the model's predictions align with the true labels.
- **Confusion Matrix**: Analyze errors and misclassifications.

## Requirements
To run this notebook, ensure you have the following installed:
- Python 3.x
- The following Python packages:
  ```bash
  pip install numpy torch matplotlib scikit-learn

## How to Run
1. **Clone or download the notebook** to your local environment or directly upload it to Google Colab.
2. **Ensure the dataset is correctly placed** and accessible in the specified file path within the notebook. If you are using Google Colab, upload the dataset using the file upload option or mount Google Drive.
3. **Run all cells sequentially** by clicking the "Run" button for each cell or by using "Run All" from the `Runtime` menu. Make sure to execute the following:
   - **Data loading and preprocessing cells**: These cells load and preprocess the MFCC dataset.
   - **Model definition**: Defines the deep learning model for phoneme recognition.
   - **Training the model**: This section will train the model on the training dataset.
   - **Model evaluation**: The evaluation section will compute accuracy and display other evaluation metrics like confusion matrix and loss curves.

Ensure you monitor the outputs for each cell to check for errors or performance issues.

## Results
Once the model is trained and evaluated, you will obtain the following:

- **Accuracy**: A metric that tells you how often the model correctly predicted the phoneme labels for each time frame in the test dataset.
- **Loss Curves**: Graphical plots of training and validation loss over time (epochs). These help monitor the model's learning process and can indicate whether the model is overfitting or underfitting.

These results will be displayed in the output sections following the evaluation.

## Hyperparameter Tuning
To improve the model’s performance, you can fine-tune the following hyperparameters:
- **Learning Rate**: This controls the step size of the model's optimization process. A smaller learning rate may lead to better convergence but slower training, while a larger learning rate might speed up training but risk overshooting the optimal solution.
- **Batch Size**: This determines how many samples are processed before the model updates its weights. Larger batch sizes can speed up training but may lead to less stable learning.
- **Number of Epochs**: This refers to how many complete passes through the training data the model will perform. More epochs may lead to better training but can also cause overfitting if not monitored carefully.
- **Model Architecture**: You can experiment with different architectures, such as increasing the number of layers or hidden units, to potentially improve performance.

## Conclusion
This project demonstrates how to build a deep learning model for frame-level speech recognition using MFCC features. It covers essential steps such as data preprocessing, sequence modeling, and evaluation. By applying techniques like hyperparameter tuning and monitoring training progress through loss curves, you can improve the model's ability to recognize phonemes in audio data.



# HW1: Frame-Level Speech Recognition

In this homework, you will be working with MFCC data consisting of 28 features at each time step/frame. Your model should be able to recognize the phoneme occured in that frame.

# Libraries

In [None]:
!pip install torchsummaryX==1.1.0 wandb --quiet

In [None]:
import torch
import numpy as np
from torchsummaryX import summary
import sklearn
import gc
import zipfile
import pandas as pd
from tqdm.auto import tqdm
import os
import datetime
import wandb
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print("Device: ", device)

Device:  cuda


In [None]:
''' If you are using colab, you can import google drive to save model checkpoints in a folder
    If you want to use it, uncomment the two lines below
'''
#from google.colab import drive
#drive.mount('/content/drive')

' If you are using colab, you can import google drive to save model checkpoints in a folder\n    If you want to use it, uncomment the two lines below\n'

In [None]:
### PHONEME LIST
PHONEMES = [
            '[SIL]',   'AA',    'AE',    'AH',    'AO',    'AW',    'AY',
            'B',     'CH',    'D',     'DH',    'EH',    'ER',    'EY',
            'F',     'G',     'HH',    'IH',    'IY',    'JH',    'K',
            'L',     'M',     'N',     'NG',    'OW',    'OY',    'P',
            'R',     'S',     'SH',    'T',     'TH',    'UH',    'UW',
            'V',     'W',     'Y',     'Z',     'ZH',    '[SOS]', '[EOS]']

# Kaggle

This section contains code that helps you install kaggle's API, creating kaggle.json with you username and API key details. Make sure to input those in the given code to ensure you can download data from the competition successfully.

In [None]:
!pip install --upgrade --force-reinstall --no-deps kaggle==1.5.8
!mkdir /root/.kaggle

with open("/root/.kaggle/kaggle.json", "w+") as f:
    f.write('{"username":"sumedhbhoir","key":"2334159be6542805302834f902c39687"}')
    # Put your kaggle username & key here

!chmod 600 /root/.kaggle/kaggle.json

Collecting kaggle==1.5.8
  Using cached kaggle-1.5.8-py3-none-any.whl
Installing collected packages: kaggle
  Attempting uninstall: kaggle
    Found existing installation: kaggle 1.5.8
    Uninstalling kaggle-1.5.8:
      Successfully uninstalled kaggle-1.5.8
Successfully installed kaggle-1.5.8
mkdir: cannot create directory ‘/root/.kaggle’: File exists


In [None]:
# commands to download data from kaggle
!kaggle competitions download -c 11785-hw1p2-f24

!unzip -qo /content/11785-hw1p2-f24.zip -d '/content'

11785-hw1p2-f24.zip: Skipping, found more recently modified local copy (use --force to force download)


# Dataset

This section covers the dataset/dataloader class for speech data. You will have to spend time writing code to create this class successfully. We have given you a lot of comments guiding you on what code to write at each stage, from top to bottom of the class. Please try and take your time figuring this out, as it will immensely help in creating dataset/dataloader classes for future homeworks.

Before running the following cells, please take some time to analyse the structure of data. Try loading a single MFCC and its transcipt, print out the shapes and print out the values. Do the transcripts look like phonemes?

In [None]:
# Dataset class to load train and validation data

class AudioDataset(torch.utils.data.Dataset):

    def __init__(self, root, phonemes = PHONEMES, context=25, partition= "train-clean-100"): # Feel free to add more arguments

        self.context    = context
        self.phonemes   = phonemes

        # TODO: MFCC directory - use partition to acces train/dev directories from kaggle data using root
        self.mfcc_dir       = os.path.join(root, partition,'mfcc')

        # TODO: Transcripts directory - use partition to acces train/dev directories from kaggle data using root
        self.transcript_dir = os.path.join(root, partition,'transcript')

        # TODO: List files in sefl.mfcc_dir using os.listdir in sorted order
        self.mfcc_names          = sorted(os.listdir(self.mfcc_dir))
        # TODO: List files in self.transcript_dir using os.listdir in sorted order
        self.transcript_names    = sorted(os.listdir(self.transcript_dir))

        # Making sure that we have the same no. of mfcc and transcripts
        assert len(self.mfcc_names) == len(self.transcript_names)

        self.mfccs, self.transcripts = [], []

        # TODO: Iterate through mfccs and transcripts
        for i in range(len(self.mfcc_names)):
        #   Load a single mfcc
            mfcc        = np.load(os.path.join(self.mfcc_dir,self.mfcc_names[i]))

            #mfcc = mfcc - mfcc.mean(axis=1,keepdims = True)
           #mfcc = mfcc / mfcc.std( axis=1, keepdims = True)

            mfcc = mfcc - mfcc.mean(axis=0, keepdims = True)
            mfcc = mfcc / (mfcc.std( axis=0, keepdims=True) + 1e-5)

        #   Do Cepstral Normalization of mfcc (explained in writeup)
        #   Load the corresponding transcript
            transcript  = np.load(os.path.join(self.transcript_dir,self.transcript_names[i]))
            transcript = transcript[1:len(transcript)-1] # Remove [SOS] and [EOS] from the transcript
            # (Is there an efficient way to do this without traversing through the transcript?)
            # Note that SOS will always be in the starting and EOS at end, as the name suggests.
        #   Append each mfcc to self.mfcc, transcript to self.transcript
            self.mfccs.append(mfcc)
            self.transcripts.append(transcript)

        # NOTE:
        # Each mfcc is of shape T1 x 28, T2 x 28, ...
        # Each transcript is of shape (T1+2), (T2+2) before removing [SOS] and [EOS]

        # TODO: Concatenate all mfccs in self.mfccs such that
        # the final shape is T x 28 (Where T = T1 + T2 + ...)
        self.mfccs          =  np.concatenate(self.mfccs, axis=0)

        # TODO: Concatenate all transcripts in self.transcripts such that
        # the final shape is (T,) meaning, each time step has one phoneme output
        self.transcripts    =  np.concatenate(self.transcripts, axis=0)
        # Hint: Use numpy to concatenate

        # Length of the dataset is now the length of concatenated mfccs/transcripts
        self.length = len(self.mfccs)

        # Take some time to think about what we have done.
        # self.mfcc is an array of the format (Frames x Features).
        # Our goal is to recognize phonemes of each frame
        # We can introduce context by padding zeros on top and bottom of self.mfcc
        self.mfccs = np.pad(self.mfccs,((context,context),(0,0)), mode='constant', constant_values=0) # TODO

        # The available phonemes in the transcript are of string data type
        # But the neural network cannot predict strings as such.
        # Hence, we map these phonemes to integers

        # TODO: Map the phonemes to their corresponding list indexes in self.phonemes
        phoneme_to_idx = {phoneme: idx for idx, phoneme in enumerate(self.phonemes)}
        self.transcripts = np.array([phoneme_to_idx[p] for p in self.transcripts])
 # TODO
        # Now, if an element in self.transcript is 0, it means that it is 'SIL' (as per the above example)

    def __len__(self):
        return self.length

    def __getitem__(self, ind):


        # TODO: Based on context and offset, return a frame at given index with context frames to the left, and right.
        frames = self.mfccs[ind : ind+2*self.context+1]
        # After slicing, you get an array of shape 2*context+1 x 28. But our MLP needs 1d data and not 2d.
        frames = frames.flatten() # TODO: Flatten to get 1d data

        frames      = torch.FloatTensor(frames) # Convert to tensors
        phonemes    = torch.tensor(self.transcripts[ind])

        return frames, phonemes

In [None]:
class AudioTestDataset(torch.utils.data.Dataset):
    def __init__(self, root, phonemes = PHONEMES, context=25, partition= "test-clean"):

      self.context = context
      self.phonemes = phonemes

      self.mfcc_dir = os.path.join(root, partition,'mfcc')
      self.mfcc_names = sorted(os.listdir(self.mfcc_dir))

      self.mfccs = []

      for mfc in self.mfcc_names:
        mfcc = np.load(os.path.join(self.mfcc_dir,mfc))
        mfcc = mfcc - mfcc.mean(axis=0, keepdims=True)
        mfcc = mfcc / (mfcc.std(axis=0, keepdims=True) + 1e-5)
        self.mfccs.append(mfcc)

      self.mfccs = np.concatenate(self.mfccs, axis=0)
      self.length = len(self.mfccs)

      self.mfccs = np.pad(self.mfccs, ((self.context, self.context), (0, 0)), mode='constant', constant_values=0)

    def __len__(self):
      return self.length

    def __getitem__(self, ind):
      frames = self.mfccs[ind:ind+2*self.context+1]
      frames = frames.flatten()

      frames = torch.FloatTensor(frames)

      return frames


    # TODO: Create a test dataset class similar to the previous class but you dont have transcripts for this
    # Imp: Read the mfccs in sorted order, do NOT shuffle the data here or in your dataloader.

# Parameters Configuration

Storing your parameters and hyperparameters in a single configuration dictionary makes it easier to keep track of them during each experiment. It can also be used with weights and biases to log your parameters for each experiment and keep track of them across multiple experiments.

In [None]:
config = {
    'epochs'        : 30,
    'batch_size'    : 2048, #1024,
    'context'       : 25,
    'init_lr'       : 1e-3,
    'architecture'  : 'High-cut-off',
    'dropout'       : 0.35
    # Add more as you need them - e.g dropout values, weight decay, scheduler parameters
}


# Create Datasets

In [None]:
#TODO: Create a dataset object using the AudioDataset class for the training data
train_data = AudioDataset(root = './11785-f24-hw1p2/', partition = 'train-clean-100' )

# TODO: Create a dataset object using the AudioDataset class for the validation data
val_data = AudioDataset(root = './11785-f24-hw1p2/', partition='dev-clean')



In [None]:
test_data = AudioTestDataset(root = './11785-f24-hw1p2/',partition='test-clean')

In [None]:
# Define dataloaders for train, val and test datasets
# Dataloaders will yield a batch of frames and phonemes of given batch_size at every iteration
# We shuffle train dataloader but not val & test dataloader. Why?

train_loader = torch.utils.data.DataLoader(
    dataset     = train_data,
    num_workers = 4,
    batch_size  = config['batch_size'],
    pin_memory  = True,
    shuffle     = True
)

val_loader = torch.utils.data.DataLoader(
    dataset     = val_data,
    num_workers = 2,
    batch_size  = config['batch_size'],
    pin_memory  = True,
    shuffle     = False
)

test_loader = torch.utils.data.DataLoader(
    dataset     = test_data,
    num_workers = 2,
    batch_size  = config['batch_size'],
    pin_memory  = True,
    shuffle     = False
)


print("Batch size     : ", config['batch_size'])
print("Context        : ", config['context'])
print("Input size     : ", (2*config['context']+1)*28)
print("Output symbols : ", len(PHONEMES))

print("Train dataset samples = {}, batches = {}".format(train_data.__len__(), len(train_loader)))
print("Validation dataset samples = {}, batches = {}".format(val_data.__len__(), len(val_loader)))
print("Test dataset samples = {}, batches = {}".format(test_data.__len__(), len(test_loader)))

Batch size     :  2048
Context        :  25
Input size     :  1428
Output symbols :  42
Train dataset samples = 36091157, batches = 17623
Validation dataset samples = 1928204, batches = 942
Test dataset samples = 1934138, batches = 945


In [None]:
# Testing code to check if your data loaders are working
for i, data in enumerate(train_loader):
    frames, phoneme = data
    print(frames.shape, phoneme.shape)
    break

torch.Size([2048, 1428]) torch.Size([2048])


# Network Architecture


## Network Architecture Overview

This class `Network` is a fully connected feedforward neural network (also known as a Multi-Layer Perceptron, or MLP) built using the PyTorch library. The architecture includes multiple linear layers, batch normalization, ReLU activations, and dropout for regularization.

### 1. Input Layer
- **Input Size**: The network takes an input of size `input_size`, which corresponds to the dimensionality of the feature vector (e.g., MFCC features in a speech recognition task).

### 2. Hidden Layers
The network consists of several hidden layers, each comprising the following components:

- **Linear Layers**:
  - The data flows through a sequence of linear transformations. The dimensions change as follows:
    - `input_size` → 512 → 1024 → 2048 → 2048 → 2048 → 1024 → 512 → `output_size`.
  - The network uses deeper layers (larger hidden dimensions) to capture more complex features.

- **Batch Normalization** (`torch.nn.BatchNorm1d`):
  - Applied after each linear transformation, batch normalization helps stabilize and speed up training by normalizing the inputs to each layer, reducing issues such as vanishing or exploding gradients.

- **ReLU Activation** (`torch.nn.ReLU()`):
  - A non-linear activation function applied after batch normalization. ReLU sets negative values to zero and keeps positive values unchanged, introducing non-linearity into the model.

- **Dropout** (`torch.nn.Dropout(p=config['dropout'])`):
  - Dropout is used for regularization to prevent overfitting. It randomly sets a fraction `p` of neurons to zero during training. The dropout rate is controlled by `config['dropout']`.

### 3. Output Layer
- **Linear Layer (512 → `output_size`)**:
  - The final linear layer reduces the hidden dimension from 512 to the number of output classes, `output_size`. This could represent the number of phonemes in a classification task.
  - There is no activation (such as softmax) applied in the final layer, meaning the raw logits are returned.


This section defines your network architecture for the homework. We have given you a sample architecture that can easily clear the very low cutoff for the early submission deadline.

In [None]:
class Network(torch.nn.Module):
    def __init__(self, input_size, output_size):
        super(Network, self).__init__()
        self.model = torch.nn.Sequential(
            torch.nn.Linear(input_size, 512),
            torch.nn.BatchNorm1d(512),
            torch.nn.ReLU(),
            torch.nn.Dropout(p=config['dropout']),
            torch.nn.Linear(512, 1024),
            torch.nn.BatchNorm1d(1024),
            torch.nn.ReLU(),
            torch.nn.Dropout(p=config['dropout']),

            torch.nn.Linear(1024, 2048),
            torch.nn.BatchNorm1d(2048),
            torch.nn.ReLU(),
            torch.nn.Dropout(p=config['dropout']),

            torch.nn.Linear(2048, 2048),
            torch.nn.BatchNorm1d(2048),
            torch.nn.ReLU(),
            torch.nn.Dropout(p=config['dropout']),

            torch.nn.Linear(2048, 2048),
            torch.nn.BatchNorm1d(2048),
            torch.nn.ReLU(),
            torch.nn.Dropout(p=config['dropout']),

            torch.nn.Linear(2048, 2048),
            torch.nn.BatchNorm1d(2048),
            torch.nn.ReLU(),
            torch.nn.Dropout(p=config['dropout']),

            torch.nn.Linear(2048, 1024),
            torch.nn.BatchNorm1d(1024),
            torch.nn.ReLU(),
            torch.nn.Dropout(p=config['dropout']),

            torch.nn.Linear(1024, 512),
            torch.nn.BatchNorm1d(512),
            torch.nn.ReLU(),
            torch.nn.Dropout(p=config['dropout']),

            torch.nn.Linear(512, output_size)
        )

    def forward(self, x):
        out = self.model(x)
        return out


# Define Model, Loss Function and Optimizer

Here we define the model, loss function, optimizer and optionally a learning rate scheduler.

In [None]:
INPUT_SIZE  = (2*config['context'] + 1) * 28 # Why is this the case?
model       = Network(INPUT_SIZE, len(train_data.phonemes)).to(device)
summary(model, frames.to(device))
# Check number of parameters of your network
# Remember, you are limited to 20 million parameters for HW1 (including ensembles)

----------------------------------------------------------------------------------------------------
Layer                   Kernel Shape         Output Shape         # Params (K)      # Mult-Adds (M)
0_Linear                 [1428, 512]          [2048, 512]               731.65                 0.73
1_BatchNorm1d                  [512]          [2048, 512]                 1.02                 0.00
2_ReLU                             -          [2048, 512]                    -                    -
3_Dropout                          -          [2048, 512]                    -                    -
4_Linear                 [512, 1024]         [2048, 1024]               525.31                 0.52
5_BatchNorm1d                 [1024]         [2048, 1024]                 2.05                 0.00
6_ReLU                             -         [2048, 1024]                    -                    -
7_Dropout                          -         [2048, 1024]                    -                    -

In [None]:
#Train with AMP
def train(model, dataloader, optimizer, criterion):
    model.train()
    tloss, tacc = 0, 0  # Monitoring loss and accuracy
    batch_bar = tqdm(total=len(train_loader), dynamic_ncols=True, leave=False, position=0, desc='Train')

    # Initialize GradScaler for AMP
    scaler = torch.cuda.amp.GradScaler()

    for i, (frames, phonemes) in enumerate(dataloader):

        ### Initialize Gradients
        optimizer.zero_grad()

        ### Move Data to Device (Ideally GPU)
        frames = frames.to(device)
        phonemes = phonemes.to(device)

        # Use AMP autocast for the forward pass to utilize mixed precision
        with torch.cuda.amp.autocast():
            ### Forward Propagation
            logits = model(frames)

            ### Loss Calculation
            loss = criterion(logits, phonemes)

        ### Backward Propagation using scaled loss
        scaler.scale(loss).backward()

        ### Gradient Descent (unscaled)
        scaler.step(optimizer)

        # Update the scaler for the next iteration
        scaler.update()

        tloss += loss.item()
        tacc += torch.sum(torch.argmax(logits, dim=1) == phonemes).item() / logits.shape[0]

        batch_bar.set_postfix(loss="{:.04f}".format(float(tloss / (i + 1))),
                              acc="{:.04f}%".format(float(tacc * 100 / (i + 1))))
        batch_bar.update()

        ### Release memory
        del frames, phonemes, logits
        torch.cuda.empty_cache()

    batch_bar.close()
    tloss /= len(train_loader)
    tacc /= len(train_loader)

    return tloss, tacc


In [None]:
criterion = torch.nn.CrossEntropyLoss() # Defining Loss function.
# We use CE because the task is multi-class classification

optimizer = torch.optim.Adam(model.parameters(), lr= config['init_lr']) #Defining Optimizer
# Recommended : Define Scheduler for Learning Rate,
# including but not limited to StepLR, MultiStep, CosineAnnealing, CosineAnnealingWithWarmRestarts, ReduceLROnPlateau, etc.
# You can refer to Pytorch documentation for more information on how to use them.
# Is your training time very high?
# Look into mixed precision training if your GPU (Tesla T4, V100, etc) can make use of it
# Refer - https://pytorch.org/docs/stable/notes/amp_examples.html

In [None]:
from torch.optim.lr_scheduler import ReduceLROnPlateau

scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=5, verbose=True)



# Training and Validation Functions

This section covers the training, and validation functions for each epoch of running your experiment with a given model architecture. The code has been provided to you, but we recommend going through the comments to understand the workflow to enable you to write these loops for future HWs.

In [None]:
torch.cuda.empty_cache()
gc.collect()

30

In [None]:
def train_given(model, dataloader, optimizer, criterion):

    model.train()
    tloss, tacc = 0, 0 # Monitoring loss and accuracy
    batch_bar   = tqdm(total=len(train_loader), dynamic_ncols=True, leave=False, position=0, desc='Train')

    for i, (frames, phonemes) in enumerate(dataloader):

        ### Initialize Gradients
        optimizer.zero_grad()

        ### Move Data to Device (Ideally GPU)
        frames      = frames.to(device)
        phonemes    = phonemes.to(device)

        ### Forward Propagation
        logits  = model(frames)

        ### Loss Calculation
        loss    = criterion(logits, phonemes)

        ### Backward Propagation
        loss.backward()

        ### Gradient Descent
        optimizer.step()

        tloss   += loss.item()
        tacc    += torch.sum(torch.argmax(logits, dim= 1) == phonemes).item()/logits.shape[0]

        batch_bar.set_postfix(loss="{:.04f}".format(float(tloss / (i + 1))),
                              acc="{:.04f}%".format(float(tacc*100 / (i + 1))))
        batch_bar.update()

        ### Release memory
        del frames, phonemes, logits
        torch.cuda.empty_cache()

    batch_bar.close()
    tloss   /= len(train_loader)
    tacc    /= len(train_loader)

    return tloss, tacc

In [None]:
#Train with AMP
def train(model, dataloader, optimizer, criterion):
    model.train()
    tloss, tacc = 0, 0  # Monitoring loss and accuracy
    batch_bar = tqdm(total=len(train_loader), dynamic_ncols=True, leave=False, position=0, desc='Train')

    # Initialize GradScaler for AMP
    scaler = torch.cuda.amp.GradScaler()

    for i, (frames, phonemes) in enumerate(dataloader):

        ### Initialize Gradients
        optimizer.zero_grad()

        ### Move Data to Device (Ideally GPU)
        frames = frames.to(device)
        phonemes = phonemes.to(device)

        # Use AMP autocast for the forward pass to utilize mixed precision
        with torch.cuda.amp.autocast():
            ### Forward Propagation
            logits = model(frames)

            ### Loss Calculation
            loss = criterion(logits, phonemes)

        ### Backward Propagation using scaled loss
        scaler.scale(loss).backward()

        ### Gradient Descent (unscaled)
        scaler.step(optimizer)

        # Update the scaler for the next iteration
        scaler.update()

        tloss += loss.item()
        tacc += torch.sum(torch.argmax(logits, dim=1) == phonemes).item() / logits.shape[0]

        batch_bar.set_postfix(loss="{:.04f}".format(float(tloss / (i + 1))),
                              acc="{:.04f}%".format(float(tacc * 100 / (i + 1))))
        batch_bar.update()

        ### Release memory
        del frames, phonemes, logits
        torch.cuda.empty_cache()

    batch_bar.close()
    tloss /= len(train_loader)
    tacc /= len(train_loader)

    return tloss, tacc


In [None]:
def eval(model, dataloader):

    model.eval() # set model in evaluation mode
    vloss, vacc = 0, 0 # Monitoring loss and accuracy
    batch_bar   = tqdm(total=len(val_loader), dynamic_ncols=True, position=0, leave=False, desc='Val')

    for i, (frames, phonemes) in enumerate(dataloader):

        ### Move data to device (ideally GPU)
        frames      = frames.to(device)
        phonemes    = phonemes.to(device)

        # makes sure that there are no gradients computed as we are not training the model now
        with torch.inference_mode():
            ### Forward Propagation
            logits  = model(frames)
            ### Loss Calculation
            loss    = criterion(logits, phonemes)

        vloss   += loss.item()
        vacc    += torch.sum(torch.argmax(logits, dim= 1) == phonemes).item()/logits.shape[0]

        # Do you think we need loss.backward() and optimizer.step() here?

        batch_bar.set_postfix(loss="{:.04f}".format(float(vloss / (i + 1))),
                              acc="{:.04f}%".format(float(vacc*100 / (i + 1))))
        batch_bar.update()

        ### Release memory
        del frames, phonemes, logits
        torch.cuda.empty_cache()

    batch_bar.close()
    vloss   /= len(val_loader)
    vacc    /= len(val_loader)

    return vloss, vacc

# Weights and Biases Setup

This section is to enable logging metrics and files with Weights and Biases. Please refer to wandb documentationa and recitation 0 that covers the use of weights and biases for logging, hyperparameter tuning and monitoring your runs for your homeworks. Using this tool makes it very easy to show results when submitting your code and models for homeworks, and also extremely useful for study groups to organize and run ablations under a single team in wandb.

We have written code for you to make use of it out of the box, so that you start using wandb for all your HWs from the beginning.

In [None]:
wandb.login(key="4f73ce31d6dbaa074483cb06e3099a23ae82ee7b") #API Key is in your wandb account, under settings (wandb.ai/settings)

[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33msbhoir[0m ([33msbhoir-carnegie-mellon-university[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [None]:
# Create your wandb run
run = wandb.init(
    name    = "sixth-run", ### Wandb creates random run names if you skip this field, we recommend you give useful names
    reinit  = True, ### Allows reinitalizing runs when you re-run this cell
    id     = "y28t31", ### Insert specific run id here if you want to resume a previous run
   #resume = "must", ### You need this to resume previous runs, but comment out reinit = True when using this
    project = "hw1p2", ### Project should be created in your wandb account
    config  = config ### Wandb Config for your run
)

In [None]:
### Save your model architecture as a string with str(model)
model_arch  = str(model)

### Save it in a txt file
arch_file   = open("model_arch.txt", "w")
file_write  = arch_file.write(model_arch)
arch_file.close()

### log it in your wandb run with wandb.save()
wandb.save('model_arch.txt')

['/content/wandb/run-20240920_235242-y28t31/files/model_arch.txt']

# Experiment

Now, it is time to finally run your ablations! Have fun!

In [None]:
# Iterate over number of epochs to train and evaluate your model
torch.cuda.empty_cache()
gc.collect()
wandb.watch(model, log="all")

for epoch in range(config['epochs']):

    print("\nEpoch {}/{}".format(epoch+1, config['epochs']))

    curr_lr                 = float(optimizer.param_groups[0]['lr'])
    train_loss, train_acc   = train(model, train_loader, optimizer, criterion)
    val_loss, val_acc       = eval(model, val_loader)
    scheduler.step(val_loss)

    print("\tTrain Acc {:.04f}%\tTrain Loss {:.04f}\t Learning Rate {:.07f}".format(train_acc*100, train_loss, curr_lr))
    print("\tVal Acc {:.04f}%\tVal Loss {:.04f}".format(val_acc*100, val_loss))

    ### Log metrics at each epoch in your run
    # Optionally, you can log at each batch inside train/eval functions
    # (explore wandb documentation/wandb recitation)
    wandb.log({'train_acc': train_acc*100, 'train_loss': train_loss,
               'val_acc': val_acc*100, 'valid_loss': val_loss, 'lr': curr_lr})

    ### Highly Recommended: Save checkpoint in drive and/or wandb if accuracy is better than your current best



Epoch 1/30


Train:   0%|          | 0/17623 [00:00<?, ?it/s]

Val:   0%|          | 0/942 [00:00<?, ?it/s]

	Train Acc 78.1244%	Train Loss 0.6659	 Learning Rate 0.0010000
	Val Acc 82.8413%	Val Loss 0.5059

Epoch 2/30


Train:   0%|          | 0/17623 [00:00<?, ?it/s]

Val:   0%|          | 0/942 [00:00<?, ?it/s]

	Train Acc 78.1419%	Train Loss 0.6652	 Learning Rate 0.0010000
	Val Acc 82.8970%	Val Loss 0.5042

Epoch 3/30


Train:   0%|          | 0/17623 [00:00<?, ?it/s]

Val:   0%|          | 0/942 [00:00<?, ?it/s]

	Train Acc 78.1588%	Train Loss 0.6646	 Learning Rate 0.0010000
	Val Acc 82.9093%	Val Loss 0.5041

Epoch 4/30


Train:   0%|          | 0/17623 [00:00<?, ?it/s]

Val:   0%|          | 0/942 [00:00<?, ?it/s]

	Train Acc 78.1724%	Train Loss 0.6640	 Learning Rate 0.0010000
	Val Acc 82.9041%	Val Loss 0.5044

Epoch 5/30


Train:   0%|          | 0/17623 [00:00<?, ?it/s]

Val:   0%|          | 0/942 [00:00<?, ?it/s]

	Train Acc 78.1927%	Train Loss 0.6633	 Learning Rate 0.0010000
	Val Acc 82.9144%	Val Loss 0.5033

Epoch 6/30


Train:   0%|          | 0/17623 [00:00<?, ?it/s]

Val:   0%|          | 0/942 [00:00<?, ?it/s]

	Train Acc 78.2082%	Train Loss 0.6628	 Learning Rate 0.0010000
	Val Acc 82.9522%	Val Loss 0.5028

Epoch 7/30


Train:   0%|          | 0/17623 [00:00<?, ?it/s]

Val:   0%|          | 0/942 [00:00<?, ?it/s]

	Train Acc 78.2235%	Train Loss 0.6623	 Learning Rate 0.0010000
	Val Acc 82.9692%	Val Loss 0.5027

Epoch 8/30


Train:   0%|          | 0/17623 [00:00<?, ?it/s]

Val:   0%|          | 0/942 [00:00<?, ?it/s]

	Train Acc 78.2513%	Train Loss 0.6615	 Learning Rate 0.0010000
	Val Acc 82.9458%	Val Loss 0.5028

Epoch 9/30


Train:   0%|          | 0/17623 [00:00<?, ?it/s]

Val:   0%|          | 0/942 [00:00<?, ?it/s]

	Train Acc 78.2537%	Train Loss 0.6612	 Learning Rate 0.0010000
	Val Acc 82.9579%	Val Loss 0.5027

Epoch 10/30


Train:   0%|          | 0/17623 [00:00<?, ?it/s]

Val:   0%|          | 0/942 [00:00<?, ?it/s]

	Train Acc 78.2697%	Train Loss 0.6607	 Learning Rate 0.0010000
	Val Acc 82.9735%	Val Loss 0.5018

Epoch 11/30


Train:   0%|          | 0/17623 [00:00<?, ?it/s]

Val:   0%|          | 0/942 [00:00<?, ?it/s]

	Train Acc 78.2785%	Train Loss 0.6603	 Learning Rate 0.0010000
	Val Acc 82.9968%	Val Loss 0.5009

Epoch 12/30


Train:   0%|          | 0/17623 [00:00<?, ?it/s]

Val:   0%|          | 0/942 [00:00<?, ?it/s]

	Train Acc 78.2963%	Train Loss 0.6597	 Learning Rate 0.0010000
	Val Acc 82.9966%	Val Loss 0.5005

Epoch 13/30


Train:   0%|          | 0/17623 [00:00<?, ?it/s]

Val:   0%|          | 0/942 [00:00<?, ?it/s]

	Train Acc 78.3134%	Train Loss 0.6593	 Learning Rate 0.0010000
	Val Acc 82.9813%	Val Loss 0.5009

Epoch 14/30


Train:   0%|          | 0/17623 [00:00<?, ?it/s]

Val:   0%|          | 0/942 [00:00<?, ?it/s]

	Train Acc 78.3338%	Train Loss 0.6586	 Learning Rate 0.0010000
	Val Acc 83.0124%	Val Loss 0.5005

Epoch 15/30


Train:   0%|          | 0/17623 [00:00<?, ?it/s]

Val:   0%|          | 0/942 [00:00<?, ?it/s]

	Train Acc 78.3397%	Train Loss 0.6582	 Learning Rate 0.0010000
	Val Acc 83.0071%	Val Loss 0.5003

Epoch 16/30


Train:   0%|          | 0/17623 [00:00<?, ?it/s]

Val:   0%|          | 0/942 [00:00<?, ?it/s]

	Train Acc 78.3512%	Train Loss 0.6576	 Learning Rate 0.0010000
	Val Acc 83.0559%	Val Loss 0.4989

Epoch 17/30


Train:   0%|          | 0/17623 [00:00<?, ?it/s]

Val:   0%|          | 0/942 [00:00<?, ?it/s]

	Train Acc 78.3657%	Train Loss 0.6575	 Learning Rate 0.0010000
	Val Acc 83.0441%	Val Loss 0.4990

Epoch 18/30


Train:   0%|          | 0/17623 [00:00<?, ?it/s]

Val:   0%|          | 0/942 [00:00<?, ?it/s]

	Train Acc 78.3796%	Train Loss 0.6569	 Learning Rate 0.0010000
	Val Acc 83.0546%	Val Loss 0.4991

Epoch 19/30


Train:   0%|          | 0/17623 [00:00<?, ?it/s]

Val:   0%|          | 0/942 [00:00<?, ?it/s]

	Train Acc 78.4049%	Train Loss 0.6564	 Learning Rate 0.0010000
	Val Acc 83.0795%	Val Loss 0.4982

Epoch 20/30


Train:   0%|          | 0/17623 [00:00<?, ?it/s]

Val:   0%|          | 0/942 [00:00<?, ?it/s]

	Train Acc 78.4059%	Train Loss 0.6561	 Learning Rate 0.0010000
	Val Acc 83.0757%	Val Loss 0.4980

Epoch 21/30


Train:   0%|          | 0/17623 [00:00<?, ?it/s]

Val:   0%|          | 0/942 [00:00<?, ?it/s]

	Train Acc 78.4197%	Train Loss 0.6556	 Learning Rate 0.0010000
	Val Acc 83.1047%	Val Loss 0.4979

Epoch 22/30


Train:   0%|          | 0/17623 [00:00<?, ?it/s]

Val:   0%|          | 0/942 [00:00<?, ?it/s]

	Train Acc 78.4321%	Train Loss 0.6552	 Learning Rate 0.0010000
	Val Acc 83.1185%	Val Loss 0.4975

Epoch 23/30


Train:   0%|          | 0/17623 [00:00<?, ?it/s]

Val:   0%|          | 0/942 [00:00<?, ?it/s]

	Train Acc 78.4519%	Train Loss 0.6547	 Learning Rate 0.0010000
	Val Acc 83.1013%	Val Loss 0.4970

Epoch 24/30


Train:   0%|          | 0/17623 [00:00<?, ?it/s]

Val:   0%|          | 0/942 [00:00<?, ?it/s]

	Train Acc 78.4548%	Train Loss 0.6544	 Learning Rate 0.0010000
	Val Acc 83.1281%	Val Loss 0.4972

Epoch 25/30


Train:   0%|          | 0/17623 [00:00<?, ?it/s]

Val:   0%|          | 0/942 [00:00<?, ?it/s]

	Train Acc 78.4667%	Train Loss 0.6541	 Learning Rate 0.0010000
	Val Acc 83.1428%	Val Loss 0.4960

Epoch 26/30


Train:   0%|          | 0/17623 [00:00<?, ?it/s]

Val:   0%|          | 0/942 [00:00<?, ?it/s]

	Train Acc 78.4838%	Train Loss 0.6536	 Learning Rate 0.0010000
	Val Acc 83.1299%	Val Loss 0.4960

Epoch 27/30


Train:   0%|          | 0/17623 [00:00<?, ?it/s]

Val:   0%|          | 0/942 [00:00<?, ?it/s]

	Train Acc 78.4826%	Train Loss 0.6531	 Learning Rate 0.0010000
	Val Acc 83.1199%	Val Loss 0.4962

Epoch 28/30


Train:   0%|          | 0/17623 [00:00<?, ?it/s]

Val:   0%|          | 0/942 [00:00<?, ?it/s]

	Train Acc 78.5060%	Train Loss 0.6529	 Learning Rate 0.0010000
	Val Acc 83.1808%	Val Loss 0.4947

Epoch 29/30


Train:   0%|          | 0/17623 [00:00<?, ?it/s]

Val:   0%|          | 0/942 [00:00<?, ?it/s]

	Train Acc 78.4986%	Train Loss 0.6526	 Learning Rate 0.0010000
	Val Acc 83.1701%	Val Loss 0.4951

Epoch 30/30


Train:   0%|          | 0/17623 [00:00<?, ?it/s]

Val:   0%|          | 0/942 [00:00<?, ?it/s]

	Train Acc 78.5146%	Train Loss 0.6522	 Learning Rate 0.0010000
	Val Acc 83.1842%	Val Loss 0.4947


# Testing and submission to Kaggle

Before we get to the following code, make sure to see the format of submission given in *sample_submission.csv*. Once you have done so, it is time to fill the following function to complete your inference on test data. Refer the eval function from previous cells to get an idea of how to go about completing this function.

In [None]:
def test(model, test_loader):
    ### What you call for model to perform inference?
    model.eval() # TODO train or eval?

    ### List to store predicted phonemes of test data
    test_predictions = []
    phoneme_to_idx = {idx: phoneme for idx, phoneme in enumerate(PHONEMES)}
    ### Which mode do you need to avoid gradients?
    with torch.no_grad(): # TODO

        for i, mfccs in enumerate(tqdm(test_loader)):

            mfccs   = mfccs.to(device)

            logits  = model(mfccs)

            ### Get most likely predicted phoneme with argmax
            predicted_phonemes = torch.argmax(logits, dim=1).cpu().numpy()
            predicted_phonemes = [phoneme_to_idx[idx].replace("'", "") for idx in predicted_phonemes]
            ### How do you store predicted_phonemes with test_predictions? Hint, look at eval
            # TODO
            for phoneme in predicted_phonemes:
                test_predictions.append(phoneme)



    return test_predictions

In [None]:
# Get a single batch from the test_loader
test_iter = iter(test_loader)
frames = next(test_iter)

# Print the size of the data point
print("Size of frames (input):", frames.size())  # For input features (frames)
#  # For target labels (phonemes)


Size of frames (input): torch.Size([2048, 1428])


Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7c57e553ef80>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1479, in __del__
    self._shutdown_workers()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1462, in _shutdown_workers
    if w.is_alive():
  File "/usr/lib/python3.10/multiprocessing/process.py", line 160, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Exception ignored in: 

In [None]:
predictions = test(model, test_loader)

  0%|          | 0/945 [00:00<?, ?it/s]

Exception ignored in: Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7c57e553ef80><function _MultiProcessingDataLoaderIter.__del__ at 0x7c57e553ef80>

Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1479, in __del__
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1479, in __del__
        self._shutdown_workers()self._shutdown_workers()

  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1462, in _shutdown_workers
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1462, in _shutdown_workers
        if w.is_alive():if w.is_alive():

  File "/usr/lib/python3.10/multiprocessing/process.py", line 160, in is_alive
  File "/usr/lib/python3.10/multiprocessing/process.py", line 160, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a

AttributeError: 'NoneType' object has no attribute '_log'

In [None]:
### Create CSV file with predictions
with open("./submission.csv", "w+") as f:
    f.write("id,label\n")
    for i in range(len(predictions)):
        f.write("{},{}\n".format(i, predictions[i]))

In [None]:
### Finish your wandb run
run.finish()

VBox(children=(Label(value='0.581 MB of 0.581 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
lr,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
train_acc,▁▁▂▃▄▄▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇█████████████
train_loss,▆█▆▅▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
val_acc,▃▃▁▂▄▄▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇███████████████
valid_loss,█▅▄▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
lr,0.001
train_acc,78.51456
train_loss,0.65225
val_acc,83.18423
valid_loss,0.4947


In [None]:
### Submit to kaggle competition using kaggle API (Uncomment below to use)
!kaggle competitions submit -c 11785-hw1p2-f24 -f ./submission.csv -m "Test Submission"

### However, its always safer to download the csv file and then upload to kaggle

100% 19.3M/19.3M [00:01<00:00, 19.7MB/s]
Successfully submitted to 11785 HW1P2 Fall 2024