<a href="https://colab.research.google.com/github/RajeshDey/DLFA_Project1_Sentiment/blob/main/Dysarthric_Classifier_RD_Test1_Orig.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Programme in Deep Learning (Foundations and Applications)
## A Program by IISc and TalentSprint
### Assignment: Speech and Audio Processing

## Learning Objectives

At the end of the experiment you will be able to :

* extract the features from audio samples/data
* implement the Convolutional Neural Networks (CNN) model to classify emotions
* evaluate the CNN trained model on the testset

### Introduction

Speech Dysarthria is a disorder in which speech muscles become weak, and it becomes difficult to articulate otherwise linguistically normal speech. This work is based on detection of speech dysarthria and how it can assist physicians, specialists, and doctors in its detection.

TORGO Database of Dysarthric Articulation was developed by the University of Toronto's departments of Computer Science and Speech Language Pathology in collaboration with the Holland-Bloorview Kids Rehabilitation Hospital in Toronto, Canada. It contains approximately 23 hours of English speech data, accompanying transcripts and documentation from 8 speakers (5 males, 3 females) with cerebral palsy (CP) or amyotrophic lateral sclerosis (ALS) and from 7 speakers (4 males, 3 females) from a non-dysarthric control group.

### Dataset

he TORGO database of dysarthric articulation consists of aligned acoustics and measured 3D articulatory features from speakers with either cerebral palsy (CP) or amyotrophic lateral sclerosis (ALS), which are two of the most prevalent causes of speech disability (Kent and Rosen, 2004), and matchd controls. This database, called TORGO, is the result of a collaboration between the departments of Computer Science and Speech-Language Pathology at the University of Toronto and the Holland-Bloorview Kids Rehab hospital in Toronto.

**Speakers:** Both CP and ALS result in dysarthria, which is caused by disruptions in the neuro-motor interface. These disruptions distort motor commands to the vocal articulators, resulting in atypical and relatively unintelligible speech in most cases (Kent, 2000). This unintelligibility can significantly diminish the use of traditional automatic speech recognition (ASR) software. The inability of modern ASR to effectively understand dysarthric speech is a major problem, since the more general physical disabilities often associated with the condition can make other forms of computer input, such as keyboards or touch screens, especially difficult (Hosom et al, 2003).

### Importing required packages

In [1]:
import os
import torch
from torch.utils.data import Dataset, DataLoader
import numpy as np
import torch.nn as nn
from tqdm import tqdm
import librosa
from pathlib import Path
import torch.nn.functional as F
!pip install huggingface_hub

import warnings
warnings.filterwarnings("ignore")



### Download the dataset torgo speech

In [None]:
from huggingface_hub import hf_hub_download
hf_hub_download(
                repo_id='viks66/torgo_speech',
                filename="torgo.zip",
                cache_dir='./',
                force_filename='torgo.zip',
                repo_type='dataset',
                )



torgo.zip:   0%|          | 0.00/1.39G [00:00<?, ?B/s]

'./torgo.zip'

In [None]:
!unzip -q torgo.zip

In [None]:
data_path = 'torgo/'

In [None]:
def get_files(path, extension='.wav'):
    return list(path.rglob(f'*{extension}'))

In [None]:
all_files = get_files(Path(data_path))
all_files = [l for l in all_files if os.path.getsize(str(l)) != 0 ]
speakers = set([str(l).split('/')[-4] for l in all_files])
labels = {str(l):0 if 'C' in str(l).split('/')[-4] else 1 for l in all_files}
print(len(speakers), speakers)

15 {'MC03', 'F03', 'M01', 'M05', 'MC02', 'M02', 'M03', 'MC01', 'FC01', 'F01', 'MC04', 'FC03', 'M04', 'FC02', 'F04'}


In [None]:
test_speakers = ['F04', 'FC03', 'M05', 'MC04']

In [None]:
class DysarthricDataset(Dataset):
    def __init__(self, mode, test_speakers, labels ,num_val=200):
        if mode == 'train' or mode == 'val':
            label_names = sorted([l for l in labels if l.split('/')[-2] not in test_speakers])
        elif mode == 'test':
            label_names = sorted([l for l in labels if l.split('/')[-2] in test_speakers])
        if mode == 'val':
            label_names = label_names[:num_val]
        elif mode == 'train':
            label_names = label_names[num_val:]
        self.label_names = label_names
        self.label_dict = labels

    def __len__(self):
        return len(self.label_names)

    def __getitem__(self, idx):
        y, sr = librosa.load(self.label_names[idx])
        mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13).T
        return torch.from_numpy(mfcc), self.label_dict[self.label_names[idx]]

class BatchPadCollafeFn():
     def __init__(self):
        pass
     def __call__(self, batch):
        input_lengths, ids_sorted_decreasing = torch.sort(
            torch.LongTensor([len(x[0]) for x in batch]),
            dim=0, descending=True)
        max_input_len = input_lengths[0]
        mfcc_padded = torch.LongTensor(len(batch), max_input_len, batch[ids_sorted_decreasing[0]][0].shape[-1])
        mfcc_padded.zero_()
        labels = torch.LongTensor(len(batch))
        for i in range(len(ids_sorted_decreasing)):
            mfcc = batch[ids_sorted_decreasing[i]][0]
            mfcc_padded[i, :mfcc.shape[0], :] = mfcc
            labels[i] = batch[ids_sorted_decreasing[i]][1]
        return mfcc_padded, labels

In [None]:
traindataset = DysarthricDataset(mode='train', test_speakers=test_speakers, labels=labels)
valdataset = DysarthricDataset(mode='val', test_speakers=test_speakers, labels=labels)
testdataset = DysarthricDataset(mode='test', test_speakers=test_speakers, labels=labels)
batch_size = 20
trainloader = DataLoader(traindataset, batch_size=batch_size, collate_fn=BatchPadCollafeFn())
valloader = DataLoader(valdataset, batch_size=batch_size, collate_fn=BatchPadCollafeFn())
testloader = DataLoader(testdataset, batch_size=batch_size, collate_fn=BatchPadCollafeFn())

### Define the CNN model

In [None]:
class Model(nn.Module):
    def __init__(self, in_channel=13):
        super().__init__()
        self.conv1 = nn.Conv1d(in_channel, 32, 3)
        self.conv2 = nn.Conv1d(32, 64, 3)
        self.conv3 = nn.Conv1d(64, 128, 3)
        self.dense = nn.Linear(128, 2)

    def forward(self, x):
        x = x.permute(0, 2, 1)
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = F.relu(self.conv3(x))
        x = torch.mean(x, -1)

        return self.dense(x)

In [None]:
def train(loader):
    model.train()
    n_classes = 2
    lossfn = nn.CrossEntropyLoss()
    confusion_matrix = torch.zeros(n_classes, n_classes)
    losses = []
    for data, label in tqdm(loader):
        data, label = data.to(device), label.to(device)
        out = model(data.float())
        loss = lossfn(out, label)
        optimiser.zero_grad()
        loss.backward()
        optimiser.step()
        losses.append(loss.item())
        _, preds = torch.max(out, 1)
        for t, p in zip(label.view(-1), preds.view(-1)):
            confusion_matrix[t.long(), p.long()] += 1
    return sum(losses)/len(losses), confusion_matrix.diag()/confusion_matrix.sum(1)
def val(loader):
    model.eval()
    n_classes = 2
    lossfn = nn.CrossEntropyLoss()
    confusion_matrix = torch.zeros(n_classes, n_classes)
    losses = []
    for data, label in tqdm(loader):
        data, label = data.to(device), label.to(device)
        out = model(data.float())
        loss = lossfn(out, label)
        losses.append(loss.item())
        _, preds = torch.max(out, 1)
        for t, p in zip(label.view(-1), preds.view(-1)):
            confusion_matrix[t.long(), p.long()] += 1
    return sum(losses)/len(losses), confusion_matrix.diag()/confusion_matrix.sum(1)

In [None]:
device = 'cuda'
lr = 0.0001
model = Model().to(device).float()
optimiser = torch.optim.Adam(model.parameters(), lr=lr)

In [None]:
num_epochs = 10
trainloss, trainaccs, valloss, valaccs = [], [], [], []
for ep in range(num_epochs):
    loss, accs = train(trainloader)
    trainloss.append(loss)
    trainaccs.append(accs)
    loss, accs = val(valloader)
    valloss.append(loss)
    valaccs.append(accs)
    print(trainloss[-1], valloss[-1])
    print(trainaccs[-1], valaccs[-1])

 86%|████████▌ | 747/872 [04:35<00:51,  2.44it/s]