<a href="https://colab.research.google.com/github/JungCesar/bscaithesis/blob/master/bsc_ai_thesis_wav2vec_dysarthria.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dysarthria Classification Algorithm

In this notebook, I will create a dyasarthria classification algorithm for my bachelor's thesis Artificial Intelligence at the University of Amsterdam. Dysarthria occurs when the muscles you use for speech are weak or you have difficulty controlling them. The title of my thesis is "Investigating Pre-Trained Self-Supervised Deep Learning Models for Disease Recognition". The idea is to apply the algorithm created here to other datasets as well, for example audiofiles from the DementiaBank databases.

The [Torgo](http://www.cs.toronto.edu/~complingweb/data/TORGO/torgo.html) dataste perfectly fits this purpose. It consists of dyasrthric male and female speakers, either caused by cerebral palsy (CP) or amyotrophic lateral sclerosis (ALS) and their non-dysarthric counterparts, also known as the healthy control group.

I will fine-tune a pre-trained self-supervised deep-learning model, namely Facebook's [wav2vec](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/), on our specific training data. Then I will extract the fine-tuned wav2vec features from our test data. Then, these features will be used as input for the classifier to distinnguish between the two categories.

I will take a bottom-up approach, try to be as operating system-independent as possible and explain the steps in detail below. The structure will be as follows:

1.   Installing and Exlpaining Necessary Packages
2.   Downloading, Storing, and Accessing (Loading) the Dataset
3.   Exploratory Data Analysis (EDA)
4.   Dataset Preprocessing
5.   Dataset Splitting (into a train- and test set)
6.   Fine-tuning Wav2vec
7.   Feature Extraction
8.   Training Classifier
9.   Model Validation
10.   Model Evaluation

Notes:

*   The above link to TORGO redirects to the Computational Linguistics website of the university of Toronto. An in-depth explanation can be found there, but the actual dataset that will be used in this notebook, comes from Kaggle. [Kaggle's version of Torgo](https://www.kaggle.com/datasets/iamhungundji/dysarthria-detection) is already some sort of preprocessed form of the original one.
*   Wav2vec has been made publicly available by Facebook on Hugging Face, it can be found [here](https://huggingface.co/facebook/wav2vec2-base).

## 1. Installing Necessary Packages and Explaining Them
Huggins Face's Transformers provides APIs and tools to easily download and train state-of-the-art pretrained models, like wav2vec. Installing Datasets is optional if you want to use one of the available datasets on their platform. Evaluate is a package for easily evaluating machine learning models and datasets.

Accelerate is a library that enables the same PyTorch code to be run across any distributed configuration by adding just four lines of code! In short, training and inference at scale made simple, efficient and adaptable.

Uninstalling and installing transformers and accelerate again was the solution to a bug I perceived.

In [1]:
%%capture
# !pip install --upgrade pip
!pip install transformers==4.28.0 datasets evaluate accelerate
# !pip uninstall -y transformers
# !pip install transformers accelerate
!pip install torchaudio
# !pip install numba==0.48
!pip install librosa

In [2]:
from sklearn.model_selection import train_test_split

import os
import sys

## 2. Downloading, Storing, and Accessing (Loading) the Dataset

There are two obvious ways to proceed: downloading the dataset and keep storing it locally or uploading it to Google Drive and working from there. The latter is especially useful when working in Google Colab, then you don't need to upload the whole dataset every time you start a new session. I will show both ways for clarity and operating system independence, but it will be in the form of a comment in the final version.

After downloading the TORGO wav files from Kaggle [here](https://www.kaggle.com/datasets/iamhungundji/dysarthria-detection), I will load the individual TORGO wav files into a Python dictionary and put all of those dicts together into a list called **data**.

**Note: The Google Drive method should only be used when working in a Google Colab environment. Then, uncomment the cell below to mount your Drive.**

In [3]:
# Get access to Google Drive 
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
# str(path) returns something like: /content/drive/MyDrive/bsc-ai-thesis/torgo_data/dysarthria_female/F01_Session1_0002.wav
# tqdm is used to create a smart progress bar for the loops, for example it shows loading time
import glob
import torchaudio
from pathlib import Path
from tqdm import tqdm

# Specify where the folder with all audio data is stored
# For Google Drive, use the first, for local, use the second
folder_path = "/content/drive/MyDrive/bsc-ai-thesis/torgo_data"
# folder_path = "C:/Users/Gebruiker/Documents/bsc-ai-thesis/torgo_data"

data = []
for path in tqdm(Path(folder_path).glob("**/*.wav")):
    path_str = str(path).replace('\\', '/')
    name = str(path_str).split('/')[-1].split('.')[0]
    label = str(path_str).split('/')[-2]
    
    try:
        # There are some broken files
        s = torchaudio.load(path)
        data.append({
            "filename": name,
            "path": path_str,
            "disease_class": label
        })

    except Exception as e:
        print(str(path), e)
        pass

139it [00:11, 75.77it/s]

/content/drive/MyDrive/bsc-ai-thesis/torgo_data/dysarthria_female/F01_Session1_0068.wav Failed to open the input "/content/drive/MyDrive/bsc-ai-thesis/torgo_data/dysarthria_female/F01_Session1_0068.wav" (Invalid data found when processing input).


2000it [00:36, 54.48it/s] 


## 3. Exploratory Data Analysis (EDA)

1.   Convert the previously assembled list of dicts into a Pandas DataFrame
2.   Show the distribution of samples over the different categories
3.   Load and play one random audio sample

In [5]:
# Show how the Pandas dataframe looks like currently
import pandas as pd
df = pd.DataFrame(data)
df.head()

Unnamed: 0,filename,path,disease_class
0,F01_Session1_0006,/content/drive/MyDrive/bsc-ai-thesis/torgo_dat...,dysarthria_female
1,F01_Session1_0038,/content/drive/MyDrive/bsc-ai-thesis/torgo_dat...,dysarthria_female
2,F01_Session1_0015,/content/drive/MyDrive/bsc-ai-thesis/torgo_dat...,dysarthria_female
3,F01_Session1_0024,/content/drive/MyDrive/bsc-ai-thesis/torgo_dat...,dysarthria_female
4,F01_Session1_0053,/content/drive/MyDrive/bsc-ai-thesis/torgo_dat...,dysarthria_female


In [6]:
# Show the distribution over the different categories or labels
df.groupby("disease_class").count()[["path"]]

Unnamed: 0_level_0,path
disease_class,Unnamed: 1_level_1
dysarthria_female,499
dysarthria_male,500
non_dysarthria_female,500
non_dysarthria_male,500


Let's display a random sample of the dataset and run it a couple of times to get a feeling for the audio and the dysarthria label.

In [7]:
import torchaudio
import librosa
import IPython.display as ipd
import numpy as np

idx = np.random.randint(0, len(df))
sample = df.iloc[idx]
path = sample["path"]
label = sample["disease_class"]

print(f"ID Location: {idx}")
print(f"      Label: {label}")
print()

speech, sr = torchaudio.load(path)
speech = speech[0].numpy().squeeze()
speech = librosa.resample(np.asarray(speech), orig_sr=sr, target_sr=16_000)
ipd.Audio(data=np.asarray(speech), autoplay=True, rate=16000)

ID Location: 73
      Label: dysarthria_female



## 4. Dataset Preprocessing

There might be some inconsistencies in the data and some specific formats that are required by the Python packages I use.

While initially loading the data locally, not in Google Colab, I already received the following error:

>C:\Users\Gebruiker\Documents\bsc-ai-thesis\torgo_data\dysarthria_female\F01_Session1_0068.wav Error opening 'C:\\Users\\Gebruiker\\Documents\\bsc-ai-thesis\\torgo_data\\dysarthria_female\\F01_Session1_0068.wav': File contains data in an unknown format.

This was filtered out before, but I might want to check again if all the extracted paths really exist.

In [8]:
import os

# Filter broken and non-existed paths
print(f"Step 0: {len(df)}")

df["status"] = df["path"].apply(lambda path: True if os.path.exists(path) else None)
df = df.dropna(subset=["path"])
df = df.drop(columns='status')

print(f"Step 1: {len(df)}")

df = df.sample(frac=1)
df = df.reset_index(drop=True)

Step 0: 1999
Step 1: 1999


Next, my goal is to train a model to recognize the presence or absence of dysarthria in speech, it would be appropriate to combine the four data folders into two categories: patients *with* the disease and patients *without* the disease. This will simplify the training process and ensure that the model is focused on recognizing the presence of dysarthria, instead of recognizing a gender. Let's see how many audio files each of the now two categories contain.

It is noticeable that there was one instance of audio filtered out previously, specifically an instance of 'dysarthria'. The 1000 samples for each class, were reduced to 999 for 'dysarthria'.

In [9]:
# Eliminate difference between male and female and print distribbution
df = df.replace({'disease_class' : {'dysarthria_female': 'dysarthria', 'dysarthria_male': 'dysarthria', 'non_dysarthria_female': 'non_dysarthria', 'non_dysarthria_male': 'non_dysarthria'}})
print("Labels: ", df["disease_class"].unique())
print()
df.groupby("disease_class").count()[["path"]]

Labels:  ['non_dysarthria' 'dysarthria']



Unnamed: 0_level_0,path
disease_class,Unnamed: 1_level_1
dysarthria,999
non_dysarthria,1000


Hugging’s models require tensors as input. Since we are working with Hugging Face models, and they require a specific format as input, we convert our Pandas DataFrame to a Hugging Face [dataset](https://huggingface.co/docs/datasets/index) here. The samples from our dataset are already at 16000 kHz, but to be sure I include a resampling part here. Also, I will convert the 'path' column to a Hugging Face [audio feature](https://huggingface.co/docs/datasets/v2.12.0/en/package_reference/main_classes#datasets.Audio).

In [10]:
from datasets import Dataset, Audio
from datasets import ClassLabel

# Conevrt Pandas DataFrame to Hugging Face Dataset
dataset = Dataset.from_pandas(df)

# Cast 'path' column to Hugging Face Audio feature and resample to 16kHz
dataset = dataset.cast_column("path", Audio(sampling_rate=16_000))
dataset = dataset.rename_column("path", "audio")

# Change column names to Hugging Face ClassLabel instances
disease_class = ClassLabel(num_classes = 2, names=["dysarthria", "non_dysarthria"])
dataset = dataset.cast_column("disease_class", disease_class)

dataset

Casting the dataset:   0%|          | 0/1999 [00:00<?, ? examples/s]

Dataset({
    features: ['filename', 'audio', 'disease_class'],
    num_rows: 1999
})

In [11]:
dataset.features

{'filename': Value(dtype='string', id=None),
 'audio': Audio(sampling_rate=16000, mono=True, decode=True, id=None),
 'disease_class': ClassLabel(names=['dysarthria', 'non_dysarthria'], id=None)}

In [12]:
dataset.features['audio']

Audio(sampling_rate=16000, mono=True, decode=True, id=None)

In [13]:
dataset.features['disease_class']

ClassLabel(names=['dysarthria', 'non_dysarthria'], id=None)

## 5. Dataset Splitting (into a train- and test set)

Difference between sklearn.model_selection.train_test_split and cross-validation:

Cross-validation is used only when you have smaller datasets and cannot afford to get statistically representative samples after splitting the dataset.

In [14]:
# Since the dataset is not perfectly balanced after filtering, I could choose for stratifying here
# Shuffle could be set to False for making a more honest comparison (orignal: shuffle=True)
# train_df, test_df = train_test_split(df, test_size=0.2, random_state=101, stratify=df["dysarthria"])
dataset = dataset.train_test_split(test_size=0.2, shuffle=False)

dataset

# # We could save the splitted dataset here on Google Drive
# save_path = "/content/drive/MyDrive/bsc-ai-thesis/torgo_data"
# train_df = dataset["train"].reset_index(drop=True)
# test_df = dataset["test"].reset_index(drop=True)
# train_df.to_csv(f"{save_path}/train.csv", sep="\t", encoding="utf-8", index=False)
# test_df.to_csv(f"{save_path}/test.csv", sep="\t", encoding="utf-8", index=False)

DatasetDict({
    train: Dataset({
        features: ['filename', 'audio', 'disease_class'],
        num_rows: 1599
    })
    test: Dataset({
        features: ['filename', 'audio', 'disease_class'],
        num_rows: 400
    })
})

In [15]:
dataset = dataset.remove_columns(["filename"])
dataset["train"][0]

{'audio': {'path': '/content/drive/MyDrive/bsc-ai-thesis/torgo_data/non_dysarthria_male/MC03_Session1_0130.wav',
  'array': array([-0.00634766,  0.00900269,  0.01443481, ...,  0.00683594,
          0.01138306,  0.02127075]),
  'sampling_rate': 16000},
 'disease_class': 1}

To make it easier for the model to get the label name from the label id, create a dictionary that maps the label name to an integer and vice versa:

In [16]:
labels = dataset["train"].features["disease_class"].names
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label

Now you can convert the label id to a label name:

In [17]:
id2label[str(1)]

'non_dysarthria'

In [18]:
# Loading the created dataset using datasets
# from datasets import load_dataset, load_metric

# data_files = {
#     "train": "/content/drive/MyDrive//bsc-ai-thesis/torgo_data/train.csv", 
#     "validation": "/content/drive/MyDrive//bsc-ai-thesis/torgo_data/test.csv",
# }

# dataset = load_dataset("csv", data_files=data_files, delimiter="\t", )
# train_dataset = dataset["train"]
# eval_dataset = dataset["validation"]

# print(train_dataset)
# print(eval_dataset)

In [19]:
# # We need to specify the input and output column
# input_column = "path"
# output_column = "disease_class"

In [20]:
# # we need to distinguish the unique labels in our Dysarthria dataset
# label_list = train_dataset.unique(output_column)
# label_list.sort()  # Let's sort it for determinism
# num_labels = len(label_list)
# print(f"A classification problem with {num_labels} classes: {label_list}")

## Preprocess Data

So far, we downloaded, loaded, and split the Dysarhtria dataset into train and test sets.

Now, we need to extract features from the audio path in context representation tensors and feed them into our classification model to determine the presence of dysarthria in the speech.

Therefore, the next step is to load a Wav2Vec2 feature extractor to process the audio signal:

In [21]:
# from transformers import Wav2Vec2FeatureExtractor

# feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=16000, padding_value=0.0, do_normalize=True, return_attention_mask=False)

from transformers import AutoFeatureExtractor

feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")

Downloading (…)rocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.84k [00:00<?, ?B/s]



In [22]:
dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000))
dataset["train"][0]

{'audio': {'path': '/content/drive/MyDrive/bsc-ai-thesis/torgo_data/non_dysarthria_male/MC03_Session1_0130.wav',
  'array': array([-0.00634766,  0.00900269,  0.01443481, ...,  0.00683594,
          0.01138306,  0.02127075]),
  'sampling_rate': 16000},
 'disease_class': 1}

Now create a preprocessing function that:

1. Calls the `audio` column to load, and if necessary, resample the audio file. (is already at required sample rate.)
2. Checks if the sampling rate of the audio file matches the sampling rate of the audio data a model was pretrained with. You can find this information in the Wav2Vec2 [model card](https://huggingface.co/facebook/wav2vec2-base).
3. Set a maximum input length to batch longer inputs without truncating them.

In [23]:
# !!! Read more about maximum length, truncation, padding and the feature_extractor
def preprocess_function(examples):
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(
        audio_arrays, sampling_rate=feature_extractor.sampling_rate, max_length=16000, truncation=True, padding=True
    )
    return inputs

To apply the preprocessing function over the entire dataset, use 🤗 Datasets [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) function. You can speed up `map` by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don't need, and rename `intent_class` to `label` because that's the name the model expects:

In [24]:
np.warnings.filterwarnings('error', category=np.VisibleDeprecationWarning)

encoded_torgo = dataset.map(preprocess_function, remove_columns="audio", batched=True)
encoded_torgo = encoded_torgo.rename_column("disease_class", "label")

Map:   0%|          | 0/1599 [00:00<?, ? examples/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

## Evaluate

Including a metric during training is often helpful for evaluating your model's performance. You can quickly load a evaluation method with the 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load the [accuracy](https://huggingface.co/spaces/evaluate-metric/accuracy) metric (see the 🤗 Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about how to load and compute a metric):

In [25]:
import evaluate

accuracy = evaluate.load("accuracy")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Then create a function that passes your predictions and labels to [compute](https://huggingface.co/docs/evaluate/main/en/package_reference/main_classes#evaluate.EvaluationModule.compute) the accuracy:

In [26]:
import numpy as np

def compute_metrics(eval_pred):
    predictions = np.argmax(eval_pred.predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=eval_pred.label_ids)

Your compute_metrics function is ready to go now, and you'll return to it when you setup your training.

##Train

Now, I am ready to start training my model! I load Wav2Vec2 with AutoModelForAudioClassification along with the number of expected labels, and the label mappings:

In [27]:
from transformers import AutoModelForAudioClassification, TrainingArguments, Trainer

num_labels = len(id2label)
model = AutoModelForAudioClassification.from_pretrained(
    "facebook/wav2vec2-base", num_labels=num_labels, label2id=label2id, id2label=id2label
)



Downloading pytorch_model.bin:   0%|          | 0.00/380M [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/wav2vec2-base were not used when initializing Wav2Vec2ForSequenceClassification: ['quantizer.codevectors', 'quantizer.weight_proj.weight', 'project_q.bias', 'project_q.weight', 'project_hid.weight', 'project_hid.bias', 'quantizer.weight_proj.bias']
- This IS expected if you are initializing Wav2Vec2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at facebook/wav2vec2-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'projector

At this point, only three steps remain:

1. Define your training hyperparameters in [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments). The only required parameter is `output_dir` which specifies where to save your model. You can push the model to the HuggingFace Hub by setting `push_to_hub=True` (you need to be signed in to Hugging Face with your token to upload your model). At the end of each epoch, the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) will evaluate the accuracy and save the training checkpoint.
2. Pass the training arguments to [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) along with the model, dataset, tokenizer, data collator, and `compute_metrics` function.
3. Call [train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train) to finetune your model.

In [28]:
training_args = TrainingArguments(
    output_dir="my_awesome_torgo_model_non_shuffled",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=3e-5,
    per_device_train_batch_size=32,
    gradient_accumulation_steps=4,
    per_device_eval_batch_size=32,
    num_train_epochs=10,
    warmup_ratio=0.1,
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    push_to_hub=False,
)
  
# Here - and before the train and test set are being used
# the terms 'test' and 'evalutation' are being used mixed up
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_torgo["train"],
    eval_dataset=encoded_torgo["test"],
    tokenizer=feature_extractor,
    compute_metrics=compute_metrics,
)

trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy
0,0.6761,0.700923,0.5425
2,0.5879,0.624355,0.69
2,0.5254,0.516715,0.775
4,0.5104,0.573275,0.755
4,0.456,0.484953,0.815
6,0.4096,0.427611,0.8525
6,0.4467,0.421833,0.8375
8,0.3421,0.411524,0.84
8,0.3125,0.406416,0.84
9,0.3355,0.442094,0.8275


TrainOutput(global_step=120, training_loss=0.45850373307863873, metrics={'train_runtime': 525.4663, 'train_samples_per_second': 30.43, 'train_steps_per_second': 0.228, 'total_flos': 1.3936608965664e+17, 'train_loss': 0.45850373307863873, 'epoch': 9.6})

Once training is completed, share my model to the Hub with the [push_to_hub()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.push_to_hub) method so everyone can use my model:

In [29]:
# trainer.push_to_hub()

## Inference

Great, now that the mode is finetuned, I can use it for inference!

I load an audio file that I'd like to run inference on. 

Remember to resample the sampling rate of the audio file to match the sampling rate of the model if you need to!

In [30]:
# from datasets import load_dataset, Audio

# dataset = load_dataset("PolyAI/minds14", name="en-US", split="train")
# dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
# sampling_rate = dataset.features["audio"].sampling_rate
# audio_file = dataset[0]["audio"]["path"]

The simplest way to try out your finetuned model for inference is to use it in a [pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline). Instantiate a `pipeline` for audio classification with your model, and pass your audio file to it:

In [31]:
from transformers import pipeline

classifier = pipeline("audio-classification", model="Juardo/my_awesome_torgo_model")
classifier("/content/drive/MyDrive/bsc-ai-thesis/OSR_us_000_0010_8k.wav")

Downloading (…)lve/main/config.json:   0%|          | 0.00/2.50k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/378M [00:00<?, ?B/s]

Downloading (…)rocessor_config.json:   0%|          | 0.00/215 [00:00<?, ?B/s]

[{'score': 0.8873335123062134, 'label': 'dysarthria'},
 {'score': 0.11266651749610901, 'label': 'non_dysarthria'}]

In [32]:
audio_file = dataset["test"][0]["audio"]["array"]

from transformers import AutoFeatureExtractor

feature_extractor = AutoFeatureExtractor.from_pretrained("Juardo/my_awesome_torgo_model")
inputs = feature_extractor(audio_file, sampling_rate=16_000, return_tensors="pt")

In [33]:
# See what the outcome of the following lines should be: 
dataset["test"].features["disease_class"].int2str(dataset["test"][0]["disease_class"])

'non_dysarthria'

In [34]:
import torch
from transformers import AutoModelForAudioClassification

model = AutoModelForAudioClassification.from_pretrained("Juardo/my_awesome_torgo_model")
with torch.no_grad():
    logits = model(**inputs).logits

In [35]:
predicted_class_ids = torch.argmax(logits).item()
predicted_label = model.config.id2label[predicted_class_ids]
predicted_label

'non_dysarthria'