#Disease Classification Algorithm

I will try to make a classification algorithm for a specific disease, based on a specific disease dataset consisting of speech data of positive patients and a healthy control group and making use of a pre-trained self-supervised deep-learning model (Wav2Vec) to extract features from this speech data and then a classification algorotihm to distinguish between the two (or eventually maybe more) classes.

I will approach this bottom-up, step-by-step below. For the sake of doing it step-by-step, not all necessary libraries will be loaded in the beginning, but just when first used. For a later version, installing at the beginning might be preferred while looking more clean.

##Accessing Data

Since the dataset is hosted on [Kaggle](https://www.kaggle.com/datasets/iamhungundji/dysarthria-detection), I downloaded it and uploaded it it on my Google Drive to make it easily accessible here.

The Torgo database consists of four folders, and looks as follows:

```bash
.
├── data.csv
├── dysarthria female
│   ├── F01_Session1_0001.wav
│   ├── F01_Session1_0002.wav
│   ├── ...
├── dysarthria male
│   ├── M01_Session1_0005.wav
│   ├── M01_Session1_00011.wav
│   ├── ...
├── non dysarthria female
│   ├── FC01_Session1_0008.wav
│   ├── FC01_Session1_00011.wav
│   ├── ...
├── non dysarthria female
│   ├── MC01_Session1_0005.wav
│   ├── MC01_Session1_0022.wav
│   ├── ...

4 directories, 1999 files
```

Below, a connection to my Google Drive is made, via this way I can access the Torgo database which is stored there.

In [1]:
# Get access to personal Google Drive account
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Log in to Hugging Face account to save model on later.

In [2]:
%%capture
!pip install huggingface_hub

In [3]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

We need Huggingface's transformer (i.e. wav2vec). Therefore we install transformers.  Installing datasets is optional if you want to use one of the available datasets on their platforms. Evaluate is a library for easily evaluating machine learning models and datasets. 



In [4]:
%%capture
!pip install transformers==4.28.0 datasets evaluate
!pip uninstall -y transformers accelerate
!pip install transformers accelerate

In [5]:
import torchaudio
from sklearn.model_selection import train_test_split

import os
import sys

##Explore and Prepare Data



1.   Convert to Python list of dicts
2.   Convert to Pandas DataFram
3.   Print and check for inconsistencies 
4.   Filter out inconsistencies
5.   Show new distribution
6.   Split dataset into: train, validation and test subsets


To start, I want to know what the data looks or sounds like. Then I need to find out if there are some inconsistencies within the data and if I need to make some adaptions to it.

To get a nice view and an easy structure to work with the data, I will first load it into Python list of dictionaiies and then into a Pandas dataframe:

In [6]:
# str(path) returns something like: /content/drive/MyDrive/bsc-ai-thesis/torgo_data/dysarthria_female/F01_Session1_0002.wav
# tqdm is used to create a smart progress bar for the loops, for example it shows loading time
from pathlib import Path
from tqdm import tqdm

data = []

for path in tqdm(Path("/content/drive/MyDrive/bsc-ai-thesis/torgo_data").glob("**/*.wav")):
    name = str(path).split('/')[-1].split('.')[0]
    label = str(path).split('/')[-2]
    
    try:
        # There are some broken files
        s = torchaudio.load(path)
        data.append({
            "filename": name,
            "path": str(path),
            "disease_class": label
        })

    except Exception as e:
        print(str(path), e)
        pass

147it [00:00, 266.01it/s]

/content/drive/MyDrive/bsc-ai-thesis/torgo_data/dysarthria_female/F01_Session1_0068.wav Failed to open the input "/content/drive/MyDrive/bsc-ai-thesis/torgo_data/dysarthria_female/F01_Session1_0068.wav" (Invalid data found when processing input).


2000it [00:09, 207.79it/s]


In [7]:
# Show how the Pandas dataframe looks like currently
import pandas as pd
df = pd.DataFrame(data)
df.head()

Unnamed: 0,filename,path,disease_class
0,F01_Session1_0006,/content/drive/MyDrive/bsc-ai-thesis/torgo_dat...,dysarthria_female
1,F01_Session1_0038,/content/drive/MyDrive/bsc-ai-thesis/torgo_dat...,dysarthria_female
2,F01_Session1_0015,/content/drive/MyDrive/bsc-ai-thesis/torgo_dat...,dysarthria_female
3,F01_Session1_0024,/content/drive/MyDrive/bsc-ai-thesis/torgo_dat...,dysarthria_female
4,F01_Session1_0053,/content/drive/MyDrive/bsc-ai-thesis/torgo_dat...,dysarthria_female


In [8]:
# Show the distribution over the different categories or labels
df.groupby("disease_class").count()[["path"]]

Unnamed: 0_level_0,path
disease_class,Unnamed: 1_level_1
dysarthria_female,499
dysarthria_male,500
non_dysarthria_female,500
non_dysarthria_male,500


While loading the data into the first datastructure, I receive the following error:

>1498it [00:14, 362.78it/s]/content/drive/MyDrive/bsc-ai-thesis/torgo_data/>dysarthria_female/F01_Session1_0068.wav Failed to open the input "/content/?>drive/MyDrive/bsc-ai-thesis/torgo_data/dysarthria_female/F01_Session1_0068.>wav" (Invalid data found when processing input).

Therefore I filter the dataset for these types of errors:

In [9]:
# Filter broken and non-existed paths
print(f"Step 0: {len(df)}")

df["status"] = df["path"].apply(lambda path: True if os.path.exists(path) else None)
df = df.dropna(subset=["path"])
df = df.drop(columns='status')

print(f"Step 1: {len(df)}")

df = df.sample(frac=1)
df = df.reset_index(drop=True)

Step 0: 1999
Step 1: 1999


Since the goal is to train a model to recognize the presence or absence of disease in speech, it would be appropriate to combine the four data folders into two categories: patients with the disease and patients without the disease. This will simplify the training process and ensure that the model is focused on recognizing the disease, rather than gender.

Now, let's explore how many audio files (examples of dysarthria or healthy speech) each of folder contains.

It is noticable that there was one instance of audio filtered out previously, specifically an instance of 'dysarthria'.

In [10]:
# Eliminate difference between male and female and print distribbution
df = df.replace({'disease_class' : {'dysarthria_female': 'dysarthria', 'dysarthria_male': 'dysarthria', 'non_dysarthria_female': 'non_dysarthria', 'non_dysarthria_male': 'non_dysarthria'}})
print("Labels: ", df["disease_class"].unique())
print()
df.groupby("disease_class").count()[["path"]]

Labels:  ['non_dysarthria' 'dysarthria']



Unnamed: 0_level_0,path
disease_class,Unnamed: 1_level_1
dysarthria,999
non_dysarthria,1000


Let's display a random sample of the dataset and run it a couple of times to get a feeling for the audio and the dysarthria label.

In [11]:
import torchaudio
import librosa
import IPython.display as ipd
import numpy as np

idx = np.random.randint(0, len(df))
sample = df.iloc[idx]
path = sample["path"]
label = sample["disease_class"]

print(f"ID Location: {idx}")
print(f"      Label: {label}")
print()

speech, sr = torchaudio.load(path)
speech = speech[0].numpy().squeeze()
speech = librosa.resample(np.asarray(speech), orig_sr=sr, target_sr=16_000)
ipd.Audio(data=np.asarray(speech), autoplay=True, rate=16000)

ID Location: 1004
      Label: dysarthria



Difference between sklearn.model_selection.train_test_split and cross-validation: 

Cross-validation is used only when you have smaller datasets and cannot afford to get statistically representative samples after splitting the dataset. 

Hugging’s models require tensors as input

In [12]:
display(df)

Unnamed: 0,filename,path,disease_class
0,MC02_Session1_0375,/content/drive/MyDrive/bsc-ai-thesis/torgo_dat...,non_dysarthria
1,F04_Session2_0096,/content/drive/MyDrive/bsc-ai-thesis/torgo_dat...,dysarthria
2,F03_Session2_0063,/content/drive/MyDrive/bsc-ai-thesis/torgo_dat...,dysarthria
3,M01_Session2_3_0171,/content/drive/MyDrive/bsc-ai-thesis/torgo_dat...,dysarthria
4,F04_Session2_0186,/content/drive/MyDrive/bsc-ai-thesis/torgo_dat...,dysarthria
...,...,...,...
1994,FC02_Session3_0474,/content/drive/MyDrive/bsc-ai-thesis/torgo_dat...,non_dysarthria
1995,FC02_Session3_0831,/content/drive/MyDrive/bsc-ai-thesis/torgo_dat...,non_dysarthria
1996,MC03_Session1_0016,/content/drive/MyDrive/bsc-ai-thesis/torgo_dat...,non_dysarthria
1997,MC04_Session1_0372,/content/drive/MyDrive/bsc-ai-thesis/torgo_dat...,non_dysarthria


In [13]:
from datasets import Dataset, Audio

# load Dataset from Pandas DataFrame
dataset = Dataset.from_pandas(df)

# add audio info to dataset: array and sample rate
paths = df["path"].values
audio_info = Dataset.from_dict({"audio": paths}).cast_column("audio", Audio(sampling_rate=16000))
dataset = dataset.add_column("audio", audio_info)

# Check to see if it worked out
dataset

Dataset({
    features: ['filename', 'path', 'disease_class', 'audio'],
    num_rows: 1999
})

Add [ClassLabels](https://discuss.huggingface.co/t/how-to-create-custom-classlabels/13650) to the dataset

In [14]:
from datasets import ClassLabel

# Set classes as column names
disease_class = ClassLabel(num_classes = 2, names=["dysarthria", "non_dysarthria"])
dataset = dataset.cast_column("disease_class", disease_class)

Casting the dataset:   0%|          | 0/1999 [00:00<?, ? examples/s]

In [15]:
# # split the dataset into train and test subset and save them on Google Drive
# save_path = "/content/drive/MyDrive/bsc-ai-thesis/torgo_data"

# Since the dataset is not perfectly balances after filtering, I could choose for stratifying here 
# train_df, test_df = train_test_split(df, test_size=0.2, random_state=101, stratify=df["dysarthria"])
dataset = dataset.train_test_split(test_size=0.2)

# train_df = dataset["train"].reset_index(drop=True)
# test_df = dataset["test"].reset_index(drop=True)

# train_df.to_csv(f"{save_path}/train.csv", sep="\t", encoding="utf-8", index=False)
# test_df.to_csv(f"{save_path}/test.csv", sep="\t", encoding="utf-8", index=False)

dataset

DatasetDict({
    train: Dataset({
        features: ['filename', 'path', 'disease_class', 'audio'],
        num_rows: 1599
    })
    test: Dataset({
        features: ['filename', 'path', 'disease_class', 'audio'],
        num_rows: 400
    })
})

In [16]:
dataset = dataset.remove_columns(["filename", "path"])
dataset["train"][0]

{'disease_class': 1,
 'audio': {'audio': {'array': [0.0074462890625,
    0.01593017578125,
    0.0177001953125,
    0.020111083984375,
    0.0257568359375,
    0.0172119140625,
    0.01556396484375,
    0.013580322265625,
    0.015289306640625,
    0.019317626953125,
    0.0205078125,
    0.016143798828125,
    0.0125732421875,
    0.014312744140625,
    0.01422119140625,
    0.020233154296875,
    0.0155029296875,
    0.012237548828125,
    0.0133056640625,
    0.005615234375,
    0.009857177734375,
    0.013580322265625,
    0.009368896484375,
    0.00885009765625,
    0.008544921875,
    0.0081787109375,
    0.008819580078125,
    0.005828857421875,
    0.003204345703125,
    0.00634765625,
    0.0074462890625,
    0.008575439453125,
    0.00567626953125,
    0.001068115234375,
    0.002655029296875,
    -0.000457763671875,
    -0.001678466796875,
    -0.000457763671875,
    -0.003204345703125,
    -0.00921630859375,
    -0.012115478515625,
    -0.0118408203125,
    -0.0139770507812

To make it easier for the model to get the label name from the label id, create a dictionary that maps the label name to an integer and vice versa:

In [17]:
# FUTURE WORK: ClassLabel
labels = dataset["train"].features["disease_class"].names
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label

Now you can convert the label id to a label name:

In [18]:
id2label[str(1)]

'non_dysarthria'

In [19]:
# Loading the created dataset using datasets
# from datasets import load_dataset, load_metric

# data_files = {
#     "train": "/content/drive/MyDrive//bsc-ai-thesis/torgo_data/train.csv", 
#     "validation": "/content/drive/MyDrive//bsc-ai-thesis/torgo_data/test.csv",
# }

# dataset = load_dataset("csv", data_files=data_files, delimiter="\t", )
# train_dataset = dataset["train"]
# eval_dataset = dataset["validation"]

# print(train_dataset)
# print(eval_dataset)

In [20]:
# # We need to specify the input and output column
# input_column = "path"
# output_column = "disease_class"

In [21]:
# # we need to distinguish the unique labels in our Dysarthria dataset
# label_list = train_dataset.unique(output_column)
# label_list.sort()  # Let's sort it for determinism
# num_labels = len(label_list)
# print(f"A classification problem with {num_labels} classes: {label_list}")

##Preprocess Data

So far, we downloaded, loaded, and split the Dysarhtria dataset into train and test sets.

Now, we need to extract features from the audio path in context representation tensors and feed them into our classification model to determine the presence of dysarthria in the speech.

Therefore, the next step is to load a Wav2Vec2 feature extractor to process the audio signal:

In [22]:
# from transformers import Wav2Vec2FeatureExtractor

# feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=16000, padding_value=0.0, do_normalize=True, return_attention_mask=False)

from transformers import AutoFeatureExtractor

feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")



Now create a preprocessing function that:

1. Calls the `audio` column to load, and if necessary, resample the audio file. (is already at required sample rate.)
2. Checks if the sampling rate of the audio file matches the sampling rate of the audio data a model was pretrained with. You can find this information in the Wav2Vec2 [model card](https://huggingface.co/facebook/wav2vec2-base).
3. Set a maximum input length to batch longer inputs without truncating them.

In [23]:
def preprocess_function(examples):
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(
        audio_arrays, sampling_rate=feature_extractor.sampling_rate, max_length=16000, truncation=True
    )
    return inputs

To apply the preprocessing function over the entire dataset, use 🤗 Datasets [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) function. You can speed up `map` by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don't need, and rename `intent_class` to `label` because that's the name the model expects:

In [None]:
dataset["train"]["audio"]

In [24]:
encoded_torgo = dataset.map(preprocess_function, batched=True)
encoded_torgo = encoded_torgo.rename_column("disease_class", "label")

Map:   0%|          | 0/1599 [00:00<?, ? examples/s]

KeyError: ignored

## Evaluate

Including a metric during training is often helpful for evaluating your model's performance. You can quickly load a evaluation method with the 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load the [accuracy](https://huggingface.co/spaces/evaluate-metric/accuracy) metric (see the 🤗 Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about how to load and compute a metric):

In [None]:
import evaluate

accuracy = evaluate.load("accuracy")

Then create a function that passes your predictions and labels to [compute](https://huggingface.co/docs/evaluate/main/en/package_reference/main_classes#evaluate.EvaluationModule.compute) the accuracy:

In [None]:
import numpy as np

def compute_metrics(eval_pred):
    predictions = np.argmax(eval_pred.predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=eval_pred.label_ids)

Your compute_metrics function is ready to go now, and you'll return to it when you setup your training.

##Train

Now, I am ready to start training my model! I load Wav2Vec2 with AutoModelForAudioClassification along with the number of expected labels, and the label mappings:

In [None]:
from transformers import AutoModelForAudioClassification, TrainingArguments, Trainer

num_labels = len(id2label)
model = AutoModelForAudioClassification.from_pretrained(
    "facebook/wav2vec2-base", num_labels=num_labels, label2id=label2id, id2label=id2label
)

At this point, only three steps remain:

1. Define your training hyperparameters in [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments). The only required parameter is `output_dir` which specifies where to save your model. You'll push this model to the Hub by setting `push_to_hub=True` (you need to be signed in to Hugging Face to upload your model). At the end of each epoch, the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) will evaluate the accuracy and save the training checkpoint.
2. Pass the training arguments to [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) along with the model, dataset, tokenizer, data collator, and `compute_metrics` function.
3. Call [train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train) to finetune your model.

In [None]:
training_args = TrainingArguments(
    output_dir="my_awesome_torgo_model",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=3e-5,
    per_device_train_batch_size=32,
    gradient_accumulation_steps=4,
    per_device_eval_batch_size=32,
    num_train_epochs=10,
    warmup_ratio=0.1,
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_torgo["train"],
    eval_dataset=encoded_torgo["test"],
    tokenizer=feature_extractor,
    compute_metrics=compute_metrics,
)

trainer.train()

Once training is completed, share my model to the Hub with the [push_to_hub()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.push_to_hub) method so everyone can use my model:

In [None]:
trainer.push_to_hub()

## Inference

Great, now that the mode is finetuned, I can use it for inference!

I load an audio file that I'd like to run inference on. 

Remember to resample the sampling rate of the audio file to match the sampling rate of the model if you need to!

In [None]:
# from datasets import load_dataset, Audio

# dataset = load_dataset("PolyAI/minds14", name="en-US", split="train")
# dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
# sampling_rate = dataset.features["audio"].sampling_rate
# audio_file = dataset[0]["audio"]["path"]

The simplest way to try out your finetuned model for inference is to use it in a [pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline). Instantiate a `pipeline` for audio classification with your model, and pass your audio file to it:

In [None]:
from transformers import pipeline

classifier = pipeline("audio-classification", model="Juardo/my_awesome_torgo_model")
classifier("/content/drive/MyDrive/bsc-ai-thesis/OSR_us_000_0010_8k.wav")