<a href="https://colab.research.google.com/github/JungCesar/bscaithesis/blob/master/bscaithesis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Google Drive Mount

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%cd /content/drive/MyDrive/Github/bscaithesis

/content/drive/MyDrive/Github/bscaithesis


In [3]:
# Git configuration

!git config --global user.name JungCesar
!git config --global user.email julius.bijkerk@icloud.com
!git config --global user.password GHavJB99$

username = 'JungCesar'
git_token = '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'
repository = 'bscaithesis'

In [4]:
%ls -a

[0m[01;34m.git[0m/  [01;34mtorgo_data[0m/


In [5]:
%%capture

!pip install transformers datasets evaluate

##Prepare Data

The Torgo database consists of four

```bash
.
├── data.csv
├── dysarthria female
│   ├── F01_Session1_0001.wav
│   ├── F01_Session1_0002.wav
│   ├── ...
├── dysarthria male
│   ├── M01_Session1_0005.wav
│   ├── M01_Session1_00011.wav
│   ├── ...
├── non dysarthria female
│   ├── FC01_Session1_0008.wav
│   ├── FC01_Session1_00011.wav
│   ├── ...
├── non dysarthria female
│   ├── MC01_Session1_0005.wav
│   ├── MC01_Session1_0022.wav
│   ├── ...

4 directories, 1999 files
```

Since the dataset is hosted on [Kaggle](https://www.kaggle.com/datasets/iamhungundji/dysarthria-detection) and unable to access directly, I uploaded it it on my Google Drive to make it accessible here.

In [6]:
# Dataset has been downloaded from Kaggle: https://www.kaggle.com/datasets/iamhungundji/dysarthria-detection

import numpy as np
import pandas as pd

from pathlib import Path
from tqdm import tqdm

import torchaudio
from sklearn.model_selection import train_test_split

import os
import sys

In [7]:
# str(path) returns: /content/drive/MyDrive/Github/bscaithesis/torgo_data/dysarthria_female/F01_Session1_0002.wav
# tqdm is used to create a smart progress bar for the loops. You just need to wrap tqdm on any iterable

data = []

for path in tqdm(Path("/content/drive/MyDrive/Github/bscaithesis/torgo_data").glob("**/*.wav")):
    name = str(path).split('/')[-1].split('.')[0]
    label = str(path).split('/')[-2]
    
    try:
        # There are some broken files
        s = torchaudio.load(path)
        data.append({
            "name": name,
            "path": path,
            "dysarthria": label
        })
    except Exception as e:
        # print(str(path), e)
        pass

    # break

2000it [00:39, 50.80it/s] 


In [8]:
df = pd.DataFrame(data)
df.head()

Unnamed: 0,name,path,dysarthria
0,F01_Session1_0001,/content/drive/MyDrive/Github/bscaithesis/torg...,dysarthria_female
1,F01_Session1_0002,/content/drive/MyDrive/Github/bscaithesis/torg...,dysarthria_female
2,F01_Session1_0004,/content/drive/MyDrive/Github/bscaithesis/torg...,dysarthria_female
3,F01_Session1_0006,/content/drive/MyDrive/Github/bscaithesis/torg...,dysarthria_female
4,F01_Session1_0007,/content/drive/MyDrive/Github/bscaithesis/torg...,dysarthria_female


In [9]:
# Filter broken and non-existed paths

print(f"Step 0: {len(df)}")

df["status"] = df["path"].apply(lambda path: True if os.path.exists(path) else None)
df = df.dropna(subset=["path"])
df = df.drop(columns='status')

print(f"Step 1: {len(df)}")

df = df.sample(frac=1)
df = df.reset_index(drop=True)
df.head()

Step 0: 1999
Step 1: 1999


Unnamed: 0,name,path,dysarthria
0,FC03_Session2_0205,/content/drive/MyDrive/Github/bscaithesis/torg...,non_dysarthria_female
1,FC03_Session2_0304,/content/drive/MyDrive/Github/bscaithesis/torg...,non_dysarthria_female
2,F03_Session1_0006,/content/drive/MyDrive/Github/bscaithesis/torg...,dysarthria_female
3,F03_Session2_0107,/content/drive/MyDrive/Github/bscaithesis/torg...,dysarthria_female
4,F04_Session2_0033,/content/drive/MyDrive/Github/bscaithesis/torg...,dysarthria_female


Let's explore how many audio files (examples of dysarthria or healthy speech) each of folder contains.

In [10]:
print("Labels: ", df["dysarthria"].unique())
print()
df.groupby("dysarthria").count()[["path"]]

Labels:  ['non_dysarthria_female' 'dysarthria_female' 'dysarthria_male'
 'non_dysarthria_male']



Unnamed: 0_level_0,path
dysarthria,Unnamed: 1_level_1
dysarthria_female,499
dysarthria_male,500
non_dysarthria_female,500
non_dysarthria_male,500


If the goal is to train a model to recognize the presence or absence of disease in speech, then it would be appropriate to combine the four folders of data into two categories: patients with the disease and patients without the disease. This will simplify the training process and ensure that the model is focused on recognizing the disease, rather than gender.

In [18]:
# df = df.replace('dysarthria_female', 'dysarthria')
# df = df.replace('dysarthria_male', 'dysarthria')
# df = df.replace('non_dysarthria_female', 'non_dysarthria')
# df = df.replace('non_dysarthria_male', 'non_dysarthria')

df = df.replace({'dysarthria' : {'dysarthria_female': 'dysarthria', 'dysarthria_male': 'dysarthria', 'non_dysarthria_female': 'non_dysarthria', 'non_dysarthria_male': 'non_dysarthria'}})

In [19]:
print("Labels: ", df["dysarthria"].unique())
print()
df.groupby("dysarthria").count()[["path"]]

Labels:  ['non_dysarthria' 'dysarthria']



Unnamed: 0_level_0,path
dysarthria,Unnamed: 1_level_1
dysarthria,999
non_dysarthria,1000


Let's display some random sample of the dataset and run it a couple of times to get a feeling for the audio and the dysarthria label.

In [20]:
import torchaudio
import librosa
import IPython.display as ipd
import numpy as np

idx = np.random.randint(0, len(df))
sample = df.iloc[idx]
path = sample["path"]
label = sample["dysarthria"]


print(f"ID Location: {idx}")
print(f"      Label: {label}")
print()

speech, sr = torchaudio.load(path)
speech = speech[0].numpy().squeeze()
speech = librosa.resample(np.asarray(speech), orig_sr=sr, target_sr=16_000)
ipd.Audio(data=np.asarray(speech), autoplay=True, rate=16000)

ID Location: 452
      Label: dysarthria



In [21]:
save_path = "/content/drive/MyDrive/Github/bscaithesis/torgo_data"

train_df, test_df = train_test_split(df, test_size=0.2, random_state=101, stratify=df["dysarthria"])

train_df = train_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)

train_df.to_csv(f"{save_path}/train.csv", sep="\t", encoding="utf-8", index=False)
test_df.to_csv(f"{save_path}/test.csv", sep="\t", encoding="utf-8", index=False)


print(train_df.shape)
print(test_df.shape)

(1599, 3)
(400, 3)


##Prepare Data for Training

In [22]:
# Loading the created dataset using datasets
from datasets import load_dataset, load_metric


data_files = {
    "train": "/content/drive/MyDrive/Github/bscaithesis/torgo_data/train.csv", 
    "validation": "/content/drive/MyDrive/Github/bscaithesis/torgo_data/test.csv",
}

dataset = load_dataset("csv", data_files=data_files, delimiter="\t", )
train_dataset = dataset["train"]
eval_dataset = dataset["validation"]

print(train_dataset)
print(eval_dataset)

Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-a391565fa3ed7a70/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-a391565fa3ed7a70/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Dataset({
    features: ['name', 'path', 'dysarthria'],
    num_rows: 1599
})
Dataset({
    features: ['name', 'path', 'dysarthria'],
    num_rows: 400
})


In [23]:
# We need to specify the input and output column
input_column = "path"
output_column = "dysarthria"

In [24]:
# we need to distinguish the unique labels in our Dysarthria dataset
label_list = train_dataset.unique(output_column)
label_list.sort()  # Let's sort it for determinism
num_labels = len(label_list)
print(f"A classification problem with {num_labels} classes: {label_list}")

A classification problem with 2 classes: ['dysarthria', 'non_dysarthria']


##Preprocess Data

So far, we downloaded, loaded, and split the Dysarhtria dataset into train and test sets.

Now, we need to extract features from the audio path in context representation tensors and feed them into our classification model to determine the presence of dysarthria in the speech.