<a href="https://colab.research.google.com/github/JungCesar/bscaithesis/blob/master/bsc_ai_thesis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Disease Classification Algorithm

I will try to make a classification algorithm for a specific disease, based on a specific disease dataset consisting of speech data of positive patients and a healthy control group and making use of a pre-trained self-supervised deep-learning model (Wav2Vec) to extract features from this speech data and then a classification algorotihm to distinguish between the two classes.

I will approach this bottom-up, step-by-step below. For the sake of doing it step-by-step, not all necessary libraries will be loaded in the beginning, but just when first used. For a later version, installing at the beginning might be preferred while looking more clean.

##Accessing Data

Since the dataset is hosted on [Kaggle](https://www.kaggle.com/datasets/iamhungundji/dysarthria-detection), I downloaded it and uploaded it it on my Google Drive to make it easily accessible here.

The Torgo database consists of four folders, and looks as follows:

```bash
.
├── data.csv
├── dysarthria female
│   ├── F01_Session1_0001.wav
│   ├── F01_Session1_0002.wav
│   ├── ...
├── dysarthria male
│   ├── M01_Session1_0005.wav
│   ├── M01_Session1_00011.wav
│   ├── ...
├── non dysarthria female
│   ├── FC01_Session1_0008.wav
│   ├── FC01_Session1_00011.wav
│   ├── ...
├── non dysarthria female
│   ├── MC01_Session1_0005.wav
│   ├── MC01_Session1_0022.wav
│   ├── ...

4 directories, 1999 files
```

Below, a connection to my Google Drive is made, via this way I can access the Torgo database which is stored there.

In [1]:
# Get access to personal Google Drive account
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


We need Huggingface's transformer (i.e. wav2vec). Therefore we install transformers.  Installing datasets is optional if you want to use one of the available datasets on their platforms. Evaluate is a library for easily evaluating machine learning models and datasets. 



In [2]:
%%capture
!pip install transformers datasets evaluate

In [3]:
import torchaudio
from sklearn.model_selection import train_test_split

import os
import sys

##Explore and Prepare Data



1.   Convert to Python list of dicts
2.   Convert to Pandas DataFram
3.   Print and check for inconsistencies 
4.   Filter out inconsistencies
5.   Show new distribution
6.   Split dataset into: train, validation and test subsets


To start, I want to know what the data looks or sounds like. Then I need to find out if there are some inconsistencies within the data and if I need to make some adaptions to it.

To get a nice view and an easy structure to work with the data, I will first load it into Python list of dictionaiies and then into a Pandas dataframe:

In [4]:
# str(path) returns something like: /content/drive/MyDrive/bsc-ai-thesis/torgo_data/dysarthria_female/F01_Session1_0002.wav
# tqdm is used to create a smart progress bar for the loops, for example it shows loading time
from pathlib import Path
from tqdm import tqdm

data = []

for path in tqdm(Path("/content/drive/MyDrive/bsc-ai-thesis/torgo_data").glob("**/*.wav")):
    name = str(path).split('/')[-1].split('.')[0]
    label = str(path).split('/')[-2]
    
    try:
        # There are some broken files
        s = torchaudio.load(path)
        data.append({
            "filename": name,
            "path": str(path),
            "class": label
        })

    except Exception as e:
        print(str(path), e)
        pass

157it [00:01, 177.37it/s]

/content/drive/MyDrive/bsc-ai-thesis/torgo_data/dysarthria_female/F01_Session1_0068.wav Failed to open the input "/content/drive/MyDrive/bsc-ai-thesis/torgo_data/dysarthria_female/F01_Session1_0068.wav" (Invalid data found when processing input).


2000it [00:14, 139.94it/s]


In [5]:
# Show how the Pandas dataframe looks like currently
import pandas as pd
df = pd.DataFrame(data)
df.head()

Unnamed: 0,filename,path,class
0,F01_Session1_0006,/content/drive/MyDrive/bsc-ai-thesis/torgo_dat...,dysarthria_female
1,F01_Session1_0038,/content/drive/MyDrive/bsc-ai-thesis/torgo_dat...,dysarthria_female
2,F01_Session1_0015,/content/drive/MyDrive/bsc-ai-thesis/torgo_dat...,dysarthria_female
3,F01_Session1_0024,/content/drive/MyDrive/bsc-ai-thesis/torgo_dat...,dysarthria_female
4,F01_Session1_0053,/content/drive/MyDrive/bsc-ai-thesis/torgo_dat...,dysarthria_female


In [6]:
# Show the distribution over the different categories or labels
df.groupby("class").count()[["path"]]

Unnamed: 0_level_0,path
class,Unnamed: 1_level_1
dysarthria_female,499
dysarthria_male,500
non_dysarthria_female,500
non_dysarthria_male,500


While loading the data into the first datastructure, I receive the following error:

>1498it [00:14, 362.78it/s]/content/drive/MyDrive/bsc-ai-thesis/torgo_data/>dysarthria_female/F01_Session1_0068.wav Failed to open the input "/content/?>drive/MyDrive/bsc-ai-thesis/torgo_data/dysarthria_female/F01_Session1_0068.>wav" (Invalid data found when processing input).

Therefore I filter the dataset for these types of errors:

In [7]:
# Filter broken and non-existed paths
print(f"Step 0: {len(df)}")

df["status"] = df["path"].apply(lambda path: True if os.path.exists(path) else None)
df = df.dropna(subset=["path"])
df = df.drop(columns='status')

print(f"Step 1: {len(df)}")

df = df.sample(frac=1)
df = df.reset_index(drop=True)

Step 0: 1999
Step 1: 1999


Since the goal is to train a model to recognize the presence or absence of disease in speech, it would be appropriate to combine the four data folders into two categories: patients with the disease and patients without the disease. This will simplify the training process and ensure that the model is focused on recognizing the disease, rather than gender.

Now, let's explore how many audio files (examples of dysarthria or healthy speech) each of folder contains.

It is noticable that there was one instance of audio filtered out previously, specifically an instance of 'dysarthria'.

In [8]:
# Eliminate difference between male and female and print distribbution
df = df.replace({'class' : {'dysarthria_female': 'dysarthria', 'dysarthria_male': 'dysarthria', 'non_dysarthria_female': 'non_dysarthria', 'non_dysarthria_male': 'non_dysarthria'}})
print("Labels: ", df["class"].unique())
print()
df.groupby("class").count()[["path"]]

Labels:  ['non_dysarthria' 'dysarthria']



Unnamed: 0_level_0,path
class,Unnamed: 1_level_1
dysarthria,999
non_dysarthria,1000


Let's display a random sample of the dataset and run it a couple of times to get a feeling for the audio and the dysarthria label.

In [9]:
import torchaudio
import librosa
import IPython.display as ipd
import numpy as np

idx = np.random.randint(0, len(df))
sample = df.iloc[idx]
path = sample["path"]
label = sample["class"]

print(f"ID Location: {idx}")
print(f"      Label: {label}")
print()

speech, sr = torchaudio.load(path)
speech = speech[0].numpy().squeeze()
speech = librosa.resample(np.asarray(speech), orig_sr=sr, target_sr=16_000)
ipd.Audio(data=np.asarray(speech), autoplay=True, rate=16000)

ID Location: 767
      Label: non_dysarthria



Difference between sklearn.model_selection.train_test_split and cross-validation: 

Cross-validation is used only when you have smaller datasets and cannot afford to get statistically representative samples after splitting the dataset. 

In [10]:
print(df)
df.count()

                 filename                                               path  \
0      FC03_Session3_0095  /content/drive/MyDrive/bsc-ai-thesis/torgo_dat...   
1       F01_Session1_0039  /content/drive/MyDrive/bsc-ai-thesis/torgo_dat...   
2     M01_Session2_3_0168  /content/drive/MyDrive/bsc-ai-thesis/torgo_dat...   
3       M01_Session1_0085  /content/drive/MyDrive/bsc-ai-thesis/torgo_dat...   
4       M04_Session2_0148  /content/drive/MyDrive/bsc-ai-thesis/torgo_dat...   
...                   ...                                                ...   
1994    M01_Session1_0064  /content/drive/MyDrive/bsc-ai-thesis/torgo_dat...   
1995    M02_Session2_0125  /content/drive/MyDrive/bsc-ai-thesis/torgo_dat...   
1996    M03_Session2_0326  /content/drive/MyDrive/bsc-ai-thesis/torgo_dat...   
1997    F04_Session2_0249  /content/drive/MyDrive/bsc-ai-thesis/torgo_dat...   
1998   MC01_Session2_0278  /content/drive/MyDrive/bsc-ai-thesis/torgo_dat...   

               class  
0     non_dysart

filename    1999
path        1999
class       1999
dtype: int64

Hugging’s models require tensors as input

In [11]:
display(df)

Unnamed: 0,filename,path,class
0,FC03_Session3_0095,/content/drive/MyDrive/bsc-ai-thesis/torgo_dat...,non_dysarthria
1,F01_Session1_0039,/content/drive/MyDrive/bsc-ai-thesis/torgo_dat...,dysarthria
2,M01_Session2_3_0168,/content/drive/MyDrive/bsc-ai-thesis/torgo_dat...,dysarthria
3,M01_Session1_0085,/content/drive/MyDrive/bsc-ai-thesis/torgo_dat...,dysarthria
4,M04_Session2_0148,/content/drive/MyDrive/bsc-ai-thesis/torgo_dat...,dysarthria
...,...,...,...
1994,M01_Session1_0064,/content/drive/MyDrive/bsc-ai-thesis/torgo_dat...,dysarthria
1995,M02_Session2_0125,/content/drive/MyDrive/bsc-ai-thesis/torgo_dat...,dysarthria
1996,M03_Session2_0326,/content/drive/MyDrive/bsc-ai-thesis/torgo_dat...,dysarthria
1997,F04_Session2_0249,/content/drive/MyDrive/bsc-ai-thesis/torgo_dat...,dysarthria


In [12]:
from datasets import Dataset, Audio

# load Dataset from Pandas DataFrame
dataset = Dataset.from_pandas(df)
dataset

Dataset({
    features: ['filename', 'path', 'class'],
    num_rows: 1999
})

In [13]:
# add audio ifo to dataset
paths = df["path"].values
audio_info = Dataset.from_dict({"audio": paths}).cast_column("audio", Audio())
dataset = dataset.add_column("audio", audio_info)
dataset

Dataset({
    features: ['filename', 'path', 'class', 'audio'],
    num_rows: 1999
})

In [14]:
save_path = "/content/drive/MyDrive/bsc-ai-thesis/torgo_data"

# Since the dataset is not perfectly balances after filtering, I could choose for stratifying here 
# train_df, test_df = train_test_split(df, test_size=0.2, random_state=101, stratify=df["dysarthria"])
dataset = dataset.train_test_split(test_size=0.2)

# # # convert to Huggingface type of dataset
# # torgo_train = Dataset.from_pandas(df[0])
# # print(torgo)
# # print(type(torgo))

# # train_df = train_df.reset_index(drop=True)
# # test_df = test_df.reset_index(drop=True)

# # train_df.to_csv(f"{save_path}/train.csv", sep="\t", encoding="utf-8", index=False)
# # test_df.to_csv(f"{save_path}/test.csv", sep="\t", encoding="utf-8", index=False)

# print(torgo)
# print(type(torgo))

# print(train_df.shape)
# print(test_df.shape)

In [23]:
dataset["train"][0]

{'name': 'FC03_Session3_0048',
 'path': '/content/drive/MyDrive/bsc-ai-thesis/torgo_data/non_dysarthria_female/FC03_Session3_0048.wav',
 'dysarthria': 'non_dysarthria'}

##Prepare Data for Training

In [17]:
# Loading the created dataset using datasets
from datasets import load_dataset, load_metric

data_files = {
    "train": "/content/drive/MyDrive/Github/bscaithesis/torgo_data/train.csv", 
    "validation": "/content/drive/MyDrive/Github/bscaithesis/torgo_data/test.csv",
}

dataset = load_dataset("csv", data_files=data_files, delimiter="\t", )
train_dataset = dataset["train"]
eval_dataset = dataset["validation"]

print(train_dataset)
print(eval_dataset)

Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-72181da9212b0971/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-72181da9212b0971/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Dataset({
    features: ['name', 'path', 'dysarthria'],
    num_rows: 1599
})
Dataset({
    features: ['name', 'path', 'dysarthria'],
    num_rows: 400
})


In [18]:
# We need to specify the input and output column
input_column = "path"
output_column = "class"

In [19]:
# we need to distinguish the unique labels in our Dysarthria dataset
label_list = train_dataset.unique(output_column)
label_list.sort()  # Let's sort it for determinism
num_labels = len(label_list)
print(f"A classification problem with {num_labels} classes: {label_list}")

A classification problem with 2 classes: ['dysarthria', 'non_dysarthria']


To make it easier for the model to get the label name from the label id, create a dictionary that maps the label name to an integer and vice versa:

In [20]:
# labels = dataset["train"].features["dysarthria"].names

# label2id, id2label = dict(), dict()
# for i, label in enumerate(labels):
#     label2id[label] = str(i)
#     id2label[str(i)] = label

##Preprocess Data

So far, we downloaded, loaded, and split the Dysarhtria dataset into train and test sets.

Now, we need to extract features from the audio path in context representation tensors and feed them into our classification model to determine the presence of dysarthria in the speech.

Therefore, the next step is to load a Wav2Vec2 feature extractor to process the audio signal:

In [21]:
from transformers import Wav2Vec2FeatureExtractor

feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=16000, padding_value=0.0, do_normalize=True, return_attention_mask=False)

# from transformers import AutoFeatureExtractor

# feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")