# Data Preparation
'
Working on the Italian_Parkinsons_Voice_and_Speech dataset downloaded from here: https://huggingface.co/datasets/birgermoell/Italian_Parkinsons_Voice_and_Speech


## imports

In [1]:
seed_value = 1986
from speechbrain.utils.data_utils import get_all_files
import os
import pandas as pd
from sklearn.model_selection import train_test_split
import json
import torchaudio

import json
from collections import Counter

from data_prepare_funs import extract_metadata, split_by_speaker, df_to_json, load_paths, load_speakers, calculate_class_samples, df_to_small_json


## Full Dataset

|  | # wav files | # speakers |
| --- | --- | --- |
| Dataset | 831 | 61 |
| Train Data | 649 | 48 |
| Valid Data | 96 | 6 |
| Test Data | 86 | 7 |

### get audio files:

In [2]:

# Your code here
data_files = get_all_files('/home/ulaval.ca/maelr5/scratch/parkinsons', match_and=['.wav'])

print('data size= ', len(data_files))



data size=  831


In [3]:
type(data_files), data_files[0], data_files[500]

(list,
 '/home/ulaval.ca/maelr5/scratch/parkinsons/15 Young Healthy Control/Daniele R/B1LBULCAAS94M100120171057.wav',
 "/home/ulaval.ca/maelr5/scratch/parkinsons/28 People with Parkinson's disease/17-28/Nicola M/VE1NMIICNOO52M100220171138.wav")

In [4]:
data_files[0].split(os.sep), data_files[500].split(os.sep)

(['',
  'home',
  'ulaval.ca',
  'maelr5',
  'scratch',
  'parkinsons',
  '15 Young Healthy Control',
  'Daniele R',
  'B1LBULCAAS94M100120171057.wav'],
 ['',
  'home',
  'ulaval.ca',
  'maelr5',
  'scratch',
  'parkinsons',
  "28 People with Parkinson's disease",
  '17-28',
  'Nicola M',
  'VE1NMIICNOO52M100220171138.wav'])

### extract metadata

In [5]:

df = extract_metadata(data_files)


In [6]:
print(len(df))
df.head(2)


831


Unnamed: 0,filename,full_path,speaker_id,label
0,B1LBULCAAS94M100120171057.wav,/home/ulaval.ca/maelr5/scratch/parkinsons/15 Y...,Daniele R,HC
1,B2LBULCAAS94M100120171057.wav,/home/ulaval.ca/maelr5/scratch/parkinsons/15 Y...,Daniele R,HC


In [7]:
df.tail(2)


Unnamed: 0,filename,full_path,speaker_id,label
829,VE2lbuairgo52M1606161815.wav,/home/ulaval.ca/maelr5/scratch/parkinsons/28 P...,Luigi B,PD
830,FB1lbuairgo52M1606161825.wav,/home/ulaval.ca/maelr5/scratch/parkinsons/28 P...,Luigi B,PD


### split data into train/ valid/ test sets **"by speaker"**:

80% Training, 10%Validation, 10% Test

Splitting **by speaker** means train and test must not include recordings from the same person, to get more reliable results and because splitting **by recordings** causes **data leakage**, and models just learn to recognize the person — not Parkinson's symptoms.[1]

[1] Iswarya Kannoth Veetil, Sowmya V., Juan Rafael Orozco-Arroyave, E.A. Gopalakrishnan,
Robust language independent voice data driven Parkinson’s disease detection,
Engineering Applications of Artificial Intelligence,
Volume 129,
2024,
107494,
ISSN 0952-1976,
https://doi.org/10.1016/j.engappai.2023.107494.
(https://www.sciencedirect.com/science/article/pii/S0952197623016780)


In [8]:
df['speaker_id'].unique()[0]


'Daniele R'

In [9]:

train_df, valid_df, test_df = split_by_speaker(df)
print('*****************************************')
print('train wavfiles size= ', len(train_df))
print('valid wavfiles size= ', len(valid_df))
print('test wavfiles size= ', len(test_df))


data speakers size=  61
train speakers size=  48
valid speakers size=  6
test speakers size=  7
*****************************************
train wavfiles size=  649
valid wavfiles size=  96
test wavfiles size=  86


#### Note: some speakers have more recordings than others

### create json files

The class balancing is done by Downsampling the majority class to match the smaller one, to reduce the risk of overfitting the repeated samples from the minority class.

In [11]:
os.path.splitext('B1LBULCAAS94M100120171057.wav')[0]

'B1LBULCAAS94M100120171057'

### auxiliary functions

In [18]:

df_to_json(train_df, "train.json", shuffle=True, balance_classes=False, seed=seed_value)
df_to_json(valid_df, "valid.json", shuffle=True, balance_classes=False, seed=seed_value)
df_to_json(test_df, "test.json", shuffle=True, balance_classes=False, seed=seed_value)  # usually test and valid is untouched


#### The json files are formatted in the following way:-

test.json:
```
{
  "VA1GCIALSDA52F170320171127": {
        path:	"/home/ulaval.ca/maelr5/scratch/parkinsons/22 Elderly Healthy Control/GILDA C/VA1GCIALSDA52F170320171127.wav"
        spk_id:	"GILDA C"
        length:	11.15625
        detection:	"HC"
  },
  "VO1cdaopmoe67M2605161911": {
        path:	"/home/ulaval.ca/maelr5/scratch/parkinsons/28 People with Parkinson's disease/1-5/Domenico C/VO1cdaopmoe67M2605161911.wav"
        spk_id:	"Domenico C"
        length:	18.500725623582767
        detection:	"PD"
  },
....
```

#### check class statistics for each set

In [30]:

print("train:-")
calculate_class_samples("train.json")
print("valid:-")
calculate_class_samples("valid.json")
print("test:-")
calculate_class_samples("test.json")


train:-
PD: 325 samples
HC: 324 samples
valid:-
PD: 64 samples
HC: 32 samples
test:-
HC: 38 samples
PD: 48 samples


#### Reasons for Not doing class balancing for the data:

- The train data is already balanced.
- The validation and test sets are fixed to reflect how the model would perform on real-world imbalanced data.


#### sanity check : to show that there is not overlap betweeen train and test samples
(1) Compare by Audio Paths (Most Reliable)

In [31]:

train_paths = load_paths("train.json")
valid_paths = load_paths("valid.json")
test_paths = load_paths("test.json")

overlap = train_paths.intersection(test_paths)
print(f"\n Number of overlapping audio files between train and test: {len(overlap)}")
if overlap:
    print("Some overlapping files:")
    for p in list(overlap)[:10]:  # show first 10
        print("-", p)

overlap = train_paths.intersection(valid_paths)
print(f"\n Number of overlapping audio files between train and valid: {len(overlap)}")

overlap = valid_paths.intersection(test_paths)
print(f"\n Number of overlapping audio files between valid and test: {len(overlap)}")




 Number of overlapping audio files between train and test: 0

 Number of overlapping audio files between train and valid: 0

 Number of overlapping audio files between valid and test: 0


(2) Compare by Speaker

In [32]:

train_speakers = load_speakers("train.json")
valid_speakers = load_speakers("valid.json")
test_speakers = load_speakers("test.json")

overlap_speakers = train_speakers.intersection(test_speakers)
print(f"\n🎙️ Overlapping speakers between train and test: {len(overlap_speakers)}")

overlap_speakers = train_speakers.intersection(valid_speakers)
print(f"\n🎙️ Overlapping speakers between train and valid: {len(overlap_speakers)}")

overlap_speakers = valid_speakers.intersection(test_speakers)
print(f"\n🎙️ Overlapping speakers between valid and test: {len(overlap_speakers)}")

if overlap_speakers:
    print("Some shared speakers:")
    for s in list(overlap_speakers)[:10]:
        print("-", s)



🎙️ Overlapping speakers between train and test: 0

🎙️ Overlapping speakers between train and valid: 0

🎙️ Overlapping speakers between valid and test: 0


## Create small data for sanity check of the model:

The model should overfit this data after training for multiple epochs.

In [46]:
import os
import json
import torchaudio

df_to_small_json(train_df, "train-check.json", shuffle=True, balance_classes=True, samples_per_class=10)
df_to_small_json(valid_df, "valid-check.json", shuffle=True, balance_classes=True, samples_per_class=2)



In [47]:

print("train:-")
calculate_class_samples("train-check.json")
print("valid:-")
calculate_class_samples("valid-check.json")
print("test:-")
calculate_class_samples("test.json")


train:-
PD: 10 samples
HC: 10 samples
valid:-
PD: 2 samples
HC: 2 samples
test:-
HC: 38 samples
PD: 48 samples


## k-fold cross validation

Cross-Validation Procedure: Stratified k-fold cross-validation (with k = 5) is employed to evaluate model performance while preserving class distribution across folds. The dataset was split into three stratified folds, ensuring that each fold maintained the original proportion of Parkinson's Disease (PD) and Healthy Control (HC) samples. For each iteration, the model was trained on four folds and validated on the remaining one. To mitigate class imbalance during training, class balancing is applied by randomly downsampling the majority class within the training set of each fold. The validation sets remained unbalanced to reflect the natural class distribution and to provide a realistic assessment of model performance. The same predefined test set was used for final evaluation.

In [24]:
len(df), len(train_df), len(valid_df), seed_value

(831, 649, 96, 1986)

In [33]:

from sklearn.model_selection import StratifiedKFold

# Save a DataFrame to JSON
def save_json(df, json_path):
    data = {}
    for _, row in df.iterrows():
        utt_id = os.path.splitext(row['filename'])[0]
        audioinfo = torchaudio.info(row['full_path'])
        duration = audioinfo.num_frames / audioinfo.sample_rate
        data[utt_id] = {
            "path": row['full_path'],
            "spk_id": row['speaker_id'],
            "length": duration,
            "detection": row['label']
        }
    with open(json_path, "w") as f:
        json.dump(data, f, indent=4)

def balance_classes(df, seed=42):
    pd_df = df[df['label'] == 'PD']
    hc_df = df[df['label'] == 'HC']
    min_size = min(len(pd_df), len(hc_df))
    
    pd_df = pd_df.sample(n=min_size, random_state=seed)
    hc_df = hc_df.sample(n=min_size, random_state=seed)
    
    return pd.concat([pd_df, hc_df]).sample(frac=1, random_state=seed).reset_index(drop=True)

# Perform K-Fold only on train_df (ignoring test_df)
def cross_val_on_train(train_df, k=5, seed=42, balance_train=True):
    # Group by speaker
    speaker_df = train_df.groupby("speaker_id").first().reset_index()[["speaker_id", "label"]]
    
    skf = StratifiedKFold(n_splits=k, shuffle=True, random_state=seed)
    
    # split by speaker
    for fold, (train_spk_idx, val_spk_idx) in enumerate(skf.split(speaker_df, speaker_df['label'])):
        train_speakers = speaker_df.iloc[train_spk_idx]['speaker_id']
        val_speakers = speaker_df.iloc[val_spk_idx]['speaker_id']
        
        fold_train = train_df[train_df['speaker_id'].isin(train_speakers)].reset_index(drop=True)
        fold_val = train_df[train_df['speaker_id'].isin(val_speakers)].reset_index(drop=True)
        if balance_train:
            fold_train = balance_classes(fold_train)

        save_json(fold_train, f"fold{fold}_train.json")
        save_json(fold_val, f"fold{fold}_valid.json")
        print(f"Fold {fold}: Train Speakers = {len(train_speakers)}, Val Speakers = {len(val_speakers)}...")

# Combine train + valid for cross-val
full_train_df = pd.concat([train_df, valid_df]).reset_index(drop=True)

# If you just want to use train_df only:
cross_val_on_train(full_train_df, k=5, seed=seed_value, balance_train=True)



Fold 0: Train Speakers = 43, Val Speakers = 11...
Fold 1: Train Speakers = 43, Val Speakers = 11...
Fold 2: Train Speakers = 43, Val Speakers = 11...
Fold 3: Train Speakers = 43, Val Speakers = 11...
Fold 4: Train Speakers = 44, Val Speakers = 10...


In [34]:
for fold in range(5):
    print(f"fold{fold}_train:-")
    calculate_class_samples(f"fold{fold}_train.json")
    print(f"fold{fold}_valid:-")
    calculate_class_samples(f"fold{fold}_valid.json")


fold0_train:-
PD: 270 samples
HC: 270 samples
fold0_valid:-
HC: 86 samples
PD: 80 samples
fold1_train:-
HC: 296 samples
PD: 296 samples
fold1_valid:-
HC: 60 samples
PD: 88 samples
fold2_train:-
PD: 309 samples
HC: 309 samples
fold2_valid:-
HC: 47 samples
PD: 61 samples
fold3_train:-
PD: 289 samples
HC: 289 samples
fold3_valid:-
HC: 67 samples
PD: 80 samples
fold4_train:-
HC: 260 samples
PD: 260 samples
fold4_valid:-
HC: 96 samples
PD: 80 samples


In [35]:

for fold in range(5):
    train_paths = load_paths(f"fold{fold}_train.json")
    valid_paths = load_paths(f"fold{fold}_valid.json")
    test_paths = load_paths(f"test.json")
    
    overlap = train_paths.intersection(test_paths)
    print(f"\n Number of overlapping audio files between fold{fold}_train and test: {len(overlap)}")
    if overlap:
        print(f"Some overlapping fold{fold}_files:")
        for p in list(overlap)[:10]:  # show first 10
            print("-", p)
    
    overlap = train_paths.intersection(valid_paths)
    print(f"\n Number of overlapping audio files between fold{fold}_train and fold{fold}_valid: {len(overlap)}")
    
    overlap = valid_paths.intersection(test_paths)
    print(f"\n Number of overlapping audio files between fold{fold}_valid and test: {len(overlap)}")
    
    
    train_speakers = load_speakers(f"fold{fold}_train.json")
    valid_speakers = load_speakers(f"fold{fold}_valid.json")
    test_speakers = load_speakers(f"test.json")
    
    overlap_speakers = train_speakers.intersection(test_speakers)
    print(f"\n🎙️ Overlapping speakers between fold{fold}_train and test: {len(overlap_speakers)}")
    
    overlap_speakers = train_speakers.intersection(valid_speakers)
    print(f"\n🎙️ Overlapping speakers between fold{fold}_train and fold{fold}_valid: {len(overlap_speakers)}")
    
    overlap_speakers = valid_speakers.intersection(test_speakers)
    print(f"\n🎙️ Overlapping speakers between fold{fold}_valid and test: {len(overlap_speakers)}")
    
    if overlap_speakers:
        print(f"Some shared fold{fold}_speakers:")
        for s in list(overlap_speakers)[:10]:
            print("-", s)



 Number of overlapping audio files between fold0_train and test: 0

 Number of overlapping audio files between fold0_train and fold0_valid: 0

 Number of overlapping audio files between fold0_valid and test: 0

🎙️ Overlapping speakers between fold0_train and test: 0

🎙️ Overlapping speakers between fold0_train and fold0_valid: 0

🎙️ Overlapping speakers between fold0_valid and test: 0

 Number of overlapping audio files between fold1_train and test: 0

 Number of overlapping audio files between fold1_train and fold1_valid: 0

 Number of overlapping audio files between fold1_valid and test: 0

🎙️ Overlapping speakers between fold1_train and test: 0

🎙️ Overlapping speakers between fold1_train and fold1_valid: 0

🎙️ Overlapping speakers between fold1_valid and test: 0

 Number of overlapping audio files between fold2_train and test: 0

 Number of overlapping audio files between fold2_train and fold2_valid: 0

 Number of overlapping audio files between fold2_valid and test: 0

🎙️ Overlap

In [None]:
#**************************************

## Data statistics

### number of examples in each class 

to check the classes balance ratio

In [4]:
from speechbrain.utils.data_utils import get_all_files

young_healthy_files = get_all_files("/home/ulaval.ca/maelr5/scratch/parkinsons/15 Young Healthy Control", match_and=['.wav'])
elderly_healthy_files = get_all_files("/home/ulaval.ca/maelr5/scratch/parkinsons/22 Elderly Healthy Control", match_and=['.wav'])

print('15 young healthy data size= ', len(young_healthy_files))
print('22 elderly healthy data size= ', len(elderly_healthy_files))
print('37 candidates - Healthy data size= ', len(young_healthy_files) + len(elderly_healthy_files))


15 young healthy data size=  45
22 elderly healthy data size=  349
37 candidates - Healthy data size=  394


In [5]:
data_files = get_all_files("/home/ulaval.ca/maelr5/scratch/parkinsons/28 People with Parkinson's disease", match_and=['.wav'])

print("28 candidates - with Parkinson's disease data size= ", len(data_files))


28 candidates - with Parkinson's disease data size=  437


In [None]:

# 39 Healthy (1) in test
# 45 PD (0) in test

# decrease the data and use the short recordings of vowels

Sustained Vowels: Participants are asked to produce sustained phonations of vowels, such as 'a','e','o','i','u'. These recordings are particularly useful for analyzing fundamental frequency (F0) variations and other acoustic features that can indicate PD-related changes in voice production.


In [38]:
# Your code here
from speechbrain.utils.data_utils import get_all_files

# Your code here
data_files = get_all_files('/home/ulaval.ca/maelr5/scratch/parkinsons',
                           match_and=['.wav'],
                           match_or=['VA1','VA2','VE1','VE2','VI1','VI2','VO1','VO2','VU1','VU2'],
                          )

print('data size= ', len(data_files))

vowels_df = extract_metadata(data_files)
print(len(vowels_df))
# print(vowels_df.head(2))

vowels_train_df, vowels_valid_df, vowels_test_df = split_by_speaker(vowels_df, seed_value = 900)
print('*****************************************')
print('train wavfiles size= ', len(vowels_train_df))
print('valid wavfiles size= ', len(vowels_valid_df))
print('test wavfiles size= ', len(vowels_test_df))


data size=  495
495
data speakers size=  46
train speakers size=  36
valid speakers size=  5
test speakers size=  5
*****************************************
train wavfiles size=  385
valid wavfiles size=  60
test wavfiles size=  50


In [39]:
df_to_json(vowels_train_df, "train_vowels.json", shuffle=True, balance_classes=True, seed=seed_value)
df_to_json(vowels_valid_df, "valid_vowels.json", shuffle=True, balance_classes=False, seed=seed_value)
df_to_json(vowels_test_df, "test_vowels.json", shuffle=True, balance_classes=False, seed=seed_value)




In [40]:
vowels_train_json = "train_vowels.json"
vowels_valid_json = "valid_vowels.json"
vowels_test_json = "test_vowels.json"

print("train:-")
calculate_class_samples(vowels_train_json)
print("valid:-")
calculate_class_samples(vowels_valid_json)
print("test:-")
calculate_class_samples(vowels_test_json)


train_paths = load_paths(vowels_train_json)
valid_paths = load_paths(vowels_valid_json)
test_paths = load_paths(vowels_test_json)

overlap = train_paths.intersection(test_paths)
print(f"\n Number of overlapping audio files between train and test: {len(overlap)}")
if overlap:
    print("Some overlapping files:")
    for p in list(overlap)[:10]:  # show first 10
        print("-", p)

overlap = train_paths.intersection(valid_paths)
print(f"\n Number of overlapping audio files between train and valid: {len(overlap)}")

overlap = valid_paths.intersection(test_paths)
print(f"\n Number of overlapping audio files between valid and test: {len(overlap)}")



train_speakers = load_speakers(vowels_train_json)
valid_speakers = load_speakers(vowels_valid_json)
test_speakers = load_speakers(vowels_test_json)

overlap_speakers = train_speakers.intersection(test_speakers)
print(f"\n🎙️ Overlapping speakers between train and test: {len(overlap_speakers)}")

overlap_speakers = train_speakers.intersection(valid_speakers)
print(f"\n🎙️ Overlapping speakers between train and valid: {len(overlap_speakers)}")

overlap_speakers = valid_speakers.intersection(test_speakers)
print(f"\n🎙️ Overlapping speakers between valid and test: {len(overlap_speakers)}")

if overlap_speakers:
    print("Some shared speakers:")
    for s in list(overlap_speakers)[:10]:
        print("-", s)



train:-
HC: 170 samples
PD: 170 samples
valid:-
PD: 30 samples
HC: 30 samples
test:-
HC: 20 samples
PD: 30 samples

 Number of overlapping audio files between train and test: 0

 Number of overlapping audio files between train and valid: 0

 Number of overlapping audio files between valid and test: 0

🎙️ Overlapping speakers between train and test: 0

🎙️ Overlapping speakers between train and valid: 0

🎙️ Overlapping speakers between valid and test: 0


# decrease the data and use the recordings of short sentences or phrases (not vowels)
Phrases: Short sentences or phrases are recorded to evaluate more complex speech patterns. 
These recordings help in assessing prosody, articulation, and other speech characteristics that may be affected by PD.


In [42]:
# Your code here
from speechbrain.utils.data_utils import get_all_files

# Your code here
data_files = get_all_files('/home/ulaval.ca/maelr5/scratch/parkinsons',
                           match_and=['.wav'],
                           exclude_or=['VA1','VA2','VE1','VE2','VI1','VI2','VO1','VO2','VU1','VU2'],
                          )

print('phrases data size= ', len(data_files))

phrases_df = extract_metadata(data_files)
print(len(phrases_df))

phrases_train_df, phrases_valid_df, phrases_test_df = split_by_speaker(phrases_df, seed_value = 900)
print('*****************************************')
print('train wavfiles size= ', len(phrases_train_df))
print('valid wavfiles size= ', len(phrases_valid_df))
print('test wavfiles size= ', len(phrases_test_df))


phrases data size=  336
336
data speakers size=  61
train speakers size=  48
valid speakers size=  6
test speakers size=  7
*****************************************
train wavfiles size=  265
valid wavfiles size=  38
test wavfiles size=  33


In [43]:
df_to_json(phrases_train_df, "train_phrases.json", shuffle=True, balance_classes=True, seed=seed_value)
df_to_json(phrases_valid_df, "valid_phrases.json", shuffle=True, balance_classes=False, seed=seed_value)
df_to_json(phrases_test_df, "test_phrases.json", shuffle=True, balance_classes=False, seed=seed_value)




In [44]:
phrases_train_json = "train_phrases.json"
phrases_valid_json = "valid_phrases.json"
phrases_test_json = "test_phrases.json"

print("train:-")
calculate_class_samples(phrases_train_json)
print("valid:-")
calculate_class_samples(phrases_valid_json)
print("test:-")
calculate_class_samples(phrases_test_json)


train_paths = load_paths(phrases_train_json)
valid_paths = load_paths(phrases_valid_json)
test_paths = load_paths(phrases_test_json)

overlap = train_paths.intersection(test_paths)
print(f"\n Number of overlapping audio files between train and test: {len(overlap)}")
if overlap:
    print("Some overlapping files:")
    for p in list(overlap)[:10]:  # show first 10
        print("-", p)

overlap = train_paths.intersection(valid_paths)
print(f"\n Number of overlapping audio files between train and valid: {len(overlap)}")

overlap = valid_paths.intersection(test_paths)
print(f"\n Number of overlapping audio files between valid and test: {len(overlap)}")



train_speakers = load_speakers(phrases_train_json)
valid_speakers = load_speakers(phrases_valid_json)
test_speakers = load_speakers(phrases_test_json)

overlap_speakers = train_speakers.intersection(test_speakers)
print(f"\n🎙️ Overlapping speakers between train and test: {len(overlap_speakers)}")

overlap_speakers = train_speakers.intersection(valid_speakers)
print(f"\n🎙️ Overlapping speakers between train and valid: {len(overlap_speakers)}")

overlap_speakers = valid_speakers.intersection(test_speakers)
print(f"\n🎙️ Overlapping speakers between valid and test: {len(overlap_speakers)}")

if overlap_speakers:
    print("Some shared speakers:")
    for s in list(overlap_speakers)[:10]:
        print("-", s)



train:-
HC: 121 samples
PD: 121 samples
valid:-
HC: 15 samples
PD: 23 samples
test:-
PD: 18 samples
HC: 15 samples

 Number of overlapping audio files between train and test: 0

 Number of overlapping audio files between train and valid: 0

 Number of overlapping audio files between valid and test: 0

🎙️ Overlapping speakers between train and test: 0

🎙️ Overlapping speakers between train and valid: 0

🎙️ Overlapping speakers between valid and test: 0


# decrease the data and only use the short recordings of vowel 'a'


In [45]:
# Your code here
from speechbrain.utils.data_utils import get_all_files

# Your code here
data_files = get_all_files('/home/ulaval.ca/maelr5/scratch/parkinsons',
                           match_and=['.wav'],
                           match_or=['VA1', 'VA2'],
                          )

print('vowel-a data size= ', len(data_files))

vowela_df = extract_metadata(data_files)
print(len(vowela_df))

vowela_train_df, vowela_valid_df, vowela_test_df = split_by_speaker(vowela_df, seed_value = 900)
print('*****************************************')
print('train wavfiles size= ', len(vowela_train_df))
print('valid wavfiles size= ', len(vowela_valid_df))
print('test wavfiles size= ', len(vowela_test_df))


vowel-a data size=  99
99
data speakers size=  46
train speakers size=  36
valid speakers size=  5
test speakers size=  5
*****************************************
train wavfiles size=  77
valid wavfiles size=  12
test wavfiles size=  10


In [46]:
df_to_json(vowela_train_df, "train_vowela.json", shuffle=True, balance_classes=True, seed=seed_value)
df_to_json(vowela_valid_df, "valid_vowela.json", shuffle=True, balance_classes=False, seed=seed_value)
df_to_json(vowela_test_df, "test_vowela.json", shuffle=True, balance_classes=False, seed=seed_value)




In [47]:
vowela_train_json = "train_vowela.json"
vowela_valid_json = "valid_vowela.json"
vowela_test_json = "test_vowela.json"

print("train:-")
calculate_class_samples(vowela_train_json)
print("valid:-")
calculate_class_samples(vowela_valid_json)
print("test:-")
calculate_class_samples(vowela_test_json)


train_paths = load_paths(vowela_train_json)
valid_paths = load_paths(vowela_valid_json)
test_paths = load_paths(vowela_test_json)

overlap = train_paths.intersection(test_paths)
print(f"\n Number of overlapping audio files between train and test: {len(overlap)}")
if overlap:
    print("Some overlapping files:")
    for p in list(overlap)[:10]:  # show first 10
        print("-", p)

overlap = train_paths.intersection(valid_paths)
print(f"\n Number of overlapping audio files between train and valid: {len(overlap)}")

overlap = valid_paths.intersection(test_paths)
print(f"\n Number of overlapping audio files between valid and test: {len(overlap)}")



train_speakers = load_speakers(vowela_train_json)
valid_speakers = load_speakers(vowela_valid_json)
test_speakers = load_speakers(vowela_test_json)

overlap_speakers = train_speakers.intersection(test_speakers)
print(f"\n🎙️ Overlapping speakers between train and test: {len(overlap_speakers)}")

overlap_speakers = train_speakers.intersection(valid_speakers)
print(f"\n🎙️ Overlapping speakers between train and valid: {len(overlap_speakers)}")

overlap_speakers = valid_speakers.intersection(test_speakers)
print(f"\n🎙️ Overlapping speakers between valid and test: {len(overlap_speakers)}")

if overlap_speakers:
    print("Some shared speakers:")
    for s in list(overlap_speakers)[:10]:
        print("-", s)



train:-
PD: 34 samples
HC: 34 samples
valid:-
PD: 6 samples
HC: 6 samples
test:-
HC: 4 samples
PD: 6 samples

 Number of overlapping audio files between train and test: 0

 Number of overlapping audio files between train and valid: 0

 Number of overlapping audio files between valid and test: 0

🎙️ Overlapping speakers between train and test: 0

🎙️ Overlapping speakers between train and valid: 0

🎙️ Overlapping speakers between valid and test: 0
