
<span style="color:green"><h1>UCLA, Fall 2024 </span> <span style="color:yellow"> ****</h1></span>
<span style="color:blue"><h2>Professor Abeer Alwan, alwan@ee.ucla.edu</span> <span style="color:yellow"> ****</h2></span>
<span style="color:blue"><h3>Dept. of Electrical Engineering EE214A: Digital Speech Processing</span> <span style="color:yellow"> ****</span>



# ECE M214A Project: Speaker Region Identification



In this project, we'll train a machine learning algorithm to classify speakers by regional dialect.  We will use speech samples from the Corpus of Regional African American Language (CORAAL - https://oraal.github.io/coraal) with speakers each belonging to one of six different US cities: 1) Rochester, NY (ROC), 2) Lower East Side, Manhattan, NY (LES), 3) Washington DC (DCB), 4) Princeville, NC (PRV), 5) Valdosta, GA (VLD) or 6) Detroit, MI (DTA)

The project files can be downloaded from [this link](https://ucla.box.com/s/mohh4fnmgj3vekui8i8n28i02odleaop)

To do this, we will first extract features from the audio files and then train a classifier to predict the city of origin of the utterance's speaker.  The goal is to extract a feature that contains useful information about regional dialect characteristics.

##1. Setting up the data directories and Google Colab

Store a copy of the project files in your google drive.

Make sure that the 'project_data' folder is stored in the top level of your google drive. Otherwise, you will need to change the corresponding paths in the remainder of the notebook.

Mount your google drive. This will give this notebook read/write access to data stored in your google drive.  You can either do this in the file browser on the left side of this notebook or by running the code snippet below.

It is recommended that you use your UCLA google account for this project, as it has more storage than a standard google account.

# <span style="color:orange"> Required Common Imports - Packages/Modules/Libraries <span>

In [None]:
%pip install torchaudio
%pip install --upgrade pip
%pip install librosa
%pip install xgboost
%pip install tqdm
%pip install panda
%pip install shap
%pip install matplotlib
%pip install opensmile
%pip install ipywidgets

# <span style="color:orange"> Print Environment Information <span>

In [None]:
import sys

def is_colab():
  return 'google.colab' in sys.modules

if is_colab():
  print("Running in Google Colab")
  !python --version
else:
  print("Not running in Google Colab")
  %pip --version
  !python3 --version

# <span style="color:yellow"> **** </span><span style="color:red"> Setting up environment based on Colab Vs Laptop </span> <span style="color:yellow"> ****</span>


In [None]:
if is_colab():
  print("*******  Running in Google Colab Environment *******")
  
  from google.colab import userdata
  gh_pat = userdata.get('gh_pat')
  gh_username = userdata.get('gh_username')


  # We have to 'mount' google drive to this notebook. This will allow us to access files from google drive here.
  # You will be asked permission to grant access to your drive. Click 'Connect to Google Drive' when prompted and select the appropriate google account. Click on Continue untill window closes.

  from google.colab import drive
  drive.mount('/content/drive')

  import sys, os
  parent_dir = os.path.dirname(os.path.realpath('drive/MyDrive/ece_m214/'))
  print (parent_dir)

  # To get one directory up from the current file
  project_dir = os.path.abspath(os.path.join(parent_dir, 'ece_m214/final_project'))

  sys.path.insert(0,project_dir)

  print ('Parent Directory Path:', parent_dir)
  print ('Project Path:', project_dir)

else:

  print("*********  Running in Non - Google Colab Environment *********")
  
  import sys, os

  parent_dir = os.path.dirname(os.path.realpath('/Users/parthakundu/GitHub/my_ucla_grad_projects/'))
  print (parent_dir)

  # To get one directory up from the current file
  # project_dir = os.path.abspath(os.path.join(parent_dir, 'drive/MyDrive/ece_m214/final_project/'))
  project_dir = os.path.abspath(os.path.join(parent_dir, "./my_ucla_grad_projects/ucla-ece-m214a-project-sri/"))

  sys.path.insert(0,project_dir)

  print ('Parent Directory Path:', parent_dir)
  print ('Project Path:', project_dir)


# <span style="color:red"> ***** </span><span style="color:blue"> Execute all the cells from this point for all environment</span> <span style="color:green"> ***** </span>

In [None]:
import os

folder1_path = project_dir + "/dataFrame"
folder2_path = project_dir + "/images"
folder3_path = project_dir + "/project_data"
folder4_path = project_dir + "/zip_files"

os.makedirs(folder1_path, exist_ok=True)
os.makedirs(folder2_path, exist_ok=True)
os.makedirs(folder3_path, exist_ok=True)
os.makedirs(folder4_path, exist_ok=True)

# <span style="color:red"> ***** </span><span style="color:orange">Loading Initial/Given Dataset</span> <span style="color:green"> ***** </span>

In [None]:
# #  Uncomment following line of codes and execute if you need unzip the provided package with dataset for the project.

# import zipfile

# with zipfile.ZipFile(parent_dir + '/ece_m214/final_project/F24 ECE M214A Project.zip', 'r') as zip_ref:
#     zip_ref.extractall(parent_dir + '/ece_m214/final_project/')

To run this project on your local system, replace the corresponding file paths to the locations of the project files on your local machine

## 2. Getting familiar with the data


Let's take a moment to understand the data.  The original CORAAL dataset consists of speakersfrom one of seven cities split into 8 components.  The audio files are names with the convention: DCB_se1_ag1_f_03.  Here, DCB is the city code, se1 denotes the socioeconomic group of the speaker, ag1 denotes the age group of the speaker, f denotes female, and 03 denotes the participant number.  These unique combinations of identifiers mark the speaker.  

The dataset has been preprocessed to only include audio segments greater than 5 seconds in length. Those segments are numbered with the appending tag _{seg_number} for each segment.

You can also try listening to any segment like this:

In [None]:
from IPython.display import Audio

sr = 44100

Audio(filename= project_dir + "/project_data/train/DCB_se1_ag2_m_01_1_11.wav", rate=sr)

The original dataset has also been split into a train and test set. The test set has been further split, with a portion corrupted with the addition of noise:

In [None]:
sr = 44100

Audio(filename= project_dir + "/project_data/test_clean/ROC_se0_ag2_f_01_1_396.wav", rate=sr)

In [None]:
sr = 44100

Audio(filename= project_dir + "/project_data/test_noisy/VLD_se0_ag4_m_02_1_64.wav", rate=sr)

## 3. Feature Extraction

As a baseline, we will be using the average mfcc value over time from the Librosa Python library. Your job will be to choose better features to improve performance on both the clean and noisy data

We first define a pair of functions to create features and labels for our classification model:


In [None]:
import librosa
import torchaudio
import numpy as np
from glob import glob
from tqdm import tqdm


def extract_feature(audio_file, n_mfcc=13):

  '''
  Function to extract features from a single audio file given its path
  Modify this function to extract your own custom features
  '''

  audio,fs = torchaudio.load(audio_file)
  audio = audio.numpy().reshape(-1)

  # replace the following features with your own
  mfccs = librosa.feature.mfcc(y=audio, sr=fs, n_mfcc=n_mfcc)
  feat_out = np.mean(mfccs,axis=1)

  return feat_out


def get_label(file_name):
  '''
  Function to retrieve output labels from filenames
  '''
  if 'ROC' in file_name:
    label=0
  elif 'LES' in file_name:
    label=1
  elif 'DCB' in file_name:
    label=2
  elif 'PRV' in file_name:
    label=3
  elif 'VLD' in file_name:
    label=4
  elif 'DTA' in file_name:
    label=5
  else:
    raise ValueError('invalid file name')
  return label

Let us now call these functions to extract the features and labels from the train directory

In [None]:

#First we obtain the list of all files in the train directory
train_files = glob(project_dir + '/project_data/train/*.wav')

#Let's sort it so that we're all using the same file list order
#and you can continue processing the features from a given file if it stops
#partway through running
train_files.sort()

train_feat=[]
train_label=[]

for wav in tqdm(train_files):

  train_feat.append(extract_feature(wav))
  train_label.append(get_label(wav))

In [None]:

print ("Length of train_feat:",len(train_feat))

In [None]:
#Now we obtain the list of all files in the test_clean directory
test_clean_files = glob(project_dir + '/project_data/test_clean/*.wav')

#Similar to above, we sort the files
test_clean_files.sort()

test_clean_feat=[]
test_clean_label=[]

for wav in tqdm(test_clean_files):

  test_clean_feat.append(extract_feature(wav))
  test_clean_label.append(get_label(wav))

In [None]:
print ("Length of test_clean_feat:",len(test_clean_feat))

In [None]:
#Finally we obtain the list of all files in the test_noisy directory
test_noisy_files = glob(project_dir + '/project_data/test_noisy/*.wav')

#Similar to above, we sort the files
test_noisy_files.sort()

test_noisy_feat=[]
test_noisy_label=[]

for wav in tqdm(test_noisy_files):

  test_noisy_feat.append(extract_feature(wav))
  test_noisy_label.append(get_label(wav))

In [None]:
print ("Length of test_noisy_feat:",len(test_noisy_feat))

## 4. Model Training and Predictions

Now we'll train the backend system to predict the regions from the input features.  We'll use an xgboosted decision tree for this.  An advantage of this model is that we can also parse the decision tree and measure the impact of different features in the end result for explainability

In [None]:
import xgboost
import numpy as np
import pandas as pd

#Format input data

#Edit this variable to create a list that contains your feature names
feat_names=['mfcc_' +str(n) for n in range(len(train_feat[0]))]

train_feat_df = pd.DataFrame(data=np.stack(train_feat), columns=feat_names)
y_train=np.stack(train_label)


test_clean_feat_df = pd.DataFrame(data=np.stack(test_clean_feat), columns=feat_names)
y_test_clean=np.stack(test_clean_label)


test_noisy_feat_df = pd.DataFrame(data=np.stack(test_noisy_feat), columns=feat_names)
y_test_noisy=np.stack(test_noisy_label)


#you could just pass in the matrix of features to xgboost
#but it looks prettier in the shap explainer if you format it
#as a dataframe.


model = xgboost.XGBClassifier()
model.fit(train_feat_df,y_train)

print("Train Clean Acc =", np.sum(y_train==model.predict(train_feat_df))/len(y_train))

print("Test Clean Acc =", np.sum(y_test_clean==model.predict(test_clean_feat_df))/len(y_test_clean))

print("Test Noisy Acc =", np.sum(y_test_noisy==model.predict(test_noisy_feat_df))/len(y_test_noisy))


To save a dataframe of features, uncomment and run the following block of code

In [None]:
# train_feat_df.to_csv(project_dir + '/dataFrame/current_features.csv')
train_feat_df.to_csv(project_dir + '/dataFrame/myfeat_train.csv')
test_clean_feat_df.to_csv(project_dir + '/dataFrame/myfeat_test_clean.csv')
test_noisy_feat_df.to_csv(project_dir + '/dataFrame/myfeat_test_noisy.csv')

To Load a preexisting dataframe of features (saved from a previous notebook), run the following cell and then train the model

In [None]:
train_feat_df = pd.read_csv(project_dir + '/dataFrame/myfeat_train.csv')
test_clean_feat_df = pd.read_csv(project_dir + '/dataFrame/myfeat_test_clean.csv')
test_noisy_feat_df = pd.read_csv(project_dir + '/dataFrame/myfeat_test_noisy.csv')

## 5. Interpreting Results and Explainability

To see the impact different features have on the model, we create a plot of the feature importances. The features are listed top to bottom in order of how important they were to the decision.

In [None]:
import shap

# Explain the model's predictions using SHAP by computing SHAP values
explainer = shap.Explainer(model)
shap_values = explainer.shap_values(train_feat_df)

#Convert the shap values for each class to a single list
shap_as_list=[]
# print (shap_as_list)


In [None]:
for i in range(6):
    print (shap_values[i])
    shap_as_list.append(shap_values[:,:,i])

# Plot the SHAP values
shap.summary_plot(shap_as_list, train_feat_df, plot_type="bar")

And we can see a confusion matrix of the mispredictions

In [None]:
from sklearn import metrics
import matplotlib.pyplot as plt

confusion_matrix_clean = metrics.confusion_matrix(y_test_clean, model.predict(test_clean_feat_df))
cm_display = metrics.ConfusionMatrixDisplay(confusion_matrix = confusion_matrix_clean, display_labels = ['ROC','LES','DCB','PRV','VLD', 'DTA'])
cm_display.plot()
plt.show()

In [None]:

confusion_matrix_noisy = metrics.confusion_matrix(y_test_noisy, model.predict(test_noisy_feat_df))
cm_display = metrics.ConfusionMatrixDisplay(confusion_matrix = confusion_matrix_noisy, display_labels = ['ROC','LES','DCB','PRV','VLD', 'DTA'])
cm_display.plot()
plt.show()