# Machine Learning with audio data using pandas and sci-kit learn. 
#### Nina Lopatina, Ph.D.   |  InQTel, Lab41  |  January 18 2019
In this workshop, we will learn how to classify a speaker's gender from realistic audio data using pandas and sci-kit learn. We will go through a brief description of VOiCES data, how to view and manipulate data in Pandas, and how to train a simple model in sci-kit learn. There are exercises throughout the workshop, with a longer exercise available at the end for those who wish to continue developing these skills or work with VOiCES data in the hackathon.


### Setup: 

Directions in repo README

----------------------------

## Part 1: Intro to VOiCES

[VOiCES](https://voices18.github.io/) is an audio dataset put together in collaboration between Lab41 and SRI. 

Clean speech was recorded in rooms of different sizes, each having distinct room acoustic profiles, with background noise played concurrently. 

These recordings provide audio data that better represents real-use scenarios. 

These data are provided in wav files, but we have extracted some features from a subset of the data for this workshop.

----------------------------

## Setup to play example audio files:

In [None]:
import librosa as lr
import librosa.display
import IPython.display

filename_clean = 'data/Lab41-SRI-VOiCES-rm1-none-sp0083-ch003054-sg0005-mc01-stu-clo-dg090.wav'
filename_babb = 'data/Lab41-SRI-VOiCES-rm1-babb-sp0083-ch003054-sg0005-mc12-lav-wal-dg090.wav'
filename_tele = 'data/Lab41-SRI-VOiCES-rm1-tele-sp0083-ch003054-sg0005-mc03-stu-mid-dg090.wav'
filename_musi = 'data/Lab41-SRI-VOiCES-rm1-musi-sp0083-ch003054-sg0005-mc07-stu-beh-dg090.wav'

def player(fname):
    #     # Read in the signal and sample rate
    s0, sample_rate = lr.core.load(fname, sr=None, mono=True)
    IPython.display.display(IPython.display.Audio(data=s0, rate=sample_rate))

### Clean audio file

In [None]:
player(filename_clean)

### Same clip with television:

In [None]:
player(filename_tele)

### Same clip with music

In [None]:
player(filename_musi)

### Same clip with babble

In [None]:
player(filename_babb)

----------------------------

## Visualizing the data

### Waveform amplitude

In [None]:
import matplotlib.pyplot as plt
sp_x, sp_sr = lr.load(filename_tele) #Noisy
src_x, src_sr = lr.load(filename_clean) #Without noise

plt.figure(figsize = (10,5))
lr.display.waveplot(src_x, src_sr, color = 'blue', alpha = 0.6, label = 'Source')
lr.display.waveplot(sp_x, sr = sp_sr, alpha = 0.5, color = 'orange',label = 'Noisy Speech')
plt.legend()
plt.ylabel('Waveform amplitude', size = 16)
plt.xlabel('Time (seconds)', size = 16)
plt.title('Noisy and clean speech waveform amplitude', size = 20);

### Fourier transformed data

In [None]:
def plot_ft(x,sr,title):# source 
    ft = lr.stft(x)
    db = lr.amplitude_to_db(abs(ft))
    plt.figure(figsize=(14, 4))
    plt.title(title, size = 20)
    lr.display.specshow(db, sr=sr, x_axis='time', y_axis='hz')
    plt.ylabel('Hz', size = 16)
    plt.xlabel('Time (seconds)', size = 16)
    clb = plt.colorbar()
    clb.set_label('Decibels')
    return db

src_db = plot_ft(src_x,src_sr,'Fourier transformed power spectrum')

### Mel spectogram transformed data 

Mel-Frequency analysis of speech is based on human perception experiments
* It is observed that human ear acts as filter: it concentrates on only certain frequency components

* These filters are non-uniformly spaced on the frequency axis: More filters in the low frequency regions & fewer filters in high frequency regions


In [None]:
def plot_ms(x,sr,title):# source 
    S = lr.feature.melspectrogram(x,sr)
    db = lr.power_to_db(S)
    plt.figure(figsize=(14, 4))
    plt.title(title)
    lr.display.specshow(db, sr=sr, x_axis='time', y_axis='mel')
    plt.ylabel('Hz', size = 16)
    plt.xlabel('Time (seconds)', size = 16)
    clb = plt.colorbar()
    clb.set_label('Decibels')
    
# Source
src_db = plot_ms(src_x,src_sr,'Mel transformed power spectrum')

----------------------------

## Part 2: Intro to Pandas

### Pandas uses cases:

- Finding trends in data
- Business analytics
- Cleaning data
- Blending multiple data sources
- Easy data manipulation to make awesome models!

----------------------------

Let's import some packages -- these are modules with specific functionalities to make your life easier

In [None]:
import sys
import pandas as pd
import numpy as np

In [None]:
voices = pd.read_csv('./data/VOiCES_90deg_features.csv')

## Data Structures and Viewing Data

We will be using the Dataframe and Series data structures in pandas. You can think of dataframes as a spreadsheet or table, and Series as columns. 

### What types of data are in this data frame?

In [None]:
voices.dtypes

Data properties
* mic_id:           Microphone #
* mic_type:         studio or lavalier
* location:         Distance from subject, see https://voices18.github.io/Lab41-SRI-VOiCES_README/
* spk_angle:        all 90* here

Statistical values
* Centroid:        2D mean of audio data
* variance:        Dispersion of samples around centroid
* skewness:        symmetry of the probability density function of the amplitude of a time series. Positive skewness with more large than small values. 
* kurtosis:        measures the peakedness of the PDF of a time series. A kurtosis value close to three indicates a Gaussian-like peakedness. PDFs with relatively sharp peaks have kurtosis greater than three. PDFs with relatively flat peaks have kurtosis less than three
* roll_off_min:    min & max frequency   

Mel Frequency Cepstral Coefficient (MFCC) transformed data
* mfcc 1-12: Features extracted from data generated into 12 Mel bands 

In [None]:
# Most of the times the data we load can be large so we look at a subset
# Default is first 5 entries
voices.head()

### How many items?

In [None]:
len(voices)

### Column names

In [None]:
voices.columns

### Quick data statistics

In [None]:
voices.info()

In [None]:
voices.describe()

### Selecting Data

In [None]:
voices['noise']

## Exercise 1:
### Select the microphone id column

In [None]:

# Answer:
voices['mic_id']

### Find types of noise conditions

In [None]:
voices['noise'].unique()

## Exercise 2: 

### Find the types of microphones used:

In [None]:
#Answer:
voices['mic_id'].unique()

----------------------------

## Data visualization I

We want to get to know our data

### Plot all the features

In [None]:
features = voices.columns
r = 6
c = 5
label_size = 22
tick_size = 18
title_size = 28

fig = plt.figure(figsize = (5*c,5*r))

for i,var in enumerate(features):
    ax = fig.add_subplot(r,c,i+1)
    x = voices.index
    y = voices[var]
    ax.scatter(x,y,c='m')

    ax.set_title(var,size=title_size)
    ax.set_xlabel('speaker', size = label_size)
    ax.set_ylabel(var,size = label_size)
    ax.tick_params(labelsize=tick_size)

plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.2)

### Sorted plot

In [None]:
features = voices.columns
r = 6
c = 5
label_size = 22
tick_size = 18
title_size = 28

fig = plt.figure(figsize = (5*c,5*r))

for i,var in enumerate(features):
    ax = fig.add_subplot(r,c,i+1)
    x = voices.index
    y = voices.sort_values(by=var)[var] # Added sort_values()
    ax.scatter(x,y,c='m')

    ax.set_title(var,size=title_size)
    ax.set_xlabel('sample', size = label_size) # Changed speaker to sample b/c the # no longer corresponds
    ax.set_ylabel(var,size = label_size)
    ax.tick_params(labelsize=tick_size)

plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.2)

----------------------------

## Data visualization II

We want to build a model that will classify the gender of the speaker. First let's check that there are differences between male and female speakers that can be observed in the data.

### Split data by gender

In [None]:
# Create dataframes with Male & Female speakers
M_DF = voices[voices['Gender']=='M']
F_DF = voices[voices['Gender']=='F']

In [None]:
F_DF.Gender

## Exercise 3: 

### Create a dataframe with only samples with television background noise

In [None]:

# Answer:
tele = voices[voices['noise']=='tele']

### Important consideration: How many data points in each split?

In [None]:
# Print length of each dataframe
print(len(M_DF), " male speakers")
print(len(F_DF), " female speakers")

## Any concerns with the number of samples per class?













### Solution: Trim data frame to have even data for each label

In [None]:
M_DF = M_DF[:len(F_DF)]

In [None]:
print(len(M_DF), " male speakers")
print(len(F_DF), " female speakers")

### Select features useful for classification

In [None]:
print(M_DF.columns)

In [None]:
features = M_DF.columns[9:]
print(features)

### Plot data split by gender

In [None]:
r = 4
c = 5
label_size = 22
tick_size = 18
title_size = 28

fig = plt.figure(figsize = (5*c,5*r))

for i,var in enumerate(features):
    ax = fig.add_subplot(r,c,i+1)
    x = range(len(M_DF))
    y = M_DF[var]
    ax.scatter(x,y,c='g')
    
    x = range(len(F_DF))
    y = F_DF[var]
    ax.scatter(x,y,c='r')
    
    ax.set_title(var + ' by gender',size=title_size)
    ax.set_xlabel('sample', size = label_size)
    ax.set_ylabel(var,size = label_size)
    ax.tick_params(labelsize=tick_size)

plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.2)

# Red is female, green is male

## Exercise 4: Modify the below block of code to more informatively display the grouped data

In [None]:
r = 4
c = 5
label_size = 22
tick_size = 18
title_size = 28

fig = plt.figure(figsize = (5*c,5*r))

for i,var in enumerate(features):
    ax = fig.add_subplot(r,c,i+1)
    x = range(len(M_DF))
    y = M_DF[var]
    ax.scatter(x,y,c='g')
    
    x = range(len(F_DF))
    y = F_DF[var]
    ax.scatter(x,y,c='r')
    
    ax.set_title(var + ' by gender',size=title_size)
    ax.set_xlabel('sample', size = label_size)
    ax.set_ylabel(var,size = label_size)
    ax.tick_params(labelsize=tick_size)

plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.2)

# Red is female, green is male

In [None]:
r = 4
c = 5
label_size = 22
tick_size = 18
title_size = 28

fig = plt.figure(figsize = (5*c,5*r))

for i,var in enumerate(features):
    ax = fig.add_subplot(r,c,i+1)
    x = range(len(M_DF))
    y = M_DF.sort_values(by=var)[var]
    ax.scatter(x,y,c='g')
    
    x = range(len(F_DF))
    y = F_DF.sort_values(by=var)[var]
    ax.scatter(x,y,c='r')
    
    ax.set_title(var + ' by gender',size=title_size)
    ax.set_xlabel('sample', size = label_size)
    ax.set_ylabel(var,size = label_size)
    ax.tick_params(labelsize=tick_size)

plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.2)

# Red is female, green is male

## Any guesses for which features would be most informative for classifying gender?

----------------------------

## Part 3: Gender classification with Sci-kit learn

----------------------------

In [None]:
# Import modules

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,accuracy_score, classification_report

### Combine data

In [None]:
# Interleave rows from tables:

# Initialize table
df = pd.DataFrame(M_DF.loc[M_DF.index[0]])
df = df.transpose()
df = df.append(F_DF.loc[F_DF.index[0]])

# Add rest of rows interleaved
for i in range(1,len(M_DF)):
    df = df.append(M_DF.loc[M_DF.index[i]])
    df = df.append(F_DF.loc[F_DF.index[i]])

# Remove redundant rows from index reseting
df = df.reset_index()
df = df.drop('index',axis=1)

In [None]:
# Split into data and label

X = df.drop('Gender',axis=1)
X = X[X.columns[8:]] # Data

y = df.Gender # Labels

In [None]:
X.columns

### Train test split

In [None]:
test_size = 0.3

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size,shuffle = False)

### Set up to track accuracy

In [None]:
df_acc = pd.DataFrame(columns = ['model','train_accuracy','test_accuracy'])
   
def get_accuracy(model,model_name,df_acc,X_train, y_train,X_test,y_test):
    mdl = model()
    mdl.fit(X_train, y_train)
    preds_test = mdl.predict(X_test)
    preds_train = mdl.predict(X_train)
    #add to next row:
    nr = len(df_acc)
    df_acc.loc[nr,'model'] = model_name
    df_acc.loc[nr,'test_accuracy'] = round(accuracy_score(y_test,preds_test)*100,1)
    df_acc.loc[nr,'train_accuracy'] = round(accuracy_score(y_train,preds_train)*100,1)
    return df_acc, mdl

### Model 1: Logistic regression

Simplest supervised binary classification algorithm

In [None]:
df_acc,logreg = get_accuracy(LogisticRegression,'LogisticRegression',df_acc,X_train, y_train,X_test,y_test)

In [None]:
df_acc

### Model 2: Random Forest Classifier

A random forest is supervised classification algorithm. It is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset, and uses averaging to improve the predictive accuracy and control over-fitting. The output is the mode of the class output of all the trees in the forest. If there are enough trees in the forest, the classifier won’t overfit the data. Lastly, RFC will identify and select the most important features from the training dataset.

In [None]:
df_acc, rdf = get_accuracy(RandomForestClassifier,'RandomForestClassifier',df_acc,X_train, y_train,X_test,y_test)

In [None]:
df_acc

### Improving accuracy with a hyperparameter search

In [None]:
# Convert categorical to one hot
y_encoded = pd.get_dummies(y)

X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=test_size,shuffle = False)

In [None]:
# Create param grid for Randomized Hyperparameter Search

param_grid = {'n_estimators': (5,10,50,100),
              'max_features': (5,8, 12, 16)
             }
clf = RandomForestClassifier()
grid = RandomizedSearchCV(clf, param_grid, cv=4)

In [None]:
grid.fit(X_train,y_train)

In [None]:
grid.best_params_

In [None]:
y_hat = grid.predict(X_test)
accuracy = metrics.accuracy_score(y_test,y_hat)
print('Accuracy for gender id ', round(accuracy*100,2))


### Ideas for further improving accuracy?


### Classification report

In [None]:
clf_rep = classification_report(y_true = y_test, y_pred = y_hat)

#this report is just text, so needs to be converted to a more easily readable form
def to_table(clf_rep):
    report = clf_rep.splitlines()
    res = []
    res.append(['']+report[0].split())
    for row in report[2:-2]:
        res.append(row.split())
    lr = report[-1].split()
    res.append([' '.join(lr[:3])]+lr[3:])
    output = np.array(res)
    return output

cols = []
cols.append('Gender: 0F/1M')

output = to_table(clf_rep)
for item in output[0][1:]:
    cols.append(item)
    
out_df = pd.DataFrame(output[1:],columns = cols)

In [None]:
out_df

In [None]:
# We know 0 is F and 1 is M because

y_test.sum()

### Can we interpret our model: How is the decision being made?

In [None]:
# Get feature importance
clf = RandomForestClassifier(n_estimators = 100)
clf.fit(X_train,y_train)
df_features = pd.DataFrame(columns = ['variable', 'contribution'])

df_features['contribution'] = pd.Series(clf.feature_importances_)
df_features['variable'] = pd.Series(X.columns)

# How much do each of the features contribute? 
df_features = df_features.sort_values(by='contribution',ascending = False)
p = df_features.plot.bar(x = 'variable',legend= False, color = 'c')

### Is it easier to identify gender for specific noise conditions?

In [None]:
# Accuracy for the clean data

# Create dataframe including noise
features = ['noise','Gender', 'Centroid', 'variance', 'skewness',
       'kurtosis', 'mfcc1', 'mfcc2', 'mfcc3', 'mfcc4', 'mfcc5', 'mfcc6',
       'mfcc7', 'mfcc8', 'mfcc9', 'mfcc10', 'mfcc11', 'mfcc12', 'roll_off_max',
       'roll_off_min']
X = df[features]
X = X.drop('Gender',axis=1)
X.columns

# Split
_, X_test_noise, _, _ = train_test_split(X, y, test_size=test_size, shuffle = False)

# Calculate accuracy
noise_type = 'none'
metrics.accuracy_score(y_test[X_test_noise['noise']==noise_type],y_hat[X_test_noise['noise']==noise_type])

## Exercise 5: Show accuracy by noise condition

In [None]:
# Answer:






## Exercise 6:  Can you improve on this model using sklearn or make any interesting inferences about the data using pandas?

Open ended or try one of the suggested exercises below

## 6a: Change hyperparameters of RFC from the below

In [None]:

'''
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
'''



## 6b: Analysis of incorrectly classified data points 

----------------------------

# Addendum: Code for continuing to work with VOiCES data

----------------------------

## A. Speech preprocessing

To modify the preprocessing we worked with above to other data subsets or to change the preprocessing: 

git clone https://github.com/Lab41/VOiCES-subset

## B. Speaker id with torch

1.  Clone the cyphercat repo with 

git clone https://github.com/ninalopatina/cyphercat

2. pip install torch

3. Modify & insert the below code into a cell

sys.path.insert(0, '{path to cyphercat repo}')

import cyphercat as cc

import torch

4. Utilize the functions below

#### Let the Lab41 members know if you have any questions!

In [None]:
class tensorToMFCC:
    def __call__(self, y):
#         y = y.numpy()
        dims = y.shape
        y = libr.feature.melspectrogram(np.reshape(y, (dims[1],)), 16000, n_mels=number_of_mels,
                               fmax=8000)
        y = libr.feature.mfcc(S = libr.power_to_db(y))
        y = torch.from_numpy(y)                           
        return y.float()

class STFT:
    def __call__(self,y):
        dims = y.shape
        y = np.abs(libr.core.stft(np.reshape(y, (dims[1],))))
        y = torch.from_numpy(y).permute(1,0)
        return y.float()

transform_type = 'MFCC'
if transform_type == 'SFTF':
    target_net_type = cc.ft_cnn_classifer
    shadow_net_type = cc.ft_cnn_classifer
    in_size = 94# 20 forMFCC,  94 for STFT
    transform  = STFT() ## STFT or MFCC
elif transform_type == 'MFCC':
    transform  = tensorToMFCC()
    target_net_type = cc.MFCC_cnn_classifier
    shadow_net_type = cc.MFCC_cnn_classifier
    in_size = 20
    
# To load data:
subset = 'room-1'
[speaker_df, sample_df] = cc.Voices_preload_and_split(subset = subset)

valid_sequence_train_target = cc.Voices_dataset(df=dfs[0], transform = transform)
valid_sequence_test_target = cc.Voices_dataset(df=dfs[1], transform = transform)

target_train_loader = DataLoader(valid_sequence_train_target,
                      batch_size=batch_size,
                      shuffle=True,
                      num_workers=8,
                    drop_last = True
                     # pin_memory=True # CUDA only
                     )


target_test_loader = DataLoader(valid_sequence_test_target,
                      batch_size=batch_size,
                      shuffle=True,
                      num_workers=8
                     # pin_memory=True # CUDA only
                     )
# Set up the model:

#in_size defined above
n_hidden = 512
n_classes = valid_sequence_test_target.num_speakers
print(n_classes,' speakers')
df.at[df_idx,'# speakers']=n_classes


target_net = target_net_type(n_classes).to(device)
target_net.apply(models.weights_init)

target_loss = nn.CrossEntropyLoss()
target_optim = optim.Adam(target_net.parameters(), lr=.001)

# Train the model
train_accuracy, test_accuracy = cc.train(target_net, target_train_loader, target_test_loader, target_optim, target_loss, n_epochs, verbose = False) 