## Evaluating regression techniques for speaker characterization - Part II
### Laura Fernández Gallardo

In this notebook, I will evaluate the performance of different regression techniques for characterizing the user, given the data explored in Part I.

In [1]:
import io
import requests

import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
# fix random seed for reproducibility
seed = 2302
np.random.seed(seed)

In [None]:
# load speech features from male and from female speakers

path = "https://raw.githubusercontent.com/laufergall/ML_Speaker_Characteristics/master/data/"

# load subjective questionnaire items and their translations
url = path + "eGeMAPSv01a_88_malespk.csv"
s = requests.get(url).content
feats_m =pd.read_csv(io.StringIO(s.decode('utf-8')))

url = path + "eGeMAPSv01a_88_femalespk.csv"
s = requests.get(url).content
feats_f =pd.read_csv(io.StringIO(s.decode('utf-8')))


Pre-processing features:

* join males and females and add 1-hot encoded gender feature
* center and scale speech features of all data

Database partitions: consider train/dev partitions and cross-validation (no test data, since we have too few instances)

Within each fold of the cross-validation:

* Feature selection looking at importance measures - with cross-val (?)
* Feature selection removing multicollinearity:

    1. NO -> Performing PLS (partial least squares) - and the linear regression! :S
    2. Performing PCA (principal component analysis)
    3. Dropping features with high VIF (variance inflation factor) - to be calculated after train/dev/test 

In [None]:
# all features: males and females
feats_m['is_male']=1
feats_f['is_male']=0
feats = pd.concat([feats_m,feats_f], axis = 0)

feats.describe()

In [None]:
feats.head()

In [None]:
# extract speaker ID from sample_heard

feats['speaker_ID'] = feats['sample_heard'].str.slice(1, 4)

In [None]:
feats.head()

In [None]:
# Standardize features  

scaler = StandardScaler()
scaler.fit(feats.iloc[:,1:-2])
feats_s = scaler.transform(feats.iloc[:,1:-2]) # numpy 300x88

Apply PCA to reduce the number of predictors.

In [None]:
pca = PCA()
pca.fit(feats_s)

In [None]:
feats_pca = pca.transform(feats_s)

In [None]:
feats_pca.shape # np array 300 x n_pcacomponents

In [None]:
pca.components_.shape

In [None]:
pca.explained_variance_ratio_

Assign instances into 3 classes (1: low, 2: mid, 3: high) for each trait.

(To plot PCA components color-coded by speaker class)

In [None]:
# load ratings (averaged across listeners)
ratings_means = pd.read_csv("SC_ratings_means.csv")

ratings_class = pd.DataFrame(index = ratings_means.index, columns=ratings_means.columns)

# for each trait, assign instances into 3 classes
for i in ratings_means.columns[2:]:
    # percentiles to threshold
    th = np.percentile(ratings_means[i],[33,66])
    ratings_class.loc[ratings_means[i]<th[0],i] = 1 # low class
    ratings_class.loc[ratings_means[i]>=th[0],i] = 2 # mid class
    ratings_class.loc[ratings_means[i]>th[1],i] = 3 # high class
    
ratings_class.iloc[:,0:2] = ratings_means.iloc[:,0:2]     

In [None]:
# plot first pca components

feats_pca_pd = pd.DataFrame(feats_pca, columns = np.char.mod('%d', np.arange(88)))

plt.scatter(x="0", y="1", data = feats_pca_pd, c=ratings_class['intelligent'])

 

Let us start by looking at the traits for which listeners had slightly higher agreement: 

* _intelligent_
* _ugly_
* _old_
* _modest_
* _incompetent_

In [None]:
# cross-validation to determine optimal number of components for PCA

In [None]:
# focus on the traits with least stdev averaged over all speakers