## Evaluating classification techniques for speaker characterization
### Laura Fernández Gallardo

After evluating the binary classification of speakers' warmth-attractiveness (WAAT), I examine in this notebook multilabel classification, that is, predicting several traits attributed to speakers, which are not mutually exclusive.   

* For each perceptive speaker interpersonal dimension generated from [factor analysis](https://github.com/laufergall/Subjective_Speaker_Characteristics/tree/master/speaker_characteristics/factor_analysis) thrsholding scores based on percentiles to define 3 classes ("high", "mid", and "low") with approximately the same number of samples. These dimensions are: *warmth*, *attractiveness*, *confidence*, *compliance*, and *maturity*.
* Only the "high" and "low" classes are of interest -> I address **multilabel binary classification**.
* As evaluation metric, I will consider the average per-class accuracy (average of sensitivity and specificity)

In [1]:
import io
import requests

import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

## Dimensions of interpersonal speaker characteristics

In [2]:
# load scores (averaged across listeners)

path = "https://raw.githubusercontent.com/laufergall/Subjective_Speaker_Characteristics/master/data/generated_data/"

url = path + "factorscores_malespk.csv"
s = requests.get(url).content
scores_m =pd.read_csv(io.StringIO(s.decode('utf-8')))

url = path + "factorscores_femalespk.csv"
s = requests.get(url).content
scores_f =pd.read_csv(io.StringIO(s.decode('utf-8')))

# rename dimensions
scores_m.columns = ['sample_heard', 'warmth', 'attractiveness', 'confidence', 'compliance', 'maturity']
scores_f.columns = ['sample_heard', 'warmth', 'attractiveness', 'compliance', 'confidence', 'maturity']
dim_names = ['warmth', 'attractiveness', 'confidence', 'compliance', 'maturity']

Thresholding each trait into "low", "mid", "high", separately for each speaker gender.

In [3]:
# for each trait, assign instances into 3 classes

classes_m = pd.DataFrame(data = scores_m['sample_heard'])
classes_f = pd.DataFrame(data = scores_f['sample_heard'])

# male speakers
for i in dim_names:
    th = np.percentile(scores_m[i],[33,66])
    classes_m.loc[scores_m[i]<th[0],i] = 'low'
    classes_m.loc[scores_m[i]>=th[0],i] = 'mid'
    classes_m.loc[scores_m[i]>th[1],i] = 'high'

# female speakers
for i in dim_names:
    th = np.percentile(scores_f[i],[33,66])
    classes_f.loc[scores_f[i]<th[0],i] = 'low'
    classes_f.loc[scores_f[i]>=th[0],i] = 'mid'
    classes_f.loc[scores_f[i]>th[1],i] = 'high'

In [4]:
# join male and female classes

classes = classes_m.append(classes_f)
classes['gender'] = classes['sample_heard'].str.slice(0,1)
classes['spkID'] = classes['sample_heard'].str.slice(1,4).astype('int')
classes.head()

Unnamed: 0,sample_heard,warmth,attractiveness,confidence,compliance,maturity,gender,spkID
0,m004_linden_stimulus.wav,mid,mid,high,low,mid,m,4
1,m005_nicosia_stimulus.wav,mid,mid,high,low,high,m,5
2,m006_rabat_stimulus.wav,high,mid,mid,mid,low,m,6
3,m007_klaksvik_stimulus.wav,mid,high,high,mid,high,m,7
4,m016_beirut_stimulus.wav,high,high,low,low,low,m,16


Stratified partition speakers into train/test, according to gender 

In [8]:
# get stratified random partition for train and test

indexes = np.arange(0,len(classes))
train_i, test_i, train_y, test_y = train_test_split(indexes, 
                                                    indexes, # dummy classes
                                                    test_size=0.25, 
                                                    stratify = classes['gender'], 
                                                    random_state=2302)

classes_train = classes.iloc[train_i,:] # 225 instances
classes_test = classes.iloc[test_i,:] # 75 instances

# save these data for other evaluations
classes_train.to_csv(r'..\data\generated_data\classes_train.csv', index=False)
classes_test.to_csv(r'..\data\generated_data\classes_test.csv', index=False)

## Speech features

Same feature set as in Part II.

In [6]:
path = "https://raw.githubusercontent.com/laufergall/ML_Speaker_Characteristics/master/data/extracted_features/"

url = path + "/eGeMAPSv01a_semispontaneous_splitted.csv"
s = requests.get(url).content
feats =pd.read_csv(io.StringIO(s.decode('utf-8')), sep = ';') # shape: 3591, 89

In [7]:
# extract speaker ID from speech file name
feats['spkID'] = feats['name'].str.slice(2, 5).astype('int')

# appending multilabels to features
feats_class_train = pd.merge(feats, classes_train) # shape (2700, 97)
feats_class_test = pd.merge(feats, classes_test) # shape (891, 97)

# classes as categorical
for col in dim_names:
    feats_class_train[col]=feats_class_train[col].astype('category')
    feats_class_test[col]=feats_class_test[col].astype('category')

In [8]:
# Standardize speech features  

dropcolumns = ['name','spkID','sample_heard', 'warmth', 'attractiveness', 'confidence', 'compliance', 'maturity', 'gender']

# learn transformation on training data
scaler = StandardScaler()
scaler.fit(feats_class_train.drop(dropcolumns, axis=1))

 
# numpy n_instances x n_feats
feats_s_train = scaler.transform(feats_class_train.drop(dropcolumns, axis=1))
feats_s_test = scaler.transform(feats_class_test.drop(dropcolumns, axis=1)) 

### quick working example

#### model training

In [9]:
X = feats_s_train
Y = feats_class_train[dim_names].apply(lambda x: x.cat.codes).as_matrix()

In [11]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier()

model.fit(X, Y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

#### testing performance

In [13]:
Xt = feats_s_test
Yt = feats_class_test[dim_names].apply(lambda x: x.cat.codes).as_matrix()

In [15]:
Y_pred = model.predict(Xt)

In [29]:
Yt

array([[1, 2, 0, 1, 1],
       [1, 2, 0, 1, 1],
       [1, 2, 0, 1, 1],
       ..., 
       [2, 1, 2, 0, 2],
       [2, 1, 2, 0, 2],
       [2, 1, 2, 0, 2]], dtype=int8)

In [17]:
Y_pred

array([[2, 0, 0, 2, 2],
       [2, 0, 0, 2, 2],
       [0, 0, 0, 2, 2],
       ..., 
       [1, 2, 1, 2, 2],
       [0, 0, 0, 0, 0],
       [1, 1, 2, 0, 1]], dtype=int8)

In [23]:
def hamming_score(y_true, y_pred, normalize=True, sample_weight=None):
    '''
    Compute the Hamming score (a.k.a. label-based accuracy) for the multi-label case
    http://stackoverflow.com/q/32239577/395857
    '''
    acc_list = []
    for i in range(y_true.shape[0]):
        set_true = set( np.where(y_true[i])[0] )
        set_pred = set( np.where(y_pred[i])[0] )
        #print('\nset_true: {0}'.format(set_true))
        #print('set_pred: {0}'.format(set_pred))
        tmp_a = None
        if len(set_true) == 0 and len(set_pred) == 0:
            tmp_a = 1
        else:
            tmp_a = len(set_true.intersection(set_pred))/\
                    float( len(set_true.union(set_pred)) )
        #print('tmp_a: {0}'.format(tmp_a))
        acc_list.append(tmp_a)
    return np.mean(acc_list)

In [26]:

score = hamming_score(Yt, Y_pred)
score

0.48466142910587351

TODO:
    
* define performance metric
* model tuning
* try different classifiers
    * Support multiclass-multioutput:
sklearn.tree.DecisionTreeClassifier
sklearn.tree.ExtraTreeClassifier
sklearn.ensemble.ExtraTreesClassifier
sklearn.neighbors.KNeighborsClassifier
sklearn.neighbors.RadiusNeighborsClassifier
sklearn.ensemble.RandomForestClassifier
http://scikit-learn.org/stable/modules/multiclass.html