## Multilabel binary classification of speaker traits
### Laura Fernández Gallardo

After evluating the binary classification of speakers' warmth-attractiveness (WAAT), I examine in this notebook multilabel classification, that is, predicting several traits attributed to speakers, which are not mutually exclusive.   

* For each perceptive speaker interpersonal dimension generated from [factor analysis](https://github.com/laufergall/Subjective_Speaker_Characteristics/tree/master/speaker_characteristics/factor_analysis) thrsholding scores based on percentiles to define 3 classes ("high", "mid", and "low") with approximately the same number of samples. These dimensions are: *warmth*, *attractiveness*, *confidence*, *compliance*, and *maturity*.
* Only the "high" and "low" classes are of interest -> I address **multilabel binary classification**.
* As evaluation metric, I will consider the average per-class accuracy (average of sensitivity and specificity)

In [1]:
import io
import requests

import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

## Load features and labels

In [7]:
path = 'https://raw.githubusercontent.com/laufergall/ML_Speaker_Characteristics/master/data/generated_data/'

url = path + "feats_ratings_scores_train.csv"
s = requests.get(url).content
feats_ratings_scores_train = pd.read_csv(io.StringIO(s.decode('utf-8')))

url = path + "feats_ratings_scores_test.csv"
s = requests.get(url).content
feats_ratings_scores_test = pd.read_csv(io.StringIO(s.decode('utf-8')))

with open(r'..\data\generated_data\feats_names.txt') as f:
    feats_names = f.readlines()
feats_names = [x.strip().strip('\'') for x in feats_names] 

with open(r'..\data\generated_data\items_names.txt') as f:
    items_names = f.readlines()
items_names = [x.strip().strip('\'') for x in items_names] 

with open(r'..\data\generated_data\traits_names.txt') as f:
    traits_names = f.readlines()
traits_names = [x.strip().strip('\'') for x in traits_names] 

# read speaker trait classes
path2 = 'https://raw.githubusercontent.com/laufergall/ML_Speaker_Characteristics/master/data/generated_data/'

url = path2 + "classes_train.csv"
s = requests.get(url).content
classes_train =pd.read_csv(io.StringIO(s.decode('utf-8')))

url = path2 + "classes_test.csv"
s = requests.get(url).content
classes_test =pd.read_csv(io.StringIO(s.decode('utf-8')))

In [10]:
# appending classes to features

dropcolumns = ['name','speaker_gender'] + items_names + traits_names # 'spkID' in for the merge
feats_train = feats_ratings_scores_train.drop(dropcolumns, axis=1) # shape (2700, 88)
feats_test = feats_ratings_scores_test.drop(dropcolumns, axis=1) # shape (2700, 88)

feats_class_train = pd.merge(feats_train, classes_train.drop(['sample_heard','gender',], axis=1)) # shape (2700, 94)
feats_class_test = pd.merge(feats_test, classes_test.drop(['sample_heard','gender',], axis=1)) # shape (891, 94)

# classes as categorical
for col in traits_names:
    feats_class_train[col]=feats_class_train[col].astype('category')
    feats_class_test[col]=feats_class_test[col].astype('category')

In [15]:
# Standardize speech features  

dropcolumns2 = ['spkID'] + traits_names

# learn transformation on training data
scaler = StandardScaler()
scaler.fit(feats_class_train.drop(dropcolumns2, axis=1))

 
# numpy n_instances x n_feats
feats_s_train = scaler.transform(feats_class_train.drop(dropcolumns2, axis=1))
feats_s_test = scaler.transform(feats_class_test.drop(dropcolumns2, axis=1)) 

## Model tuning with feature selection

As done for classification, I perform nested hyperparameter tuning with feature selection. 

### quick working example

#### model training

In [9]:
X = feats_s_train
Y = feats_class_train[dim_names].apply(lambda x: x.cat.codes).as_matrix()

In [11]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier()

model.fit(X, Y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

#### testing performance

In [13]:
Xt = feats_s_test
Yt = feats_class_test[dim_names].apply(lambda x: x.cat.codes).as_matrix()

In [15]:
Y_pred = model.predict(Xt)

In [29]:
Yt

array([[1, 2, 0, 1, 1],
       [1, 2, 0, 1, 1],
       [1, 2, 0, 1, 1],
       ..., 
       [2, 1, 2, 0, 2],
       [2, 1, 2, 0, 2],
       [2, 1, 2, 0, 2]], dtype=int8)

In [17]:
Y_pred

array([[2, 0, 0, 2, 2],
       [2, 0, 0, 2, 2],
       [0, 0, 0, 2, 2],
       ..., 
       [1, 2, 1, 2, 2],
       [0, 0, 0, 0, 0],
       [1, 1, 2, 0, 1]], dtype=int8)

In [23]:
def hamming_score(y_true, y_pred, normalize=True, sample_weight=None):
    '''
    Compute the Hamming score (a.k.a. label-based accuracy) for the multi-label case
    http://stackoverflow.com/q/32239577/395857
    '''
    acc_list = []
    for i in range(y_true.shape[0]):
        set_true = set( np.where(y_true[i])[0] )
        set_pred = set( np.where(y_pred[i])[0] )
        #print('\nset_true: {0}'.format(set_true))
        #print('set_pred: {0}'.format(set_pred))
        tmp_a = None
        if len(set_true) == 0 and len(set_pred) == 0:
            tmp_a = 1
        else:
            tmp_a = len(set_true.intersection(set_pred))/\
                    float( len(set_true.union(set_pred)) )
        #print('tmp_a: {0}'.format(tmp_a))
        acc_list.append(tmp_a)
    return np.mean(acc_list)

In [26]:

score = hamming_score(Yt, Y_pred)
score

0.48466142910587351

TODO:
    
* define performance metric
* model tuning
* try different classifiers
    * Support multiclass-multioutput:
sklearn.tree.DecisionTreeClassifier
sklearn.tree.ExtraTreeClassifier
sklearn.ensemble.ExtraTreesClassifier
sklearn.neighbors.KNeighborsClassifier
sklearn.neighbors.RadiusNeighborsClassifier
sklearn.ensemble.RandomForestClassifier
http://scikit-learn.org/stable/modules/multiclass.html
    * try MLP multiclass