# OpenSMILE Analysis
This notebook loads OpenSMILE csv- data, cleans and plots it

## Import relevant libraries

In [1]:
import numpy as np
import pandas as pd
from os import listdir
import matplotlib.pyplot as plt
import itertools as it
from statsmodels.sandbox.stats.multicomp import multipletests
import statsmodels.api as sm
#import nltk
import scipy.stats as st
import statsmodels.formula.api as smf
import seaborn as sns
import Helper as hp

## Load .csv data with results of OpenSMILE Analysis
First we load .csv data and clean it (removing of NaNs), then we store information of all files in seperate panda dataframes containing information about affect, emotion and valence/arousal for all participants.

In [2]:
data = pd.read_csv("UIST_2019_short_samples_OpenSMILE.csv")

#Set Labels 
emotion_label = ['Anger', 'Boredom', 'Disgust', 'Fear', 'Happiness', 'Emo_Neutral', 'Sadness']
affect_label = ['Aggressiv', 'Cheerful', 'Intoxicated', 'Nervous', 'Aff_Neutral', 'Tired']
loi_label = ['Disinterest', 'Normal', 'High Interest']

#Get specific data and save it into new data frames
# We use the pandas .copy(deep=True) function to prevent the SettingWithCopyWarning we would otherwise get. Since we do
# not write, but only read from the data, the warning does not affect the data frames
df_emotion = data[['Anger', 'Boredom', 'Disgust', 'Fear', 'Happiness', 'Emo_Neutral', 'Sadness', 'Filename']].copy(deep=True)
df_affect = data[['Aggressiv', 'Cheerful', 'Intoxicated', 'Nervous', 'Aff_Neutral', 'Tired', 'Filename']].copy(deep=True)
df_loi = data[['Disinterest', 'Normal', 'High Interest', 'Filename']].copy(deep=True)
df_ar_val = data[['Arousal', 'Valence', 'Filename']].copy(deep=True)
#For further usage, we want to append the CharacterID as a column, which is saved with other information in the filename
#Since we only want the digits, we can remove all non-digit characters of the filename column and append the column to the df

df_emotion['Char_ID'] = df_emotion['Filename'].replace('\D+','', regex = True).copy(deep=True)
df_affect['Char_ID'] = df_affect['Filename'].replace('\D+','', regex = True).copy(deep=True)
df_loi['Char_ID'] = df_loi['Filename'].replace('\D+','', regex = True).copy(deep=True)
df_ar_val['Char_ID'] = df_ar_val['Filename'].replace('\D+','', regex = True).copy(deep=True)

## Let's load information about the speakers
The speaker ID is saved in a single .csv file containing four important columns: ID, Age, Sex and Acadedmic Status. Since before loaded OpenSMILE csv files are named using the corresponding index (ex. speaker with id 0 has two files 0_a.csv and 0_b.csv), so that a link can be created

In [3]:
char_data = pd.read_csv("UIST2019_CharacterData.csv")  

#Join above tables and Character Tables

#To Join DataFrames we have to cast the column on which we want to join to int, so that both columns have the same data type
char_data['ID'] = char_data['ID'].astype(int)
df_ar_val['Char_ID'] = df_ar_val['Char_ID'].astype(int)
df_emotion['Char_ID'] = df_emotion['Char_ID'].astype(int)
df_affect['Char_ID'] = df_affect['Char_ID'].astype(int)
df_loi['Char_ID'] = df_loi['Char_ID'].astype(int)

#Safe new data frames
df_ar_val_char = df_ar_val.merge(char_data, how = 'left', left_on='Char_ID', right_on='ID')
df_emotion_char = df_emotion.merge(char_data, how = 'left', left_on='Char_ID', right_on= 'ID')
df_affect_char = df_affect.merge(char_data, how = 'left', left_on='Char_ID', right_on= 'ID')
df_loi_char = df_loi.merge(char_data, how = 'left', left_on='Char_ID', right_on= 'ID')

## Chi-squared Test of Independence
We Start with characteristic sex. The null hypothesis states that the two categorical variables sex and e.g. emotion are independent.

We bin the data for each specific voice feature e.g. the emotion anger into quartiles (<= 0.25; <= 0.50 && > 0.25; <=0.75 && > 0.5; <= 1.0 && > 0.75) and use the resulting tables as frequency tables for chi2.

CAREFUL! The below printed results cannot be used for evaluation, since there are frequency counts of zero resulting in an error! This is why there is this line of code 'tables += 5' in the hp.chi2 function, to prevent that error!

In [4]:
print('EMOTION\n')
emo_sex_chi2 = hp.chi2(df_emotion_char, emotion_label,'Sex',  True)
print('\nAFFECT\n')
aff_sec_chi2 = hp.chi2(df_affect_char, affect_label,'Sex',  True)
print('\nAROUSAL-VALENCE\n')
ar_val_sec_chi2 = hp.chi2(df_ar_val_char, ['Arousal', 'Valence'], 'Sex', True)
print('\nLEVEL OF INTEREST\n')
loi_sec_chi2 = hp.chi2(df_loi_char, loi_label, 'Sex', True)

EMOTION

Chi square of Anger : 14.284119897959181 with p-value of: 0.0025428510334059118
Chi square of Boredom : 12.689320112972528 with p-value of: 0.005358989190508994
Chi square of Disgust : 51.62577688903845 with p-value of: 3.598809706330085e-11
Chi square of Fear : 14.284119897959181 with p-value of: 0.0025428510334059118
Chi square of Happiness : 14.284119897959181 with p-value of: 0.0025428510334059118
Chi square of Emo_Neutral : 13.205567580567578 with p-value of: 0.00421249985180783
Chi square of Sadness : 49.523415466961985 with p-value of: 1.0092306115585427e-10

AFFECT

Chi square of Aggressiv : 38.34710794039838 with p-value of: 2.3864370734860672e-08
Chi square of Cheerful : 7.264610355568236 with p-value of: 0.0639252979705837
Chi square of Intoxicated : 14.519367784992784 with p-value of: 0.002277048718019142
Chi square of Nervous : 14.284119897959181 with p-value of: 0.0025428510334059118
Chi square of Aff_Neutral : 14.284119897959181 with p-value of: 0.00254285103340

Now move on to academic status, the hypothesis being that the variables academic status and e.g. emotion are independent.

In [5]:
print('EMOTION\n')
emo_aca_chi2 = hp.chi2(df_emotion_char, emotion_label,'Academic' , True)
print('\nAFFECT\n')
aff_aca_chi2 = hp.chi2(df_affect_char, affect_label,'Academic', True)
print('\nAROUSAL-VALENCE\n')
ar_val_aca_chi2 = hp.chi2(df_ar_val_char, ['Arousal', 'Valence'],  'Academic',True)
print('\nLEVEL OF INTEREST\n')
loi_aca_chi2 = hp.chi2(df_loi_char, loi_label,'Academic', True)

EMOTION

Chi square of Anger : 2.2939556257285516 with p-value of: 0.513679837522993
Chi square of Boredom : 2.331135254688385 with p-value of: 0.5065824736987072
Chi square of Disgust : 0.19044788968218865 with p-value of: 0.979116402934949
Chi square of Fear : 2.2939556257285516 with p-value of: 0.513679837522993
Chi square of Happiness : 2.2939556257285516 with p-value of: 0.513679837522993
Chi square of Emo_Neutral : 2.2939556257285516 with p-value of: 0.513679837522993
Chi square of Sadness : 2.0098521307088073 with p-value of: 0.5703643781225765

AFFECT

Chi square of Aggressiv : 1.8165118542960852 with p-value of: 0.6113483801758055
Chi square of Cheerful : 10.415518655673196 with p-value of: 0.015345071429159331
Chi square of Intoxicated : 1.4513113716098598 with p-value of: 0.6935523493276594
Chi square of Nervous : 2.2939556257285516 with p-value of: 0.513679837522993
Chi square of Aff_Neutral : 2.2939556257285516 with p-value of: 0.513679837522993
Chi square of Tired : 3.167

Now let's look if age and e.g. emotion/ affect/ arousal-valence/ level of interest are independent

In [6]:
print('EMOTION\n')
emo_age_chi2 = hp.chi2(df_emotion_char, emotion_label,'Age', True)
print('\nAFFECT\n')
aff_age_chi2 = hp.chi2(df_affect_char, affect_label, 'Age', True)
print('\nAROUSAL-VALENCE\n')
ar_val_age_chi2 = hp.chi2(df_ar_val_char, ['Arousal', 'Valence'],'Age' ,True)
print('\nLEVEL OF INTEREST\n')
loi_age_chi2 = hp.chi2(df_loi_char, loi_label, 'Age',  True)

EMOTION

Chi square of Anger : 34.877210348949795 with p-value of: 4.552098631951087e-06
Chi square of Boredom : 26.534228937441206 with p-value of: 0.00017698483795208845
Chi square of Disgust : 5.448623369907392 with p-value of: 0.48768789011227665
Chi square of Fear : 34.877210348949795 with p-value of: 4.552098631951087e-06
Chi square of Happiness : 34.877210348949795 with p-value of: 4.552098631951087e-06
Chi square of Emo_Neutral : 34.877210348949795 with p-value of: 4.552098631951087e-06
Chi square of Sadness : 3.0499184234787218 with p-value of: 0.8025559752759701

AFFECT

Chi square of Aggressiv : 26.61725366811749 with p-value of: 0.00017077283694653112
Chi square of Cheerful : 20.170246613771518 with p-value of: 0.0025826030217904567
Chi square of Intoxicated : 30.0345654075165 with p-value of: 3.871811344330645e-05
Chi square of Nervous : 34.877210348949795 with p-value of: 4.552098631951087e-06
Chi square of Aff_Neutral : 34.877210348949795 with p-value of: 4.5520986319510

Now let's look at Native Speaker

In [7]:
print('EMOTION\n')
emo_age_chi2 = hp.chi2(df_emotion_char, emotion_label,'IsNativeSpeaker', True)
print('\nAFFECT\n')
aff_age_chi2 = hp.chi2(df_affect_char, affect_label, 'IsNativeSpeaker', True)
print('\nAROUSAL-VALENCE\n')
ar_val_age_chi2 = hp.chi2(df_ar_val_char, ['Arousal', 'Valence'],'IsNativeSpeaker' ,True)
print('\nLEVEL OF INTEREST\n')
loi_age_chi2 = hp.chi2(df_loi_char, loi_label, 'IsNativeSpeaker',  True)

EMOTION

Chi square of Anger : 8.6807867516953 with p-value of: 0.19234218415919735
Chi square of Boredom : 5.660209266347925 with p-value of: 0.4623085255238457
Chi square of Disgust : 5.018268769657805 with p-value of: 0.5414721528690005
Chi square of Fear : 8.6807867516953 with p-value of: 0.19234218415919735
Chi square of Happiness : 8.6807867516953 with p-value of: 0.19234218415919735
Chi square of Emo_Neutral : 8.390973984446658 with p-value of: 0.21083558773123384
Chi square of Sadness : 3.5054211314560453 with p-value of: 0.7432482982152953

AFFECT

Chi square of Aggressiv : 9.380555810445896 with p-value of: 0.15327980181710932
Chi square of Cheerful : 9.85823934059986 with p-value of: 0.1307502147949227
Chi square of Intoxicated : 9.780344357638318 with p-value of: 0.13421219890848646
Chi square of Nervous : 8.6807867516953 with p-value of: 0.19234218415919735
Chi square of Aff_Neutral : 8.6807867516953 with p-value of: 0.19234218415919735
Chi square of Tired : 8.067817683868

## Post-Hoc tests for age and native speaker, as they have three different groups

If a significant p-value for the category 'Age' is found, we do not yet know which groups differ significantly from each other, so post-hoc testing is done for this character feature.

CAREFUL! The below printed results cannot be used for evaluation, since there are frequency counts of zero resulting in an error! This is why there is this line of code 'tables += 5' in the hp.chi2 function, to prevent that error!

In [8]:
print('EMOTION\n')
print('post-hoc emotions and different groups')
emo_reject_list, emo_corrected_p_vals, emo_combinations, emo_residuals= hp.chi2_post_hoc(df_emotion_char,emotion_label, 'Age', 'bonferroni', True, True)
print('\nAFFECT\n')
print('\n post-hoc affect and different groups')
aff_reject_list, emo_corrected_p_vals, emo_combinations, aff_residuals = hp.chi2_post_hoc(df_affect_char, affect_label, 'Age' ,'bonferroni', True, True)
print('\nAROUSAL-VALENCE\n')
print('\n post-hoc arousal-valence and different groups')
ar_val_reject_list, ar_val_corrected_p_vals, ar_val_combinations, ar_val_residuals = hp.chi2_post_hoc(df_ar_val_char, ['Arousal', 'Valence'], 'Age', 'bonferroni',True, True)
print('\nLEVEL OF INTEREST\n')
print('\n post-hoc level of intereset and different groups')
loi_reject_list, loi_corrected_p_vals, loi_combinations, loi_residuals = hp.chi2_post_hoc(df_loi_char, loi_label, 'Age', 'bonferroni', True, True)

EMOTION

post-hoc emotions and different groups
Anger
Combinations: [('Young', 'Intermediate'), ('Young', 'Old'), ('Intermediate', 'Old')]
Reject List: [ True  True False]
Corrected p-values: [3.26765404e-03 1.55140776e-06 1.00000000e+00]
Boredom
Combinations: [('Young', 'Intermediate'), ('Young', 'Old'), ('Intermediate', 'Old')]
Reject List: [ True  True False]
Corrected p-values: [3.59129612e-02 6.21451554e-05 1.00000000e+00]
Disgust
Combinations: [('Young', 'Intermediate'), ('Young', 'Old'), ('Intermediate', 'Old')]
Reject List: [False False False]
Corrected p-values: [1.         0.44097776 1.        ]
Fear
Combinations: [('Young', 'Intermediate'), ('Young', 'Old'), ('Intermediate', 'Old')]
Reject List: [ True  True False]
Corrected p-values: [3.26765404e-03 1.55140776e-06 1.00000000e+00]
Happiness
Combinations: [('Young', 'Intermediate'), ('Young', 'Old'), ('Intermediate', 'Old')]
Reject List: [ True  True False]
Corrected p-values: [3.26765404e-03 1.55140776e-06 1.00000000e+00]
Em

In [9]:
print('EMOTION\n')
print('post-hoc emotions and different groups')
emo_reject_list, emo_corrected_p_vals, emo_combinations, emo_residuals= hp.chi2_post_hoc(df_emotion_char,emotion_label, 'IsNativeSpeaker', 'bonferroni', True, True)
print('\nAFFECT\n')
print('\n post-hoc affect and different groups')
aff_reject_list, emo_corrected_p_vals, emo_combinations, aff_residuals = hp.chi2_post_hoc(df_affect_char, affect_label, 'IsNativeSpeaker' ,'bonferroni', True, True)
print('\nAROUSAL-VALENCE\n')
print('\n post-hoc arousal-valence and different groups')
ar_val_reject_list, ar_val_corrected_p_vals, ar_val_combinations, ar_val_residuals = hp.chi2_post_hoc(df_ar_val_char, ['Arousal', 'Valence'], 'IsNativeSpeaker', 'bonferroni',True, True)
print('\nLEVEL OF INTEREST\n')
print('\n post-hoc level of intereset and different groups')
loi_reject_list, loi_corrected_p_vals, loi_combinations, loi_residuals = hp.chi2_post_hoc(df_loi_char, loi_label, 'IsNativeSpeaker', 'bonferroni', True, True)

EMOTION

post-hoc emotions and different groups
Anger
Combinations: [('Asian Non-Native', 'Europ. Non-Native'), ('Asian Non-Native', 'Native Speaker'), ('Europ. Non-Native', 'Native Speaker')]
Reject List: [False False False]
Corrected p-values: [0.14843553 1.         0.47713622]
Boredom
Combinations: [('Asian Non-Native', 'Europ. Non-Native'), ('Asian Non-Native', 'Native Speaker'), ('Europ. Non-Native', 'Native Speaker')]
Reject List: [False False False]
Corrected p-values: [0.44664789 1.         1.        ]
Disgust
Combinations: [('Asian Non-Native', 'Europ. Non-Native'), ('Asian Non-Native', 'Native Speaker'), ('Europ. Non-Native', 'Native Speaker')]
Reject List: [False False False]
Corrected p-values: [1. 1. 1.]
Fear
Combinations: [('Asian Non-Native', 'Europ. Non-Native'), ('Asian Non-Native', 'Native Speaker'), ('Europ. Non-Native', 'Native Speaker')]
Reject List: [False False False]
Corrected p-values: [0.14843553 1.         0.47713622]
Happiness
Combinations: [('Asian Non-Nati