## <center>Classification<center>

<center> I confirm that I have not used any GPT-generated responses for any part of this assignment. <center>

Extract a set of acoustic-prosodic features using the openSMILE toolkit. Normalize your extracted 
features (as in Part 1.Feature Analysis) and use leave-one-speaker-out cross-validation to predict 
the emotion classes. Leave-one-speaker-out cross-validation means, for each speaker S, train on 
all other six other speakers combined and test on S

## 1. Extract features

### openSMILE features

In [1]:
import os
os.makedirs("opensmile_features", exist_ok=True)

In [2]:
import os
import glob

smile_bin = "opensmile/build/progsrc/smilextract/SMILExtract"
conf_path = "opensmile/config/is09-13/IS09_emotion.conf"
wav_dir = "hw3_speech_files"
out_dir = "opensmile_features"

os.makedirs(out_dir, exist_ok=True)

wav_paths = glob.glob(os.path.join(wav_dir, "*.wav"))
print("Number of wav files:", len(wav_paths))

for i, wav_path in enumerate(wav_paths):
    base = os.path.splitext(os.path.basename(wav_path))[0]
    out_csv = os.path.join(out_dir, base + ".csv")
    
    if os.path.exists(out_csv):
        continue

    cmd = (
        f'"{smile_bin}" -C "{conf_path}" -I "{wav_path}" -O "{out_csv}" '
        f'> /dev/null 2>&1'
    )
    os.system(cmd)
    
    if (i + 1) % 500 == 0 or i == len(wav_paths) - 1:
        print(f"{i+1}/{len(wav_paths)} done")

Number of wav files: 2324
500/2324 done
1000/2324 done
1500/2324 done
2000/2324 done
2324/2324 done


In [4]:
import pandas as pd
import glob
import os

def load_opensmile_arff(path):
    attr_names = []
    data_rows = []
    in_data = False
    
    with open(path, 'r') as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            if line.lower().startswith('@attribute'):
                parts = line.split()
                attr_names.append(parts[1])
            elif line.lower().startswith('@data'):
                in_data = True
            elif in_data and not line.startswith('@'):
                data_rows.append(line.split(','))

    df = pd.DataFrame(data_rows, columns=attr_names)
    for col in df.columns:
        if col not in ['name', 'class']:
            df[col] = pd.to_numeric(df[col], errors='coerce')
    return df

feat_dir = "opensmile_features"
all_dfs = []

for path in glob.glob(os.path.join(feat_dir, "*.csv")):
    df_one = load_opensmile_arff(path)
    df_one = df_one.drop(columns=["name", "class"], errors="ignore")
    utt_id = os.path.splitext(os.path.basename(path))[0]
    df_one = df_one.assign(utt_id=utt_id)
    all_dfs.append(df_one)

opensmile_df = pd.concat(all_dfs, ignore_index=True)
opensmile_df.to_csv("opensmile_is09_features.csv", index=False)
print(opensmile_df.shape)
opensmile_df.head()

(2324, 385)


Unnamed: 0,pcm_RMSenergy_sma_max,pcm_RMSenergy_sma_min,pcm_RMSenergy_sma_range,pcm_RMSenergy_sma_maxPos,pcm_RMSenergy_sma_minPos,pcm_RMSenergy_sma_amean,pcm_RMSenergy_sma_linregc1,pcm_RMSenergy_sma_linregc2,pcm_RMSenergy_sma_linregerrQ,pcm_RMSenergy_sma_stddev,...,F0_sma_de_maxPos,F0_sma_de_minPos,F0_sma_de_amean,F0_sma_de_linregc1,F0_sma_de_linregc2,F0_sma_de_linregerrQ,F0_sma_de_stddev,F0_sma_de_skewness,F0_sma_de_kurtosis,utt_id
0,0.01076,6.4e-05,0.010696,111.0,165.0,0.002748,-8.324649e-06,0.003498,8e-06,0.002916,...,85.0,92.0,5.348673e-08,-0.033835,3.045167,1062.679,32.64666,-0.090631,4.245496,mm_001_happy_2353.51_three-hundred-nine
1,0.076285,3.4e-05,0.076251,32.0,134.0,0.011488,-0.0001106844,0.019789,0.000221,0.015626,...,85.0,68.0,2.589447e-08,-0.09788,7.340973,1683.387,41.25033,0.000596,4.717154,mm_001_panic_3395.30_one-thousand-three
2,0.020006,1e-05,0.019996,39.0,1853.0,0.000425,-2.645859e-07,0.000788,3e-06,0.001664,...,1909.0,617.0,-9.201901e-09,-2.5e-05,0.034964,213.4161,14.60878,-0.116606,17.21548,cc_001_panic_861.97_Two-thousand-five
3,0.015069,4.3e-05,0.015026,85.0,0.0,0.005252,2.256441e-05,0.004067,1.8e-05,0.004311,...,62.0,71.0,6.747695e-08,-0.029053,1.525273,188.6371,13.76326,-0.312293,6.684447,mf_001_contempt_3901.86_November-first
4,0.016625,0.000111,0.016514,55.0,137.0,0.002422,-3.115808e-05,0.004572,8e-06,0.00303,...,95.0,99.0,1.372193e-08,-0.001866,0.128773,39.98518,6.323827,-0.006288,25.1965,cl_001_interest_1035.82_Ten-thousand-one


### adding other features

In addition to the IS09 features, we also include a small set of manually designed prosodic features that we used in Part 1 for feature analysis. Using Parselmouth, we compute utterance-level summary statistics of pitch and intensity: minimum, maximum, and mean F0, as well as minimum, maximum, and mean intensity.

In [5]:
pm_df = pd.read_csv("parselmouth_features.csv")
pm_df.head()

pm_df["utt_id"] = pm_df["file"].str.replace(".wav", "", regex=False)
pm_df.head()

pm_keep_cols = [
    "utt_id", "speaker", "emotion",
    "min_pitch", "max_pitch", "mean_pitch",
    "min_intensity", "max_intensity", "mean_intensity",
]

pm_small = pm_df[pm_keep_cols].copy()
pm_small.head()

Unnamed: 0,utt_id,speaker,emotion,min_pitch,max_pitch,mean_pitch,min_intensity,max_intensity,mean_intensity
0,cc_001_anxiety_910.77_May-twenty-third,cc,anxiety,76.85194,134.315515,102.734915,48.573983,80.261238,68.821829
1,cc_001_anxiety_916.11_Eight-hundred-eight,cc,anxiety,88.357876,470.846252,141.611068,51.952659,82.362566,68.802924
2,cc_001_anxiety_918.66_Eight-hundred-eight,cc,anxiety,77.727072,446.765057,137.602579,44.266131,81.408653,69.223235
3,cc_001_anxiety_928.48_Four-thousand-eight,cc,anxiety,93.122852,532.985706,159.949508,49.420874,85.87101,72.128965
4,cc_001_anxiety_934.73_Nine-thousand-six,cc,anxiety,81.534891,156.770768,106.813002,44.370192,77.682918,67.40195


In [6]:
full_df = opensmile_df.merge(pm_small, on="utt_id", how="inner")
print(full_df.shape)
full_df.head()

(2324, 393)


Unnamed: 0,pcm_RMSenergy_sma_max,pcm_RMSenergy_sma_min,pcm_RMSenergy_sma_range,pcm_RMSenergy_sma_maxPos,pcm_RMSenergy_sma_minPos,pcm_RMSenergy_sma_amean,pcm_RMSenergy_sma_linregc1,pcm_RMSenergy_sma_linregc2,pcm_RMSenergy_sma_linregerrQ,pcm_RMSenergy_sma_stddev,...,F0_sma_de_kurtosis,utt_id,speaker,emotion,min_pitch,max_pitch,mean_pitch,min_intensity,max_intensity,mean_intensity
0,0.01076,6.4e-05,0.010696,111.0,165.0,0.002748,-8.324649e-06,0.003498,8e-06,0.002916,...,4.245496,mm_001_happy_2353.51_three-hundred-nine,mm,happy,182.100595,438.882046,261.664025,27.908648,76.034085,56.297349
1,0.076285,3.4e-05,0.076251,32.0,134.0,0.011488,-0.0001106844,0.019789,0.000221,0.015626,...,4.717154,mm_001_panic_3395.30_one-thousand-three,mm,panic,380.564954,597.443171,497.159058,19.992874,82.878146,59.836422
2,0.020006,1e-05,0.019996,39.0,1853.0,0.000425,-2.645859e-07,0.000788,3e-06,0.001664,...,17.21548,cc_001_panic_861.97_Two-thousand-five,cc,panic,94.952739,314.859649,154.211373,7.121748,78.862817,28.01241
3,0.015069,4.3e-05,0.015026,85.0,0.0,0.005252,2.256441e-05,0.004067,1.8e-05,0.004311,...,6.684447,mf_001_contempt_3901.86_November-first,mf,contempt,118.925872,166.166038,144.59776,32.36015,75.987364,62.269368
4,0.016625,0.000111,0.016514,55.0,137.0,0.002422,-3.115808e-05,0.004572,8e-06,0.00303,...,25.1965,cl_001_interest_1035.82_Ten-thousand-one,cl,interest,85.481336,151.660137,108.604979,17.217389,68.549933,53.173287


### Normalized

In [7]:
meta_cols = ["utt_id", "speaker", "emotion"]
feature_cols = [c for c in full_df.columns if c not in meta_cols]

def speaker_zscore(df, speaker_col, feature_cols):
    df_norm = df.copy()
    for spk, group in df.groupby(speaker_col):
        mu = group[feature_cols].mean()
        sigma = group[feature_cols].std(ddof=0).replace(0, 1.0) 
        df_norm.loc[group.index, feature_cols] = (group[feature_cols] - mu) / sigma
    return df_norm

full_df_norm = speaker_zscore(full_df, "speaker", feature_cols)
full_df_norm.head()

Unnamed: 0,pcm_RMSenergy_sma_max,pcm_RMSenergy_sma_min,pcm_RMSenergy_sma_range,pcm_RMSenergy_sma_maxPos,pcm_RMSenergy_sma_minPos,pcm_RMSenergy_sma_amean,pcm_RMSenergy_sma_linregc1,pcm_RMSenergy_sma_linregc2,pcm_RMSenergy_sma_linregerrQ,pcm_RMSenergy_sma_stddev,...,F0_sma_de_kurtosis,utt_id,speaker,emotion,min_pitch,max_pitch,mean_pitch,min_intensity,max_intensity,mean_intensity
0,-1.085203,-0.515961,-1.082779,1.430641,1.591199,-0.884629,0.174266,-0.730863,-0.568014,-0.96308,...,-0.572511,mm_001_happy_2353.51_three-hundred-nine,mm,happy,0.529118,0.863866,0.579475,-0.502331,-0.033875,-0.555149
1,1.01555,-0.628866,1.022714,-0.952728,1.091524,0.571155,-1.293282,1.288691,0.416932,0.875683,...,-0.508643,mm_001_panic_3395.30_one-thousand-three,mm,panic,3.522636,2.017552,3.500447,-1.471447,1.255247,-0.005646
2,-0.5226,-0.884185,-0.520079,-0.24681,14.711776,-1.359014,0.428145,-1.169856,-0.638895,-1.043284,...,1.22398,cc_001_panic_861.97_Two-thousand-five,cc,panic,-0.455608,0.205021,-0.294913,-2.772321,0.738522,-5.076999
3,-0.55919,-0.61591,-0.55642,1.209768,-1.099892,-0.33907,0.429121,-0.439372,-0.345017,-0.471005,...,-0.028862,mf_001_contempt_3901.86_November-first,mf,contempt,-0.151186,-0.878212,-0.638453,-0.599881,-0.064433,-0.183911
4,-0.483868,0.438624,-0.484371,-0.451632,0.626345,-0.407186,-0.353485,-0.286902,-0.166126,-0.476449,...,1.392326,cl_001_interest_1035.82_Ten-thousand-one,cl,interest,-0.372619,-0.443433,-0.51846,0.237007,-0.940565,0.562582


## 2.Train a multiclass classifier

I treat emotion recognition as a 15-way multiclass classification problem. All models are implemented in scikit-learn. For evaluation, we use sklearn.metrics.classification_report to obtain per-class precision, recall, and F1 scores, as well as macro and weighted averages for each test speaker.

### Try SVM
#### IS09 only


In [12]:
import numpy as np
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score, f1_score

speakers = sorted(full_df_norm["speaker"].unique())
print("Speakers:", speakers)

def loso_SVM_experiment(df, feature_cols, label_col="emotion", model_name=""):
    all_acc, all_f1, all_n = [], [], []
    
    print("\n==== experiment:", model_name, "====\n")
    
    for spk in speakers:
        train_df = df[df["speaker"] != spk]
        test_df  = df[df["speaker"] == spk]
        
        X_train = train_df[feature_cols].values
        y_train = train_df[label_col].values
        X_test  = test_df[feature_cols].values
        y_test  = test_df[label_col].values
        
        clf = SVC(kernel="rbf", C=10, gamma="scale", random_state=0)
        clf.fit(X_train, y_train)
        
        y_pred = clf.predict(X_test)
        
        print("=" * 50)
        print(f"Test speaker = {spk}")
        print(classification_report(y_test, y_pred, digits=3))
        
        acc = accuracy_score(y_test, y_pred)
        f1_w = f1_score(y_test, y_pred, average="weighted")
        n_i = len(y_test)
        
        all_acc.append(acc)
        all_f1.append(f1_w)
        all_n.append(n_i)
        
        print(f"Speaker {spk} accuracy: {acc:.4f}, weighted F1: {f1_w:.4f}, n = {n_i}\n")
    
    all_acc = np.array(all_acc)
    all_f1 = np.array(all_f1)
    all_n = np.array(all_n)
    
    agg_acc = (all_acc * all_n).sum() / all_n.sum()
    agg_f1  = (all_f1 * all_n).sum() / all_n.sum()
    
    print("\n==== Aggregated results for", model_name, "====")
    print(f"Aggregated average accuracy: {agg_acc:.4f}")
    print(f"Aggregated average weighted F1: {agg_f1:.4f}")
    
    return agg_acc, agg_f1, all_acc, all_f1, all_n

Speakers: ['cc', 'cl', 'gg', 'jg', 'mf', 'mk', 'mm']


In [13]:
is09_feature_cols = [c for c in opensmile_df.columns if c != "utt_id"]

len(is09_feature_cols), is09_feature_cols[:5]

(384,
 ['pcm_RMSenergy_sma_max',
  'pcm_RMSenergy_sma_min',
  'pcm_RMSenergy_sma_range',
  'pcm_RMSenergy_sma_maxPos',
  'pcm_RMSenergy_sma_minPos'])

In [14]:
agg_acc_A, agg_f1_A, all_acc_A, all_f1_A, all_n_A = loso_SVM_experiment(
    full_df_norm,
    feature_cols=is09_feature_cols,
    model_name="Model A: IS09 only"
)


==== experiment: Model A: IS09 only ====

Test speaker = cc
              precision    recall  f1-score   support

     anxiety      0.000     0.000     0.000        10
     boredom      0.074     0.133     0.095        15
  cold-anger      0.067     0.067     0.067        15
    contempt      0.308     0.364     0.333        22
     despair      0.125     0.333     0.182         9
     disgust      0.364     0.129     0.190        31
     elation      0.167     0.250     0.200        16
       happy      0.190     0.174     0.182        23
   hot-anger      0.333     0.500     0.400        14
    interest      0.200     0.235     0.216        17
     neutral      0.667     0.222     0.333        18
       panic      0.583     0.389     0.467        18
       pride      0.167     0.043     0.069        23
     sadness      0.308     0.308     0.308        13
       shame      0.250     0.143     0.182        21

    accuracy                          0.211       265
   macro avg      0

  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


Test speaker = gg
              precision    recall  f1-score   support

     anxiety      0.528     0.633     0.576        30
     boredom      0.316     0.400     0.353        30
  cold-anger      0.234     0.407     0.297        27
    contempt      0.286     0.231     0.255        26
     despair      0.226     0.250     0.237        28
     disgust      0.474     0.176     0.257        51
     elation      0.304     0.500     0.378        28
       happy      0.231     0.400     0.293        30
   hot-anger      0.562     0.409     0.474        22
    interest      0.237     0.300     0.265        30
     neutral      1.000     0.111     0.200         9
       panic      0.583     0.519     0.549        27
       pride      0.278     0.200     0.233        25
     sadness      0.000     0.000     0.000        33
       shame      0.346     0.375     0.360        24

    accuracy                          0.326       420
   macro avg      0.374     0.327     0.315       420
weighted

#### Combine

In [15]:
extended_feature_cols = feature_cols  # IS09 + 6 

agg_acc_B, agg_f1_B, all_acc_B, all_f1_B, all_n_B = loso_SVM_experiment(
    full_df_norm,
    feature_cols=extended_feature_cols,
    model_name="Model B: IS09 + Parselmouth prosodic features"
)


==== experiment: Model B: IS09 + Parselmouth prosodic features ====

Test speaker = cc
              precision    recall  f1-score   support

     anxiety      0.000     0.000     0.000        10
     boredom      0.074     0.133     0.095        15
  cold-anger      0.091     0.067     0.077        15
    contempt      0.357     0.455     0.400        22
     despair      0.087     0.222     0.125         9
     disgust      0.429     0.194     0.267        31
     elation      0.192     0.312     0.238        16
       happy      0.143     0.130     0.136        23
   hot-anger      0.400     0.571     0.471        14
    interest      0.278     0.294     0.286        17
     neutral      0.500     0.167     0.250        18
       panic      0.615     0.444     0.516        18
       pride      0.000     0.000     0.000        23
     sadness      0.231     0.231     0.231        13
       shame      0.200     0.143     0.167        21

    accuracy                          0.223   

  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


Test speaker = gg
              precision    recall  f1-score   support

     anxiety      0.613     0.633     0.623        30
     boredom      0.333     0.400     0.364        30
  cold-anger      0.205     0.333     0.254        27
    contempt      0.235     0.308     0.267        26
     despair      0.233     0.250     0.241        28
     disgust      0.611     0.216     0.319        51
     elation      0.348     0.571     0.432        28
       happy      0.234     0.367     0.286        30
   hot-anger      0.625     0.455     0.526        22
    interest      0.225     0.300     0.257        30
     neutral      1.000     0.111     0.200         9
       panic      0.591     0.481     0.531        27
       pride      0.300     0.240     0.267        25
     sadness      0.000     0.000     0.000        33
       shame      0.346     0.375     0.360        24

    accuracy                          0.336       420
   macro avg      0.393     0.336     0.328       420
weighted

### Try MLP

In [18]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
label_encoder.fit(full_df_norm["emotion"])
print(label_encoder.classes_) 

['anxiety' 'boredom' 'cold-anger' 'contempt' 'despair' 'disgust' 'elation'
 'happy' 'hot-anger' 'interest' 'neutral' 'panic' 'pride' 'sadness'
 'shame']


In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, accuracy_score, f1_score
import numpy as np

def loso_MLP_experiment(df, feature_cols, label_col="emotion", model_name=""):
    all_acc, all_f1, all_n = [], [], []
    
    print("\n==== Experiment:", model_name, "====\n")
    
    for spk in speakers:
        train_df = df[df["speaker"] != spk]
        test_df  = df[df["speaker"] == spk]
        
        X_train = train_df[feature_cols].values
        y_train = train_df[label_col].values
        X_test  = test_df[feature_cols].values
        y_test  = test_df[label_col].values
        
        y_train_enc = label_encoder.transform(y_train)
        y_test_enc  = label_encoder.transform(y_test)
        
        clf = MLPClassifier(
            hidden_layer_sizes=(256, 64),
            activation="relu",
            alpha=1e-4,
            batch_size=64,
            learning_rate="adaptive",
            max_iter=100,
            early_stopping=True,
            random_state=0
        )
        clf.fit(X_train, y_train_enc)
        
        y_pred_enc = clf.predict(X_test)
        y_pred = label_encoder.inverse_transform(y_pred_enc)
        
        print("=" * 50)
        print(f"Test speaker = {spk}")
        print(classification_report(y_test, y_pred, digits=3))
        
        acc = accuracy_score(y_test, y_pred)
        f1_w = f1_score(y_test, y_pred, average="weighted")
        n_i = len(y_test)
        
        all_acc.append(acc)
        all_f1.append(f1_w)
        all_n.append(n_i)
        
        print(f"Speaker {spk} accuracy: {acc:.4f}, weighted F1: {f1_w:.4f}, n = {n_i}\n")
    
    all_acc = np.array(all_acc)
    all_f1 = np.array(all_f1)
    all_n = np.array(all_n)
    
    agg_acc = (all_acc * all_n).sum() / all_n.sum()
    agg_f1  = (all_f1 * all_n).sum() / all_n.sum()
    
    print("\n==== Aggregated results for", model_name, "====")
    print(f"Aggregated average accuracy: {agg_acc:.4f}")
    print(f"Aggregated average weighted F1: {agg_f1:.4f}")
    
    return agg_acc, agg_f1, all_acc, all_f1, all_n

#### IS09 only

In [20]:
agg_acc_A, agg_f1_A, all_acc_A, all_f1_A, all_n_A = loso_MLP_experiment(
    full_df_norm,
    feature_cols=is09_feature_cols,
    model_name="Model A2: IS09 only"
)


==== Experiment: Model A2: IS09 only ====

Test speaker = cc
              precision    recall  f1-score   support

     anxiety      0.074     0.200     0.108        10
     boredom      0.100     0.200     0.133        15
  cold-anger      0.267     0.267     0.267        15
    contempt      0.226     0.318     0.264        22
     despair      0.158     0.333     0.214         9
     disgust      0.455     0.161     0.238        31
     elation      0.227     0.312     0.263        16
       happy      0.067     0.043     0.053        23
   hot-anger      0.316     0.429     0.364        14
    interest      0.214     0.176     0.194        17
     neutral      0.375     0.167     0.231        18
       panic      0.318     0.389     0.350        18
       pride      0.100     0.043     0.061        23
     sadness      0.133     0.154     0.143        13
       shame      0.143     0.048     0.071        21

    accuracy                          0.200       265
   macro avg      

#### combine

In [23]:
extended_feature_cols = feature_cols  # IS09 + 6 

agg_acc_B, agg_f1_B, all_acc_B, all_f1_B, all_n_B = loso_MLP_experiment(
    full_df_norm,
    feature_cols=extended_feature_cols,
    model_name="Model B2: IS09 + Parselmouth prosodic features"
)


==== Experiment: Model B2: IS09 + Parselmouth prosodic features ====

Test speaker = cc
              precision    recall  f1-score   support

     anxiety      0.000     0.000     0.000        10
     boredom      0.071     0.133     0.093        15
  cold-anger      0.000     0.000     0.000        15
    contempt      0.382     0.591     0.464        22
     despair      0.133     0.222     0.167         9
     disgust      0.417     0.161     0.233        31
     elation      0.118     0.125     0.121        16
       happy      0.105     0.087     0.095        23
   hot-anger      0.333     0.571     0.421        14
    interest      0.182     0.235     0.205        17
     neutral      0.000     0.000     0.000        18
       panic      0.562     0.500     0.529        18
       pride      0.000     0.000     0.000        23
     sadness      0.158     0.231     0.188        13
       shame      0.333     0.238     0.278        21

    accuracy                          0.208  

We can see that, the MLP consistently underperformed the SVM baseline (e.g., aggregated accuracy around 23–24% vs. 26–27% for SVM), likely due to the limited amount of training data and the relatively high-dimensional feature space. Therefore, we report SVM as our main classifier in the following analysis.