# Group Convo Analysis

## Datasets

Datasets used in the following experiments:

1) GAP corpus
2) UGI corpus

The GAP corpus differs from the UGI corpus in that the GAP corpus has more metadata. Important metadata only in the GAP corpus includes utterance 'Duration' and 'End' metadata fields (describe temporal length of utterance). Features that depend on such metadata will therefore be exclusive to the GAP corpus.

## Analysis

### Loading Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectPercentile, f_regression
from sklearn.metrics import mean_squared_error, r2_score

import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings('ignore', category=ConvergenceWarning)

In [2]:
csv_path = '../../csv'

gap_align_df = pd.read_csv(f'{csv_path}/gap-align.csv')
gap_dom_df = pd.read_csv(f'{csv_path}/gap-dom.csv')
gap_meta_df = pd.read_csv(f'{csv_path}/gap-meta.csv')
gap_polite_df = pd.read_csv(f'{csv_path}/gap-polite.csv')
gap_psych_df = pd.read_csv(f'{csv_path}/gap-psych.csv')

ugi_align_df = pd.read_csv(f'{csv_path}/ugi-align.csv')
ugi_dom_df = pd.read_csv(f'{csv_path}/ugi-dom.csv')
ugi_meta_df = pd.read_csv(f'{csv_path}/ugi-meta.csv')
ugi_polite_df = pd.read_csv(f'{csv_path}/ugi-polite.csv')
ugi_psych_df = pd.read_csv(f'{csv_path}/ugi-psych.csv')

### Combining Data Groups

Each data group from the GAP corpus is combined with its UGI counterpart. Analysis is performed on combined data groups and individual data groups (since some features are exclusive to the GAP corpus). 10-fold cross-validation is used for each dataset.

In [3]:
# Note that the id column of meta_df does not necessarily hold unique values (depend on row num to remedy this)
align_df = pd.concat([gap_align_df, ugi_align_df], ignore_index=True).dropna(axis=1)
dom_df = pd.concat([gap_dom_df, ugi_dom_df], ignore_index=True).dropna(axis=1)
meta_df = pd.concat([gap_meta_df, ugi_meta_df], ignore_index=True).dropna(axis=1)
polite_df = pd.concat([gap_polite_df, ugi_polite_df], ignore_index=True).dropna(axis=1)
psych_df = pd.concat([gap_psych_df, ugi_psych_df], ignore_index=True).dropna(axis=1)

### Predicting AGS

In [4]:
# Parameter lr = LinearRegression
# Parameter rfr = RandomForestRegressor
# Parameter gbr = GradientBoostingRegressor
# Parameter mlpr = MLPRegressor
def data_group_regression(x, y, lr, rfr, gbr, mlpr, feat_percentage=None):
    l_preds = regression_preds(x, y, lr, feat_percentage)
    evaluate_preds('Linear Regression', y, l_preds)
    
    rf_preds = regression_preds(x, y, rfr, feat_percentage)
    evaluate_preds('Random Forest', y, rf_preds)
    
    gb_preds = regression_preds(x, y, gbr, feat_percentage)
    evaluate_preds('Gradient Boosting', y, gb_preds)
    
    mlp_preds = regression_preds(x, y, mlpr, feat_percentage, True)
    evaluate_preds('Neural Network', y, mlp_preds)


# To use less than 100% of features, set parameter feat_percentage
def regression_preds(x, y, regressor, feat_percentage=None, scaled=False):
    kf = KFold(n_splits=10)
    preds = []
    
    for train, test in kf.split(x):
        x_train = x.iloc[train]
        y_train = y.iloc[train]
        x_test = x.iloc[test]

        if scaled:
            scaler = StandardScaler().set_output(transform="pandas").fit(x_train)
            x_train = scaler.transform(x_train)
            x_test = scaler.transform(x_test)

        if feat_percentage:
            x_train, x_test = select_features(
                x_train, y_train, x_test, feat_percentage
            )
            
        regressor.fit(x_train, y_train)
        new_preds = regressor.predict(x_test)
        preds.extend(list(new_preds))

    return preds


def select_features(x_train, y_train, x_test, feat_percentage):

    if not (0 < feat_percentage < 100):
        raise Exception(
            'Parameter feat_percentage must be a value in range (0, 100)'
        )

    selector = SelectPercentile(f_regression, percentile=feat_percentage)
    selector.fit_transform(x_train, y_train)
    selected_cols = selector.get_support()
    
    selected_feats = x_train.columns.values[selected_cols]
    
    x_train_sub = x_train[selected_feats]
    x_test_sub = x_test[selected_feats]

    return x_train_sub, x_test_sub
        

def evaluate_preds(title, y, preds):
    mse = mean_squared_error(y, preds)
    r2 = r2_score(y, preds)
    print(title)
    print("\tMSE:", mse)
    print("\tR^2 score:", r2)

#### Exp 1 - Combined Align Data Group

In [21]:
df = align_df.join(meta_df)
X = df.drop(['id', 'ags'], axis=1)
y = df['ags']

lr = LinearRegression()
rfr = RandomForestRegressor(max_depth=6, max_features=1/3, random_state=0)
gbr = GradientBoostingRegressor(max_depth=4, max_features=1/3, random_state=0)
mlpr = MLPRegressor(hidden_layer_sizes=(5), max_iter=1000, activation='relu', random_state=0)

print('Feature percentage = 100%\n')
data_group_regression(X, y, lr, rfr, gbr, mlpr)

feat_perc = 25
print(f'\nFeature percentage = {feat_perc}%\n')
data_group_regression(X, y, lr, rfr, gbr, mlpr, feat_perc)

Feature percentage = 100%

Linear Regression
	MSE: 488.07335617132907
	R^2 score: 0.044289437752003336
Random Forest
	MSE: 415.9189364399946
	R^2 score: 0.18557709498257946
Gradient Boosting
	MSE: 393.3041373864552
	R^2 score: 0.229859787420715
Neural Network
	MSE: 2575.4355122462184
	R^2 score: -4.043034802699356

Feature percentage = 50%

Linear Regression
	MSE: 441.87571977639675
	R^2 score: 0.13475036641214244
Random Forest
	MSE: 405.95595586719685
	R^2 score: 0.2050858955440097
Gradient Boosting
	MSE: 462.6408950728179
	R^2 score: 0.09408947577595195
Neural Network
	MSE: 2546.5738166984356
	R^2 score: -3.9865198814674754


#### Exp 2 - Individual Align Data Groups

In [24]:
gap_df = gap_align_df.join(gap_meta_df)
gap_X = gap_df.drop(['id', 'ags'], axis=1)
gap_y = gap_df['ags']

ugi_df = ugi_align_df.join(ugi_meta_df)
ugi_X = ugi_df.drop(['id', 'ags'], axis=1)
ugi_y = ugi_df['ags']

dataset_params = {
    'GAP': (gap_X, gap_y), 
    'UGI': (ugi_X, ugi_y)
}

for key, (X, y) in dataset_params.items():
    print('\n', key, 'DATASET', '\n')
    
    lr = LinearRegression()
    rfr = RandomForestRegressor(max_depth=3, max_features=1/3, random_state=0)
    gbr = GradientBoostingRegressor(max_depth=4, max_features=1/3, random_state=0)
    mlpr = MLPRegressor(hidden_layer_sizes=(4), max_iter=1000, activation='relu', random_state=0)
    
    print('Feature percentage = 100%\n')
    data_group_regression(X, y, lr, rfr, gbr, mlpr)
    
    feat_perc = 25
    print(f'\nFeature percentage = {feat_perc}%\n')
    data_group_regression(X, y, lr, rfr, gbr, mlpr, feat_perc)


 GAP DATASET 

Feature percentage = 100%

Linear Regression
	MSE: 384.17913727114734
	R^2 score: -1.478187607439419
Random Forest
	MSE: 154.4126566445844
	R^2 score: 0.003945048014594832
Gradient Boosting
	MSE: 145.40477454944448
	R^2 score: 0.06205133128654616
Neural Network
	MSE: 3839.7331738182693
	R^2 score: -23.76859944769599

Feature percentage = 50%

Linear Regression
	MSE: 258.92866566390427
	R^2 score: -0.6702463726087999
Random Forest
	MSE: 157.9888473242889
	R^2 score: -0.019123543078703342
Gradient Boosting
	MSE: 155.79588931943746
	R^2 score: -0.0049776386710354
Neural Network
	MSE: 4321.4941665471215
	R^2 score: -26.87624899475019

 UGI DATASET 

Feature percentage = 100%

Linear Regression
	MSE: 1086.7202667476429
	R^2 score: -11.223961353208585
Random Forest
	MSE: 110.86205643891184
	R^2 score: -0.24703066181168798
Gradient Boosting
	MSE: 138.45492465131963
	R^2 score: -0.5574087461940758
Neural Network
	MSE: 689.852014397013
	R^2 score: -6.759793041000145

Feature per

#### Exp 3 - Combined Dom Data Group

In [57]:
df = dom_df.join(meta_df)
X = df.drop(['id', 'ags'], axis=1)
y = df['ags']

lr = LinearRegression()
rfr = RandomForestRegressor(max_depth=6, max_features=1/3, random_state=0)
gbr = GradientBoostingRegressor(max_depth=4, max_features=1/3, random_state=0)
mlpr = MLPRegressor(hidden_layer_sizes=(1), max_iter=1000, activation='relu', random_state=0)

print('Feature percentage = 100%\n')
data_group_regression(X, y, lr, rfr, gbr, mlpr)

feat_perc = 75
print(f'\nFeature percentage = {feat_perc}%\n')
data_group_regression(X, y, lr, rfr, gbr, mlpr, feat_perc)

Feature percentage = 100%

Linear Regression
	MSE: 388.4520504363389
	R^2 score: 0.23936079928407106
Random Forest
	MSE: 351.59264830679354
	R^2 score: 0.31153626120579714
Gradient Boosting
	MSE: 354.12900153350654
	R^2 score: 0.3065697545573366
Neural Network
	MSE: 3476.0098361477285
	R^2 score: -5.8064754465272745

Feature percentage = 50%

Linear Regression
	MSE: 422.71457328759914
	R^2 score: 0.17227036182385003
Random Forest
	MSE: 471.1124085206859
	R^2 score: 0.0775011601508897
Gradient Boosting
	MSE: 651.1555915383343
	R^2 score: -0.2750466064809649
Neural Network
	MSE: 3257.6556403301797
	R^2 score: -5.378909777114367


#### Exp 4 - Individual Dom Data Groups

In [38]:
gap_df = gap_dom_df.join(gap_meta_df)
gap_X = gap_df.drop(['id', 'ags'], axis=1)
gap_y = gap_df['ags']

ugi_df = ugi_dom_df.join(ugi_meta_df)
ugi_X = ugi_df.drop(['id', 'ags'], axis=1)
ugi_y = ugi_df['ags']

dataset_params = {
    'GAP': {
        'data': (gap_X, gap_y),
        'mlpr': MLPRegressor(
            hidden_layer_sizes=(2), max_iter=1000, activation='relu', random_state=0
        )
    },
    'UGI': {
        'data': (ugi_X, ugi_y),
        'mlpr': MLPRegressor(
            hidden_layer_sizes=(1), max_iter=1000, activation='relu', random_state=0
        )
    }
}

for key, params in dataset_params.items():
    print()
    print(key, 'DATASET')
    print()

    X, y = params['data']
    
    lr = LinearRegression()
    rfr = RandomForestRegressor(max_depth=3, max_features=1/3, random_state=0)
    gbr = GradientBoostingRegressor(max_depth=4, max_features=1/3, random_state=0)
    mlpr = params['mlpr']
    
    print('Feature percentage = 100%\n')
    data_group_regression(X, y, lr, rfr, gbr, mlpr)
    
    feat_perc = 50
    print(f'\nFeature percentage = {feat_perc}%\n')
    data_group_regression(X, y, lr, rfr, gbr, mlpr, feat_perc)


GAP DATASET

Feature percentage = 100%

Linear Regression
	MSE: 197.11177258787373
	R^2 score: -0.2714900542944487
Random Forest
	MSE: 183.5730041335352
	R^2 score: -0.18415681584258192
Gradient Boosting
	MSE: 173.6349375998323
	R^2 score: -0.12005028080096536
Neural Network
	MSE: 4577.586767729348
	R^2 score: -28.528201037525474

Feature percentage = 75%

Linear Regression
	MSE: 202.11850286994041
	R^2 score: -0.303786490344937
Random Forest
	MSE: 217.41893169508901
	R^2 score: -0.40248350281761236
Gradient Boosting
	MSE: 228.1693310258338
	R^2 score: -0.4718300753194751
Neural Network
	MSE: 4300.071504681372
	R^2 score: -26.738059879299612

UGI DATASET

Feature percentage = 100%

Linear Regression
	MSE: 133.35506192185505
	R^2 score: -0.5000429945658138
Random Forest
	MSE: 104.22976954939237
	R^2 score: -0.17242745333052656
Gradient Boosting
	MSE: 111.67308447424939
	R^2 score: -0.25615350203441234
Neural Network
	MSE: 1138.5884770091905
	R^2 score: -11.807400364238356

Feature perc

#### Exp 5 - Combined Polite Data Group

In [55]:
df = polite_df.join(meta_df)
X = df.drop(['id', 'ags'], axis=1)
y = df['ags']

lr = LinearRegression()
rfr = RandomForestRegressor(max_depth=6, max_features=1/3, random_state=0)
gbr = GradientBoostingRegressor(max_depth=4, max_features=1/3, random_state=0)
mlpr = MLPRegressor(hidden_layer_sizes=(4), max_iter=1000, activation='relu', random_state=0)

print('Feature percentage = 100%\n')
data_group_regression(X, y, lr, rfr, gbr, mlpr)

feat_perc = 25
print(f'\nFeature percentage = {feat_perc}%\n')
data_group_regression(X, y, lr, rfr, gbr, mlpr, feat_perc)

Feature percentage = 100%

Linear Regression
	MSE: 324.83108385932957
	R^2 score: 0.36393885495800293
Random Forest
	MSE: 431.44767804518574
	R^2 score: 0.15516981668547958
Gradient Boosting
	MSE: 456.83569612211
	R^2 score: 0.1054568038281617
Neural Network
	MSE: 1624.6313183369473
	R^2 score: -2.1812375968920326

Feature percentage = 25%

Linear Regression
	MSE: 381.3619455851978
	R^2 score: 0.25324413876163654
Random Forest
	MSE: 390.03638437468527
	R^2 score: 0.23625846915303628
Gradient Boosting
	MSE: 478.9728511495369
	R^2 score: 0.062109399979289215
Neural Network
	MSE: 2996.0101571977852
	R^2 score: -4.866574185276956


#### Exp 6 - Combined Psych Data Group

In [49]:
df = psych_df.join(meta_df)
X = df.drop(['id', 'ags'], axis=1)
y = df['ags']

lr = LinearRegression()
rfr = RandomForestRegressor(max_depth=6, max_features=1/3, random_state=0)
gbr = GradientBoostingRegressor(max_depth=4, max_features=1/3, random_state=0)
mlpr = MLPRegressor(hidden_layer_sizes=(7), max_iter=1000, activation='relu', random_state=0)

print('Feature percentage = 100%\n')
data_group_regression(X, y, lr, rfr, gbr, mlpr)

feat_perc = 10
print(f'\nFeature percentage = {feat_perc}%\n')
data_group_regression(X, y, lr, rfr, gbr, mlpr, feat_perc)

Feature percentage = 100%

Linear Regression
	MSE: 631.1033683199008
	R^2 score: -0.2357817679395957
Random Forest
	MSE: 502.7865893696937
	R^2 score: 0.015479030064928345
Gradient Boosting
	MSE: 611.5648091783905
	R^2 score: -0.19752274989130525
Neural Network
	MSE: 1053.288896310023
	R^2 score: -1.0624754672096093

Feature percentage = 10%

Linear Regression
	MSE: 427.89181385550495
	R^2 score: 0.16213265725242998
Random Forest
	MSE: 368.75830161305635
	R^2 score: 0.27792369873901135
Gradient Boosting
	MSE: 459.05185558683297
	R^2 score: 0.10111727785059921
Neural Network
	MSE: 2334.11868015326
	R^2 score: -3.570505330718696


#### Exp 7 - Combined Align-Dom Group

In [52]:
# 11 features

df = align_df.join(dom_df).join(meta_df)
X = df.drop(['id', 'ags'], axis=1)
y = df['ags']

lr = LinearRegression()
rfr = RandomForestRegressor(max_depth=6, max_features=1/3, random_state=0)
gbr = GradientBoostingRegressor(max_depth=4, max_features=1/3, random_state=0)
mlpr = MLPRegressor(hidden_layer_sizes=(6), max_iter=1000, activation='relu', random_state=3)

print('Feature percentage = 100%\n')
data_group_regression(X, y, lr, rfr, gbr, mlpr)

feat_perc = 25
print(f'\nFeature percentage = {feat_perc}%\n')
data_group_regression(X, y, lr, rfr, gbr, mlpr, feat_perc)

Feature percentage = 100%

Linear Regression
	MSE: 503.37969696318845
	R^2 score: 0.014317648923169224
Random Forest
	MSE: 420.1252404943165
	R^2 score: 0.17734060929469675
Gradient Boosting
	MSE: 431.989368600223
	R^2 score: 0.15410911673459493
Neural Network
	MSE: 2073.267321411483
	R^2 score: -3.0597247368303746

Feature percentage = 10%

Linear Regression
	MSE: 422.71457328759914
	R^2 score: 0.17227036182385003
Random Forest
	MSE: 471.1124085206859
	R^2 score: 0.0775011601508897
Gradient Boosting
	MSE: 651.1555915383343
	R^2 score: -0.2750466064809649
Neural Network
	MSE: 3486.830018298541
	R^2 score: -5.827662758303721


#### Exp 8 - Combined Align-Polite Group

In [59]:
# 16 features

df = align_df.join(polite_df).join(meta_df)
X = df.drop(['id', 'ags'], axis=1)
y = df['ags']

lr = LinearRegression()
rfr = RandomForestRegressor(max_depth=6, max_features=1/3, random_state=0)
gbr = GradientBoostingRegressor(max_depth=4, max_features=1/3, random_state=0)
mlpr = MLPRegressor(hidden_layer_sizes=(8), max_iter=1000, activation='relu', random_state=0)

print('Feature percentage = 100%\n')
data_group_regression(X, y, lr, rfr, gbr, mlpr)

feat_perc = 10
print(f'\nFeature percentage = {feat_perc}%\n')
data_group_regression(X, y, lr, rfr, gbr, mlpr, feat_perc)

Feature percentage = 100%

Linear Regression
	MSE: 489.0783161658854
	R^2 score: 0.0423215965058259
Random Forest
	MSE: 432.2966307419384
	R^2 score: 0.1535074578435629
Gradient Boosting
	MSE: 413.9258992433045
	R^2 score: 0.18947971879054892
Neural Network
	MSE: 965.4339032838898
	R^2 score: -0.8904440630781665

Feature percentage = 25%

Linear Regression
	MSE: 436.07062542061846
	R^2 score: 0.14611748965399385
Random Forest
	MSE: 505.39748571392295
	R^2 score: 0.010366558380981994
Gradient Boosting
	MSE: 621.3136693668431
	R^2 score: -0.2166122751320818
Neural Network
	MSE: 2357.429825375033
	R^2 score: -3.616151558739233


#### Exp 9 - Combined Dom-Polite Group

In [64]:
# 9 features

df = dom_df.join(polite_df).join(meta_df)
X = df.drop(['id', 'ags'], axis=1)
y = df['ags']

lr = LinearRegression()
rfr = RandomForestRegressor(max_depth=6, max_features=1/3, random_state=0)
gbr = GradientBoostingRegressor(max_depth=4, max_features=1/3, random_state=0)
mlpr = MLPRegressor(hidden_layer_sizes=(5), max_iter=1000, activation='relu', random_state=0)

print('Feature percentage = 100%\n')
data_group_regression(X, y, lr, rfr, gbr, mlpr)

feat_perc = 25
print(f'\nFeature percentage = {feat_perc}%\n')
data_group_regression(X, y, lr, rfr, gbr, mlpr, feat_perc)

Feature percentage = 100%

Linear Regression
	MSE: 362.0303144192739
	R^2 score: 0.2910979651529927
Random Forest
	MSE: 444.88310727236336
	R^2 score: 0.12886151393059275
Gradient Boosting
	MSE: 484.8536282764667
	R^2 score: 0.05059408011319033
Neural Network
	MSE: 2224.968074788778
	R^2 score: -3.356774371829844

Feature percentage = 10%

Linear Regression
	MSE: 422.71457328759914
	R^2 score: 0.17227036182385003
Random Forest
	MSE: 471.1124085206859
	R^2 score: 0.0775011601508897
Gradient Boosting
	MSE: 651.1555915383343
	R^2 score: -0.2750466064809649
Neural Network
	MSE: 2417.506220749185
	R^2 score: -3.733788886970502


#### Exp 10 - Combined Align-Dom-Polite Group

In [66]:
# 17 features

df = align_df.join(dom_df).join(polite_df).join(meta_df)
X = df.drop(['id', 'ags'], axis=1)
y = df['ags']

lr = LinearRegression()
rfr = RandomForestRegressor(max_depth=6, max_features=1/3, random_state=0)
gbr = GradientBoostingRegressor(max_depth=4, max_features=1/3, random_state=0)
mlpr = MLPRegressor(hidden_layer_sizes=(8), max_iter=1000, activation='relu', random_state=0)

print('Feature percentage = 100%\n')
data_group_regression(X, y, lr, rfr, gbr, mlpr)

feat_perc = 15
print(f'\nFeature percentage = {feat_perc}%\n')
data_group_regression(X, y, lr, rfr, gbr, mlpr, feat_perc)

Feature percentage = 100%

Linear Regression
	MSE: 561.4961693600501
	R^2 score: -0.09948189741137314
Random Forest
	MSE: 439.75917193919895
	R^2 score: 0.1388948399793556
Gradient Boosting
	MSE: 437.17124701971187
	R^2 score: 0.1439623306517831
Neural Network
	MSE: 897.0960115457932
	R^2 score: -0.7566296597511943

Feature percentage = 5%

Linear Regression
	MSE: 422.71457328759914
	R^2 score: 0.17227036182385003
Random Forest
	MSE: 471.1124085206859
	R^2 score: 0.0775011601508897
Gradient Boosting
	MSE: 651.1555915383343
	R^2 score: -0.2750466064809649
Neural Network
	MSE: 2146.6137620689233
	R^2 score: -3.203346524730234


## Summary of Results

Summary