# Group Convo Analysis

## Datasets

Datasets used in the following experiments:

1) GAP corpus
2) UGI corpus

The GAP corpus differs from the UGI corpus in that the GAP corpus has more metadata. Important metadata only in the GAP corpus includes utterance 'Duration' and 'End' metadata fields (describe temporal length of utterance). Features that depend on such metadata will therefore be exclusive to the GAP corpus.

## Analysis

### Loading Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings('ignore', category=ConvergenceWarning)

In [2]:
csv_path = '../../csv'

gap_align_df = pd.read_csv(f'{csv_path}/gap-align.csv')
gap_dom_df = pd.read_csv(f'{csv_path}/gap-dom.csv')
gap_meta_df = pd.read_csv(f'{csv_path}/gap-meta.csv')
gap_polite_df = pd.read_csv(f'{csv_path}/gap-polite.csv')
gap_psych_df = pd.read_csv(f'{csv_path}/gap-psych.csv')

ugi_align_df = pd.read_csv(f'{csv_path}/ugi-align.csv')
ugi_dom_df = pd.read_csv(f'{csv_path}/ugi-dom.csv')
ugi_meta_df = pd.read_csv(f'{csv_path}/ugi-meta.csv')
ugi_polite_df = pd.read_csv(f'{csv_path}/ugi-polite.csv')
ugi_psych_df = pd.read_csv(f'{csv_path}/ugi-psych.csv')

### Combining Data Groups

Each data group from the GAP corpus is combined with its UGI counterpart. Analysis is performed on combined data groups and individual data groups (since some features are exclusive to the GAP corpus). 10-fold cross-validation is used for each dataset.

In [3]:
# Note that the id column of meta_df does not necessarily hold unique values (depend on row num to remedy this)
align_df = pd.concat([gap_align_df, ugi_align_df], ignore_index=True).dropna(axis=1)
dom_df = pd.concat([gap_dom_df, ugi_dom_df], ignore_index=True).dropna(axis=1)
meta_df = pd.concat([gap_meta_df, ugi_meta_df], ignore_index=True).dropna(axis=1)
polite_df = pd.concat([gap_polite_df, ugi_polite_df], ignore_index=True).dropna(axis=1)
psych_df = pd.concat([gap_psych_df, ugi_psych_df], ignore_index=True).dropna(axis=1)

### Predicting AGS

In [4]:
# Parameter lr = LinearRegression
# Parameter rfr = RandomForestRegressor
# Parameter gbr = GradientBoostingRegressor
# Parameter mlpr = MLPRegressor
def data_group_regression(x, y, lr, rfr, gbr, mlpr):
    l_preds = regression_preds(X, y, lr)
    evaluate_preds('Linear Regression', y, l_preds)
    
    rf_preds = regression_preds(X, y, rfr)
    evaluate_preds('Random Forest', y, rf_preds)
    
    gb_preds = regression_preds(X, y, gbr)
    evaluate_preds('Gradient Boosting', y, gb_preds)
    
    mlp_preds = scaled_regression_preds(X, y, mlpr)
    evaluate_preds('Neural Network', y, mlp_preds)


def regression_preds(x, y, regressor):
    kf = KFold(n_splits=10)
    preds = []
    
    for train, test in kf.split(X):
        X_train = X.iloc[train]
        y_train = y.iloc[train]
        X_test = X.iloc[test]
            
        regressor.fit(X_train, y_train)
        new_preds = regressor.predict(X_test)
        preds.extend(list(new_preds))

    return preds


def scaled_regression_preds(x, y, regressor):
    kf = KFold(n_splits=10)
    preds = []
    
    for train, test in kf.split(X):
        X_train = X.iloc[train]
        y_train = y.iloc[train]
        X_test = X.iloc[test]

        scaler = StandardScaler().fit(X_train)
        X_train = scaler.transform(X_train)
        X_test = scaler.transform(X_test)
            
        regressor.fit(X_train, y_train)
        new_preds = regressor.predict(X_test)
        preds.extend(list(new_preds))

    return preds


def evaluate_preds(title, y, preds):
    mse = mean_squared_error(y, preds)
    r2 = r2_score(y, preds)
    print(title)
    print("\tMSE:", mse)
    print("\tR^2 score:", r2)

#### Exp 1 - Combined Align Data Group

In [5]:
df = align_df.join(meta_df)
X = df.drop(['id', 'ags'], axis=1)
y = df['ags']

lr = LinearRegression()
rfr = RandomForestRegressor(max_depth=6, max_features=1/3, random_state=0)
gbr = GradientBoostingRegressor(max_depth=4, max_features=1/3, random_state=0)

# results are highly dependent on random state
mlpr = MLPRegressor(hidden_layer_sizes=(4), max_iter=1000, activation='relu', random_state=0)

data_group_regression(X, y, lr, rfr, gbr, mlpr)

Linear Regression
	MSE: 488.07335617132907
	R^2 score: 0.044289437752003336
Random Forest
	MSE: 415.9189364399946
	R^2 score: 0.18557709498257946
Gradient Boosting
	MSE: 393.3041373864552
	R^2 score: 0.229859787420715
Neural Network
	MSE: 2492.211383402689
	R^2 score: -3.880071227728612


#### Exp 2 - Individual Align Data Groups

In [6]:
gap_df = gap_align_df.join(gap_meta_df)
gap_X = gap_df.drop(['id', 'ags'], axis=1)
gap_y = gap_df['ags']

ugi_df = ugi_align_df.join(ugi_meta_df)
ugi_X = ugi_df.drop(['id', 'ags'], axis=1)
ugi_y = ugi_df['ags']

dataset_params = {
    'GAP': (gap_X, gap_y), 
    'UGI': (ugi_X, ugi_y)
}

for key, (X, y) in dataset_params.items():
    print()
    print(key, 'DATASET')
    print()
    
    lr = LinearRegression()
    rfr = RandomForestRegressor(max_depth=3, max_features=1/3, random_state=0)
    gbr = GradientBoostingRegressor(max_depth=4, max_features=1/3, random_state=0)
    mlpr = MLPRegressor(hidden_layer_sizes=(4), max_iter=1000, activation='relu', random_state=0)
    
    data_group_regression(X, y, lr, rfr, gbr, mlpr)


GAP DATASET

Linear Regression
	MSE: 384.17913727114734
	R^2 score: -1.478187607439419
Random Forest
	MSE: 154.4126566445844
	R^2 score: 0.003945048014594832
Gradient Boosting
	MSE: 145.40477454944448
	R^2 score: 0.06205133128654616
Neural Network
	MSE: 3839.7331738182693
	R^2 score: -23.76859944769599

UGI DATASET

Linear Regression
	MSE: 1086.7202667476429
	R^2 score: -11.223961353208585
Random Forest
	MSE: 110.86205643891184
	R^2 score: -0.24703066181168798
Gradient Boosting
	MSE: 138.45492465131963
	R^2 score: -0.5574087461940758
Neural Network
	MSE: 689.852014397013
	R^2 score: -6.759793041000145


#### Exp 3 - Combined Dom Data Group

In [7]:
df = dom_df.join(meta_df)
X = df.drop(['id', 'ags'], axis=1)
y = df['ags']

lr = LinearRegression()
rfr = RandomForestRegressor(max_depth=6, max_features=1/3, random_state=0)
gbr = GradientBoostingRegressor(max_depth=4, max_features=1/3, random_state=0)
mlpr = MLPRegressor(hidden_layer_sizes=(1), max_iter=1000, activation='relu', random_state=0)

data_group_regression(X, y, lr, rfr, gbr, mlpr)

Linear Regression
	MSE: 388.4520504363389
	R^2 score: 0.23936079928407106
Random Forest
	MSE: 351.59264830679354
	R^2 score: 0.31153626120579714
Gradient Boosting
	MSE: 354.12900153350654
	R^2 score: 0.3065697545573366
Neural Network
	MSE: 3476.0098361477285
	R^2 score: -5.8064754465272745


#### Exp 4 - Individual Dom Data Groups

In [8]:
gap_df = gap_dom_df.join(gap_meta_df)
gap_X = gap_df.drop(['id', 'ags'], axis=1)
gap_y = gap_df['ags']

ugi_df = ugi_dom_df.join(ugi_meta_df)
ugi_X = ugi_df.drop(['id', 'ags'], axis=1)
ugi_y = ugi_df['ags']

dataset_params = {
    'GAP': {
        'data': (gap_X, gap_y),
        'mlpr': MLPRegressor(
            hidden_layer_sizes=(2), max_iter=1000, activation='relu', random_state=0
        )
    },
    'UGI': {
        'data': (ugi_X, ugi_y),
        'mlpr': MLPRegressor(
            hidden_layer_sizes=(1), max_iter=1000, activation='relu', random_state=0
        )
    }
}

for key, params in dataset_params.items():
    print()
    print(key, 'DATASET')
    print()

    X, y = params['data']
    
    lr = LinearRegression()
    rfr = RandomForestRegressor(max_depth=3, max_features=1/3, random_state=0)
    gbr = GradientBoostingRegressor(max_depth=4, max_features=1/3, random_state=0)
    mlpr = params['mlpr']
    
    data_group_regression(X, y, lr, rfr, gbr, mlpr)


GAP DATASET

Linear Regression
	MSE: 197.11177258787373
	R^2 score: -0.2714900542944487
Random Forest
	MSE: 183.5730041335352
	R^2 score: -0.18415681584258192
Gradient Boosting
	MSE: 173.6349375998323
	R^2 score: -0.12005028080096536
Neural Network
	MSE: 4577.586767729348
	R^2 score: -28.528201037525474

UGI DATASET

Linear Regression
	MSE: 133.35506192185505
	R^2 score: -0.5000429945658138
Random Forest
	MSE: 104.22976954939237
	R^2 score: -0.17242745333052656
Gradient Boosting
	MSE: 111.67308447424939
	R^2 score: -0.25615350203441234
Neural Network
	MSE: 1138.5884770091905
	R^2 score: -11.807400364238356


#### Exp 5 - Combined Polite Data Groups

In [9]:
df = polite_df.join(meta_df)
X = df.drop(['id', 'ags'], axis=1)
y = df['ags']

lr = LinearRegression()
rfr = RandomForestRegressor(max_depth=6, max_features=1/3, random_state=0)
gbr = GradientBoostingRegressor(max_depth=4, max_features=1/3, random_state=0)
mlpr = MLPRegressor(hidden_layer_sizes=(4), max_iter=1000, activation='relu', random_state=0)

data_group_regression(X, y, lr, rfr, gbr, mlpr)

Linear Regression
	MSE: 324.83108385932957
	R^2 score: 0.36393885495800293
Random Forest
	MSE: 431.44767804518574
	R^2 score: 0.15516981668547958
Gradient Boosting
	MSE: 456.83569612211
	R^2 score: 0.1054568038281617
Neural Network
	MSE: 1624.6313183369473
	R^2 score: -2.1812375968920326


#### Exp 6 - Combined Psych Data Groups

In [10]:
df = psych_df.join(meta_df)
X = df.drop(['id', 'ags'], axis=1)
y = df['ags']

lr = LinearRegression()
rfr = RandomForestRegressor(max_depth=6, max_features=1/3, random_state=0)
gbr = GradientBoostingRegressor(max_depth=4, max_features=1/3, random_state=0)
mlpr = MLPRegressor(hidden_layer_sizes=(7), max_iter=1000, activation='relu', random_state=0)

data_group_regression(X, y, lr, rfr, gbr, mlpr)

Linear Regression
	MSE: 631.1033683199008
	R^2 score: -0.2357817679395957
Random Forest
	MSE: 502.7865893696937
	R^2 score: 0.015479030064928345
Gradient Boosting
	MSE: 611.5648091783905
	R^2 score: -0.19752274989130525
Neural Network
	MSE: 1053.288896310023
	R^2 score: -1.0624754672096093


## Summary of Results

Summary