# Group Convo Analysis

## Datasets

Datasets used in the following experiments:

1) GAP corpus
2) UGI corpus

The GAP corpus differs from the UGI corpus in that the GAP corpus has more metadata. Important metadata only in the GAP corpus includes utterance 'Duration' and 'End' metadata fields (describe temporal length of utterance). Features that depend on such metadata will therefore be exclusive to the GAP corpus.

## Analysis

### Loading Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.feature_selection import SelectPercentile, f_regression
from sklearn.metrics import mean_squared_error, r2_score

# The MLP Regressor rarely converges, so this filters convergence failure warnings
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings('ignore', category=ConvergenceWarning)

In [2]:
csv_path = '../../csv'

gap_align_df = pd.read_csv(f'{csv_path}/gap-align.csv')
gap_dom_df = pd.read_csv(f'{csv_path}/gap-dom.csv')
gap_meta_df = pd.read_csv(f'{csv_path}/gap-meta.csv')
gap_polite_df = pd.read_csv(f'{csv_path}/gap-polite.csv')
gap_psych_df = pd.read_csv(f'{csv_path}/gap-psych.csv')
gap_rhythm_df = pd.read_csv(f'{csv_path}/gap-rhythm.csv')

ugi_align_df = pd.read_csv(f'{csv_path}/ugi-align.csv')
ugi_dom_df = pd.read_csv(f'{csv_path}/ugi-dom.csv')
ugi_meta_df = pd.read_csv(f'{csv_path}/ugi-meta.csv')
ugi_polite_df = pd.read_csv(f'{csv_path}/ugi-polite.csv')
ugi_psych_df = pd.read_csv(f'{csv_path}/ugi-psych.csv')
ugi_rhythm_df = pd.read_csv(f'{csv_path}/ugi-rhythm.csv')

### Preprocessing Data

In [3]:
# There exists a disparity between the GAP and UGI average AGS (meta feature)
# To combine the GAP and UGI datasets, the AGS feature must be scaled for both

scaler = MinMaxScaler()
gap_meta_df['ags'] = scaler.fit_transform(gap_meta_df[['ags']])
ugi_meta_df['ags'] = scaler.fit_transform(ugi_meta_df[['ags']])

### Combining Data Groups

Each data group from the GAP corpus is combined with its UGI counterpart. Analysis is performed on combined data groups and individual data groups (since some features are exclusive to the GAP corpus). 10-fold cross-validation is used for each dataset.

In [4]:
# Note that the id column of meta_df does not necessarily hold unique values (depend on row num to remedy this)
# Align feature values may be NaN, so can't use dropna to drop columns the ugi_align_df is missing

align_df = pd.concat([gap_align_df, ugi_align_df], ignore_index=True).drop(['src-10'], axis=1)
dom_df = pd.concat([gap_dom_df, ugi_dom_df], ignore_index=True).dropna(axis=1)
meta_df = pd.concat([gap_meta_df, ugi_meta_df], ignore_index=True).dropna(axis=1)
polite_df = pd.concat([gap_polite_df, ugi_polite_df], ignore_index=True).dropna(axis=1)
psych_df = pd.concat([gap_psych_df, ugi_psych_df], ignore_index=True).dropna(axis=1)
rhythm_df = pd.concat([gap_rhythm_df, ugi_rhythm_df], ignore_index=True).dropna(axis=1)

### Predicting AGS

In [5]:
# Parameter lr = LinearRegression
# Parameter rfr = RandomForestRegressor
# Parameter gbr = GradientBoostingRegressor
# Parameter mlpr = MLPRegressor
def data_group_regression(x, y, lr, rfr, gbr, mlpr, feat_percentage=None):
    l_preds = regression_preds(x, y, lr, feat_percentage)
    evaluate_preds('Linear Regression', y, l_preds)
    
    rf_preds = regression_preds(x, y, rfr, feat_percentage)
    evaluate_preds('Random Forest', y, rf_preds)
    
    gb_preds = regression_preds(x, y, gbr, feat_percentage)
    evaluate_preds('Gradient Boosting', y, gb_preds)
    
    mlp_preds = regression_preds(x, y, mlpr, feat_percentage, True)
    evaluate_preds('Neural Network', y, mlp_preds)


# To use less than 100% of features, set parameter feat_percentage
def regression_preds(x, y, regressor, feat_percentage=None, scaled=False):
    kf = KFold(n_splits=10)
    preds = []
    
    for train, test in kf.split(x):
        x_train = x.iloc[train]
        y_train = y.iloc[train]
        x_test = x.iloc[test]

        if scaled:
            scaler = StandardScaler().set_output(transform="pandas").fit(x_train)
            x_train = scaler.transform(x_train)
            x_test = scaler.transform(x_test)

        if feat_percentage:
            x_train, x_test = select_features(
                x_train, y_train, x_test, feat_percentage
            )
            
        regressor.fit(x_train, y_train)
        new_preds = regressor.predict(x_test)
        preds.extend(list(new_preds))

    return preds


def select_features(x_train, y_train, x_test, feat_percentage):

    if not (0 < feat_percentage < 100):
        raise Exception(
            'Parameter feat_percentage must be a value in range (0, 100)'
        )

    selector = SelectPercentile(f_regression, percentile=feat_percentage)
    selector.fit_transform(x_train, y_train)
    selected_cols = selector.get_support()    
    selected_feats = x_train.columns.values[selected_cols]

    x_train_sub = x_train[selected_feats]
    x_test_sub = x_test[selected_feats]

    return x_train_sub, x_test_sub
        

def evaluate_preds(title, y, preds):
    mse = mean_squared_error(y, preds)
    r2 = r2_score(y, preds)
    print(title)
    print("\tMSE:", mse)
    print("\tR^2 score:", r2)

#### Exp 1 - Combined Meta Data Group

In [6]:
# 2 features

X = meta_df.drop(['id', 'ags'], axis=1)
y = meta_df['ags']

lr = LinearRegression()
rfr = RandomForestRegressor(max_depth=3, max_features=1/3, random_state=0)
gbr = GradientBoostingRegressor(max_depth=3, max_features=1/3, random_state=0)
mlpr = MLPRegressor(hidden_layer_sizes=(1), max_iter=1000, activation='relu', random_state=0)

print('Feature percentage = 100%\n')
data_group_regression(X, y, lr, rfr, gbr, mlpr)

# Feature total_mins is the most useful
feat_perc = 50
print(f'\nFeature percentage = {feat_perc}%\n')
data_group_regression(X, y, lr, rfr, gbr, mlpr, feat_perc)

Feature percentage = 100%

Linear Regression
	MSE: 0.07395253467500197
	R^2 score: -0.03468072377639286
Random Forest
	MSE: 0.0696869513448505
	R^2 score: 0.024999676182386743
Gradient Boosting
	MSE: 0.07852747128914231
	R^2 score: -0.09868933075589026
Neural Network
	MSE: 0.09968333986643133
	R^2 score: -0.3946841808021446

Feature percentage = 50%

Linear Regression
	MSE: 0.07050949657032586
	R^2 score: 0.013491325684680922
Random Forest
	MSE: 0.06615564653797393
	R^2 score: 0.0744066779776773
Gradient Boosting
	MSE: 0.0955269761777764
	R^2 score: -0.3365318888143882
Neural Network
	MSE: 0.08620357127386168
	R^2 score: -0.2060867678129612


#### Exp 2 - Combined Align Data Group (Excluding NaN Columns)

In [7]:
# 3 features

df = align_df.join(meta_df['ags']).dropna(axis=1)
X = df.drop(['ags'], axis=1)
y = df['ags']

lr = LinearRegression()
rfr = RandomForestRegressor(max_depth=3, max_features=1/3, random_state=0)
gbr = GradientBoostingRegressor(max_depth=3, max_features=1/3, random_state=0)
mlpr = MLPRegressor(hidden_layer_sizes=(1), max_iter=1000, activation='relu', random_state=0)

print('Feature percentage = 100%\n')
data_group_regression(X, y, lr, rfr, gbr, mlpr)

feat_perc = 50
print(f'\nFeature percentage = {feat_perc}%\n')
data_group_regression(X, y, lr, rfr, gbr, mlpr, feat_perc)

Feature percentage = 100%

Linear Regression
	MSE: 0.09104352234755239
	R^2 score: -0.27380323083855407
Random Forest
	MSE: 0.08953790056539868
	R^2 score: -0.2527378563772389
Gradient Boosting
	MSE: 0.13504371030572676
	R^2 score: -0.8894162929591996
Neural Network
	MSE: 0.09234090513943966
	R^2 score: -0.2919551031445442

Feature percentage = 50%

Linear Regression
	MSE: 0.08316517257464988
	R^2 score: -0.16357608742806828
Random Forest
	MSE: 0.09407344356005863
	R^2 score: -0.31619530146762354
Gradient Boosting
	MSE: 0.1361011973681058
	R^2 score: -0.9042117490432273
Neural Network
	MSE: 0.0783306829686039
	R^2 score: -0.0959360365947064


#### Exp 3 - Combined Align Data Group (Excluding NaN Rows)

In [8]:
# 8 features

df = align_df.join(meta_df['ags']).dropna()
X = df.drop(['ags'], axis=1)
y = df['ags']

lr = LinearRegression()
rfr = RandomForestRegressor(max_depth=3, max_features=1/3, random_state=0)
gbr = GradientBoostingRegressor(max_depth=3, max_features=1/3, random_state=0)
mlpr = MLPRegressor(hidden_layer_sizes=(4), max_iter=1000, activation='relu', random_state=0)

print('Feature percentage = 100%\n')
data_group_regression(X, y, lr, rfr, gbr, mlpr)

feat_perc = 50
print(f'\nFeature percentage = {feat_perc}%\n')
data_group_regression(X, y, lr, rfr, gbr, mlpr, feat_perc)

Feature percentage = 100%

Linear Regression
	MSE: 0.10253769540559236
	R^2 score: -0.3423241701824218
Random Forest
	MSE: 0.07947434314952079
	R^2 score: -0.04040110611999603
Gradient Boosting
	MSE: 0.115748448397532
	R^2 score: -0.5152665498335949
Neural Network
	MSE: 0.2055160135007113
	R^2 score: -1.690416545742805

Feature percentage = 50%

Linear Regression
	MSE: 0.08149091369612699
	R^2 score: -0.06680009407150256
Random Forest
	MSE: 0.08030585373786231
	R^2 score: -0.05128643717876846
Gradient Boosting
	MSE: 0.11985472295825858
	R^2 score: -0.5690219182419225
Neural Network
	MSE: 0.25858503519227377
	R^2 score: -2.385144764694314


#### Exp 4 - Individual Align Data Groups (Excluding NaN Columns)

In [9]:
# GAP = 4 features, UGI = 3 features

gap_df = gap_align_df.join(gap_meta_df['ags']).dropna(axis=1)
gap_X = gap_df.drop(['ags'], axis=1)
gap_y = gap_df['ags']

ugi_df = ugi_align_df.join(ugi_meta_df['ags']).dropna(axis=1)
ugi_X = ugi_df.drop(['ags'], axis=1)
ugi_y = ugi_df['ags']

dataset_params = {
    'GAP': (gap_X, gap_y),
    'UGI': (ugi_X, ugi_y)
}

for key, (X, y) in dataset_params.items():
    print()
    print(key, 'DATASET')
    print()
    
    lr = LinearRegression()
    rfr = RandomForestRegressor(max_depth=3, max_features=1/3, random_state=0)
    gbr = GradientBoostingRegressor(max_depth=3, max_features=1/3, random_state=0)
    mlpr = MLPRegressor(hidden_layer_sizes=(2), max_iter=1000, activation='relu', random_state=0)
    
    print('Feature percentage = 100%\n')
    data_group_regression(X, y, lr, rfr, gbr, mlpr)
    
    feat_perc = 50
    print(f'\nFeature percentage = {feat_perc}%\n')
    data_group_regression(X, y, lr, rfr, gbr, mlpr, feat_perc)


GAP DATASET

Feature percentage = 100%

Linear Regression
	MSE: 0.09482456744768539
	R^2 score: -0.7181972256565372
Random Forest
	MSE: 0.04976166194124794
	R^2 score: 0.0983312469241564
Gradient Boosting
	MSE: 0.0463023639316272
	R^2 score: 0.16101285363044282
Neural Network
	MSE: 0.07628725193575217
	R^2 score: -0.3823057479411702

Feature percentage = 50%

Linear Regression
	MSE: 0.0709166381256623
	R^2 score: -0.28499158140241887
Random Forest
	MSE: 0.06039630224536123
	R^2 score: -0.0943657508919793
Gradient Boosting
	MSE: 0.06989555844647913
	R^2 score: -0.2664898753660092
Neural Network
	MSE: 0.1443133390305805
	R^2 score: -1.6149212743245394

UGI DATASET

Feature percentage = 100%

Linear Regression
	MSE: 0.3749657954297069
	R^2 score: -3.875775365020515
Random Forest
	MSE: 0.09943341322158811
	R^2 score: -0.2929578979067451
Gradient Boosting
	MSE: 0.11951465378234642
	R^2 score: -0.5540793634339953
Neural Network
	MSE: 0.3220560709764535
	R^2 score: -3.1877767950081255

Featu

#### Exp 5 - Individual Align Data Groups (Excluding NaN Rows)

In [10]:
# GAP = 9 features, UGI = 8 features

gap_df = gap_align_df.join(gap_meta_df['ags']).dropna()
gap_X = gap_df.drop(['ags'], axis=1)
gap_y = gap_df['ags']

ugi_df = ugi_align_df.join(ugi_meta_df['ags']).dropna()
ugi_X = ugi_df.drop(['ags'], axis=1)
ugi_y = ugi_df['ags']

dataset_params = {
    'GAP': (gap_X, gap_y), 
    'UGI': (ugi_X, ugi_y)
}

for key, (X, y) in dataset_params.items():
    print()
    print(key, 'DATASET')
    print()
    
    lr = LinearRegression()
    rfr = RandomForestRegressor(max_depth=3, max_features=1/3, random_state=0)
    gbr = GradientBoostingRegressor(max_depth=3, max_features=1/3, random_state=0)
    mlpr = MLPRegressor(hidden_layer_sizes=(4), max_iter=1000, activation='relu', random_state=0)
    
    print('Feature percentage = 100%\n')
    data_group_regression(X, y, lr, rfr, gbr, mlpr)
    
    feat_perc = 50
    print(f'\nFeature percentage = {feat_perc}%\n')
    data_group_regression(X, y, lr, rfr, gbr, mlpr, feat_perc)


GAP DATASET

Feature percentage = 100%

Linear Regression
	MSE: 0.22865801111357653
	R^2 score: -2.723881733059376
Random Forest
	MSE: 0.06014012026721725
	R^2 score: 0.020569215150551212
Gradient Boosting
	MSE: 0.05696748319238145
	R^2 score: 0.0722381909763733
Neural Network
	MSE: 0.19003681675104775
	R^2 score: -2.0949041630405425

Feature percentage = 50%

Linear Regression
	MSE: 0.07815467216084747
	R^2 score: -0.2728124179671203
Random Forest
	MSE: 0.07443690835698023
	R^2 score: -0.21226560987748888
Gradient Boosting
	MSE: 0.08618064590660066
	R^2 score: -0.4035219298546302
Neural Network
	MSE: 0.14395955093366944
	R^2 score: -1.3444984035789695

UGI DATASET

Feature percentage = 100%

Linear Regression
	MSE: 0.3749657954297069
	R^2 score: -3.875775365020515
Random Forest
	MSE: 0.09943341322158811
	R^2 score: -0.2929578979067451
Gradient Boosting
	MSE: 0.11951465378234642
	R^2 score: -0.5540793634339953
Neural Network
	MSE: 0.39427695487133646
	R^2 score: -4.126883270389798

Fe

#### Exp 6 - Combined Dom Data Group

In [11]:
# 1 feature

df = dom_df.join(meta_df['ags'])
X = df.drop(['ags'], axis=1)
y = df['ags']

lr = LinearRegression()
rfr = RandomForestRegressor(max_depth=3, max_features=1/3, random_state=0)
gbr = GradientBoostingRegressor(max_depth=3, max_features=1/3, random_state=0)
mlpr = MLPRegressor(hidden_layer_sizes=(1), max_iter=1000, activation='relu', random_state=0)

print('Feature percentage = 100%\n')
data_group_regression(X, y, lr, rfr, gbr, mlpr)

# Only one feature in this data group - no need for feature selection
# feat_perc = 50
# print(f'\nFeature percentage = {feat_perc}%\n')
# data_group_regression(X, y, lr, rfr, gbr, mlpr, feat_perc)

Feature percentage = 100%

Linear Regression
	MSE: 0.0726934259476935
	R^2 score: -0.017064349503219
Random Forest
	MSE: 0.07949499706864026
	R^2 score: -0.11222612537966725
Gradient Boosting
	MSE: 0.11984148163934925
	R^2 score: -0.676719689270527
Neural Network
	MSE: 0.08384967273143305
	R^2 score: -0.17315302918886077


#### Exp 7 - Individual Dom Data Groups

In [12]:
# GAP = 5 features, UGI = 1 feature

gap_df = gap_dom_df.join(gap_meta_df['ags'])
gap_X = gap_df.drop(['ags'], axis=1)
gap_y = gap_df['ags']

ugi_df = ugi_dom_df.join(ugi_meta_df['ags'])
ugi_X = ugi_df.drop(['ags'], axis=1)
ugi_y = ugi_df['ags']

dataset_params = {
    'GAP': {
        'data': (gap_X, gap_y),
        'mlpr': MLPRegressor(
            hidden_layer_sizes=(3), max_iter=1000, activation='relu', random_state=0
        ),
        'feat_perc': 50
    },
    'UGI': {
        'data': (ugi_X, ugi_y),
        'mlpr': MLPRegressor(
            hidden_layer_sizes=(1), max_iter=1000, activation='relu', random_state=0
        ),
        'feat_perc': None
    }
}

for key, params in dataset_params.items():
    print()
    print(key, 'DATASET')
    print()

    X, y = params['data']
    
    lr = LinearRegression()
    rfr = RandomForestRegressor(max_depth=3, max_features=1/3, random_state=0)
    gbr = GradientBoostingRegressor(max_depth=3, max_features=1/3, random_state=0)
    mlpr = params['mlpr']
    
    print('Feature percentage = 100%\n')
    data_group_regression(X, y, lr, rfr, gbr, mlpr)

    feat_perc = params['feat_perc']
    
    if feat_perc:
        print(f'\nFeature percentage = {feat_perc}%\n')
        data_group_regression(X, y, lr, rfr, gbr, mlpr, feat_perc)


GAP DATASET

Feature percentage = 100%

Linear Regression
	MSE: 0.06160913134257484
	R^2 score: -0.11634190797993682
Random Forest
	MSE: 0.07548188669294821
	R^2 score: -0.36771273303931573
Gradient Boosting
	MSE: 0.08179184659854187
	R^2 score: -0.48204761371015437
Neural Network
	MSE: 0.31888671244203387
	R^2 score: -4.778146733112367

Feature percentage = 50%

Linear Regression
	MSE: 0.0680677711860879
	R^2 score: -0.23337083159470784
Random Forest
	MSE: 0.08423520306905906
	R^2 score: -0.5263206161812568
Gradient Boosting
	MSE: 0.12404981997777097
	R^2 score: -1.2477514241927778
Neural Network
	MSE: 0.09503588185273339
	R^2 score: -0.7220261893340678

UGI DATASET

Feature percentage = 100%

Linear Regression
	MSE: 0.07971960608133838
	R^2 score: -0.036614262362488326
Random Forest
	MSE: 0.08025321017087461
	R^2 score: -0.043552851711560425
Gradient Boosting
	MSE: 0.11325141747065386
	R^2 score: -0.47263691271964103
Neural Network
	MSE: 0.09384635743589918
	R^2 score: -0.2203079941

#### Exp 8 - Combined Polite Data Group

In [13]:
# 6 features

df = polite_df.join(meta_df['ags'])
X = df.drop(['ags'], axis=1)
y = df['ags']

lr = LinearRegression()
rfr = RandomForestRegressor(max_depth=3, max_features=1/3, random_state=0)
gbr = GradientBoostingRegressor(max_depth=3, max_features=1/3, random_state=0)
mlpr = MLPRegressor(hidden_layer_sizes=(3), max_iter=1000, activation='relu', random_state=0)

print('Feature percentage = 100%\n')
data_group_regression(X, y, lr, rfr, gbr, mlpr)

feat_perc = 50
print(f'\nFeature percentage = {feat_perc}%\n')
data_group_regression(X, y, lr, rfr, gbr, mlpr, feat_perc)

Feature percentage = 100%

Linear Regression
	MSE: 0.05481660954954535
	R^2 score: 0.23305280213926627
Random Forest
	MSE: 0.06620540932329208
	R^2 score: 0.07371043957345769
Gradient Boosting
	MSE: 0.08037705351000422
	R^2 score: -0.12456710599882914
Neural Network
	MSE: 0.11104649014670652
	R^2 score: -0.5536676775550875

Feature percentage = 50%

Linear Regression
	MSE: 0.061077136933215766
	R^2 score: 0.14546084828636252
Random Forest
	MSE: 0.06587315427961414
	R^2 score: 0.0783590684619988
Gradient Boosting
	MSE: 0.06559523659069316
	R^2 score: 0.08224745547654455
Neural Network
	MSE: 0.08925002169516945
	R^2 score: -0.248710101018782


#### Exp 9 - Combined Psych Data Group

In [14]:
# 13 features

df = psych_df.join(meta_df['ags'])
X = df.drop(['ags'], axis=1)
y = df['ags']

lr = LinearRegression()
rfr = RandomForestRegressor(max_depth=3, max_features=1/3, random_state=0)
gbr = GradientBoostingRegressor(max_depth=3, max_features=1/3, random_state=0)
mlpr = MLPRegressor(hidden_layer_sizes=(6), max_iter=1000, activation='relu', random_state=0)

print('Feature percentage = 100%\n')
data_group_regression(X, y, lr, rfr, gbr, mlpr)

feat_perc = 50
print(f'\nFeature percentage = {feat_perc}%\n')
data_group_regression(X, y, lr, rfr, gbr, mlpr, feat_perc)

Feature percentage = 100%

Linear Regression
	MSE: 0.07935142360289507
	R^2 score: -0.1102173680313876
Random Forest
	MSE: 0.08034156023789146
	R^2 score: -0.12407051443992967
Gradient Boosting
	MSE: 0.1005927527095222
	R^2 score: -0.40740790883710654
Neural Network
	MSE: 0.2210716960893296
	R^2 score: -2.0930464185090654

Feature percentage = 50%

Linear Regression
	MSE: 0.07131590979355545
	R^2 score: 0.0022086803887370055
Random Forest
	MSE: 0.0869490691179675
	R^2 score: -0.21651713713435106
Gradient Boosting
	MSE: 0.12557182162443262
	R^2 score: -0.7568937137215788
Neural Network
	MSE: 0.1867989260447697
	R^2 score: -1.6135310824713125


#### Exp 10 - Combined Rhythm Data Group

In [15]:
# 9 features

df = rhythm_df.join(meta_df['ags'])
X = df.drop(['ags'], axis=1)
y = df['ags']

lr = LinearRegression()
rfr = RandomForestRegressor(max_depth=3, max_features=1/3, random_state=0)
gbr = GradientBoostingRegressor(max_depth=3, max_features=1/3, random_state=0)
mlpr = MLPRegressor(hidden_layer_sizes=(5), max_iter=1000, activation='relu', random_state=0)

print('Feature percentage = 100%\n')
data_group_regression(X, y, lr, rfr, gbr, mlpr)

feat_perc = 50
print(f'\nFeature percentage = {feat_perc}%\n')
data_group_regression(X, y, lr, rfr, gbr, mlpr, feat_perc)

Feature percentage = 100%

Linear Regression
	MSE: 0.134964926817159
	R^2 score: -0.8883140216532712
Random Forest
	MSE: 0.07120351695456473
	R^2 score: 0.003781185029791745
Gradient Boosting
	MSE: 0.08444319878261519
	R^2 score: -0.181457138938663
Neural Network
	MSE: 0.27695515836084367
	R^2 score: -2.874920108766314

Feature percentage = 50%

Linear Regression
	MSE: 0.11417296609416322
	R^2 score: -0.5974106595962083
Random Forest
	MSE: 0.07136058105067358
	R^2 score: 0.0015836782998956966
Gradient Boosting
	MSE: 0.07687708692362734
	R^2 score: -0.0755985618278916
Neural Network
	MSE: 0.15666227560072885
	R^2 score: -1.1918848004247193


#### Exp 11 - Individual Rhythm Data Groups

In [16]:
# GAP = 13 features, UGI = 9 features

gap_df = gap_rhythm_df.join(gap_meta_df['ags'])
gap_X = gap_df.drop(['ags'], axis=1)
gap_y = gap_df['ags']

ugi_df = ugi_rhythm_df.join(ugi_meta_df['ags'])
ugi_X = ugi_df.drop(['ags'], axis=1)
ugi_y = ugi_df['ags']

dataset_params = {
    'GAP': {
        'data': (gap_X, gap_y),
        'mlpr': MLPRegressor(
            hidden_layer_sizes=(6), max_iter=1000, activation='relu', random_state=0
        )
    },
    'UGI': {
        'data': (ugi_X, ugi_y),
        'mlpr': MLPRegressor(
            hidden_layer_sizes=(4), max_iter=1000, activation='relu', random_state=0
        )
    }
}

for key, params in dataset_params.items():
    print()
    print(key, 'DATASET')
    print()

    X, y = params['data']
    
    lr = LinearRegression()
    rfr = RandomForestRegressor(max_depth=3, max_features=1/3, random_state=0)
    gbr = GradientBoostingRegressor(max_depth=3, max_features=1/3, random_state=0)
    mlpr = params['mlpr']
    
    print('Feature percentage = 100%\n')
    data_group_regression(X, y, lr, rfr, gbr, mlpr)

    feat_perc = 50
    print(f'\nFeature percentage = {feat_perc}%\n')
    data_group_regression(X, y, lr, rfr, gbr, mlpr, feat_perc)


GAP DATASET

Feature percentage = 100%

Linear Regression
	MSE: 0.23242054603535947
	R^2 score: -3.2114016244139476
Random Forest
	MSE: 0.05532443519011045
	R^2 score: -0.002464800138489931
Gradient Boosting
	MSE: 0.06503517110731065
	R^2 score: -0.17842088368426245
Neural Network
	MSE: 0.32272626149014044
	R^2 score: -4.847718392649527

Feature percentage = 50%

Linear Regression
	MSE: 0.1619953705823595
	R^2 score: -1.93531522258061
Random Forest
	MSE: 0.05719670583391045
	R^2 score: -0.03638987158825002
Gradient Boosting
	MSE: 0.0722421956528094
	R^2 score: -0.30901034918481685
Neural Network
	MSE: 1.6050972366235725
	R^2 score: -28.08395675410924

UGI DATASET

Feature percentage = 100%

Linear Regression
	MSE: 0.23598552216591723
	R^2 score: -2.068579613133758
Random Forest
	MSE: 0.08713182957174519
	R^2 score: -0.1329972848542742
Gradient Boosting
	MSE: 0.09723392022311436
	R^2 score: -0.26435733244662485
Neural Network
	MSE: 0.32714888178907614
	R^2 score: -3.253999906026662

Fe

#### Exp 12 - Combined Meta-Polite Group

In [17]:
# 8 features

df = meta_df.join(polite_df)
X = df.drop(['id', 'ags'], axis=1)
y = df['ags']

lr = LinearRegression()
rfr = RandomForestRegressor(max_depth=3, max_features=1/3, random_state=0)
gbr = GradientBoostingRegressor(max_depth=3, max_features=1/3, random_state=0)
mlpr = MLPRegressor(hidden_layer_sizes=(4), max_iter=1000, activation='relu', random_state=0)

print('Feature percentage = 100%\n')
data_group_regression(X, y, lr, rfr, gbr, mlpr)

feat_perc = 50
print(f'\nFeature percentage = {feat_perc}%\n')
data_group_regression(X, y, lr, rfr, gbr, mlpr, feat_perc)

Feature percentage = 100%

Linear Regression
	MSE: 0.06110900613183924
	R^2 score: 0.14501496166945582
Random Forest
	MSE: 0.06836401055455123
	R^2 score: 0.043509134180489295
Gradient Boosting
	MSE: 0.08291306763063218
	R^2 score: -0.16004885030103577
Neural Network
	MSE: 0.1391112818991941
	R^2 score: -0.9463262817626494

Feature percentage = 50%

Linear Regression
	MSE: 0.06189060976942264
	R^2 score: 0.13407943091320684
Random Forest
	MSE: 0.06579199162889658
	R^2 score: 0.07949462697337517
Gradient Boosting
	MSE: 0.07292864559019711
	R^2 score: -0.020355341908301794
Neural Network
	MSE: 0.09287497502203727
	R^2 score: -0.29942735294776934


#### Exp 13 - Combined Align-Polite Group

In [18]:
# 14 features

df = align_df.join(polite_df).join(meta_df['ags']).dropna()
X = df.drop(['ags'], axis=1)
y = df['ags']

lr = LinearRegression()
rfr = RandomForestRegressor(max_depth=3, max_features=1/3, random_state=0)
gbr = GradientBoostingRegressor(max_depth=3, max_features=1/3, random_state=0)
mlpr = MLPRegressor(hidden_layer_sizes=(7), max_iter=1000, activation='relu', random_state=0)

print('Feature percentage = 100%\n')
data_group_regression(X, y, lr, rfr, gbr, mlpr)

feat_perc = 50
print(f'\nFeature percentage = {feat_perc}%\n')
data_group_regression(X, y, lr, rfr, gbr, mlpr, feat_perc)

Feature percentage = 100%

Linear Regression
	MSE: 0.07189293344388915
	R^2 score: 0.05884736490980946
Random Forest
	MSE: 0.07135815498965425
	R^2 score: 0.06584816634164214
Gradient Boosting
	MSE: 0.08086077121311624
	R^2 score: -0.058550879163187464
Neural Network
	MSE: 0.29968615285098565
	R^2 score: -2.9232007785004477

Feature percentage = 50%

Linear Regression
	MSE: 0.06403229188080312
	R^2 score: 0.1617512688988143
Random Forest
	MSE: 0.07019211827911173
	R^2 score: 0.08111278930483645
Gradient Boosting
	MSE: 0.07739211529279633
	R^2 score: -0.013142596272933904
Neural Network
	MSE: 0.26501782883770875
	R^2 score: -2.4693566670382037


#### Exp 14 - Combined Dom-Polite Group

In [19]:
# 7 features

df = dom_df.join(polite_df).join(meta_df['ags'])
X = df.drop(['ags'], axis=1)
y = df['ags']

lr = LinearRegression()
rfr = RandomForestRegressor(max_depth=3, max_features=1/3, random_state=0)
gbr = GradientBoostingRegressor(max_depth=3, max_features=1/3, random_state=0)
mlpr = MLPRegressor(hidden_layer_sizes=(4), max_iter=1000, activation='relu', random_state=0)

print('Feature percentage = 100%\n')
data_group_regression(X, y, lr, rfr, gbr, mlpr)

feat_perc = 50
print(f'\nFeature percentage = {feat_perc}%\n')
data_group_regression(X, y, lr, rfr, gbr, mlpr, feat_perc)

Feature percentage = 100%

Linear Regression
	MSE: 0.056616932151795626
	R^2 score: 0.20786422542159788
Random Forest
	MSE: 0.07041593237448274
	R^2 score: 0.014800396026894092
Gradient Boosting
	MSE: 0.08250981226257516
	R^2 score: -0.15440684549455264
Neural Network
	MSE: 0.08977889653698627
	R^2 score: -0.2561096662469786

Feature percentage = 50%

Linear Regression
	MSE: 0.06198269740559769
	R^2 score: 0.13279101933316861
Random Forest
	MSE: 0.06580581482038664
	R^2 score: 0.07930122468042222
Gradient Boosting
	MSE: 0.06903815130467783
	R^2 score: 0.03407713239267318
Neural Network
	MSE: 0.1549833841918077
	R^2 score: -1.1683951852849739


#### Exp 15 - Combined Psych-Polite Group

In [20]:
# 19 features

df = psych_df.join(polite_df).join(meta_df['ags'])
X = df.drop(['ags'], axis=1)
y = df['ags']

lr = LinearRegression()
rfr = RandomForestRegressor(max_depth=3, max_features=1/3, random_state=0)
gbr = GradientBoostingRegressor(max_depth=3, max_features=1/3, random_state=0)
mlpr = MLPRegressor(hidden_layer_sizes=(9), max_iter=1000, activation='relu', random_state=0)

print('Feature percentage = 100%\n')
data_group_regression(X, y, lr, rfr, gbr, mlpr)

feat_perc = 50
print(f'\nFeature percentage = {feat_perc}%\n')
data_group_regression(X, y, lr, rfr, gbr, mlpr, feat_perc)

Feature percentage = 100%

Linear Regression
	MSE: 0.08986235563891984
	R^2 score: -0.25727735474303715
Random Forest
	MSE: 0.07597140877683156
	R^2 score: -0.06292708647477041
Gradient Boosting
	MSE: 0.08488349640833304
	R^2 score: -0.18761740738729493
Neural Network
	MSE: 0.2866119739216339
	R^2 score: -3.010030027009468

Feature percentage = 50%

Linear Regression
	MSE: 0.06544867578407391
	R^2 score: 0.08429800914771324
Random Forest
	MSE: 0.0776493881368468
	R^2 score: -0.08640394100494109
Gradient Boosting
	MSE: 0.08627212633391887
	R^2 score: -0.20704593168030172
Neural Network
	MSE: 0.18909319208440573
	R^2 score: -1.6456304940311521


#### Exp 16 - Combined Rhythm-Polite Group

In [21]:
# 15 features

df = rhythm_df.join(polite_df).join(meta_df['ags'])
X = df.drop(['ags'], axis=1)
y = df['ags']

lr = LinearRegression()
rfr = RandomForestRegressor(max_depth=3, max_features=1/3, random_state=0)
gbr = GradientBoostingRegressor(max_depth=3, max_features=1/3, random_state=0)
mlpr = MLPRegressor(hidden_layer_sizes=(7), max_iter=1000, activation='relu', random_state=0)

print('Feature percentage = 100%\n')
data_group_regression(X, y, lr, rfr, gbr, mlpr)

feat_perc = 50
print(f'\nFeature percentage = {feat_perc}%\n')
data_group_regression(X, y, lr, rfr, gbr, mlpr, feat_perc)

Feature percentage = 100%

Linear Regression
	MSE: 0.11398343633873705
	R^2 score: -0.594758921080641
Random Forest
	MSE: 0.06927899586318771
	R^2 score: 0.03070744096543876
Gradient Boosting
	MSE: 0.06473834145764949
	R^2 score: 0.09423640055265958
Neural Network
	MSE: 0.3747174552397324
	R^2 score: -4.242726696292023

Feature percentage = 50%

Linear Regression
	MSE: 0.11324124719525562
	R^2 score: -0.5843748442733869
Random Forest
	MSE: 0.06158652202497833
	R^2 score: 0.13833396700038614
Gradient Boosting
	MSE: 0.05258795763713256
	R^2 score: 0.2642341968529809
Neural Network
	MSE: 0.1914629892318055
	R^2 score: -1.6787866723616247


## Summary of Results

Most experiments do not require a summary due to the negative R2 scores (revealing that the features, or at least the models manipulating the features, are useless). Here are the experiment numbers that had R2 scores above X on any model with or without feature selection (groups are mutually exclusive):

X = 0.05
- 1, 5, 15

X = 0.1
- 4, 12, 13

X = 0.2
- 8, 14, 16

The most significant features are those of the polite data group, which were used in the experiments with the best results.