# Sentential Relation Prediction
*LING 7800: Computational Models of Discourse*

This ipynb is to test the statistical significance of our findings.

In [45]:
import pandas as pd
import numpy as np
from util import *

In [46]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Testing Random Variation of PSRN

Each model in `variance_testing_runs` is a single epoch of PSRN with shuffled neighbors. We will compare the standard deviation and mean of this data set to the standard deviation and mean of the original PSRN data set (ran for 10 epochs) as well as the standard deviation of the EWN model (10 epochs).

NOTE: Sentence tags were not used in these models, each combined sentence is a concatenated string.

    Creating our data frame:

In [10]:
variance_testing_runs = [
    '../results/PSRN variance test 1.csv',
    '../results/PSRN variance test 2.csv',
    '../results/PSRN variance test 3.csv',
    '../results/PSRN variance test 4.csv',
    '../results/PSRN variance test 5.csv',
]

rows = {

    'eval_loss': 'loss',
    'eval_accuracy': 'accuracy',
    'eval_precision': 'precision',
    'eval_recall': 'recall',
    'eval_macro_f1': 'macro f1',
    'eval_f1': 'f1'

}

my_data = []

for i, j in enumerate(variance_testing_runs):

    df = pd.read_csv(j)[['eval_accuracy', 'eval_precision', 'eval_recall', 'eval_f1', 'eval_macro_f1']].dropna().round(2)
    df.rename(columns=rows, inplace=True)
    df = df.transpose()

    df.columns = [f'test {i + 1}']
    my_data.append(df)

test_variance_df = pd.concat(my_data, axis=1)
test_variance_df

Unnamed: 0,test 1,test 2,test 3,test 4,test 5
accuracy,0.54,0.55,0.54,0.54,0.54
precision,0.42,0.58,0.42,0.42,0.42
recall,0.54,0.55,0.54,0.54,0.54
f1,0.44,0.46,0.45,0.45,0.45
macro f1,0.24,0.25,0.24,0.24,0.25


#### Calculate the mean and standard deviation of each model run (column)

In [11]:
mean = test_variance_df.mean(axis=1)
std = test_variance_df.std(axis=1)

variance_metrics_df = pd.concat([mean, std], axis=1)
variance_metrics_df.columns = ['mean', 'stdv']
variance_metrics_df

Unnamed: 0,mean,stdv
accuracy,0.542,0.004472
precision,0.452,0.071554
recall,0.542,0.004472
f1,0.45,0.007071
macro f1,0.244,0.005477


### Testing Effect of Sentence Tags
We ran three models with basic sentence concatenation, and no delimiting sentence tags \<s1></s1> and \<s2></s2>. To measure the effect of sentence markers, we ran the same three data sets with the sentence tags included. We ran our baseline data set, the EWN data set, and the PSRN dataset. For later comparison, we also ran a true direct-neighbors model with sentence tags for 10 epochs.

> NOTE: Model runs without sentence tags each ran for 10 epochs. Model runs with sentence tags each ran for 5 epochs after we found that the models generally converged after 5 epochs. Due to a much smaller data set, we ran the TEWN model for 10 epochs with sentence tags.

    Creating our data frame:

In [22]:
metrics = ['accuracy', 'precision', 'recall', 'f1', 'macro f1']

# Load data from filepaths
without_tags = [
    '../results/Baseline.csv',
    '../results/PSRN.csv',
    '../results/EWN.csv',
]

with_tags = [
    '../results/EWN tagged.csv',
    '../results/PSRN tagged.csv',
    '../results/Baseline tagged.csv',
    '../results/DN tagged.csv',
]

# Create dataframes for without_tags and with_tags
df_without_tags = create_df(without_tags)
df_with_tags = create_df(with_tags)
df_with_tags.head()


Unnamed: 0,model,epoch,accuracy,precision,recall,macro f1,f1
0,EWN tagged,1.0,0.563943,0.534387,0.563943,0.38467,0.518112
1,EWN tagged,2.0,0.569869,0.584768,0.569869,0.480812,0.56381
2,EWN tagged,3.0,0.567686,0.570109,0.567686,0.489462,0.56582
3,EWN tagged,4.0,0.561135,0.547905,0.561135,0.466463,0.552324
4,EWN tagged,5.0,0.560823,0.550473,0.560823,0.470197,0.554198


In [40]:
df_without_tags.head()

Unnamed: 0,model,epoch,accuracy,precision,recall,macro f1,f1
0,Baseline,1.0,0.572988,0.550824,0.572988,0.431567,0.541218
1,Baseline,2.0,0.578291,0.596606,0.578291,0.483279,0.571692
2,Baseline,3.0,0.568309,0.571203,0.568309,0.477157,0.565131
3,Baseline,4.0,0.548035,0.542355,0.548035,0.459205,0.540959
4,Baseline,5.0,0.566126,0.547418,0.566126,0.452455,0.548344


### Perform an independent two-sample t-test on the different data groups (Tagged v. Untagged)

Because the groups are independent, and the data is normally distributed, this test will tell us if the means of the two groups are significantly different.

[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2996580/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2996580/)

In [50]:
for i in metrics:
    t_stat, p_value = compare(i, df_without_tags, df_with_tags)
    result = significance(i, p_value)
    print(result)

accuracy: P-value = 4.8874159350778227e-08 (Significant!)
precision: P-value = 0.05149532395032266 (Insignificant!)
recall: P-value = 4.8874159350778227e-08 (Significant!)
f1: P-value = 0.05940279330739174 (Insignificant!)
macro f1: P-value = 0.44422768008939884 (Insignificant!)


In [58]:
for i in metrics:
    if df_with_tags[i].mean() > df_without_tags[i].mean():
        print(f'{i}: mean is higher with tags by {(df_with_tags[i].mean() - df_without_tags[i].mean()).round(3)}!')
    else:
        print(f'{i}: mean is higher without tags by {(df_without_tags[i].mean() - df_with_tags[i].mean()).round(3)}!')

accuracy: mean is higher with tags by 0.026!
precision: mean is higher with tags by 0.019!
recall: mean is higher with tags by 0.026!
f1: mean is higher with tags by 0.015!
macro f1: mean is higher without tags by 0.013!


### Looking at highest performers across evaluation metrics (Tagged v. Untagged)

In [34]:
# combine dataframes 'df_with_tags' and 'df_without_tags'
tagged_df = pd.concat([df_with_tags, df_without_tags], axis=0).reset_index(drop=True)

# sort all_files_df by 'accuracy' descending
tagged_df.sort_values(by='accuracy', ascending=False, inplace=True)
pd.set_option('display.max_rows', None)
tagged_df[:15]

Unnamed: 0,model,epoch,accuracy,precision,recall,macro f1,f1
11,Baseline tagged,2.0,0.593575,0.601594,0.593575,0.49869,0.589199
15,DN tagged,1.0,0.586074,0.343482,0.586074,0.184756,0.433123
14,Baseline tagged,5.0,0.582034,0.576385,0.582034,0.491447,0.578516
17,DN tagged,3.0,0.580271,0.584971,0.580271,0.301162,0.516304
12,Baseline tagged,3.0,0.578603,0.586721,0.578603,0.505223,0.581387
26,Baseline,2.0,0.578291,0.596606,0.578291,0.483279,0.571692
16,DN tagged,2.0,0.576402,0.560016,0.576402,0.235416,0.473379
25,Baseline,1.0,0.572988,0.550824,0.572988,0.431567,0.541218
13,Baseline tagged,4.0,0.572676,0.571,0.572676,0.486132,0.570417
6,PSRN tagged,2.0,0.570805,0.57703,0.570805,0.465127,0.561614


### Testing Effect of Stop Words
After discovering that sentence tags generally improved the model prediction metrics across the board, we decided to compare those models to ones which have stop words removed. Our reasoning for this step is that perhaps less context is better than more context. By running these experiments, we can more fully test the impact of context on our prediction task.

> NOTE: Model runs with stop words included each ran for 5 epochs after we found that the models generally converged after 5 epochs. Due to a much smaller data set, we ran the TEWN model for 10 epochs with sentence tags and stop words, and an additional 10 with sentence tags and non stop words. Our other models with stop words removed include a 5 epoch run of baseline data set, a 2 epoch run of the EWN, and a 2 epoch run of the PSRN.

    Creating our data frame:

In [5]:
# Load data from filepaths
removed_stopwords = [
    '../results/EWN tagged no stopwords.csv',
    '../results/PSRN tagged no stopwords.csv',
    '../results/DN tagged no stopwords.csv',
    '../results/Baseline tagged no stopwords.csv',
]

with_stopwords = [
    '../results/EWN tagged.csv',
    '../results/PSRN tagged.csv',
    '../results/Baseline tagged.csv',
    '../results/DN tagged.csv',
]

# Create dataframes for without_tags and with_tags
df_removed_stopwords = create_df(removed_stopwords)
df_with_stopwords = create_df(with_stopwords)
df_removed_stopwords.head()

Unnamed: 0,model,epoch,accuracy,precision,recall,macro f1,f1
0,EWN tagged no stopwords,1.0,0.551154,0.510923,0.551154,0.320238,0.483158
1,EWN tagged no stopwords,2.0,0.546475,0.522325,0.546475,0.377935,0.511993
2,PSRN tagged no stopwords,1.0,0.533687,0.483026,0.533687,0.246652,0.43946
3,PSRN tagged no stopwords,2.0,0.550842,0.535048,0.550842,0.340141,0.510961
4,DN tagged no stopwords,1.0,0.611219,0.373588,0.611219,0.189676,0.463734


### Perform an independent two-sample t-test on the different data groups (With Stopwords v. Without Stopwords)

Because the groups are independent, and the data is normally distributed, this test will tell us if the means of the two groups are significantly different.

[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2996580/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2996580/)

In [75]:
for i in metrics:
    t_stat, p_value = compare(i, df_removed_stopwords, df_with_stopwords)
    result = significance(i, p_value)
    print(result)

accuracy: P-value = 0.7801148719952548 (Insignificant!)
precision: P-value = 0.06414336790087483 (Insignificant!)
recall: P-value = 0.7801148719952548 (Insignificant!)
macro f1: P-value = 0.009055062591456244 (Significant!)
f1: P-value = 0.13734329072868712 (Insignificant!)


In [76]:
for i in metrics:
    if df_with_stopwords[i].mean() > df_removed_stopwords[i].mean():
        print(f'{i}: mean is higher with stopwords by {(df_with_stopwords[i].mean() - df_removed_stopwords[i].mean()).round(3)}!')
    else:
        print(f'{i}: mean is higher without stopwords by {(df_removed_stopwords[i].mean() - df_with_stopwords[i].mean()).round(3)}!')

accuracy: mean is higher with stopwords by 0.001!
precision: mean is higher with stopwords by 0.026!
recall: mean is higher with stopwords by 0.001!
macro f1: mean is higher with stopwords by 0.068!
f1: mean is higher with stopwords by 0.016!


### Looking at highest performers across evaluation metrics (With Stopwords v. Without Stopwords)

In [35]:
stopwords_df = pd.concat([df_with_stopwords, df_removed_stopwords], axis=0).reset_index(drop=True)

# sort all_files_df by 'accuracy' descending
stopwords_df.sort_values(by='accuracy', ascending=False, inplace=True)
pd.set_option('display.max_rows', None)
stopwords_df[:15]

Unnamed: 0,model,epoch,accuracy,precision,recall,macro f1,f1
30,DN tagged no stopwords,2.0,0.611219,0.456462,0.611219,0.193605,0.467716
29,DN tagged no stopwords,1.0,0.611219,0.373588,0.611219,0.189676,0.463734
32,DN tagged no stopwords,4.0,0.59381,0.53392,0.59381,0.318234,0.549818
11,Baseline tagged,2.0,0.593575,0.601594,0.593575,0.49869,0.589199
15,DN tagged,1.0,0.586074,0.343482,0.586074,0.184756,0.433123
31,DN tagged no stopwords,3.0,0.584139,0.521874,0.584139,0.294002,0.54764
14,Baseline tagged,5.0,0.582034,0.576385,0.582034,0.491447,0.578516
17,DN tagged,3.0,0.580271,0.584971,0.580271,0.301162,0.516304
12,Baseline tagged,3.0,0.578603,0.586721,0.578603,0.505223,0.581387
16,DN tagged,2.0,0.576402,0.560016,0.576402,0.235416,0.473379


### Testing Effect of Sentence Tags \& Stop Words
Our final comparison looks at the set of models ran without sentence tags and with stop words compared to those with sentence tags and stop words removed. This is to highlight the extreme ends of our spectrum of models. 

> NOTE: Model runs with stop words included and no sentence tags ran for 10 epochs each. Due to a much smaller data set, we ran the TEWN model for 10 epochs with sentence tags and stop words. Our other models with stop words removed and sentence tags include a 5 epoch run of baseline data set, a 2 epoch run of the EWN, and a 2 epoch run of the PSRN.

    Creating our data frame:

In [7]:
# Load data from filepaths
removed_stopwords_tags = [
    '../results/EWN tagged no stopwords.csv',
    '../results/PSRN tagged no stopwords.csv',
    '../results/DN tagged no stopwords.csv',
    '../results/Baseline tagged no stopwords.csv',
]

with_stopwords = [
    '../results/Baseline.csv',
    '../results/PSRN.csv',
    '../results/EWN.csv',
]

# Create dataframes for without_tags and with_tags
removed_stopwords_tags = create_df(removed_stopwords_tags)
with_stopwords = create_df(with_stopwords)
with_stopwords.head()

Unnamed: 0,model,epoch,accuracy,precision,recall,macro f1,f1
0,Baseline,1.0,0.572988,0.550824,0.572988,0.431567,0.541218
1,Baseline,2.0,0.578291,0.596606,0.578291,0.483279,0.571692
2,Baseline,3.0,0.568309,0.571203,0.568309,0.477157,0.565131
3,Baseline,4.0,0.548035,0.542355,0.548035,0.459205,0.540959
4,Baseline,5.0,0.566126,0.547418,0.566126,0.452455,0.548344


### Perform an independent two-sample t-test on the different data groups (With Stopwords and No Tags v. Without Stopwords and Tags)

Because the groups are independent, and the data is normally distributed, this test will tell us if the means of the two groups are significantly different.

[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2996580/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2996580/)

In [79]:
for i in metrics:
    t_stat, p_value = compare(i, removed_stopwords_tags, with_stopwords)
    result = significance(i, p_value)
    print(result)

accuracy: P-value = 0.00015626210212820618 (Significant!)
precision: P-value = 0.5100838248912308 (Insignificant!)
recall: P-value = 0.00015626210212820618 (Significant!)
macro f1: P-value = 0.0004822452508571328 (Significant!)
f1: P-value = 0.8513120846905181 (Insignificant!)


In [78]:
for i in metrics:
    if with_stopwords[i].mean() > removed_stopwords_tags[i].mean():
        print(f'{i}: mean is higher with stopwords and no tags by {(with_stopwords[i].mean() - removed_stopwords_tags[i].mean()).round(3)}!')
    else:
        print(f'{i}: mean is higher with tags and without stopwords by {(removed_stopwords_tags[i].mean() - with_stopwords[i].mean()).round(3)}!')

accuracy: mean is higher with tags and without stopwords by 0.025!
precision: mean is higher with stopwords and no tags by 0.007!
recall: mean is higher with tags and without stopwords by 0.025!
macro f1: mean is higher with stopwords and no tags by 0.082!
f1: mean is higher with stopwords and no tags by 0.002!


### Looking at highest performers across evaluation metrics (With Stopwords and No Tags v. Without Stopwords and Tags)

In [37]:
stopwords_tags_df = pd.concat([removed_stopwords_tags, with_stopwords], axis=0).reset_index(drop=True)

# sort all_files_df by 'accuracy' descending
stopwords_tags_df.sort_values(by='accuracy', ascending=False, inplace=True)
pd.set_option('display.max_rows', None)
stopwords_tags_df[:15]

Unnamed: 0,model,epoch,accuracy,precision,recall,macro f1,f1
4,DN tagged no stopwords,1.0,0.611219,0.373588,0.611219,0.189676,0.463734
5,DN tagged no stopwords,2.0,0.611219,0.456462,0.611219,0.193605,0.467716
7,DN tagged no stopwords,4.0,0.59381,0.53392,0.59381,0.318234,0.549818
6,DN tagged no stopwords,3.0,0.584139,0.521874,0.584139,0.294002,0.54764
20,Baseline,2.0,0.578291,0.596606,0.578291,0.483279,0.571692
9,DN tagged no stopwords,6.0,0.574468,0.514447,0.574468,0.298824,0.532813
19,Baseline,1.0,0.572988,0.550824,0.572988,0.431567,0.541218
21,Baseline,3.0,0.568309,0.571203,0.568309,0.477157,0.565131
18,Baseline tagged no stopwords,5.0,0.567686,0.556424,0.567686,0.458936,0.559883
23,Baseline,5.0,0.566126,0.547418,0.566126,0.452455,0.548344


## Creating a master data frame!

    Creating our data frame:

In [63]:
all_files = [
    '../results/Baseline.csv',
    '../results/PSRN.csv',
    '../results/EWN.csv',
    '../results/EWN tagged.csv',
    '../results/PSRN tagged.csv',
    '../results/Baseline tagged.csv',
    '../results/DN tagged.csv',
    '../results/EWN tagged no stopwords.csv',
    '../results/PSRN tagged no stopwords.csv',
    '../results/DN tagged no stopwords.csv',
    '../results/Baseline tagged no stopwords.csv',
]

all_files_df = create_df(all_files)
all_files_df.head()

# sort all_files_df by 'accuracy' descending
all_files_df.sort_values(by='accuracy', ascending=False, inplace=True)
pd.set_option('display.max_rows', None)
all_files_df

Unnamed: 0,model,epoch,accuracy,precision,recall,macro f1,f1
60,DN tagged no stopwords,2.0,0.611219,0.456462,0.611219,0.193605,0.467716
59,DN tagged no stopwords,1.0,0.611219,0.373588,0.611219,0.189676,0.463734
62,DN tagged no stopwords,4.0,0.59381,0.53392,0.59381,0.318234,0.549818
41,Baseline tagged,2.0,0.593575,0.601594,0.593575,0.49869,0.589199
45,DN tagged,1.0,0.586074,0.343482,0.586074,0.184756,0.433123
61,DN tagged no stopwords,3.0,0.584139,0.521874,0.584139,0.294002,0.54764
44,Baseline tagged,5.0,0.582034,0.576385,0.582034,0.491447,0.578516
47,DN tagged,3.0,0.580271,0.584971,0.580271,0.301162,0.516304
42,Baseline tagged,3.0,0.578603,0.586721,0.578603,0.505223,0.581387
1,Baseline,2.0,0.578291,0.596606,0.578291,0.483279,0.571692


## Compare the effects of different context strategies (Baseline, EWN, PSRN, DN)

In [65]:
methods = ["Baseline", "DN", "EWN", "PSRN"]
metrics = ['accuracy', 'precision', 'recall', 'macro f1', 'f1']

Baseline = all_files_df[all_files_df['model'].str.startswith("Baseline")]
DN = all_files_df[all_files_df['model'].str.startswith("DN")]
EWN = all_files_df[all_files_df['model'].str.startswith("EWN")]
PSRN = all_files_df[all_files_df['model'].str.startswith("PSRN")]

method_map = {
    "Baseline": Baseline,
    "DN": DN,
    "EWN": EWN,
    "PSRN": PSRN,
}

In [73]:
for i in range(len(methods)):
    for j in range(i+1, len(methods)):
        method_1 = methods[i]
        method_2 = methods[j]
        
        df1 = method_map[method_1]
        df2 = method_map[method_2]
        
        for metric in metrics:
            t_stat, p_value = compare(metric, df1, df2)
            result = significance(metric, p_value)
            print(f"Comparing {method_1} vs {method_2}: {result}")
        print()

Comparing Baseline vs DN: accuracy: P-value = 0.3058034698159954 (Insignificant!)
Comparing Baseline vs DN: precision: P-value = 0.013406009319660992 (Significant!)
Comparing Baseline vs DN: recall: P-value = 0.3058034698159954 (Insignificant!)
Comparing Baseline vs DN: macro f1: P-value = 8.423218120791035e-09 (Significant!)
Comparing Baseline vs DN: f1: P-value = 0.001981387836402727 (Significant!)

Comparing Baseline vs EWN: accuracy: P-value = 7.407500120130044e-05 (Significant!)
Comparing Baseline vs EWN: precision: P-value = 0.0002804355419967234 (Significant!)
Comparing Baseline vs EWN: recall: P-value = 7.407500120130044e-05 (Significant!)
Comparing Baseline vs EWN: macro f1: P-value = 0.00386320997598372 (Significant!)
Comparing Baseline vs EWN: f1: P-value = 9.382519639023924e-05 (Significant!)

Comparing Baseline vs PSRN: accuracy: P-value = 3.2322172173166152e-06 (Significant!)
Comparing Baseline vs PSRN: precision: P-value = 4.763296472170122e-05 (Significant!)
Comparing B

In [74]:
for i in range(len(methods)):
    for j in range(i+1, len(methods)):
        method_1 = methods[i]
        method_2 = methods[j]
        
        df1 = method_map[method_1]
        df2 = method_map[method_2]
        
        print(f"Comparing {method_1} vs {method_2}:")
        for k in metrics:            
            mean_difference = (df1[k].mean() - df2[k].mean()).round(3)
            if mean_difference > 0:
                print(f"\t{k}: {method_1} mean is higher by {mean_difference}!")
            elif mean_difference < 0:
                print(f"\t{k}: {method_2} mean is higher by {-mean_difference}!")
            else:
                print(f"\t{k}: Both {method_1} and {method_2} have the same mean!")
        print()


Comparing Baseline vs DN:
	accuracy: DN mean is higher by 0.005!
	precision: Baseline mean is higher by 0.038!
	recall: DN mean is higher by 0.005!
	macro f1: Baseline mean is higher by 0.142!
	f1: Baseline mean is higher by 0.03!

Comparing Baseline vs EWN:
	accuracy: Baseline mean is higher by 0.023!
	precision: Baseline mean is higher by 0.027!
	recall: Baseline mean is higher by 0.023!
	macro f1: Baseline mean is higher by 0.039!
	f1: Baseline mean is higher by 0.03!

Comparing Baseline vs PSRN:
	accuracy: Baseline mean is higher by 0.027!
	precision: Baseline mean is higher by 0.035!
	recall: Baseline mean is higher by 0.027!
	macro f1: Baseline mean is higher by 0.06!
	f1: Baseline mean is higher by 0.039!

Comparing DN vs EWN:
	accuracy: DN mean is higher by 0.028!
	precision: EWN mean is higher by 0.011!
	recall: DN mean is higher by 0.028!
	macro f1: EWN mean is higher by 0.103!
	f1: Both DN and EWN have the same mean!

Comparing DN vs PSRN:
	accuracy: DN mean is higher by 0.0