# Relation Prediction in Argument Mining With Pre-trained Deep Bidirectional Transformers - Core

This notebook generates or shows all plots and tables used in the thesis. Some additional interesting material not mentioned or shown in the thesis, can be found in the other notebook. To run this file, first make sure that all requirements are satisified. The README has detailed information about how to satisfy the requirements. Not all training settings have to be retrained to run this notebook, the cells belonging to training settings not done will be simply skipped. The order is the same as in the thesis.

## Table of Contents:
1. [Materials and Methods](#methods)
2. [Comparative Experiments](#results)
3. [Additional Experiments](#addexps)
4. [Critical Analysis](#critana)

## Setup

The following cells import and load the necessary data and libraries.

In [None]:
# Necessary imports and setups
# Basic python imports
import re
import sys 
import os

# For the sentiment baselines
import nltk
# Download Sentiment Lexicon
nltk.download('vader_lexicon')

# Datahandling and plotting
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Statistics and other stuff
from sklearn.metrics import precision_recall_fscore_support, accuracy_score, classification_report
from wordcloud import WordCloud, STOPWORDS
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from pytorch_pretrained_bert.tokenization import BertTokenizer


# Get the exact splits of the data
sys.path.append(os.path.abspath("../pytorch"))
from run_classifier_dataset_utils import processors

# Settings
# Do no hide rows in pandas
pd.set_option('display.max_rows', 164)
pd.set_option('precision', 2)

In [None]:
# Read all necessary data
df = pd.read_csv('../data/complete_data.tsv', sep='\t').astype({"id": str})
data_stats_org = np.load('../data/stats/data_stats_org.npy', allow_pickle=True).item()
data_stats_resp = np.load('../data/stats/data_stats_resp.npy', allow_pickle=True).item()
data_stats_topic = np.load('../data/stats/data_stats_topic.npy', allow_pickle=True).item()
data_stats_author = pd.read_csv('../data/stats/data_stats_author.tsv', sep='\t')
data_nix_ken = pd.read_csv('../data/stats/data_nix_ken.tsv', sep='\t')

# Init the data processors and get the correct labels
node_pro = processors['node']('both')
_, node_test_df = node_pro.get_dev_examples('../data')
political_ru_pro = processors['political-ru']('both')
pol_ru_test_df = pd.concat(np.array(political_ru_pro.get_splits('../data'))[:,3])
political_as_pro = processors['political-as']('both')
pol_as_test_df = pd.concat(np.array(political_as_pro.get_splits('../data'))[:,3])
political_asu_pro = processors['political-asu']('both')
pol_asu_test_df = pd.concat(np.array(political_asu_pro.get_splits('../data'))[:,3])
agreement_pro = processors['agreement']('both')
ag_test_df = pd.concat(np.array(agreement_pro.get_splits('../data'))[:,3])
political_ru_topics_pro = processors['political-ru-topics']('both')
pol_ru_to_test_df = pd.concat(np.array(political_ru_topics_pro.get_splits('../data'))[:,3])
political_as_topics_pro = processors['political-as-topics']('both')
pol_as_to_test_df = pd.concat(np.array(political_as_topics_pro.get_splits('../data'))[:,3])
agreement_topics_pro = processors['agreement-topics']('both')
_, ag_to_test_df = agreement_topics_pro.get_dev_examples('../data')

# Init the sentiment analyzer
sid = SentimentIntensityAnalyzer()

# Define helper functions
def get_major_acc(x, classes=['Unrelated', 'Attack/Disagreement', 'Support/Agreement']):
    """Returns the accuracy of the major class."""
    return np.divide(x[classes].max(), np.sum(x[classes]))

def get_major_class(x, classes=['Unrelated', 'Attack/Disagreement', 'Support/Agreement']):
    """Returns the name of the major class."""
    return x[classes].astype('float64').idxmax()

def disc_pol(x):
    """Discretize the float sentiment polarity."""
    if x >= 0.00:
        return 'positive'
    else:
        return 'negative'
    
def count_values(x, labels):
    """Count how many rows in x have a label in labels."""
    return x['label'].loc[x['label'].isin(labels)].count()

def convert_preds_topic(df_preds, df_res, replace_dict, count, print_clr=True):
    """Prints the classification reports for every topic."""
    labels = list(replace_dict.values())
    preds = pd.Series()
    for i, row in df_preds.iterrows():
        preds = preds.append(pd.Series(row.values[~row.str.contains('bert*', na=False, regex=True)]))
        if i == count:
            break
    preds = preds.dropna().astype(int)
    df_res['preds'] = preds.values
    df_res = df_res.replace(replace_dict)
    
    if print_clr:
        for name, group in df_res.groupby('topic'):
            print("Topic: {}".format(name))
            print(classification_report(group['label'], group['preds'], labels=labels))
    
    #display(pd.crosstab(df_res['topic'], [df_res['label'], df_res['preds']]))
    results_topic = pd.crosstab(df_res['topic'], [df_res['label'], df_res['preds']])
    results_topic['total'] = results_topic.agg([np.sum], axis=1)
    results_topic['correctness'] = results_topic.apply(
        lambda r: (r[labels[0], labels[0]]+r[labels[1], labels[1]])/r['total'], axis=1)
    display(results_topic.sort_values(by='correctness'))


## Materials and Methods <a name="methods"></a>

Three datasets were used in this thesis. For every dataset a link where the dataset is available and a link to the describing paper is provided.
- NoDE:
    - Link: http://www-sop.inria.fr/NoDE/NoDE-xml.html
    - Paper: https://pdfs.semanticscholar.org/16d1/6b8a37c5313fa8c8430fddc011f2a98d20c5.pdf
- Political
    - Link: https://dh.fbk.eu/resources/political-argumentation
    - Paper: https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16393/16020
- Agreement
    - Link: https://dh.fbk.eu/resources/agreement-disagreement
    - Paper: https://www.aclweb.org/anthology/C16-1232

In [None]:
# NoDE (consisting of debatepedia train and test and procon)
# General statistics for the datasets
pd.concat((data_stats_topic['debate_train'], data_stats_topic['debate_test'], data_stats_topic['procon']), keys=['train', 'test', 'procon'])

In [None]:
# Example tokenization
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
text = "Violent games make youth more agressive/violent."
print(tokenizer.tokenize(text))

In [None]:
# Histogram of ingoing and outgoing links for the arguments in the NoDe debatepedia test set
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(10,4), sharex=True)  # 1 row, 2 columns

data_set = 'debate_test'
df_plot = pd.concat((data_stats_org[data_set].iloc[:-1], 
                     data_stats_resp[data_set].iloc[:-1].rename(
                         columns={'Attacks': 'Attacked', 'Supports': 'Supported'})), keys=['Original', 'Response'])
df_plot.loc['Original'].hist(column=['Total pairs'], ax=ax1, bins=[1,2,3,4,5,6,7,8,9], align='left', ec='black')
df_plot.loc['Response'].hist(column=['Total pairs'], ax=ax2, bins=[1,2,3,4,5,6,7,8,9], align='left', ec='black')

ax1.set_title("Original")
ax2.set_title("Response")
ax1.set_ylabel("Count")
ax1.set_xlabel("Number of ingoing links")
ax2.set_xlabel("Number of outgoing links")
ax1.set_xticks([1,2,3,4,5,6,7,8])
ax1.grid(False)
ax2.set_xticks([1,2,3,4,5,6,7,8])
ax2.grid(False)

plt.tight_layout()
plt.savefig('../data/thesis/node_hist.pdf', bbox_inches='tight')

In [None]:
# Boxplot of the lengths of arguments in WordPiece-Tokens for the debatepedia train and test set
fig, ax = plt.subplots(1,2, figsize=(10,4), sharey=True)  # 1 row, 2 columns

#for data_set, ax in [('debate_train', ax1),('debate_test', ax2)]:
df_plot = pd.concat((df[df['org_dataset'] == 'debate_train'], df[df['org_dataset'] == 'debate_test']))
df_plot = df_plot.rename(columns={'org_len': 'Original', 'response_len': 'Response', 'complete_len': 'Combined'})
df_plot = df_plot.replace({'debate_test': 'Test', 'debate_train': 'Train'})
df_plot.groupby('org_dataset', sort=False).boxplot(ax=ax)
ax[0].set_ylabel('Length in WordPiece tokens')
plt.tight_layout()

plt.savefig('../data/thesis/node_length.pdf', bbox_inches='tight')

In [None]:
# Political
# General statistics for the dataset
data_stats_topic['political']

In [None]:
# Histogram of ingoing and outgoing links for the arguments in the Political dataset
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(10,4), sharex=False, sharey=True)  # 1 row, 2 columns

data_set = 'political'
df_plot = pd.concat((data_stats_org[data_set].iloc[:-1], 
                     data_stats_resp[data_set].iloc[:-1].rename(
                         columns={'Attacks': 'Attacked', 'Supports': 'Supported'})), keys=['Original', 'Response'])
df_plot.loc['Original'].hist(column=['Total pairs'], bins=[1,2,3,4,5,6,7,8,9], ax=ax1, align='left', ec='black')
df_plot.loc['Response'].hist(column=['Total pairs'], bins=[1,2,3,4,5,6,7,8,9], ax=ax2, align='left', ec='black')

ax1.set_title("Original")
ax2.set_title("Response")
ax1.set_ylabel("Count")
ax1.set_xlabel("Number of ingoing links")
ax2.set_xlabel("Number of outgoing links")
ax1.set_xticks([1,2,3,4,5,6,7,8])
ax1.grid(False)
ax2.set_xticks([1,2,3,4,5,6,7,8])
ax2.grid(False)

plt.tight_layout()

plt.savefig('../data/thesis/political_hist.pdf', bbox_inches='tight')

In [None]:
# Boxplot of the lengths of arguments in WordPiece-Tokens for the Political dataset
fig, ax = plt.subplots(1,1, figsize=(5,4), sharey=True)  # 1 row, 2 columns

#for data_set, ax in [('debate_train', ax1),('debate_test', ax2)]:
df_plot = df[df['org_dataset'] == 'political']
df_plot = df_plot.rename(columns={'org_len': 'Original', 'response_len': 'Response', 'complete_len': 'Combined'})
df_plot.groupby('org_dataset', sort=False).boxplot(ax=ax)
ax.set_title("")
ax.set_ylabel('Length in WordPiece tokens')
plt.tight_layout()

plt.savefig('../data/thesis/political_length.pdf', bbox_inches='tight')

In [None]:
# Agreement
# General statistics
data_stats_topic['agreement']

In [None]:
# Boxplot of the lengths of arguments in WordPiece-Tokens for the Agreement dataset
fig, ax = plt.subplots(1,1, figsize=(5,4), sharey=True)  # 1 row, 2 columns

df_plot = df[df['org_dataset'] == 'agreement']
df_plot = df_plot.rename(columns={'org_len': 'Argument 1', 'response_len': 'Argument 2', 'complete_len': 'Combined'})
df_plot.groupby('org_dataset', sort=False).boxplot(ax=ax)
ax.set_title("")
ax.set_ylabel('Length in WordPiece tokens')
plt.tight_layout()

plt.savefig('../data/thesis/agreement_length.pdf', bbox_inches='tight')

## Comparative Experiments <a name="results"></a>

In the following, the results for the comparative experiments for the three datasets are presented.

In [None]:
# The comparative results on the NoDE dataset. The network is always trained on the train dataset 
# and evaluated on the test dataset. 

# Parameters:
# Fixed: input=both, seq_len=128, warmup_prop=0.1, seed=42-51
# Tested: model=base-uncased,large-uncased, epochs=3,4,5, batch_size=8,12,16, lr=2e-5, 3e-5, 5e-5
# Gradient accumulation: batch_size/4 for bert_large 

# Only run this cell, if the NoDE comparative training has been done
if os.path.isdir("../pytorch/res/node_both_paper"):
    # Read in all testing results
    eval_results = pd.read_csv('../pytorch/res/node_both_paper/eval_results.tsv', sep='\t')
    # Read in all testing predictions
    eval_preds = pd.read_csv('../pytorch/res/node_both_paper/eval_preds.csv')
    # Group all runs with the same settings (different seeds) together
    eval_results_grouped = eval_results.groupby(['_bert-model', '_num_epochs', '_batch_size','_gradient_acc' ,'_learning_rate' ])
    # Aggregate mean, min, max, std and median
    results = eval_results_grouped['acc'].agg([np.mean, np.min, np.max, np.std, np.median])
    
    # Print the statistics reported in the thesis
    print("Best mean result with parameters and other statistics:\n", results.loc[results['mean'].idxmax()]) 
    print("\nOverall mean:\n", eval_results['acc'].mean())
    print("\nWorst (single) run:\n", eval_results['acc'].min())
    print("\nBest (single) run:\n", eval_results['acc'].max())
    print("\nResults grouped per model:\n", eval_results.groupby('_bert-model')['acc'].agg([np.mean])) 
    
    # Precision and recall for the best mean setting
    bmodel, bepochs, bb, bga, blr = results.loc[results['mean'].idxmax()].name
    best_pred_ps = eval_results.loc[(eval_results['_bert-model'] == bmodel) & 
                           (eval_results['_num_epochs'] == bepochs) & (eval_results['_batch_size'] == bb) &
                           (eval_results['_learning_rate'] == blr)].index
    res = pd.concat([node_test_df.reset_index(drop=True), 
                     eval_preds.iloc[best_pred_ps,:-1].transpose().reset_index(drop=True)], axis=1)
    res = res.replace({0: 'attack', 1: 'support'})
    re_pre_dict = {'Precision support' : [], 'Precision attack': [], 'Recall support': [], 'Recall attack': []}
    for ind in best_pred_ps:
        pre, rec, _, _ = precision_recall_fscore_support(res['label'], res[ind], labels=['support', 'attack'])
        re_pre_dict['Precision support'].append(pre[0])
        re_pre_dict['Precision attack'].append(pre[1])
        re_pre_dict['Recall support'].append(rec[0])
        re_pre_dict['Recall attack'].append(rec[1])
    
    print("\nPrecision and recall for the best mean setting:")
    display(pd.DataFrame(re_pre_dict).agg([np.mean, np.std]))
    
    # Formats and shows the complete result table
    results = results.reset_index()
    results = results.rename(columns={element: element.replace("_", "-") for element in results.columns.tolist()})
    results = results.rename(columns={element: re.sub("^-", "", element) for element in results.columns.tolist()})
    results = results.replace({"bert-base-uncased": 'bbu', "bert-large-uncased": 'blu'})
    results['batch-size'] = results['batch-size'] * results['gradient-acc']
    results = results.drop(columns=["gradient-acc"])
    results.iloc[:,:9].to_csv('../data/thesis/node_all_results.csv', index=False)
    print("\nComplete results table:")
    display(results.iloc[:,:9])
else:
    print("You have to first reproduce the results for the NoDE dataset.\n"
          "../code_relation_prediction/pytorch ./run_all_node.sh comp")

In [None]:
# The comparative results on the Political dataset (Task 1: Related/Unrelated). 
# 10-Fold stratified cross-validation is used 

# Parameters:
# Fixed: input=both, seq_len=256, warmup_prop=0.1, seed=42
# model=base-uncased, epochs=5, batch_size=12, lr=2e-5

# Only run this cell, if the Political Task 1 comparative training has been done
if os.path.isdir("../pytorch/res/pol_ru"):
    # Read the results
    eval_results = pd.read_csv('../pytorch/res/pol_ru/eval_results.tsv', sep='\t')
    # Read the predictions
    eval_preds = pd.read_csv('../pytorch/res/pol_ru/eval_preds.csv')
    
    # Calculate precision, recall and F1 for all Folds
    preds = pd.Series()
    re_pre_f1_dict = {'Precision related' : [], 'Precision unrelated': [], 
                      'Recall related': [], 'Recall unrelated': [], 'F1 related': [], 'F1 unrelated': []}
    scores_df = pd.DataFrame(re_pre_f1_dict)
    count = 0
    for i, row in eval_preds.iterrows():
        preds_split_i = pd.Series(row.values[~row.str.contains('bert*', na=False, regex=True)]).dropna().astype('int8')    
        preds_split_i = preds_split_i.replace({0: 'related', 1: 'unrelated'})
        preds = preds.append(preds_split_i)
        length = len(preds_split_i)
        pre, rec, f1, support = precision_recall_fscore_support(pol_ru_test_df.iloc[count:count+length]['label'], preds_split_i, labels=['related', 'unrelated'])
        count += length
        scores_df.loc[i] = np.array((pre,rec,f1)).reshape((1,-1))[0]
        if i == 9:
            break
    pol_ru_test_df['preds'] = preds.values

    # Calculate the macro-averages
    scores_df['Average Precision'] = (scores_df['Precision related']+scores_df['Precision unrelated'])/2
    scores_df['Average Recall'] = (scores_df['Recall related']+scores_df['Recall unrelated'])/2
    scores_df['Average F1'] = (scores_df['F1 related']+scores_df['F1 unrelated'])/2

    print("Precision, Recall, F1, Macro-averages (Mean + std) for Task 1 on all folds:")
    display(scores_df.agg([np.mean, np.std, np.min, np.max]))
    
    print('Only calculate the scores on the complete predictions:')
    print(classification_report(pol_ru_test_df['label'], pol_ru_test_df['preds'], labels=['related', 'unrelated']))

else:
    print('You have to first reproduce the results for the Polical dataset Task 1.\n'
          'python run_classifier_ba.py  --task_name "political-ru" --output_dir res/pol_ru/crossval1 --do_cross_val --do_lower_case --num_train_epochs 5 --max_seq_length 256 --train_batch_size 12 --learning_rate 2e-5')

In [None]:
# The comparative results on the Political dataset (Task 2: Attack/Support). 
# 10-Fold stratified cross-validation is used 

# Parameters:
# Fixed: input=both, seq_len=256, warmup_prop=0.1, seed=42
# model=base-uncased, epochs=5, batch_size=12, lr=2e-5

# Only run this cell, if the Political Task 2 comparative training has been done
if os.path.isdir("../pytorch/res/pol_as"):
    # Read the results
    eval_results = pd.read_csv('../pytorch/res/pol_as/eval_results.tsv', sep='\t')
    # Read the predictions
    eval_preds = pd.read_csv('../pytorch/res/pol_as/eval_preds.csv')
    
    # Calculate precision, recall and F1 for all Folds
    preds = pd.Series()
    re_pre_f1_dict = {'Precision attack' : [], 'Precision support': [], 
                  'Recall attack': [], 'Recall support': [], 'F1 attack': [], 'F1 support': []}
    scores_df = pd.DataFrame(re_pre_f1_dict)
    count = 0
    for i, row in eval_preds.iterrows():
        preds_split_i = pd.Series(row.values[~row.str.contains('bert*', na=False, regex=True)]).dropna().astype('int8')    
        preds_split_i = preds_split_i.replace({0: 'attack', 1: 'support'})
        preds = preds.append(preds_split_i)
        length = len(preds_split_i)
        pre, rec, f1, support = precision_recall_fscore_support(pol_as_test_df.iloc[count:count+length]['label'], preds_split_i, labels=['attack', 'support'])
        count += length
        scores_df.loc[i] = np.array((pre,rec,f1)).reshape((1,-1))[0]
        if i == 9:
            break
    pol_as_test_df['preds'] = preds.values

    # Calculate the macro-averages
    scores_df['Average Precision'] = (scores_df['Precision attack']+scores_df['Precision support'])/2
    scores_df['Average Recall'] = (scores_df['Recall attack']+scores_df['Recall support'])/2
    scores_df['Average F1'] = (scores_df['F1 attack']+scores_df['F1 support'])/2

    print("Precision, Recall, F1, Macro-averages (Mean + std) for Task 2 on all folds:")
    display(scores_df.agg([np.mean, np.std, np.min, np.max]))
    
    print('Only calculate the scores on the complete predictions:')
    print(classification_report(pol_as_test_df['label'], pol_as_test_df['preds'], labels=['attack', 'support']))

else:
    print('You have to first reproduce the results for the Polical dataset Task 2.\n'
          'python run_classifier_ba.py  --task_name "political-as" --output_dir res/pol_as/crossval1 --do_cross_val --do_lower_case --num_train_epochs 5 --max_seq_length 256 --train_batch_size 12 --learning_rate 2e-5')

In [None]:
# The comparative results on the Political dataset (Task 3: Attack/Support/Unrelated). 
# 10-Fold stratified cross-validation is used 

# Parameters:
# Fixed: input=both, seq_len=256, warmup_prop=0.1, seed=42
# model=base-uncased, epochs=5, batch_size=12, lr=2e-5

# Only run this cell, if the Political Task 3 comparative training has been done
if os.path.isdir("../pytorch/res/pol_asu"):
    # Read the results
    eval_results = pd.read_csv('../pytorch/res/pol_asu/eval_results.tsv', sep='\t')
    # Read the predictions
    eval_preds = pd.read_csv('../pytorch/res/pol_asu/eval_preds.csv')
    
    # Calculate precision, recall and F1 for all Folds
    preds = pd.Series()
    re_pre_f1_dict = {'Precision attack' : [], 'Precision support': [], 'Precision unrelated': [],
                  'Recall attack': [], 'Recall support': [], 'Recall unrelated': [],
                  'F1 attack': [], 'F1 support': [], 'F1 unrelated': []}
    scores_df = pd.DataFrame(re_pre_f1_dict)
    count = 0
    for i, row in eval_preds.iterrows():
        preds_split_i = pd.Series(row.values[~row.str.contains('bert*', na=False, regex=True)]).dropna().astype('int8')    
        preds_split_i = preds_split_i.replace({0: 'attack', 1: 'support', 2: 'unrelated'})
        preds = preds.append(preds_split_i)
        length = len(preds_split_i)
        pre, rec, f1, support = precision_recall_fscore_support(pol_asu_test_df.iloc[count:count+length]['label'], preds_split_i, labels=['attack', 'support', 'unrelated'])
        count += length
        scores_df.loc[i] = np.array((pre,rec,f1)).reshape((1,-1))[0]
        if i == 9:
            break
    pol_asu_test_df['preds'] = preds.values

    # Calculate the macro-averages
    scores_df['Average Precision'] = (scores_df['Precision attack']+scores_df['Precision support']+scores_df['Precision unrelated'])/3
    scores_df['Average Recall'] = (scores_df['Recall attack']+scores_df['Recall support']+scores_df['Recall unrelated'])/3
    scores_df['Average F1'] = (scores_df['F1 attack']+scores_df['F1 support']+scores_df['F1 unrelated'])/3

    print("Precision, Recall, F1, Macro-averages (Mean + std) for Task 3 on all folds:")
    display(scores_df.agg([np.mean, np.std, np.min, np.max]))
    
    print('Only calculate the scores on the complete predictions:')
    print(classification_report(pol_asu_test_df['label'], pol_asu_test_df['preds'], labels=['attack', 'support', 'unrelated']))
    
    print('Major class baseline:')
    print(classification_report(pol_asu_test_df['label'], ['unrelated']*1460, labels=['attack', 'support', 'unrelated']))


else:
    print('You have to first reproduce the results for the Polical dataset Task 3.\n'
          'python run_classifier_ba.py  --task_name "political-asu" --output_dir res/pol_asu/crossval1 --do_cross_val --do_lower_case --num_train_epochs 5 --max_seq_length 256 --train_batch_size 12 --learning_rate 2e-5')

In [None]:
# The comparative results on the Agreement dataset. 
# 10-Fold stratified cross-validation is used 

# Parameters:
# Fixed: input=both, seq_len=256, warmup_prop=0.1, seed=42
# model=base-uncased, epochs=3, batch_size=12, lr=2e-5

# Only run this cell, if the Agreement comparative training has been done
if os.path.isdir("../pytorch/res/agreement_new"):
    # Read the results
    eval_results = pd.read_csv('../pytorch/res/agreement_new/eval_results.tsv', sep='\t')
    # Read the predictions
    eval_preds = pd.read_csv('../pytorch/res/agreement_new/eval_preds.csv')
    
    # Calculate precision, recall and F1 for all Folds
    preds = pd.Series()
    re_pre_f1_dict = {'Precision agreement' : [], 'Precision disagreement': [], 
                  'Recall agreement': [], 'Recall disagreement': [], 'F1 agreement': [], 'F1 disagreement': [], "Accuracy": []}
    scores_df = pd.DataFrame(re_pre_f1_dict)
    count = 0
    for i, row in eval_preds.iterrows():
        preds_split_i = pd.Series(row.values[~row.str.contains('bert*', na=False, regex=True)]).dropna().astype('int8')    
        preds_split_i = preds_split_i.replace({0: 'agreement', 1: 'disagreement'})
        preds = preds.append(preds_split_i)
        length = len(preds_split_i)
        pre, rec, f1, support = precision_recall_fscore_support(ag_test_df.iloc[count:count+length]['label'], preds_split_i, labels=['agreement', 'disagreement'])
        acc = np.array([accuracy_score(ag_test_df.iloc[count:count+length]['label'], preds_split_i)])
        count += length
        scores_df.loc[i] = np.concatenate((pre,rec,f1,acc)).ravel()
        if i == 9:
            break
    ag_test_df['preds'] = preds.values

    print("Precision, Recall, F1, Accuracy (Mean + std) for Agreement on all folds:")
    display(scores_df.agg([np.mean, np.std, np.min, np.max]))


else:
    print('You have to first reproduce the results for the Agreement dataset.\n'
          'python run_classifier_ba.py  --task_name "agreement" --output_dir res/agreement_new/crossval1 --do_cross_val --do_lower_case --num_train_epochs 3 --max_seq_length 256 --train_batch_size 12 --learning_rate 2e-5')

## Additional Experiments <a name="addexps"></a>

In the following the results of the additional experiments are presented.

In [None]:
# The additional procon results on the NoDE dataset. The network is always trained on the train dataset + procon 
# and evaluated on the test dataset. 

# Parameters:
# Fixed: input=both, seq_len=128, warmup_prop=0.1, seed=42-71
# Tested: model=base-uncased, epochs=4,5, batch_size=12,16, lr=2e-5, 3e-5

# Only run this cell, if the NoDE additional procon training has been done
if os.path.isdir("../pytorch/res/node_both_procon"):
    # Read in all testing results
    eval_results = pd.read_csv('../pytorch/res/node_both_procon/eval_results.tsv', sep='\t')
    # Read in all testing predictions
    eval_preds = pd.read_csv('../pytorch/res/node_both_procon/eval_preds.csv')
    # Group all runs with the same settings (different seeds) together
    eval_results_grouped = eval_results.groupby(['_bert-model', '_num_epochs', '_batch_size','_gradient_acc' ,'_learning_rate' ])
    # Aggregate mean, min, max, std and median
    results = eval_results_grouped['acc'].agg([np.mean, np.min, np.max, np.std, np.median])
    
    # Print the statistics reported in the thesis
    print("Best mean result with parameters and other statistics:\n", results.loc[results['mean'].idxmax()]) 
    print("\nOverall mean:\n", eval_results['acc'].mean())
    print("\nWorst (single) run:\n", eval_results['acc'].min())
    print("\nBest (single) run:\n", eval_results['acc'].max())
    
    # Formats and shows the complete result table
    results = results.reset_index()
    results = results.rename(columns={element: element.replace("_", "-") for element in results.columns.tolist()})
    results = results.rename(columns={element: re.sub("^-", "", element) for element in results.columns.tolist()})
    results = results.replace({"bert-base-uncased": 'bbu', "bert-large-uncased": 'blu'})
    results['batch-size'] = results['batch-size'] * results['gradient-acc']
    results = results.drop(columns=["gradient-acc"])
    results.iloc[:,:9].to_csv('../data/thesis/node_procon_results.csv', index=False)
    print("\nComplete results table:")
    display(results.iloc[:,:8])
else:
    print("You have to first reproduce the results for the NoDE procon dataset.\n"
          "../code_relation_prediction/pytorch ./run_all_node.sh procon")

In [None]:
# The additional results on the Political dataset (Task 1+2). 
# 5-Fold leave-one-group-out (topics) cross-validation is used 

# Parameters:
# Fixed: input=both, seq_len=256, warmup_prop=0.1, seed=42
# model=base-uncased, epochs=5, batch_size=12, lr=2e-5

# Only run this cell, if the Political Task 1 additional training has been done
if os.path.isdir("../pytorch/res/pol_ru_topics"):
    # Load results
    eval_results = pd.read_csv('../pytorch/res/pol_ru_topics/eval_results.tsv', sep='\t')
    # Aggregate and display
    print("Task 1:")
    results = eval_results.groupby(['_bert-model', '_num_epochs', '_batch_size','_gradient_acc' ,'_learning_rate'])['f1'].agg([np.mean, np.min, np.max, np.std])
    display(results)
else:
    print('You have to first reproduce the results for the Political Task 1 additional dataset.\n'
          '../code_relation_prediction/pytorch python run_classifier_ba.py  --task_name "political-ru-topics" --output_dir res/pol_ru_topics/crossval1 --do_cross_val --do_lower_case --num_train_epochs 5 --max_seq_length 256 --train_batch_size 12 --learning_rate 2e-5')
    
# Only run this cell, if the Political Task 2 additional training has been done
if os.path.isdir("../pytorch/res/pol_as_topics"):
    # Load results
    eval_results = pd.read_csv('../pytorch/res/pol_as_topics/eval_results.tsv', sep='\t')
    # Aggregate and display
    print("Task 2:")
    results = eval_results.groupby(['_bert-model', '_num_epochs', '_batch_size','_gradient_acc' ,'_learning_rate'])['f1'].agg([np.mean, np.min, np.max, np.std])
    display(results)
else:
    print('You have to first reproduce the results for the Political Task 2 additional dataset.\n'
          '../code_relation_prediction/pytorch python run_classifier_ba.py  --task_name "political-as-topics" --output_dir res/pol_as_topics/crossval1 --do_cross_val --do_lower_case --num_train_epochs 5 --max_seq_length 256 --train_batch_size 12 --learning_rate 2e-5')

In [None]:
# The additional results on the Agreement dataset. 
# Topic independent 80/20 train/test split

# Parameters:
# Fixed: input=both, seq_len=256, warmup_prop=0.1, seed=42
# model=base-uncased, epochs=3, batch_size=12, lr=2e-5

# Only run this cell, if the Political Task 2 additional training has been done
if os.path.isdir("../pytorch/res/agreement_topics_new"):
    # Load results
    eval_results = pd.read_csv('../pytorch/res/agreement_topics_new/eval_results.tsv', sep='\t')
    # Aggregate and display
    results = eval_results.groupby(['_bert-model', '_num_epochs', '_batch_size','_gradient_acc' ,'_learning_rate'])['f1'].agg([np.mean, np.min, np.max, np.std])
    display(results)
else:
    print('You have to first reproduce the results for the Agreement additional dataset.\n'
          '../code_relation_prediction/pytorch python run_classifier_ba.py  --task_name "agreement-topics" --output_dir res/agreement_topics_new/train_test1 --do_train --do_eval --do_lower_case --num_train_epochs 3 --max_seq_length 256 --train_batch_size 12 --learning_rate 2e-5')

In [None]:
# The additional only org results on the NoDE dataset. The network is always trained on the train dataset
# and evaluated on the test dataset. 

# Parameters:
# Fixed: input=org, seq_len=128, warmup_prop=0.1, seed=42-71
# Tested: model=base-uncased, epochs=4,5, batch_size=12,16, lr=2e-5, 3e-5

# Only run this cell, if the NoDE additional org training has been done
if os.path.isdir("../pytorch/res/node_org"):
    # Read in all testing results
    eval_results = pd.read_csv('../pytorch/res/node_org/eval_results.tsv', sep='\t')
    # Read in all testing predictions
    eval_preds = pd.read_csv('../pytorch/res/node_org/eval_preds.csv')
    # Group all runs with the same settings (different seeds) together
    eval_results_grouped = eval_results.groupby(['_bert-model', '_num_epochs', '_batch_size','_gradient_acc' ,'_learning_rate' ])
    # Aggregate mean, min, max, std and median
    results = eval_results_grouped['acc'].agg([np.mean, np.min, np.max, np.std, np.median])
    
    # Print the statistics reported in the thesis
    print("Best mean result with parameters and other statistics:\n", results.loc[results['mean'].idxmax()]) 
    print("\nOverall mean:\n", eval_results['acc'].mean())
    print("\nWorst (single) run:\n", eval_results['acc'].min())
    print("\nBest (single) run:\n", eval_results['acc'].max())
    
    # Formats and shows the complete result table
    results = results.reset_index()
    results = results.rename(columns={element: element.replace("_", "-") for element in results.columns.tolist()})
    results = results.rename(columns={element: re.sub("^-", "", element) for element in results.columns.tolist()})
    results = results.replace({"bert-base-uncased": 'bbu', "bert-large-uncased": 'blu'})
    results['batch-size'] = results['batch-size'] * results['gradient-acc']
    results = results.drop(columns=["gradient-acc"])
    results.iloc[:,:9].to_csv('../data/thesis/node_onlyorg_results.csv', index=False)
    print("\nComplete results table:")
    display(results.iloc[:,:8])
else:
    print("You have to first reproduce the results for the NoDE org dataset.\n"
          "../code_relation_prediction/pytorch ./run_all_node.sh only")

In [None]:
# The additional results on the Political dataset (Task 1+2). 
# 10-Fold stratified cross-validation is used 

# Parameters:
# Fixed: input=org, seq_len=256, warmup_prop=0.1, seed=42
# model=base-uncased, epochs=5, batch_size=12, lr=2e-5

# Only run this cell, if the Political Task 1 additional org training has been done
if os.path.isdir("../pytorch/res/pol_ru_org"):
    # Load results
    eval_results = pd.read_csv('../pytorch/res/pol_ru_org/eval_results.tsv', sep='\t')
    # Aggregate and display
    print("Task 1:")
    results = eval_results.groupby(['_bert-model', '_num_epochs', '_batch_size','_gradient_acc' ,'_learning_rate'])['f1'].agg([np.mean, np.min, np.max, np.std])
    display(results)
else:
    print('You have to first reproduce the results for the Political Task 1 additional org dataset.\n'
          '../code_relation_prediction/pytorch python run_classifier_ba.py  --task_name "political-as" --output_dir res/pol_as_org/crossval1 --input_to_use "org" --do_cross_val --do_lower_case --num_train_epochs 5 --max_seq_length 256 --train_batch_size 12 --learning_rate 2e-5')
# Only run this cell, if the Political Task 2 additional org training has been done
if os.path.isdir("../pytorch/res/pol_as_org"):
    # Load results
    eval_results = pd.read_csv('../pytorch/res/pol_as_org/eval_results.tsv', sep='\t')
    # Aggregate and display
    print("Task 2:")
    results = eval_results.groupby(['_bert-model', '_num_epochs', '_batch_size','_gradient_acc' ,'_learning_rate'])['f1'].agg([np.mean, np.min, np.max, np.std])
    display(results)
else:
    print('You have to first reproduce the results for the Political Task 2 additional org dataset.\n'
          '../code_relation_prediction/pytorch python run_classifier_ba.py  --task_name "political-ru" --output_dir res/pol_ru_org/crossval1 --input_to_use "org" --do_cross_val --do_lower_case --num_train_epochs 5 --max_seq_length 256 --train_batch_size 12 --learning_rate 2e-5')

In [None]:
# The additional only resp results on the NoDE dataset. The network is always trained on the train dataset
# and evaluated on the test dataset. 

# Parameters:
# Fixed: input=response, seq_len=128, warmup_prop=0.1, seed=42-71
# Tested: model=base-uncased, epochs=4,5, batch_size=12,16, lr=2e-5, 3e-5

# Only run this cell, if the NoDE additional org training has been done
if os.path.isdir("../pytorch/res/node_resp"):
    # Read in all testing results
    eval_results = pd.read_csv('../pytorch/res/node_resp/eval_results.tsv', sep='\t')
    # Read in all testing predictions
    eval_preds = pd.read_csv('../pytorch/res/node_resp/eval_preds.csv')
    # Group all runs with the same settings (different seeds) together
    eval_results_grouped = eval_results.groupby(['_bert-model', '_num_epochs', '_batch_size','_gradient_acc' ,'_learning_rate' ])
    # Aggregate mean, min, max, std and median
    results = eval_results_grouped['acc'].agg([np.mean, np.min, np.max, np.std, np.median])
    
    # Print the statistics reported in the thesis
    print("Best mean result with parameters and other statistics:\n", results.loc[results['mean'].idxmax()]) 
    print("\nOverall mean:\n", eval_results['acc'].mean())
    print("\nWorst (single) run:\n", eval_results['acc'].min())
    print("\nBest (single) run:\n", eval_results['acc'].max())
    
    # Formats and shows the complete result table
    results = results.reset_index()
    results = results.rename(columns={element: element.replace("_", "-") for element in results.columns.tolist()})
    results = results.rename(columns={element: re.sub("^-", "", element) for element in results.columns.tolist()})
    results = results.replace({"bert-base-uncased": 'bbu', "bert-large-uncased": 'blu'})
    results['batch-size'] = results['batch-size'] * results['gradient-acc']
    results = results.drop(columns=["gradient-acc"])
    results.iloc[:,:9].to_csv('../data/thesis/node_onlyorg_results.csv', index=False)
    print("\nComplete results table:")
    display(results.iloc[:,:8])
else:
    print("You have to first reproduce the results for the NoDE resp dataset.\n"
          "../code_relation_prediction/pytorch ./run_all_node.sh only")

In [None]:
# The additional results on the Political dataset (Task 1+2). 
# 10-Fold stratified cross-validation is used 

# Parameters:
# Fixed: input=response, seq_len=256, warmup_prop=0.1, seed=42
# model=base-uncased, epochs=5, batch_size=12, lr=2e-5

# Only run this cell, if the Political Task 1 additional resp training has been done
if os.path.isdir("../pytorch/res/pol_ru_resp"):
    # Load results
    eval_results = pd.read_csv('../pytorch/res/pol_ru_resp/eval_results.tsv', sep='\t')
    # Aggregate and display
    print("Task 1:")
    results = eval_results.groupby(['_bert-model', '_num_epochs', '_batch_size','_gradient_acc' ,'_learning_rate'])['f1'].agg([np.mean, np.min, np.max, np.std])
    display(results)
else:
    print('You have to first reproduce the results for the Political Task 1 additional resp dataset.\n'
          '../code_relation_prediction/pytorch python run_classifier_ba.py  --task_name "political-as" --output_dir res/pol_as_resp/crossval1 --input_to_use "response" --do_cross_val --do_lower_case --num_train_epochs 5 --max_seq_length 256 --train_batch_size 12 --learning_rate 2e-5')
# Only run this cell, if the Political Task 2 additional resp training has been done
if os.path.isdir("../pytorch/res/pol_as_resp"):
    # Load results
    eval_results = pd.read_csv('../pytorch/res/pol_as_resp/eval_results.tsv', sep='\t')
    # Aggregate and display
    print("Task 2:")
    results = eval_results.groupby(['_bert-model', '_num_epochs', '_batch_size','_gradient_acc' ,'_learning_rate'])['f1'].agg([np.mean, np.min, np.max, np.std])
    display(results)
else:
    print('You have to first reproduce the results for the Political Task 2 additional resp dataset.\n'
          '../code_relation_prediction/pytorch python run_classifier_ba.py  --task_name "political-ru" --output_dir res/pol_ru_resp/crossval1 --input_to_use "response" --do_cross_val --do_lower_case --num_train_epochs 5 --max_seq_length 256 --train_batch_size 12 --learning_rate 2e-5')

In [None]:
# The additional reversed results on the NoDE dataset. The network is always trained on the train dataset
# and evaluated on the test dataset. 

# Parameters:
# Fixed: input=resp-org, seq_len=128, warmup_prop=0.1, seed=42-71
# Tested: model=base-uncased, epochs=4,5, batch_size=12,16, lr=2e-5, 3e-5

# Only run this cell, if the NoDE additional org training has been done
if os.path.isdir("../pytorch/res/node_both_reversed"):
    # Read in all testing results
    eval_results = pd.read_csv('../pytorch/res/node_both_reversed/eval_results.tsv', sep='\t')
    # Read in all testing predictions
    eval_preds = pd.read_csv('../pytorch/res/node_both_reversed/eval_preds.csv')
    # Group all runs with the same settings (different seeds) together
    eval_results_grouped = eval_results.groupby(['_bert-model', '_num_epochs', '_batch_size','_gradient_acc' ,'_learning_rate' ])
    # Aggregate mean, min, max, std and median
    results = eval_results_grouped['acc'].agg([np.mean, np.min, np.max, np.std, np.median])
    
    # Print the statistics reported in the thesis
    print("Best mean result with parameters and other statistics:\n", results.loc[results['mean'].idxmax()]) 
    print("\nOverall mean:\n", eval_results['acc'].mean())
    print("\nWorst (single) run:\n", eval_results['acc'].min())
    print("\nBest (single) run:\n", eval_results['acc'].max())
    
    # Formats and shows the complete result table
    results = results.reset_index()
    results = results.rename(columns={element: element.replace("_", "-") for element in results.columns.tolist()})
    results = results.rename(columns={element: re.sub("^-", "", element) for element in results.columns.tolist()})
    results = results.replace({"bert-base-uncased": 'bbu', "bert-large-uncased": 'blu'})
    results['batch-size'] = results['batch-size'] * results['gradient-acc']
    results = results.drop(columns=["gradient-acc"])
    results.iloc[:,:9].to_csv('../data/thesis/node_resporg_results.csv', index=False)
    print("\nComplete results table:")
    display(results.iloc[:,:8])
else:
    print("You have to first reproduce the results for the NoDE resp dataset.\n"
          "../code_relation_prediction/pytorch ./run_all_node.sh resporg")

In [None]:
# The additional reversed results on the Political dataset (Task 1+2). 
# 10-Fold stratified cross-validation is used 

# Parameters:
# Fixed: input=resp-org, seq_len=256, warmup_prop=0.1, seed=42
# model=base-uncased, epochs=5, batch_size=12, lr=2e-5

# Only run this cell, if the Political Task 1 additional resp-org training has been done
if os.path.isdir("../pytorch/res/pol_ru_resporg"):
    # Load results
    eval_results = pd.read_csv('../pytorch/res/pol_ru_resporg/eval_results.tsv', sep='\t')
    # Aggregate and display
    print("Task 1:")
    results = eval_results.groupby(['_bert-model', '_num_epochs', '_batch_size','_gradient_acc' ,'_learning_rate'])['f1'].agg([np.mean, np.min, np.max, np.std])
    display(results)
else:
    print('You have to first reproduce the results for the Political Task 1 additional resp-org dataset.\n'
          '../code_relation_prediction/pytorch python run_classifier_ba.py  --task_name "political-as" --output_dir res/pol_as_resporg/crossval1 --input_to_use "response-org" --do_cross_val --do_lower_case --num_train_epochs 5 --max_seq_length 256 --train_batch_size 12 --learning_rate 2e-5')
# Only run this cell, if the Political Task 2 additional resp-org training has been done
if os.path.isdir("../pytorch/res/pol_as_resporg"):
    # Load results
    eval_results = pd.read_csv('../pytorch/res/pol_as_resporg/eval_results.tsv', sep='\t')
    # Aggregate and display
    print("Task 2:")
    results = eval_results.groupby(['_bert-model', '_num_epochs', '_batch_size','_gradient_acc' ,'_learning_rate'])['f1'].agg([np.mean, np.min, np.max, np.std])
    display(results)
else:
    print('You have to first reproduce the results for the Political Task 2 additional resp-org dataset.\n'
          '../code_relation_prediction/pytorch python run_classifier_ba.py  --task_name "political-ru" --output_dir res/pol_ru_resporg/crossval1 --input_to_use "response-org" --do_cross_val --do_lower_case --num_train_epochs 5 --max_seq_length 256 --train_batch_size 12 --learning_rate 2e-5')

## Critical Analysis  <a name="critana"></a>

Critical Analysis
- Duplicates in Agreement
- Author distributon + unique argument author
- WordClouds Author
- NoDe:
    - Major
    - Major by topic
    - Sent 1+2
    - Output in respect to the topics
- Pol:
    - RU:
        - Major
        - Major by topic
        - Major by org
        - Major by response
        - Major by author
        - Performance over topics (10-CV, 5-CV)
    - AS:
        - Major
        - Major by topic
        - Major by org
        - Major by response
        - Major by author
        - Sent 1+2
        - Performance over topics (10-CV, 5-CV)
- Agreement:
    - Major
    - Major by topic
    - Performance over topics (10-CV, train/test)

In [None]:
# The first few duplicates in the agreement Dataset (~4000 arg1,arg2 pairs occur twice, sometimes in different topics)
print("Agreement Duplicates:")
df_check = df[df['org_dataset'] == 'agreement']
display(df_check[df_check.duplicated(subset=['org', 'response'], keep=False)].head())

In [None]:
# Political by author
# Same author mostly support each other
# Different authors mostly attack each other
# Dataset is heavily imbalanced in respect to the author, Kennedy occurs way more often
print("Number of unique arguments per author:")
display(data_nix_ken.groupby("author").nunique())
print("Label distribution for the different author combinations:")
data_stats_author.iloc[:,:-2].to_csv('../data/thesis/author_imbalance.csv', index=False)
display(data_stats_author.iloc[:,:-2].style.background_gradient(cmap='Blues'))

In [None]:
# Wordclouds for Kennedy and for Nixon
# Both often say the name of the other candidate, Nixon talks about Predisdent Eisenhower
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(12,10))  # 1 row, 2 columns

stopwords = set(STOPWORDS)  
wordcloud = WordCloud(
    stopwords=stopwords, random_state=5).generate(
    " ".join(text for text in data_nix_ken.loc[data_nix_ken["author"] == 'Nixon', 'text']))
ax1.imshow(wordcloud, interpolation="bilinear")
ax1.set_title("Nixon WordCloud")
ax1.set_axis_off()
wordcloud = WordCloud(
    stopwords=stopwords, random_state=5).generate(
    " ".join(text for text in data_nix_ken.loc[data_nix_ken["author"] == 'Kennedy', 'text']))
ax2.imshow(wordcloud, interpolation="bilinear")
ax2.set_title("Kennedy WordCloud")
ax2.set_axis_off()


plt.tight_layout()
plt.savefig('../data/thesis/authors_wordcloud.pdf', bbox_inches='tight')

In [None]:
# All Baselines for the NoDE (test) dataset + Predictions on NoDE with respect to the topcis

# Major class baseline 
data = data_stats_topic['debate_test']
data['major_acc'] = data.apply(get_major_acc, args=[['Attack','Support']], axis=1)
data['major_class'] = data.apply(get_major_class, args=[['Attack','Support']], axis=1)
data = data.set_index('Topic')
print("Major class baseline NoDe test:")
display(data.iloc[-1:])

# Major class topic baseline
print("Major class by topic NoDe test:")
display(data)
print("Major by topic baseline overall NoDe test:\n")
node_test_df['major_topic_pred'] = node_test_df['topic'].apply(
    lambda r: data.loc[r, 'major_class']).replace({'Attack': 'attack', 'Support': 'support'})
print(classification_report(node_test_df['label'], node_test_df['major_topic_pred']))

# Sentiment baseliness
node_test_df['org_polarity'] = node_test_df['org'].apply(lambda r: disc_pol(sid.polarity_scores(r)['compound']))
node_test_df['resp_polarity'] = node_test_df['response'].apply(lambda r: disc_pol(sid.polarity_scores(r)['compound']))
node_test_df['sent_both_baseline'] = node_test_df.apply(
    lambda r: 'attack' if r['org_polarity'] != r['resp_polarity'] else 'support', axis=1)
node_test_df['sent_resp_baseline'] = node_test_df.apply(
    lambda r: 'attack' if r['resp_polarity'] == 'negative' else 'support', axis=1)
print("\nSentimen 1 baseline (both arguments same sent==support, otherwise==attack):\n")
print(classification_report(node_test_df['label'], node_test_df['sent_both_baseline']))
print("\nSentiment 2 baseline (response negative sentiment==attack, otherwise==support):\n")
print(classification_report(node_test_df['label'], node_test_df['sent_resp_baseline']))

# Predictions with respect to the topics (only if NoDE comparative was trained)
if os.path.isdir("../pytorch/res/node_both_paper"):
    # Load the data
    eval_results = pd.read_csv('../pytorch/res/node_both_paper/eval_results.tsv', sep='\t')
    eval_results_grouped = eval_results.groupby(['_bert-model', '_num_epochs', '_batch_size','_gradient_acc' ,'_learning_rate' ])
    eval_preds = pd.read_csv('../pytorch/res/node_both_paper/eval_preds.csv')
    results = eval_results_grouped['acc'].agg([np.mean, np.min, np.max, np.std, np.median])
    bmodel, bepochs, bb, bga, blr = results.loc[results['mean'].idxmax()].name
    best_pred_ps = eval_results.loc[(eval_results['_bert-model'] == bmodel) & 
                           (eval_results['_num_epochs'] == bepochs) & (eval_results['_batch_size'] == bb) &
                           (eval_results['_learning_rate'] == blr)].index
    
    # Take the rounded mean prediction for all runs of the best setting
    res = pd.concat([node_test_df.reset_index(drop=True), eval_preds.iloc[best_pred_ps,:-1].transpose().reset_index(drop=True)], axis=1)
    res['Mean prediction'] = res[list(best_pred_ps)].mean(axis=1).round().values
    res = res.replace({0: 'attack', 1: 'support'})
    res = res.rename(columns={'label': 'Label'})

    print("\nPredictions of the best setting (rounded) with respect to the topics:")
    pd.crosstab(res['topic'], [res['Label'],res['Mean prediction']]).to_csv('../data/thesis/node_topics_preds.csv', index=True)
    display(pd.crosstab(res['topic'], [res['Label'],res['Mean prediction']]))
else:
    print("You have to first reproduce the results for the NoDE dataset.\n"
          "../code_relation_prediction/pytorch ./run_all_node.sh comp")

In [None]:
# All Baselines for the Political (Task 1) dataset + Predictions with respect to the topcis

# Major class baseline 
data = data_stats_topic['political']
data['related'] = data.apply(lambda r: r['Attack'] + r['Support'], axis=1)
data['major_acc'] = data.apply(get_major_acc, args=[['related', 'Unrelated']], axis=1)
data['major_class'] = data.apply(get_major_class, args=[['related', 'Unrelated']], axis=1)
data = data.set_index('Topic')
pol_ru_test_df['major_class'] = 'related'
print("Major class Political RU:\n")
print(classification_report(pol_ru_test_df['label'], pol_ru_test_df['major_class']))

# Major class topic baseline
print("\nMajor class by topic Political RU:\n")
pol_ru_test_df['major_topic_pred'] = pol_ru_test_df['topic'].apply(
    lambda r: data.loc[r, 'major_class']).replace({'Unrelated': 'unrelated'})
print(classification_report(pol_ru_test_df['label'], pol_ru_test_df['major_topic_pred']))

# Author major class baseline
data = data_stats_author.copy()
data['related'] = data.apply(lambda r: r['Attack'] + r['Support'], axis=1)
data['major_acc'] = data.apply(get_major_acc, args=[['related', 'Unrelated']], axis=1)
data['major_class'] = data.apply(get_major_class, args=[['related', 'Unrelated']], axis=1)
data = data.set_index(['Author resp', 'Author org'])
pol_ru_test_df['major_author'] = pol_ru_test_df.apply(
    lambda r: data.loc[r['response_stance'], r['org_stance']]['major_class'], axis=1).replace({'Unrelated': 'unrelated'})
print("\nAuthor major class baseline RU:\n")
print(classification_report(pol_ru_test_df['label'], pol_ru_test_df['major_author']))
display(data)

# Major class per original argument baseline (excluding all arguments only occurring once)
df = pol_ru_test_df.copy()
data = df.groupby('org').apply(
    lambda r: pd.Series({'org': r['org'].iloc[0], 'related': count_values(r, ['related']),
                         'unrelated': count_values(r, ['unrelated'])}))
data['major_acc'] = data.apply(get_major_acc, args=[['related', 'unrelated']], axis=1)
data['major_class'] = data.apply(get_major_class, args=[['related', 'unrelated']], axis=1)
data = data.set_index('org')
data['total'] = data['related'] + data['unrelated']
# Drop all arguments only occurring once
orgs_only_once = data.loc[data['total'] == 1].index.to_list()
index = df[df['org'].isin(orgs_only_once)].index
df = df.drop(index)
df['major_org_pred'] = df['org'].apply(lambda r: data.loc[r, 'major_class'])
print("\nMajor class by original argument RU:\n")
print(classification_report(df['label'], df['major_org_pred']))

# Major class by response argument baseline (excluding all arguments only occurring once)
df = pol_ru_test_df.copy()
data = df.groupby('response').apply(
    lambda r: pd.Series({'resp': r['response'].iloc[0], 'related': count_values(r, ['related']),
                         'unrelated': count_values(r, ['unrelated'])}))
data['major_acc'] = data.apply(get_major_acc, args=[['related', 'unrelated']], axis=1)
data['major_class'] = data.apply(get_major_class, args=[['related', 'unrelated']], axis=1)
data = data.set_index('resp')
data['total'] = data['related'] + data['unrelated']
# Drop all arguments only occurring once
orgs_only_once = data.loc[data['total'] == 1].index.to_list()
index = df[df['response'].isin(orgs_only_once)].index
df = df.drop(index)
df['major_resp_pred'] = df['response'].apply(lambda r: data.loc[r, 'major_class'])
print("\nMajor class by response argument RU:\n")
print(classification_report(df['label'], df['major_resp_pred']))

# Predictions with respect to the topics (only if Political Task1 comparative was trained)
if os.path.isdir("../pytorch/res/pol_ru"):
    # Load the data
    eval_preds = pd.read_csv('../pytorch/res/pol_ru/eval_preds.csv')
    print("\nPredictions with respect to the topics RU:\n")
    convert_preds_topic(eval_preds, pol_ru_test_df, {0: 'related', 1: 'unrelated'}, 9)
else:
    print('You have to first reproduce the results for the Political Task 1 dataset.\n'
          'python run_classifier_ba.py  --task_name "political-ru" --output_dir res/pol_ru/crossval1 --do_cross_val --do_lower_case --num_train_epochs 5 --max_seq_length 256 --train_batch_size 12 --learning_rate 2e-5')    

# Predictions with respect to the topics (only if Political Task1 additional was trained)
if os.path.isdir("../pytorch/res/pol_ru_topics"):
    # Load the data
    eval_preds = pd.read_csv('../pytorch/res/pol_ru_topics/eval_preds.csv', names=list(range(0,315))).iloc[1:]
    print("\nPredictions with respect to the topics RU (indepedent version):\n")
    convert_preds_topic(eval_preds, pol_ru_to_test_df, {0: 'related', 1: 'unrelated'}, 5)
else:
    print('You have to first reproduce the results for the Political Task 1 (topic independent) dataset.\n'
          '../code_relation_prediction/pytorch python run_classifier_ba.py  --task_name "political-ru-topics" --output_dir res/pol_ru_topics/crossval1 --do_cross_val --do_lower_case --num_train_epochs 5 --max_seq_length 256 --train_batch_size 12 --learning_rate 2e-5')

In [None]:
# All Baselines for the Political (Task 2) dataset + Predictions with respect to the topcis

# Major class baseline 
data = data_stats_topic['political']
data['major_acc'] = data.apply(get_major_acc, args=[['Attack', 'Support']], axis=1)
data['major_class'] = data.apply(get_major_class, args=[['Attack', 'Support']], axis=1)
data = data.set_index('Topic')
pol_as_test_df['major_class'] = 'attack'
print("Major class Political As:\n")
print(classification_report(pol_as_test_df['label'], pol_as_test_df['major_class']))

# Major class topic baseline
print("\nMajor class by topic Political AS:\n")
pol_as_test_df['major_topic_pred'] = pol_as_test_df['topic'].apply(
    lambda r: data.loc[r, 'major_class']).replace({'Attack': 'attack', 'Support': 'support'})
print(classification_report(pol_as_test_df['label'], pol_as_test_df['major_topic_pred']))

# Author major class baseline
data = data_stats_author.copy()
data['major_acc'] = data.apply(get_major_acc, args=[['Attack', 'Support']], axis=1)
data['major_class'] = data.apply(get_major_class, args=[['Attack', 'Support']], axis=1)
data = data.set_index(['Author resp', 'Author org'])
pol_as_test_df['major_author'] = pol_as_test_df.apply(
    lambda r: data.loc[r['response_stance'], r['org_stance']]['major_class'], axis=1).replace(
    {'Attack': 'attack', 'Support': 'support'})
print("\nAuthor major class baseline AS:\n")
print(classification_report(pol_as_test_df['label'], pol_as_test_df['major_author']))
display(data)

# Major class per original argument baseline (excluding all arguments only occurring once)
df = pol_as_test_df.copy()
data = df.groupby('org').apply(
    lambda r: pd.Series({'org': r['org'].iloc[0], 'attack': count_values(r, ['attack']),
                         'support': count_values(r, ['support'])}))
data['major_acc'] = data.apply(get_major_acc, args=[['attack', 'support']], axis=1)
data['major_class'] = data.apply(get_major_class, args=[['attack', 'support']], axis=1)
data = data.set_index('org')
data['total'] = data['attack'] + data['support']
# Drop all arguments only occurring once
orgs_only_once = data.loc[data['total'] == 1].index.to_list()
index = df[df['org'].isin(orgs_only_once)].index
df = df.drop(index)
df['major_org_pred'] = df['org'].apply(lambda r: data.loc[r, 'major_class'])
print("\nMajor class by original argument AS:\n")
print(classification_report(df['label'], df['major_org_pred']))

# Majro class by response argument baseline (excluding all arguments only occurring once)
df = pol_as_test_df.copy()
data = df.groupby('response').apply(
    lambda r: pd.Series({'resp': r['response'].iloc[0], 'attack': count_values(r, ['attack']),
                         'support': count_values(r, ['support'])}))
data['major_acc'] = data.apply(get_major_acc, args=[['attack', 'support']], axis=1)
data['major_class'] = data.apply(get_major_class, args=[['attack', 'support']], axis=1)
data = data.set_index('resp')
data['total'] = data['attack'] + data['support']
# Drop all arguments only occurring once
orgs_only_once = data.loc[data['total'] == 1].index.to_list()
index = df[df['response'].isin(orgs_only_once)].index
df = df.drop(index)
df['major_resp_pred'] = df['response'].apply(lambda r: data.loc[r, 'major_class'])
print("\nMajor class by response argument AS:\n")
print(classification_report(df['label'], df['major_resp_pred']))

# Sentiment baselines
df = pol_as_test_df.copy()
df['org_polarity'] = df['org'].apply(lambda r: disc_pol(sid.polarity_scores(r)['compound']))
df['resp_polarity'] = df['response'].apply(lambda r: disc_pol(sid.polarity_scores(r)['compound']))
df['sent_both_baseline'] = df.apply(
    lambda r: 'attack' if r['org_polarity'] != r['resp_polarity'] else 'support', axis=1)
df['sent_resp_baseline'] = df.apply(
    lambda r: 'attack' if r['resp_polarity'] == 'negative' else 'support', axis=1)
print("\nSentimen 1 baseline (both arguments same sent==support, otherwise==attack):\n")
print(classification_report(df['label'], df['sent_both_baseline']))
print("\nSentiment 2 baseline (response negative sentiment==attack, otherwise==support):\n")
print(classification_report(df['label'], df['sent_resp_baseline']))

# Predictions with respect to the topics (only if Political Task1 comparative was trained)
if os.path.isdir("../pytorch/res/pol_as"):
    # Load the data
    eval_preds = pd.read_csv('../pytorch/res/pol_as/eval_preds.csv')
    print("\nPredictions with respect to the topics AS:\n")
    convert_preds_topic(eval_preds, pol_as_test_df, {0: 'attack', 1: 'support'}, 9)
else:
    print('You have to first reproduce the results for the Political Task 2 dataset.\n'
          'python run_classifier_ba.py  --task_name "political-as" --output_dir res/pol_as/crossval1 --do_cross_val --do_lower_case --num_train_epochs 5 --max_seq_length 256 --train_batch_size 12 --learning_rate 2e-5')    

# Predictions with respect to the topics (only if Political Task1 additional was trained)
if os.path.isdir("../pytorch/res/pol_as_topics"):
    # Load the data
    eval_preds = pd.read_csv('../pytorch/res/pol_as_topics/eval_preds.csv', names=list(range(0,315))).iloc[1:]
    print("\nPredictions with respect to the topics AS (indepedent version):\n")
    convert_preds_topic(eval_preds, pol_as_to_test_df, {0: 'attack', 1: 'support'}, 5)
else:
    print('You have to first reproduce the results for the Political Task 2 (topic independent) dataset.\n'
          '../code_relation_prediction/pytorch python run_classifier_ba.py  --task_name "political-as-topics" --output_dir res/pol_as_topics/crossval1 --do_cross_val --do_lower_case --num_train_epochs 5 --max_seq_length 256 --train_batch_size 12 --learning_rate 2e-5')

In [None]:
# All Baselines for the Agreement dataset + Predictions with respect to the topcis

# Major class baseline 
data = data_stats_topic['agreement']
data['major_acc'] = data.apply(get_major_acc, args=[['Agreement', 'Disagreement']], axis=1)
data['major_class'] = data.apply(get_major_class, args=[['Agreement', 'Disagreement']], axis=1)
data = data.set_index('Topic')
ag_test_df['major_class'] = 'disagreement'
print("Major class Agreement:\n")
print(classification_report(ag_test_df['label'], ag_test_df['major_class']))

# Major class topic baseline
print("\nMajor class by topic Agreement:\n")
ag_test_df['major_topic_pred'] = ag_test_df['topic'].apply(
    lambda r: data.loc[r, 'major_class']).replace({'Agreement': 'agreement', 'Disagreement': 'disagreement'})
print(classification_report(ag_test_df['label'], ag_test_df['major_topic_pred']))


# Predictions with respect to the topics (only if Agreement comparative was trained)
if os.path.isdir("../pytorch/res/agreement_new"):
    # Load the data
    eval_preds = pd.read_csv('../pytorch/res/agreement_new/eval_preds.csv')
    print("\nPredictions with respect to the topics:\n")
    convert_preds_topic(eval_preds, ag_test_df, {0: 'agreement', 1: 'disagreement'}, 9, False)
else:
    print('You have to first reproduce the results for the Agreement dataset.\n'
          'python run_classifier_ba.py  --task_name "agreement" --output_dir res/agreement_new/crossval1 --do_cross_val --do_lower_case --num_train_epochs 3 --max_seq_length 256 --train_batch_size 12 --learning_rate 2e-5')    

# Predictions with respect to the topics (only if Agreement additional was trained)
if os.path.isdir("../pytorch/res/agreement_topics_new"):
    # Load the data
    eval_preds = pd.read_csv('../pytorch/res/agreement_topics_new/eval_preds.csv')
    print("\nPredictions with respect to the topics (indepedent version):\n")
    convert_preds_topic(eval_preds, ag_to_test_df, {0: 'agreement', 1: 'disagreement'}, 1, False)
else:
    print('You have to first reproduce the results for the Political Task 2 (topic independent) dataset.\n'
          '../code_relation_prediction/pytorch python run_classifier_ba.py  --task_name "agreement_topics" --output_dir res/agreement_topics_new/crossval1 --do_cross_val --do_lower_case --num_train_epochs 3 --max_seq_length 256 --train_batch_size 12 --learning_rate 2e-5')