# Sentential Relation Prediction
*LING 7800: Computational Models of Discourse*

To incorporate context in the task of sentence relation prediction we offer two new approaches: 
1. Expanded Window Neighbors (EWNs)
2. Part Smart Random Neighbors (PSRNs)

EWN incorporate extra sentences before and after our target sentences. PSRNs ignore direct neighbors and instead propose a random selection of preceding and following sentences to predict the target relation.

In [1]:
import torch
import pandas as pd
import numpy as np
import warnings

from util import *
from sklearn.model_selection import train_test_split
from gensim.parsing.preprocessing import remove_stopwords

from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments

warnings.filterwarnings('ignore')

    auto-reload util.py functions with the next cell:

In [2]:
%load_ext autoreload
%autoreload 2

### Data Pre-Processing

    The following commented cell is for turning the piped PDTB into a CSV format. The piped PDTB files are created with the convert.pl PERL script. The cell is no longer needed once you have the pdtb2.csv in your working directory!

In [3]:
# import os
# import glob

# # Turn pipe files into CSV
# PATH = r"C:/Users/sucho/OneDrive - UCB-O365/Desktop/Spring 2023/LING 7800/LING-7800-Term-Project/pdtb pipe"
# pdtb_files = glob.glob(os.path.join(PATH, "*/*.pipe"))
# my_data = 'pdtb2.csv'

# with open(my_data, 'w', encoding='utf-8') as output:
#     for i in pdtb_files:
#         with open(i, 'r', encoding='ISO-8859-1') as _input:
#             for j in _input:
#                 output.write(j)
#             output.write('\n')

Load Penn Discourse Tree Bank 2.0 to `my_data`, and assign column labels based on `COLUMN_DEFINITIONS` document.

In [4]:
# (The 'pdtb2.csv' file is created with the previous cell)
my_data = '../data/pdtb2.csv'

# Column names were found in the COLUMN_DEFINITIONS file provided by the PDTB2
COLUMN_DEFINITIONS = [
    'Relation', 'SectionNumber', 'FileNumber', 'Connective_AltLex_SpanList', 'Connective_AltLex_GornAddressList',
    'Connective_AltLex_RawText', 'String_position', 'SentenceNumber', 'ConnHead', 'Conn1', 'Conn2',
    'ConnHeadSemClass1', '1st_Semantic_Class_1', '1st_Semantic_Class_2', '2nd_Semantic_Class_2',
    'Relation_Attribution_Source', 'Relation_Attribution_Type', 'Relation_Attribution_Polarity',
    'Relation_Attribution_Determinacy', 'Relation_Attribution_SpanList', 'Relation_Attribution_GornAddressList',
    'Relation_Attribution_RawText', 'Arg1_SpanList', 'Arg1_GornAddress', 'S1', 'Arg1_Attribution_Source',
    'Arg1_Attribution_Type', 'Arg1_Attribution_Polarity', 'Arg1_Attribution_Determinacy', 'Arg1_Attribution_SpanList',
    'Arg1_Attribution_GornAddressList', 'Arg1_Attribution_RawText', 'Arg2_SpanList', 'Arg2_GornAddress', 'S2',
    'Arg2_Attribution_Source', 'Arg2_Attribution_Type', 'Arg2_Attribution_Polarity', 'Arg2_Attribution_Determinacy',
    'Arg2_Attribution_SpanList', 'Arg2_Attribution_GornAddressList', 'Arg2_Attribution_RawText', 'Sup1_SpanList',
    'Sup1_GornAddress', 'Sup1_RawText', 'Sup2_SpanList', 'Sup2_GornAddress', 'Sup2_RawText'
]

# read_csv() my_data into a pandas DataFrame (PDTB encodings are ISO-8859-1)
data = pd.read_csv(my_data, sep='|', encoding='ISO-8859-1', names=COLUMN_DEFINITIONS, error_bad_lines=False)

b'Skipping line 20149: expected 48 fields, saw 58\n'


    Uncomment the following cell to view full PDTB dataset:

In [5]:
# pd.options.display.max_columns = 48
# pd.set_option('display.max_colwidth', None)
# data.head()

### Filter Data to Only Include Relevant Columns

In [6]:
# We are interested in the following columns:
df = data[['Relation','SectionNumber','FileNumber','SentenceNumber','S1', 'S2', 'ConnHeadSemClass1']].dropna()

# 'ConnHeadSemClass1' contains relation labels between Arg1 and Arg2. The levels are separated by a period.
df = pd.concat([df, df['ConnHeadSemClass1'].str.split(".", expand=True)], axis=1)
df['SentenceNumber'] = df['SentenceNumber'].astype(str).str[:-2]

# Add the relation levels back to df, and then set the colmn names with .columns.
df = df[['Relation','SectionNumber','FileNumber','SentenceNumber','S1', 'S2', 0, 1]]
df.columns = ['Relation','SectionNumber','FileNumber', 'SentenceNumber', 'S1', 'S2', 'Level 1', 'Level 2']

# Display the first 5 rows without any column width restrictions
pd.set_option('display.max_colwidth', None)
df[:5]

Unnamed: 0,Relation,SectionNumber,FileNumber,SentenceNumber,S1,S2,Level 1,Level 2
4,Implicit,0,3,5,This is an old story,We're talking about years ago before anyone heard of asbestos having any questionable properties,Expansion,Restatement
5,Implicit,0,3,6,We're talking about years ago before anyone heard of asbestos having any questionable properties,There is no asbestos in our products now,Expansion,Conjunction
6,Implicit,0,3,8,Neither Lorillard nor the researchers who studied the workers were aware of any research on smokers of the Kent cigarettes,We have no useful information on whether users are at risk,Contingency,Cause
9,Implicit,0,3,13,"Among 33 men who worked closely with the substance, 28 have died -- more than three times the expected number","Four of the five surviving workers have asbestos-related diseases, including three with recently diagnosed cancer",Expansion,Conjunction
10,Implicit,0,3,14,"Among 33 men who worked closely with the substance, 28 have died -- more than three times the expected number.Four of the five surviving workers have asbestos-related diseases, including three with recently diagnosed cancer","The total of 18 deaths from malignant mesothelioma, lung cancer and asbestosis was far higher than expected",Expansion,Conjunction


### Convert Labels to Integers

In [7]:
# print all the unique values in 'Level 1'
print(df['Level 1'].unique())

# print all the unique values in 'Level 2'
print(df['Level 2'].unique())

['Expansion' 'Contingency' 'Comparison' 'Temporal']
['Restatement' 'Conjunction' 'Cause' None 'Contrast' 'Instantiation'
 'Asynchronous' 'Alternative' 'Pragmatic cause' 'Concession' 'List'
 'Synchrony' 'Exception' 'Pragmatic contrast' 'Pragmatic concession'
 'Condition' 'Pragmatic condition']


In [8]:
level_1_map = {
    'Temporal': 0,
    'Contingency': 1,
    'Comparison': 2,
    'Expansion': 3
}

level_2_map = {
    'Restatement': 0,
    'Conjunction': 1,
    'Cause': 2,
    'Contrast': 3,
    'Instantiation': 4,
    'Asynchronous': 5,
    'Alternative': 6,
    'Pragmatic cause': 7,
    'Concession': 8,
    'List': 9,
    'Synchrony': 10,
    'Exception': 11,
    'Pragmatic contrast': 12,
    'Pragmatic concession': 13,
    'Condition': 14,
    'Pragmatic condition': 15,
    None: 16

}

df['Level 1'] = df['Level 1'].map(level_1_map)
df['Level 2'] = df['Level 2'].map(level_2_map)

In [9]:
# I'm changing variable names to not overwrite the original df
my_data_transformed = df
my_data_transformed[:5]

Unnamed: 0,Relation,SectionNumber,FileNumber,SentenceNumber,S1,S2,Level 1,Level 2
4,Implicit,0,3,5,This is an old story,We're talking about years ago before anyone heard of asbestos having any questionable properties,3,0
5,Implicit,0,3,6,We're talking about years ago before anyone heard of asbestos having any questionable properties,There is no asbestos in our products now,3,1
6,Implicit,0,3,8,Neither Lorillard nor the researchers who studied the workers were aware of any research on smokers of the Kent cigarettes,We have no useful information on whether users are at risk,1,2
9,Implicit,0,3,13,"Among 33 men who worked closely with the substance, 28 have died -- more than three times the expected number","Four of the five surviving workers have asbestos-related diseases, including three with recently diagnosed cancer",3,1
10,Implicit,0,3,14,"Among 33 men who worked closely with the substance, 28 have died -- more than three times the expected number.Four of the five surviving workers have asbestos-related diseases, including three with recently diagnosed cancer","The total of 18 deaths from malignant mesothelioma, lung cancer and asbestosis was far higher than expected",3,1


## Remove stop words (... optional)

In [10]:
# my_data_transformed['S1'] = my_data_transformed['S1'].apply(remove_stopwords)
# my_data_transformed['S2'] = my_data_transformed['S2'].apply(remove_stopwords)
my_data_transformed.head()

Unnamed: 0,Relation,SectionNumber,FileNumber,SentenceNumber,S1,S2,Level 1,Level 2
4,Implicit,0,3,5,This is an old story,We're talking about years ago before anyone heard of asbestos having any questionable properties,3,0
5,Implicit,0,3,6,We're talking about years ago before anyone heard of asbestos having any questionable properties,There is no asbestos in our products now,3,1
6,Implicit,0,3,8,Neither Lorillard nor the researchers who studied the workers were aware of any research on smokers of the Kent cigarettes,We have no useful information on whether users are at risk,1,2
9,Implicit,0,3,13,"Among 33 men who worked closely with the substance, 28 have died -- more than three times the expected number","Four of the five surviving workers have asbestos-related diseases, including three with recently diagnosed cancer",3,1
10,Implicit,0,3,14,"Among 33 men who worked closely with the substance, 28 have died -- more than three times the expected number.Four of the five surviving workers have asbestos-related diseases, including three with recently diagnosed cancer","The total of 18 deaths from malignant mesothelioma, lung cancer and asbestosis was far higher than expected",3,1


In [11]:
# drop 'Relation' column from my_data_transformed
my_data_transformed = my_data_transformed.drop(['Relation'], axis=1)

# re-index my_data_transformed
my_data_transformed = my_data_transformed.reset_index(drop=True)

my_data_transformed[:5]

Unnamed: 0,SectionNumber,FileNumber,SentenceNumber,S1,S2,Level 1,Level 2
0,0,3,5,This is an old story,We're talking about years ago before anyone heard of asbestos having any questionable properties,3,0
1,0,3,6,We're talking about years ago before anyone heard of asbestos having any questionable properties,There is no asbestos in our products now,3,1
2,0,3,8,Neither Lorillard nor the researchers who studied the workers were aware of any research on smokers of the Kent cigarettes,We have no useful information on whether users are at risk,1,2
3,0,3,13,"Among 33 men who worked closely with the substance, 28 have died -- more than three times the expected number","Four of the five surviving workers have asbestos-related diseases, including three with recently diagnosed cancer",3,1
4,0,3,14,"Among 33 men who worked closely with the substance, 28 have died -- more than three times the expected number.Four of the five surviving workers have asbestos-related diseases, including three with recently diagnosed cancer","The total of 18 deaths from malignant mesothelioma, lung cancer and asbestosis was far higher than expected",3,1


### Apply one of the transforms

In [12]:
# filtered_df = EWN(my_data_transformed, n=1)
# filtered_df = PSRN(my_data_transformed, n=1)
filtered_df = direct_neighbors(my_data_transformed, n=1)

In [13]:
len(filtered_df)

2583

In [14]:
filtered_df.head()

Unnamed: 0,SectionNumber,FileNumber,SentenceNumber,S1,S2,Level 1,Level 2
3,0,3,13,"<s1>Among 33 men who worked closely with the substance, 28 have died -- more than three times the expected number</s1>","<s2>Four of the five surviving workers have asbestos-related diseases, including three with recently diagnosed cancer</s2> <s2>Among 33 men who worked closely with the substance, 28 have died -- more than three times the expected number.Four of the five surviving workers have asbestos-related diseases, including three with recently diagnosed cancer</s2>",3,1
4,0,3,14,"<s1>Four of the five surviving workers have asbestos-related diseases, including three with recently diagnosed cancer</s1> <s1>Among 33 men who worked closely with the substance, 28 have died -- more than three times the expected number.Four of the five surviving workers have asbestos-related diseases, including three with recently diagnosed cancer</s1>","<s2>The total of 18 deaths from malignant mesothelioma, lung cancer and asbestosis was far higher than expected</s2>",3,1
23,0,12,4,<s1>Newsweek's ad rates would increase 5% in January</s1>,"<s2>A full, four-color page in Newsweek will cost $100,980</s2> <s2>Newsweek's ad rates would increase 5% in January.A full, four-color page in Newsweek will cost $100,980</s2>",3,0
24,0,12,5,"<s1>A full, four-color page in Newsweek will cost $100,980</s1> <s1>Newsweek's ad rates would increase 5% in January.A full, four-color page in Newsweek will cost $100,980</s1>","<s2>In mid-October, Time magazine lowered its guaranteed circulation rate base for 1990 while not increasing ad page rates; with a lower circulation base, Time's ad rate will be effectively 7.5% higher per subscriber; a full page in Time costs about $120,000</s2>",2,3
25,0,12,5,"<s1>In mid-October, Time magazine lowered its guaranteed circulation rate base for 1990 while not increasing ad page rates</s1>","<s2>with a lower circulation base, Time's ad rate will be effectively 7.5% higher per subscriber</s2> <s2>Newsweek's ad rates would increase 5% in January.A full, four-color page in Newsweek will cost $100,980.In mid-October, Time magazine lowered its guaranteed circulation rate base for 1990 while not increasing ad page rates; with a lower circulation base, Time's ad rate will be effectively 7.5% higher per subscriber; a full page in Time costs about $120,000</s2>",1,2


# Training Time!

Model: `distilbert-base-uncased-finetuned-sst-2-english`

per_device_train_batch_size: `16`

per_device_eval_batch_size: `16`

num_train_epochs: `10`

seed: `42`

learning_rate: `2e-5`

In [19]:
# SententialNeighbors class is a custom dataset modeled from https://huggingface.co/transformers/v3.2.0/custom_datasets.html it inherits from torch.utils.data.Dataset
class SententialNeighbors(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        # init method takes in encodings and labels
        self.encodings = encodings
        self.labels = labels

    def __len__(self):
        # len method returns the length of the dataset for embedding
        return len(self.labels)
    
    def __getitem__(self, idx):
        # getitem method returns the dictionary of encodings and their labels
        item = {key: torch.tensor(self.encodings[idx][key]) for key in self.encodings[idx].keys()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

In [15]:
# Select our model and tokenizer
distilBERT = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(distilBERT)

# Split data into train and validation sets (80/20 split)
train, test = train_test_split(filtered_df, test_size=0.2, random_state=42)

# Encode the train and validation data
train_encodings = encode(train, tokenizer)
test_encodings = encode(test, tokenizer)

# Create our custom PyTorch dataset from the encodings we made (sentences and labels)
train_dataset = SententialNeighbors(train_encodings, train['Level 1'])
test_dataset = SententialNeighbors(test_encodings, test['Level 1'])

In [16]:
# Set up our model for a sequence classification task (we are using distilBERT)
model = AutoModelForSequenceClassification.from_pretrained(
    distilBERT, num_labels=4, ignore_mismatched_sizes=True
)

# We want to adapt the model to our task, the Linear layer can help us set the output dimensions (4, one per each label)
# ref: https://discuss.huggingface.co/t/adding-linear-layer-to-transformer-model-save-pretrained-and-load-pretrained/15548
model.classifier = torch.nn.Linear(model.config.dim, 4)

# Set up the training arguments
training_args = TrainingArguments(
    output_dir="../results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=10,
    seed=42,
    learning_rate=2e-5,
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized because the shapes did not match:
- classifier.weight: found shape torch.Size([2, 768]) in the checkpoint and torch.Size([4, 768]) in the model instantiated
- classifier.bias: found shape torch.Size([2]) in the checkpoint and torch.Size([4]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=evaluate,
)

# Train the model
trainer.train()

### Save results to .csv!

In [None]:
# Convert trainer.state.log_history to a DataFrame
results = pd.DataFrame(trainer.state.log_history)

# results to csv
results.to_csv('model_15_tags_baseline.csv')