# Exploring the Corpus

This notebook presents a brief overview of the data present in the Corpus object. This corpus is used to train the models.

## Loading the Corpus

The existing corpus objects can be loaded from the pickle objects in the "src/corpus" folder. There are three corpora present in this folder:

- `full_corpus_ontonotes`
- `md_corpus_ontonotes`
- `sm_corpus_ontonotes`

The `full` corpus can be used to use all the data to train the models. However, this requires quite a bit of computational power. For efficiency purposes, the `md_corpus_ontonotes` was made which represents a smaller subset of the `full` corpus. Both the `full` and `md` corpora result in a comprehensive model when training on them. The `md` corpus was generated in such a way that the role distribution is the same as in the `full` corpus.

The `sm_corpus_ontonotes` should not be used for training the models as the data is too limited. Nevertheless, it can be used for testing purposes.

In [1]:
import os #nopep8
import sys #nopep8
from pathlib import Path #nopep8

# Add the 'src' directory to sys.path
current_working_directory = os.getcwd() # Get the current working directory
src_path = (
    Path(current_working_directory) /
    "src"
)  # Construct the path to the 'src' directory
sys.path.append(str(src_path))  # Add 'src' directory to sys.path

from corpus import *

# Define the data directory path
data_dir = Path(current_working_directory) / "data"
corpus_data_dir = (
    data_dir / "ontonotes_data"
)
verbatlas_data_dir = data_dir / "verbatlas_data"

# Load corpus data
train_corpus_file_path = corpus_data_dir / "ontonotes_train.csv"
dev_corpus_file_path = corpus_data_dir / "ontonotes_dev.csv"
test_corpus_file_path = corpus_data_dir / "ontonotes_test.csv"

pb2va_file_path = verbatlas_data_dir / "pb2va.tsv"
VA_frame_info_file_path = verbatlas_data_dir / "VA_frame_info.tsv"
verb_atlas_frames = VerbAtlasFrames(pb2va_file_path, VA_frame_info_file_path)

#corpus = Corpus(train_corpus_file_path, dev_corpus_file_path, test_corpus_file_path, verb_atlas_frames)

In [3]:
# define the corpus path
md_corpus_path = Path(src_path)/ "corpus" / "md_corpus_ontonotes.pkl"

md_corpus_onto = Corpus.load_corpus(md_corpus_path)

# full_corpus_path = Path(src_path) / "corpus" / "full_corpus_ontonotes.pkl"

# full_corpus_onto = Corpus.load_corpus(full_corpus_path)

Corpus (non-SpaCy data) loaded from c:\Users\tommo\Documents\VS_Code\predicting-argument-modifiers\src\corpus\md_corpus_ontonotes.pkl
Corpus (SpaCy data) loaded from c:\Users\tommo\Documents\VS_Code\predicting-argument-modifiers\src\corpus\md_corpus_ontonotes.pkl.spacy
Corpus (non-SpaCy data) loaded from c:\Users\tommo\Documents\VS_Code\predicting-argument-modifiers\src\corpus\full_corpus_ontonotes.pkl
Corpus (SpaCy data) loaded from c:\Users\tommo\Documents\VS_Code\predicting-argument-modifiers\src\corpus\full_corpus_ontonotes.pkl.spacy


## Structure of the Corpus

The Corpus object is pre-split into a `train`, `development` (dev) and `test` set.

The Corpus object also contains some useful methods that can be used to access and visualize the data.

In [7]:
# accessing the full corpus
md_corpus_onto.sentences

# # accessing the train split from the corpus
# md_corpus_onto.train

# # accessing the dev split from the corpus
# md_corpus_onto.dev

# # accessing the test split from the corpus
# md_corpus_onto.test

{(0,
  11): Sentence(sentence_id=(0, 11), sentence_string="Not only are Japanese foodstuff shops and Japanese product counters at department stores quickly selling out of their regular shipments of these feline fortune finders , Maneki Nekos were even on sale in the run - up to New Year 's at the roadside stalls selling traditional Chinese holiday paraphernalia , such as lanterns and spring couplets on red paper .", frames=[Frame(frame_name='sell.01', lemma_name='sell', roles=[Role(sentence_id=(0, 11), role_id=110, role='ARG0', indices=[42, 43, 44], string='the roadside stalls', role_index=0, va_role=None, doc=the roadside stalls), Role(sentence_id=(0, 11), role_id=111, role='V', indices=[45], string='selling', role_index=1, va_role=None, doc=selling), Role(sentence_id=(0, 11), role_id=112, role='ARG1', indices=[46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59], string='traditional Chinese holiday paraphernalia , such as lanterns and spring couplets on red paper', role_index=2, v

In [8]:
# visualizing a sentence from the corpus with its gold standard frames and roles
# the VerbAtlas roles are not included in the visualization or corpus because they are not used in the current models
md_corpus_onto.visualize_sentence_displacy((1,106))

## Role Distribution in the Corpus

There are two general types of roles that occur in the corpus:

- Core Roles: V, ARG1, ARG2, ARG3, ARG4, ARG5, ARGA
- Modifier Roles: ADV, MOD, DIS, LOC, MNR, NEG, ADJ, PRP, CAU, DIR, PRD, EXT, LVB, PRR, GOL, DSP, COM, CXN, REC

Information about what these roles represent can be consulted in [the PropBank project](https://propbank.github.io/).

In [17]:
import pandas as pd

data = []

# Iterate over all sentences in the training corpus
for sentence_id, sentence in md_corpus_onto.train.items():
    # For each sentence, iterate over all frames
    for frame in sentence.frames:
        # For each frame, check the role of each role object
        for role in frame.roles:
            data.append({
                "sentence_id": sentence_id,
                "role": role.role,
                "indices": role.indices,
                "string": role.string,
            })

# Convert the list to a DataFrame
df = pd.DataFrame(data)

role_distribution = df['role'].value_counts(normalize=True)

print(role_distribution)

# full corpus
# V       0.327361
# ARG1    0.234784
# ARG0    0.147282
# ARG2    0.080357
# TMP     0.045138
# ADV     0.026037
# MOD     0.025146
# DIS     0.022413
# LOC     0.016478
# MNR     0.016336
# NEG     0.013282
# ADJ     0.006962
# ARG3    0.005719
# PRP     0.005641
# CAU     0.004457
# ARG4    0.004306
# DIR     0.004086
# PRD     0.003957
# EXT     0.002763
# LVB     0.002178
# PRR     0.002171
# GOL     0.001357
# DSP     0.000764
# COM     0.000571
# CXN     0.000180
# REC     0.000172
# ARG5    0.000095
# ARGA    0.000006

V       0.326829
ARG1    0.234948
ARG0    0.147330
ARG2    0.080193
TMP     0.045769
ADV     0.025789
MOD     0.025292
DIS     0.022304
MNR     0.016459
LOC     0.016387
NEG     0.013206
ADJ     0.006858
ARG3    0.005703
PRP     0.005602
CAU     0.004645
ARG4    0.004380
DIR     0.004160
PRD     0.003974
EXT     0.002780
PRR     0.002069
LVB     0.002022
GOL     0.001351
DSP     0.000957
COM     0.000592
CXN     0.000154
REC     0.000145
ARG5    0.000101
ARGA    0.000003
Name: role, dtype: float64


In [18]:
core_roles = {"ARG0", "ARG1", "ARG2", "ARG3", "ARG4", "ARG5", "ARGA", "V"}

# Create a new column 'role_type' which maps each role to its category
df['role_type'] = df['role'].apply(lambda x: 'core_role' if x in core_roles else 'modifier_role')

# Compute the relative frequency of each role type
role_type_distribution = df['role_type'].value_counts(normalize=True)

print(role_type_distribution)

# full corpus
# core_role        0.799904
# modifier_role    0.200096

core_role        0.799485
modifier_role    0.200515
Name: role_type, dtype: float64


## Extracting the Training Data from the Corpus

The task the models are trained for is predicting the modifier role for a phrase. The training data from the corpus thus consists of all modifier roles and the strings of their phrases in which they occur.

In [13]:
from models.framework import create_modifiers_df

# this function creates a dataframe with the modifiers and their corresponding roles
# it only includes modifiers that have a distribution of at least 0.05 of the modifiers in the corpus
# the modifier roles considered for this task are thus: TMP, ADV, MOD, DIS, LOC, MNR, NEG
modifiers_df = create_modifiers_df(md_corpus_onto)
modifiers_df.head(10)

Unnamed: 0,sentence_id,sentence_string,frame,role,role_id,indices,string,role_type
4,"(1, 106)",You will find nothing about a prophet coming f...,find.01,MOD,11061,[1],will,modifier_role
22,"(4, 33)",""" If apartheid means you want cheap black labo...",go_on.15,MOD,4332,[38],ca,modifier_role
23,"(4, 33)",""" If apartheid means you want cheap black labo...",go_on.15,NEG,4333,[39],n't,modifier_role
25,"(4, 33)",""" If apartheid means you want cheap black labo...",go_on.15,TMP,4335,[42],forever,modifier_role
30,"(4, 33)",""" If apartheid means you want cheap black labo...",want.01,DIS,4331,[20],also,modifier_role
33,"(4, 33)",""" If apartheid means you want cheap black labo...",be.01,ADV,4330,"[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...",If apartheid means you want cheap black labor ...,modifier_role
34,"(4, 33)",""" If apartheid means you want cheap black labo...",be.01,ADV,4331,[32],then,modifier_role
41,"(6, 21)","By 1940 , China 's War of Resistance against J...",enter.01,TMP,6210,"[0, 1]",By 1940,modifier_role
47,"(7, 3)",""" It would now be physically impossible to beg...",begin.01,TMP,732,"[10, 11]",in 1992,modifier_role
51,"(8, 27)",There 's little doubt that such a move would b...,challenge.01,MOD,8271,[8],would,modifier_role


## Preliminary Training Data Analysis

This part of the notebook shows some preliminary analysis of the distribution of the modifier roles in the training data. This to provide some intuition of the data this project operates on.

In [14]:
# get the distribution of modifiers
modifier_distribution = modifiers_df['role'].value_counts(normalize=True)

print(modifier_distribution)

# full corpus
# TMP     0.225580
# ADV     0.130124
# MOD     0.125671
# DIS     0.112013
# LOC     0.082351
# MNR     0.081642
# NEG     0.066376


TMP    0.277042
ADV    0.156104
MOD    0.153093
DIS    0.135004
MNR    0.099628
LOC    0.099190
NEG    0.079939
Name: role, dtype: float64


In general, three general, distinct types of modifier roles can be observed:

1. Single word modifiers: e.g. "usually" as in "He **usually** takes the bus home."
2. Prepositional phrase modifiers: e.g. "in LOC" as in "She had dinner **in Paris**."
3. Subclause modifiers: e.g. "while..." as in "He listened to the radio **while doing the dishes**."

In [21]:

# Create a function that classifies a role based on its 'string' attribute
def classify_role(row):
    
    # Create a list of common English prepositions
    prepositions = {"in", "at", "by", "for", "from", "on", "to", "with", "about", "against", "among", "between"}

    words = row['string'].split()
    
    if len(words) == 1:
        return 'single_word'
    
    if words[0] in prepositions:
        return 'prepositional_phrase'
    
    return 'subclause'

# Apply the classification function to the 'string' attribute of the DataFrame
modifiers_df['modifier_type'] = modifiers_df[modifiers_df['role_type'] == 'modifier_role'].apply(classify_role, axis=1)

# Compute the relative frequency of each modifier type
modifier_type_distribution = modifiers_df['modifier_type'].value_counts(normalize=True)

print(modifier_type_distribution)

# full corpus
# single_word             0.563476
# subclause               0.277419
# prepositional_phrase    0.159105

single_word             0.591594
subclause               0.267931
prepositional_phrase    0.140475
Name: modifier_type, dtype: float64


In [22]:
# Create a cross-tabulation of modifier types and roles
cross_tab = pd.crosstab(modifiers_df['role'], modifiers_df['modifier_type'], normalize='columns')

# Convert to relative frequency
cross_tab = cross_tab.div(cross_tab.sum(axis=1), axis=0)

print(cross_tab)


modifier_type  prepositional_phrase  single_word  subclause
role                                                       
ADV                        0.204741     0.212935   0.582323
DIS                        0.059447     0.757543   0.183010
LOC                        0.763001     0.051148   0.185851
MNR                        0.483793     0.220151   0.296057
MOD                        0.000000     0.999450   0.000550
NEG                        0.000996     0.984910   0.014094
TMP                        0.320321     0.202453   0.477226
