# Paragraph causal relation detection: do they contain a causal relation or not

**Example**: '3-2: <span style="background-color: lightblue;">[concept Giving to the ECB the ultimate responsibility for supervision of banks in the euro area concept]</span> <span style="background-color: pink;">[explanation will decisively contribute to increase explanation]</span> <span style="background-color: lightblue;">[concept confidence between the banks concept]</span> <span style="background-color: pink;">[explanation and in this way increase explanation]</span> <span style="background-color: lightblue;">[concept the financial stability in the euro area concept]</span>. The euro area governments and the European institutions, including naturally the European Commission and the ECB, will do whatever is necessary to secure the financial stability of the euro area.\n'

## 1. Load data

This notebook expects three files in a subdirectory `csv`: `Map_Contents-20200726.csv`, `Speech_Contents-20210520.txt` and `Speeches-20210520.txt`. It will look for files with the speeches in the subdirectory `txt`. The names of the speech files are expected to start with the date followed by a space and the suname of the speaker (currently restricted to one word, see function `get_speech_id`).

If you are just interested in learning what the code is doing, you can skip all code blocks with the commands `import` (load libraries), `assert` (perform tests) and `def` (define functions), and examine the other code blocks.

In [1]:
import os
from src.data.make_dataset import read_data_file

In [2]:
assert os.path.isdir("csv"), 'The directory "csv" does not exist!'
assert os.path.isdir("txt"), 'The directory "txt" does not exist!'

In [3]:
map_contents = read_data_file("csv/Map_Contents-20200726.csv")

In [4]:
speech_contents = read_data_file("csv/Speech_Contents-20210520.txt")

In [5]:
speeches = read_data_file("csv/Speeches-20210520.txt")

## 2. Predict presence of causal relations in paragraphs

Steps:

1. store the paragraphs in the data structure X (data) after separating punctuation from words and replacing upper case by lower case
2. create a data structure y (labels) with True for paragraphs with causal relations and False for others
3. predict a label for each paragraph with a machine learning model generated from the other paragraphs
4. evaluate the results

The code in this task uses the packages `fasttext` (for machine learning) and `nltk` (for language processing) 

The task uses limited natural language processing to prepare the data for machine leaning:

1. tokenization: separate punctuation from words
2. conversion of upper case characters to lower case

Other interesting natural language preprocessing steps:

3. part-of-tagging
4. full parsing (Stanford parser)

In [11]:
from src.data.make_dataset import make_dataset

In [12]:
import fasttext
from langdetect import detect
from nltk.tokenize import word_tokenize
import numpy as np
import re
import sklearn
from termcolor import colored
from IPython.display import clear_output

In [13]:
def make_train_test(X, y, test_index=0):
    train_list = []
    test_list = []
    index = 0
    for key in sorted(X.keys()):
        if index == test_index:
            test_list.append(f"__label__{str(y[key])} {X[key]}")
        else:
            train_list.append(f"__label__{str(y[key])} {X[key]}")
        index += 1
    return train_list, test_list

In [14]:
def make_train_file(file_name, train_list):
    data_file = open(file_name, "w")
    for line in train_list:
        print(line, file=data_file)
    data_file.close()

In [15]:
def decode_label(label):
    return re.sub("__label__", "", label)

In [16]:
def show_results(results):
    result_list = []
    for key in results:
        result_list.append({"paragraph": key})
        result_list[-1].update(results[key])
    return pd.DataFrame(result_list, index=[""]*len(result_list))

In [17]:
def evaluate_results(results):
    correct = 0
    found = 0
    target = 0
    accuracy_count = 0
    for key in results:
        if results[key]["predicted"] == str(True):
            found += 1
            if results[key]["predicted"] == str(results[key]["correct"]):
                correct += 1
        if str(results[key]["correct"]) == str(True):
            target += 1
        if results[key]["predicted"] == str(results[key]["correct"]):
            accuracy_count += 1
    precision = round(100*correct/found, 1)
    recall = round(100*correct/target, 1)
    f = round(2 * precision * recall / (precision + recall), 1)
    accuracy = round(100 * accuracy_count / len(results), 1)
    print(f"precision: {precision}%; recall: {recall}%; F: {f}; accuracy: {accuracy}%")

In [18]:
def count_y_values(y):
    values = {}
    for key in y:
        if y[key] not in values:
            values[y[key]] = 0
        values[y[key]] += 1
    table = []
    for key in values:
        table.append({"label": key, "count": values[key], "percentage": f"{round(100*values[key]/len(y), 1)}%"})
    return pd.DataFrame(table, index=[""]*len(table))

In [19]:
def squeal(text):
    clear_output(wait=True)
    print(text)

In [20]:
def make_fasttext_data(data_in):
    data_out = []
    for paragraph, label in data_in:
        data_out.append("__label__" + str(label) + " " + paragraph)
    return data_out

In [21]:
def run_experiments(X, y, wordNgrams=1, pretrainedVectors=""):
    predicted_labels_all = []
    correct_labels = [ y[key] for key in X if key in y]
    counter = 0
    data = np.array([ (X[key], y[key]) for key in X if key in y])
    for train_items, test_items in sklearn.model_selection.KFold(n_splits=10).split(X):
        train_data = make_fasttext_data(data[train_items])
        make_train_file("train_file.txt", train_data)
        model = fasttext.train_supervised("train_file.txt", dim=300, pretrainedVectors=pretrainedVectors, wordNgrams=wordNgrams)
        test_data = make_fasttext_data(data[test_items])
        predicted_labels = model.predict(test_data)
        predicted_labels_all.extend(predicted_labels[0])
        counter += 1
        squeal(f"Ran experiment {counter} of 10")
    return { i: { "correct": correct_labels[i], "predicted": decode_label(predicted_labels_all[i][0]) } for i in range(0, len(predicted_labels_all)) }

The variable `X` in the next code block contains all the data (paragraphs) available for machine learning. The variable `y` contains all the associated labels, with values `True` or `False`.   

In [23]:
X, y = make_dataset(speeches, speech_contents, map_contents)

skipping file in language de: 2015-01-19 Merkel Bundesregerung ann g.txt
skipping file in language nl: 2011-09-27 Rutte Rijksoverheid ann.txt
skipping file in language nl: 2011-10-28 Knot dnb_01 ANN NL.txt
skipping file in language fr: 2013-04-17 Hollande SFM2020 ann fr.txt
skipping file in language fr: 2010-04-20 Barroso European Commission ann fr.txt
skipping file in language de: 2012-01-06 Rutte CSU klausurtagung ann G.txt
skipping file in language nl: 2011-04-06 Rutte FD evenement ann NL.txt
skipping file in language fr: 2011-01-13 Sarkozy gb ann.txt
skipping file in language fr: 2012-08-30 Hollande SFM2020 ann fr.txt
skipping file in language fr: 2009-12-01 Sarkozy Elysee (Economy) ann fr.txt
skipping file placeholder.txt
skipping file in language de: 2013-11-21 Merkel Bundesregerung ann g.txt
skipping file in language unk: 2012-07-26 Barroso European Commission.txt
skipping file in language fr: 2013-02-19 Hollande SFM2020 ann fr.txt
skipping file in language fr: 2009-12-14 Sarkoz

In [29]:
count_y_values(y)

Unnamed: 0,label,count,percentage
,True,660,73.6%
,False,237,26.4%


We run as many machine learning experiments as there are paragraphs. In each machine learning experiment, one paragraph is used as test data while all others are used for training a machine learning model to predict the label of this paragraph. 

In [338]:
results = run_experiments(X, y, wordNgrams=3, pretrainedVectors="wiki.en.vec")

Ran experiment 10 of 10


In [339]:
evaluate_results(results)

precision: 77.0%; recall: 92.8%; F: 84.2; accuracy: 74.5%


| ngrams | language model | precision | recall | F    | accuracy |
| ------ | -------------- | --------- | ------ | ---- | -------- |
| 1      | yes            | 80.0%     | 85.5%  | 82.7 | 73.8%    |
| 2      | yes            | 78.1%     | 90.3%  | 83.8 | 74.5%    |
| 3      | yes            | 77.0%     | 92.8%  | 84.2 | 74.5%    |
| 1      | no             | 73.3%     | 100.%  | 84.6 | 73.3%    |
| 2      | no             | 73.0%     | 100.%  | 84.4 | 73.0%    |
| 3      | no             | 73.0%     | 100.%  | 84.4 | 73.0%    |