# Cognitive Mapping

The data consists of text and relations, which have three parts (Concept1, Explanation and Concept2). Each of the three parts correspond to a multi-word phrase in a text. All relations and text can be easily read into a Jupyter notebook.

The goal is to identify the three parts of the relation in a text automatically.

**Example**: '3-2: <span style="background-color: lightblue;">[concept Giving to the ECB the ultimate responsibility for supervision of banks in the euro area concept]</span> <span style="background-color: pink;">[explanation will decisively contribute to increase explanation]</span> <span style="background-color: lightblue;">[concept confidence between the banks concept]</span> <span style="background-color: pink;">[explanation and in this way increase explanation]</span> <span style="background-color: lightblue;">[concept the financial stability in the euro area concept]</span>. The euro area governments and the European institutions, including naturally the European Commission and the ECB, will do whatever is necessary to secure the financial stability of the euro area.\n'

## 1. Plan
    
1. machine learning of paragraphs: do they contain a causal relation or not
2. find phrases of relations in text: either concepts, explanations or not present in a relation
3. identify relations based on recognized concept phrases and explanation phrases

We need a tagger or a entity recognition program, for example transformers, huggingface/bert: https://github.com/huggingface/transformers , or Spacy

## 1.1. Data encoding for plan step 2

contribute E\
to E\
increase E\
confidence C\
between C\
the C\
banks C\
the X

## 1.2. References

Hosseini, M.J., Chambers, N., Reddy, S., Holt, X R., Cohen, S B., Johnson, M., & Steedman, M. (2018). Learning Typed Entailment Graphs with Global Soft Constraints, Transactions of the Association for Computational Linguistics. Sizov, G. & Ozturk, P. (2013). 

Zornitsa Kozareva, Irina Matveeva, Gabor Melli, Vivi Nastase Zornitsa Kozareva, Irina Matveeva, Gabor Melli, Vivi Nastase (2013). Automatic Extraction of Reasoning Chains from Textual Reports. Proceedings of TextGraphs-8 Workshop “Graph-based Methods for Natural Language Processing”, Empirical Methods in Natural Language Processing. 

Noah Jadallah (2021). [Cause-Effect Detection for Software Requirements Based on Token Classification with BERTCause-Effect Detection for Software Requirements Based on Token Classification with BERT](
https://colab.research.google.com/drive/14V9Ooy3aNPsRfTK88krwsereia8cfSPc?usp=sharing#scrollTo=H_kiqxjbW3lh). Seminar Natural Language Processing for Software Engineering, Winter-term 2020/2021, Technical University Munich.

Erik Tjong Kim Sang and Katja Hofmann (2009). [Lexical Patterns or Dependency Patterns: Which Is Better for Hypernym Extraction?](https://ifarm.nl/erikt/papers/conll2009.pdfhttps://ifarm.nl/erikt/papers/conll2009.pdf) In: Proceedings of CoNLL-2009, Boulder, CO, USA, 2009, pages 174-182.

## 2. Load data

This notebook expects three files in a subdirectory `csv`: `Map_Contents-20200726.csv`, `Speech_Contents-20210520.txt` and `Speeches-20210520.txt`. It will look for files with the speeches in the subdirectory `txt`. The names of the speech files are expected to start with the date followed by a space and the suname of the speaker (currently restricted to one word, see function `get_speech_id`).

In [1]:
import os
import pandas as pd

In [2]:
assert os.path.isdir("csv"), 'The directory "csv" does not exist!'
assert os.path.isdir("txt"), 'The directory "txt" does not exist!'

In [3]:
map_contents = pd.read_csv("csv/Map_Contents-20200726.csv", encoding="latin1")

In [4]:
speech_contents = pd.read_csv("csv/Speech_Contents-20210520.txt", encoding="latin1")

In [5]:
speeches = pd.read_csv("csv/Speeches-20210520.txt", encoding="latin1")

## 3. Task 1: Predict presence of causal relations in paragraphs

Steps:

1. store the paragraphs in the data structure X (data) after separating punctuation from words and replacing upper case by lower case
2. create a data structure y (labels) with True for paragraphs with causal relations and False for others
3. predict a label for each paragraph with a machine learning model generated from the other paragraphs
4. evaluate the results

The code in this task uses the packages `fasttext` (for machine learning) and `nltk` (for language processing) 

The task uses limited natural language processing to prepare the data for machine leaning:

1. tokenization: separate punctuation from words
2. conversion of upper case characters to lower case

Other interesting natural language preprocessing steps:

3. part-of-tagging
4. full parsing (Stanford parser)

In [6]:
import fasttext
from nltk.tokenize import word_tokenize
import re
from IPython.display import clear_output

In [7]:
def get_speech_id(file_name, speeches):
    try:
        file_name_parts = file_name.split()
        date = file_name_parts[0]
        speaker = list(file_name_parts[1].split("_")[0])
        speaker[0] = speaker[0].upper()
        speaker = "".join(speaker)
        speech_identifier = f"{speaker} {date}"
        speech_identifier = re.sub("Simor 2010-05-25", "Simor 2010-05-26", speech_identifier)
        return int(speeches[speeches["Speech_Identifier"] == speech_identifier]["Speech_ID"])
    except:
        return None

In [8]:
def get_paragraph_ids(speech_id, speech_contents):
    paragraph_ids = {}
    try:
        for i, row in speech_contents[speech_contents["Speech_ID"] == speech_id].iterrows():
            paragraph_ids[row["Speech_Content_ID"]] = row["Speech_Content_Title"]
    except:
        pass
    return paragraph_ids

In [9]:
def check_paragraphs(speech_id, paragraph_ids, map_contents):
    paragraph_values = {}
    for i, row in map_contents[map_contents["Content_Speech_ID"] == speech_id].iterrows():
        if row["Content_Source_ID"] not in paragraph_ids:
            print(f'warning: unknown paragraph id {row["Content_Source_ID"]} for document {speech_id}')
        else:
            paragraph_values[f'{speech_id} {paragraph_ids[row["Content_Source_ID"]]}'] = True
    return paragraph_values

In [10]:
def read_paragraphs(file_name):
    paragraph_list = []
    data_file = open(file_name, "r", encoding="latin1")
    for line in data_file:
        paragraph_list.append(line.strip())
    data_file.close()
    return paragraph_list

In [11]:
def select_paragraphs(paragraph_list, paragraph_values, speech_id):
    paragraph_texts = {}
    for paragraph in paragraph_list:
        tokens = paragraph.split()
        if len(tokens) > 0 and re.search(r'^\d+-\d+:$', tokens[0]):
            key = re.sub(":", "", tokens[0])
            key = f"{speech_id} {key}" 
            paragraph_texts[key] = " ".join(word_tokenize(" ".join(tokens[1:]))).lower()
            if key not in paragraph_values:
                paragraph_values[key] = False
    return paragraph_texts

In [12]:
def read_data(speeches, speech_contents, map_contents):
    paragraph_texts_all = {}
    paragraph_values_all = {}
    files = os.listdir("txt")
    for file_name in files:
        speech_id = get_speech_id(file_name, speeches)
        if speech_id == None:
            print(f"skipping file {file_name}")
        else:
            paragraph_ids = get_paragraph_ids(speech_id, speech_contents)
            paragraph_values = check_paragraphs(speech_id, paragraph_ids, map_contents)
            paragraph_list = read_paragraphs(f"txt/{file_name}")
            paragraph_texts = select_paragraphs(paragraph_list, paragraph_values, speech_id)
            paragraph_texts_all.update(paragraph_texts)
            paragraph_values_all.update(paragraph_values)
    return paragraph_texts_all, paragraph_values_all

In [13]:
def make_train_test(X, y, test_index=0):
    train_list = []
    test_list = []
    index = 0
    for key in sorted(X.keys()):
        if index == test_index:
            test_list.append(f"__label__{str(y[key])} {X[key]}")
        else:
            train_list.append(f"__label__{str(y[key])} {X[key]}")
        index += 1
    return train_list, test_list

In [14]:
def make_train_file(file_name, train_list):
    data_file = open(file_name, "w")
    for line in train_list:
        print(line, file=data_file)
    data_file.close()

In [15]:
def decode_label(label):
    return re.sub("__label__", "", label)

In [16]:
def show_results(results):
    return pd.DataFrame(list(results.values()), index=list(results.keys()))

In [17]:
def evaluate_results(results):
    correct_count = 0
    for key in results:
        if decode_label(results[key]["predicted"]) == str(results[key]["correct"]):
            correct_count += 1
    print(f"correct: {round(100*correct_count/len(results), 1)}%")

In [18]:
def count_y_values(y):
    values = {}
    for key in y:
        if y[key] not in values:
            values[y[key]] = 0
        values[y[key]] += 1
    for key in values:
        print(values[key], f"{round(100*values[key]/len(y), 1)}%", key)

In [19]:
def squeal(text):
    clear_output(wait=True)
    print(text)

In [20]:
X, y = read_data(speeches, speech_contents, map_contents)

skipping file placeholder.txt


In [21]:
count_y_values(y)

61 80.3% True
15 19.7% False


In [22]:
results = {}
for i in range(0, len(X)):
    key = list(sorted(X.keys()))[i]
    train_list, test_list = make_train_test(X, y, i)
    make_train_file("train_file.txt", train_list)
    model = fasttext.train_supervised("train_file.txt")
    predicted_label = model.predict(test_list)
    results[key] = {"correct": y[key], "predicted": decode_label(predicted_label[0][0][0])}
    squeal(f"Running experiment {i+1} of {len(X)}")

Running experiment 76 of 76


In [23]:
evaluate_results(results)

correct: 80.3%


In [24]:
show_results(results)

Unnamed: 0,correct,predicted
232 2-2,True,True
232 2-3,True,True
232 3-1,True,True
232 4-1,True,True
232 4-2,True,True
...,...,...
536 1-2,True,True
536 2-3,True,True
536 3-1,False,True
536 3-2,True,True


## 4. Task 2: Find relevant phrases in text

In [None]:
from transformers import pipeline

### 4.1 Testing the pretrained Named Entity Recognition (NER) model

In [None]:
classifier = pipeline('ner')

In [None]:
print(paragraph_list[0])
classifier(paragraph_list[0])

### 4.2 Training a phrase recognition model

Source: https://huggingface.co/transformers/task_summary.html#named-entity-recognition

This does not change the behaviour of the system. Perhaps we need to start from https://github.com/huggingface/transformers/blob/master/examples/pytorch/token-classification/run_ner.py

In [None]:
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
label_list = [
    "O",       # Outside of a phrase
    "I-CON"    # Concept
    "I-EXP"    # Explanation
]

In [None]:
sequence = paragraph_list[0]

# Bit of a hack to get the tokens with the special tokens
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors="pt")
outputs = model(inputs).logits
predictions = torch.argmax(outputs, dim=2)

In [None]:
predictions

In [None]:
for token, prediction in zip(tokens, predictions[0].numpy()):
    print((token, model.config.id2label[prediction]))

In [None]:
paragraph_list
