# Assignment #5: Extraction of subject–verb–object triples
Author: Pierre Nugues

## Objectives

In this assignment, you will extract relations from a parsed sentence involving two words or entities. You will start with pairs of words, namely a subject and its verb, and then extend your programs to triples: subject, verb, and object. In the triples, the subject and the object are the entities, and the verb represents the relation. 

$$
\text{Subject} \xrightarrow[\text{}]{\text{Verb}} \text{Object}
$$

The overall work is inspired by the _Prismatic_ knowledge base used in the IBM Watson system, where the subject, verb, and object triples are a way to extract knowledge from text.  See <a href="http://www.aclweb.org/anthology/W/W10/W10-0915.pdf">this paper</a> for details. 

You will apply the extraction to multilingual texts: 
1. First you will use a parsed corpus of Swedish; and then
2. You will apply it to other languages.
            
The objectives of this assignment are to:
* Extract the subject–verb pairs from a parsed corpus
* Extend the extraction to subject–verb–object triples
* Understand how dependency parsing can help create a knowledge base
* Write a short report of 1 to 2 pages on the assignment

### Swedish

You will extract all the subject–verb pairs and the subject–verb–object triples from the Swedish _Talbanken_ training corpus. To start the program, you can use the CoNLL-U reader available in the cells below.
This program works for the other corpora. You can also program a reader yourself starting from the one you used to read the CoNLL 2000 format in the fourth lab or from scratch. 

The column names of the CoNLL-U corpora

In [285]:
column_names_u = ['ID', 'FORM', 'LEMMA', 'UPOS', 'XPOS', 'FEATS', 'HEAD', 'DEPREL', 'DEPS', 'MISC']

#### Functions to read the CoNLL-U files

In [286]:
def read_sentences(file):
    """
    Creates a list of sentences from the corpus
    Each sentence is a string
    :param file:
    :return:
    """
    f = open(file, encoding='utf-8').read().strip()
    sentences = f.split('\n\n')
    return sentences

In [287]:
def split_rows(sentences, column_names):
    """
    Creates a list of sentence where each sentence is a list of lines
    Each line is a dictionary of columns
    :param sentences:
    :param column_names:
    :return:
    """
    new_sentences = []
    root_values = ['0', 'ROOT', 'ROOT', 'ROOT', 'ROOT', 'ROOT', '0', 'ROOT', '0', 'ROOT']
    start = [dict(zip(column_names, root_values))]
    for sentence in sentences:
        rows = sentence.split('\n')
        sentence = [dict(zip(column_names, row.split('\t'))) for row in rows if row[0] != '#']
        sentence = start + sentence
        new_sentences.append(sentence)
    return new_sentences

#### Reading the corpus

We load the Swedish Talbanken corpus.

In [288]:
sentences = read_sentences(path_sv)
formatted_corpus = split_rows(sentences, column_names_u)

#### Converting the lists in dictionaries

To ease the processing of some corpora, you will use a dictionary represention of the sentences. The keys will be the `ID` values. We do this because `ID` is not necessarily a number.

In [291]:
def convert_to_dict(formatted_corpus):
    """
    Converts each sentence from a list of words to a dictionary where the keys are id
    :param formatted_corpus:
    :return:
    """
    formatted_corpus_dict = []
    for sentence in formatted_corpus:
        sentence_dict = {}
        for word in sentence:
            sentence_dict[word['ID']] = word
        formatted_corpus_dict.append(sentence_dict)
    return formatted_corpus_dict

### Extracting the subject-verb pairs

Now you will extract the subject-verb pairs, where you will set the words in lowercase. In the second sentence of the corpus, this corresponds to `(beskattning, införs)`. You will call the function `extract_pairs(formatted_corpus_dict)` and and you will store the results in a `pairs_sv` variable. All the corpora in the universal dependencies format use the same function names: `nsubj` and `obj` for the subject and direct object.

You can use the algorithm you want. However, here are some hints on the results:
* You will extract all the subject-verb pairs in the corpus. In the extraction, just check the function between two words. Do not check if the part of speech is a verb or a noun in the pair. You will also ignore the possible function suffixes as in `nsubj:pass`, where `pass` means passive.
* You will return the results as Python's dictionaries, where the key will be the pair and the value, the count, as for instance `{(beskattning, införs): 1}`. Be sure you understand the Python dictionaries and note that you can use tuples as keys.

In [293]:
from collections import Counter
def extract_pairs(corpus):
    pairs = []
    for sentence in corpus:
        for subject in sentence.values():
            if subject['DEPREL'].startswith('nsubj'):
                for verbidx, verb in sentence.items():
                    if subject['HEAD'] == verbidx:
                        pairs.append((subject['FORM'].lower(), verb['FORM'].lower()))
    return Counter(pairs)

In [294]:
pairs_sv = extract_pairs(formatted_corpus_dict)

In all the experiments, we will keep the `nbest` most frequent. In the first experiments, we set `nbest` to 3 first. We will set it to 5 in the last experiment.

In [296]:
nbest = 3

In [297]:
sorted_pairs = sorted(pairs_sv, key=lambda x: (-pairs_sv[x], x))
freq_pairs_sv = [(pair, pairs_sv[pair]) for pair in sorted_pairs][:nbest]

In [298]:
freq_pairs_sv

[(('som', 'har'), 45), (('du', 'får'), 19), (('vi', 'har'), 19)]

### Extracting the subject-verb-object triples

You will now extract all the subject–verb–object triples of the corpus. The object function uses the `obj` code.

In [299]:
def extract_triples(corpus):
    triples = []
    for sentence in corpus:
        for subject in sentence.values():
            if subject['DEPREL'].startswith('nsubj'):
                for obj in sentence.values():
                    if obj['DEPREL'].startswith('obj'):
                        for verbidx, verb in sentence.items():
                            if subject['HEAD'] == obj['HEAD'] == verbidx:
                                triples.append((subject['FORM'].lower(), verb['FORM'].lower(), obj['FORM'].lower()))
    return Counter(triples)

In [300]:
triples_sv = extract_triples(formatted_corpus_dict)

In [302]:
triples_sv.most_common(3)

[(('man', 'vänder', 'sig'), 14),
 (('det', 'rör', 'sig'), 5),
 (('som', 'tar', 'barn'), 3)]

In [303]:
sorted_triples = sorted(triples_sv, key=lambda x: (-triples_sv[x], x))
freq_triples_sv = [(triple, triples_sv[triple]) for triple in sorted_triples][:nbest]

In [304]:
freq_triples_sv

[(('man', 'vänder', 'sig'), 14),
 (('det', 'rör', 'sig'), 5),
 (('man', 'söker', 'arbete'), 3)]

### Multilingual Corpora

Once your program is working on Swedish, you will apply it to all the other languages in universal dependencies. The code below returns all the files from a folder with a suffix. Here we consider the training files only.

In [306]:
files = get_files(ud_path, 'train.conllu')

#### Dealing with the indices

Some corpora expand some tokens into multiwords. This is the case in French, Spanish, and German.
        The table below shows examples of such expansions.
        <table style="width:100%">
            <tr>
                <th>French</th>
                <th>Spanish</th>
                <th>German</th>
            </tr>
            <tr>
                <td><i>du</i>: de le
                </td>
                <td><i>del</i>: de el
                </td>
                <td><i>zur</i>: zu der
                </td>
            </tr>
            <tr>
                <td><i>des</i>: de les
                </td>
                <td><i>vámonos</i>: vamos nos
                </td>
                <td><i>im</i>: in dem
                </td>
            </tr>
        </table>
        In the corpora, you have the original tokens as well as the multiwords as with <i>vámonos al mar</i>.
        <pre>
1-2 vámonos _
1 vamos ir
2 nos nosotros
3-4 al _
3 a a
4 el el
5 mar mar
</pre>Read the format description for the details: [<a
                href="http://universaldependencies.org/format.html">CoNLL-U format</a>].

#### Extracting the pairs and triples

Write a function `extract_pairs_and_triples(formatted_corpus_dict, nbest)` that extracts the `nbest` most frequent pairs and triples of a given corpus and returns two sorted lists of tuples: `frequent_pairs` and `frequent_triples`. You will sort them by frequency and then by alphabetical order of the pair or triple.

In [307]:
def extract_pairs_and_triples(corpus, nbest):
    pairs = extract_pairs(corpus)
    triples = extract_triples(corpus)
    
    sorted_pairs = sorted(pairs, key=lambda x: (-pairs[x], x))
    freq_pairs = [(pair, pairs[pair]) for pair in sorted_pairs][:nbest]
    
    sorted_triples = sorted(triples, key=lambda x: (-triples[x], x))
    freq_triples = [(triple, triples[triple]) for triple in sorted_triples][:nbest]
    return freq_pairs, freq_triples

Run your extractor on all the corpora. Note that some corpora have replaced the words by underscores as for one corpus n French. You need then to contact the provider to obtain them.

In [None]:
files = get_files(ud_path, 'train.conllu')

def clean_file(file):
    sentences = read_sentences(file)
    formatted_corpus = split_rows(sentences, column_names_u)
    formatted_corpus_dict = convert_to_dict(formatted_corpus)
    new_formatted_corpus_dict = []
    for i, sentences in enumerate(formatted_corpus_dict):
        new_formatted_corpus_dict.append({k:v for k,v in sentences.items() if '-' not in k})           
    
    return new_formatted_corpus_dict

lang_info = {}
for path in files[2:5]:
    new_formatted_corpus_dict = clean_file(path)
    freq_pairs, freq_triples = extract_pairs_and_triples(new_formatted_corpus_dict, nbest=3)
    lang_info[path] = (freq_pairs, freq_triples)
    
lang_info

## Resolving the entities

You will now extract the relations involving named entities, that is where both the subject and the object are proper nouns. 

Write an `extract_entity_triples(formatted_corpus_dict)` that will process the corpus and return a list of `(subject, verb, object)` triples. You will leave the case as it is in the form, for instance _United States_ and not _united states_.  

In [317]:
def extract_entity_triples(corpus):
    triples = []
    for sentence in corpus:
        for subject in sentence.values():
            if subject['DEPREL'].startswith('nsubj') and 'PROPN' in subject['UPOS']:
                for obj in sentence.values():
                    if obj['DEPREL'].startswith('obj') and 'PROPN' in obj['UPOS']:
                        for verbidx, verb in sentence.items():
                            if subject['HEAD'] == obj['HEAD'] == verbidx:
                                triples.append((subject['FORM'], verb['FORM'], obj['FORM']))
    return Counter(triples)


You will run the `extract_entity_triples()` function one the English EWT corpus. You will store the list in the `entity_relation_en` variable and you will sort it with `sorted()`. You will keep the **five** first triples. 

In [318]:
nbest = 5

The two first triples are:
```
[('Baba', 'remember', 'George'),
 ('Beschta', 'told', 'Planet'),
...]
 ```
Note that this time, we keep the original case and the triples are in the alphabetical order.

In [319]:
formatted_corpus_dict = clean_file(files[-1])
entity_relation_en = extract_entity_triples(formatted_corpus_dict)

In [320]:
entity_relation_en = sorted(entity_relation_en)[:nbest]
entity_relation_en

[('Baba', 'remember', 'George'),
 ('Beschta', 'told', 'Planet'),
 ('Boi', 'beat', 'Lopez'),
 ('Bush', 'mentioned', 'Arabia'),
 ('Bush', 'mentioned', 'Osama')]