# How to work with the PDTB

_based on the course "Computational Pragmatics" by Chris Potts_ 

Shared under a cc-by-nc-sa license.
https://creativecommons.org/licenses/by-nc-sa/3.0/

## Loading & accessing the corpus

We can access the corpus using the compiled csv-version (a tabular format with one relation per line): https://bit.ly/2TjfT5K (you should know the password). 

The python package `pdtb` provides an iterator over the data points in the corpus.

In [None]:
# Needs NLTK

from collections import defaultdict
from pdtb import CorpusReader, Datum

from matplotlib import pyplot as plt


pdtb = CorpusReader('pdtb2.csv')


In [None]:
def relation_count():
    """Calculate and display the distribution of relations."""
    # Create a count dictionary of relations.
    d = defaultdict(int)
    for datum in pdtb.iter_data():
        if datum.Relation == None:
            break
        d[datum.Relation] += 1
    # Print the results to standard output.
    for key, val in d.items():
        print(key, val)
    return()

# This will take a long time, so run with caution!
relation_count()

## Exploring the corpus

* Example relation:

In [None]:
# for datum in pdtb.iter_data():
#     print(datum)
#     break
    
print(next(pdtb.iter_data()))


* What information do we have for each relation?

In [None]:
Datum.header

In [None]:
ex_item = next(x for i,x in enumerate(pdtb.iter_data()) if i==3) # Get the 3rd relation
print(ex_item.Relation,ex_item.ConnHead,ex_item.FullRawText)

print("Arg1 = " + ex_item.Arg1_RawText)
print("Arg2 = " + ex_item.Arg2_RawText)

## Semantic classes

We can look at the semantic classes present in the corpus.

*NOTE:* This is the 2.0 version of the PDTB, which still uses the (deprecated) PDTB 2.0 set of relations. PDTB 3.0 has just been released.

In [None]:
def count_semantic_classes():
    """Count ConnHeadSemClass1 values."""
    d = defaultdict(int)
    for datum in pdtb.iter_data():
        sc = datum.ConnHeadSemClass1
        # Filter None values (should be just EntRel/NonRel data).
        if sc:
            d[sc] += 1
    return d

sem = count_semantic_classes()

In [None]:
%pylab inline
pylab.rcParams['figure.figsize'] = (18, 5)

plt.bar(*zip(*sem.items()))
xticks(rotation='vertical')

plt.show()

## Connectives

Looking at the connectives (only for *Explicit* relations):

In [None]:
def print_connectives():
    """Print all connectives."""
    d = defaultdict(int)
    for datum in pdtb.iter_data():
        if datum.Relation == "Explicit":
            conn = datum.ConnHead
            d[conn] += 1
    return d

ALL_CONNECTIVES = print_connectives()

In [None]:
plt.bar(*zip(*ALL_CONNECTIVES.items()))
xticks(rotation='vertical')

plt.show()

## Answering some questions about the data

### Disconnected argument spans

When does it happen that argument spans are not continuous (that is, something is "missing" from within either argument)?

In [None]:
i = 0
for datum in pdtb.iter_data():
    i+=1
    if i>200:
        break
    if datum.Relation=="Explicit":
        arg1spans = datum.Arg1_SpanList
        arg2spans = datum.Arg2_SpanList
        if len(arg1spans)+len(arg2spans)>2:
            print(datum.ConnHead, datum.Connective_SpanList,arg1spans,arg2spans)
            print(datum)
        

### Looking at syntactic trees

Potts has also included the syntactic trees of arguments (and connectives) in the same data structure. This means we can look at them and for example see which types of arguments occur for certain connectives, etc.

In [None]:
i = 0
for datum in pdtb.iter_data():
    i+=1
    if i>200:
        break
    if datum.Relation=="Implicit":
        print(datum)
        print(datum.Arg1_Trees)
        print(datum.Arg2_Trees)
        break
 

## Exercises

1. Which semantic senses occur in *Explicit* vs. *Implicit* relations? Construct a confusion matrix with the Relation types as rows, the ConnHeadSemClass1 as colums, and the cells representing the number of times that the correspondong row and columns values occur together. Are there patterns here that we might take advantage of in experiments predicting Relation-types or semantic coherence classes?
2. Find long-distance relations. These are relations where there is some extra material in between argument 1 and 2. For this, you may want to use functionality similar to the `adjacency_check`-function below. When you find a long-distance relation, save what type of relation it is (should be mainly Explicit), and what the connective is (`ConnHead`). Further, `Datum` also provides a method called `relative_arg_order()`. The function `distribution_of_relative_arg_order()` defined below creates a simple tally of the relative argument orders (Arg1 befor Arg2, Arg2 before Arg1, etc.).
3. How does the **size** of arguments correlate with connectives? Create a dictionary of connective heads and argument sizes. Plot the argument size distributions for a few connectives (or the means for all connectives). 
4. What is the syntactic type of arguments? Which kinds of clauses can you find?
5. What is happening in "Attribution"s? Eg, use the function `def print_attribution_texts()` below. 

In [None]:
def adjacency_check(datum):
    """Return True if datum is of the form Arg1 (connective) Arg2, else False"""    
    if not datum.arg1_precedes_arg2():
        return False
    arg1_finish = max([x for span in datum.Arg1_SpanList for x in span])
    arg2_start = min([x for span in datum.Arg2_SpanList for x in span])    
    if datum.Relation == 'Implicit':
        if (arg2_start - arg1_finish) <= 3:
            return True
        else:
            return False
    else:
        conn_indices = [x for span in datum.Connective_SpanList for x in span]
        conn_start = min(conn_indices)
        conn_finish = max(conn_indices)
        if (conn_start - arg1_finish) <= 3 and (arg2_start - conn_finish) <= 3:
            return True
        else:
            return False        

In [None]:
adjacency_check(ex_item)

In [None]:
from operator import itemgetter

def distribution_of_relative_arg_order():
    d = defaultdict(int)
    pdtb = CorpusReader('pdtb2.csv')
    for datum in pdtb.iter_data(display_progress=True):
        d[datum.relative_arg_order()] += 1
    for order, count in sorted(list(d.items()), key=itemgetter(1), reverse=True):
        print(order, count)
    
distribution_of_relative_arg_order()

In [None]:
def print_attribution_texts():
    """Inspect the strings characterizing attribution values."""
    pdtb = CorpusReader('pdtb2.csv')
    for datum in pdtb.iter_data(display_progress=False):
        txt = datum.Attribution_RawText
        if txt:
            print(txt)

In [None]:
attr = print_attribution_texts()
attr