# 8: Predicting dependency labels

Today we will assume that we have some mechanism for building dependency graphs, either transition-based or graph-based. We will now **focus on the task of predicting dependency labels** and we'll use a logistic regression model to do this.

![unk deps](unk_dep.png)


## 1. Labeled dependency treebank

Unfortunately, the dependency treebank which is distributed with NLTK does not contain actual dependency labels. You can, however, convert the phrase structure trees in Penn treebank into labeled dependency trees. You will need to install the `PyStanfordDependencies` toolkit first:
```
pip3 install PyStanfordDependencies
```

Here's code for convering the Penn Treebank into a **labeled** dependency treebank.

In [None]:
# Warning: it's going to take a long while to run this
import StanfordDependencies
from nltk.corpus import treebank

sd = StanfordDependencies.get_instance(backend='subprocess')
parse_trees = list(treebank.parsed_sents())

dep_treebank = []
for i,sent in enumerate(parse_trees):
    print(i)
    dep_treebank.append(sd.convert_tree(str(sent)))

Unfortunately, `PyStanfordDependencies` is quite slow so I have preconverted the 1000 first sentences and stored them as a Pickle file.

In [3]:
import pickle
dep_treebank = pickle.load(open("dep_treebank.pkl","rb"))

Let's start by examining the treebank. Each sentence is a list of entries like:
```
Token(index=1, form='Meanwhile', cpos='RB', pos='RB', head=7, deprel='advmod')
```
which specify a token in the sentence, its POS, head and the dependency relation between the token and head.

Indexing for word tokens starts from `1` and the syntactic head of the entire sentence has `head=0`. 

In [4]:
print(len(dep_treebank))

for token in dep_treebank[0]:
    print(token)

1000
Token(index=1, form='Meanwhile', cpos='RB', pos='RB', head=7, deprel='advmod')
Token(index=2, form=',', cpos=',', pos=',', head=7, deprel='punct')
Token(index=3, form='business', cpos='NN', pos='NN', head=6, deprel='compound')
Token(index=4, form='and', cpos='CC', pos='CC', head=3, deprel='cc')
Token(index=5, form='government', cpos='NN', pos='NN', head=3, deprel='conj')
Token(index=6, form='leaders', cpos='NNS', pos='NNS', head=7, deprel='nsubj')
Token(index=7, form='rebuked', cpos='VBD', pos='VBD', head=0, deprel='root')
Token(index=8, form='the', cpos='DT', pos='DT', head=10, deprel='det')
Token(index=9, form='computer', cpos='NN', pos='NN', head=10, deprel='compound')
Token(index=10, form='makers', cpos='NNS', pos='NNS', head=7, deprel='dobj')
Token(index=11, form=',', cpos=',', pos=',', head=7, deprel='punct')
Token(index=12, form='and', cpos='CC', pos='CC', head=7, deprel='cc')
Token(index=13, form='fretted', cpos='VBD', pos='VBD', head=7, deprel='conj')
Token(index=14, form

We can access the features of individual tokens in the following way:

In [5]:
print(dep_treebank[0][0].form,dep_treebank[0][0].pos)

Meanwhile RB


We'll now create a function which splits our dataset into a training, development and test set. We'll use 10% of the data respectively for testing and development. The rest is used for training.

In [6]:
def split_data(data):
    train_set = []
    dev_set = []
    test_set = []
    
    for i, sent in enumerate(data):
        if i % 10 == 0:
            test_set.append(sent)
        elif i % 10 == 1:
            dev_set.append(sent)
        else:
            train_set.append(sent)
            
    return train_set, dev_set, test_set

train_set, dev_set, test_set = split_data(dep_treebank)

Let's check that we get the correct lengths:

In [7]:
print(len(train_set), len(dev_set), len(test_set))

800 100 100


## 2. Feature extraction

Next we'll build a feature extraction function which uses the following features to predict the label of the dependency arc between a dependent and its head:

* the word form of the depedent
* the POS tag of the dependent
* the word form of the head
* the POS tag of the head
* a dummy feature which is always active

We will also return the gold standard labels for each dependency relation.

In [8]:
def extract_features(data):
    features = []
    labels = []
    
    for sent in data:
        for example in sent:
            dep_wf = example.form
            dep_pos = example.pos
            
            # default set
            head_wf = "ROOT"
            head_pos = "TOP"
            if example.head > 0:   # if the word is not the head of the sentence
                head_wf = sent[example.head-1].form    # the index starts with 1, so we have to minus 1 here
                head_pos = sent[example.head-1].pos
        
            features.append({"DEP_WF":dep_wf,
                             "DEP_POS":dep_pos,
                             "HEAD_WF": head_wf,
                             "HEAD_POS":head_pos})
            deprel = example.deprel
            labels.append(deprel)
    return features, labels
        
train_features, train_labels = extract_features(train_set)
dev_features, dev_labels = extract_features(dev_set)
test_features, test_labels = extract_features(test_set)

Let's print the sizes of the datasets to check that everything looks okay:

In [9]:
print(len(train_features),len(dev_features),len(test_features))  

19721 2591 2635


And the features for the first training example:

In [10]:
print(train_features[0])

{'DEP_WF': 'The', 'DEP_POS': 'DT', 'HEAD_WF': 'company', 'HEAD_POS': 'NN'}


Next we will need to convert our feature dictionaries into numerical feature vectors (of zeroes and ones). We will do this using the `sklearn` [DictVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html#sklearn.feature_extraction.DictVectorizer) class.

In [11]:
from sklearn.feature_extraction import DictVectorizer

vectorizer = DictVectorizer()
vectorizer.fit(train_features)

# transform everything using the same vectorizer trained from the `train_features`
train_X = vectorizer.transform(train_features)
dev_X = vectorizer.transform(dev_features)
test_X = vectorizer.transform(test_features)

Let's print the shapes of our output matrices to check that everything looks okay:

In [12]:
print(train_X.shape,dev_X.shape,test_X.shape)

(19721, 7477) (2591, 7477) (2635, 7477)


We also need to numericalize our label vectors. We can do this using the `sklearn` [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) class. 

In [13]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
label_encoder.fit(train_labels+dev_labels+test_labels)

train_y = label_encoder.transform(train_labels)
dev_y = label_encoder.transform(dev_labels)
test_y = label_encoder.transform(test_labels)

We turn every vector into some lables (integers).

In [15]:
print(train_y)
print(len(train_y))

[18 31  3 ... 18 27 35]
19721


## 3. Training a classifier

We are then ready to train a logistic regression model. We will use the `sklearn` [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) class.

In [16]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(random_state=0,max_iter=1000).fit(train_X,train_y)

Let's then apply the model to our development set:

In [17]:
sys_dev_y = lr.predict(dev_X)

In [19]:
sys_dev_y

array([31, 36, 31, ..., 12, 12, 35])

We can evaluate the accuracy of the dependency label classifier (it should be around 81%):

In [18]:
def get_acc(sys_y, gold_y):
    return (sys_y == gold_y).sum()/len(sys_y)

get_acc(sys_dev_y, dev_y)

0.8162871478193747

## 4. Refining the features

The features we used were quite simple. A nice boost in accruacy can be achieved by adding the POS tags of the word forms surrounding the dependent because these can often offer a clue about the role that the dependent plays in the sentence.

In [20]:
def extract_features(data):
    features = []
    labels = []
    
    for sent in data:
        for example in sent:
            dep_wf = example.form
            dep_pos = example.pos
            
            head_wf = "ROOT"
            head_pos = "TOP"
            if example.head > 0:
                head_wf = sent[example.head-1].form
                head_pos = sent[example.head-1].pos
        
            dep_pos_minus_1 = "NONE"
            if example.index > 1:  # if it's not the first word
                dep_pos_minus_1 = sent[example.index - 2].pos
            
            dep_pos_plus_1 = "NONE"
            if example.index != len(sent):  # if it's not the last word
                dep_pos_plus_1 = sent[example.index].pos
    
            features.append({"DEP_WF":dep_wf, 
                             "DEP_POS":dep_pos, 
                             "HEAD_WF":head_wf, 
                             "HEAD_POS":head_pos, 
                             "DEP_POS_MINUS_1":dep_pos_minus_1,
                             "DEP_POS_PLUS_1":dep_pos_plus_1,
                             "DUMMY":"DUMMY"})
            deprel = example.deprel
            labels.append(deprel)
    return features, labels
        
train_features, train_labels = extract_features(train_set)
dev_features, dev_labels = extract_features(dev_set)
test_features, test_labels = extract_features(test_set)

We will need to re-vectorize our features:

In [21]:
vectorizer = DictVectorizer()
vectorizer.fit(train_features)

train_X = vectorizer.transform(train_features)
dev_X = vectorizer.transform(dev_features)
test_X = vectorizer.transform(test_features)

And retrain the model:

In [22]:
lr = LogisticRegression(random_state=0,max_iter=1000).fit(train_X,train_y)

We can now see that accuracy improves substantially (it should be around 89%):

In [23]:
sys_dev_y = lr.predict(dev_X)
get_acc(sys_dev_y, dev_y)

0.8892319567734466

## 5. Practical work

A lot still remains to be done with regard to feature extraction. Other features you might want to try are:

* word forms and POS tags around the head word
* Other dependent word forms and their POS tags for both the head and dependent
* The head word for the head of this relation (and its POS tag)

It's important that you leverage the graph structure when designing your features.

Of course you should also tune the hyperparameters of the `LogisticRegression` classifier on the development set and run a final test on the test set.