# NLP Assignment #4
### by Prodromos Kampouridis MTN2203

#### IMPORTANT NOTE
##### *Due to the large length of code, the answers to tasks 1-5 can also be found as markdowns in the cells below.*


##### *For more detailed information, please refer to the report entitled PRODROMOS KAMPOURIDIS REPORT*.

## Β. GRAPH-BASED DEPENDENCY PARSER


### Introduction

In this part of the assignment we will explore a graph-based dependency parser based on the work of Kiperwasser and Goldberg (2016). The model is based on a bidirectional LSTM encoder and two MLPs. One for predicting a score for each possible dependency, and one to detect the type of the dependency. As part of the assignment, we will explore various modifications in the model's architecture, as well as the change of the BI-LSTM encoder with a BERT encoder, and the effect of these changes in the performance of the model, in terms of UAS and LAS evaluation metrics.

We have modified the `model.py` and `main.py` files to implement the changes described in the questions. Moreover, we have created a new file `model_bert.py` to implement the BERT based model of question B4.

## Answers

### 1.
First, we run the experiment with 1 BiLSTM layer, by executing `main.py` with the argument `--n_lstm_layers 1`. To evaluate the model's performance in the test set, we then execute the `main.py` script with the `--do_eval` and `--model_dir` arguments. As the `model_dir`, we pass the path of the save model.

Commands:
```
python main.py --n_lstm_layers 1
python main.py --n_lstm_layers 1 --do_eval --model_dir "results/ds=ptb_epochs=5_lr=0.001_seed=1234_extEmb=False_wDim=100_pDim=25_lstmDim=125_mlpDim=100_lstmN=1_date=06_20_2023"
```

Results:

The model's performance in the test set is:

UAS: 93.56

LAS: 92.05

Compared to the best models of part A (A1 and A4) that both achieved UAS 89.19, this model performs significantly better.

In [None]:
!python main.py --n_lstm_layers 1

2023-06-20 14:01:35,116 - INFO - Experiment Parameters - 
{'train_path': 'data/train.conll', 'dev_path': 'data/dev.conll', 'test_path': 'data/test.conll', 'ds_name': 'ptb', 'model_dir': None, 'ext_emb': None, 'seed': 1234, 'epochs': 5, 'lr': 0.001, 'alpha': 0.25, 'w_emb_dim': 100, 'pos_emb_dim': 25, 'lstm_hid_dim': 125, 'mlp_hid_dim': 100, 'n_lstm_layers': 1, 'no_cuda': False, 'log_interval': 2000, 'do_eval': False, 'pretrained_emb': None, 'activation_function': 'tanh', 'encoder': 'lstm', 'experiment_dir': './results/ds=ptb_epochs=5_lr=0.001_seed=1234_extEmb=False_wDim=100_pDim=25_lstmDim=125_mlpDim=100_lstmN=1_date=06_20_2023'}
2023-06-20 14:01:37,360 - INFO - Vocab statistics: words - 34327 | relations - 40 | POS tags - 19
2023-06-20 14:01:57,827 - INFO - -----------+-----------+-----------+-----------+-----------
2023-06-20 14:01:57,827 - INFO - Train epoch: 1
2023-06-20 14:36:00,557 - INFO - -----------+-----------+-----------+-----------+-----------
2023-06-20 14:36:00,558 - INFO 

In [None]:
!python main.py --n_lstm_layers 1 --do_eval --model_dir "results/ds=ptb_epochs=5_lr=0.001_seed=1234_extEmb=False_wDim=100_pDim=25_lstmDim=125_mlpDim=100_lstmN=1_date=06_20_2023"

2023-06-21 21:14:41,626 - INFO - Experiment Parameters - 
{'train_path': 'data/train.conll', 'dev_path': 'data/dev.conll', 'test_path': 'data/test.conll', 'ds_name': 'ptb', 'model_dir': 'results/ds=ptb_epochs=5_lr=0.001_seed=1234_extEmb=False_wDim=100_pDim=25_lstmDim=125_mlpDim=100_lstmN=1_date=06_20_2023', 'ext_emb': None, 'seed': 1234, 'epochs': 5, 'lr': 0.001, 'alpha': 0.25, 'w_emb_dim': 100, 'pos_emb_dim': 25, 'lstm_hid_dim': 125, 'mlp_hid_dim': 100, 'n_lstm_layers': 1, 'no_cuda': False, 'log_interval': 2000, 'do_eval': True, 'pretrained_emb': None, 'activation_function': 'tanh', 'encoder': 'lstm', 'experiment_dir': 'results/ds=ptb_epochs=5_lr=0.001_seed=1234_extEmb=False_wDim=100_pDim=25_lstmDim=125_mlpDim=100_lstmN=1_date=06_20_2023'}
2023-06-21 21:14:41,938 - INFO - Vocab statistics: words - 34327 | relations - 40 | POS tags - 19
2023-06-21 21:15:30,711 - INFO - -----------+-----------+-----------+-----------+-----------
2023-06-21 21:15:30,711 - INFO - test results:
2023-06-21 

### 2.
In order to replace the randomly initialized embeddings with pre-trained embeddings, we first added an extra command line argument in `main.py` named `--pretrained_embed`. Note that `main.py` has already available the `--ext_emb` argument, which however does not replace the randomly initialized embeddings with the ones in the given file, but rather concatenates the given embeddings along with the randomly initialized ones and the POS embeddings. We modified `main.py` so that when the `--pretrained_embed` is used, the `load_pretrained_word_embed` util function is called with input the given pre-trained embeddings file `glove.6B.100d.txt` and the already calculates word-index mapping. We use the returned w2i value of the function to replace the previous w2i variable, and thus restrict the vocabulary to contain the words with available embeddings, as follows.

```
pretrained_word_vectors = None
if args.pretrained_emb is not None:
    w2i, pretrained_word_vectors = load_pretrained_word_embed(args.pretrained_emb, w2i)
```
Then, we create the BISTParser with an extra parameter `pretrained_word_vectors` that contains the embeddings matrix. Finally, we modified the BISTParser so that when the `pretrained_word_vectors` is passed, the word_embedding layer is initialized with the pretrained embeddings rather than random ones:

```
 # embedding layers initialization
if pretrained_word_vectors is None:
    self.word_embedding = nn.Embedding(len(w2i), w_emb_dim)
else:
    self.word_embedding = nn.Embedding.from_pretrained(pretrained_word_vectors, freeze=False)
```

Commands:
```
!python main.py --n_lstm_layers 1 --pretrained_emb ./glove.6B.100d.txt
!python main.py --n_lstm_layers 1 --pretrained_emb ./glove.6B.100d.txt --do_eval --model_dir "results/ds=ptb_epochs=5_lr=0.001_seed=1234_extEmb=False_wDim=100_pDim=25_lstmDim=125_mlpDim=100_lstmN=1_pretrained_emb=True_activation=tanh_encoder=lstm_date=06_21_2023"
```

The model's performance in the test set is:

UAS: 93.71

LAS: 92.22

Again this model outperforms the models of part A that achieved UAS: 89.19, by an even larger margin.

In [None]:
!python main.py --n_lstm_layers 1 --pretrained_emb ./glove.6B.100d.txt

2023-06-21 11:53:18,451 - INFO - Experiment Parameters - 
{'train_path': 'data/train.conll', 'dev_path': 'data/dev.conll', 'test_path': 'data/test.conll', 'ds_name': 'ptb', 'model_dir': None, 'ext_emb': None, 'seed': 1234, 'epochs': 5, 'lr': 0.001, 'alpha': 0.25, 'w_emb_dim': 100, 'pos_emb_dim': 25, 'lstm_hid_dim': 125, 'mlp_hid_dim': 100, 'n_lstm_layers': 1, 'no_cuda': False, 'log_interval': 2000, 'do_eval': False, 'pretrained_emb': './glove.6B.100d.txt', 'activation_function': 'tanh', 'encoder': 'lstm', 'experiment_dir': './results/ds=ptb_epochs=5_lr=0.001_seed=1234_extEmb=False_wDim=100_pDim=25_lstmDim=125_mlpDim=100_lstmN=1_pretrained_emb=True_activation=tanh_encoder=lstm_date=06_21_2023'}
2023-06-21 11:53:19,933 - INFO - Vocab statistics: words - 34327 | relations - 40 | POS tags - 19
2023-06-21 11:53:45,107 - INFO - -----------+-----------+-----------+-----------+-----------
2023-06-21 11:53:45,107 - INFO - Train epoch: 1
2023-06-21 12:28:58,237 - INFO - -----------+-----------+-

In [None]:
!python main.py --n_lstm_layers 1 --pretrained_emb ./glove.6B.100d.txt --do_eval --model_dir "results/ds=ptb_epochs=5_lr=0.001_seed=1234_extEmb=False_wDim=100_pDim=25_lstmDim=125_mlpDim=100_lstmN=1_pretrained_emb=True_activation=tanh_encoder=lstm_date=06_21_2023"

2023-06-21 20:55:11,316 - INFO - Experiment Parameters - 
{'train_path': 'data/train.conll', 'dev_path': 'data/dev.conll', 'test_path': 'data/test.conll', 'ds_name': 'ptb', 'model_dir': 'results/ds=ptb_epochs=5_lr=0.001_seed=1234_extEmb=False_wDim=100_pDim=25_lstmDim=125_mlpDim=100_lstmN=1_pretrained_emb=True_activation=tanh_encoder=lstm_date=06_21_2023', 'ext_emb': None, 'seed': 1234, 'epochs': 5, 'lr': 0.001, 'alpha': 0.25, 'w_emb_dim': 100, 'pos_emb_dim': 25, 'lstm_hid_dim': 125, 'mlp_hid_dim': 100, 'n_lstm_layers': 1, 'no_cuda': False, 'log_interval': 2000, 'do_eval': True, 'pretrained_emb': './glove.6B.100d.txt', 'activation_function': 'tanh', 'encoder': 'lstm', 'experiment_dir': 'results/ds=ptb_epochs=5_lr=0.001_seed=1234_extEmb=False_wDim=100_pDim=25_lstmDim=125_mlpDim=100_lstmN=1_pretrained_emb=True_activation=tanh_encoder=lstm_date=06_21_2023'}
2023-06-21 20:55:11,590 - INFO - Vocab statistics: words - 34327 | relations - 40 | POS tags - 19
2023-06-21 20:56:15,368 - INFO - ---

### 3.
We modified the BISTParser class in the `model.py` file, to take `activation_function` as a parameter in the constructor. Then, we create the `self.slp_out_arc` and `self.slp_out_rel` layers with `nn.Tanh` if the name of the given activation function is `tanh`, or with `nn.ReLU` if the name is `relu`. We have also added an extra command line argument `activation_function` in the `main.py` file, which passes the given activation function name (default: tanh) to the BISTParser constructor during the parser initialization. To train and evaluate the model with ReLU activation function, we run the following commands:

```
python main.py --n_lstm_layers 1 --activation_function relu
python main.py --n_lstm_layers 1 --activation_function relu --do_eval --model_dir="results/ds=ptb_epochs=5_lr=0.001_seed=1234_extEmb=False_wDim=100_pDim=25_lstmDim=125_mlpDim=100_lstmN=1_pretrained_emb=False_activation=relu_encoder=lstm_date=06_21_2023"
```

The model's performance in the test set is:

UAS: 93.66

LAS: 92.16

The model performs significantly better than the models of part A that achieved UAS: 89.19.

In [None]:
!python main.py --n_lstm_layers 1 --activation_function relu

2023-06-21 18:04:10,082 - INFO - Experiment Parameters - 
{'train_path': 'data/train.conll', 'dev_path': 'data/dev.conll', 'test_path': 'data/test.conll', 'ds_name': 'ptb', 'model_dir': None, 'ext_emb': None, 'seed': 1234, 'epochs': 5, 'lr': 0.001, 'alpha': 0.25, 'w_emb_dim': 100, 'pos_emb_dim': 25, 'lstm_hid_dim': 125, 'mlp_hid_dim': 100, 'n_lstm_layers': 1, 'no_cuda': False, 'log_interval': 2000, 'do_eval': False, 'pretrained_emb': None, 'activation_function': 'relu', 'encoder': 'lstm', 'experiment_dir': './results/ds=ptb_epochs=5_lr=0.001_seed=1234_extEmb=False_wDim=100_pDim=25_lstmDim=125_mlpDim=100_lstmN=1_pretrained_emb=False_activation=relu_encoder=lstm_date=06_21_2023'}
2023-06-21 18:04:12,971 - INFO - Vocab statistics: words - 34327 | relations - 40 | POS tags - 19
2023-06-21 18:04:26,383 - INFO - -----------+-----------+-----------+-----------+-----------
2023-06-21 18:04:26,383 - INFO - Train epoch: 1
2023-06-21 18:38:16,336 - INFO - -----------+-----------+-----------+-----

In [None]:
!python main.py --n_lstm_layers 1 --activation_function relu --do_eval --model_dir="results/ds=ptb_epochs=5_lr=0.001_seed=1234_extEmb=False_wDim=100_pDim=25_lstmDim=125_mlpDim=100_lstmN=1_pretrained_emb=False_activation=relu_encoder=lstm_date=06_21_2023"

2023-06-21 21:07:24,123 - INFO - Experiment Parameters - 
{'train_path': 'data/train.conll', 'dev_path': 'data/dev.conll', 'test_path': 'data/test.conll', 'ds_name': 'ptb', 'model_dir': 'results/ds=ptb_epochs=5_lr=0.001_seed=1234_extEmb=False_wDim=100_pDim=25_lstmDim=125_mlpDim=100_lstmN=1_pretrained_emb=False_activation=relu_encoder=lstm_date=06_21_2023', 'ext_emb': None, 'seed': 1234, 'epochs': 5, 'lr': 0.001, 'alpha': 0.25, 'w_emb_dim': 100, 'pos_emb_dim': 25, 'lstm_hid_dim': 125, 'mlp_hid_dim': 100, 'n_lstm_layers': 1, 'no_cuda': False, 'log_interval': 2000, 'do_eval': True, 'pretrained_emb': None, 'activation_function': 'relu', 'encoder': 'lstm', 'experiment_dir': 'results/ds=ptb_epochs=5_lr=0.001_seed=1234_extEmb=False_wDim=100_pDim=25_lstmDim=125_mlpDim=100_lstmN=1_pretrained_emb=False_activation=relu_encoder=lstm_date=06_21_2023'}
2023-06-21 21:07:24,173 - INFO - Vocab statistics: words - 34327 | relations - 40 | POS tags - 19
2023-06-21 21:08:10,535 - INFO - -----------+------

### 4.
To implement the BERT based parser, we created a new file `model_bert.py` with a `BertBISTParser` inside. In the constructor we create a `BertTokenizerFast` and a `BertModel`, both based on the `bert-base-uncased` version of BERT.

```
# BERT initialization
self.tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
self.encoder = BertModel.from_pretrained("bert-base-uncased")
self.bert_hid_dim = self.encoder.config.hidden_size
```

We also modify the MLPs so that their input size matches the hidden layer size of BERT.
```
encodings = self.tokenizer([w[0] for w in sentence], truncation=True, padding='max_length', is_split_into_words=True)
```
In the `forward` method, we tokenized the (already split into words) sentence, using the bert tokenizer. At this point, we also keep the positions of the first sub-words of each word, as follows:
```
word_ids = encodings.word_ids()
first_subword_inds = []
seen_ids = set()
for i, wid in enumerate(word_ids):
    if wid is None:
        continue
    if wid not in seen_ids:
        first_subword_inds.append(i)
    seen_ids.add(wid)
first_subword_inds = torch.LongTensor(first_subword_inds).to(self.device)
```

After passing the encoded sentence through bert, we keep the last hidden states of BERT that correspond to the first sub-words of each word:
```
hidden_vectors = hidden_vectors[:, first_subword_inds, :]
```
Finally, these hidden states pass through the classification MLPs.

To enable the bert model, we added an argument `--encoder bert` in `main.py`. We also set the learning rate to 1e-5 since the default one did not provide good results.

Note: due to limitations in Google colab GPU availability, the BERT-based model
was trained for 2 epochs.

Commands:
```
python main.py --encoder bert --lr 1e-5
python main.py --encoder bert --do_eval --model_dir="results/ds=ptb_epochs=5_lr=1e-05_seed=1234_extEmb=False_wDim=100_pDim=25_lstmDim=125_mlpDim=100_lstmN=2_pretrained_emb=False_activation=tanh_encoder=bert_date=06_22_2023"
```

The model's performance in the test set is:

UAS: 96.19

LAS: 94.52

The Graph-based dependency parser with BERT encoder outperforms both the models of part A and the LSTM based models of Part B.

In [None]:
!python main.py --encoder bert --lr 1e-5

2023-06-22 18:27:08,990 - INFO - Experiment Parameters - 
{'train_path': 'data/train.conll', 'dev_path': 'data/dev.conll', 'test_path': 'data/test.conll', 'ds_name': 'ptb', 'model_dir': None, 'ext_emb': None, 'seed': 1234, 'epochs': 5, 'lr': 1e-05, 'alpha': 0.25, 'w_emb_dim': 100, 'pos_emb_dim': 25, 'lstm_hid_dim': 125, 'mlp_hid_dim': 100, 'n_lstm_layers': 2, 'no_cuda': False, 'log_interval': 2000, 'do_eval': False, 'pretrained_emb': None, 'activation_function': 'tanh', 'encoder': 'bert', 'experiment_dir': './results/ds=ptb_epochs=5_lr=1e-05_seed=1234_extEmb=False_wDim=100_pDim=25_lstmDim=125_mlpDim=100_lstmN=2_pretrained_emb=False_activation=tanh_encoder=bert_date=06_22_2023'}
2023-06-22 18:27:10,481 - INFO - Vocab statistics: words - 34327 | relations - 40 | POS tags - 19
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', '

In [None]:
!python main.py --encoder bert --do_eval --model_dir="results/ds=ptb_epochs=5_lr=1e-05_seed=1234_extEmb=False_wDim=100_pDim=25_lstmDim=125_mlpDim=100_lstmN=2_pretrained_emb=False_activation=tanh_encoder=bert_date=06_22_2023"

2023-06-23 18:54:32,890 - INFO - Experiment Parameters - 
{'train_path': 'data/train.conll', 'dev_path': 'data/dev.conll', 'test_path': 'data/test.conll', 'ds_name': 'ptb', 'model_dir': 'results/ds=ptb_epochs=5_lr=1e-05_seed=1234_extEmb=False_wDim=100_pDim=25_lstmDim=125_mlpDim=100_lstmN=2_pretrained_emb=False_activation=tanh_encoder=bert_date=06_22_2023', 'ext_emb': None, 'seed': 1234, 'epochs': 5, 'lr': 0.001, 'alpha': 0.25, 'w_emb_dim': 100, 'pos_emb_dim': 25, 'lstm_hid_dim': 125, 'mlp_hid_dim': 100, 'n_lstm_layers': 2, 'no_cuda': False, 'log_interval': 2000, 'do_eval': True, 'pretrained_emb': None, 'activation_function': 'tanh', 'encoder': 'bert', 'experiment_dir': 'results/ds=ptb_epochs=5_lr=1e-05_seed=1234_extEmb=False_wDim=100_pDim=25_lstmDim=125_mlpDim=100_lstmN=2_pretrained_emb=False_activation=tanh_encoder=bert_date=06_22_2023'}
2023-06-23 18:54:33,598 - INFO - Vocab statistics: words - 34327 | relations - 40 | POS tags - 19
Downloading (…)okenizer_config.json: 100% 28.0/28.0

### 5.


We observe that the graph-based dependency parsers of part B outperformed the transition-based dependency parsers of part A, in terms of UAS. Regarding the transition-based models, the initial model achieved the best performance (UAS: 89.19). Adding one extra layer in the MLP provided the same results, whereas the other changes performed worse. Regarding the graph-based model with LSTM encoder, the use of ReLU instead of tanh provided a small improvement over the model of B1 (UAS: 93.66 vs. 93.56), and using pre-trained embeddings instead of randomly initialized provided an even larger but still relatively small improvement over B1(UAS: 93.71 vs. 93.56). Finally, changing the LSTM encoder with the BERT encoder, provided the largest impact in performance, by achieving UAS: 96.19, vs. 93.71 which is the next best performance.


The Transition based parsers aim to predict a sequence of actions (transitions), based on which a dependency tree can be produced. They have relatively simple complexity and thus are efficient, but they are prone to error propagation since an early wrong prediction in this sentence can affect the subsequent predictions.


On the other hand, the Graph-based parsers try to learn a dependency tree scoring function and detect the highest scoring dependency tree for a sentence. They analyze the whole sentence and thus can extract more complex dependencies. However, this makes the parser more complex and thus slower than the Transition based parser.