# DL4NLP SS17 Home Exercise 08
----------------------------------
**Due until Tuesday, 13.06. at 13:00**

## Task 1 Mandatory Reading (1P)
The mandatory reading of this week is [End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF]('https://arxiv.org/pdf/1603.01354.pdf). Please answer the following question: What kind of error analysis has been performed and what was the main insight?

...

## Task 2 Sequence Tagging (1P)
Please state a reason why one would prefer recurrent neural network architectures over approaches with a fixed window size for sequence tagging tasks.

...

## Task 3 Detecting Argument Components with Recurrent Neural Networks (8P + 2P Bonus)
It is possible to use deep neural networks to predict the arguments in a text in a sequence tagging manner. Arguments are composed of smaller units (argument components).

Example: "``The meaning to life and everything is 42, because it is the answer to 6*9.``"

The claim of this argument would be "``The meaning to life and everything is 42``", while "``it is the answer to 6*9``" represents the premise, supporting the claim. Similar to home exercise 5, we can represent this data in a BIO tagging scheme:

```
        The        B-Claim
        meaning    I-Claim
        to         I-Claim
        life       I-Claim
        and        I-Claim
        everything I-Claim
        is         I-Claim
        42         I-Claim
        ,          O
        ...        ...
```
        
The goal is to predict the label for each token in a given sequence. Since chains of arguments often span several sentences, it is feasible for this task to process the sequential information of several sentences (or in this case sequential information of a whole document) at once.

Your task is to implement and train a Bi-LSTM in Keras to detect these argument components.

#### Data
The persuasive-essays dataset (http://anthology.aclweb.org/C/C14/C14-1142.pdf and https://arxiv.org/abs/1604.07370) contains essays with several argument components.

In the `hex08_data` folder, you can find a training, development and test set. The data is already tokenized and formatted in the BIO tagging scheme. Each line in these datasets contains the token the label, separated by a tab character (`\t`). Documents are separated by an empty line.
The embeddings folder contains a subsampled word embedding (``hex08.vec``) which you will use in this home exercise.

#### Skeleton code
For this home exercise, we provide you with skeleton code for an LSTM and a Bi-LSTM implementation (``lstm_unsolved.py`` and ``bilstm_unsolved.py``). The LSTM implementation is fully operational. Completing the Bi-LSTM implementation will be your task. The code uses the ``gensim``, ``argparse``, ``sklearn`` and ``h5py`` python packages.

#### Hints on the Submission Format
* Please submit your python code for all the tasks where it is applicable. Make sure to include comments explaining complicated/non-obvious sections of your code.
* Please also submit a copy of the console output of your code execution. Your code might run in 10 minutes on your watercooled battlestation, but it might not run in 10 minutes for the person who corrects your home exercises. Thank you!
* **Please also submit your best model (found in results/model.dat) and the result file containing the predictions (found in results/bilstm.predicts).**

### Task 3.1 Data Analysis (1P)
Have a look at the essay data.

a) What kinds of argument components does this data contain?

b) For each kind of argument component, calculate the number of occurrences in the train/dev/test sets.

...

### Task 3.2 Implementation (2P)
Add a Bi-LSTM to ``bilstm_unsolved.py`` using Keras. For this, you only need to modify the code section marked with "``# TODO: Task 3.2 Implement Bi-LSTM``". Do not change the optimizer (``'adagrad'``) and the loss (``'categorical_crossentropy'``) of the model. The utils folder contains required helper functions.

**Hints:**
* Accumulate the results of the LSTM layers with ``'concat'``.
* Apply dropout before and after the LSTM layers.
* Hint on padding: Sequence tagging tasks usually work on input sequences of varying length. It is necessary to pad these sentences so that each input sequence has the same length. In Keras, one can do this by zero padding, i.e. by adding zeros at the start or the end to ensure that all examples have the same shape. When using padding, it is important that your none of your class indexes are the same as the padding value, since this will falsify the results. For zero padding, we can simply reserve 0 (zero) as the padding label.

### Task 3.3 Intermediate Results (1P)
Train your model on the training data and observe the ``val_categorical_accuracy`` on the **development** set. Set the batch size to 10, dropout to 0.5, the number of hidden units to 100, and run for 10 epochs. Report your macro F1 score on the **test** set.

### Task 3.4 A Different Selection of the best Model (1P)
The current implementation stores the model with the best ``val_categorical_accuracy`` on the **development** set

Complete the ``on_epoch_end(...)`` method in the skeleton of the ``F1Score`` class for using the macro F1 score as the indicator for the best model (look out for the line "``# TODO: Task 3.4 Implement callback on f1 score``"). Retrain the model with the parameters from task 3.3 and compare the macro F1 score on the **test** set with the previous approach.

**Hint**: You may use ``sklearn.metrics.f1_score(...)``.

### Task 3.5 Hyperparameter Optimization (1P)
Experiment with the following hyperparameters:
* hidden_units
* dropout
* batch_size

Report your three best hyperparameter sets ("best" according to the development set). Additionally report the macro F1 score of these three parameter sets on the test set. Please do not forget to also hand in your model (located in ``results/bilstm.model``) and the predictions (located in ``results/bilstm.predicts``).

**Hint:** There is no need to add dropout to the Bi-LSTM gates, but feel free to do so (if you do so, please report them in addition).

### Task 3.6 Bi-LSTM Masking (2P)
Set the ``mask_zero`` parameter in the Embedding layer to ``False`` and retrain the Bi-LSTM with a reasonable parameter set of your choice. How do the results change compared to leaving ``mask_zero = True``? Explain the change in one or two short sentences.

...

### Bonus Task 3.7 Error Analysis (1P Bonus)
Take a look at the results of your best performing Bi-LSTM model from task 3.5 (``results/bilstm.predicts``). Each line in the results has the following format:

    word(str)	predict_label(str)	true_label(str)
    
Discuss the quality of your model's predictions in two sentences.

...

### Bonus Task 3.8 Bi-LSTM vs. LSTM (1P Bonus)
``lstm_unsolved.py`` contains an implementation of an LSTM. Add your callback implementation of the macro F1 score and run the Bi-LSTM and the LSTM with the parameters from task 3.3. Which one reaches a higher macro F1 score and why?

Note: In case you did not finish task 3.4, you may determine the best model by using the accuracy instead of the macro F1 score. Still, use the macro F1 scores of both resulting models for comparison.

...