# Computational Semantics 2018 (LDA-T3101)
## Practical assignment 5: Paraphrase identification with neural networks
---

### Preliminaries

This assignment is shared as a [Jupyter Notebook](https://jupyter.org) document. A notebook is an interactive document that contains a mix of executable code and Markdown elements, among others. A notebook is divided into cells, and you can identify a cell by the bounded box that surrounds it when you select the cell. The cell you are reading right now is a Markdown cell. We will use Markdown cells to structure the assignment and give you the task descriptions.

The other type of cell we are interested in is the Code cell. The cell below that contains a `print`-command is a Code cell. You can run the cell by selecting it and pressing `Ctrl+Enter`. Run it and you should see the output below.

In [None]:
# This is a code cell
a = 1 + 2
print(a)

The code execution is handled by a Python kernel that runs in the background. When you run a code cell, any variables you create will be stored by the kernel. The whole notebook shares a single kernel, so you can reuse the variables in later cells:

In [None]:
print(a + 3)

This offers a convenient way of structuring the code. It is especially convenient when some of the code is very slow to run or only needs to be run once. This is the case for this assignment: for example, you will only need to prepare the data you will use once, but you probably need to change a function that defines your neural network model several times. Because preparing the data is rather slow, only running it once speeds up development.

Notice that this feature of the notebook can also lead to bugs: because variables are stored in the kernel, you can define a variable, run the code, delete the definition, and use the variable somewhere without the original definition being present. If you face hard-to-find bugs in your code, clearing the kernel of all variables is an option. You can restart the kernel by choosing `Kernel -> Restart` from the top menu.

This should be enough to get you started. Jupyter Notebook contains a ton of advanced features, but you do not need them to complete the assignment. If you are interested in diving deeper, check out the [Jupyter website](https://jupyter.org) and the help menu.

---
**IMPORTANT:** The CSC Notebooks environments are destroyed after a set time period. This means that unless you complete the assignment in one sitting and within the alloted time, you need to download the notebook and upload it later. No changes to this notebook will be saved otherwise. You can dowload the notebook using the top menu: `File -> Download as -> Notebook (.ipynb)`. You can later continue the assignment by uploading the notebook to a new environment (`upload`-button in the upper right corner of the directory view when you open a new environment). 

---
### Developing neural networks
---

One way of structuring the process of developing a neural network model for any task is the following:

1. Data preparation
2. Deciding the model architecture
3. Training the model
3. Evaluating the model

This assignment is structured similarly. We will start by importing a bunch of stuff we need. Press `Ctrl+Enter` in the code cell below and move on to Part 1 of this assignment.

In [None]:
# Install spaCy using workaround
# !pip install spacy --quiet
# !python -m spacy download en
# Install pydot for plotting model
# !pip install pydot --quiet

import random
import spacy
import utils

from IPython.display import Image
from keras import optimizers
from keras.layers import Input, LSTM, Embedding, Concatenate, Dense, Dropout, Masking
from keras.models import Model
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from keras.utils.vis_utils import plot_model
from spacy.lang.en import English

# DO NOT CHANGE THESE
from numpy.random import seed
seed(123)
from tensorflow import set_random_seed
set_random_seed(234)

# Load spaCy utilities for English
nlp = spacy.load('en_core_web_sm')

---
### Part 1: Data preparation (2 points)
---

The task you will perform in this assignment is *paraphrase identification*. As the name suggests, in the paraphrase identification task the model is given two sentences and it should decide whether the sentences are paraphrases or not. We will use the [**Quora question pairs**](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs) corpus.

In this first part of the assignment we will load and preprocess the data so that we can actually feed it into our network. 

The code cell below reads the data file. Each line of the file contains a single sample as 6 different tab-separated fields. The `lines` variable will be a dictionary `lineID -> data`, where `lineID` is a unique identifier for each data sample and `data` is another dictionary with keys `srcID`, `tgtID`, `source`, `target`, and `label`.

* **srcID**: a unique identifier of the source sentence
* **tgtID**: a unique identifier of the target sentence
* **source**: the source sentence
* **target**: the target sentence
* **label**: a label indicating whether the sentences are paraphrases (1) or not (0)

Run the code cell to read the data file. You do not need to change the parameters when calling the function.

In [None]:
# Do not change
lines = utils.load_lines(
    file_name="data/quora/quora_duplicate_questions.tsv",
    fields="lineID srcID tgtID source target label".split(),
    delimiter="\t",
    n=200000,
    skip_first=True
)
print("First sample in the data:")
print(lines["ID0"])

Now it's time to prepare the data for actual usage. The cell below gathers tokenizes the data and gathers the source and target sentences into two separate lists. The correct labels will be gathered in a third list. You do not need to change this code but check it out to understand what is happening.

In [None]:
# Initialize a spaCy tokenizer (https://spacy.io)
tokenizer = English().Defaults.create_tokenizer(nlp)

# Initialize empty lists for source and target sentences as well as labels.
src_strings = []
tgt_strings = []
labels = []

# Go through the read lines
for line_id, line_obj in lines.items():
    # Tokenize the sentences
    src = " ".join([token.text for token in tokenizer(line_obj["source"])])
    tgt = " ".join([token.text for token in tokenizer(line_obj["target"])])
    # Extract label
    label = int(line_obj["label"])
    
    src_strings.append(src)
    tgt_strings.append(tgt)
    labels.append(label)
    
# Print examples of tokenized data
for i in range(10):
    print("{}\t{}\t{}".format(src_strings[i], tgt_strings[i], labels[i]))

At this step you have the tokenized data split into three lists. You now need to implement the following streps:

1. **Learn the vocabulary of the text and the mapping from tokens to integers:**
    
        "can" -> 2
        "you" -> 3
         etc.
      
2. **Map each word sequence into an sequence of integers:**

        "Can you fill the can ?" -> [2, 3, 14, 25, 2, 53]
        
        Note: This is simply a preprocessing step needed for feeding the sentence to the network. The resulting sequence is not a distributional representation of the sentence.
        
3. **Pad each sequence to length 'MAX_LEN' using the padding index 0:**

        [2, 3, 15, 25, 2, 53] -> [0, 0, 0, ..., 2, 3, 15, 25, 2, 53]
        
In addition to implementing each of the three steps above, after each step print out at least a part of the data  (for example a single sample) and explain as comments in the cell what the data looks like and why. The cell below contains some code that will get you started, as well as hints for each step. 

In [None]:
N_WORDS = 50000  # Vocabulary size
MAX_LEN = 50  # Maximum length of sentence (in tokens)

# STEP 1. Learning the vocabulary
# Below we initialize a Keras Tokenizer (keras.io/preprocessing/text/) that you can
# use to learn the mapping from tokens to integers. It will automatically limit the
# vocabulary to the 'N_WORDS' most common tokens, lowercase the data, and replace 
# all out-of-vocabulary tokens (those outside the 'N_WORDS' most common) with a
# special symbol (<UNK>).
processor = Tokenizer(num_words=N_WORDS, lower=True, oov_token="<UNK>")

# TODO: the method 'fit_on_texts' does the actual learning. It takes in a list of strings
# (sentences) and constructs the vocabulary and mapping. Change the line below so that 
# it learns the vocabulary from the Quora data.
processor.fit_on_texts(["hello how are you ?"])

In [None]:
# STEP 2. Mapping the sentences
# TODO: Now that we have learned the mapping, we need to use it to map the sentences in our data
# to integer sequences. 'processor' has a method called 'texts_to_sequences' that takes in
# a list of strings and return the mapped sentences. Create new lists 'src_mapped' and 'tgt_mapped'
# that contain the mapped source and target sentences respectively.
print(processor.texts_to_sequences(["hello how are you ?"]))

In [None]:
# STEP 3. Padding
# At this point 'src_mapped' and 'tgt_mapped' are lists of variable-length arrays.
# In order to feed a batch of sentences into our model, we need to do so-called
# padding. In padding we extend the shorter sequences to the length of the longest
# sequence by using a special padding symbol, in this case the integer 0. You can
# do this with the method 'processor.pad_sequences', which takes in the mapped 
# sentences and a keyword argument 'maxlen' which is the padding length
#
# The 'pad_sequences'-function also constructs a matrix out of the training sequences.
# At this step the lists 'src_mapped' and 'tgt_mapped' are turned into matrices with 
# dimensions (n_training_pairs x MAX_LEN).
#
# TODO: Create two variables 'src' and 'tgt' that contain the matrices with padded sequences.
# Make sure the matrices have the correct dimensions (attribute src.shape).
print(pad_sequences([[1, 2, 3, 4], [5, 6, 7]], maxlen=MAX_LEN))

---
### Part 2: Defining the model (1 point)
---
The function in the cell below defines our neural network model. Go through the code so that you undestand how it works. The cell will also draw the model for you. Explain the structure of the model as comments in the cell. The comments contain some explanation, but you should also refer to the [Keras documentation](https://keras.io) (especially the sections `Models` and `Layers`).

In [None]:
def define_model(max_len, num_words, learning_rate, embedding_dim, lstm_units):
    # Define model inputs (in this case, two sequences of length 'MAX_LEN')
    src = Input(shape=(max_len,), dtype='int32')
    tgt = Input(shape=(max_len,), dtype='int32')
    
    # Mask the inputs so that we do not waste computation on padding
    src_masked = Masking(mask_value=0)(src)
    tgt_masked = Masking(mask_value=0)(tgt)

    # Define the embedding layer
    embed = Embedding(
        num_words, embedding_dim, 
        input_length=max_len, 
        mask_zero=True
    )
    # We use the 'functional' way of defining models in Keras so the embedding 
    # is done by calling the embedding layer on the sequences. Let's do it for 
    # both the source and the target.
    embedded_src = embed(src_masked)
    embedded_tgt = embed(tgt_masked)

    # Define the encoder, in this case a simple LSTM layer.
    encode = LSTM(units=lstm_units)

    # Call the LSTM layer, this time on the embedded inputs.
    # The outputs are the encoded sequences.
    encoded_src = encode(embedded_src)
    encoded_tgt = encode(embedded_tgt)

    # Concatenate the encoded sequences. This serves as an input
    # to the classification layer.
    concatenated = Concatenate()([encoded_src, encoded_tgt])

    # Classification layer 1
    out = Dense(128, activation="tanh")(concatenated)
    # Classification layer 2
    out = Dense(128, activation="tanh")(out)
    # Prediction layer
    predictions = Dense(1, activation="sigmoid")(out)
    
    # Initialize a model instance with the input layers 'src' and 'tgt'
    # And the output 'predictions'.
    model = Model(inputs=[src, tgt], outputs=predictions)
    
    # Initialize optimizer with the chosen learning rate.
    optim = optimizers.Adam(lr=learning_rate)
    # Compile the model.
    model.compile(optimizer=optim, loss='binary_crossentropy', metrics=['accuracy'])

    return model

# Here we set some hyperparameters.
EMB_BASELINE = 128  # Embedding dimensions
UNITS_BASELINE = 128  # Number of units in the LSTM layer
LEARNING_RATE_BASELINE = 0.005  # Learning rate for optimizer

N_SAMPLES = 75000  # Number of samples to use (must be less than what we loaded above)

# Create the model
model_baseline = define_model(MAX_LEN, N_WORDS, LEARNING_RATE_BASELINE, EMB_BASELINE, UNITS_BASELINE)

# Draw the model scheme
plot_model(model_baseline, to_file='model_baseline.png', show_shapes=True, show_layer_names=True)
Image("model_baseline.png")

---
### Part 3: Training the model (4 points)
---

The cell below creates a model and trains it with the given hyperparameters. This serves as a baseline for the assignment. Your task is to interpret the output of the training, evaluate the results and identify any problems with the model.

In [None]:
# This is where the actual training happens
history_baseline = model_baseline.fit(
    x=[src[:N_SAMPLES], tgt[:N_SAMPLES]],  # Use N_SAMPLES first samples as data
    y=labels[:N_SAMPLES],  # Feed in correct labels
    batch_size=128,  # Process 128 pairs per batch
    epochs=4,  # Go through the data 4 times 
    validation_split=0.02  # use 2% of the data for validation
)

# TODO: The training (model_baseline.fit) logs some basic information as the training
# advances. Interpret the output with a few sentences as comments here.

# TODO: The function 'utils.plot_training(history_baseline) plots some of the information 
# for another view. Identify what is problematic with the training results. Based on the
# reading for this assignment, how would you alleviate the problem? Answer as comments.
# Also, change the model or the training procedure and see if you can get better results.
# (HINT: https://keras.io/layers/recurrent/#lstm, https://keras.io/regularizers/)
#
# You can copy the 'define_model' function to another cell or modify the existing code.
# Also, feel free to change the amount of data you use, or the hyperparameters for training
# if you think this is necessary. 
#
# Because training a model can take a very long time, we do not assume you will get spectacular
# results. You can get full points if you identify the problem, implement some way of
# alleviating it, and train a model with some sensible parameters, even if the results do not
# improve significantly.

utils.plot_training(history_baseline)

---
### Part 4: Evaluating the model (3 points)
---

In this last part you need to test the model on some of your own data. Usually at this point the model would be evaluated on a test set that has not been seen during the development. This time we will skip that, however.

In [None]:
# TODO: Figure out a way to test the model on your own sentences. (HINT: 
# https://keras.io/models/model/). Discuss how well the model performs.

test_pairs = [
    ("Why are carrots orange ?", "What is the reason for carrots' orange color ?")
    # etc ..
]