# AMMI Deep Natural Language Processing: Lab 1

## 0. Introduction

In this tutorial we will train neural networks on the bAbI tasks using ParlAI framework.  
This tutorial can be run both in google colab or on your computer.  
The solutions will be added during the lab [here](https://fburl.com/ammi_dnlp_lab1).  

We will cover the following:
0. Introduction
    - Introduction to ParlAI and installation
    - Introduction to the bAbI tasks
1. Exploring the data:
    - Compute some statistics (number of examples in train, valid, test, size of examples...)
    - Look at some examples
2. Choose the appropriate metrics
3. Baselines
    - Ranom baseline
    - Majority class baseline
    - Information retrieval baseline
4. More elaborate models
   - Generative model: Seq2Seq
   - Ranking model: Memory Network
5. To go further
    - Additional ideas to try if you want to dig deeper

### ParlAI
[ParlAI](https://github.com/facebookresearch/ParlAI/blob/master/README.md) (pronounced “par-lay”) is a framework for dialogue AI research, implemented in Python.

Its goal is to provide researchers:

* a unified framework for sharing, training and testing dialogue models
* many popular datasets available all in one place -- from open-domain chitchat to visual question answering.
* a wide set of reference models -- from retrieval baselines to Transformers.
* seamless integration of Amazon Mechanical Turk for data collection and human evaluation
* integration with Facebook Messenger to connect agents with humans in a chat interface

Documentation can be found [here](http://www.parl.ai/static/docs/), some of this tutorial is inspired from the ParlAI documentation so feel free to go back and forth between the notebook and the documentation.


### Setup the notebook
If using google colab, make sure to use TPU runtime by going to ***Runtime > Change runtime type > Hardware accelerator: TPU > Save***

### Install ParlAI

Start by installing ParlAI from github. The ParlAI folder will be located in the home directory at `~/ParlAI/`.  
*Note: In a jupyter notebook, you can run arbitrary bash commands by prefixing them with a question mark, example: `!echo "Hello World"`*

In [0]:
# Remove `> /dev/null` to see the output of commands
# !git clone https://github.com/facebookresearch/ParlAI.git ~/ParlAI  > /dev/null
# !cd ~/ParlAI && git checkout 6bd0e58692b3fd3a13b5f654944525ac1b7cd8e3
# !cd ~/ParlAI && python3 setup.py develop > /dev/null

Most of the scripts that we will use in ParlAI are located in the `~/ParlAI/examples` directory.  
Let's have a first glance at the scripts available, we will come back to them later:

In [0]:
!ls ~/ParlAI/examples/

base_train.py	       eval_model.py		 remote.py
build_dict.py	       extract_image_feature.py  seq2seq_train_babi.py
build_pytorch_data.py  interactive.py		 train_model.py
display_data.py        profile_train.py
display_model.py       README.md


### The bAbI tasks
Many datasets and tasks are included in ParlAI, we will focus on the bAbI tasks.
The bAbI tasks are 20 synthetic tasks that each test a unique aspect of text and reasoning, and hence test different capabilities of learning models from [Weston et al. ‘16](http://arxiv.org/abs/1502.05698).

---
**Question 0.**  
Open the bAbI [paper](https://arxiv.org/pdf/1502.05698.pdf) and read the abstract  and section: *"3 The Tasks"* (until paragraph **Two or Three Supporting Facts**,  included).  
- **0.a.** Explain in your own words the motivations behind these tasks (in 2-3 sentences).

*The motivation behind these task is to provide powerful techniques for measuring the performence of any machine reading system. Since most of tasks that allowed resecher to evaluate their modelt s are quite difficult and expensive. So, that motivate to providing differents tasks that allow to interpret results based on well defined datasets. *

---

These tasks can be downloaded and used directly from ParlAI.  
We will focus on tasks 1, 2 and 3, see examples below:


**Task 1: Single Supporting Fact**  
Mary went to the bathroom.  
John moved to the hallway.  
Mary travelled to the office.  
Where is Mary?  
**Answer: office**  


**Task 2: Two Supporting Facts**  
John is in the playground.  
John picked up the football.  
Bob went to the kitchen.  
Where is the football?  
**Answer: playground**


**Task 3: Three Supporting Facts **  
John picked up the apple.  
John went to the office.  
John went to the kitchen.  
John dropped the apple.   
Where was the apple before the kitchen?  
**Answer: office**



## 1. Exploring the data

First we need to download the data, we will use the `build_dict.py` as a dummy task to download the data.

In [0]:
# Download the data silently
!python ~/ParlAI/examples/build_dict.py --task babi:task1k:1 --dict-file /tmp/babi1.dict
# Print a few examples
!head -n 30 ~/ParlAI/data/bAbI/tasks_1-20_v1-2/en-valid-10k-nosf/qa1_train.txt

[ Main ParlAI Arguments: ] 
[  batchsize: 1 ]
[  datapath: /root/ParlAI/data ]
[  datatype: train ]
[  download_path: /root/ParlAI/downloads ]
[  hide_labels: False ]
[  image_mode: raw ]
[  multitask_weights: [1] ]
[  numthreads: 1 ]
[  show_advanced_args: False ]
[  task: babi:task1k:1 ]
[ ParlAI Model Arguments: ] 
[  dict_class: None ]
[  init_model: None ]
[  model: None ]
[  model_file: None ]
[ PytorchData Arguments: ] 
[  batch_length_range: 5 ]
[  batch_sort_cache_type: pop ]
[  batch_sort_field: text ]
[  numworkers: 4 ]
[  pytorch_context_length: -1 ]
[  pytorch_datapath: None ]
[  pytorch_include_labels: True ]
[  pytorch_preprocess: False ]
[  pytorch_teacher_batch_sort: False ]
[  pytorch_teacher_dataset: None ]
[  pytorch_teacher_task: None ]
[  shuffle: False ]
[ Dictionary Loop Arguments: ] 
[  dict_include_test: False ]
[  dict_include_valid: False ]
[  dict_maxexs: -1 ]
[  log_every_n_secs: 2 ]
[ Dictionary Arguments: ] 
[  bpe_debug: False ]
[  dict_endtoken: __end_

The bAbI tasks were downloaded in `~/ParlAI/data/bAbI/tasks_1-20_v1-2/en-valid-nosf/`

In bAbI the data is organised as follows:
- **Dialog turn**: A dialog turn is a single utterance / statement. Each line in the file corresponds to one dialog turn.   
  Example: *"John went to the office."*
- **Sample (question)**: Every few dialog turns, a question can be asked that the model has to answer, this consitute a sample.  The question is followed by its ground truth answer, separated by a tab.
  Example: *"Where is John? `<tab>` bathroom"*
- **Episode**: a sequence of ordered coherent dialog turns that are related to each other form an episode. Each new episode is independant of the others. Each line starts with the dialog turn number in the current episode.


---
**Question 1.**
- **1.a.** Look at the training file of task 1 (`~/ParlAI/data/bAbI/tasks_1-20_v1-2/en/qa1_train.txt`) and compute the following information:
  - Number of episodes
  - Number of  samples (questions)
  - Number of dialog turns per episode
  - How many different answers are there in the train set? How many times does each appear? (*hint: Use a python [counter](https://docs.python.org/3/library/collections.html#collections.Counter)*)
  - How many unique words appear in the training set? How many time does each appear? (*hint: Use the Counter `most_common()` method*)

*Print the answer in the following code cell*
  
  ---

In [0]:
# FILL THIS CELL
from collections import Counter


task_1_train_path = '/root/ParlAI/data/bAbI/tasks_1-20_v1-2/en-valid-10k-nosf/qa1_train.txt'

i = 0
n_episodes = 0
n_questions = 0
n_total_dialog_turns = 0
possible_answers = Counter()
possible_answers_dict = {}
vocabulary = Counter()
with open(task_1_train_path, 'r') as f:
    # FILL CODE HERE
    
    for line in f:
      dialog_turn = int(line.split(' ')[0])
      if dialog_turn == 1:
        n_episodes +=1
      # Remove the dialog turn num
      line = ' '.join(line.split(' ')[1:])
      fields = line.split('\t')
      if len(fields)>1:
        n_questions +=1
        possible_answers.update([fields[1]])
      vocabulary.update(fields[0].split(' '))
      n_total_dialog_turns +=1

print(f'Number of episodes: {n_episodes}')
print(f'Number of questions: {n_questions}')
print(f'Number of dialog turn per episode: {n_total_dialog_turns/n_episodes}')
print(f'Accuracy of a random model: {1/len(possible_answers): 4f}')
print(f'Possible answers: {possible_answers} ({len(possible_answers)})')
print(f'Vocabulary size: {n_episodes}')
print(f'Most common word: {vocabulary.most_common()}')

      # print(line)
      # break

Number of episodes: 1800
Number of questions: 9000
Number of dialog turn per episode: 15.0
Accuracy of a random model:  0.166667
Possible answers: Counter({'bathroom\n': 1564, 'hallway\n': 1517, 'garden\n': 1508, 'bedroom\n': 1473, 'kitchen\n': 1471, 'office\n': 1467}) (6)
Vocabulary size: 1800
Most common word: [('to', 18000), ('the', 18000), ('Where', 9000), ('is', 9000), ('', 9000), ('went', 7225), ('Mary', 4535), ('Sandra', 4502), ('John', 4484), ('Daniel', 4479), ('journeyed', 3620), ('travelled', 3582), ('back', 3581), ('moved', 3573), ('bathroom.\n', 3070), ('hallway.\n', 3045), ('garden.\n', 2982), ('kitchen.\n', 2981), ('office.\n', 2963), ('bedroom.\n', 2959), ('John?', 2299), ('Mary?', 2265), ('Sandra?', 2244), ('Daniel?', 2192)]



- **2.b.** Use the appropriate script from the `~/ParlAI/examples/` to take a quick look at examples of the first bAbI task.  
Does the number of episodes and examples fit what you computed before? (*hint: you can use the argument `--task babi:task1k:1` to select the first babi task*)

In [0]:
# FILL THIS CELL
#Solution

!python ~/ParlAI/examples/display_data.py --task babi:task10k:1

[ optional arguments: ] 
[  display_ignore_fields: agent_reply ]
[  max_display_len: 1000 ]
[  num_examples: 10 ]
[ Main ParlAI Arguments: ] 
[  batchsize: 1 ]
[  datapath: /root/ParlAI/data ]
[  datatype: train:stream ]
[  download_path: /root/ParlAI/downloads ]
[  hide_labels: False ]
[  image_mode: raw ]
[  multitask_weights: [1] ]
[  numthreads: 1 ]
[  show_advanced_args: False ]
[  task: babi:task10k:1 ]
[ ParlAI Model Arguments: ] 
[  dict_class: None ]
[  init_model: None ]
[  model: None ]
[  model_file: None ]
[ PytorchData Arguments: ] 
[  batch_length_range: 5 ]
[  batch_sort_cache_type: pop ]
[  batch_sort_field: text ]
[  numworkers: 4 ]
[  pytorch_context_length: -1 ]
[  pytorch_datapath: None ]
[  pytorch_include_labels: True ]
[  pytorch_preprocess: False ]
[  pytorch_teacher_batch_sort: False ]
[  pytorch_teacher_dataset: None ]
[  pytorch_teacher_task: None ]
[  shuffle: False ]
[ ParlAI Image Preprocessing Arguments: ] 
[  image_cropsize: 224 ]
[  image_size: 256 ]
[

## 2. Metrics

The bAbI task 1 expects single word answers among a small set of possible answers.


---
**Question 2**  
- **2.a.** Which metrics do you think are appropriate for evaluating a model on this task?   
-  **2.b.**  What are their respective strengths?  
-  **2.c.** When do they fail? (find specific examples)  

**Answer 2** 
- **2.a.** The accuracy and BLEU metrics are appropriate for evaluation a model on this task.
-  **2.b.** The accuracy metrics is very easy to compute and simple to verify percentage of correct answers base on the total number of questions.
The BLEU metrics is fast and easy to calculate, it make easy the comparison of the model with a benchmark. 
-  **2.c.** The accuracy is not sufficient to ensure the model performence on unseen data. So it can fail when we want to evaluate the prediction of data the was not seen.
The BLEU can fail when the the same question is asked with other word because it doesn't consider meaning and setence structure.




---

## 3. Baseline



We now have a clearer idea of the data distribution and the metrics that we can use.  
The next step is to start solving the tasks with a simple baseline. This will allow us to compare more elaborate models agains this baseline.  
Here are a few classical baselines:
- **Random model**: The model answers randomly among the set of possible answers for each question
-  **Majority class**: The model always answers with the most frequent answer in the training set (majority class)

We are going to reimplement these own baselines.  
Implementing a new model in ParlAI is detailed in the [tutorial](http://parl.ai/static/docs/seq2seq_tutorial.html) but for our simple baselines, we will only need to inherit the [Agent](https://github.com/facebookresearch/ParlAI/blob/6d246842d3f4e941dd3806f3d9fa62f607d48f59/parlai/core/agents.py#L50) class and override the `act()` method.

---
**Question 3**  
- **3.a.** What would be the accuracy of a model that choses a random answer among the set of possible answers for each question? 

**Answer 3**
- **3.a.** The accuarcy of a model that choses a random answer among the set of possible answers for each question could be very small because the probablility of getting the correct answer is still not big.

---

*Note: the `%%writefile` magic command in jupyter writes the content of the cell to a file at the given path.*

In [0]:
!mkdir -p ~/ParlAI/parlai/agents/baseline/
!touch ~/ParlAI/parlai/agents/baseline/random.py
!touch ~/ParlAI/parlai/agents/baseline/majorityclass.py

- **3.b.**  Design a baseline that answers a random word in the set of possible answer (run it multiple time to observe variance in results).

In [0]:
# FILL THIS CELL
%%writefile ~/ParlAI/parlai/agents/baseline/random.py
import random

from parlai.core.torch_agent import Agent


class RandomAgent(Agent):
  
    def act(self):
        #Solution
        if 'label_candidates' not in self.observation:
            return
        candidates = list(self.observation['label_candidates'])
        reply = {'text': candidates[random.randrange(len(candidates))]}
        return reply

Overwriting /root/ParlAI/parlai/agents/baseline/random.py


In [0]:
!python ~/ParlAI/examples/eval_model.py -t babi:task10k:1 -m baseline/random | grep accuracy -A 1

{'exs': 1000, 'accuracy': 0.17, 'f1': 0.17, 'bleu': 1.7e-10}


In [0]:
!python ~/ParlAI/examples/display_model.py -t babi:task10k:1 -m baseline/random -n 10 

[ optional arguments: ] 
[  display_ignore_fields:  ]
[  num_examples: 10 ]
[ Main ParlAI Arguments: ] 
[  batchsize: 1 ]
[  datapath: /root/ParlAI/data ]
[  datatype: valid ]
[  download_path: /root/ParlAI/downloads ]
[  hide_labels: False ]
[  image_mode: raw ]
[  multitask_weights: [1] ]
[  numthreads: 1 ]
[  show_advanced_args: False ]
[  task: babi:task10k:1 ]
[ ParlAI Model Arguments: ] 
[  dict_class: None ]
[  init_model: None ]
[  model: baseline/random ]
[  model_file: None ]
[ ParlAI Image Preprocessing Arguments: ] 
[  image_cropsize: 224 ]
[  image_size: 256 ]
[creating task(s): babi:task10k:1]
[loading fbdialog data:/root/ParlAI/data/bAbI/tasks_1-20_v1-2/en-valid-10k-nosf/qa1_valid.txt]
[babi:task10k:1]: Sandra travelled to the office.
Sandra went to the bathroom.
Where is Sandra?
[eval_labels: bathroom]
[label_candidates: hallway|bedroom|office|bathroom|kitchen|...and 1 more]
   garden
~~
[babi:task10k:1]: Mary went to the bedroom.
Daniel moved to the hallway.
Where is S

- **3.c.**  Design a baseline that answers the most common answer every time (majority class baseline).

In [0]:
# FILL THIS CELL
%%writefile ~/ParlAI/parlai/agents/baseline/majorityclass.py
import random

from parlai.core.torch_agent import Agent


class MajorityclassAgent(Agent):
  
    def act(self):
        # Solution
        if 'label_candidates' not in self.observation:
            return
        candidates = list(self.observation['label_candidates'])
        reply = {'text': 'bathroom'}
        return reply

Overwriting /root/ParlAI/parlai/agents/baseline/majorityclass.py


In [0]:
!python ~/ParlAI/examples/eval_model.py -t babi:task10k:1 -m baseline/majorityclass | grep accuracy -A 1

{'exs': 1000, 'accuracy': 0.169, 'f1': 0.169, 'bleu': 1.69e-10}


In [0]:
!python ~/ParlAI/examples/display_model.py -t babi:task10k:1 -m baseline/majorityclass -n 10

---
- **3.d.**  In which cases would the majority class baseline be better than the random baseline?

*The majority class baseline would be better in case where we many possible answers of a given question*

---

Another slightly more advanced baseline is implemented in ParlAI: the information retrieval baseline (`ir_baseline`)

---
- **3.e.** Look at the [implementation](https://github.com/facebookresearch/ParlAI/blob/53ea58acf389bffc79c85c43bcdd848eecdcecb4/parlai/agents/ir_baseline/ir_baseline.py#L211) of the IR baseline and explain in a few lines how it works (*hint: look at the following methods `act()` `rank_candidates()`  `score_match()`*)  

*We can explain this implementation by focusing on three of its methods. From the rank_candidates we get a list of candidate answer sorted with respect to their score. The score is calculated by using the sum of the words representation of the candidate text and we divide it by the square root of the len of the candidate text.The act() method compute the candidate texts and their words representation and then reply to the question by choosing the top 1 of the ranked candidate i.e the candidate with the highest score.  *


---

- **3.f.** Use the IR baseline and compare its with one of your baselines (random and/or majority) on bAbI tasks 1, 2 and 3.  
    (*hint: you can use `!python ... -t babi:task1-k:{i+1}` syntax to substitute the task number in a bash command from jupyter*)


In [0]:
# FILL THIS CELL
for i in range(3):
    print(f'~ Task {i+1} ~')
    # FILL CODE HERE
    print(f'-Task {i+1}-')
    print('Majority class baseline')
    !python ~/ParlAI/examples/eval_model.py -t babi:task10k:{i+1} -m baseline/majorityclass | grep accuracy

    print('IR baseline')
    !python ~/ParlAI/examples/eval_model.py -t babi:task10k:{i+1} -m ir_baseline | grep accuracy

~ Task 1 ~
-Task 1-
Majority class baseline
{'exs': 1000, 'accuracy': 0.169, 'f1': 0.169, 'bleu': 1.69e-10}
IR baseline
{'exs': 1000, 'accuracy': 0.465, 'f1': 0.465, 'hits@1': 0.465, 'hits@5': 0.961, 'hits@10': 1.0, 'hits@100': 1.0, 'bleu': 4.65e-10}
~ Task 2 ~
-Task 2-
Majority class baseline
{'exs': 1000, 'accuracy': 0.17, 'f1': 0.17, 'bleu': 1.7e-10}
IR baseline
{'exs': 1000, 'accuracy': 0.284, 'f1': 0.284, 'hits@1': 0.284, 'hits@5': 0.9, 'hits@10': 1.0, 'hits@100': 1.0, 'bleu': 2.84e-10}
~ Task 3 ~
-Task 3-
Majority class baseline
{'exs': 1000, 'accuracy': 0.203, 'f1': 0.203, 'bleu': 2.03e-10}
IR baseline
{'exs': 1000, 'accuracy': 0.132, 'f1': 0.132, 'hits@1': 0.132, 'hits@5': 0.836, 'hits@10': 1.0, 'hits@100': 1.0, 'bleu': 1.32e-10}


In [0]:
# SOLUTION For random baseline
for i in range(3):
    print(f'~ Task {i+1} ~')
    print('Random baseline:')
    !python ~/ParlAI/examples/eval_model.py -t babi:task10k:{i+1} -m baseline/random | grep accuracy
    print('IR baseline:')
    !python ~/ParlAI/examples/eval_model.py -t babi:task10k:{i+1} -m ir_baseline | grep accuracy

~ Task 1 ~
Random baseline:
{'exs': 1000, 'accuracy': 0.16, 'f1': 0.16, 'bleu': 1.6e-10}
IR baseline:
{'exs': 1000, 'accuracy': 0.465, 'f1': 0.465, 'hits@1': 0.465, 'hits@5': 0.961, 'hits@10': 1.0, 'hits@100': 1.0, 'bleu': 4.65e-10}
~ Task 2 ~
Random baseline:
{'exs': 1000, 'accuracy': 0.159, 'f1': 0.159, 'bleu': 1.59e-10}
IR baseline:
{'exs': 1000, 'accuracy': 0.284, 'f1': 0.284, 'hits@1': 0.284, 'hits@5': 0.9, 'hits@10': 1.0, 'hits@100': 1.0, 'bleu': 2.84e-10}
~ Task 3 ~
Random baseline:
{'exs': 1000, 'accuracy': 0.15, 'f1': 0.15, 'bleu': 1.5e-10}
IR baseline:
{'exs': 1000, 'accuracy': 0.132, 'f1': 0.132, 'hits@1': 0.132, 'hits@5': 0.836, 'hits@10': 1.0, 'hits@100': 1.0, 'bleu': 1.32e-10}


## 4. More elaborate models



We can now continue to more elaborate models and evaluate their performance in perspective to the baselines.
We will use the `~/ParlAI/examples/train_model.py` script. Let's first get a glance at its arguments:

In [0]:
!python ~/ParlAI/examples/train_model.py --help

usage: train_model.py [-h] [-v] [-t TASK]
                      [-dt {train,train:stream,train:ordered,train:ordered:stream,train:stream:ordered,train:evalmode,train:evalmode:stream,train:evalmode:ordered,train:evalmode:ordered:stream,train:evalmode:stream:ordered,valid,valid:stream,test,test:stream}]
                      [-nt NUMTHREADS] [-bs BATCHSIZE] [-dp DATAPATH] [-m MODEL] [-mf MODEL_FILE] [-im INIT_MODEL] [-et EVALTASK]
                      [-eps NUM_EPOCHS] [-ttim MAX_TRAIN_TIME] [-vtim VALIDATION_EVERY_N_SECS] [-stim SAVE_EVERY_N_SECS]
                      [-sval SAVE_AFTER_VALID] [-veps VALIDATION_EVERY_N_EPOCHS] [-vp VALIDATION_PATIENCE]
                      [-vmt VALIDATION_METRIC] [-vmm {max,min}] [-pyt PYTORCH_TEACHER_TASK] [-pytd PYTORCH_TEACHER_DATASET]

Train a model

optional arguments:
  -h, --help
        show this help message and exit

Main ParlAI Arguments:
  -v, --show-advanced-args
        Show hidden command line options (advanced users only) (default: Fa

We can train two types of models:
- **Generative models**: The model generates an answer from its vocabulary.
- **Ranking models**: The model is given a list of possible answers and has to choose the correct answer. This is much easier for the model since the list of possible answers is often way smaller than the size of the vocabulary


### Generative model: seq2seq with attention

The generative model we are going to train is a sequence to sequence model with attention based on [Sustskever et al. 2014](https://arxiv.org/abs/1409.3215) and [Bahdanau et al. 2014](https://arxiv.org/abs/1409.0473).
      
- **4.a.** Briefly explain how attention works in sequence to sequence neural networks.
- **4.b.** Do you think attention is useful for the babi tasks? How would you verify it experimentally?

*ANSWERS*
- **4.a.**  - Attention involves answering "what part of the input should we focus on ?"
For every words we can get its attention vector generated which captures the contextual relationship between words and sentences.
In one headed attention we have 3 abstract vectors (V,K and Q) that extract different components of an input word for every single learn. We use those 3 vectors to compute the attention vector using the formula Z = softmax(Q.K^t / sqrt(len(Q))).V.
- **4.b.** I think attention would be useful useful for some  bAbi tasks.
We could verify it by checking its performence on each task and try to compar them with the LSTM and the Memory Neural Networks.


---
- **4.c.** Train a seq2seq on bAbI task 1 (10k) and compare its results to the baselines.
   (*hint: for faster training use the following arguments `--batchsize 32 --numthreads 1 --num-epochs 5 --hiddensize 64 --embeddingsize 64 --numlayers 1 --decoder shared`)


In [0]:
# Solution
!python ~/ParlAI/examples/train_model.py --task babi:task10k:1 --model seq2seq  --model-file /tmp/babi_s2s --batchsize 32 --numthreads 1 --num-epochs 5 --hiddensize 64 --embeddingsize 64 --numlayers 1 --decoder shared

[ Main ParlAI Arguments: ] 
[  batchsize: 32 ]
[  datapath: /root/ParlAI/data ]
[  datatype: train ]
[  download_path: /root/ParlAI/downloads ]
[  hide_labels: False ]
[  image_mode: raw ]
[  multitask_weights: [1] ]
[  numthreads: 1 ]
[  show_advanced_args: False ]
[  task: babi:task10k:1 ]
[ ParlAI Model Arguments: ] 
[  dict_class: parlai.core.dict:DictionaryAgent ]
[  init_model: None ]
[  model: seq2seq ]
[  model_file: /tmp/babi_s2s ]
[ Training Loop Arguments: ] 
[  dict_build_first: True ]
[  display_examples: False ]
[  eval_batchsize: None ]
[  evaltask: None ]
[  load_from_checkpoint: False ]
[  max_train_time: -1 ]
[  num_epochs: 5.0 ]
[  save_after_valid: False ]
[  save_every_n_secs: -1 ]
[  validation_cutoff: 1.0 ]
[  validation_every_n_epochs: -1 ]
[  validation_every_n_secs: -1 ]
[  validation_max_exs: -1 ]
[  validation_metric: accuracy ]
[  validation_metric_mode: None ]
[  validation_patience: 10 ]
[  validation_share_agent: False ]
[ Tensorboard Arguments: ] 
[  te

In [0]:
!python ~/ParlAI/examples/display_model.py --task babi:task10k:1 --model seq2seq --model-file /tmp/babi_s2s

[ optional arguments: ] 
[  display_ignore_fields:  ]
[  num_examples: 10 ]
[ Main ParlAI Arguments: ] 
[  batchsize: 1 ]
[  datapath: /root/ParlAI/data ]
[  datatype: valid ]
[  download_path: /root/ParlAI/downloads ]
[  hide_labels: False ]
[  image_mode: raw ]
[  multitask_weights: [1] ]
[  numthreads: 1 ]
[  show_advanced_args: False ]
[  task: babi:task10k:1 ]
[ ParlAI Model Arguments: ] 
[  dict_class: parlai.core.dict:DictionaryAgent ]
[  init_model: None ]
[  model: seq2seq ]
[  model_file: /tmp/babi_s2s ]
[ ParlAI Image Preprocessing Arguments: ] 
[  image_cropsize: 224 ]
[  image_size: 256 ]
[ Seq2Seq Arguments: ] 
[  attention: none ]
[  attention_length: 48 ]
[  attention_time: post ]
[  bidirectional: False ]
[  decoder: same ]
[  dropout: 0.1 ]
[  embeddingsize: 128 ]
[  hiddensize: 128 ]
[  input_dropout: 0.0 ]
[  lookuptable: unique ]
[  numlayers: 2 ]
[  numsoftmax: 1 ]
[  rnn_class: lstm ]
[ Torch Generator Agent: ] 
[  beam_block_ngram: 0 ]
[  beam_dot_log: False ]
[

### Ranking model: memory network

We saw in the class that Memory Networks ([Sukhbaatar et al. 15'](https://papers.nips.cc/paper/5846-end-to-end-memory-networks.pdf)) rely on an explicit memory "database". this is especially adapted to tasks where a few useful memories are "hidden" among distractor memories.  
These type of networks work therefore  especially well for the bAbI tasks by turning the previous dialog turns as memories and the question as the query.  
Here is an illustration of how a memory network work:

![Memory Network schema](https://raw.githubusercontent.com/louismartin/ammi-2019-bordes-DeepNLP/master/lab1/memory_network.png)


---
**Question 4**  
- **4.d.** Explain how hops work in a memory network (either with words or formulas using the notations of the above figure)
- **4.e.** How can a memory network be used to rank multiple candidates?  
  (*hint: you can look at the [implementation](https://github.com/facebookresearch/ParlAI/blob/6bd0e58692b3fd3a13b5f654944525ac1b7cd8e3/parlai/agents/memnn/modules.py#L22) of the memory network in ParlAI and especially the `_score()` method. Recall how the IR baseline worked.*)
  
*ANSWERS *

- **4.d.** hops take as input a set of words and a query. each word of of the set will be embedded and convert into a memory vector. the embedding word will form an embedding matrix.Next, we compute the mactch between the embedding of the query and each memory (i.e each column of the embedding matrix) by aplying the softmax of their inner product. that will give the probabilty vector over the inputs.
The response vector will be given by the sum over the corresponding output vector weighted by the probability vector from the input.
  
- **4.e.** To explain how can a memory network be used to rank multiple candidate, we are going to focus on a the _score() method. In fact, we check the number of candidate before computing the score. If it receive 2 candidate it compute as a matrix multiplication between the expected output and the matrix candidate.
If it receive 3 candidate it will squeeze the ouput and squeeze the candidate matrix after computing the matrix multiplication. But if receive more than 3 candidates it will give an error.
 
---


- **4.f.** Using the ParlAI implementation, train a memory network on bAbI tasks 1, 2 and 3 (10k) and compare its results with the baselines.  
   (*hint: use a 1 thread, a batch size of 32 and 5 epochs*)


In [0]:
# FILL CELL
for i in range(3):
    print(f'~ Task {i+1} ~')
    # solution
    !python ~/ParlAI/examples/train_model.py -t babi:task10k:{i+1} -m memnn -mf /tmp/babi{i+1}_memnn -bs 32 -eps 5 | grep "'accuracy':"

~ Task 1 ~
Building dictionary: 100% 9.00k/9.00k [00:00<00:00, 25.5kex/s]
  ''.format(mode)
  "Some training metrics are omitted for speed. Set the flag "
  ''.format(mode)
valid:{'exs': 1000, 'accuracy': 0.985, 'f1': 0.985, 'hits@1': 0.985, 'hits@5': 1.0, 'hits@10': 1.0, 'hits@100': 1.0, 'bleu': 9.85e-10, 'lr': 1, 'num_updates': 1407, 'examples': 1000, 'loss': 55.54, 'mean_loss': 0.05554, 'mean_rank': 1.017}
test:{'exs': 1000, 'accuracy': 0.992, 'f1': 0.992, 'hits@1': 0.992, 'hits@5': 1.0, 'hits@10': 1.0, 'hits@100': 1.0, 'bleu': 9.92e-10, 'lr': 1, 'num_updates': 1407, 'examples': 1000, 'loss': 38.47, 'mean_loss': 0.03847, 'mean_rank': 1.01}
~ Task 2 ~
Building dictionary: 100% 9.00k/9.00k [00:00<00:00, 18.3kex/s]
  ''.format(mode)
  "Some training metrics are omitted for speed. Set the flag "
  ''.format(mode)
valid:{'exs': 1000, 'accuracy': 0.207, 'f1': 0.207, 'hits@1': 0.207, 'hits@5': 0.847, 'hits@10': 1.0, 'hits@100': 1.0, 'bleu': 2.07e-10, 'lr': 1, 'num_updates': 1407, 'examples

## 5. To go further

If you want to go further you can try to do the following:

- Retrieve and plot the attention of the memory network for the different hops along the memories.
- For the seq2seq model, can you plot the training loss? The validation loss? Both on the same plot?
- Can you show an example of overfitting?
- Adapt the seq2seq model for ranking using the [torch ranker tutorial](http://www.parl.ai/static/docs/tutorial_torch_ranker_agent.html)
- Try multitasking babi and squad, does it improve the performance? (this will require more GPU power than what is available in google colab)
- You can play around with other models and other tasks
- Try interfacing ParlAI with [messenger](http://www.parl.ai/static/docs/tutorial_messenger.html )