# AMMI Deep Natural Language Processing: Lab 1

## 0. Introduction

**All questions should be answered in the separate google form to be graded.**


In this lab we will train neural networks on the bAbI tasks using ParlAI framework.  
This lab can be run both in google colab or on your computer.   

We will cover the following:
0. Introduction
    - Introduction to ParlAI and installation
    - Introduction to the bAbI tasks
1. Exploring the data:
    - Compute some statistics (number of examples in train, valid, test, size of examples...)
    - Look at some examples
2. Choose the appropriate metrics
3. Baselines
    - Ranom baseline
    - Majority class baseline
    - Information retrieval baseline
4. More elaborate models
   - Generative model: Seq2Seq
   - Ranking model: Memory Network
5. To go further
    - Additional ideas to try if you want to dig deeper

### ParlAI
[ParlAI](https://github.com/facebookresearch/ParlAI/blob/master/README.md) (pronounced “par-lay”) is a framework for dialogue AI research, implemented in Python.

Its goal is to provide researchers:

* a unified framework for sharing, training and testing dialogue models
* many popular datasets available all in one place -- from open-domain chitchat to visual question answering.
* a wide set of reference models -- from retrieval baselines to Transformers.
* seamless integration of Amazon Mechanical Turk for data collection and human evaluation
* integration with Facebook Messenger to connect agents with humans in a chat interface

Documentation can be found [here](http://www.parl.ai/static/docs/), some of this tutorial is inspired from the ParlAI documentation so feel free to go back and forth between the notebook and the documentation.


### Setup the notebook
If using google colab, make sure to use TPU runtime by going to ***Runtime > Change runtime type > Hardware accelerator: TPU > Save***

### Install ParlAI

Start by installing ParlAI from github. The ParlAI folder will be located in the home directory at `~/ParlAI/`.  
*Note: In a jupyter notebook, you can run arbitrary bash commands by prefixing them with a question mark, example: `!echo "Hello World"`*

In [1]:
!git clone https://github.com/facebookresearch/ParlAI.git ~/ParlAI
!cd ~/ParlAI && git checkout 6bd0e58692b3fd3a13b5f654944525ac1b7cd8e3
!cd ~/ParlAI; python3 setup.py develop

Cloning into '/Users/habibmbow/ParlAI'...
remote: Enumerating objects: 46750, done.[K
remote: Counting objects: 100% (654/654), done.[K
remote: Compressing objects: 100% (286/286), done.[K
remote: Total 46750 (delta 389), reused 586 (delta 347), pack-reused 46096[K
Receiving objects: 100% (46750/46750), 140.89 MiB | 472.00 KiB/s, done.
Resolving deltas: 100% (33194/33194), done.
Note: switching to '6bd0e58692b3fd3a13b5f654944525ac1b7cd8e3'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 6bd0e586 light pape

Most of the scripts that we will use in ParlAI are located in the `~/ParlAI/examples` directory.  
Let's have a first glance at the scripts available, we will come back to them later:

In [2]:
!ls ~/ParlAI/examples/

README.md                display_model.py         remote.py
base_train.py            eval_model.py            seq2seq_train_babi.py
build_dict.py            extract_image_feature.py train_model.py
build_pytorch_data.py    interactive.py
display_data.py          profile_train.py


### The bAbI tasks
Many datasets and tasks are included in ParlAI, we will focus on the bAbI tasks.
The bAbI tasks are 20 synthetic tasks that each test a unique aspect of text and reasoning, and hence test different capabilities of learning models from [Weston et al. ‘16](http://arxiv.org/abs/1502.05698).

---
**Question 0.**  
Open the bAbI [paper](https://arxiv.org/pdf/1502.05698.pdf) and read the abstract  and section: *"3 The Tasks"* (until paragraph **Two or Three Supporting Facts**,  included).  
- **0.a.** Explain in your own words the motivations behind these tasks (in 2-3 sentences).

*ANSWER IN GOOGLE FORM*

---

These tasks can be downloaded and used directly from ParlAI.  
We will focus on tasks 1, 2 and 3, see examples below:


**Task 1: Single Supporting Fact**  
Mary went to the bathroom.  
John moved to the hallway.  
Mary travelled to the office.  
Where is Mary?  
**Answer: office**  


**Task 2: Two Supporting Facts**  
John is in the playground.  
John picked up the football.  
Bob went to the kitchen.  
Where is the football?  
**Answer: playground**


**Task 3: Three Supporting Facts **  
John picked up the apple.  
John went to the office.  
John went to the kitchen.  
John dropped the apple.   
Where was the apple before the kitchen?  
**Answer: office**



## 1. Exploring the data

First we need to download the data, we will use the `build_dict.py` as a dummy task to download the data.

In [3]:
# Download the data silently
!python3 ~/ParlAI/examples/build_dict.py --task babi:task1k:1 --dict-file /tmp/babi1.dict
# Print a few examples
!head -n 30 ~/ParlAI/data/bAbI/tasks_1-20_v1-2/en-valid-10k-nosf/qa1_train.txt

  end_of_data = episode_done and self.next_episode is -1
[ Main ParlAI Arguments: ] 
[  batchsize: 1 ]
[  datapath: /Users/habibmbow/ParlAI/data ]
[  datatype: train ]
[  download_path: /Users/habibmbow/ParlAI/downloads ]
[  hide_labels: False ]
[  image_mode: raw ]
[  multitask_weights: [1] ]
[  numthreads: 1 ]
[  show_advanced_args: False ]
[  task: babi:task1k:1 ]
[ ParlAI Model Arguments: ] 
[  dict_class: None ]
[  init_model: None ]
[  model: None ]
[  model_file: None ]
[ PytorchData Arguments: ] 
[  batch_length_range: 5 ]
[  batch_sort_cache_type: pop ]
[  batch_sort_field: text ]
[  numworkers: 4 ]
[  pytorch_context_length: -1 ]
[  pytorch_datapath: None ]
[  pytorch_include_labels: True ]
[  pytorch_preprocess: False ]
[  pytorch_teacher_batch_sort: False ]
[  pytorch_teacher_dataset: None ]
[  pytorch_teacher_task: None ]
[  shuffle: False ]
[ Dictionary Loop Arguments: ] 
[  dict_include_test: False ]
[  dict_include_valid: False ]
[  dict_maxexs: -1 ]
[  log_every_n_secs

The bAbI tasks were downloaded in `~/ParlAI/data/bAbI/tasks_1-20_v1-2/en-valid-nosf/`

In bAbI the data is organised as follows:
- **Dialog turn**: A dialog turn is a single utterance / statement. Each line in the file corresponds to one dialog turn.   
  Example: *"John went to the office."*
- **Sample (question)**: Every few dialog turns, a question can be asked that the model has to answer, this consitute a sample.  The question is followed by its ground truth answer, separated by a tab.
  Example: *"Where is John? `<tab>` bathroom"*
- **Episode**: a sequence of ordered coherent dialog turns that are related to each other form an episode. Each new episode is independant of the others. Each line starts with the dialog turn number in the current episode.


---
**Question 1.**
- **1.a.** Look at the training file of task 1 (`~/ParlAI/data/bAbI/tasks_1-20_v1-2/en/qa1_train.txt`) and compute the following information:
  - Number of episodes
  - Number of  samples (questions)
  - Number of dialog turns per episode
  - How many different answers are there in the train set? How many times does each appear? (*hint: Use a python [counter](https://docs.python.org/3/library/collections.html#collections.Counter)*)
  - How many unique words appear in the training set? How many time does each appear? (*hint: Use the Counter `most_common()` method*)

*Print the answer in the following code cell*
  
  ---

In [4]:
# FILL THIS CELL
from collections import Counter


task_1_train_path = '/root/ParlAI/data/bAbI/tasks_1-20_v1-2/en-valid-10k-nosf/qa1_train.txt'
num_episode=num_sample=0.0
num_dialog=1.0
tab=[]
tab1=[]
tab2=[]
with open(task_1_train_path, 'r') as f:
    # FILL CODE HERE
    data = f.readlines()
    print(data)
    for _episode in data:
      if _episode[0:2]=="1 ":
        num_episode+=1
      if '?' in _episode:
        num_sample+=1
    for _episode in data:
      tab.append(_episode[0:2])
      if '?' in _episode:
        tab1.append(_episode)
print("number of episode",num_episode)
print("number of sample",num_sample)
print("Number of dialog turns per episode",float(len(set(tab))))
print("different answers in the train set",len(set(tab1)))
for unique in set(tab1):
  tab2.append((unique,tab1.count(unique)))
#print(len(tab2))
print(tab2)
my_unique=[]
word_appear=[]
for word in data:
  for unique_word in word.split():
    my_unique.append(unique_word)
print("number of unique word",len(set(my_unique)))
# for word_time in my_unique:
#   word_appear.append((word_time,my_unique.count(word_time)))
# print("number of time each word appear",word_appear)

FileNotFoundError: [Errno 2] No such file or directory: '/root/ParlAI/data/bAbI/tasks_1-20_v1-2/en-valid-10k-nosf/qa1_train.txt'


Use the appropriate script from the `~/ParlAI/examples/` to take a quick look at examples of the first bAbI task.  
Does the number of episodes and examples fit what you computed before? (*hint: you can use the argument `--task babi:task1k:1` to select the first babi task*)

In [None]:
# FILL THIS CELL
!~/ParlAI/examples/

/bin/bash: /root/ParlAI/examples/: Is a directory


In [None]:
tab=[]
a="1 je suis etudiant je suis"
for t in a.split():
  tab.append(t)
print(tab)
print(set(tab))

['1', 'je', 'suis', 'etudiant', 'je', 'suis']
{'suis', '1', 'je', 'etudiant'}


## 2. Metrics

The bAbI task 1 expects single word answers among a small set of possible answers.


---
**Question 2**  
- **2.a.** Which metrics do you think are appropriate for evaluating a model on this task?   
-  **2.b.**  What are their respective strengths?  
-  **2.c.** When do they fail? (find specific examples)  


*ANSWER IN GOOGLE FORM* 




---

## 3. Baseline



We now have a clearer idea of the data distribution and the metrics that we can use.  
The next step is to start solving the tasks with a simple baseline. This will allow us to compare more elaborate models agains this baseline.  
Here are a few classical baselines:
- **Random model**: The model answers randomly among the set of possible answers for each question
-  **Majority class**: The model always answers with the most frequent answer in the training set (majority class)

We are going to reimplement these own baselines.  
Implementing a new model in ParlAI is detailed in the [tutorial](http://parl.ai/static/docs/seq2seq_tutorial.html) but for our simple baselines, we will only need to inherit the [Agent](https://github.com/facebookresearch/ParlAI/blob/6d246842d3f4e941dd3806f3d9fa62f607d48f59/parlai/core/agents.py#L50) class and override the `act()` method.

---
**Question 3**  
- **3.a.** What would be the accuracy of a model that choses a random answer among the set of possible answers for each question? 

*ANSWER IN GOOGLE FORM*

---

*Note: the `%%writefile` magic command in jupyter writes the content of the cell to a file at the given path.*

In [None]:
!mkdir -p ~/ParlAI/parlai/agents/baseline/
!touch ~/ParlAI/parlai/agents/baseline/random.py
!touch ~/ParlAI/parlai/agents/baseline/majorityclass.py

- **3.b.**  Design a baseline that answers a random word in the set of possible answer (run it multiple time to observe variance in results).

In [None]:
# FILL THIS CELL
%%writefile ~/ParlAI/parlai/agents/baseline/random.py
import random

from parlai.core.torch_agent import Agent


class RandomAgent(Agent):
  
    def act(self):
        # FILL CODE HERE

In [None]:
!python3 ~/ParlAI/examples/eval_model.py -t babi:task10k:1 -m baseline/random | grep accuracy -A 1

In [None]:
!python3 ~/ParlAI/examples/display_model.py -t babi:task10k:1 -m baseline/random -n 10 

- **3.c.**  Design a baseline that answers the most common answer every time (majority class baseline).

In [None]:
# FILL THIS CELL
%%writefile ~/ParlAI/parlai/agents/baseline/majorityclass.py
import random

from parlai.core.torch_agent import Agent


class MajorityclassAgent(Agent):
  
    def act(self):
        # FILL CODE HERE

In [None]:
!python3 ~/ParlAI/examples/eval_model.py -t babi:task10k:1 -m baseline/majorityclass | grep accuracy -A 1

In [None]:
!python3 ~/ParlAI/examples/display_model.py -t babi:task10k:1 -m baseline/majorityclass -n 10

---
- **3.d.**  In which cases would the majority class baseline be better than the random baseline?

*ANSWER IN GOOGLE FORM*

---

Another slightly more advanced baseline is implemented in ParlAI: the information retrieval baseline (`ir_baseline`)

---
- **3.e.** Look at the [implementation](https://github.com/facebookresearch/ParlAI/blob/53ea58acf389bffc79c85c43bcdd848eecdcecb4/parlai/agents/ir_baseline/ir_baseline.py#L211) of the IR baseline and explain in a few lines how it works (*hint: look at the following methods `act()` `rank_candidates()`  `score_match()`*)  

*ANSWER IN GOOGLE FORM*


---

- **3.f.** Use the IR baseline and compare its with one of your baselines (random and/or majority) on bAbI tasks 1, 2 and 3.  
    (*hint: you can use `!python3 ... -t babi:task1-k:{i+1}` syntax to substitute the task number in a bash command from jupyter*)


In [None]:
# FILL THIS CELL
for i in range(3):
    print(f'~ Task {i+1} ~')
    # FILL CODE HERE

## 4. More elaborate models



We can now continue to more elaborate models and evaluate their performance in perspective to the baselines.
We will use the `~/ParlAI/examples/train_model.py` script. Let's first get a glance at its arguments:

In [None]:
!python3 ~/ParlAI/examples/train_model.py --help

We can train two types of models:
- **Generative models**: The model generates an answer from its vocabulary.
- **Ranking models**: The model is given a list of possible answers and has to choose the correct answer. This is much easier for the model since the list of possible answers is often way smaller than the size of the vocabulary


### Generative model: seq2seq with attention

The generative model we are going to train is a sequence to sequence model with attention based on [Sustskever et al. 2014](https://arxiv.org/abs/1409.3215) and [Bahdanau et al. 2014](https://arxiv.org/abs/1409.0473).
      
- **4.a.** Briefly explain how attention works in sequence to sequence neural networks.
- **4.b.** Do you think attention is useful for the babi tasks? How would you verify it experimentally?

*ANSWER IN GOOGLE FORM*

---
- **4.c.** Train a seq2seq on bAbI task 1 (10k) and compare its results to the baselines.
   (*hint: for faster training use the following arguments `--batchsize 32 --numthreads 1 --num-epochs 5 --hiddensize 64 --embeddingsize 64 --numlayers 1 --decoder shared`)


In [None]:
# FILL THIS CELL

In [None]:
!python3 ~/ParlAI/examples/display_model.py --task babi:task10k:1 --model seq2seq --model-file /tmp/babi_s2s

### Ranking model: memory network

We saw in the class that Memory Networks ([Sukhbaatar et al. 15'](https://papers.nips.cc/paper/5846-end-to-end-memory-networks.pdf)) rely on an explicit memory "database". this is especially adapted to tasks where a few useful memories are "hidden" among distractor memories.  
These type of networks worktherefore  especially well for the bAbI tasks by turning the previous dialog turns as memories and the question as the query.  
Here is an illustration of how a memory network work:

![Memory Network schema](https://raw.githubusercontent.com/louismartin/ammi-2019-bordes-DeepNLP/master/lab1/memory_network.png)


---
**Question 4**  
- **4.d.** Explain how hops work in a memory network (either with words or formulas using the notations of the above figure)
- **4.e.** How can a memory network be used to rank multiple candidates?  
  (*hint: you can look at the [implementation](https://github.com/facebookresearch/ParlAI/blob/6bd0e58692b3fd3a13b5f654944525ac1b7cd8e3/parlai/agents/memnn/modules.py#L22) of the memory network in ParlAI and especially the `_score()` method. Recall how the IR baseline worked.*)
  
*ANSWER IN GOOGLE FORM*
  
 
---


- **4.f.** Using the ParlAI implementation, train a memory network on bAbI tasks 1, 2 and 3 (10k) and compare its results with the baselines.  
   (*hint: use a 1 thread, a batch size of 32 and 5 epochs*)


In [None]:
# FILL CELL
for i in range(3):
    print(f'~ Task {i+1} ~')
    # FILL CODE HERE

## 5. To go further

If you want to go further you can try to do the following:

- Retrieve and plot the attention of the memory network for the different hops along the memories.
- For the seq2seq model, can you plot the training loss? The validation loss? Both on the same plot?
- Can you show an example of overfitting?
- Adapt the seq2seq model for ranking using the [torch ranker tutorial](http://www.parl.ai/static/docs/tutorial_torch_ranker_agent.html)
- Try multitasking babi and squad, does it improve the performance? (this will require more GPU power than what is available in google colab)
- You can play around with other models and other tasks
- Try interfacing ParlAI with [messenger](http://www.parl.ai/static/docs/tutorial_messenger.html )