# Deep Learning Project - DrQA

Reproducibility challenge, we will try to reproduce the methods, models
and results put forward by the following paper:

Danqi Chen, Adam Fisch, Jason Weston and Antoine Bordes. 2017. Reading Wikipedia to Answer Open-Domain Questions. arXiv preprint : 1704.00051

The paper considers the problem of answering factoid questions in an open-domain setting using Wikipedia as the unique knowledge source, such as one does when looking for answers in an encyclopedia. Unlike knowledge bases , which are easier for computers to process but too sparsely populated for open-domain question answering (Miller et al., 2016), Wikipedia contains up-to-date knowledge that humans are interested in. It is designed, however, for humans - not machines - to read. Using Wikipedia articles as the knowledge source causes the task of question answering to combine the challenges of both large-scale open-domain question answering and of machine comprehension of text. The approach of the paper's authors is generic and could be switched to other collections of documents, books, or even daily updated newspapers. In contrast to large-scale question answering systems that rely on multiple sources to answer by pairing KBs, dictionaries, and even news articles, books, etc, thus relying on information redundancy among the sources to answer correctly. Having a single knowledge source forces the model to be very precise while searching for an answer as the evidence might appear only once. The model developed by the paper DrQA is a strong system for question answering from Wikipedia composed of: (1) Document Retriever, a module using bigram hashing and TF-IDF matching designed to, given a question, efficiently return a subset of relevant articles and (2) Document Reader, a multi-layer recurrent neural network machine comprehension model trained to detect answer spans in those few returned documents.

In the reproduction of the paper will focus mostly on the document reader part, since the document retriever part of the system, though interesting and thought prokoving, is not relevant to the taught course material (We will, either, use the authors implementation or a standard search API depending on the dataset, but, if the time allows it, we will try implementing it since we deem that retrieving and feeding information to the network is an integral part of deep learning ).

We will start project by implementing the document reader part of the system and train it on the SQuAD (Rajpurkar et al., 2016) dataset, we can then test it on the different open question answering datasets used fo benchmarking text comprehension systems.

**DrQA Implementation**


In [None]:
!bash download.sh

In [None]:
import torch
import numpy as np
import pandas as pd
import re, os, string, typing, gc, json, unicodedata, time
import spacy
from collections import Counter
import torchtext
from torch import nn
import torch.nn.functional as F
import spacy
from collections import Counter
from nltk import word_tokenize
nlp = spacy.load('en')
from utilis import *
from model import *
from SquadDS import *

In [None]:
# load SQuAD dataset json files

train_data = load_json('./SQuAD/train-v1.1.json')
valid_data = load_json('./SQuAD/dev-v1.1.json')

# parse the json structure to return the data as a list of dictionaries

train_list = parse_data(train_data)
valid_list = parse_data(valid_data)

# converting the lists into dataframes

train_df = pd.DataFrame(train_list)
valid_df = pd.DataFrame(valid_list)

In [None]:
train_df.context = train_df.context.apply(normalize_spaces)
valid_df.context = valid_df.context.apply(normalize_spaces)

In [None]:
vocab_text = gather_text_for_vocab([train_df, valid_df])

word2idx, idx2word, word_vocab = build_word_vocab(vocab_text)

train_df['context_ids'] = train_df.context.apply(context_to_ids, word2idx=word2idx)
valid_df['context_ids'] = valid_df.context.apply(context_to_ids, word2idx=word2idx)

train_df['question_ids'] = train_df.question.apply(question_to_ids,  word2idx=word2idx)
valid_df['question_ids'] = valid_df.question.apply(question_to_ids,  word2idx=word2idx)

In [None]:
# get indices with tokenization errors and drop those indices 

train_err = get_error_indices(train_df, idx2word)
valid_err = get_error_indices(valid_df, idx2word)

train_df.drop(train_err, inplace=True)
valid_df.drop(valid_err, inplace=True)

In [None]:
# get start and end positions of answers from the context
# this is basically the label for training QA models

train_label_idx = train_df.apply(index_answer, axis=1, idx2word=idx2word)
valid_label_idx = valid_df.apply(index_answer, axis=1, idx2word=idx2word)

train_df['label_idx'] = train_label_idx
valid_df['label_idx'] = valid_label_idx

In [None]:
train_dataset = SquadDataset(train_df, 32)
valid_dataset = SquadDataset(valid_df, 32)

In [None]:
glove_dict = create_glove_matrix()

In [None]:
weights_matrix, words_found = create_word_embedding(glove_dict, word_vocab)

In [None]:
device = torch.device('cuda')
HIDDEN_DIM = 128
EMB_DIM = 300
NUM_LAYERS = 3
NUM_DIRECTIONS = 2
DROPOUT = 0.3
device = torch.device('cuda')

model = DocumentReader(HIDDEN_DIM,
                       EMB_DIM, 
                       NUM_LAYERS, 
                       NUM_DIRECTIONS, 
                       DROPOUT, 
                       device, glove_dict, word_vocab).to(device)

In [None]:
optimizer = torch.optim.Adamax(model.parameters())

In [None]:
train_losses = []
valid_losses = []
ems = []
f1s = []
epochs = 40

for epoch in range(epochs):
    print(f"Epoch {epoch+1}")
    
    start_time = time.time()
    
    train_loss = train(model, train_dataset, device, optimizer)
    valid_loss, em, f1 = valid(model, valid_dataset, device, idx2word)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    train_losses.append(train_loss)
    valid_losses.append(valid_loss)
    ems.append(em)
    f1s.append(f1)
    
    print(f"Epoch train loss : {train_loss}| Time: {epoch_mins}m {epoch_secs}s")
    print(f"Epoch valid loss: {valid_loss}")
    print(f"Epoch EM: {em}")
    print(f"Epoch F1: {f1}")
    print("====================================================================================")
    