Credits: the provided initial code is an adaptation of the [Starter code for Stanford CS224n default final project on SQuAD 2.0](https://github.com/chrischute/squad) which is shared under MIT License. 

This notebook does initial preprocessing for the SberQuAD dataset and will give you the starting point in this assignment. If it looks too complex and/or time/resourse-expensive, you may stick to homework05 as well.

In [3]:
from google.colab import drive
drive.mount('/drive/')


Drive already mounted at /drive/; to attempt to forcibly remount, call drive.mount("/drive/", force_remount=True).


In [4]:
!ls "/drive/My Drive/datasets/data"
!ls "/drive/My Drive/datasets"

char2idx.json  ft_native_300_ru_wiki_lenta_nltk_wordpunct_tokenize.vec
char_emb.json  train_eval.json
dev_eval.json  train.npz
dev_meta.json  train-v1.1.json
dev.npz        word2idx.json
dev-v1.1.json  word_emb.json
args.py    models.py					       setup.py
data	   __pycache__					       test.py
layers.py  README.md					       train.py
LICENSE    save						       util.py
model	   SberQuAD_preprocessing_and_problem_statement.ipynb  vocab.txt


### 1. Preprocessing
This code is a bit changed version of the code from `setup.py`. If you want to work with the SQuAD dataset, stick to the original instructions from the https://github.com/chrischute/squad repository.

In [5]:
# If running on Colab, uncomment the following lines 

!wget https://raw.githubusercontent.com/OlegPozovnoy/MailRuNLP/HW4/args.py -nc
!wget https://raw.githubusercontent.com/OlegPozovnoy/MailRuNLP/HW4/layers.py -nc
!wget https://raw.githubusercontent.com/OlegPozovnoy/MailRuNLP/HW4/models.py -nc
!wget https://raw.githubusercontent.com/OlegPozovnoy/MailRuNLP/HW4/setup.py -nc
!wget https://raw.githubusercontent.com/OlegPozovnoy/MailRuNLP/HW4/test.py -nc
!wget https://raw.githubusercontent.com/OlegPozovnoy/MailRuNLP/HW4/train.py -nc
!wget https://raw.githubusercontent.com/OlegPozovnoy/MailRuNLP/HW4/util.py -nc

File ‘args.py’ already there; not retrieving.

File ‘layers.py’ already there; not retrieving.

File ‘models.py’ already there; not retrieving.

File ‘setup.py’ already there; not retrieving.

File ‘test.py’ already there; not retrieving.

File ‘train.py’ already there; not retrieving.

File ‘util.py’ already there; not retrieving.



In [6]:
# If running on Colab, uncomment the following lines 

!pip install ujson
!pip install tensorboardX
!pip install pymorphy2==0.8



In [7]:
"""Train a model on SQuAD.

Author:
    Chris Chute (chute@stanford.edu)
"""

import numpy as np
import random
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.optim.lr_scheduler as sched
import torch.utils.data as data
import util

from args import get_train_args
from collections import OrderedDict
from json import dumps
from models import BiDAF
from tensorboardX import SummaryWriter
from tqdm import tqdm
from ujson import load as json_load
from util import collate_fn, SQuAD

In [8]:
from pathlib import Path
#Path("./data").mkdir(parents=True, exist_ok=True)
#Path("./save").mkdir(parents=True, exist_ok=True)

Downloading the SberQuAD data

In [9]:
#!wget http://files.deeppavlov.ai/datasets/sber_squad_clean-v1.1.tar.gz -nc -O ./data/sber_squad_clean-v1.1.tar.gz

In [10]:
#! tar -xzvf ./data/sber_squad_clean-v1.1.tar.gz
#! mv train-v1.1.json data
#! mv dev-v1.1.json data

Downloading the word vectors (this may take a while)

In [11]:
#! wget http://files.deeppavlov.ai/embeddings/ft_native_300_ru_wiki_lenta_nltk_wordpunct_tokenize/ft_native_300_ru_wiki_lenta_nltk_wordpunct_tokenize.vec -nc -O ./data/ft_native_300_ru_wiki_lenta_nltk_wordpunct_tokenize.vec

And finally the preprocessing for the SberQuAD dataset:

In [12]:
prefix = '/drive/My Drive/datasets'
train_file = prefix + '/data/train-v1.1.json'
dev_file = prefix + '/data/dev-v1.1.json'
glove_file = prefix + '/data/ft_native_300_ru_wiki_lenta_nltk_wordpunct_tokenize.vec'

In [13]:
from setup import *

In [14]:
# Uncomment this cell if needed
!pip install spacy



In [15]:
nlp = spacy.blank("ru")

The following cell may take a while (usually 10 minutes or less).

In [16]:
# Process training set and use it to decide on the word/character vocabularies
word_counter, char_counter = Counter(), Counter()
train_examples, train_eval = process_file(train_file, "train", word_counter, char_counter, nlp)
word_emb_mat, word2idx_dict = get_embedding(
    word_counter, 'word', emb_file=glove_file, vec_size=300, num_vectors=1560132)
char_emb_mat, char2idx_dict = get_embedding(
    char_counter, 'char', emb_file=None, vec_size=64)




Pre-processing train examples...


100%|██████████| 1/1 [02:41<00:00, 161.35s/it]
  0%|          | 0/1560132 [00:00<?, ?it/s]

45328 questions in total
Pre-processing word vectors...


100%|██████████| 1560132/1560132 [03:15<00:00, 7984.09it/s]


135451 / 156143 tokens have corresponding word embedding vector
Pre-processing char vectors...
701 tokens have corresponding char embedding vector


In [17]:
dev_examples, dev_eval = process_file(dev_file, "dev", word_counter, char_counter, nlp)

Pre-processing dev examples...


100%|██████████| 1/1 [00:16<00:00, 16.11s/it]

5036 questions in total





In [18]:
!pip uninstall -y tensorflow tensorflow-gpu
!pip install numpy scipy librosa unidecode inflect librosa transformers
!pip install deeppavlov
!python -m deeppavlov install squad_ru_rubert

Uninstalling tensorflow-1.15.2:
  Successfully uninstalled tensorflow-1.15.2
2020-07-07 13:34:54.80 INFO in 'deeppavlov.core.common.file'['file'] at line 32: Interpreting 'squad_ru_rubert' as '/usr/local/lib/python3.6/dist-packages/deeppavlov/configs/squad/squad_ru_rubert.json'
Collecting git+https://github.com/deepmipt/bert.git@feat/multi_gpu
  Cloning https://github.com/deepmipt/bert.git (to revision feat/multi_gpu) to /tmp/pip-req-build-wicov_23
  Running command git clone -q https://github.com/deepmipt/bert.git /tmp/pip-req-build-wicov_23
  Running command git checkout -b feat/multi_gpu --track origin/feat/multi_gpu
  Switched to a new branch 'feat/multi_gpu'
  Branch 'feat/multi_gpu' set up to track remote branch 'feat/multi_gpu' from 'origin'.
Building wheels for collected packages: bert-dp
  Building wheel for bert-dp (setup.py) ... [?25l[?25hdone
  Created wheel for bert-dp: filename=bert_dp-1.0-cp36-none-any.whl size=23581 sha256=aabff2af43756126d7534c5cfa37671e89b5966aa7092

In [19]:

from deeppavlov import build_model, configs
print('Start model load')
model_ru = build_model(configs.squad.squad_ru_rubert, download=True)
print('End model load')



Start model load


2020-07-07 13:35:48.8 INFO in 'deeppavlov.download'['download'] at line 132: Skipped http://files.deeppavlov.ai/deeppavlov_data/bert/rubert_cased_L-12_H-768_A-12_v1.tar.gz download because of matching hashes
2020-07-07 13:36:07.958 INFO in 'deeppavlov.download'['download'] at line 132: Skipped http://files.deeppavlov.ai/deeppavlov_data/squad_model_ru_rubert.tar.gz download because of matching hashes
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package perluniprops to /root/nltk_data...
[nltk_data]   Package perluniprops is already up-to-date!
[nltk_data] Downloading package nonbreaking_prefixes to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package nonbreaking_prefixes is already up-to-date!












The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Instructions for updating:
Use keras.layers.Dense instead.
Instructions for updating:
Please use `layer.__call__` method instead.


Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See `tf.nn.softmax_cross_entropy_with_logits_v2`.





Instructions for updating:
Use standard file APIs to check for files with this prefix.


2020-07-07 13:36:36.607 INFO in 'deeppavlov.core.models.tf_model'['tf_model'] at line 51: [loading model from /root/.deeppavlov/models/squad_ru_bert/model_rubert]



INFO:tensorflow:Restoring parameters from /root/.deeppavlov/models/squad_ru_bert/model_rubert
End model load


In [29]:
i=0
contexts = []
questions = []
result = {}
for k, v in dev_eval.items():
  if i%25==0:
    ans = model_ru(contexts,questions)
    if i>0:
      for s in range(25):
        result[str(i-25+s)] = [ans[1], ans[2]]
    print(model_ru(contexts,questions))
    contexts = []
    questions = []
  else:
    contexts.append(dev_eval[k]['context'])
    questions.append(dev_eval[k]['question'])
  i=i+1


print(dev_eval['1'])

model_ru.pipe[2][2].bert.pooled_output

[[], [], []]
[['потоотделением', 'структуру хромосом', 'в Лондоне', 'в режиме офлайн', 'не находит', 'нервным и гнетущим', '1147 человек', 'пункт 2 статьи 434', 'купцами-новгородцами', 'в Европе, а затем и во всем мире', 'Операционные системы, следующие стандарту POSIX или опирающиеся на него, называют POSIX-совместимыми', 'четыре раза подряд', 'Игорь Дедков', 'на участках, где появляется вода', 'выступление на вечере талантов школы West', 'при 3000 K', 'поездку на поля Ватерлоо', 'трубка из кожи, набитая шерстью, или рулон нот', 'об убийстве, разбое и воровстве с поличным', 'значительно снизиться', 'Наш долгий национальный кошмар окончился', 'церковнославянский', 'нигилизм', 'Технические'], [132, 74, 103, 455, 368, 232, 21, 558, 245, 721, 0, 635, 140, 138, 409, 522, 428, 300, 318, 339, 383, 339, 202, 0], [2505075.75, 7764097.0, 3335143.25, 1839220.5, 170736.109375, 2318677.5, 1711473.5, 106775.0703125, 728885.125, 50803.4296875, 0.9381993412971497, 181829.125, 798465.0, 71.35090637207

KeyboardInterrupt: ignored

In [21]:

#from deeppavlov.models.bert.bert_squad import BertSQuADInferModel
#new_model = ((model_ru.pipe[0][2]))

print(model_ru([dev_eval['1']['context']],[dev_eval['1']['question']]))
print(dir(model_ru.pipe[2][2]))
print(model_ru.pipe[2][2].start_probs)


[['лихорадкой'], [28], [327802.53125]]
['__abstractmethods__', '__annotations__', '__call__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abc_cache', '_abc_negative_cache', '_abc_negative_cache_version', '_abc_registry', '_build_feed_dict', '_clip_norm', '_fit_batch_size', '_fit_beta', '_fit_learning_rate', '_fit_learning_rate_div', '_fit_max_batches', '_fit_min_batches', '_get_best', '_get_saveable_variables', '_get_trainable_variables', '_init_graph', '_init_learning_rate_variable', '_init_momentum_variable', '_init_optimizer', '_init_placeholders', '_learning_rate_cur_div', '_learning_rate_cur_impatience', '_learning_rate_drop_div', '_learning_rate_drop_patience', '_learning_rate_last_i

In [39]:
from deeppavlov.models.preprocessors.bert_preprocessor import BertPreprocessor 

preprocessor = model_ru.pipe[0][2]
model = model_ru.pipe[2][2]
features = preprocessor([dev_eval['6']['context'],dev_eval['6']['question']])
model(features)
print(dir(model.start_pred))
print(model.end_pred)

print(model_ru([dev_eval['6']['context']],[dev_eval['6']['question']]))

['OVERLOADABLE_OPERATORS', '_USE_EQUALITY', '__abs__', '__add__', '__and__', '__array__', '__array_priority__', '__bool__', '__class__', '__copy__', '__delattr__', '__dict__', '__dir__', '__div__', '__doc__', '__eq__', '__floordiv__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__invert__', '__iter__', '__le__', '__len__', '__lt__', '__matmul__', '__mod__', '__module__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__or__', '__pow__', '__radd__', '__rand__', '__rdiv__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rmatmul__', '__rmod__', '__rmul__', '__ror__', '__rpow__', '__rsub__', '__rtruediv__', '__rxor__', '__setattr__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__truediv__', '__weakref__', '__xor__', '_as_node_def_input', '_as_tf_output', '_c_api_shape', '_consumers', '_disallow_bool_casting', '_disallow_in_graph_mode', '_disallow_iteration', '_disallow_when_autog

In [57]:
from deeppavlov.models.preprocessors.bert_preprocessor import BertPreprocessor  
from deeppavlov.models.preprocessors.squad_preprocessor import SquadBertMappingPreprocessor  
from deeppavlov.models.bert.bert_squad import BertSQuADModel 
from deeppavlov.models.preprocessors.squad_preprocessor import SquadBertAnsPostprocessor


bert_preprocessor =  model_ru.pipe[0][2]
squad_preprocessor = model_ru.pipe[1][2]
bert_squad = model_ru.pipe[2][2]
squad_preprocessor = model_ru.pipe[3][2]

question_raw, context_raw =  dev_eval['6']['question'], dev_eval['6']['context']

bert_features = bert_preprocessor([question_raw], [context_raw]) 
  #subtok2chars, char2subtoks = squad_preprocessor([context_raw], bert_features) 
ans_start_predicted, ans_end_predicted, logits, score = bert_squad(bert_features)
#print(dir(bert_squad)) 
#print((bert_squad.start_probs.eval(session=bert_squad.sess))) 
  #ans_predicted, ans_start_predicted, ans_end_predicted = squad_preprocessor(ans_start_predicted, ans_end_predicted, context_raw, bert_features, subtok2chars)
print(ans_start_predicted, ans_end_predicted, logits, score)

[81] [82] [170735.953125] [1.0]


In [28]:
print(model_ru.pipe)

[(([], ['question_raw', 'context_raw']), ['bert_features'], <deeppavlov.models.preprocessors.bert_preprocessor.BertPreprocessor object at 0x7f997ee740b8>), (([], ['context_raw', 'bert_features']), ['subtok2chars', 'char2subtoks'], <deeppavlov.models.preprocessors.squad_preprocessor.SquadBertMappingPreprocessor object at 0x7f997bd646d8>), (([], ['bert_features']), ['ans_start_predicted', 'ans_end_predicted', 'logits', 'score'], <deeppavlov.models.bert.bert_squad.BertSQuADModel object at 0x7f997bd64908>), (([], ['ans_start_predicted', 'ans_end_predicted', 'context_raw', 'bert_features', 'subtok2chars']), ['ans_predicted', 'ans_start_predicted', 'ans_end_predicted'], <deeppavlov.models.preprocessors.squad_preprocessor.SquadBertAnsPostprocessor object at 0x7f997bd0d160>)]


Now we have the preprocessed data:

In [None]:
train_record_file = prefix + '/data/train.npz'
dev_record_file = prefix + '/data/dev.npz'

In [None]:
from args import add_common_args, get_setup_args

In [None]:
# Retreiving the default arguments for the preprocessing script
_args = get_setup_args(bypass=True)

In [None]:
_args
_args.answer_file=prefix + '/data/answer.json'
_args.char2idx_file=prefix + '/data/char2idx.json' 
_args.char_emb_file=prefix + '/data/char_emb.json'
_args.dev_eval_file=prefix + '/data/dev_eval.json'
_args.dev_meta_file=prefix + '/data/dev_meta.json'
_args.dev_record_file=prefix + '/data/dev.npz'
_args.test_eval_file=prefix + '/data/test_eval.json'
_args.test_meta_file=prefix + '/data/test_meta.json'
_args.test_record_file=prefix + '/data/test.npz'
_args.train_eval_file=prefix + '/data/train_eval.json'
_args.train_record_file=prefix + '/data/train.npz'
_args.word2idx_file=prefix + '/data/word2idx.json'
_args.word_emb_file=prefix + '/data/word_emb.json'
_args

Namespace(ans_limit=30, answer_file='/drive/My Drive/datasets/data/answer.json', char2idx_file='/drive/My Drive/datasets/data/char2idx.json', char_dim=64, char_emb_file='/drive/My Drive/datasets/data/char_emb.json', char_limit=16, dev_eval_file='/drive/My Drive/datasets/data/dev_eval.json', dev_meta_file='/drive/My Drive/datasets/data/dev_meta.json', dev_record_file='/drive/My Drive/datasets/data/dev.npz', dev_url='https://github.com/chrischute/squad/data/dev-v2.0.json', glove_dim=300, glove_num_vecs=2196017, glove_url='http://nlp.stanford.edu/data/glove.840B.300d.zip', include_test_examples=True, para_limit=400, ques_limit=50, test_eval_file='/drive/My Drive/datasets/data/test_eval.json', test_meta_file='/drive/My Drive/datasets/data/test_meta.json', test_para_limit=1000, test_ques_limit=100, test_record_file='/drive/My Drive/datasets/data/test.npz', test_url='https://github.com/chrischute/squad/data/test-v2.0.json', train_eval_file='/drive/My Drive/datasets/data/train_eval.json', tra

In [None]:
build_features(_args, train_examples, "train", train_record_file, word2idx_dict, char2idx_dict)
dev_meta = build_features(_args, dev_examples, "dev", dev_record_file, word2idx_dict, char2idx_dict)


218it [00:00, 2172.80it/s]

Converting train examples to indices...


45328it [00:17, 2546.16it/s]
248it [00:00, 2478.10it/s]

Built 45213 / 45328 instances of features in total
Converting dev examples to indices...


5036it [00:02, 2439.80it/s]


Built 5022 / 5036 instances of features in total


In [None]:
save(_args.word_emb_file, word_emb_mat, message="word embedding")
save(_args.char_emb_file, char_emb_mat, message="char embedding")
save(_args.train_eval_file, train_eval, message="train eval")
save(_args.dev_eval_file, dev_eval, message="dev eval")
save(_args.word2idx_file, word2idx_dict, message="word dictionary")
save(_args.char2idx_file, char2idx_dict, message="char dictionary")
save(_args.dev_meta_file, dev_meta, message="dev meta")


Saving word embedding...
Saving char embedding...
Saving train eval...
Saving dev eval...
Saving word dictionary...
Saving char dictionary...
Saving dev meta...


### 2. The experiment

Now you are almost ready to go. You may follow these steps to begin (or just start your experiments here).

1. Try running the `train.py` script from the console (or via `!`) (default command-line arguments are ok for the start). If will run the BiDAF model on the preprocessed data. Set `--use_squad_v2` flag to False (SberQuAD is similar to SQuAD v1.1).

Example code (be careful with the path and the names of the variables):
```
python train.py --name first_run_on_sberquad --use_squad_v2 False
```

2. After if finishes (might take an 1-2-3 hours depending on the hardware), evaluate your model on the `dev` set and measure the quality.
Example code (be careful with the path and the names of the variables):
```
 python test.py --split dev --load_path ./save/train/first_run_on_sberquad-02/best.pth.tar --name best_evaluation_experiment
```
The result should be similar to the following:
```
>>> Dev NLL: 02.47, F1: 75.62, EM: 55.73, AvNA: 99.42
```

The [DeepPavlov's RuBERT](http://docs.deeppavlov.ai/en/master/features/models/squad.html) achieves $F1 = 84.60\pm0.11$ and $EM = 66.30\pm0.24$

In [None]:
%cd /drive/My Drive/datasets
!ls

/drive/My Drive/datasets
args.py    __pycache__					       test.py
data	   README.md					       train.py
layers.py  save						       util.py
LICENSE    SberQuAD_preprocessing_and_problem_statement.ipynb
models.py  setup.py


In [None]:
import gc
gc.collect()

87

In [None]:
!python train.py --name first_run_on_sberquad --use_squad_v2 False

[07.06.20 05:38:06] Args: {
    "batch_size": 64,
    "char_emb_file": "./data/char_emb.json",
    "dev_eval_file": "./data/dev_eval.json",
    "dev_record_file": "./data/dev.npz",
    "drop_prob": 0.2,
    "ema_decay": 0.999,
    "eval_steps": 50000,
    "gpu_ids": [
        0
    ],
    "hidden_size": 100,
    "l2_wd": 0,
    "load_path": null,
    "lr": 0.5,
    "max_ans_len": 15,
    "max_checkpoints": 5,
    "max_grad_norm": 5.0,
    "maximize_metric": true,
    "metric_name": "F1",
    "name": "first_run_on_sberquad",
    "num_epochs": 30,
    "num_visuals": 10,
    "num_workers": 4,
    "save_dir": "./save/train/first_run_on_sberquad-02",
    "seed": 224,
    "test_eval_file": "./data/test_eval.json",
    "test_record_file": "./data/test.npz",
    "train_eval_file": "./data/train_eval.json",
    "train_record_file": "./data/train.npz",
    "use_squad_v2": false,
    "word_emb_file": "./data/word_emb.json"
}
[07.06.20 05:38:06] Using random seed 224...
[07.06.20 05:38:06] Loading

#### Here comes your quest: try to improve the quality of this QA system. 

This is a very creative assignment. It is all about experimenting, trying different approaches (and a lot of computations). But if you wish to stick to some numbers, try to increase F1 at least by $5$ points.

Here are some ideas that might help you on your way:
* Try adapting the optimization hyperparameters/network structure to Russian language (the baseline is designed for English SQuAD dataset).
* Incorporating the additional information about the data (like PoS tags) might be a good idea.
* __Distilling the knowledge from a pre-trained RuBERT__ (e.g. try to use the predictions of the model we've discussed on `week10` as soft targets).
* Or anything else.


And, first of all, read the initial code carefully.


Good luck! Feel free to share your results :)