In [1]:
import json
import pandas as pd
import numpy as np
import tqdm
import matplotlib.pyplot as plt
%matplotlib inline
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/lats/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

# Stanford question answering dataset (SQuAD)

Today we are going to work with a popular NLP dataset.

Here is the description of the original problem:

```
Stanford Question Answering Dataset (SQuAD) is a new reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage. With 100,000+ question-answer pairs on 500+ articles, SQuAD is significantly larger than previous reading comprehension datasets.
```


We are not going to solve it :) Instead we will try to answer the question in a different way: given the question, we will find a **sentence** containing the answer, but not within the context, but in a **whole databank**

Just watch the hands

In [2]:
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json

--2017-11-15 19:40:41--  https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
Распознаётся rajpurkar.github.io (rajpurkar.github.io)... 151.101.37.147, 2a04:4e42:9::403
Подключение к rajpurkar.github.io (rajpurkar.github.io)|151.101.37.147|:443... соединение установлено.
HTTP-запрос отправлен. Ожидание ответа... 200 OK
Длина: 30288272 (29M) [application/json]
Сохранение в каталог: ««train-v1.1.json»».


2017-11-15 19:40:45 (7,48 MB/s) - «train-v1.1.json» сохранён [30288272/30288272]



In [3]:
data = json.load(open('train-v1.1.json'))

In [4]:
data['data'][0]

{'paragraphs': [{'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
   'qas': [{'answers': [{'answer_start': 515,
       'text': 'Saint Bernadette Soubirous'}],
     'id': '5733be284776f41900661182',
     'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?'},
    {'answers': [{'answer_start': 188, 'text': 

The code here is very similar to `week5/`

In [5]:
from nltk.tokenize import RegexpTokenizer
from collections import Counter,defaultdict
tokenizer = RegexpTokenizer(r"\w+|\d+")

#Dictionary of tokens
token_counts = Counter()

def tokenize(value):
    return tokenizer.tokenize(value.lower())

for q in tqdm.tqdm_notebook(data['data']):
    for p in q['paragraphs']:
        token_counts.update(tokenize(p['context']))




In [6]:
min_count = 4

tokens = [w for w, c in token_counts.items() if c > min_count] 

In [7]:
dict_size = len(tokens)+2

token_to_id = {t: i + 2 for i,t in enumerate(tokens)}
id_to_token = {i + 2: t for i,t in enumerate(tokens)}

In [8]:
assert token_to_id['me'] != token_to_id['woods']
assert token_to_id[id_to_token[42]]==42
assert len(token_to_id)==len(tokens)
assert 0 not in id_to_token

In [9]:
from nltk.tokenize import sent_tokenize
def build_dataset(train_data):
    '''Takes SQuAD data
    Returns a list of tuples - a set of pairs (q, a_+)
    '''
    data = []
    for q in tqdm.tqdm_notebook(train_data):
        for p in q['paragraphs']:
            offsets = []
            curent_index = 0
            for sent in sent_tokenize(p['context']):
                curent_index+=len(sent)+2
                offsets.append((curent_index, sent))
                
            for qa in p['qas']:
                answer = qa['answers'][0]
                found = False
                for o, sent in offsets:
                    if answer['answer_start']<o:
                        data.append((qa['question'], sent))
                        found = True
                        break
                assert found
    return data

In [10]:
from sklearn.model_selection import train_test_split
train_data, val_data = train_test_split(data['data'], test_size=0.1)

data_train = build_dataset(train_data)
data_val = build_dataset(val_data)







In [11]:
data_val[2]

('In what year did Northwestern University began teaching?',
 'Instruction began in 1855; women were admitted in 1869.')

In [34]:
def vectorize(strings, token_to_id, UNK=1, PAD=0):
    '''This function gets a string array and transforms it to padded token matrix
    Remember to:
     - Transform a string to list of tokens
     - Transform each token to it ids (if not in the dict, replace with UNK)
     - Pad each line to max_len'''
    token_matrix = []
    
    for s in strings:
        seq = [token_to_id.get(token,UNK) for token in tokenize(s)]
        token_matrix.append(seq)
    
    max_len = max(map(len,token_matrix))
        
    # handle empty batch
    if max_len == 0:
        max_len = 1
    
    for i in range(len(token_matrix)):
        while(len(token_matrix[i]) < max_len):
            token_matrix[i].append(PAD)
    
    return np.array(token_matrix, dtype='int32')

In [35]:
test = vectorize(["Hello, adshkjasdhkas, world", "data"], token_to_id, 1)
assert test.shape==(2,3)
assert (test[:,1]==(1,0)).all()
print("Correct!")

Correct!


# Deep Learning

The beginning is same as always

In [36]:
import theano
import theano.tensor as T
import lasagne
from lasagne.layers import *

margin = 0.1

In [37]:
def build_encoder(lstm_size=50, embeddings_size=50, target_space_dim=50, PAD=0):
    '''
    Build a lasagne network that converts input sequence to a fixed-size vector.
    Must have a single input layer that accepts int32[batch,max_len]
    '''
    inp = InputLayer([None, None], dtype='int32')
    mask = ExpressionLayer(inp, lambda ix: T.neq(ix,PAD))

    net = EmbeddingLayer(inp, dict_size, embeddings_size)
    net = LSTMLayer(net, lstm_size, mask_input=mask, only_return_final=True)
    net = DenseLayer(net, target_space_dim)        

    return net

question_encoder = build_encoder()
answer_encoder = build_encoder()

We are going to use a single encoder for both poitive and negative answers.

In [38]:
def normalize(batch_vector):
    return (batch_vector ** 2).sum(1)

In [39]:
questions = T.imatrix(name="word_ids_questions")
answers_positive = T.imatrix(name="word_ids_answers_positive")
answers_negative = T.imatrix(name="word_ids_answers_negative")

positive_output = get_output(answer_encoder,answers_positive)
negative_output = get_output(answer_encoder, answers_negative)
anchor_output = get_output(question_encoder, questions)


In [50]:
# compute dot products to get similarity. Also: you can use T.batched_dot for speed
positive_dot = T.sum(anchor_output*positive_output, axis=1)
negative_dot = T.sum(anchor_output*negative_output, axis=1)


# compute triplet loss (pairwise hinge loss) as per formulae in the lecture.
# please use T.maximum and not T.max!
loss = T.mean(T.maximum(0, negative_dot - positive_dot + margin))

recall = T.mean(positive_dot > negative_dot)

In [51]:
allparams = get_all_params([answer_encoder,question_encoder],trainable=True)

updates = lasagne.updates.adam(loss, allparams)
train_op = theano.function([questions, answers_positive, answers_negative],
                           [loss, recall],
                           updates=updates)

validate_op = theano.function([questions, answers_positive, answers_negative], [loss, recall])

### Training on minibatches

In [52]:
batch_size = 200
def iterate_batches(data, only_positives=False):
    """Takes a D
    Returns a dict, containing pairs for each input type
    only_positives indicates either we need to iterate over triplets vs only positive (needed for index)
    """

    i = 0
    while i < len(data):
        data_batch = data[i:i+batch_size]
        batch = {}
        batch['positive'] = vectorize([sample[1] for sample in data_batch], token_to_id)
        if not only_positives:
            batch['anchor'] = vectorize([sample[0] for sample in data_batch], token_to_id)
            batch['negative'] = vectorize([ data[np.random.randint(0, len(data))][1]  for j in range(len(data_batch))], \
                                          token_to_id, 1)
        
        yield batch
        i+=batch_size

In [53]:
def validate():
    total_loss, total_recall = 0, 0
    batches = 0
    for batch in  iterate_batches(data_val):
        batches+=1
        current_loss, current_recall =  validate_op(batch['anchor'],
                                                    batch['positive'],
                                                    batch['negative'])
        total_loss+=current_loss
        total_recall+=current_recall
        
    total_loss/=batches
    total_recall/=batches
    
    if total_recall > 0.9:
        print('Cool! If recall is right, you earned (3 pts)')
    return (total_loss, total_recall)

In [54]:
num_epochs = 100
step = 0
for j in range(num_epochs):
    for i, batch in  enumerate(iterate_batches(data_train)):
        current_loss, current_recall =  train_op(batch['anchor'],
                                                 batch['positive'],
                                                 batch['negative'])
        step+=1
        print("Current step: %s. Current loss is %s, Current recall is %s" % (step, current_loss, current_recall))
        if i%100==0:
            print("Validation. Loss: %s, Recall: %s" %validate())

Current step: 1. Current loss is 0.09263699469831793, Current recall is 0.6
Validation. Loss: 0.0939050048071, Recall: 0.598779028853
Current step: 2. Current loss is 0.0925891880238865, Current recall is 0.675
Current step: 3. Current loss is 0.09745501592745198, Current recall is 0.55
Current step: 4. Current loss is 0.09284730466874352, Current recall is 0.62
Current step: 5. Current loss is 0.09116292812449255, Current recall is 0.62
Current step: 6. Current loss is 0.08220451870338087, Current recall is 0.69
Current step: 7. Current loss is 0.08904215334327358, Current recall is 0.6
Current step: 8. Current loss is 0.07407578920112229, Current recall is 0.735
Current step: 9. Current loss is 0.08448092136482264, Current recall is 0.635
Current step: 10. Current loss is 0.07554864783596638, Current recall is 0.7
Current step: 11. Current loss is 0.06742075928304492, Current recall is 0.705
Current step: 12. Current loss is 0.08819032808740618, Current recall is 0.595
Current step: 

Current step: 105. Current loss is 0.046229878239207815, Current recall is 0.795
Current step: 106. Current loss is 0.05184521319420739, Current recall is 0.77
Current step: 107. Current loss is 0.07163887368391399, Current recall is 0.66
Current step: 108. Current loss is 0.07517893189643839, Current recall is 0.685
Current step: 109. Current loss is 0.07458762911714643, Current recall is 0.61
Current step: 110. Current loss is 0.06446430242853003, Current recall is 0.72
Current step: 111. Current loss is 0.02880496131508754, Current recall is 0.89
Current step: 112. Current loss is 0.08060044771370878, Current recall is 0.645
Current step: 113. Current loss is 0.08602884656086897, Current recall is 0.55
Current step: 114. Current loss is 0.07839199002627328, Current recall is 0.625
Current step: 115. Current loss is 0.07410847467071252, Current recall is 0.65
Current step: 116. Current loss is 0.0711300940338827, Current recall is 0.68
Current step: 117. Current loss is 0.06766193946

Current step: 208. Current loss is 0.04528846150700451, Current recall is 0.81
Current step: 209. Current loss is 0.06503749557518622, Current recall is 0.705
Current step: 210. Current loss is 0.05734807338619333, Current recall is 0.745
Current step: 211. Current loss is 0.03990980973576901, Current recall is 0.82
Current step: 212. Current loss is 0.039311998708477465, Current recall is 0.825
Current step: 213. Current loss is 0.024526323556272954, Current recall is 0.925
Current step: 214. Current loss is 0.03150716421603013, Current recall is 0.87
Current step: 215. Current loss is 0.06516465206383215, Current recall is 0.675
Current step: 216. Current loss is 0.06226994343165904, Current recall is 0.71
Current step: 217. Current loss is 0.07581175297434208, Current recall is 0.675
Current step: 218. Current loss is 0.04748083876190792, Current recall is 0.79
Current step: 219. Current loss is 0.07997821675567915, Current recall is 0.635
Current step: 220. Current loss is 0.072244

Current step: 311. Current loss is 0.06863526779389213, Current recall is 0.735
Current step: 312. Current loss is 0.08242795518775006, Current recall is 0.645
Current step: 313. Current loss is 0.08185140451541156, Current recall is 0.62
Current step: 314. Current loss is 0.08300554638872126, Current recall is 0.64
Current step: 315. Current loss is 0.07149622873736022, Current recall is 0.68
Current step: 316. Current loss is 0.07841086924522765, Current recall is 0.64
Current step: 317. Current loss is 0.06340608497394527, Current recall is 0.72
Current step: 318. Current loss is 0.06114422221747362, Current recall is 0.72
Current step: 319. Current loss is 0.07251573118547844, Current recall is 0.68
Current step: 320. Current loss is 0.07502872041934848, Current recall is 0.625
Current step: 321. Current loss is 0.0706532679315802, Current recall is 0.69
Current step: 322. Current loss is 0.0676419032972889, Current recall is 0.71
Current step: 323. Current loss is 0.07241847039543

Current step: 414. Current loss is 0.06225454870055144, Current recall is 0.73
Current step: 415. Current loss is 0.05353865589902797, Current recall is 0.795
Current step: 416. Current loss is 0.054629450806912165, Current recall is 0.76
Current step: 417. Current loss is 0.0464629257401007, Current recall is 0.835
Current step: 418. Current loss is 0.061479193616064555, Current recall is 0.765
Current step: 419. Current loss is 0.050530754716914726, Current recall is 0.77
Current step: 420. Current loss is 0.07036537491500844, Current recall is 0.665
Current step: 421. Current loss is 0.053304285984629676, Current recall is 0.805
Current step: 422. Current loss is 0.03778843350484899, Current recall is 0.88
Current step: 423. Current loss is 0.05640950423217662, Current recall is 0.78
Current step: 424. Current loss is 0.07265070259700056, Current recall is 0.68
Current step: 425. Current loss is 0.054084287836498636, Current recall is 0.77
Current step: 426. Current loss is 0.052899

Current step: 517. Current loss is 0.04394106240234575, Current recall is 0.785
Current step: 518. Current loss is 0.06288979338887024, Current recall is 0.725
Current step: 519. Current loss is 0.06160175581456653, Current recall is 0.75
Current step: 520. Current loss is 0.05744583002571062, Current recall is 0.76
Current step: 521. Current loss is 0.0641692333618015, Current recall is 0.705
Current step: 522. Current loss is 0.05468427908842993, Current recall is 0.765
Current step: 523. Current loss is 0.04250541159547207, Current recall is 0.81
Current step: 524. Current loss is 0.03159351149590786, Current recall is 0.87
Current step: 525. Current loss is 0.033359650822293696, Current recall is 0.875
Current step: 526. Current loss is 0.030139481480636488, Current recall is 0.865
Current step: 527. Current loss is 0.07440766875090474, Current recall is 0.69
Current step: 528. Current loss is 0.06679493971724605, Current recall is 0.735
Current step: 529. Current loss is 0.0592488

Current step: 620. Current loss is 0.05172026579400231, Current recall is 0.8
Current step: 621. Current loss is 0.04011500137429673, Current recall is 0.83
Current step: 622. Current loss is 0.05582643286972342, Current recall is 0.775
Current step: 623. Current loss is 0.05179047946203785, Current recall is 0.755
Current step: 624. Current loss is 0.05305571674655882, Current recall is 0.785
Current step: 625. Current loss is 0.06717094163603643, Current recall is 0.715
Current step: 626. Current loss is 0.054965862187561346, Current recall is 0.73
Current step: 627. Current loss is 0.05315051016224567, Current recall is 0.81
Current step: 628. Current loss is 0.046190162168725395, Current recall is 0.81
Current step: 629. Current loss is 0.05693007165716068, Current recall is 0.745
Current step: 630. Current loss is 0.05695418179379558, Current recall is 0.775
Current step: 631. Current loss is 0.045056581493102564, Current recall is 0.81
Current step: 632. Current loss is 0.0573086

Current step: 723. Current loss is 0.05954748069713755, Current recall is 0.765
Current step: 724. Current loss is 0.032924513316661234, Current recall is 0.865
Current step: 725. Current loss is 0.06082073444847752, Current recall is 0.755
Current step: 726. Current loss is 0.07255848802270597, Current recall is 0.685
Current step: 727. Current loss is 0.05965846464504282, Current recall is 0.74
Current step: 728. Current loss is 0.05294805588489968, Current recall is 0.79
Current step: 729. Current loss is 0.043783859754088474, Current recall is 0.82
Current step: 730. Current loss is 0.03623240522778856, Current recall is 0.845
Current step: 731. Current loss is 0.06955149089823047, Current recall is 0.66
Current step: 732. Current loss is 0.06196544677461601, Current recall is 0.715
Current step: 733. Current loss is 0.0409143433226082, Current recall is 0.815
Current step: 734. Current loss is 0.05793754273911593, Current recall is 0.745
Current step: 735. Current loss is 0.055608

Current step: 826. Current loss is 0.051230796974895076, Current recall is 0.78
Current step: 827. Current loss is 0.04013087931003195, Current recall is 0.83
Current step: 828. Current loss is 0.04376699330137634, Current recall is 0.81
Current step: 829. Current loss is 0.04484529979144388, Current recall is 0.82
Current step: 830. Current loss is 0.03326139529910332, Current recall is 0.86
Current step: 831. Current loss is 0.023683183378614737, Current recall is 0.91
Current step: 832. Current loss is 0.04257625512955471, Current recall is 0.8
Current step: 833. Current loss is 0.040847100275865016, Current recall is 0.82
Current step: 834. Current loss is 0.047462143186071, Current recall is 0.8
Current step: 835. Current loss is 0.03997281177007608, Current recall is 0.845
Current step: 836. Current loss is 0.02346070598739011, Current recall is 0.89
Current step: 837. Current loss is 0.03172021091131314, Current recall is 0.865
Current step: 838. Current loss is 0.05181843204227

Current step: 929. Current loss is 0.030712569999401754, Current recall is 0.865
Current step: 930. Current loss is 0.03385028767049078, Current recall is 0.875
Current step: 931. Current loss is 0.03295903534344163, Current recall is 0.865
Current step: 932. Current loss is 0.03398206511975172, Current recall is 0.845
Current step: 933. Current loss is 0.034988049732391564, Current recall is 0.86
Current step: 934. Current loss is 0.053527193655336196, Current recall is 0.78
Current step: 935. Current loss is 0.037714188962088155, Current recall is 0.83
Current step: 936. Current loss is 0.05599434437256205, Current recall is 0.72
Current step: 937. Current loss is 0.04578157093680867, Current recall is 0.805
Current step: 938. Current loss is 0.0437542864062903, Current recall is 0.85
Current step: 939. Current loss is 0.021906104864182537, Current recall is 0.92
Current step: 940. Current loss is 0.02548736947815884, Current recall is 0.915
Current step: 941. Current loss is 0.03459

Current step: 1031. Current loss is 0.03843821894874486, Current recall is 0.85
Current step: 1032. Current loss is 0.045465766966255365, Current recall is 0.81
Current step: 1033. Current loss is 0.04994521934299111, Current recall is 0.79
Current step: 1034. Current loss is 0.03565754055437531, Current recall is 0.845
Current step: 1035. Current loss is 0.04581867708387389, Current recall is 0.815
Current step: 1036. Current loss is 0.034389932075379895, Current recall is 0.855
Current step: 1037. Current loss is 0.03133606458117729, Current recall is 0.865
Current step: 1038. Current loss is 0.036662135126167575, Current recall is 0.8
Current step: 1039. Current loss is 0.03865843597260248, Current recall is 0.825
Current step: 1040. Current loss is 0.04390481193863766, Current recall is 0.83
Current step: 1041. Current loss is 0.04581598437115677, Current recall is 0.795
Current step: 1042. Current loss is 0.028821332646450246, Current recall is 0.88
Current step: 1043. Current los

Current step: 1132. Current loss is 0.044646324789751786, Current recall is 0.805
Current step: 1133. Current loss is 0.04778032957029553, Current recall is 0.79
Current step: 1134. Current loss is 0.04129932498052928, Current recall is 0.84
Current step: 1135. Current loss is 0.042157875775063074, Current recall is 0.825
Current step: 1136. Current loss is 0.04178947118006364, Current recall is 0.815
Current step: 1137. Current loss is 0.03763865037176276, Current recall is 0.825
Current step: 1138. Current loss is 0.04572274687320044, Current recall is 0.815
Current step: 1139. Current loss is 0.0418723779153147, Current recall is 0.8
Current step: 1140. Current loss is 0.030525740738536155, Current recall is 0.89
Current step: 1141. Current loss is 0.032845142318763586, Current recall is 0.86
Current step: 1142. Current loss is 0.04407756753569021, Current recall is 0.805
Current step: 1143. Current loss is 0.047697412155568115, Current recall is 0.82
Current step: 1144. Current los

Current step: 1233. Current loss is 0.04110043690586771, Current recall is 0.81
Current step: 1234. Current loss is 0.03182083054692368, Current recall is 0.885
Current step: 1235. Current loss is 0.0391935785049506, Current recall is 0.82
Current step: 1236. Current loss is 0.07622256110754538, Current recall is 0.68
Current step: 1237. Current loss is 0.054967365652757644, Current recall is 0.76
Current step: 1238. Current loss is 0.05576672971451006, Current recall is 0.775
Current step: 1239. Current loss is 0.029650815844904397, Current recall is 0.89
Current step: 1240. Current loss is 0.02796039473825843, Current recall is 0.885
Current step: 1241. Current loss is 0.03465303467694919, Current recall is 0.85
Current step: 1242. Current loss is 0.034023890498566585, Current recall is 0.885
Current step: 1243. Current loss is 0.029491609710714926, Current recall is 0.875
Current step: 1244. Current loss is 0.030594133220542736, Current recall is 0.865
Current step: 1245. Current lo

Current step: 1334. Current loss is 0.039096436551274795, Current recall is 0.87
Current step: 1335. Current loss is 0.04194497625905668, Current recall is 0.845
Current step: 1336. Current loss is 0.04158888167056398, Current recall is 0.84
Current step: 1337. Current loss is 0.03585483351602838, Current recall is 0.835
Current step: 1338. Current loss is 0.047936498674280875, Current recall is 0.78
Current step: 1339. Current loss is 0.03387356822928718, Current recall is 0.85
Current step: 1340. Current loss is 0.03665518104527872, Current recall is 0.83
Current step: 1341. Current loss is 0.03557290234423614, Current recall is 0.84
Current step: 1342. Current loss is 0.02949998487546883, Current recall is 0.87
Current step: 1343. Current loss is 0.03414129970515227, Current recall is 0.86
Current step: 1344. Current loss is 0.031198993319816824, Current recall is 0.875
Current step: 1345. Current loss is 0.03176348442782931, Current recall is 0.85
Current step: 1346. Current loss i

KeyboardInterrupt: 

In [None]:
class Index(object):
    """Represents index of calculated embeddings"""
    def __init__(self, data):
        """Class constructor takes a dataset and stores all unique sentences and their embeddings"""
        self.vectors = []
        self.sent = []
        batch = []
        i = 0
        while True:
            if data[i][1] not in self.sent:
                self.sent.append(data[i][1])
                batch.append(data[i][1])
            if len(batch)>=batch_size or (len(batch) >0 and i+1==len(data)):
                vectorized_batch = vectorize(batch, token_to_id=token_to_id, UNK=1)
                self.vectors.extend(sess.run(positive_output, {inputs['positive'][0]: vectorized_batch[0],
                                                               inputs['positive'][1]: vectorized_batch[1]
                                                            }))
                batch = []
            if i+1==len(D):
                break
            i+=1
        self.vectors = np.asarray(self.vectors)
        
    def predict(self, query, top_size =1):
        """
        Function takes:
         - query is a string, containing question
        Function returns:
         - a list with len of top_size, containing the closet answer from the index
        You may want to use np.argpartition
          """
        vectorized_batch = vectorize([query], token_to_id=token_to_id, UNK=1)
        embedding =  sess.run(anchor_output, {inputs['anchor'][0]: vectorized_batch[0],
                                                    inputs['anchor'][1]: vectorized_batch[1]})[0]
        scores = 1-self.vectors.dot(embedding)
        indices = np.argpartition(scores, top_size)[:top_size]
        indices = sorted(indices, key=lambda ind: scores[ind])
        return [self.sent[i] for i in indices]
    
    def calculate_FHS(self, D):
        """Prototype for home assignment. Returns a float number"""
        raise NotImplementedError
        
        
        

In [None]:
index = Index(data_val)

In [None]:
assert len(index.vectors) == len(index.sent)
assert type(index.sent[1])==str
assert index.vectors.shape == (len(index.sent), target_space_dim)
p  = index.predict("Hey", top_size=3)
assert len(p) == 3
assert type(p[0])==str
assert index.predict("Hello", top_size=50)!=index.predict("Not Hello", top_size=50)
print("Ok (2 pts)")

In [None]:
index.predict('To show their strength in the international Communist movement, what did China do?', top_size=10)

In [None]:
data_val[np.random.randint(0, 100)]

# Home assignment
**Task 1.** (3 pts) Implement **semihard** sampling strategy. Use **in-graph** sampling. You have a prototype above

**Task 2.1.** (1 pt) Calculate a **FHS** (First Hit Success) metric on a whole validation dataset (over each query on whole `data_val` index). Prototype of the function in in `Index` class. Compare different model based on this metric. Add table with FHS values to your report.

**Task 2.2.** Add calculation of other representative metrics. You may want to calculate different recalls on a mini-batch, or some ranking metrics.   

**Task 3.** (2 pt) Do experiments with deep architecture and find out the best one. Analyse your results and write a conclusion. 

**describe your results here**

Bonus task 1. (2++ pts) Add manual negatives to the model. What can be a good manual negative in this case?

Bonus task 2. (2++ pts) Implement more efficient Nearest Neighbors Search method. How well it performs on our dataset?



