# Creating Sentence Embeddings for SQuAD

In [1]:
import pandas as pd

## Converting Json to Pandas DataFrame for better Visualization

In [2]:
X = pd.read_json("data/train-v1.1.json")

We can see that the SQuAD dataset is structured and close-domained i.e. it only contains selected question answer pairs.
Context is the paragraph from the article within which the answer is located. This text is under the key 'paragraphs'. Every Context contains a set of questions and their respective answers that are placed under the key 'answers'. The dataset also provides the index of the character where the answer to the question starts. Let's go ahead and convert this information into an excel sheet for better and easier data handling.

In [10]:
X.iloc[6,0]['paragraphs'][56]

{'context': 'P. Christiaan Klieger, an anthropologist and scholar of the California Academy of Sciences in San Francisco, writes that the vice royalty of the Sakya regime installed by the Mongols established a patron and priest relationship between Tibetans and Mongol converts to Tibetan Buddhism. According to him, the Tibetan lamas and Mongol khans upheld a "mutual role of religious prelate and secular patron," respectively. He adds that "Although agreements were made between Tibetan leaders and Mongol khans, Ming and Qing emperors, it was the Republic of China and its Communist successors that assumed the former imperial tributaries and subject states as integral parts of the Chinese nation-state."',
 'qas': [{'answers': [{'answer_start': 304,
     'text': 'the Tibetan lamas and Mongol khans'}],
   'id': '56ce2752aab44d1400b884d2',
   'question': 'Who does P. Christiaan Klieger claim to have had a mutual role of religious prelate?'},
  {'answers': [{'answer_start': 534,
     'text': 

In [7]:
contexts = []
questions = []
answer_texts = []
starts_at = []

In [8]:
for data in range(X.shape[0]):
    title = X.iloc[data,0]['paragraphs']
    for subtitle in title:
        for qa in subtitle['qas']:
            questions.append(qa['question'])
            starts_at.append(qa['answers'][0]['answer_start'])
            answer_texts.append(qa['answers'][0]['text'])
            contexts.append(subtitle['context'])

Create a single Dataframe with all data

In [9]:
df = pd.DataFrame({"context":contexts, "question": questions, "answer_start": starts_at, "text": answer_texts})

In [14]:
df.to_csv("data/train.csv", index = None)

In [10]:
df.shape

(87599, 4)


## Using InferSent for creating embeddings and dumping data dictionary to pickle

In [16]:
uniquecontext = list(df['context'].drop_duplicates().reset_index(drop= True))

In [19]:
from textblob import TextBlob

tb = TextBlob(" ".join(uniquecontext))
sentences = [item.raw for item in tb.sentences]

#### Transfer Learning:  Import InferSent LSTM model with pretrained GloVe vectors for creating sentence embeddings

In [21]:
import sys
sys.path.append('InferSent')
import torch

from models import InferSent
model_version = 1
MODEL_PATH = "InferSent/encoder/infersent%s.pkl" % model_version
params_model = {'bsize': 64, 'word_emb_dim': 300, 'enc_lstm_dim': 2048,
                'pool_type': 'max', 'dpout_model': 0.0, 'version': model_version}
model = InferSent(params_model)
model.load_state_dict(torch.load(MODEL_PATH))
use_cuda = True
model = model.cuda()

In [22]:
model.set_w2v_path("InferSent/glove/glove.840B.300d.txt")

In [23]:
model.build_vocab(sentences, tokenize=True)

Found 88993(/109718) words with w2v vectors
Vocab size : 88993


This Step is just for visualization purposes. If you don't have a GPU, this step will take a lot of time. Just load the pre created embeddings from embedding1.pkl and embedding2.pkl as specified below

In [None]:
with open("data/embedding1.pickle", "rb") as f:
    d1 = pickle.load(f)

with open("data/embedding2.pickle", "rb") as f:
    d2 = pickle.load(f)

In [None]:
dict_embeddings = {}
for i in range(len(sentences)):
    print(i)
    dict_embeddings[sentences[i]] = model.encode([sentences[i]], tokenize=True)

In [26]:
questions = list(df["question"])

In [27]:
len(questions)

87599

In [29]:
d1 = {key:dict_embeddings[key] for i, key in enumerate(dict_embeddings) if i % 2 == 0}
d2 = {key:dict_embeddings[key] for i, key in enumerate(dict_embeddings) if i % 2 == 1}

In [32]:
import pickle

with open('data/embedding1.pickle', 'wb') as handle:
    pickle.dump(d1, handle)
with open('data/embedding2.pickle', 'wb') as handle:
    pickle.dump(d2, handle)

In [33]:
X_train = pd.read_csv('data/train.csv')

In [34]:
X_train

Unnamed: 0,answer_start,context,question,text
0,515,"Architecturally, the school has a Catholic cha...",To whom did the Virgin Mary allegedly appear i...,Saint Bernadette Soubirous
1,188,"Architecturally, the school has a Catholic cha...",What is in front of the Notre Dame Main Building?,a copper statue of Christ
2,279,"Architecturally, the school has a Catholic cha...",The Basilica of the Sacred heart at Notre Dame...,the Main Building
3,381,"Architecturally, the school has a Catholic cha...",What is the Grotto at Notre Dame?,a Marian place of prayer and reflection
4,92,"Architecturally, the school has a Catholic cha...",What sits on top of the Main Building at Notre...,a golden statue of the Virgin Mary
5,248,"As at most other universities, Notre Dame's st...",When did the Scholastic Magazine of Notre dame...,September 1876
6,441,"As at most other universities, Notre Dame's st...",How often is Notre Dame's the Juggler published?,twice
7,598,"As at most other universities, Notre Dame's st...",What is the daily student paper at Notre Dame ...,The Observer
8,126,"As at most other universities, Notre Dame's st...",How many student news papers are found at Notr...,three
9,908,"As at most other universities, Notre Dame's st...",In what year did the student paper Common Sens...,1987


In [35]:
dict_emb = dict(d1)
dict_emb.update(d2)

In [36]:
len(dict_emb)

179862

In [37]:
X_train.dropna(inplace=True)

In [38]:
X_train.shape

(87598, 4)

In [39]:
def get_target(x):
    idx = -1
    for i in range(len(x["sentences"])):
        if x["text"] in x["sentences"][i]: idx = i
    return idx

def process(train):
    
    train['sentences'] = train['context'].apply(lambda x: [item.raw for item in TextBlob(x).sentences])
    train["target"] = train.apply(get_target, axis = 1)
    train['sent_emb'] = train['sentences'].apply(lambda x: [dict_emb[item][0] if item in\
                                                           dict_emb else np.zeros(4096) for item in x])
    train['quest_emb'] = train['question'].apply(lambda x: dict_emb[x] if x in dict_emb else np.zeros(4096) )
    
    return train

In [40]:
import numpy as np

X_train = process(X_train)

In [41]:
X_train.head(5)

Unnamed: 0,answer_start,context,question,text,sentences,target,sent_emb,quest_emb
0,515,"Architecturally, the school has a Catholic cha...",To whom did the Virgin Mary allegedly appear i...,Saint Bernadette Soubirous,"[Architecturally, the school has a Catholic ch...",5,"[[0.05519996, 0.0501314, 0.047870375, 0.016248...","[[0.11010079, 0.11422941, 0.115608975, 0.05489..."
1,188,"Architecturally, the school has a Catholic cha...",What is in front of the Notre Dame Main Building?,a copper statue of Christ,"[Architecturally, the school has a Catholic ch...",2,"[[0.05519996, 0.0501314, 0.047870375, 0.016248...","[[0.10951651, 0.11030627, 0.052100062, 0.03053..."
2,279,"Architecturally, the school has a Catholic cha...",The Basilica of the Sacred heart at Notre Dame...,the Main Building,"[Architecturally, the school has a Catholic ch...",3,"[[0.05519996, 0.0501314, 0.047870375, 0.016248...","[[0.011956477, 0.14930707, 0.026600495, 0.0527..."
3,381,"Architecturally, the school has a Catholic cha...",What is the Grotto at Notre Dame?,a Marian place of prayer and reflection,"[Architecturally, the school has a Catholic ch...",4,"[[0.05519996, 0.0501314, 0.047870375, 0.016248...","[[0.0711433, 0.05411832, -0.013959841, 0.05310..."
4,92,"Architecturally, the school has a Catholic cha...",What sits on top of the Main Building at Notre...,a golden statue of the Virgin Mary,"[Architecturally, the school has a Catholic ch...",1,"[[0.05519996, 0.0501314, 0.047870375, 0.016248...","[[0.16133596, 0.1503958, 0.09225755, 0.0404580..."


In [42]:
def cosine_similarity(x):
    li = []
    for item in x["sent_emb"]:
        li.append(spatial.distance.cosine(item, x["quest_emb"][0]))
    return li

def pred_idx(distances):
    return np.argmin(distances)

def predictions(train):
    
    train["cosine_sim"] = train.apply(cosine_similarity, axis = 1)
    train["diff"] = (train["quest_emb"] - train["sent_emb"])**2
    train["euclidean_dis"] = train["diff"].apply(lambda x: list(np.sum(x, axis = 1)))
    del train["diff"]
    
    train["pred_idx_cos"] = train["cosine_sim"].apply(lambda x: pred_idx(x))
    train["pred_idx_euc"] = train["euclidean_dis"].apply(lambda x: pred_idx(x))
    
    return train

In [43]:
from scipy import spatial

predicted = predictions(X_train)

  dist = 1.0 - uv / np.sqrt(uu * vv)


In [44]:
predicted.head(10)

Unnamed: 0,answer_start,context,question,text,sentences,target,sent_emb,quest_emb,cosine_sim,euclidean_dis,pred_idx_cos,pred_idx_euc
0,515,"Architecturally, the school has a Catholic cha...",To whom did the Virgin Mary allegedly appear i...,Saint Bernadette Soubirous,"[Architecturally, the school has a Catholic ch...",5,"[[0.05519996, 0.0501314, 0.047870375, 0.016248...","[[0.11010079, 0.11422941, 0.115608975, 0.05489...","[0.42473626136779785, 0.3640499711036682, 0.34...","[14.563859, 15.262213, 17.398178, 14.272491, 1...",5,5
1,188,"Architecturally, the school has a Catholic cha...",What is in front of the Notre Dame Main Building?,a copper statue of Christ,"[Architecturally, the school has a Catholic ch...",2,"[[0.05519996, 0.0501314, 0.047870375, 0.016248...","[[0.10951651, 0.11030627, 0.052100062, 0.03053...","[0.45407456159591675, 0.32262009382247925, 0.3...","[12.889506, 12.285219, 16.843704, 8.361172, 11...",3,3
2,279,"Architecturally, the school has a Catholic cha...",The Basilica of the Sacred heart at Notre Dame...,the Main Building,"[Architecturally, the school has a Catholic ch...",3,"[[0.05519996, 0.0501314, 0.047870375, 0.016248...","[[0.011956477, 0.14930707, 0.026600495, 0.0527...","[0.3958578109741211, 0.2917083501815796, 0.309...","[11.857297, 11.392319, 15.061656, 7.184714, 8....",3,3
3,381,"Architecturally, the school has a Catholic cha...",What is the Grotto at Notre Dame?,a Marian place of prayer and reflection,"[Architecturally, the school has a Catholic ch...",4,"[[0.05519996, 0.0501314, 0.047870375, 0.016248...","[[0.0711433, 0.05411832, -0.013959841, 0.05310...","[0.49006974697113037, 0.4060605764389038, 0.45...","[13.317537, 15.017247, 20.81268, 10.511387, 10...",3,3
4,92,"Architecturally, the school has a Catholic cha...",What sits on top of the Main Building at Notre...,a golden statue of the Virgin Mary,"[Architecturally, the school has a Catholic ch...",1,"[[0.05519996, 0.0501314, 0.047870375, 0.016248...","[[0.16133596, 0.1503958, 0.09225755, 0.0404580...","[0.4777514934539795, 0.2891119122505188, 0.341...","[15.0888195, 11.612734, 16.684145, 9.71824, 12...",3,3
5,248,"As at most other universities, Notre Dame's st...",When did the Scholastic Magazine of Notre dame...,September 1876,"[As at most other universities, Notre Dame's s...",2,"[[0.09720327, 0.09345725, 0.054660242, 0.04843...","[[0.016918724, 0.12084099, 0.013292058, 0.0587...","[0.2747580409049988, 0.3731493353843689, 0.280...","[11.473504, 16.305737, 14.419686, 11.785967, 1...",5,4
6,441,"As at most other universities, Notre Dame's st...",How often is Notre Dame's the Juggler published?,twice,"[As at most other universities, Notre Dame's s...",3,"[[0.09720327, 0.09345725, 0.054660242, 0.04843...","[[0.07944553, 0.11071574, 0.11615732, 0.045065...","[0.29136353731155396, 0.44691193103790283, 0.3...","[12.094654, 19.268333, 17.051125, 12.115431, 1...",5,4
7,598,"As at most other universities, Notre Dame's st...",What is the daily student paper at Notre Dame ...,The Observer,"[As at most other universities, Notre Dame's s...",9,"[[0.09720327, 0.09345725, 0.054660242, 0.04843...","[[0.0711433, 0.05411832, 0.02641398, 0.0866460...","[0.24287956953048706, 0.38149863481521606, 0.3...","[10.40575, 17.056553, 16.048374, 12.6742735, 1...",5,0
8,126,"As at most other universities, Notre Dame's st...",How many student news papers are found at Notr...,three,"[As at most other universities, Notre Dame's s...",9,"[[0.09720327, 0.09345725, 0.054660242, 0.04843...","[[0.06699271, 0.050647143, 0.118103534, 0.0667...","[0.18055570125579834, 0.3603665828704834, 0.34...","[7.8146877, 16.114155, 17.537537, 12.886263, 1...",0,0
9,908,"As at most other universities, Notre Dame's st...",In what year did the student paper Common Sens...,1987,"[As at most other universities, Notre Dame's s...",7,"[[0.09720327, 0.09345725, 0.054660242, 0.04843...","[[0.042654388, 0.13311043, 0.112292886, 0.0977...","[0.2252199649810791, 0.36460912227630615, 0.26...","[9.867832, 16.703403, 13.726372, 11.037147, 12...",5,0


In [45]:
predicted.to_csv("data/newdata.csv", index=None)

In [46]:
def accuracy(target, predicted):
    
    acc = (target==predicted).sum()/len(target)
    
    return acc

In [47]:
print(accuracy(predicted["target"], predicted["pred_idx_euc"]))

0.4471106646270463


In [48]:
print(accuracy(predicted["target"], predicted["pred_idx_cos"]))

0.6333477933286148


In [49]:
predicted.iloc[65000,:]

answer_start                                                   546
context          Socioeconomic factors, in combination with ear...
question         What has led to many tragic instances of event...
text                                                        Racism
sentences        [Socioeconomic factors, in combination with ea...
target                                                           3
sent_emb         [[0.098147884, 0.1247487, 0.14421512, 0.060331...
quest_emb        [[0.079051174, 0.08900199, 0.13616362, 0.06939...
cosine_sim       [0.2518765926361084, 0.24583172798156738, 0.24...
euclidean_dis           [11.059026, 11.683056, 11.02791, 3.251983]
pred_idx_cos                                                     3
pred_idx_euc                                                     3
Name: 65000, dtype: object

In [50]:
ct,k = 0,0
for i in range(predicted.shape[0]):
    if predicted.iloc[i,10] != predicted.iloc[i,5]:
        k += 1
        if predicted.iloc[i,11] == predicted.iloc[i,5]:
            ct += 1

ct, k

(5534, 32118)

In [51]:
label = []
for i in range(predicted.shape[0]):
    if predicted.iloc[i,10] == predicted.iloc[i,11]:
        label.append(predicted.iloc[i,10])
    else:
        label.append((predicted.iloc[i,10],predicted.iloc[i,10]))

In [52]:
ct = 0
for i in range(75206):
    item = predicted["target"][i]
    try:
        if label[i] == predicted["target"][i]: ct +=1
    except:
        if item in label[i]: ct +=1

# Combining Accuracy of features

In [53]:
ct/75206

0.6364385820280297