## Processing of extracted data from fine-tuned model

### Load predictions and validation data

In [1]:
import json
import pandas as pd

predictions = pd.read_json('./formatted_predictions.json')
actual = pd.read_json('./validation.json')

In [2]:
predictions

Unnamed: 0,id,prediction_text
0,56be4db0acb8001400a502ec,Denver Broncos
1,56be4db0acb8001400a502ed,Carolina Panthers
2,56be4db0acb8001400a502ee,"Santa Clara, California"
3,56be4db0acb8001400a502ef,Denver Broncos
4,56be4db0acb8001400a502f0,"golden anniversary"" with various gold"
...,...,...
10565,5737aafd1c456719005744fb,pound-force
10566,5737aafd1c456719005744fc,kilopond
10567,5737aafd1c456719005744fd,the metric slug
10568,5737aafd1c456719005744fe,the metric slug


In [3]:
actual

Unnamed: 0,id,title,context,question,answers
0,56be4db0acb8001400a502ec,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the AFC at Super Bo...,"{'text': ['Denver Broncos', 'Denver Broncos', ..."
1,56be4db0acb8001400a502ed,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the NFC at Super Bo...,"{'text': ['Carolina Panthers', 'Carolina Panth..."
2,56be4db0acb8001400a502ee,Super_Bowl_50,Super Bowl 50 was an American football game to...,Where did Super Bowl 50 take place?,"{'text': ['Santa Clara, California', 'Levi's S..."
3,56be4db0acb8001400a502ef,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team won Super Bowl 50?,"{'text': ['Denver Broncos', 'Denver Broncos', ..."
4,56be4db0acb8001400a502f0,Super_Bowl_50,Super Bowl 50 was an American football game to...,What color was used to emphasize the 50th anni...,"{'text': ['gold', 'gold', 'gold'], 'answer_sta..."
...,...,...,...,...,...
10565,5737aafd1c456719005744fb,Force,"The pound-force has a metric counterpart, less...",What is the metric term less used than the New...,"{'text': ['kilogram-force', 'pound-force', 'ki..."
10566,5737aafd1c456719005744fc,Force,"The pound-force has a metric counterpart, less...",What is the kilogram-force sometimes reffered ...,"{'text': ['kilopond', 'kilopond', 'kilopond', ..."
10567,5737aafd1c456719005744fd,Force,"The pound-force has a metric counterpart, less...",What is a very seldom used unit of mass in the...,"{'text': ['slug', 'metric slug', 'metric slug'..."
10568,5737aafd1c456719005744fe,Force,"The pound-force has a metric counterpart, less...",What seldom used term of a unit of force equal...,"{'text': ['kip', 'kip', 'kip', 'kip', 'kip'], ..."


### Add predictions to dataframe

In [4]:
actual['prediction_text'] = predictions['prediction_text']
actual

Unnamed: 0,id,title,context,question,answers,prediction_text
0,56be4db0acb8001400a502ec,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the AFC at Super Bo...,"{'text': ['Denver Broncos', 'Denver Broncos', ...",Denver Broncos
1,56be4db0acb8001400a502ed,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the NFC at Super Bo...,"{'text': ['Carolina Panthers', 'Carolina Panth...",Carolina Panthers
2,56be4db0acb8001400a502ee,Super_Bowl_50,Super Bowl 50 was an American football game to...,Where did Super Bowl 50 take place?,"{'text': ['Santa Clara, California', 'Levi's S...","Santa Clara, California"
3,56be4db0acb8001400a502ef,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team won Super Bowl 50?,"{'text': ['Denver Broncos', 'Denver Broncos', ...",Denver Broncos
4,56be4db0acb8001400a502f0,Super_Bowl_50,Super Bowl 50 was an American football game to...,What color was used to emphasize the 50th anni...,"{'text': ['gold', 'gold', 'gold'], 'answer_sta...","golden anniversary"" with various gold"
...,...,...,...,...,...,...
10565,5737aafd1c456719005744fb,Force,"The pound-force has a metric counterpart, less...",What is the metric term less used than the New...,"{'text': ['kilogram-force', 'pound-force', 'ki...",pound-force
10566,5737aafd1c456719005744fc,Force,"The pound-force has a metric counterpart, less...",What is the kilogram-force sometimes reffered ...,"{'text': ['kilopond', 'kilopond', 'kilopond', ...",kilopond
10567,5737aafd1c456719005744fd,Force,"The pound-force has a metric counterpart, less...",What is a very seldom used unit of mass in the...,"{'text': ['slug', 'metric slug', 'metric slug'...",the metric slug
10568,5737aafd1c456719005744fe,Force,"The pound-force has a metric counterpart, less...",What seldom used term of a unit of force equal...,"{'text': ['kip', 'kip', 'kip', 'kip', 'kip'], ...",the metric slug


In [5]:
actual['str_answers'] = actual['answers']

### Label data if there were correctly predicted or not

In [6]:
actual['prediction_text'] = actual['prediction_text'].astype('str')
actual['str_answers'] = actual['str_answers'].astype('str')

In [7]:
df1 = actual[actual.apply(lambda x: x.prediction_text in x.str_answers, axis=1)]
df1['ok'] = 'ok'
df1.to_json('ok.json')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [8]:
df2 = actual[actual.apply(lambda x: x.prediction_text not in x.str_answers, axis=1)]
df2['ok'] = 'nok'
df2.to_json('nok.json')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [9]:
df3 = pd.concat([df1, df2])
df3.to_json('valid_pred_labeled.json')

In [10]:
df3

Unnamed: 0,id,title,context,question,answers,prediction_text,str_answers,ok
0,56be4db0acb8001400a502ec,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the AFC at Super Bo...,"{'text': ['Denver Broncos', 'Denver Broncos', ...",Denver Broncos,"{'text': ['Denver Broncos', 'Denver Broncos', ...",ok
1,56be4db0acb8001400a502ed,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the NFC at Super Bo...,"{'text': ['Carolina Panthers', 'Carolina Panth...",Carolina Panthers,"{'text': ['Carolina Panthers', 'Carolina Panth...",ok
2,56be4db0acb8001400a502ee,Super_Bowl_50,Super Bowl 50 was an American football game to...,Where did Super Bowl 50 take place?,"{'text': ['Santa Clara, California', 'Levi's S...","Santa Clara, California","{'text': ['Santa Clara, California', ""Levi's S...",ok
3,56be4db0acb8001400a502ef,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team won Super Bowl 50?,"{'text': ['Denver Broncos', 'Denver Broncos', ...",Denver Broncos,"{'text': ['Denver Broncos', 'Denver Broncos', ...",ok
5,56be8e613aeaaa14008c90d1,Super_Bowl_50,Super Bowl 50 was an American football game to...,What was the theme of Super Bowl 50?,"{'text': ['""golden anniversary""', 'gold-themed...",golden anniversary,"{'text': ['""golden anniversary""', 'gold-themed...",ok
...,...,...,...,...,...,...,...,...
10548,5737a5931c456719005744e9,Force,"where is the mass of the object, is the velo...",What force changes an objects direction of tra...,"{'text': ['centripetal', 'unbalanced centripet...",radial (centripetal) force,"{'text': ['centripetal', 'unbalanced centripet...",nok
10555,5737a7351c456719005744f5,Force,A conservative force that acts on a closed sys...,What is the force called rgarding a potential ...,"{'text': ['artifact', 'artifact of the potenti...",conservative force,"{'text': ['artifact', 'artifact of the potenti...",nok
10562,5737a9afc3c5551400e51f63,Force,The connection between macroscopic nonconserva...,What is the exchange of heat associated with?,"{'text': ['nonconservative forces', 'nonconser...",macroscopic closed systems,"{'text': ['nonconservative forces', 'nonconser...",nok
10568,5737aafd1c456719005744fe,Force,"The pound-force has a metric counterpart, less...",What seldom used term of a unit of force equal...,"{'text': ['kip', 'kip', 'kip', 'kip', 'kip'], ...",the metric slug,"{'text': ['kip', 'kip', 'kip', 'kip', 'kip'], ...",nok


### Count similar words between answer and context

In [11]:
data = pd.read_json('valid_pred_labeled.json')

In [12]:
data = data.reset_index()
data

Unnamed: 0,index,id,title,context,question,answers,prediction_text,str_answers,ok
0,0,56be4db0acb8001400a502ec,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the AFC at Super Bo...,"{'text': ['Denver Broncos', 'Denver Broncos', ...",Denver Broncos,"{'text': ['Denver Broncos', 'Denver Broncos', ...",ok
1,1,56be4db0acb8001400a502ed,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the NFC at Super Bo...,"{'text': ['Carolina Panthers', 'Carolina Panth...",Carolina Panthers,"{'text': ['Carolina Panthers', 'Carolina Panth...",ok
2,2,56be4db0acb8001400a502ee,Super_Bowl_50,Super Bowl 50 was an American football game to...,Where did Super Bowl 50 take place?,"{'text': ['Santa Clara, California', 'Levi's S...","Santa Clara, California","{'text': ['Santa Clara, California', ""Levi's S...",ok
3,3,56be4db0acb8001400a502ef,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team won Super Bowl 50?,"{'text': ['Denver Broncos', 'Denver Broncos', ...",Denver Broncos,"{'text': ['Denver Broncos', 'Denver Broncos', ...",ok
4,5,56be8e613aeaaa14008c90d1,Super_Bowl_50,Super Bowl 50 was an American football game to...,What was the theme of Super Bowl 50?,"{'text': ['""golden anniversary""', 'gold-themed...",golden anniversary,"{'text': ['""golden anniversary""', 'gold-themed...",ok
...,...,...,...,...,...,...,...,...,...
10565,10548,5737a5931c456719005744e9,Force,"where is the mass of the object, is the velo...",What force changes an objects direction of tra...,"{'text': ['centripetal', 'unbalanced centripet...",radial (centripetal) force,"{'text': ['centripetal', 'unbalanced centripet...",nok
10566,10555,5737a7351c456719005744f5,Force,A conservative force that acts on a closed sys...,What is the force called rgarding a potential ...,"{'text': ['artifact', 'artifact of the potenti...",conservative force,"{'text': ['artifact', 'artifact of the potenti...",nok
10567,10562,5737a9afc3c5551400e51f63,Force,The connection between macroscopic nonconserva...,What is the exchange of heat associated with?,"{'text': ['nonconservative forces', 'nonconser...",macroscopic closed systems,"{'text': ['nonconservative forces', 'nonconser...",nok
10568,10568,5737aafd1c456719005744fe,Force,"The pound-force has a metric counterpart, less...",What seldom used term of a unit of force equal...,"{'text': ['kip', 'kip', 'kip', 'kip', 'kip'], ...",the metric slug,"{'text': ['kip', 'kip', 'kip', 'kip', 'kip'], ...",nok


In [13]:
data.info()
data['prediction_text'] = data['prediction_text'].astype('str')
data['str_answers'] = data['str_answers'].astype('str')
data['question'] = data['question'].astype('str')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10570 entries, 0 to 10569
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   index            10570 non-null  int64 
 1   id               10570 non-null  object
 2   title            10570 non-null  object
 3   context          10570 non-null  object
 4   question         10570 non-null  object
 5   answers          10570 non-null  object
 6   prediction_text  10570 non-null  object
 7   str_answers      10570 non-null  object
 8   ok               10570 non-null  object
dtypes: int64(1), object(8)
memory usage: 743.3+ KB


In [None]:
data = pd.read_json('valid_pred_labeled_with_added_from_func.json')

In [29]:
def count_similar_words_in_question_and_context(data):

    tokenizer = nltk.RegexpTokenizer(r"\w+")

    similar_words = []

    for i in range(len(data)):
        context1 = nltk.word_tokenize(data['context'][i])
        question1 = nltk.word_tokenize(data['question'][i])
        context_new = [word for word in context1 if word.isalnum()]
        question_new = [word for word in question1 if word.isalnum()]
        similar_words.append(len(set(context_new).intersection(set(question_new))))
        # similar_words.append(len(set(data['context'][i].split()).intersection(set(data['question'][i].split()))))

    return similar_words

In [30]:
data['similar_words'] = count_similar_words_in_question_and_context(data)
data

Unnamed: 0,index,id,title,context,question,answers,prediction_text,str_answers,ok,similar_words,distances,closest_words,kth_sentence,cosine_similarity,first_word
0,0,56be4db0acb8001400a502ec,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the AFC at Super Bo...,"{'text': ['Denver Broncos', 'Denver Broncos', ...",Denver Broncos,"{'text': ['Denver Broncos', 'Denver Broncos', ...",ok,7,2,AFC,1,0.444912,Which
1,1,56be4db0acb8001400a502ed,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the NFC at Super Bo...,"{'text': ['Carolina Panthers', 'Carolina Panth...",Carolina Panthers,"{'text': ['Carolina Panthers', 'Carolina Panth...",ok,7,2,NFC,2,0.444996,Which
2,2,56be4db0acb8001400a502ee,Super_Bowl_50,Super Bowl 50 was an American football game to...,Where did Super Bowl 50 take place?,"{'text': ['Santa Clara, California', 'Levi's S...","Santa Clara, California","{'text': ['Santa Clara, California', ""Levi's S...",ok,3,8,Super,2,0.502001,Where
3,3,56be4db0acb8001400a502ef,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team won Super Bowl 50?,"{'text': ['Denver Broncos', 'Denver Broncos', ...",Denver Broncos,"{'text': ['Denver Broncos', 'Denver Broncos', ...",ok,4,11,NFL,1,0.459421,Which
4,5,56be8e613aeaaa14008c90d1,Super_Bowl_50,Super Bowl 50 was an American football game to...,What was the theme of Super Bowl 50?,"{'text': ['""golden anniversary""', 'gold-themed...",golden anniversary,"{'text': ['""golden anniversary""', 'gold-themed...",ok,6,5,Bowl,3,0.490903,What
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10565,10548,5737a5931c456719005744e9,Force,"where is the mass of the object, is the velo...",What force changes an objects direction of tra...,"{'text': ['centripetal', 'unbalanced centripet...",radial (centripetal) force,"{'text': ['centripetal', 'unbalanced centripet...",nok,5,1,force,4,0.190650,What
10566,10555,5737a7351c456719005744f5,Force,A conservative force that acts on a closed sys...,What is the force called rgarding a potential ...,"{'text': ['artifact', 'artifact of the potenti...",conservative force,"{'text': ['artifact', 'artifact of the potenti...",nok,9,3,potential,2,0.381421,What
10567,10562,5737a9afc3c5551400e51f63,Force,The connection between macroscopic nonconserva...,What is the exchange of heat associated with?,"{'text': ['nonconservative forces', 'nonconser...",macroscopic closed systems,"{'text': ['nonconservative forces', 'nonconser...",nok,6,32,associated,1,0.138439,What
10568,10568,5737aafd1c456719005744fe,Force,"The pound-force has a metric counterpart, less...",What seldom used term of a unit of force equal...,"{'text': ['kip', 'kip', 'kip', 'kip', 'kip'], ...",the metric slug,"{'text': ['kip', 'kip', 'kip', 'kip', 'kip'], ...",nok,7,4,1000,3,0.364676,What


In [31]:
data['similar_words'].describe()

count    10570.000000
mean         6.047020
std          2.813025
min          0.000000
25%          4.000000
50%          6.000000
75%          8.000000
max         28.000000
Name: similar_words, dtype: float64

In [27]:
data['similar_words'].describe()

count    10570.000000
mean         6.254305
std          2.872910
min          0.000000
25%          4.000000
50%          6.000000
75%          8.000000
max         27.000000
Name: similar_words, dtype: float64

In [32]:
data['similar_words'].value_counts()

5     1635
6     1529
4     1387
7     1295
3     1084
8      998
9      699
2      554
10     431
11     280
1      222
12     163
13     110
14      61
0       34
15      29
16      26
17      14
19       6
18       5
20       3
21       2
28       1
22       1
23       1
Name: similar_words, dtype: int64

In [28]:
data['similar_words'].value_counts()

5     1569
6     1552
7     1330
4     1328
8     1036
3      991
9      746
2      487
10     485
11     326
1      193
12     189
13     125
14      75
15      41
0       29
16      25
17      15
18      14
19       6
20       4
22       2
21       1
27       1
Name: similar_words, dtype: int64

In [33]:
data.to_json('valid_pred_labeled_with_added_from_func.json')

### Count distance of the closest question word from answer in context

In [None]:
data = pd.read_json('valid_pred_labeled_with_added_from_func.json')

In [16]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

def count_lowest_position_of_word_from_question_in_context(data):
    distances = []
    words = []
    for i in range(len(data)):
        context_list = data['context'][i].replace(',', '').replace(':', '').replace('(', '').replace(')', '').split()
        question_list = data['question'][i].replace(',', '').replace(':', '').replace('(', '').replace(')', '').replace('?', '').split()
        answer_text = data['str_answers'][i][data['str_answers'][i].index('text\': [')+8 : data['str_answers'][i].index(']')].split(", ")
        answer_start = data['str_answers'][i][data['str_answers'][i].index('answer_start\': [')+16 : data['str_answers'][i].index(']}')]

        if len(answer_text[0]) > 2 and answer_text[0][1:-1].split(', ')[0].split()[0] in context_list:
            answer_index = context_list.index(answer_text[0][1:-1].split(', ')[0].split()[0])
        else:
            distances.append(-1)
            words.append('None')
            continue

        filtered_words = [word for word in question_list if word not in stopwords.words('english')]

        list_indexes = {}

        for word in filtered_words:
            if word in context_list:
                for j in range(len(context_list)):
                    if word == context_list[j]:
                        list_indexes[abs(j - answer_index)] = context_list[j]

        sort_orders = sorted(list_indexes.items(), key=lambda x: x[0], reverse=False)

        if len(sort_orders) == 0:
            distances.append(-1)
            words.append('None')
        else:
            distances.append(sort_orders[0][0])
            words.append(sort_orders[0][1])

    return distances, words

[nltk_data] Downloading package stopwords to /home/luki/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [17]:
data['distances'], data['closest_words'] = count_lowest_position_of_word_from_question_in_context(data)

In [18]:
data

Unnamed: 0,index,id,title,context,question,answers,prediction_text,str_answers,ok,similar_words,distances,closest_words
0,0,56be4db0acb8001400a502ec,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the AFC at Super Bo...,"{'text': ['Denver Broncos', 'Denver Broncos', ...",Denver Broncos,"{'text': ['Denver Broncos', 'Denver Broncos', ...",ok,4,2,AFC
1,1,56be4db0acb8001400a502ed,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the NFC at Super Bo...,"{'text': ['Carolina Panthers', 'Carolina Panth...",Carolina Panthers,"{'text': ['Carolina Panthers', 'Carolina Panth...",ok,4,2,NFC
2,2,56be4db0acb8001400a502ee,Super_Bowl_50,Super Bowl 50 was an American football game to...,Where did Super Bowl 50 take place?,"{'text': ['Santa Clara, California', 'Levi's S...","Santa Clara, California","{'text': ['Santa Clara, California', ""Levi's S...",ok,3,8,Super
3,3,56be4db0acb8001400a502ef,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team won Super Bowl 50?,"{'text': ['Denver Broncos', 'Denver Broncos', ...",Denver Broncos,"{'text': ['Denver Broncos', 'Denver Broncos', ...",ok,2,11,NFL
4,5,56be8e613aeaaa14008c90d1,Super_Bowl_50,Super Bowl 50 was an American football game to...,What was the theme of Super Bowl 50?,"{'text': ['""golden anniversary""', 'gold-themed...",golden anniversary,"{'text': ['""golden anniversary""', 'gold-themed...",ok,5,5,Bowl
...,...,...,...,...,...,...,...,...,...,...,...,...
10565,10548,5737a5931c456719005744e9,Force,"where is the mass of the object, is the velo...",What force changes an objects direction of tra...,"{'text': ['centripetal', 'unbalanced centripet...",radial (centripetal) force,"{'text': ['centripetal', 'unbalanced centripet...",nok,5,1,force
10566,10555,5737a7351c456719005744f5,Force,A conservative force that acts on a closed sys...,What is the force called rgarding a potential ...,"{'text': ['artifact', 'artifact of the potenti...",conservative force,"{'text': ['artifact', 'artifact of the potenti...",nok,8,3,potential
10567,10562,5737a9afc3c5551400e51f63,Force,The connection between macroscopic nonconserva...,What is the exchange of heat associated with?,"{'text': ['nonconservative forces', 'nonconser...",macroscopic closed systems,"{'text': ['nonconservative forces', 'nonconser...",nok,4,32,associated
10568,10568,5737aafd1c456719005744fe,Force,"The pound-force has a metric counterpart, less...",What seldom used term of a unit of force equal...,"{'text': ['kip', 'kip', 'kip', 'kip', 'kip'], ...",the metric slug,"{'text': ['kip', 'kip', 'kip', 'kip', 'kip'], ...",nok,7,4,1000


In [26]:
data.to_json('valid_pred_labeled_with_added_from_func.json')

### Identify in which sentence the answer is

In [12]:
data = pd.read_json('valid_pred_labeled_with_added_from_func.json')

In [16]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/luki/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [17]:
from nltk import tokenize

def identify_in_which_sentence_answer_is(data):
    sentence_indexes = []

    for i in range(len(data)):
        context1 = tokenize.sent_tokenize(data['context'][i])
        # context1 = data['context'][i].split('.')
        answer_start = data['answers'][i]['answer_start'][0]
        n = 0

        for sentence in context1:
            if answer_start - len(sentence) > 0:
                answer_start -= len(sentence)
                n += 1
        
        sentence_indexes.append(n)

    return sentence_indexes

In [18]:
data['kth_sentence'] = identify_in_which_sentence_answer_is(data)

In [19]:
data['kth_sentence'].describe()

count    10570.000000
mean         2.001987
std          1.893087
min          0.000000
25%          1.000000
50%          2.000000
75%          3.000000
max         27.000000
Name: kth_sentence, dtype: float64

In [23]:
data['kth_sentence'].describe()

count    10570.000000
mean         3.345885
std          2.417245
min          0.000000
25%          2.000000
50%          3.000000
75%          4.000000
max         43.000000
Name: kth_sentence, dtype: float64

In [20]:
data['kth_sentence'].value_counts()

1     2768
2     2245
0     2236
3     1522
4      855
5      431
6      231
7      136
8       76
9       23
10      14
12       8
13       6
11       6
15       5
14       3
27       2
25       1
17       1
16       1
Name: kth_sentence, dtype: int64

In [24]:
data['kth_sentence'].value_counts()

2     2505
3     2051
1     1739
4     1607
5      938
6      553
7      352
0      276
8      203
9      120
10      78
11      48
12      25
13      19
14      14
16      11
15      10
18       5
17       4
19       4
21       2
24       1
41       1
43       1
42       1
26       1
31       1
Name: kth_sentence, dtype: int64

In [21]:
data.to_json('valid_pred_labeled_with_added_from_func.json')

### Computing cosine similarity from tf-idf between contexts and questions

In [2]:
data = pd.read_json('valid_pred_labeled_with_added_from_func.json')

In [3]:
train_dataset = pd.read_json('./train.json')

In [4]:
train_dataset

Unnamed: 0,id,title,context,question,answers
0,5733be284776f41900661182,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",To whom did the Virgin Mary allegedly appear i...,"{'text': ['Saint Bernadette Soubirous'], 'answ..."
1,5733be284776f4190066117f,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",What is in front of the Notre Dame Main Building?,"{'text': ['a copper statue of Christ'], 'answe..."
2,5733be284776f41900661180,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",The Basilica of the Sacred heart at Notre Dame...,"{'text': ['the Main Building'], 'answer_start'..."
3,5733be284776f41900661181,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",What is the Grotto at Notre Dame?,{'text': ['a Marian place of prayer and reflec...
4,5733be284776f4190066117e,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",What sits on top of the Main Building at Notre...,{'text': ['a golden statue of the Virgin Mary'...
...,...,...,...,...,...
87594,5735d259012e2f140011a09d,Kathmandu,"Kathmandu Metropolitan City (KMC), in order to...",In what US state did Kathmandu first establish...,"{'text': ['Oregon'], 'answer_start': [229]}"
87595,5735d259012e2f140011a09e,Kathmandu,"Kathmandu Metropolitan City (KMC), in order to...",What was Yangon previously known as?,"{'text': ['Rangoon'], 'answer_start': [414]}"
87596,5735d259012e2f140011a09f,Kathmandu,"Kathmandu Metropolitan City (KMC), in order to...",With what Belorussian city does Kathmandu have...,"{'text': ['Minsk'], 'answer_start': [476]}"
87597,5735d259012e2f140011a0a0,Kathmandu,"Kathmandu Metropolitan City (KMC), in order to...",In what year did Kathmandu create its initial ...,"{'text': ['1975'], 'answer_start': [199]}"


In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from tqdm.auto import tqdm

vectorizer = TfidfVectorizer(stop_words='english', use_idf=True)
model = vectorizer.fit(train_dataset['context']) # fit na vsetkych datach

def compute_similarity_between_context_and_question(data):
    similarities = []

    for i in tqdm(range(len(data))):
        context1 = vectorizer.transform([data['context'][i]])
        question1 = vectorizer.transform([data['question'][i]])
        similarities.append(cosine_similarity(context1, question1)[0][0])

    return similarities


In [9]:
data['cosine_similarity'] = compute_similarity_between_context_and_question(data)

  0%|          | 0/10570 [00:00<?, ?it/s]

In [10]:
data['cosine_similarity'].describe()

count    10570.000000
mean         0.306264
std          0.158483
min          0.000000
25%          0.187675
50%          0.296845
75%          0.418278
max          0.910359
Name: cosine_similarity, dtype: float64

In [29]:
data['cosine_similarity'].describe()

count    10570.000000
mean         0.351640
std          0.146940
min          0.000000
25%          0.242536
50%          0.345643
75%          0.452267
max          0.910465
Name: cosine_similarity, dtype: float64

In [11]:
data.to_json('valid_pred_labeled_with_added_from_func.json')

### Extract first word from the question text

In [31]:
data = pd.read_json('valid_pred_labeled_with_added_from_func.json')

In [32]:
data

Unnamed: 0,index,id,title,context,question,answers,prediction_text,str_answers,ok,similar_words,distances,closest_words,kth_sentence,cosine_similarity
0,0,56be4db0acb8001400a502ec,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the AFC at Super Bo...,"{'text': ['Denver Broncos', 'Denver Broncos', ...",Denver Broncos,"{'text': ['Denver Broncos', 'Denver Broncos', ...",ok,4,2,AFC,2,0.494975
1,1,56be4db0acb8001400a502ed,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the NFC at Super Bo...,"{'text': ['Carolina Panthers', 'Carolina Panth...",Carolina Panthers,"{'text': ['Carolina Panthers', 'Carolina Panth...",ok,4,2,NFC,3,0.494975
2,2,56be4db0acb8001400a502ee,Super_Bowl_50,Super Bowl 50 was an American football game to...,Where did Super Bowl 50 take place?,"{'text': ['Santa Clara, California', 'Levi's S...","Santa Clara, California","{'text': ['Santa Clara, California', ""Levi's S...",ok,3,8,Super,3,0.547723
3,3,56be4db0acb8001400a502ef,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team won Super Bowl 50?,"{'text': ['Denver Broncos', 'Denver Broncos', ...",Denver Broncos,"{'text': ['Denver Broncos', 'Denver Broncos', ...",ok,2,11,NFL,2,0.513870
4,5,56be8e613aeaaa14008c90d1,Super_Bowl_50,Super Bowl 50 was an American football game to...,What was the theme of Super Bowl 50?,"{'text': ['""golden anniversary""', 'gold-themed...",golden anniversary,"{'text': ['""golden anniversary""', 'gold-themed...",ok,5,5,Bowl,4,0.547723
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10565,10548,5737a5931c456719005744e9,Force,"where is the mass of the object, is the velo...",What force changes an objects direction of tra...,"{'text': ['centripetal', 'unbalanced centripet...",radial (centripetal) force,"{'text': ['centripetal', 'unbalanced centripet...",nok,5,1,force,5,0.353209
10566,10555,5737a7351c456719005744f5,Force,A conservative force that acts on a closed sys...,What is the force called rgarding a potential ...,"{'text': ['artifact', 'artifact of the potenti...",conservative force,"{'text': ['artifact', 'artifact of the potenti...",nok,8,3,potential,3,0.461880
10567,10562,5737a9afc3c5551400e51f63,Force,The connection between macroscopic nonconserva...,What is the exchange of heat associated with?,"{'text': ['nonconservative forces', 'nonconser...",macroscopic closed systems,"{'text': ['nonconservative forces', 'nonconser...",nok,4,32,associated,2,0.175412
10568,10568,5737aafd1c456719005744fe,Force,"The pound-force has a metric counterpart, less...",What seldom used term of a unit of force equal...,"{'text': ['kip', 'kip', 'kip', 'kip', 'kip'], ...",the metric slug,"{'text': ['kip', 'kip', 'kip', 'kip', 'kip'], ...",nok,7,4,1000,4,0.597614


In [33]:
def extract_first_word_from_question(data):
    first_words = []
    wh_words = ['What', 'Who', 'Which', 'How', 'Where', 'Why', 'When', 'Whose']

    for i in range(len(data['question'])):
        words = data['question'][i].split()
        if words[0] in wh_words:
            first_words.append(words[0])
        else:
            first_words.append('Otherone')
       

    return first_words

In [34]:
data['first_word'] = extract_first_word_from_question(data)

In [35]:
data.to_json('valid_pred_labeled_with_added_from_func.json')

### Split data into 2 files based on the distance of closest word from question to answer

In [36]:
data = pd.read_json('valid_pred_labeled_with_added_from_func.json')

In [37]:
data

Unnamed: 0,index,id,title,context,question,answers,prediction_text,str_answers,ok,similar_words,distances,closest_words,kth_sentence,cosine_similarity,first_word
0,0,56be4db0acb8001400a502ec,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the AFC at Super Bo...,"{'text': ['Denver Broncos', 'Denver Broncos', ...",Denver Broncos,"{'text': ['Denver Broncos', 'Denver Broncos', ...",ok,4,2,AFC,2,0.494975,Which
1,1,56be4db0acb8001400a502ed,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the NFC at Super Bo...,"{'text': ['Carolina Panthers', 'Carolina Panth...",Carolina Panthers,"{'text': ['Carolina Panthers', 'Carolina Panth...",ok,4,2,NFC,3,0.494975,Which
2,2,56be4db0acb8001400a502ee,Super_Bowl_50,Super Bowl 50 was an American football game to...,Where did Super Bowl 50 take place?,"{'text': ['Santa Clara, California', 'Levi's S...","Santa Clara, California","{'text': ['Santa Clara, California', ""Levi's S...",ok,3,8,Super,3,0.547723,Where
3,3,56be4db0acb8001400a502ef,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team won Super Bowl 50?,"{'text': ['Denver Broncos', 'Denver Broncos', ...",Denver Broncos,"{'text': ['Denver Broncos', 'Denver Broncos', ...",ok,2,11,NFL,2,0.513870,Which
4,5,56be8e613aeaaa14008c90d1,Super_Bowl_50,Super Bowl 50 was an American football game to...,What was the theme of Super Bowl 50?,"{'text': ['""golden anniversary""', 'gold-themed...",golden anniversary,"{'text': ['""golden anniversary""', 'gold-themed...",ok,5,5,Bowl,4,0.547723,What
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10565,10548,5737a5931c456719005744e9,Force,"where is the mass of the object, is the velo...",What force changes an objects direction of tra...,"{'text': ['centripetal', 'unbalanced centripet...",radial (centripetal) force,"{'text': ['centripetal', 'unbalanced centripet...",nok,5,1,force,5,0.353209,What
10566,10555,5737a7351c456719005744f5,Force,A conservative force that acts on a closed sys...,What is the force called rgarding a potential ...,"{'text': ['artifact', 'artifact of the potenti...",conservative force,"{'text': ['artifact', 'artifact of the potenti...",nok,8,3,potential,3,0.461880,What
10567,10562,5737a9afc3c5551400e51f63,Force,The connection between macroscopic nonconserva...,What is the exchange of heat associated with?,"{'text': ['nonconservative forces', 'nonconser...",macroscopic closed systems,"{'text': ['nonconservative forces', 'nonconser...",nok,4,32,associated,2,0.175412,What
10568,10568,5737aafd1c456719005744fe,Force,"The pound-force has a metric counterpart, less...",What seldom used term of a unit of force equal...,"{'text': ['kip', 'kip', 'kip', 'kip', 'kip'], ...",the metric slug,"{'text': ['kip', 'kip', 'kip', 'kip', 'kip'], ...",nok,7,4,1000,4,0.597614,What


In [38]:
data_distances = data[data.distances >= 0]
data_higher, data_lower = [x for _, x in data.groupby(data_distances['distances'] <= 3)]

In [39]:
data_lower.reset_index().to_json('valid_data_lower_distance_than_4.json')

In [40]:
data_lower

Unnamed: 0,index,id,title,context,question,answers,prediction_text,str_answers,ok,similar_words,distances,closest_words,kth_sentence,cosine_similarity,first_word
0,0,56be4db0acb8001400a502ec,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the AFC at Super Bo...,"{'text': ['Denver Broncos', 'Denver Broncos', ...",Denver Broncos,"{'text': ['Denver Broncos', 'Denver Broncos', ...",ok,4,2,AFC,2,0.494975,Which
1,1,56be4db0acb8001400a502ed,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the NFC at Super Bo...,"{'text': ['Carolina Panthers', 'Carolina Panth...",Carolina Panthers,"{'text': ['Carolina Panthers', 'Carolina Panth...",ok,4,2,NFC,3,0.494975,Which
5,6,56be8e613aeaaa14008c90d2,Super_Bowl_50,Super Bowl 50 was an American football game to...,What day was the game played on?,"{'text': ['February 7, 2016', 'February 7', 'F...","February 7, 2016","{'text': ['February 7, 2016', 'February 7', 'F...",ok,4,2,played,3,0.279508,What
9,10,56bea9923aeaaa14008c91bb,Super_Bowl_50,Super Bowl 50 was an American football game to...,What day was the Super Bowl played on?,"{'text': ['February 7, 2016', 'February 7', 'F...","February 7, 2016","{'text': ['February 7, 2016', 'February 7', 'F...",ok,5,2,played,3,0.502079,What
12,14,56beace93aeaaa14008c91e2,Super_Bowl_50,Super Bowl 50 was an American football game to...,"If Roman numerals were used, what would Super ...","{'text': ['Super Bowl L', 'L', 'Super Bowl L']...",Super Bowl L,"{'text': ['Super Bowl L', 'L', 'Super Bowl L']...",ok,8,0,Super,4,0.530330,Otherone
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10561,10525,57379a4b1c456719005744cd,Force,The normal force is due to repulsive forces of...,What is the repulsive force of close range ato...,"{'text': ['normal force', 'normal force', 'nor...",The normal force,"{'text': ['normal force', 'normal force', 'nor...",nok,6,1,force,1,0.474579,What
10562,10528,57379a4b1c456719005744d0,Force,The normal force is due to repulsive forces of...,What is the force that causes rigid strength i...,"{'text': ['normal', 'normal force', 'normal fo...",repulsive forces,"{'text': ['normal', 'normal force', 'normal fo...",nok,5,1,force,3,0.664411,What
10565,10548,5737a5931c456719005744e9,Force,"where is the mass of the object, is the velo...",What force changes an objects direction of tra...,"{'text': ['centripetal', 'unbalanced centripet...",radial (centripetal) force,"{'text': ['centripetal', 'unbalanced centripet...",nok,5,1,force,5,0.353209,What
10566,10555,5737a7351c456719005744f5,Force,A conservative force that acts on a closed sys...,What is the force called rgarding a potential ...,"{'text': ['artifact', 'artifact of the potenti...",conservative force,"{'text': ['artifact', 'artifact of the potenti...",nok,8,3,potential,3,0.461880,What


In [41]:
data_higher.reset_index().to_json('valid_data_higher_distance_than_4.json')

In [42]:
data_higher

Unnamed: 0,index,id,title,context,question,answers,prediction_text,str_answers,ok,similar_words,distances,closest_words,kth_sentence,cosine_similarity,first_word
2,2,56be4db0acb8001400a502ee,Super_Bowl_50,Super Bowl 50 was an American football game to...,Where did Super Bowl 50 take place?,"{'text': ['Santa Clara, California', 'Levi's S...","Santa Clara, California","{'text': ['Santa Clara, California', ""Levi's S...",ok,3,8,Super,3,0.547723,Where
3,3,56be4db0acb8001400a502ef,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team won Super Bowl 50?,"{'text': ['Denver Broncos', 'Denver Broncos', ...",Denver Broncos,"{'text': ['Denver Broncos', 'Denver Broncos', ...",ok,2,11,NFL,2,0.513870,Which
4,5,56be8e613aeaaa14008c90d1,Super_Bowl_50,Super Bowl 50 was an American football game to...,What was the theme of Super Bowl 50?,"{'text': ['""golden anniversary""', 'gold-themed...",golden anniversary,"{'text': ['""golden anniversary""', 'gold-themed...",ok,5,5,Bowl,4,0.547723,What
6,7,56be8e613aeaaa14008c90d3,Super_Bowl_50,Super Bowl 50 was an American football game to...,What is the AFC short for?,"{'text': ['American Football Conference', 'Ame...",American Football Conference,"{'text': ['American Football Conference', 'Ame...",ok,1,21,AFC,2,0.079057,What
7,8,56bea9923aeaaa14008c91b9,Super_Bowl_50,Super Bowl 50 was an American football game to...,What was the theme of Super Bowl 50?,"{'text': ['""golden anniversary""', 'gold-themed...",golden anniversary,"{'text': ['""golden anniversary""', 'gold-themed...",ok,5,5,Bowl,4,0.547723,What
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10558,10520,57379829c3c5551400e51f3d,Force,The weak force is due to the exchange of the h...,What does the W and Z boson exchange create?,"{'text': ['weak force', 'weak force', 'weak fo...",The weak force,"{'text': ['weak force', 'weak force', 'weak fo...",nok,5,6,exchange,1,0.121268,What
10563,10535,5737a0acc3c5551400e51f48,Force,Newton's laws and Newtonian mechanics in gener...,What didn't Newton's mechanics affext?,"{'text': ['three-dimensional objects', 'three-...",forces affect idealized point particles rather...,"{'text': ['three-dimensional objects', 'three-...",nok,2,16,mechanics,1,0.232845,What
10564,10540,5737a25ac3c5551400e51f52,Force,where is the relevant cross-sectional area fo...,What is used to calculate cross section area i...,"{'text': ['pressure terms', 'stress tensor', '...",stress-tensor,"{'text': ['pressure terms', 'stress tensor', '...",nok,6,11,volume,2,0.428393,What
10567,10562,5737a9afc3c5551400e51f63,Force,The connection between macroscopic nonconserva...,What is the exchange of heat associated with?,"{'text': ['nonconservative forces', 'nonconser...",macroscopic closed systems,"{'text': ['nonconservative forces', 'nonconser...",nok,4,32,associated,2,0.175412,What
