## Processing of extracted data from fine-tuned model

### Load predictions and validation data

In [1]:
import json
import pandas as pd

predictions = pd.read_json('./formatted_predictions.json')
actual = pd.read_json('./validation.json')

In [28]:
predictions

Unnamed: 0,id,prediction_text
0,56be4db0acb8001400a502ec,Denver Broncos
1,56be4db0acb8001400a502ed,Carolina Panthers
2,56be4db0acb8001400a502ee,"Santa Clara, California"
3,56be4db0acb8001400a502ef,Denver Broncos
4,56be4db0acb8001400a502f0,gold
...,...,...
10565,5737aafd1c456719005744fb,pound-force
10566,5737aafd1c456719005744fc,kilopond
10567,5737aafd1c456719005744fd,the metric slug
10568,5737aafd1c456719005744fe,the metric slug


In [29]:
actual

Unnamed: 0,id,title,context,question,answers
0,56be4db0acb8001400a502ec,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the AFC at Super Bo...,"{'text': ['Denver Broncos', 'Denver Broncos', ..."
1,56be4db0acb8001400a502ed,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the NFC at Super Bo...,"{'text': ['Carolina Panthers', 'Carolina Panth..."
2,56be4db0acb8001400a502ee,Super_Bowl_50,Super Bowl 50 was an American football game to...,Where did Super Bowl 50 take place?,"{'text': ['Santa Clara, California', 'Levi's S..."
3,56be4db0acb8001400a502ef,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team won Super Bowl 50?,"{'text': ['Denver Broncos', 'Denver Broncos', ..."
4,56be4db0acb8001400a502f0,Super_Bowl_50,Super Bowl 50 was an American football game to...,What color was used to emphasize the 50th anni...,"{'text': ['gold', 'gold', 'gold'], 'answer_sta..."
...,...,...,...,...,...
10565,5737aafd1c456719005744fb,Force,"The pound-force has a metric counterpart, less...",What is the metric term less used than the New...,"{'text': ['kilogram-force', 'pound-force', 'ki..."
10566,5737aafd1c456719005744fc,Force,"The pound-force has a metric counterpart, less...",What is the kilogram-force sometimes reffered ...,"{'text': ['kilopond', 'kilopond', 'kilopond', ..."
10567,5737aafd1c456719005744fd,Force,"The pound-force has a metric counterpart, less...",What is a very seldom used unit of mass in the...,"{'text': ['slug', 'metric slug', 'metric slug'..."
10568,5737aafd1c456719005744fe,Force,"The pound-force has a metric counterpart, less...",What seldom used term of a unit of force equal...,"{'text': ['kip', 'kip', 'kip', 'kip', 'kip'], ..."


### Add predictions to dataframe

In [30]:
actual['prediction_text'] = predictions['prediction_text']
actual

Unnamed: 0,id,title,context,question,answers,prediction_text
0,56be4db0acb8001400a502ec,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the AFC at Super Bo...,"{'text': ['Denver Broncos', 'Denver Broncos', ...",Denver Broncos
1,56be4db0acb8001400a502ed,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the NFC at Super Bo...,"{'text': ['Carolina Panthers', 'Carolina Panth...",Carolina Panthers
2,56be4db0acb8001400a502ee,Super_Bowl_50,Super Bowl 50 was an American football game to...,Where did Super Bowl 50 take place?,"{'text': ['Santa Clara, California', 'Levi's S...","Santa Clara, California"
3,56be4db0acb8001400a502ef,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team won Super Bowl 50?,"{'text': ['Denver Broncos', 'Denver Broncos', ...",Denver Broncos
4,56be4db0acb8001400a502f0,Super_Bowl_50,Super Bowl 50 was an American football game to...,What color was used to emphasize the 50th anni...,"{'text': ['gold', 'gold', 'gold'], 'answer_sta...",gold
...,...,...,...,...,...,...
10565,5737aafd1c456719005744fb,Force,"The pound-force has a metric counterpart, less...",What is the metric term less used than the New...,"{'text': ['kilogram-force', 'pound-force', 'ki...",pound-force
10566,5737aafd1c456719005744fc,Force,"The pound-force has a metric counterpart, less...",What is the kilogram-force sometimes reffered ...,"{'text': ['kilopond', 'kilopond', 'kilopond', ...",kilopond
10567,5737aafd1c456719005744fd,Force,"The pound-force has a metric counterpart, less...",What is a very seldom used unit of mass in the...,"{'text': ['slug', 'metric slug', 'metric slug'...",the metric slug
10568,5737aafd1c456719005744fe,Force,"The pound-force has a metric counterpart, less...",What seldom used term of a unit of force equal...,"{'text': ['kip', 'kip', 'kip', 'kip', 'kip'], ...",the metric slug


In [31]:
actual['str_answers'] = actual['answers']

### Label data if there were correctly predicted or not

In [32]:
actual['prediction_text'] = actual['prediction_text'].astype('str')
actual['str_answers'] = actual['str_answers'].astype('str')

In [33]:
df1 = actual[actual.apply(lambda x: x.prediction_text in x.str_answers, axis=1)]
df1['ok'] = 'ok'
df1.to_json('ok.json')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [34]:
df2 = actual[actual.apply(lambda x: x.prediction_text not in x.str_answers, axis=1)]
df2['ok'] = 'nok'
df2.to_json('nok.json')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [35]:
df3 = pd.concat([df1, df2])
df3.to_json('valid_pred_labeled.json')

In [36]:
df3

Unnamed: 0,id,title,context,question,answers,prediction_text,str_answers,ok
0,56be4db0acb8001400a502ec,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the AFC at Super Bo...,"{'text': ['Denver Broncos', 'Denver Broncos', ...",Denver Broncos,"{'text': ['Denver Broncos', 'Denver Broncos', ...",ok
1,56be4db0acb8001400a502ed,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the NFC at Super Bo...,"{'text': ['Carolina Panthers', 'Carolina Panth...",Carolina Panthers,"{'text': ['Carolina Panthers', 'Carolina Panth...",ok
2,56be4db0acb8001400a502ee,Super_Bowl_50,Super Bowl 50 was an American football game to...,Where did Super Bowl 50 take place?,"{'text': ['Santa Clara, California', 'Levi's S...","Santa Clara, California","{'text': ['Santa Clara, California', ""Levi's S...",ok
3,56be4db0acb8001400a502ef,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team won Super Bowl 50?,"{'text': ['Denver Broncos', 'Denver Broncos', ...",Denver Broncos,"{'text': ['Denver Broncos', 'Denver Broncos', ...",ok
4,56be4db0acb8001400a502f0,Super_Bowl_50,Super Bowl 50 was an American football game to...,What color was used to emphasize the 50th anni...,"{'text': ['gold', 'gold', 'gold'], 'answer_sta...",gold,"{'text': ['gold', 'gold', 'gold'], 'answer_sta...",ok
...,...,...,...,...,...,...,...,...
10548,5737a5931c456719005744e9,Force,"where is the mass of the object, is the velo...",What force changes an objects direction of tra...,"{'text': ['centripetal', 'unbalanced centripet...",radial (centripetal) force,"{'text': ['centripetal', 'unbalanced centripet...",nok
10555,5737a7351c456719005744f5,Force,A conservative force that acts on a closed sys...,What is the force called rgarding a potential ...,"{'text': ['artifact', 'artifact of the potenti...",difference in potential energy,"{'text': ['artifact', 'artifact of the potenti...",nok
10562,5737a9afc3c5551400e51f63,Force,The connection between macroscopic nonconserva...,What is the exchange of heat associated with?,"{'text': ['nonconservative forces', 'nonconser...",macroscopic closed systems,"{'text': ['nonconservative forces', 'nonconser...",nok
10568,5737aafd1c456719005744fe,Force,"The pound-force has a metric counterpart, less...",What seldom used term of a unit of force equal...,"{'text': ['kip', 'kip', 'kip', 'kip', 'kip'], ...",the metric slug,"{'text': ['kip', 'kip', 'kip', 'kip', 'kip'], ...",nok


### Count similar words between answer and context

In [37]:
data = pd.read_json('valid_pred_labeled.json')

In [38]:
data = data.reset_index()
data

Unnamed: 0,index,id,title,context,question,answers,prediction_text,str_answers,ok
0,0,56be4db0acb8001400a502ec,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the AFC at Super Bo...,"{'text': ['Denver Broncos', 'Denver Broncos', ...",Denver Broncos,"{'text': ['Denver Broncos', 'Denver Broncos', ...",ok
1,1,56be4db0acb8001400a502ed,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the NFC at Super Bo...,"{'text': ['Carolina Panthers', 'Carolina Panth...",Carolina Panthers,"{'text': ['Carolina Panthers', 'Carolina Panth...",ok
2,2,56be4db0acb8001400a502ee,Super_Bowl_50,Super Bowl 50 was an American football game to...,Where did Super Bowl 50 take place?,"{'text': ['Santa Clara, California', 'Levi's S...","Santa Clara, California","{'text': ['Santa Clara, California', ""Levi's S...",ok
3,3,56be4db0acb8001400a502ef,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team won Super Bowl 50?,"{'text': ['Denver Broncos', 'Denver Broncos', ...",Denver Broncos,"{'text': ['Denver Broncos', 'Denver Broncos', ...",ok
4,4,56be4db0acb8001400a502f0,Super_Bowl_50,Super Bowl 50 was an American football game to...,What color was used to emphasize the 50th anni...,"{'text': ['gold', 'gold', 'gold'], 'answer_sta...",gold,"{'text': ['gold', 'gold', 'gold'], 'answer_sta...",ok
...,...,...,...,...,...,...,...,...,...
10565,10548,5737a5931c456719005744e9,Force,"where is the mass of the object, is the velo...",What force changes an objects direction of tra...,"{'text': ['centripetal', 'unbalanced centripet...",radial (centripetal) force,"{'text': ['centripetal', 'unbalanced centripet...",nok
10566,10555,5737a7351c456719005744f5,Force,A conservative force that acts on a closed sys...,What is the force called rgarding a potential ...,"{'text': ['artifact', 'artifact of the potenti...",difference in potential energy,"{'text': ['artifact', 'artifact of the potenti...",nok
10567,10562,5737a9afc3c5551400e51f63,Force,The connection between macroscopic nonconserva...,What is the exchange of heat associated with?,"{'text': ['nonconservative forces', 'nonconser...",macroscopic closed systems,"{'text': ['nonconservative forces', 'nonconser...",nok
10568,10568,5737aafd1c456719005744fe,Force,"The pound-force has a metric counterpart, less...",What seldom used term of a unit of force equal...,"{'text': ['kip', 'kip', 'kip', 'kip', 'kip'], ...",the metric slug,"{'text': ['kip', 'kip', 'kip', 'kip', 'kip'], ...",nok


In [39]:
data.info()
data['prediction_text'] = data['prediction_text'].astype('str')
data['str_answers'] = data['str_answers'].astype('str')
data['question'] = data['question'].astype('str')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10570 entries, 0 to 10569
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   index            10570 non-null  int64 
 1   id               10570 non-null  object
 2   title            10570 non-null  object
 3   context          10570 non-null  object
 4   question         10570 non-null  object
 5   answers          10570 non-null  object
 6   prediction_text  10570 non-null  object
 7   str_answers      10570 non-null  object
 8   ok               10570 non-null  object
dtypes: int64(1), object(8)
memory usage: 743.3+ KB


In [40]:
def count_similar_words_in_question_and_context(data):

    similar_words = []

    for i in range(len(data)):
        similar_words.append(len(set(data['context'][i].split()).intersection(set(data['question'][i].split()))))

    return similar_words

In [41]:
data['similar_words'] = count_similar_words_in_question_and_context(data)
data

Unnamed: 0,index,id,title,context,question,answers,prediction_text,str_answers,ok,similar_words
0,0,56be4db0acb8001400a502ec,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the AFC at Super Bo...,"{'text': ['Denver Broncos', 'Denver Broncos', ...",Denver Broncos,"{'text': ['Denver Broncos', 'Denver Broncos', ...",ok,4
1,1,56be4db0acb8001400a502ed,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the NFC at Super Bo...,"{'text': ['Carolina Panthers', 'Carolina Panth...",Carolina Panthers,"{'text': ['Carolina Panthers', 'Carolina Panth...",ok,4
2,2,56be4db0acb8001400a502ee,Super_Bowl_50,Super Bowl 50 was an American football game to...,Where did Super Bowl 50 take place?,"{'text': ['Santa Clara, California', 'Levi's S...","Santa Clara, California","{'text': ['Santa Clara, California', ""Levi's S...",ok,3
3,3,56be4db0acb8001400a502ef,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team won Super Bowl 50?,"{'text': ['Denver Broncos', 'Denver Broncos', ...",Denver Broncos,"{'text': ['Denver Broncos', 'Denver Broncos', ...",ok,2
4,4,56be4db0acb8001400a502f0,Super_Bowl_50,Super Bowl 50 was an American football game to...,What color was used to emphasize the 50th anni...,"{'text': ['gold', 'gold', 'gold'], 'answer_sta...",gold,"{'text': ['gold', 'gold', 'gold'], 'answer_sta...",ok,6
...,...,...,...,...,...,...,...,...,...,...
10565,10548,5737a5931c456719005744e9,Force,"where is the mass of the object, is the velo...",What force changes an objects direction of tra...,"{'text': ['centripetal', 'unbalanced centripet...",radial (centripetal) force,"{'text': ['centripetal', 'unbalanced centripet...",nok,5
10566,10555,5737a7351c456719005744f5,Force,A conservative force that acts on a closed sys...,What is the force called rgarding a potential ...,"{'text': ['artifact', 'artifact of the potenti...",difference in potential energy,"{'text': ['artifact', 'artifact of the potenti...",nok,8
10567,10562,5737a9afc3c5551400e51f63,Force,The connection between macroscopic nonconserva...,What is the exchange of heat associated with?,"{'text': ['nonconservative forces', 'nonconser...",macroscopic closed systems,"{'text': ['nonconservative forces', 'nonconser...",nok,4
10568,10568,5737aafd1c456719005744fe,Force,"The pound-force has a metric counterpart, less...",What seldom used term of a unit of force equal...,"{'text': ['kip', 'kip', 'kip', 'kip', 'kip'], ...",the metric slug,"{'text': ['kip', 'kip', 'kip', 'kip', 'kip'], ...",nok,7


### Count distance of the closest question word from answer in context

In [42]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

def count_lowest_position_of_word_from_question_in_context(data):
    distances = []
    words = []
    for i in range(len(data)):
        context_list = data['context'][i].replace(',', '').replace(':', '').replace('(', '').replace(')', '').split()
        question_list = data['question'][i].replace(',', '').replace(':', '').replace('(', '').replace(')', '').replace('?', '').split()
        answer_text = data['str_answers'][i][data['str_answers'][i].index('text\': [')+8 : data['str_answers'][i].index(']')].split(", ")
        answer_start = data['str_answers'][i][data['str_answers'][i].index('answer_start\': [')+16 : data['str_answers'][i].index(']}')]

        if len(answer_text[0]) > 2 and answer_text[0][1:-1].split(', ')[0].split()[0] in context_list:
            answer_index = context_list.index(answer_text[0][1:-1].split(', ')[0].split()[0])
        else:
            distances.append(-1)
            words.append('None')
            continue

        filtered_words = [word for word in question_list if word not in stopwords.words('english')]

        list_indexes = {}

        for word in filtered_words:
            if word in context_list:
                for j in range(len(context_list)):
                    if word == context_list[j]:
                        list_indexes[abs(j - answer_index)] = context_list[j]

        sort_orders = sorted(list_indexes.items(), key=lambda x: x[0], reverse=False)

        if len(sort_orders) == 0:
            distances.append(-1)
            words.append('None')
        else:
            distances.append(sort_orders[0][0])
            words.append(sort_orders[0][1])

    return distances, words

[nltk_data] Downloading package stopwords to /home/luki/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [43]:
data['distances'], data['closest_words'] = count_lowest_position_of_word_from_question_in_context(data)

In [44]:
data

Unnamed: 0,index,id,title,context,question,answers,prediction_text,str_answers,ok,similar_words,distances,closest_words
0,0,56be4db0acb8001400a502ec,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the AFC at Super Bo...,"{'text': ['Denver Broncos', 'Denver Broncos', ...",Denver Broncos,"{'text': ['Denver Broncos', 'Denver Broncos', ...",ok,4,2,AFC
1,1,56be4db0acb8001400a502ed,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the NFC at Super Bo...,"{'text': ['Carolina Panthers', 'Carolina Panth...",Carolina Panthers,"{'text': ['Carolina Panthers', 'Carolina Panth...",ok,4,2,NFC
2,2,56be4db0acb8001400a502ee,Super_Bowl_50,Super Bowl 50 was an American football game to...,Where did Super Bowl 50 take place?,"{'text': ['Santa Clara, California', 'Levi's S...","Santa Clara, California","{'text': ['Santa Clara, California', ""Levi's S...",ok,3,8,Super
3,3,56be4db0acb8001400a502ef,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team won Super Bowl 50?,"{'text': ['Denver Broncos', 'Denver Broncos', ...",Denver Broncos,"{'text': ['Denver Broncos', 'Denver Broncos', ...",ok,2,11,NFL
4,4,56be4db0acb8001400a502f0,Super_Bowl_50,Super Bowl 50 was an American football game to...,What color was used to emphasize the 50th anni...,"{'text': ['gold', 'gold', 'gold'], 'answer_sta...",gold,"{'text': ['gold', 'gold', 'gold'], 'answer_sta...",ok,6,-1,
...,...,...,...,...,...,...,...,...,...,...,...,...
10565,10548,5737a5931c456719005744e9,Force,"where is the mass of the object, is the velo...",What force changes an objects direction of tra...,"{'text': ['centripetal', 'unbalanced centripet...",radial (centripetal) force,"{'text': ['centripetal', 'unbalanced centripet...",nok,5,1,force
10566,10555,5737a7351c456719005744f5,Force,A conservative force that acts on a closed sys...,What is the force called rgarding a potential ...,"{'text': ['artifact', 'artifact of the potenti...",difference in potential energy,"{'text': ['artifact', 'artifact of the potenti...",nok,8,3,potential
10567,10562,5737a9afc3c5551400e51f63,Force,The connection between macroscopic nonconserva...,What is the exchange of heat associated with?,"{'text': ['nonconservative forces', 'nonconser...",macroscopic closed systems,"{'text': ['nonconservative forces', 'nonconser...",nok,4,32,associated
10568,10568,5737aafd1c456719005744fe,Force,"The pound-force has a metric counterpart, less...",What seldom used term of a unit of force equal...,"{'text': ['kip', 'kip', 'kip', 'kip', 'kip'], ...",the metric slug,"{'text': ['kip', 'kip', 'kip', 'kip', 'kip'], ...",nok,7,4,1000


In [45]:
data.to_json('valid_pred_labeled_with_added_from_func.json')

### Identify in which sentence the answer is

In [2]:
data = pd.read_json('valid_pred_labeled_with_added_from_func.json')

In [26]:
def identify_in_which_sentence_answer_is(data):
    sentence_indexes = []

    for i in range(len(data)):
        context1 = data['context'][i].split('.')
        answer_start = data['answers'][i]['answer_start'][0]
        n = 0

        for sentence in context1:
            if answer_start - len(sentence) > 0:
                answer_start -= len(sentence)
                n += 1
        
        sentence_indexes.append(n)

    return sentence_indexes

In [28]:
data['kth_sentence'] = identify_in_which_sentence_answer_is(data)

In [30]:
data['kth_sentence'].describe()

count    10570.000000
mean         3.345885
std          2.417245
min          0.000000
25%          2.000000
50%          3.000000
75%          4.000000
max         43.000000
Name: kth_sentence, dtype: float64

In [7]:
data['kth_sentence'].value_counts()

2     2505
3     2051
1     1739
4     1607
5      938
6      553
7      352
0      276
8      203
9      120
10      78
11      48
12      25
13      19
14      14
16      11
15      10
18       5
17       4
19       4
21       2
31       1
42       1
41       1
43       1
24       1
26       1
Name: kth_sentence, dtype: int64

In [31]:
data.to_json('valid_pred_labeled_with_added_from_func.json')

### Computing cosine similarity from tf-idf between contexts and questions

In [2]:
data = pd.read_json('valid_pred_labeled_with_added_from_func.json')

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from tqdm.auto import tqdm

vectorizer = TfidfVectorizer(stop_words='english', use_idf=True)

def compute_similarity_between_context_and_question(data):
    similarities = []

    for i in tqdm(range(len(data))):
        context1 = vectorizer.fit_transform([data['context'][i]])
        question1 = vectorizer.transform([data['question'][i]])
        similarities.append(cosine_similarity(context1, question1)[0][0])

    return similarities


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype=np.int):


In [4]:
data['cosine_similarity'] = compute_similarity_between_context_and_question(data)

  0%|          | 0/10570 [00:00<?, ?it/s]

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float
Deprecated in

In [5]:
data['cosine_similarity'].describe()

count    10570.000000
mean         0.351640
std          0.146940
min          0.000000
25%          0.242536
50%          0.345643
75%          0.452267
max          0.910465
Name: cosine_similarity, dtype: float64

In [6]:
data.to_json('valid_pred_labeled_with_added_from_func.json')

### Extract first word from the question text

In [46]:
data = pd.read_json('valid_pred_labeled_with_added_from_func.json')

In [47]:
data

Unnamed: 0,index,id,title,context,question,answers,prediction_text,str_answers,ok,similar_words,distances,closest_words
0,0,56be4db0acb8001400a502ec,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the AFC at Super Bo...,"{'text': ['Denver Broncos', 'Denver Broncos', ...",Denver Broncos,"{'text': ['Denver Broncos', 'Denver Broncos', ...",ok,4,2,AFC
1,1,56be4db0acb8001400a502ed,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the NFC at Super Bo...,"{'text': ['Carolina Panthers', 'Carolina Panth...",Carolina Panthers,"{'text': ['Carolina Panthers', 'Carolina Panth...",ok,4,2,NFC
2,2,56be4db0acb8001400a502ee,Super_Bowl_50,Super Bowl 50 was an American football game to...,Where did Super Bowl 50 take place?,"{'text': ['Santa Clara, California', 'Levi's S...","Santa Clara, California","{'text': ['Santa Clara, California', ""Levi's S...",ok,3,8,Super
3,3,56be4db0acb8001400a502ef,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team won Super Bowl 50?,"{'text': ['Denver Broncos', 'Denver Broncos', ...",Denver Broncos,"{'text': ['Denver Broncos', 'Denver Broncos', ...",ok,2,11,NFL
4,4,56be4db0acb8001400a502f0,Super_Bowl_50,Super Bowl 50 was an American football game to...,What color was used to emphasize the 50th anni...,"{'text': ['gold', 'gold', 'gold'], 'answer_sta...",gold,"{'text': ['gold', 'gold', 'gold'], 'answer_sta...",ok,6,-1,
...,...,...,...,...,...,...,...,...,...,...,...,...
10565,10548,5737a5931c456719005744e9,Force,"where is the mass of the object, is the velo...",What force changes an objects direction of tra...,"{'text': ['centripetal', 'unbalanced centripet...",radial (centripetal) force,"{'text': ['centripetal', 'unbalanced centripet...",nok,5,1,force
10566,10555,5737a7351c456719005744f5,Force,A conservative force that acts on a closed sys...,What is the force called rgarding a potential ...,"{'text': ['artifact', 'artifact of the potenti...",difference in potential energy,"{'text': ['artifact', 'artifact of the potenti...",nok,8,3,potential
10567,10562,5737a9afc3c5551400e51f63,Force,The connection between macroscopic nonconserva...,What is the exchange of heat associated with?,"{'text': ['nonconservative forces', 'nonconser...",macroscopic closed systems,"{'text': ['nonconservative forces', 'nonconser...",nok,4,32,associated
10568,10568,5737aafd1c456719005744fe,Force,"The pound-force has a metric counterpart, less...",What seldom used term of a unit of force equal...,"{'text': ['kip', 'kip', 'kip', 'kip', 'kip'], ...",the metric slug,"{'text': ['kip', 'kip', 'kip', 'kip', 'kip'], ...",nok,7,4,1000


In [48]:
def extract_first_word_from_question(data):
    first_words = []
    wh_words = ['What', 'Who', 'Which', 'How', 'Where', 'Why', 'When', 'Whose']

    for i in range(len(data['question'])):
        words = data['question'][i].split()
        if words[0] in wh_words:
            first_words.append(words[0])
        else:
            first_words.append('Otherone')
       

    return first_words

In [49]:
data['first_word'] = extract_first_word_from_question(data)

In [50]:
data.to_json('valid_pred_labeled_with_added_from_func.json')

### Split data into 2 files based on the distance of closest word from question to answer

In [51]:
data = pd.read_json('valid_pred_labeled_with_added_from_func.json')

In [55]:
data

Unnamed: 0,index,id,title,context,question,answers,prediction_text,str_answers,ok,similar_words,distances,closest_words,first_word
0,0,56be4db0acb8001400a502ec,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the AFC at Super Bo...,"{'text': ['Denver Broncos', 'Denver Broncos', ...",Denver Broncos,"{'text': ['Denver Broncos', 'Denver Broncos', ...",ok,4,2,AFC,Which
1,1,56be4db0acb8001400a502ed,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the NFC at Super Bo...,"{'text': ['Carolina Panthers', 'Carolina Panth...",Carolina Panthers,"{'text': ['Carolina Panthers', 'Carolina Panth...",ok,4,2,NFC,Which
2,2,56be4db0acb8001400a502ee,Super_Bowl_50,Super Bowl 50 was an American football game to...,Where did Super Bowl 50 take place?,"{'text': ['Santa Clara, California', 'Levi's S...","Santa Clara, California","{'text': ['Santa Clara, California', ""Levi's S...",ok,3,8,Super,Where
3,3,56be4db0acb8001400a502ef,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team won Super Bowl 50?,"{'text': ['Denver Broncos', 'Denver Broncos', ...",Denver Broncos,"{'text': ['Denver Broncos', 'Denver Broncos', ...",ok,2,11,NFL,Which
4,4,56be4db0acb8001400a502f0,Super_Bowl_50,Super Bowl 50 was an American football game to...,What color was used to emphasize the 50th anni...,"{'text': ['gold', 'gold', 'gold'], 'answer_sta...",gold,"{'text': ['gold', 'gold', 'gold'], 'answer_sta...",ok,6,-1,,What
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10565,10548,5737a5931c456719005744e9,Force,"where is the mass of the object, is the velo...",What force changes an objects direction of tra...,"{'text': ['centripetal', 'unbalanced centripet...",radial (centripetal) force,"{'text': ['centripetal', 'unbalanced centripet...",nok,5,1,force,What
10566,10555,5737a7351c456719005744f5,Force,A conservative force that acts on a closed sys...,What is the force called rgarding a potential ...,"{'text': ['artifact', 'artifact of the potenti...",difference in potential energy,"{'text': ['artifact', 'artifact of the potenti...",nok,8,3,potential,What
10567,10562,5737a9afc3c5551400e51f63,Force,The connection between macroscopic nonconserva...,What is the exchange of heat associated with?,"{'text': ['nonconservative forces', 'nonconser...",macroscopic closed systems,"{'text': ['nonconservative forces', 'nonconser...",nok,4,32,associated,What
10568,10568,5737aafd1c456719005744fe,Force,"The pound-force has a metric counterpart, less...",What seldom used term of a unit of force equal...,"{'text': ['kip', 'kip', 'kip', 'kip', 'kip'], ...",the metric slug,"{'text': ['kip', 'kip', 'kip', 'kip', 'kip'], ...",nok,7,4,1000,What


In [52]:
data_distances = data[data.distances >= 0]
data_higher, data_lower = [x for _, x in data.groupby(data_distances['distances'] <= 3)]

In [53]:
data_lower.reset_index().to_json('valid_data_lower_distance_than_4.json')

In [56]:
data_lower

Unnamed: 0,index,id,title,context,question,answers,prediction_text,str_answers,ok,similar_words,distances,closest_words,first_word
0,0,56be4db0acb8001400a502ec,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the AFC at Super Bo...,"{'text': ['Denver Broncos', 'Denver Broncos', ...",Denver Broncos,"{'text': ['Denver Broncos', 'Denver Broncos', ...",ok,4,2,AFC,Which
1,1,56be4db0acb8001400a502ed,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the NFC at Super Bo...,"{'text': ['Carolina Panthers', 'Carolina Panth...",Carolina Panthers,"{'text': ['Carolina Panthers', 'Carolina Panth...",ok,4,2,NFC,Which
5,6,56be8e613aeaaa14008c90d2,Super_Bowl_50,Super Bowl 50 was an American football game to...,What day was the game played on?,"{'text': ['February 7, 2016', 'February 7', 'F...","February 7, 2016","{'text': ['February 7, 2016', 'February 7', 'F...",ok,4,2,played,What
8,10,56bea9923aeaaa14008c91bb,Super_Bowl_50,Super Bowl 50 was an American football game to...,What day was the Super Bowl played on?,"{'text': ['February 7, 2016', 'February 7', 'F...","February 7, 2016","{'text': ['February 7, 2016', 'February 7', 'F...",ok,5,2,played,What
12,14,56beace93aeaaa14008c91e2,Super_Bowl_50,Super Bowl 50 was an American football game to...,"If Roman numerals were used, what would Super ...","{'text': ['Super Bowl L', 'L', 'Super Bowl L']...",Super Bowl L,"{'text': ['Super Bowl L', 'L', 'Super Bowl L']...",ok,8,0,Super,Otherone
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10560,10533,57379ed81c456719005744d9,Force,Tension forces can be modeled using ideal stri...,What can increase the tension force on a load?,"{'text': ['movable pulleys', 'connecting the s...",ideal pulleys,"{'text': ['movable pulleys', 'connecting the s...",nok,7,3,tension,What
10562,10536,5737a0acc3c5551400e51f49,Force,Newton's laws and Newtonian mechanics in gener...,In what kind of fluid are pressure differences...,"{'text': ['extended', 'extended', 'extended'],...",extended fluids,"{'text': ['extended', 'extended', 'extended'],...",nok,4,3,forces,Otherone
10565,10548,5737a5931c456719005744e9,Force,"where is the mass of the object, is the velo...",What force changes an objects direction of tra...,"{'text': ['centripetal', 'unbalanced centripet...",radial (centripetal) force,"{'text': ['centripetal', 'unbalanced centripet...",nok,5,1,force,What
10566,10555,5737a7351c456719005744f5,Force,A conservative force that acts on a closed sys...,What is the force called rgarding a potential ...,"{'text': ['artifact', 'artifact of the potenti...",difference in potential energy,"{'text': ['artifact', 'artifact of the potenti...",nok,8,3,potential,What


In [54]:
data_higher.reset_index().to_json('valid_data_higher_distance_than_4.json')

In [57]:
data_higher

Unnamed: 0,index,id,title,context,question,answers,prediction_text,str_answers,ok,similar_words,distances,closest_words,first_word
2,2,56be4db0acb8001400a502ee,Super_Bowl_50,Super Bowl 50 was an American football game to...,Where did Super Bowl 50 take place?,"{'text': ['Santa Clara, California', 'Levi's S...","Santa Clara, California","{'text': ['Santa Clara, California', ""Levi's S...",ok,3,8,Super,Where
3,3,56be4db0acb8001400a502ef,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team won Super Bowl 50?,"{'text': ['Denver Broncos', 'Denver Broncos', ...",Denver Broncos,"{'text': ['Denver Broncos', 'Denver Broncos', ...",ok,2,11,NFL,Which
6,7,56be8e613aeaaa14008c90d3,Super_Bowl_50,Super Bowl 50 was an American football game to...,What is the AFC short for?,"{'text': ['American Football Conference', 'Ame...",American Football Conference,"{'text': ['American Football Conference', 'Ame...",ok,1,21,AFC,What
7,9,56bea9923aeaaa14008c91ba,Super_Bowl_50,Super Bowl 50 was an American football game to...,What does AFC stand for?,"{'text': ['American Football Conference', 'Ame...",American Football Conference,"{'text': ['American Football Conference', 'Ame...",ok,0,21,AFC,What
9,11,56beace93aeaaa14008c91df,Super_Bowl_50,Super Bowl 50 was an American football game to...,Who won Super Bowl 50?,"{'text': ['Denver Broncos', 'Denver Broncos', ...",Denver Broncos,"{'text': ['Denver Broncos', 'Denver Broncos', ...",ok,2,16,Super,Who
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10561,10535,5737a0acc3c5551400e51f48,Force,Newton's laws and Newtonian mechanics in gener...,What didn't Newton's mechanics affext?,"{'text': ['three-dimensional objects', 'three-...",idealized point particles,"{'text': ['three-dimensional objects', 'three-...",nok,2,16,mechanics,What
10563,10540,5737a25ac3c5551400e51f52,Force,where is the relevant cross-sectional area fo...,What is used to calculate cross section area i...,"{'text': ['pressure terms', 'stress tensor', '...",stress-tensor,"{'text': ['pressure terms', 'stress tensor', '...",nok,6,11,volume,What
10564,10546,5737a5931c456719005744e7,Force,"where is the mass of the object, is the velo...",Where does centripetal force go?,{'text': ['toward the center of the curving pa...,"where is the mass of the object, is the velo...",{'text': ['toward the center of the curving pa...,nok,2,8,force,Where
10567,10562,5737a9afc3c5551400e51f63,Force,The connection between macroscopic nonconserva...,What is the exchange of heat associated with?,"{'text': ['nonconservative forces', 'nonconser...",macroscopic closed systems,"{'text': ['nonconservative forces', 'nonconser...",nok,4,32,associated,What
