# Getting started with dataset

Using pandas dataframes to load the dataset

In [2]:
import pandas as pd

Download the dataset from URL and read it as a tab separated file in a pandas dataframe.
The dataset is of web queries with ten probable passages that could be answers to the query.
But only one of those passages is the correct answer.

In [3]:
# location of the dataset in filesystem
dataset_location = 'C:\ds\msaic\Starting Kit\\answers\\data.tsv'

dataset = pd.read_csv(dataset_location, sep='\t', names = ["id","query","passage","label","index"])

Checking out the size of dataset reveals...

In [5]:
dataset.shape

(5241880, 5)

... it's huge.

5+ million rows.

Since each query has exactly 10 passages,
this would mean that it has 5241880/10 = 524188 queries.

<img src="https://media1.tenor.com/images/077a38dbc8acdf55ef54ba3fd909bbb6/tenor.gif?itemid=10323653" align="left" style="height:25%; width: 25%"></img>

Let's see how a query looks

In [6]:
# printing the first 10 rows
dataset[0:10] 

Unnamed: 0,id,query,passage,label,index
0,0,. what is a corporation?,A company is incorporated in a specific nation...,0,0
1,0,. what is a corporation?,"Today, there is a growing community of more th...",0,1
2,0,. what is a corporation?,"Corporation definition, an association of indi...",0,2
3,0,. what is a corporation?,Examples of corporation in a Sentence. 1 He w...,0,3
4,0,. what is a corporation?,1: a government-owned corporation (as a utilit...,0,4
5,0,. what is a corporation?,McDonald's Corporation is one of the most reco...,1,5
6,0,. what is a corporation?,Corporations are owned by their stockholders (...,0,6
7,0,. what is a corporation?,An Association is an organized group of people...,0,7
8,0,. what is a corporation?,B Corp certification shines a light on the com...,0,8
9,0,. what is a corporation?,LLCs offer greater flexibility when it comes t...,0,9


Every row of the dataset has five columns:
#### 1. Query ID
This is an ID presumably given to each query by the data curators. It would also help them in evaluating the results in contests.
#### 2. Query
This is the actual query that users have entered in search engines.
#### 3. Passage
The passage is the probable webpage that is the best match for answering the query.
#### 4. Label
Label indicates if the passage is the right answer or not. 0 for wrong and 1 for right.
#### 5. Index
This just denotes the index of the passage. It will range from 0 to 9 for each query.

# Cleaning the data

We can notice how the query above was followed by a random period (.)

Just to ensure this doesn't trouble us while parsing, we'll strip unnecessary characters from the front and back of the query.

In [7]:
dataset['query'] = dataset['query'].str.strip(".,?! ")
dataset[0:10]

Unnamed: 0,id,query,passage,label,index
0,0,what is a corporation,A company is incorporated in a specific nation...,0,0
1,0,what is a corporation,"Today, there is a growing community of more th...",0,1
2,0,what is a corporation,"Corporation definition, an association of indi...",0,2
3,0,what is a corporation,Examples of corporation in a Sentence. 1 He w...,0,3
4,0,what is a corporation,1: a government-owned corporation (as a utilit...,0,4
5,0,what is a corporation,McDonald's Corporation is one of the most reco...,1,5
6,0,what is a corporation,Corporations are owned by their stockholders (...,0,6
7,0,what is a corporation,An Association is an organized group of people...,0,7
8,0,what is a corporation,B Corp certification shines a light on the com...,0,8
9,0,what is a corporation,LLCs offer greater flexibility when it comes t...,0,9


# Dividing the dataset

It is not wise to run any algorithm over the complete dataset at first.

Let's work on only 100,000 rows and see how good is the learning.

For that it must be first divided into three parts(60-20-20):

- Training set (60,000 rows)

- Cross validation set (20,000 rows)

- Test set (20,000 rows)

In [6]:
training_set = dataset[0:60000]
cv_set = dataset[60001:80000]
test_set = dataset[80000:100000]

# Parsing sentences

For parsing I chose Stanford CoreNLP, because it's reliable and I've used it before.

In [8]:
from stanfordcorenlp import StanfordCoreNLP

Also, importing JSON to interpret the parsing result.

In [9]:
import json

Must load the parser first. I downloaded CoreNLP from URL. Unzipped it and using the location here for loading.

In [10]:
# location of the downloaded file after unzipping
coreNLP_location = 'C:\ds\phrasebase\stanford-corenlp-full-2018-02-27'

parser = StanfordCoreNLP(coreNLP_location, quiet=True)
props = {'annotators': 'parse', 'pipelineLanguage': 'en'}

Let's see how it works. Running this for the first time will load the pipeline, hence it could consume some time.

In [11]:
text = "Why did the chicken cross the road?"

result = json.loads(parser.annotate(text, properties=props))
parsed_tree_string = result['sentences'][0]['parse'];
print(parsed_tree_string)

(ROOT
  (SBARQ
    (WHADVP (WRB Why))
    (SQ (VBD did)
      (NP (DT the) (NN chicken))
      (VP (VB cross)
        (NP (DT the) (NN road))))
    (. ?)))


There we have it, the parsed tree.

### Extracting wh-phrases

Since we want to guess the expected entity type in answer for a given wh-question, we'll define a method to extract the wh-phrase from a query.

Learning and testing will only happen for those rows that have a wh- question in query.

It is better to represent parsed tree into a proper tree instead of using a JSON string. That way we can chunk phrases using NLTK's tree library.

In [12]:
from nltk.tree import *

Now defining a method to chunk the parsed tree and extract wh-phrases (if any)

- The method will accept a parsed tree (not parsed JSON string), along with a phrase as arguments.
- The method will return those branches of the tree that contain given phrase as root.

In [13]:
def ExtractPhrases(myTree, phrase):
    myPhrases = []
    if (myTree.label() == phrase):
        myPhrases.append( myTree.copy(True) )
    for child in myTree:
        if (type(child) is Tree):
            list_of_phrases = ExtractPhrases(child, phrase)
            if (len(list_of_phrases) > 0):
                myPhrases.extend(list_of_phrases)
    return myPhrases

Testing the extraction method

In [14]:
# converting JSON to tree using NLTK's tree
parsed_tree = Tree.fromstring(parsed_tree_string)

ExtractPhrases(parsed_tree,"WHADVP")

[Tree('WHADVP', [Tree('WRB', ['Why'])])]

So it returned the wh-adverb from our query text. To see a list of all the constituent tags refer [Penn Treebank](http://www.surdeanu.info/mihai/teaching/ista555-fall13/readings/PennTreebankConstituents.html).

# Named Entity Recognition

While CoreNLP does recognise Named Entities, it's not accurate and offers fewer Named Entities compared to SpaCy.

So let's try SpaCy.

In [15]:
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
ner = en_core_web_sm.load()

Let's see how it works

In [16]:
text = "Sachin Tendulkar is an Indian cricketer. He was born in Mumbai."
text_ner = ner(text)

# iterating over all entities and printing the entity with its ontology
for entity in text_ner.ents:
    print(entity,' - ',entity.label_)

Sachin Tendulkar  -  PERSON
Indian  -  NORP
Mumbai  -  GPE


Refer [SpaCy Entities](https://spacy.io/usage/linguistic-features#entity-types) to get a description of all entity types.

# Creating a dictionary

The aim is to record all the wh-phrases in a dictionary against all the named entities that occur in the answer passage for that wh-phrase.

So let's define a dict for that

In [17]:
qa_dict = {}

Next, we need a method that would extract wh-phrases from query. The method should also populate our dict with that particular wh-phrase alongside an empty dictionary. This dictionary will maintain a score for all named entities expected in the answer passage for the query.

`CheckWh` method will take `query` string and `qa_dict` as arguments
- This method returns 0 if no wh-phrase found in phrase
- otherwise updates `qa_dict` with wh-phrase found and returns its value (answer named entities score)

In [18]:
def CheckWh(query, qa_dict):
    query = str(query)
    
    # removing all round parenthesis from the query, so that is doesn't trouble while parsing
    query = query.replace("("," ")
    query = query.replace(")"," ")
    
    # parsing the query
    r = json.loads(parser.annotate(query, properties=props))
    
    # we want to extract three types of wh-phrase
    # WHADVP, WHADJP, and WHNP
    
    # since it is a query, expecting only 1 sentence
    pADV = ExtractPhrases(Tree.fromstring(r["sentences"][0]['parse']),"WHADVP")
    # if found WHADVP, make it the wh-phrase
    p = pADV
    
    pADJ = ExtractPhrases(Tree.fromstring(r["sentences"][0]['parse']),"WHADJP")
    # if found WHADJP...
    if(len(pADJ) != 0):
        # ...and WHADVP was not found then make WHADJP the wh-phrase
        if(len(p) == 0):
            p = pADJ
        # else choose the one with fewer branch count
        elif(len(str(pADJ[0])) < len(str(p[0]))):
            p = pADJ
            
    pNP = ExtractPhrases(Tree.fromstring(r["sentences"][0]['parse']),"WHNP")
    # if found WHNP...
    if(len(pNP) != 0):
        # ...and WHADVP/WHADJ was not found then make WHNP the wh-phrase
        if(len(p) == 0):
            p = pNP
        # else choose the one with fewer branch count
        elif (len(str(pNP[0])) < len(str(p[0]))):
            p = pNP
            
    # update qa_dict if any wh-phrase found
    if(len(p) > 0):
        qphrase = ' '.join(word for word in p[0].leaves())
        if qphrase in qa_dict:
            # update existing entry
            qa_dict[qphrase]['size'] += 1 
        else:
            # new entry
            qa_dict[qphrase] = {}
            qa_dict[qphrase]['size'] = 1 
            qa_dict[qphrase]['answer_ner'] = {} # create empty value
        return qa_dict[qphrase]
    else:
        return 0

Testing the method

In [19]:
CheckWh("what is cricket", qa_dict)
qa_dict

{'what': {'size': 1, 'answer_ner': {}}}

# Training 

The dataset has 5+ million rows, which is just...

<img src="https://media1.tenor.com/images/4d88ca458dd87a78979baff245791a8e/tenor.gif?itemid=7357158" alt="Trump Yuge" align="left"></img>

It would be heartbreaking to learn that the algorithm is not learning well after building the feature for 5 million rows.
So it's best we first build out dictionary on a smaller subset, say 1 million rows. It's still big, but remember not all queries in search engines have a wh-phrase so the actual count of queries to learn from, will be far less.

In [20]:
subset = dataset[0:1000000]

Now a method `TrainFeature` to accept a dataframe as argument.
- This method checks for wh-phrase in the query using `CheckWh`
    - if found, then for the 10 rows(passages), all the named entities found in answer passages are rewarded with good score and those from incorrect passages are punished.
        - We'll reward the named entities with +1
        - and punish with -(1/9). This is to equalize the score to 0 if the named entity is present in all the incorrect passages and in the correct passage. That would indicate the entity is pretty useless and nothing to learn from it.
    - also, in the `qa_dict` we'll save/update this wh-phrase with the named entity scores
- If query doesn't have a wh-phrase then skip next 10 rows
- Also record the no. of wh-phrases in `wh_query_count` 

In [26]:
# this is the dictionary we are going to populate
qa_dict = {}

def TrainFeature(t):
    
    # to save the no. of wh-phrase in subset
    #wh_query_count = 0
    
    # NOTE: It is not the most elegant solution to just iterate over the dataframe in Pandas.
    # I know I'm committing a sacrilege but desperate times call for desperate measures. 
    # If you could suggest a more efficient and beautiful way of building this feature without looping then feel free to help.

    # iterator
    i=0

    # let's loop over each row of the dataframe (sorry!)
    while i < t.id.size:
        trow = t.iloc[i]
        # Query of 1 in every 10 rows will be considered for checking wh-phrase
        if(i%10 == 0):
            # check for wh-phrase in query, and populate if it has any...
            qdict_entry = CheckWh(trow['query'],qa_dict)
            
            # in case no wh-phrase was found...
            if qdict_entry == 0:
                # ...skip the 10 rows (passages)
                i += 10
                continue
            #else:
             #   wh_query_count += 1

        # wh-phrase was found, so time to assign scores to the named entities
        
        # if the passage is correct answer...
        if trow['label'] == 1:
            # get its named entities...
            passage_ner = ner(str(trow['passage']))
            # ...iterate over each entity
            for X in passage_ner.ents:
                # either update the wh-phrase dictionary...
                if X.label_ in qdict_entry['answer_ner']:
                    # reward the named entity
                    qdict_entry['answer_ner'][X.label_] += 1
                # ...or create a new entry for the wh-phrase
                else:    
                    qdict_entry['answer_ner'][X.label_] = {}
                    # and then reward the named entity
                    qdict_entry['answer_ner'][X.label_] = 1
            #print("query in qdict: ",qdict_entry)
        
        # ...otherwise passage is incorrect for the query
        else:
            # still... get named entities for the passage
            passage_ner = ner(str(trow['passage']))
            # iterate over all entities
            for X in passage_ner.ents:
                # either update the wh-phrase dictionary...
                if X.label_ in qdict_entry['answer_ner']:
                    # punish the named entity by decreasing the score
                    qdict_entry['answer_ner'][X.label_] -= (1/9)
                # ...or create a new entry for the wh-phrase
                else:    
                    qdict_entry['answer_ner'][X.label_] = {}
                    # and then punish the named entity
                    qdict_entry['answer_ner'][X.label_] = (1/9)
        
        # increment row
        i+=1
    #return wh_query_count    
print(qa_dict)

{}


Let's test the  `TrainFeature` method on 10 queries

In [39]:
# this is the dictionary we are going to populate
qa_dict = {}

# to save the no. of wh-phrase in subset
#wh_query_count = 0

#wh_query_count = 
TrainFeature(subset[0:200])

Number of wh-phrases in ten queries

In [23]:
#wh_query_count

3

Our dictionary has recorded three wh-phrases from the ten queries.
`size` attribute indicates the no. of times that particular wh-phrase was found
All the negative scores simply mean that named entity is not to be trusted for the wh-phrase
Checking out all the scores of named entities reveals the scales are messed up.

In [40]:
qa_dict

{'what': {'size': 1,
  'answer_ner': {'DATE': 0.1111111111111111,
   'CARDINAL': -1.2222222222222225,
   'ORG': 1.0,
   'PERSON': 0.0,
   'PERCENT': 0.1111111111111111,
   'NORP': 0.1111111111111111}},
 'how often': {'size': 1,
  'answer_ner': {'CARDINAL': 2.5555555555555554,
   'ORG': 0.7777777777777778,
   'DATE': -0.555555555555556,
   'GPE': -0.1111111111111111,
   'PERSON': 0.0,
   'TIME': -0.5555555555555556,
   'QUANTITY': 0.0,
   'LAW': 0.1111111111111111}},
 'how many': {'size': 3,
  'answer_ner': {'DATE': 0.6666666666666643,
   'TIME': -0.5555555555555556,
   'GPE': -2.111111111111112,
   'CARDINAL': -1.4444444444444455,
   'ORG': -0.5555555555555562,
   'PERSON': -0.8888888888888902,
   'NORP': 0.4444444444444445,
   'PERCENT': 0.1111111111111111,
   'QUANTITY': 0.1111111111111111,
   'LOC': 0.1111111111111111,
   'WORK_OF_ART': 0.1111111111111111,
   'PRODUCT': 0.1111111111111111}},
 'how much': {'size': 1,
  'answer_ner': {'PERSON': 0.1111111111111111,
   'MONEY': 2.666666

We need to normalize the scores. And even remove negative ones since they don't come from correct passages.

In [33]:
def NormalizeScore(qa_dict):
    qdict_norm = {}
    for k1, v1 in qa_dict.items():
        d = {}
        for k2,v2 in v1['answer_ner'].items():
            if v2 > 0.1:
                d[k2] = v2
        l = [v for k,v in d.items()]

        if len(l) > 0:
            max_ = max(l)
            for k,v in d.items():
                d[k] = v/max_
            if k1 in qdict_norm:
                qdict_norm[k1]['norm_ner'] = d
            else:
                qdict_norm[k1] = {}
                qdict_norm[k1]['norm_ner'] = d
    return qdict_norm
            
qdict_norm = NormalizeScore(qa_dict)

for k,v in qdict_norm.items():
    print("wh-phrase: ",k)
    for k1,v1 in v['norm_ner'].items():
        print(k1," ",v1)
    print("\n")

wh-phrase:  what
DATE   0.1111111111111111
ORG   1.0
PERCENT   0.1111111111111111
NORP   0.1111111111111111


wh-phrase:  how often
CARDINAL   1.0
ORG   0.30434782608695654
LAW   0.043478260869565216


wh-phrase:  how many
DATE   1.0
TIME   0.08333333333333337
NORP   0.33333333333333354
PERCENT   0.08333333333333337
QUANTITY   0.08333333333333337
LOC   0.08333333333333337
WORK_OF_ART   0.08333333333333337




The results are not that bad.
- 'what' is answered highly using an ORG entity
- 'how often' is answered highly using a CARDINAL entity
- 'how many' is answered highly using a DATE, not quite correct... but ok.


Time to train it over 100,000 queries i.e. 1 million rows

In [None]:
# this is the dictionary we are going to populate
qa_dict = {}

# to save the no. of wh-phrase in subset
#wh_query_count = 0

#wh_query_count = 
TrainFeature(subset)
#qa_dict_norm = NormalizeScore(qa_dict)

Let's check how many from the 100,000 queries contained a wh-phrase

In [34]:
#wh_query_count

Running this took a lot of time, so to be safe I'll use pickle to dump this dictionary.

In [35]:
import pickle

Saving as a pickle

In [None]:
output = open('qa_dict_1mil_queries_norm.pkl', 'wb')
pickle.dump(qa_dict_norm, output)
output.close()

<img src="https://media1.tenor.com/images/1756eb5631ade0eb64d57d256a5847f2/tenor.gif?itemid=9423244" align="left">

Read that pickle again... we'll use this to build a feature for training samples. 

In [36]:
#pkl_file = open('wh_q_ans_dict_1mil_norm.pkl', 'rb')
pkl_file = open('qa_dict_10k_queries_norm.pkl', 'rb')
qa_ner_dict = pickle.load(pkl_file)
pkl_file.close()

Number of unique wh-phrase is...

In [37]:
len(qa_ner_dict)

6249

So of all the 100,000 queries we parsed, we found 6249 unique wh-phrase. Shoot! I forgot to save the actual no. of queries that had a wh-phrase. Nevermind, 

In [38]:
qa_ner_dict

{'what': {'norm_ner': {'DATE': 0.2418223643619053,
   'ORG': 0.27209224036509844,
   'PERSON': 0.2820793356910595,
   'NORP': 0.24185041800009802,
   'GPE': 1.0,
   'QUANTITY': 0.20821410537015195,
   'LOC': 0.14618751052058074,
   'MONEY': 0.16905122594456373,
   'FAC': 0.023256466363745502,
   'ORDINAL': 0.01660775402573,
   'EVENT': 0.013521853784470026,
   'LANGUAGE': 0.007041463277805692}},
 'how often': {'norm_ner': {'CARDINAL': 0.37242798353911627,
   'ORG': 0.38271604938273435,
   'DATE': 1.0,
   'GPE': 0.13374485596708402,
   'PERSON': 0.034979423868313805,
   'LAW': 0.026748971193416664,
   'ORDINAL': 0.03703703703703852,
   'PRODUCT': 0.026748971193416633,
   'LOC': 0.010288065843622051,
   'FAC': 0.022633744855967947,
   'NORP': 0.2222222222222315,
   'WORK_OF_ART': 0.010288065843621796,
   'LANGUAGE': 0.01440329218107052}},
 'how many': {'norm_ner': {'DATE': 0.35156555772983117,
   'TIME': 0.00900195694715671,
   'GPE': 0.3557729941289829,
   'CARDINAL': 1.0,
   'PERSON': 