# Mandatory Session 7
Using ***nltk.ne_chunk*** and ***CoreNLPParser***

In [1]:
import pandas as pd
import csv
from scipy.stats   import pearsonr
from copy          import deepcopy
from nltk.metrics  import jaccard_distance
from nltk          import ne_chunk, pos_tag, word_tokenize
from nltk.tree     import Tree
from nltk.parse    import CoreNLPParser

First read the data sets using pandas ...

In [2]:
trial_path    = 'data/trial/STS.input.txt'
trial_gs_path = 'data/trial/STS.gs.txt'
trial_df      = pd.read_csv(trial_path, sep='\t', lineterminator='\n', names=['sentence0','sentence1'], header=None, quoting=csv.QUOTE_NONE)
trial_gs      = pd.read_csv(trial_gs_path, sep='\t', lineterminator='\n', names=['labels'], header=None, quoting=csv.QUOTE_NONE).iloc[::-1]
trial_df

Unnamed: 0,sentence0,sentence1
id1,The bird is bathing in the sink.,Birdie is washing itself in the water basin.\r
id2,"In May 2010, the troops attempted to invade Ka...",The US army invaded Kabul on May 7th last year...
id3,John said he is considered a witness but not a...,"""He is not a suspect anymore."" John said.\r"
id4,They flew out of the nest in groups.,They flew into the nest together.\r
id5,The woman is playing the violin.,The young lady enjoys listening to the guitar.\r
id6,John went horse back riding at dawn with a who...,Sunrise at dawn is a magnificent view to take ...


The preprocessing function will apply one of the name entity parser functions to the data only. Some *real* preprocessing should be done here. For example, removing stopwords...

In [3]:
def preprocessing(data, ne_parser_function): # Name entity parser function is passed as an argument
    """ This preprocessing only uses the name entity parser functions. Generaly a more complete function will be used """
    data = data.fillna('')
    for column in data.columns:
        data[column] = data[column].apply(ne_parser_function)
    return data

The two next functions are used to calculate the similarities using the Jaccard distances (the same as in Session 2) with a list of words and the pearson correlation coefficient as well... 

In [4]:
def lexical_simmilarity(df):
    """ Calculate the similarities using the Jaccard distance """
    guess = pd.DataFrame()
    for i in df.index:
        set1 = set(df.loc[i,'sentence0'])
        set2 = set(df.loc[i,'sentence1'])
        guess.loc[i,'labels'] = 1. - jaccard_distance(set1, set2)
    return guess

def analyzeResults(results):
    """ Print similarities and pearson correlation coefficient """
    guess_lex = lexical_simmilarity(results)
    pearson    = pearsonr(trial_gs['labels'], guess_lex['labels'])[0]
    for column in results.columns:
        print('\n', column)
        results[column].apply(print)
    print()
    print('Similarities:\n', guess_lex)
    print()
    print('Pearson correlation index:', pearson)

Now, the two *Name Entity Parser* functions are defined. First the nltk parser which is supossed to receive as an argument the tree obtained by an *nltk.ne_chunk* parser. The binary tree is transformed into a flat list, where words in the same branch corresponding to a name entity are now thogether in the same string...

In [5]:
def nltk_parser(tree):
    """ Parse a tree obtained using the nltk ne_chunk parser """
    parsed_array = []
    for chunk in tree: 
        if type(chunk) == Tree:
            word = ' '.join(leaf[0] for leaf in chunk)
            parsed_array.append(word.lower())
        else: 
            parsed_array.append(chunk[0].lower())
    
    return [i for i in parsed_array if i.isalnum()] # Remove non-alphanumeric cases

On the other hand, the parser from *CoreNLP* is first created using the constructor (the *tagtype* argument indicates a name entity parser). The constructor needs a *CoreNLP* server to be up, in this case this server is supposed to be running in the local machine. The server can be launched using the next command:

````java -mx4g -cp path_to\stanford-corenlp-full-2018-10-05\* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000````

The parser functions expects a list created by the *CoreNLPParser.tag* method and for each group of words with tags other than ````'O'```` it will generate a single string grouping them, e.g. ````[('John', 'PERSON'), ('Smith', 'PERSON)] => ['John Smith']````

In [6]:
parser = CoreNLPParser(url='http://localhost:9000', tagtype='ner')

def nlp_parser(tagged):
    """ Parse a list obtained using the CoreNLPParser.tag parser """
    parsed      = []
    last_tag    = None
    start_index = 0
    for index, node in enumerate(tagged):
        tag = node[1]
        if (tag == 'O' or tag != last_tag) and (start_index != index): # If end of block (All 'O' blocks are separated)
            subset      = tagged[start_index:index]            # Get the block
            tokens      = [pair[0].lower() for pair in subset] # Get the block lowercase word
            token       = ' '.join(tokens)                     # Join the block in a single string
            last_tag    = tag                                  
            start_index = index
            parsed.append(token)
    return parsed

Now its time to run both taggers and check the correlation obtained between the sentences. First using the *nltk.ne_chunk* function...

In [7]:
def nltk_operation(sentence):
    tokens = word_tokenize(sentence)
    taggs  = pos_tag(tokens)
    nes    = ne_chunk(taggs, binary = True)
    return nltk_parser(nes)

analyzeResults(
    preprocessing(deepcopy(trial_df), nltk_operation)
)


 sentence0
['the', 'bird', 'is', 'bathing', 'in', 'the', 'sink']
['in', 'may', '2010', 'the', 'troops', 'attempted', 'to', 'invade', 'kabul']
['john', 'said', 'he', 'is', 'considered', 'a', 'witness', 'but', 'not', 'a', 'suspect']
['they', 'flew', 'out', 'of', 'the', 'nest', 'in', 'groups']
['the', 'woman', 'is', 'playing', 'the', 'violin']
['john', 'went', 'horse', 'back', 'riding', 'at', 'dawn', 'with', 'a', 'whole', 'group', 'of', 'friends']

 sentence1
['birdie', 'is', 'washing', 'itself', 'in', 'the', 'water', 'basin']
['the', 'us', 'army', 'invaded', 'kabul', 'on', 'may', '7th', 'last', 'year', '2010']
['he', 'is', 'not', 'a', 'suspect', 'anymore', 'john', 'said']
['they', 'flew', 'into', 'the', 'nest', 'together']
['the', 'young', 'lady', 'enjoys', 'listening', 'to', 'the', 'guitar']
['sunrise', 'at', 'dawn', 'is', 'a', 'magnificent', 'view', 'to', 'take', 'in', 'if', 'you', 'wake', 'up', 'early', 'enough', 'for', 'it']

Similarities:
        labels
id1  0.272727
id2  0.250000


Here, it can be seen that there was no change in the tokenization of the sentences. Either *ne_chunk* is not powerfull enough to distinguish name entities in this context or the sentences have too few context to extract that information.

In any case, the pearson correlation index and the values of the similarities is the same as the ones obtained in the session 2 using only Jaccard distances and word tokenization. 

Moreover, if the *ne_chunker* would have consider that, for example, ````the troops```` is a whole name entity, the Jaccard distance between the pair of sentences will be smaller, because the set is now smaller, so the similar items have more weight in the calculation. But the same effect would have been achieved with completely different sentences: an increase in similiraty. It can probably be safe to assume that the name entity parser alone is not enough to make a better correlation but instead a more complex method that uses the output of this one should be implemented.

Now, using the CoreNLPParser.tag function...

In [8]:
def nltk_operation(sentence):
    tokens = word_tokenize(sentence)
    nes    = parser.tag(tokens)
    return nlp_parser(nes)

analyzeResults(
    preprocessing(deepcopy(trial_df), nltk_operation)
)


 sentence0
['the', 'bird', 'is', 'bathing', 'in', 'the', 'sink']
['in', 'may 2010', ',', 'the', 'troops', 'attempted', 'to', 'invade', 'kabul']
['john', 'said', 'he', 'is', 'considered', 'a', 'witness', 'but', 'not', 'a', 'suspect']
['they', 'flew', 'out', 'of', 'the', 'nest', 'in', 'groups']
['the', 'woman', 'is', 'playing', 'the', 'violin']
['john', 'went', 'horse', 'back', 'riding', 'at', 'dawn', 'with', 'a', 'whole', 'group', 'of', 'friends']

 sentence1
['birdie', 'is', 'washing', 'itself', 'in', 'the', 'water', 'basin']
['the', 'us', 'army', 'invaded', 'kabul', 'on', 'may 7th last year , 2010']
['``', 'he', 'is', 'not', 'a', 'suspect', 'anymore', '.', "''", 'john', 'said']
['they', 'flew', 'into', 'the', 'nest', 'together']
['the', 'young', 'lady', 'enjoys', 'listening', 'to', 'the', 'guitar']
['sunrise', 'at', 'dawn', 'is', 'a', 'magnificent', 'view', 'to', 'take', 'in', 'if', 'you', 'wake', 'up', 'early', 'enough', 'for', 'it']

Similarities:
        labels
id1  0.272727
id2  

In this case, it can be seen that some name entities where found (tagged as Dates):
* ````may 2010````
* ````may 7th last year , 2010````

This surprisingly makes the similarity for the sentence with id2 worse (And a worse pearson correlation index is obtained). The cause for this is the decrease in the length of both sets while removing some of the matching words: may and 2010. Now, both these tokens will not match, decreasing the number of similar tokens in each set.

A solution for this may be to accept two tokens as a coincidence when one of its words is the same in both (excluding stopwords)