# Wikification
This notebook contains code for doing wikification, as well as evaluating it.

## Overview of our Wikification Method
The following flowchart describes how our wikification method works without much technical detail.
<img src="https://docs.google.com/drawings/d/19fInwE2C_fsAFiMNnIPe0cFldIHbJTHrXrLlLg6nbwI/pub?w=728&h=600">
<center><strong>Figure 1.</strong> A flowchart describing our wikification method at a relatively basic level.</center>

## With More Detail
In reference to figure 1.:

### 1.  Input Some Text
Self explanatory, just feed into the wikifier some text that is desired to be wikified. In the evaluation part of our code, the text comes pre-split from the datasets. We either keep the text split to focus more on our wikification, or join the text with spaces to evaluate while taking our own mention extraction into account.

### 2. Tokenize Text
The text is tokenized by a [Solr](http://lucene.apache.org/solr/) extension called [Solr Text Tagger](https://github.com/OpenSextant/SolrTextTagger), this tokenizer returns all potential mentions that it detects in the text. Our code is configured so that the tokenizer returns all overlaps. So if given the text: 'The United States of America', the tokenizer would return all of 'The United States', 'The United States of America', 'United States', and 'United States of America'. These overlaps are undesirable for our wikification purposes. However we choose to enable the overlaps so that we can obtain more potential mentions that we can deal with later more intelligently than the tokenizer can (without configuring it deeply). The overlaps are dealt with in the next step, though future work may make it better to deal with them later in the process.

### 3. Remove Overlaps
This part as-is is a work in progress. Currently our method is to first group all overlapping mentions into what we call overlap sets. Each overlap set is comprised of overlapping mentions that start at the same letter. The mention 'probability' of each mention is calculated at this time. The mention 'probability' is not truly a probability, it is defined as the amount of times the mention text is a mention in Wikipedia divided by the amount of documents it shows up in Wikipedia (it would be preferable to have the denominator be the total amount of times the mention text shows up in Wikipedia (to be an actual probability)). The mention with the highest 'probability' in each overlap set is the sole mention that is kept.

There of-course may still be overlaps remaining at this point, now the residual overlaps are to be dealt with. It is important to note that for the following part, the mentions are stored in the order that they appear in the text, by their beginning letters'. When we say the first mention we mean the mention that appears first in the text, and by next mention we mean the mention that appears next in the text. To deal with the residual overlaps we call the first mention the anchor, and all of the next mentions that start before the anchor mention ends, all get grouped together with the anchor mention in an overlap set. Just like before the most 'probable' mention in this set is kept, all others are discarded from the original set. Once again the first mention that in the updated original set is selected as the anchor, the same process is repeated. If the overlap set only contains the anchor mention, the whole process is repeated on the next mention. This process is repeated until there is no next mention to go to.

This step needs more investigation, perhaps the first part does not even need to be done.

### 4. Filter with POS Tags and Mention Probability
We use [Natural Language Toolkit (NLTK)](http://www.nltk.org/) to tag all of the mentions (though we should tag all of the text together to get more accurate results (update to come)). Using the POS tags helps us filter out bad mentions. Approximately 99% of all mentions in our datasets where either any type of noun, an adjective, or a cardinal number. The tags are displayed as 'NN', 'NNS', 'NNP', and 'NNPS' for nouns, 'JJ' for adjectives, and 'CD' for cardinal numbers.

In addition to filtering with POS tags, we also filter out any mention that have a 'probability' of being a mention of less than 0.001.

### 5. Candidate Generation
Now that the mentions are all extracted, we must generate a list of possible entities that each mention can refer to, we call these, entity candidates. To select n candidates, we first try selecting n/2 entities (Wikipedia page) that the given mention refers to most on Wikipedia. We refer to this measure as popularity. Once the n/2 most popular are selected, the remaining n/2 entities are selected based on the context similarity with other mentions in the same sentence. If selecting most popular fails to return n/2 results, we try selecting however many we need based on context. So if popularity returns 0 entities, we select n from the most contextual. This method of mixing popular and contextual candidates is called the hybrid method.

The hybrid method of candidate generation scores best on average on each of our datasets. Selecting candidates based on popularity alone initially seemed more promising, because it scored better on the largest dataset (wiki5000), thus giving an overall higher recall. But on each independent dataset, the popularity method only scored better on wiki5000, by about one percent, whereas other datasets scored worse, from 3 to 13 percent.

### 6. Candidate Scoring
For each mention, all of the candidates must be scored on some metric. The candidate with the best score will be selected as the proposed entity for the mention. All of these methods rely on basic scores such as the popularity of an entity given the mention, or some measure of similarity from the context of a mention to the document of a candidate. Individually some of these methods perform well on select datasets, but combined together using machine learning gives the best results overall. Using a learning to rank algorithm ([LambdaMART](https://github.com/jma127/pyltr)), we achieve a score better than any best idividual on each of our datasets.

#### Popularity
This method simply chooses the most popular candidate (most popular as described in section 5. Candidate Generation). This method performs very well but is undesirable due to the fact that it is just blindly guessing, and could be horribly wrong in some cases. See [this comic](https://comic.hmp.is.it/comic/rainy-days/) for an example of someone who does not quite get this concept.

#### Context 1
For this method, the sentence that contains the mention is extracted and called the context. The mention is removed from this context. We then use Solr to search for the most similar document (Wikipedia page) by searching in the document text field for the context, as well as searching in the document title field for the mention text. The set of documents that we are searching through in Solr is of-course limited to those that are the candidates of the mention. We do what is called boosting to make the results more weighted by the title field, the results from this are boosted by 1.35 (multiplied) on each document. The document with the highest score is deemed the most similar and is selected as the proposed entity for the mention.

#### Context 2
This method is slightly similar to Context 1 as it also uses Solr and it uses the sentence as a context in the same way. The difference is that we use a different index for this method. The index for this method, rather than containing whole documents (Wikipedia pages), contains all instances of all mentions with the surrounding context of each mention as a record. For example, a record could be for the mention 'David', the record will also have n (5 in this example) words before the mention: 'is a soccer player named', n words after the mention: 'he played for Manchester United', the Wikipedia page that the mention is in, and the Wikipedia page that the mention refers to. Using this index we search in the collection of all records that have the mention refer our candidate, for each candidate. The n words before and after are searched in for our context sentence, whichever entity has the highest number of relevant examples is selected as the proposed entity for that mention.

#### Word2Vec
For this method we have Word2Vec create a vector space model of concepts from a Wikipedia corpus. The entities, as well as regular words all have their own vector representation. To use this method, we select n words before and after the mention, and get the vector representation of each of these words. All of these vectors are added together to become a context vector. This context vector is compared to the vector representation of each of the candidates. The candidate vector that is most similar (by cosine similarity) to the context vector is selected as the proposed entity for the mention.

#### Coherence
This method uses the reverse page rank algorithm to determine which combination of candidates from all mentions makes the most sense together. This method looks at the quality of all of the proposed entities from all mentions together, instead of individually selecting the proposed entity for each individual mention.

# Wikification Evaluation Code
The code in this cell is used to evaluate the precision and recall of the wikification code as well as other wikification methods.

## Datasets

### KORE
* 50 records.
* Relatively small pieces of text with the main goal of being tricky for wikification systems.

### Aquaint
* 50 records.
* News.

### MSNBC
* 20 records.
* News.

### Wiki[n]
* n records (we usually use 500 or 5000).
* Opening paragraph of a variety of randomly selected Wikipedia articles.

### nopop
* 2304 records.
* Comprised of subsets of the other datasets.
* Only contains records where the most popular candidate is not the correct entity.

In [2]:
%%writefile wikification_eval.py 

"""
This is for testing performance of different wikification methods (Macro).
"""

from wikification import *
from IPython.display import clear_output
import copy
from datetime import datetime
import tagme
import os
import json

tagme.GCUBE_TOKEN = "f6c2ba6c-751b-4977-a94c-c140c30e9b92-843339462"
    

pathStrt = '/users/cs/amaral/wsd-datasets'
#pathStrt = 'C:\\Temp\\wsd-datasets'

# the data sets for performing on
datasets = [{'name':'kore', 'path':os.path.join(pathStrt,'kore.json')},
            {'name':'AQUAINT', 'path':os.path.join(pathStrt,'AQUAINT.txt.json')},
            {'name':'MSNBC', 'path':os.path.join(pathStrt,'MSNBC.txt.json')},
            {'name':'wiki5000', 'path':os.path.join(pathStrt,'wiki-mentions.5000.json')}]

# many different option for combonations of datasets for smaller tests
#datasets = [{'name':'MSNBC', 'path':os.path.join(pathStrt,'MSNBC.txt.json')}]
#datasets = [{'name':'kore', 'path':os.path.join(pathStrt,'kore.json')}]
#datasets = [{'name':'kore', 'path':os.path.join(pathStrt,'kore.json')}, {'name':'AQUAINT', 'path':os.path.join(pathStrt,'AQUAINT.txt.json')}]
#datasets = [{'name':'wiki5000', 'path':os.path.join(pathStrt,'wiki-mentions.5000.json')}]
datasets = [{'name':'kore', 'path':os.path.join(pathStrt,'kore.json')}, {'name':'AQUAINT', 'path':os.path.join(pathStrt,'AQUAINT.txt.json')}, {'name':'MSNBC', 'path':os.path.join(pathStrt,'MSNBC.txt.json')}]
#datasets = [{'name':'kore', 'path':os.path.join(pathStrt,'kore.json')}, {'name':'AQUAINT', 'path':os.path.join(pathStrt,'AQUAINT.txt.json')}, {'name':'MSNBC', 'path':os.path.join(pathStrt,'MSNBC.txt.json')},{'name':'wiki500', 'path':os.path.join(pathStrt,'wiki-mentions.500.json')}]
#datasets = [{'name':'nopop', 'path':os.path.join(pathStrt,'nopop.json')}]
#datasets = [{'name':'kore', 'path':os.path.join(pathStrt,'kore.json')}, {'name':'AQUAINT', 'path':os.path.join(pathStrt,'AQUAINT.txt.json')}, {'name':'MSNBC', 'path':os.path.join(pathStrt,'MSNBC.txt.json')},{'name':'wiki500', 'path':os.path.join(pathStrt,'wiki-mentions.500.json')},{'name':'nopop', 'path':os.path.join(pathStrt,'nopop.json')}]
#datasets = [{'name':'kore', 'path':os.path.join(pathStrt,'kore.json')}, {'name':'AQUAINT', 'path':os.path.join(pathStrt,'AQUAINT.txt.json')}, {'name':'MSNBC', 'path':os.path.join(pathStrt,'MSNBC.txt.json')},{'name':'wiki5000', 'path':os.path.join(pathStrt,'wiki-mentions.5000.json')},{'name':'nopop', 'path':os.path.join(pathStrt,'nopop.json')}]
#datasets = [{'name':'wiki500', 'path':os.path.join(pathStrt,'wiki-mentions.500.json')}]

# 'popular', 'context1', 'context2', 'word2vec', 'coherence', 'tagme', 'multi'
#methods = ['multi']
methods = ['abc', 'bgc', 'etc', 'gbc', 'rfc', 'lsvc', 'svc', 'lmart']
# 'lmart', 'gbr', 'etr', 'rfr'
mlModels = 'lmart'

if 'word2vec' in methods or 'multi' in methods or True:
    try:
        word2vec
    except:
        word2vec = gensim_loadmodel('/users/cs/amaral/cgmdir/WikipediaClean5Negative300Skip10.Ehsan/WikipediaClean5Negative300Skip10')

doSplit = True
doManual = False

verbose = True

maxCands = 20

performances = {}

# for each dataset, run all methods
for dataset in datasets:
    performances[dataset['name']] = {}
    # get the data from dataset
    dataFile = open(dataset['path'], 'r')
    dataLines = []
    
    # put in all lines that contain proper ascii
    for line in dataFile:
        dataLines.append(json.loads(line.decode('utf-8').strip()))
        
    print dataset['name'] + '\n'
    
    # run each method on the data set
    for mthd in methods:
        print mthd
        print str(datetime.now()) + '\n'
        
        # reset counters
        totalPrecS = 0
        totalPrecM = 0
        totalRecS = 0
        totalRecM = 0
        totalF1S = 0
        totalF1M = 0
        totalLines = 0
        
        # each method tests all lines
        for line in dataLines:
            if verbose:
                print str(totalLines + 1)
            
            # get absolute text indexes and entity id of each given mention
            trueEntities = mentionStartsAndEnds(copy.deepcopy(line), forTruth = True) # the ground truth
            
            oData = copy.deepcopy(line)
            
            # get results for pre split string
            if doSplit and mthd <> 'tagme': # presplit no work on tagme
                # original split string with mentions given
                resultS = wikifyEval(copy.deepcopy(line), True, hybridC = True, maxC = maxCands, method = 'multi', model = mthd)
                precS = precision(trueEntities, resultS) # precision of pre-split
                recS = recall(trueEntities, resultS) # recall of pre-split
                try:
                    f1S = (2*precS*recS)/(precS+recS)
                except:
                    f1S = 0
                
                if verbose:
                    print 'Split: ' + str(precS) + ', ' + str(recS) + ', ' + str(f1S)
                
                # track results
                totalPrecS += precS
                totalRecS += recS
                totalF1S += f1S
                
                """j = 0
                for mention in oData['mentions']:
                    try:
                        print oData['text'][mention[0]].encode('utf-8') + ':  ' + mention[1] + ' --> ' + id2title(resultS[j][2])
                    except:
                        pass
                    j += 1"""
                
            else:
                totalPrecS = 0
                totalRecS = 0
                totalF1S = 0
                
            # get results for manually split string
            if doManual:
                # tagme has separate way to do things
                if mthd == 'tagme':
                    antns = tagme.annotate(" ".join(line['text']))
                    resultM = []
                    for an in antns.get_annotations(0.005):
                        resultM.append([an.begin,an.end,title2id(an.entity_title)])
                else:
                    # unsplit string to be manually split and mentions found
                    resultM = wikifyEval(" ".join(line['text']), False, maxC = maxCands, method = mthd)
                
                precM = precision(trueEntities, resultM) # precision of manual split
                recM = recall(trueEntities, resultM) # recall of manual split
                try:
                    f1M = (2*precM*recM)/(precM+recM)
                except:
                    f1M = 0
                
                if verbose:
                    print 'Manual: ' + str(precM) + ', ' + str(recM) + ', ' + str(f1M)
                    
                # track results
                totalPrecM += precM
                totalRecM += recM
                totalF1M += f1M
            else:
                totalPrecM = 0
                totalRecM = 0
                totalF1M = 0
                
            totalLines += 1
        
        # record results for this method on this dataset
        # [avg precision split, avg precision manual, avg recall split, avg recall manual]
        performances[dataset['name']][mthd] = {'S Prec':totalPrecS/totalLines, 
                                               'M Prec':totalPrecM/totalLines,
                                              'S Rec':totalRecS/totalLines, 
                                               'M Rec':totalRecM/totalLines,
                                               'S F1':totalF1S/totalLines,
                                               'M F1':totalF1M/totalLines
                                              }

with open('/users/cs/amaral/wikisim/wikification/wikification_results.txt', 'a') as resultFile:
    resultFile.write('\nmaxC: ' + str(maxCands) + '\n' + str(datetime.now()) + '\n\n')
    resultFile.write('Doing hybrid candidate generation before hybrid training.\n\n')
    for dataset in datasets:
        resultFile.write(dataset['name'] + ':\n')
        for mthd in methods:
            if doSplit and doManual:
                resultFile.write(mthd + ':'
                       + '\n    S Prec :' + str(performances[dataset['name']][mthd]['S Prec'])
                       + '\n    S Rec :' + str(performances[dataset['name']][mthd]['S Rec'])
                       + '\n    S F1 :' + str(performances[dataset['name']][mthd]['S F1'])
                       + '\n    M Prec :' + str(performances[dataset['name']][mthd]['M Prec'])
                       + '\n    M Rec :' + str(performances[dataset['name']][mthd]['M Rec'])
                       + '\n    M F1 :' + str(performances[dataset['name']][mthd]['M F1']) + '\n')
            elif doSplit:
                resultFile.write(mthd + ':'
                       + '\n    S Prec :' + str(performances[dataset['name']][mthd]['S Prec'])
                       + '\n    S Rec :' + str(performances[dataset['name']][mthd]['S Rec']) 
                       + '\n    S F1 :' + str(performances[dataset['name']][mthd]['S F1']) + '\n')
            elif doManual:
                resultFile.write(mthd + ':'
                       + '\n    M Prec :' + str(performances[dataset['name']][mthd]['M Prec'])
                       + '\n    M Rec :' + str(performances[dataset['name']][mthd]['M Rec'])
                       + '\n    M F1 :' + str(performances[dataset['name']][mthd]['M F1']) + '\n')
                
    resultFile.write('\n' + '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~' + '\n')

Overwriting wikification_eval.py


In [7]:
%%writefile wikification_eval_micro.py 

"""
This is for testing performance of different wikification methods (Micro).
"""

from wikification import *
from IPython.display import clear_output
import copy
from datetime import datetime
import tagme
import os
import json

tagme.GCUBE_TOKEN = "f6c2ba6c-751b-4977-a94c-c140c30e9b92-843339462"
    

pathStrt = '/users/cs/amaral/wsd-datasets'
#pathStrt = 'C:\\Temp\\wsd-datasets'

# the data sets for performing on
datasets = [{'name':'kore', 'path':os.path.join(pathStrt,'kore.json')},
            {'name':'AQUAINT', 'path':os.path.join(pathStrt,'AQUAINT.txt.json')},
            {'name':'MSNBC', 'path':os.path.join(pathStrt,'MSNBC.txt.json')},
            {'name':'wiki5000', 'path':os.path.join(pathStrt,'wiki-mentions.5000.json')}]

# many different option for combonations of datasets for smaller tests
#datasets = [{'name':'MSNBC', 'path':os.path.join(pathStrt,'MSNBC.txt.json')}]
#datasets = [{'name':'kore', 'path':os.path.join(pathStrt,'kore.json')}]
#datasets = [{'name':'kore', 'path':os.path.join(pathStrt,'kore.json')}, {'name':'AQUAINT', 'path':os.path.join(pathStrt,'AQUAINT.txt.json')}]
#datasets = [{'name':'wiki5000', 'path':os.path.join(pathStrt,'wiki-mentions.5000.json')}]
datasets = [{'name':'kore', 'path':os.path.join(pathStrt,'kore.json')}, {'name':'AQUAINT', 'path':os.path.join(pathStrt,'AQUAINT.txt.json')}, {'name':'MSNBC', 'path':os.path.join(pathStrt,'MSNBC.txt.json')}, {'name':'nopop', 'path':os.path.join(pathStrt,'nopop.json')}]
#datasets = [{'name':'kore', 'path':os.path.join(pathStrt,'kore.json')}, {'name':'AQUAINT', 'path':os.path.join(pathStrt,'AQUAINT.txt.json')}, {'name':'MSNBC', 'path':os.path.join(pathStrt,'MSNBC.txt.json')},{'name':'wiki500', 'path':os.path.join(pathStrt,'wiki-mentions.500.json')}]
#datasets = [{'name':'nopop', 'path':os.path.join(pathStrt,'nopop.json')}]
#datasets = [{'name':'kore', 'path':os.path.join(pathStrt,'kore.json')}, {'name':'AQUAINT', 'path':os.path.join(pathStrt,'AQUAINT.txt.json')}, {'name':'MSNBC', 'path':os.path.join(pathStrt,'MSNBC.txt.json')},{'name':'wiki500', 'path':os.path.join(pathStrt,'wiki-mentions.500.json')},{'name':'nopop', 'path':os.path.join(pathStrt,'nopop.json')}]
#datasets = [{'name':'kore', 'path':os.path.join(pathStrt,'kore.json')}, {'name':'AQUAINT', 'path':os.path.join(pathStrt,'AQUAINT.txt.json')}, {'name':'MSNBC', 'path':os.path.join(pathStrt,'MSNBC.txt.json')},{'name':'wiki5000', 'path':os.path.join(pathStrt,'wiki-mentions.5000.json')},{'name':'nopop', 'path':os.path.join(pathStrt,'nopop.json')}]
#datasets = [{'name':'wiki500', 'path':os.path.join(pathStrt,'wiki-mentions.500.json')}]

# 'popular', 'context1', 'context2', 'word2vec', 'coherence', 'tagme'
methods = ['abc', 'bgc', 'etc', 'gbc', 'rfc', 'lsvc', 'svc', 'lmart']

if 'word2vec' in methods or 'multi' in methods or True:
    try:
        word2vec
    except:
        word2vec = gensim_loadmodel('/users/cs/amaral/cgmdir/WikipediaClean5Negative300Skip10.Ehsan/WikipediaClean5Negative300Skip10')

doSplit = True
doManual = False

verbose = True

maxCands = 20

performances = {}

# for each dataset, run all methods
for dataset in datasets:
    performances[dataset['name']] = {}
    # get the data from dataset
    dataFile = open(dataset['path'], 'r')
    dataLines = []
    
    # put in all lines that contain proper ascii
    for line in dataFile:
        dataLines.append(json.loads(line.decode('utf-8').strip()))
        
    print dataset['name'] + '\n'
    
    # run each method on the data set
    for mthd in methods:
        print mthd
        print str(datetime.now()) + '\n'
        
        # reset counters
        totalMentions = 0
        totalRightS = 0
        totalRightM = 0
        totalLines = 0
        
        # each method tests all lines
        for line in dataLines:
            if verbose:
                print str(totalLines + 1)
            
            # get absolute text indexes and entity id of each given mention
            trueEntities = mentionStartsAndEnds(copy.deepcopy(line), forTruth = True) # the ground truth
            
            oData = copy.deepcopy(line)
            
            totalMentions += len(trueEntities)
            
            # get results for pre split string
            if doSplit and mthd <> 'tagme': # presplit no work on tagme
                # original split string with mentions given
                resultS = wikifyEval(copy.deepcopy(line), True, maxC = maxCands, method = 'multi', hybridC = True, model = mthd)
                totalRightS += precision(trueEntities, resultS) * len(trueEntities)
                
                if verbose:
                    print 'Split: ' + str(totalMentions) + ', ' + str(totalRightS)
                
            # get results for manually split string
            if doManual:
                # tagme has separate way to do things
                if mthd == 'tagme':
                    antns = tagme.annotate(" ".join(line['text']))
                    resultM = []
                    for an in antns.get_annotations(0.005):
                        resultM.append([an.begin,an.end,title2id(an.entity_title)])
                else:
                    # unsplit string to be manually split and mentions found
                    resultM = wikifyEval(" ".join(line['text']), False, maxC = maxCands, method = mthd)
                
                totalRightM += precision(trueEntities, resultM) * len(trueEntities)
                
                if verbose:
                    print 'Manual: ' + str(totalMentions) + ', ' + str(totalRightM)
                
            totalLines += 1
        
        # record results for this method on this dataset
        # [avg precision split, avg precision manual, avg recall split, avg recall manual]
        performances[dataset['name']][mthd] = {'S F1':totalRightS/totalMentions,
                                               'M F1':totalRightM/totalMentions
                                              }

with open('/users/cs/amaral/wikisim/wikification/wikification_results.txt', 'a') as resultFile:
    resultFile.write('\nmaxC: ' + str(maxCands) + '\n' + str(datetime.now()) + '\n\n')
    resultFile.write('Doing hybrid candidate generation with new hybrid trained models.\n\n')
    for dataset in datasets:
        resultFile.write(dataset['name'] + ':\n')
        for mthd in methods:
            if doSplit and doManual:
                resultFile.write(mthd + ':'
                       + '\n    S Micro F1 :' + str(performances[dataset['name']][mthd]['S F1'])
                       + '\n    M Micro F1 :' + str(performances[dataset['name']][mthd]['M F1']) + '\n')
            elif doSplit:
                resultFile.write(mthd + ':'
                       + '\n    S Micro F1 :' + str(performances[dataset['name']][mthd]['S F1']) + '\n')
            elif doManual:
                resultFile.write(mthd + ':'
                       + '\n    M Micro F1 :' + str(performances[dataset['name']][mthd]['M F1']) + '\n')
                
    resultFile.write('\n' + '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~' + '\n')

Overwriting wikification_eval_micro.py


In [None]:
%%writefile wikification_eval_bot.py 

"""
This is for testing performance of different wikification methods using BOT F1 score
as described here: http://cogcomp.cs.illinois.edu/papers/RRDA11.pdf.
"""

from __future__ import division
from wikification import *
from IPython.display import clear_output
import copy
from datetime import datetime
import tagme
import os
import json
from sets import Set

tagme.GCUBE_TOKEN = "f6c2ba6c-751b-4977-a94c-c140c30e9b92-843339462"
    

pathStrt = '/users/cs/amaral/wsd-datasets'
#pathStrt = 'C:\\Temp\\wsd-datasets'

# the data sets for performing on
datasets = [{'name':'kore', 'path':os.path.join(pathStrt,'kore.json')},
            {'name':'AQUAINT', 'path':os.path.join(pathStrt,'AQUAINT.txt.json')},
            {'name':'MSNBC', 'path':os.path.join(pathStrt,'MSNBC.txt.json')},
            {'name':'wiki5000', 'path':os.path.join(pathStrt,'wiki-mentions.5000.json')}]

# many different option for combonations of datasets for smaller tests
#datasets = [{'name':'MSNBC', 'path':os.path.join(pathStrt,'MSNBC.txt.json')}]
#datasets = [{'name':'kore', 'path':os.path.join(pathStrt,'kore.json')}]
#datasets = [{'name':'kore', 'path':os.path.join(pathStrt,'kore.json')}, {'name':'AQUAINT', 'path':os.path.join(pathStrt,'AQUAINT.txt.json')}]
#datasets = [{'name':'wiki5000', 'path':os.path.join(pathStrt,'wiki-mentions.5000.json')}]
#datasets = [{'name':'kore', 'path':os.path.join(pathStrt,'kore.json')}, {'name':'AQUAINT', 'path':os.path.join(pathStrt,'AQUAINT.txt.json')}, {'name':'MSNBC', 'path':os.path.join(pathStrt,'MSNBC.txt.json')}]
#datasets = [{'name':'kore', 'path':os.path.join(pathStrt,'kore.json')}, {'name':'AQUAINT', 'path':os.path.join(pathStrt,'AQUAINT.txt.json')}, {'name':'MSNBC', 'path':os.path.join(pathStrt,'MSNBC.txt.json')},{'name':'wiki500', 'path':os.path.join(pathStrt,'wiki-mentions.500.json')}]
#datasets = [{'name':'nopop', 'path':os.path.join(pathStrt,'nopop.json')}]
#datasets = [{'name':'kore', 'path':os.path.join(pathStrt,'kore.json')}, {'name':'AQUAINT', 'path':os.path.join(pathStrt,'AQUAINT.txt.json')}, {'name':'MSNBC', 'path':os.path.join(pathStrt,'MSNBC.txt.json')},{'name':'wiki500', 'path':os.path.join(pathStrt,'wiki-mentions.500.json')},{'name':'nopop', 'path':os.path.join(pathStrt,'nopop.json')}]
#datasets = [{'name':'kore', 'path':os.path.join(pathStrt,'kore.json')}, {'name':'AQUAINT', 'path':os.path.join(pathStrt,'AQUAINT.txt.json')}, {'name':'MSNBC', 'path':os.path.join(pathStrt,'MSNBC.txt.json')},{'name':'wiki5000', 'path':os.path.join(pathStrt,'wiki-mentions.5000.json')},{'name':'nopop', 'path':os.path.join(pathStrt,'nopop.json')}]
#datasets = [{'name':'wiki500', 'path':os.path.join(pathStrt,'wiki-mentions.500.json')}]
datasets = [{'name':'MSNBC', 'path':os.path.join(pathStrt,'MSNBC.txt.json')},{'name':'AQUAINT', 'path':os.path.join(pathStrt,'AQUAINT.txt.json')}]

# 'popular', 'context1', 'context2', 'word2vec', 'coherence', 'tagme'
methods = ['context2']

if 'word2vec' in methods:
    try:
        word2vec
    except:
        word2vec = gensim_loadmodel('/users/cs/amaral/cgmdir/WikipediaClean5Negative300Skip10.Ehsan/WikipediaClean5Negative300Skip10')

doSplit = True
doManual = False

verbose = True

maxCands = 20

performances = {}

# for each dataset, run all methods
for dataset in datasets:
    performances[dataset['name']] = {}
    # get the data from dataset
    dataFile = open(dataset['path'], 'r')
    dataLines = []
    
    # put in all lines that contain proper ascii
    for line in dataFile:
        dataLines.append(json.loads(line.decode('utf-8').strip()))
        
    print dataset['name'] + '\n'
    
    # run each method on the data set
    for mthd in methods:
        print mthd
        print str(datetime.now()) + '\n'
        
        # reset counters
        totalPrecS = 0
        totalPrecM = 0
        totalRecS = 0
        totalRecM = 0
        totalBotF1S = 0
        totalBotF1M = 0
        totalLines = 0
        
        # each method tests all lines
        for line in dataLines:
            if verbose:
                print str(totalLines + 1)
            
            # get absolute text indexes and entity id of each given mention
            trueEntities = mentionStartsAndEnds(copy.deepcopy(line), forTruth = True) # the ground truth
            trueSet = Set()
            for truEnt in trueEntities:
                trueSet.add(truEnt[2])
            
            # get results for pre split string
            if doSplit and mthd <> 'tagme': # presplit no work on tagme
                # original split string with mentions given
                resultS = wikifyEval(copy.deepcopy(line), True, maxC = maxCands, method = mthd)
                spltSet = Set()
                for res in resultS:
                    spltSet.add(res[2])
                
                precS = len(trueSet & spltSet)/len(spltSet)
                recS = len(trueSet & spltSet)/len(trueSet)
                try:
                    f1 = (2*precS*recS)/(precS+recS)
                except:
                    f1 = 0
                
                if verbose:
                    print 'Split: ' + str(precS) + ', ' + str(recS) + ', ' + str(f1)
                
                # track results
                totalPrecS += precS
                totalRecS += recS
                totalBotF1S += f1
            else:
                totalPrecS = 0
                totalRecS = 0
                totalBotF1S = 0
                
            # get results for manually split string
            if doManual:
                # tagme has separate way to do things
                if mthd == 'tagme':
                    antns = tagme.annotate(" ".join(line['text']))
                    resultM = []
                    for an in antns.get_annotations(0.005):
                        resultM.append([an.begin,an.end,title2id(an.entity_title)])
                else:
                    # unsplit string to be manually split and mentions found
                    resultM = wikifyEval(" ".join(line['text']), False, maxC = maxCands, method = mthd)
                
                manSet = Set()
                for res in resultM:
                    manSet.add(res[2])
                
                precM = len(trueSet & manSet)/len(manSet)
                recM = len(trueSet & manSet)/len(trueSet)
                try:
                    f1 = (2*precM*recM)/(precM+recM)
                except:
                    f1 = 0
                
                if verbose:
                    print 'Manual: ' + str(precM) + ', ' + str(recM) + ', ' + str(f1)
                
                # track results
                totalPrecM += precM
                totalRecM += recM
                totalBotF1M += f1
            else:
                totalPrecM = 0
                totalRecM = 0
                totalBotF1M = 0
                
            totalLines += 1
        
        # record results for this method on this dataset
        # [avg precision split, avg precision manual, avg recall split, avg recall manual]
        performances[dataset['name']][mthd] = {'S Prec':totalPrecS/totalLines, 
                                               'M Prec':totalPrecM/totalLines,
                                              'S Rec':totalRecS/totalLines, 
                                               'M Rec':totalRecM/totalLines,
                                               'S BOT F1':totalBotF1S/totalLines,
                                               'M BOT F1':totalBotF1M/totalLines
                                              }

with open('/users/cs/amaral/wikisim/wikification/wikification_results.txt', 'a') as resultFile:
    resultFile.write('\nmaxC: ' + str(maxCands) + '\n' + str(datetime.now()) + '\n\n')
    for dataset in datasets:
        resultFile.write(dataset['name'] + ':\n')
        for mthd in methods:
            if doSplit and doManual:
                resultFile.write(mthd + ':'
                       + '\n    S Prec :' + str(performances[dataset['name']][mthd]['S Prec'])
                       + '\n    S Rec :' + str(performances[dataset['name']][mthd]['S Rec'])
                       + '\n    S BOT F1 :' + str(performances[dataset['name']][mthd]['S BOT F1'])
                       + '\n    M Prec :' + str(performances[dataset['name']][mthd]['M Prec'])
                       + '\n    M Rec :' + str(performances[dataset['name']][mthd]['M Rec'])
                       + '\n    M BOT F1 :' + str(performances[dataset['name']][mthd]['M BOT F1'])+ '\n')
            elif doSplit:
                resultFile.write(mthd + ':'
                       + '\n    S Prec :' + str(performances[dataset['name']][mthd]['S Prec'])
                       + '\n    S Rec :' + str(performances[dataset['name']][mthd]['S Rec'])
                       + '\n    S BOT F1 :' + str(performances[dataset['name']][mthd]['S BOT F1'])+ '\n')
            elif doManual:
                resultFile.write(mthd + ':'
                       + '\n    M Prec :' + str(performances[dataset['name']][mthd]['M Prec'])
                       + '\n    M Rec :' + str(performances[dataset['name']][mthd]['M Rec'])
                       + '\n    M BOT F1 :' + str(performances[dataset['name']][mthd]['M BOT F1'])+ '\n')
                
    resultFile.write('\n' + '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~' + '\n')

In [1]:
#%%writefile mention_extraction_eval.py 

"""
This evaluates the quality of mention extraction
"""

from __future__ import division
import requests
import json
import os
from wikification import *
from datetime import datetime

pathStrt = '/users/cs/amaral/wsd-datasets'
#pathStrt = 'C:\\Temp\\wsd-datasets'

# the data sets for performing on
datasets = [{'name':'kore', 'path':os.path.join(pathStrt,'kore.json')},
            {'name':'AQUAINT', 'path':os.path.join(pathStrt,'AQUAINT.txt.json')},
            {'name':'MSNBC', 'path':os.path.join(pathStrt,'MSNBC.txt.json')},
            {'name':'wiki5000', 'path':os.path.join(pathStrt,'wiki-mentions.5000.json')}]

# short for quick tests
#datasets = [{'name':'MSNBC', 'path':os.path.join(pathStrt,'MSNBC.txt.json')}]
#datasets = [{'name':'kore', 'path':os.path.join(pathStrt,'kore.json')}]
#datasets = [{'name':'kore', 'path':os.path.join(pathStrt,'kore.json')}, {'name':'AQUAINT', 'path':os.path.join(pathStrt,'AQUAINT.txt.json')}]
#datasets = [{'name':'wiki5000', 'path':os.path.join(pathStrt,'wiki-mentions.5000.json')}]
#datasets = [{'name':'kore', 'path':os.path.join(pathStrt,'kore.json')}, {'name':'AQUAINT', 'path':os.path.join(pathStrt,'AQUAINT.txt.json')}, {'name':'MSNBC', 'path':os.path.join(pathStrt,'MSNBC.txt.json')}]

performances = {}

verbose = True

# for each dataset, run all methods
for dataset in datasets:
    performances[dataset['name']] = {}
    # get the data from dataset
    dataFile = open(dataset['path'], 'r')
    dataLines = []
    for line in dataFile:
        dataLines.append(json.loads(line.decode('utf-8').strip()))
    
    # reset counters
    totalPrec = 0
    totalRec = 0
    totalF1 = 0
    totalLines = 0

    # each method tests all lines
    for line in dataLines:

        if(verbose):
            print str(totalLines + 1)

        trueMentions = mentionStartsAndEnds(line, True)
        myMentions = mentionStartsAndEnds(mentionExtract(" ".join(line['text'])))
        
        # put in right format
        for mention in myMentions:
            mention[0] = mention[2]
            mention[1] = mention[3]
            
        prec = mentionPrecision(trueMentions, myMentions)
        rec = mentionRecall(trueMentions, myMentions)
        try:
            f1 = (2*prec*rec)/(prec+rec)
        except:
            f1 = 0
        
        if(verbose):
            print str(prec) + ' ' + str(rec) + ' ' + str(f1) + '\n'

        # track results
        totalPrec += prec
        totalRec += rec
        totalF1 += f1
        totalLines += 1

    # record results for this method on this dataset
    performances[dataset['name']] = {'Precision':totalPrec/totalLines, 
                                     'Recall':totalRec/totalLines,
                                     'F1':totalF1/totalLines}
            
with open('/users/cs/amaral/wikisim/wikification/mention_extraction_results.txt', 'a') as resultFile:
    resultFile.write(str(datetime.now()) + '\n\n')
    for dataset in datasets:
        resultFile.write(dataset['name'] + ':\n')
        for mthd in methods:
            resultFile.write(mthd + ':'
                   + '\n    Prec :' + str(performances[dataset['name']][mthd]['Precision'])
                   + '\n    Rec :' + str(performances[dataset['name']][mthd]['Recall'])
                   + '\n    F1 :' + str(performances[dataset['name']][mthd]['F1']) + '\n')
                
    resultFile.write('\n' + '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~' + '\n')

1
[[u'startOffset', 0, u'endOffset', 5, u'ids', [u"David's", u'David\u2019s', u'David', u'"David', u'"David"', u'David)', u'D\xe1vid', u'Dav\xedd', u'Davi\xf0', u'Dav\xed\xf0']], [u'startOffset', 6, u'endOffset', 9, u'ids', [u'and', u'"+" and "\u2212"']], [u'startOffset', 10, u'endOffset', 18, u'ids', [u'Victoria', u'\u2018Victoria\u2019', u'"Victoria"', u'Victoria.', u"Victoria's", u'Victoria\u2019s']], [u'startOffset', 19, u'endOffset', 24, u'ids', [u'named']], [u'startOffset', 25, u'endOffset', 30, u'ids', [u'their', u'"their"']], [u'startOffset', 25, u'endOffset', 39, u'ids', [u'their children']], [u'startOffset', 31, u'endOffset', 39, u'ids', [u'children', u"/children's", u"children's", u'children\u2019s']], [u'startOffset', 40, u'endOffset', 48, u'ids', [u'Brooklyn', u'"Brooklyn"', u'Brooklyn-', u'"Brooklyn\'s"', u"Brooklyn's"]], [u'startOffset', 51, u'endOffset', 56, u'ids', [u'Romeo', u'"Romeo"', u'Romeo!', u'Rom\xe9o']], [u'startOffset', 59, u'endOffset', 63, u'ids', [u'Cruz']

KeyboardInterrupt: 

# The Main Wikification Code
The code in this cell contains all of the logic to do wikification.

In [2]:
%%writefile wikification.py 

from __future__ import division
import sys
import pickle
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import GradientBoostingClassifier
sys.path.append('./pyltr/')
import pyltr
sys.path.append('../wikisim/')
from wikipedia import *
from operator import itemgetter
import requests
import json
import nltk
import scipy as sp
import scipy.sparse as sprs
import scipy.spatial
import scipy.sparse.linalg
from calcsim import *
sys.path.append('../')
from wsd.wsd import *
import numpy as np

MIN_MENTION_LENGTH = 3 # mentions must be at least this long
MIN_FREQUENCY = 20 # anchor with frequency below is ignored

with open('/users/cs/amaral/wikisim/wikification/pos-filter-out-nonmentions.txt', 'r') as srcFile:
    posFilter = srcFile.read().splitlines()

def get_solr_count(s):
    """ Gets the number of documents the string occurs 
        NOTE: Multi words should be quoted
    Arg:
        s: the string (can contain AND, OR, ..)
    Returns:
        The number of documents
    """

    q='+text:(\"%s\")'%(s,)
    qstr = 'http://localhost:8983/solr/enwiki20160305/select'
    params={'indent':'on', 'wt':'json', 'q':q, 'rows':0}
    r = requests.get(qstr, params=params)
    try:
        if 'response' not in r.json():
            return 0
        else:
            return r.json()['response']['numFound']
    except:
        return 0

def get_mention_count(s):
    """
    Description:
        Returns the amount of times that the given string appears as a mention in wikipedia.
    Args:
        s: the string (can contain AND, OR, ..)
    Return:
        The amount of times the given string appears as a mention in wikipedia
    """
    
    result = anchor2concept(s)
    rSum = 0
    for item in result:
        rSum += item[1]
        
    return rSum

def mentionProb(text):
    """
    Description:
        Returns the probability that the text is a mention in Wikipedia.
    Args:
        text: 
    Return:
        The probability that the text is a mention in Wikipedia.
    """
    
    totalMentions = get_mention_count(text)
    totalAppearances = get_solr_count(text.replace(".", ""))
    
    if totalAppearances == 0:
        return 0 # a mention never used probably is not a good link
    else:
        return totalMentions/totalAppearances
    
def normalize(nums):
    """Normalizes a list of nums to its sum + 1"""
    
    numSum = sum(nums) + 1 # get max
    
    # fill with normalized
    normNums = []
    for num in nums:
        normNums.append(num/numSum)
        
    return normNums

def destroyExclusiveOverlaps(textData):
    """
    Description:
        Removes all overlaps that start at same letter from text data, so that only the best mention in an
        overlap set is left.
    Args:
        textData: [[start, end, text, anchProb],...]
    Return:
        textData minus the unesescary elements that overlap.
    """
    
    newTextData = [] # textData minus the unesescary parts of the overlapping
    overlappingSets = [] # stores arrays of the indexes of overlapping items from textData
    
    # creates the overlappingSets array
    i = 0
    while i < len(textData)-1:
        # even single elements considered overlapping set
        # this is root of overlapping set
        overlappingSets.append([i])
        overlapIndex = len(overlappingSets) - 1
        theBegin = textData[i][0]
        
        # look at next words until not overlap
        for j in range(i+1, len(textData)):
            # if next word starts before endiest one ends
            if textData[j][0] == theBegin:
                overlappingSets[overlapIndex].append(j)
                i = j # make sure not to repeat overlap set
            else:
                # add final word
                if j == len(textData) - 1:
                    overlappingSets.append([j])
                break
        i += 1
                    
    # get only the best overlapping element of each set
    for oSet in overlappingSets:
        bestIndex = 0
        bestScore = -1
        for i in oSet:
            score = mentionProb(textData[i][2])
            if score > bestScore:
                bestScore = score
                bestIndex = i
        
        # put right item in new textData
        newTextData.append(textData[bestIndex])
        
    return newTextData

def destroyResidualOverlaps(textData):
    """
    Description:
        Removes all overlaps from text data, so that only the best mention in an
        overlap set is left.
    Args:
        textData: [[start, end, text, anchProb],...]
    Return:
        textData minus the unesescary elements that overlap.
    """
    
    newTextData = [] # to be returned
    oSet = [] # the set of current overlaps
    rootWIndex = 0 # the word to start looking from for finding root word
    rEnd = 0 # the end index of the root word
    
    # keep looping as long as overlaps
    while True:
        oSet = []
        oSet.append(textData[rootWIndex])
        for i in range(rootWIndex + 1, len(textData)):
            # if cur start before root end
            if textData[i][0] < textData[rootWIndex][1]:
                oSet.append(textData[i])
            else:
                break # have all overlap words

        
        bestIndex = 0
        # deal with the overlaps
        if len(oSet) > 1:
            bestProb = 0
            
            # choose the most probable
            i = 0
            for mention in oSet:
                prob = mentionProb(mention[2])
                if prob > bestProb:
                    bestProb = prob
                    bestIndex = i
                i += 1
        else:
            rootWIndex += 1 # move up one if no overlaps
                
        # remove from old text data all that is not best
        for i in range(0, len(oSet)):
            if i <> bestIndex:
                textData.remove(oSet[i])
                
        # add the best to new
        if not (oSet[bestIndex] in newTextData):
            newTextData.append(oSet[bestIndex])
            
        if rootWIndex >= len(textData):
            break
    
    return newTextData
    
def mentionStartsAndEnds(textData, forTruth = False):
    """
    Description:
        Takes in a list of mentions and turns each of its mentions into the form: [wIndex, start, end]. 
        Or if forTruth is true: [[start,end,entityId]]
    Args:
        textData: {'text': [w1,w2,w3,...] , 'mentions': [[wordIndex,entityTitle],...]}, to be transformed 
            as described above.
        forTruth: Changes form to use.
    Return:
        The mentions in the form [[wIndex, start, end],...]]. Or if forTruth is true: [[start,end,entityId]]
    """
    
    curWord = 0 
    curStart = 0
    for mention in textData['mentions']:
        while curWord < mention[0]:
            curStart += len(textData['text'][curWord]) + 1
            curWord += 1
            
        ent = mention[1] # store entity title in case of forTruth
        mention.pop() # get rid of entity text
        
        if forTruth:
            mention.pop() # get rid of wIndex too
            
        mention.append(curStart) # start of the mention
        mention.append(curStart + len(textData['text'][curWord])) # end of the mention
        
        if forTruth:
            mention.append(title2id(ent)) # put on entityId
    
    return textData['mentions']
     
def mentionExtract(text):
    """
    Description:
        Takes in a text and splits it into the different words/mentions.
    Args:
        phrase: The text to be split.
    Return:
        The text split it into the different words / mentions: 
        {'text':[w1,w2,...], 'mentions': [[wIndex,begin,end],...]}
    """
    
    addr = 'http://localhost:8983/solr/enwikianchors20160305/tag'
    params={'overlaps':'ALL', 'tagsLimit':'5000', 'fl':'id','wt':'json','indent':'on'}
    r = requests.post(addr, params=params, data=text.encode('utf-8'))
    textData0 = r.json()['tags']
    
    splitText = [] # the text now in split form
    mentions = [] # mentions before remove inadequate ones
    
    textData = [] # [[begin,end,word,anchorProb],...]
    
    print textData0
    
    i = 0 # for wordIndex
    # get rid of extra un-needed Solr data, and add in anchor probability
    for item in textData0:
        totalMentions = get_mention_count(text[item[1]:item[3]])
        totalAppearances = get_solr_count(text[item[1]:item[3]].replace(".", ""))
        if totalAppearances == 0:
            anchorProb = 0
        else:
            anchorProb = totalMentions/totalAppearances
        # put in the new clean textData
        textData.append([item[1], item[3], text[item[1]:item[3]], anchorProb, i])
        i += 1
        
        # also fill split text
        splitText.append(text[item[1]:item[3]])
    
    # get rid of overlaps
    textData = destroyExclusiveOverlaps(textData)
    textData = destroyResidualOverlaps(textData)
        
    # gets the POS labels for the words
    postrs = []
    for item in textData:
        postrs.append(item[2])
    postrs = nltk.pos_tag(postrs)
    
    for i in range(0,len(textData)):
        textData[i].append(postrs[i]) # [5][1] is index of type of word
    
    mentionPThrsh = 0.001 # for getting rid of unlikelies
    
    # put in only good mentions
    i = 0
    for item in textData:
        if i == 0:
            bef = 'NONE'
        else:
            bef = textData[i-1][5][1] # pos tag of before
        if i == len(textData) - 1:
            aft = 'NONE'
        else:
            aft = textData[i+1][5][1] # pos tag of after
        befaft = " : ".join([bef,aft])
        
        if (item[3] >= mentionPThrsh # if popular enough, and either some type of noun or JJ or CD
                and (item[5][1][0:2] == 'NN' or item[5][1] == 'JJ' or item[5][1] == 'CD')):
            mentions.append([item[4], item[0], item[1]]) # wIndex, start, end
        i += 1
    
    # get in same format as dataset provided data
    newTextData = {'text':splitText, 'mentions':mentions}
    
    return newTextData

def getMentionsInSentence(textData, mainWord):
    """
    Description:
        Finds all mentions that are in the same sentence as mainWord.
    Args:
        textData: A text in split form along with its suspected mentions.
        mainWord: The index of the word that is in the wanted sentence
    Return:
        A list of mention texts that are in the same sentence as mainWord
    """
    
    sents = nltk.sent_tokenize(" ".join(textData['text']))
    
    # start and end of sentences (absolute)
    sStart = 0
    sEnd = 0
    
    mentionStrs = [] # the mentions
    
    curEnd = 0
    for sent in sents:
        curEnd += len(sent)
        # if sentence ends after mention starts
        if curEnd > mainWord[1]:
            sEnd = curEnd
            sStart = sEnd - len(sent)
            mWIndex = textData['mentions'].index(mainWord) # index of mainWord
            
            # add every mention before main in sent to mentionsStr
            for i in range(mWIndex-1, -1, -1):
                if textData['mentions'][i][2] > sStart:
                    mentionStrs.append(textData['text'][textData['mentions'][i][0]])
                else:
                    break
                    
            # add every mention after main in sent to mentionsStr
            for i in range(mWIndex+1, len(textData['mentions'])):
                if textData['mentions'][i][1] < sEnd:
                    mentionStrs.append(textData['text'][textData['mentions'][i][0]])
                else:
                    break
            
            break
    
    return " ".join(mentionStrs).strip()

def generateCandidates(textData, maxC, hybrid = False):
    """
    Description:
        Generates up to maxC candidates for each possible mention word in phrase.
    Args:
        textData: A text in split form along with its suspected mentions.
        maxC: The max amount of candidates to accept.
        Hybrid: Whether to include best context fitting results too.
    Return:
        The top maxC candidates for each possible mention word in textData. Each 
        mentions has its candidates of the form: [(wikiId, popularity),...]
    """
    
    candidates = []
    
    ctxC0 = 0 # the amount of candidates to fill from best context.
    if hybrid == True:
        popC = int(maxC/2) + 1 # get ceil
        ctxC0 = maxC - popC
    else:
        popC = maxC
    
    for mention in textData['mentions']:
        resultT = sorted(anchor2concept(textData['text'][mention[0]]), key = itemgetter(1), 
                          reverse = True)[:popC]
        results = [list(item) for item in resultT]
        
        # get the right amount to fill with context 
        if len(results) < popC and hybrid == True:
            # fill in rest with context
            ctxC = maxC - len(results)
        elif hybrid == True:
            ctxC = ctxC0
        else:
            ctxC = 0
            
        # get some context results from solr
        if ctxC > 0:
            mentionStr = escapeStringSolr(textData['text'][mention[0]])
            ctxStr = escapeStringSolr(getMentionsInSentence(textData, mention))
            
            strIds = ['-id:' +  str(res[0]) for res in results]
            
            # select all the docs from Solr with the best scores, highest first.
            addr = 'http://localhost:8983/solr/enwiki20160305/select'
            
            if len(ctxStr) == 0:
                params={'fl':'id', 'indent':'on', 'fq':" ".join(strIds),
                        'q':'title:(' + mentionStr.encode('utf-8')+')^5',
                        'wt':'json', 'rows': str(ctxC)}
            else:
                params={'fl':'id', 'indent':'on', 'fq':" ".join(strIds),
                        'q':'title:(' + mentionStr.encode('utf-8') + ')^5'
                        + ' text:(' + ctxStr.encode('utf-8') + ')',
                        'wt':'json', 'rows':str(ctxC)}
            
            r = requests.get(addr, params = params)
            try:
                if ('response' in r.json() 
                        and 'docs' in r.json()['response']
                        and len(r.json()['response']['docs']) > 0):
                    for doc in r.json()['response']['docs']:
                        # get popularity of entity given the mention
                        popularity = 0
                        thingys = anchor2concept(textData['text'][mention[0]])
                        for thingy in thingys:
                            if thingy[0] == long(doc['id']):
                                popularity = thingy[1]
                                break
                        
                        results.append([long(doc['id']), popularity])
            except:
                pass
            
        candidates.append(results[:maxC]) # take up to maxC of the results
    
    return candidates

def precision(truthSet, mySet):
    """
    Description:
        Calculates the precision of mySet against the truthSet.
    Args:
        truthSet: The 'right' answers for what the entities are. [[start,end,id],...]
        mySet: My code's output for what it thinks the right entities are. [[start,end,id],...]
    Return:
        The precision: (# of correct entities)/(# of found entities)
    """
    
    numFound = len(mySet)
    numCorrect = 0 # incremented in for loop
    
    truthIndex = 0
    myIndex = 0
    
    while truthIndex < len(truthSet) and myIndex < len(mySet):
        if mySet[myIndex][0] < truthSet[truthIndex][0]:
            if mySet[myIndex][1] > truthSet[truthIndex][0]:
                # overlap with mine behind
                if truthSet[truthIndex][2] == mySet[myIndex][2]:
                    numCorrect += 1
                    truthIndex += 1
                    myIndex += 1
                elif truthSet[truthIndex][1] < mySet[myIndex][1]:
                    # truth ends first
                    truthIndex += 1
                else:
                    # mine ends first
                    myIndex += 1
            else:
                # mine not even reach truth
                myIndex += 1
                
        elif mySet[myIndex][0] == truthSet[truthIndex][0]:
            # same mention (same start atleast)
            if truthSet[truthIndex][2] == mySet[myIndex][2]:
                numCorrect += 1
                truthIndex += 1
                myIndex += 1
            elif truthSet[truthIndex][1] < mySet[myIndex][1]:
                # truth ends first
                truthIndex += 1
            else:
                # mine ends first
                myIndex += 1
                  
        elif mySet[myIndex][0] > truthSet[truthIndex][0]:
            if mySet[myIndex][0] < truthSet[truthIndex][1]:
                # overlap with truth behind
                if truthSet[truthIndex][2] == mySet[myIndex][2]:
                    numCorrect += 1
                    truthIndex += 1
                    myIndex += 1
                elif truthSet[truthIndex][1] < mySet[myIndex][1]:
                    # truth ends first
                    truthIndex += 1
                else:
                    # mine ends first
                    myIndex += 1
            else:
                # mine beyond mention, increment truth
                truthIndex += 1

    #print 'correct: ' + str(numCorrect) + '\nfound: ' + str(numFound)
    if numFound == 0:
        return 0
    else:
        return (numCorrect/numFound)

def recall(truthSet, mySet):
    """
    Description:
        Calculates the recall of mySet against the truthSet.
    Args:
        truthSet: The 'right' answers for what the entities are. [[start,end,id],...]
        mySet: My code's output for what it thinks the right entities are. [[start,end,id],...]
    Return:
        The recall: (# of correct entities)/(# of actual entities)
    """
    
    numActual = len(truthSet)
    numCorrect = 0 # incremented in for loop)
    
    truthIndex = 0
    myIndex = 0
    
    while truthIndex < len(truthSet) and myIndex < len(mySet):
        if mySet[myIndex][0] < truthSet[truthIndex][0]:
            if mySet[myIndex][1] > truthSet[truthIndex][0]:
                # overlap with mine behind
                if truthSet[truthIndex][2] == mySet[myIndex][2]:
                    numCorrect += 1
                    truthIndex += 1
                    myIndex += 1
                elif truthSet[truthIndex][1] < mySet[myIndex][1]:
                    # truth ends first
                    truthIndex += 1
                else:
                    # mine ends first
                    myIndex += 1
            else:
                # mine not even reach truth
                myIndex += 1
                
        elif mySet[myIndex][0] == truthSet[truthIndex][0]:
            # same mention (same start atleast)
            if truthSet[truthIndex][2] == mySet[myIndex][2]:
                numCorrect += 1
                truthIndex += 1
                myIndex += 1
            elif truthSet[truthIndex][1] < mySet[myIndex][1]:
                # truth ends first
                truthIndex += 1
            else:
                # mine ends first
                myIndex += 1
                  
        elif mySet[myIndex][0] > truthSet[truthIndex][0]:
            if mySet[myIndex][0] < truthSet[truthIndex][1]:
                # overlap with truth behind
                if truthSet[truthIndex][2] == mySet[myIndex][2]:
                    numCorrect += 1
                    truthIndex += 1
                    myIndex += 1
                elif truthSet[truthIndex][1] < mySet[myIndex][1]:
                    # truth ends first
                    truthIndex += 1
                else:
                    # mine ends first
                    myIndex += 1
            else:
                # mine beyond mention, increment truth
                truthIndex += 1
                
    if numActual == 0:
        return 0
    else:
        return (numCorrect/numActual)
    
def mentionPrecision(trueMentions, otherMentions):
    """
    Description:
        Calculates the precision of otherMentions against the trueMentions.
    Args:
        trueMentions: The 'right' answers for what the mentions are.
        otherMentions: Our mentions obtained through some means.
    Return:
        The precision: (# of correct mentions)/(# of found mentions)
    """
    
    numFound = len(otherMentions)
    numCorrect = 0 # incremented in for loop
    
    trueIndex = 0
    otherIndex = 0
    
    while trueIndex < len(trueMentions) and otherIndex < len(otherMentions):
        # if mentions start and end on the same
        if (trueMentions[trueIndex][0] == otherMentions[otherIndex][0]
               and trueMentions[trueIndex][1] == otherMentions[otherIndex][1]):
            #print ('MATCH: [' + str(trueMentions[trueIndex][0]) + ',' + str(trueMentions[trueIndex][1]) + ']' + trueMentions[trueIndex][2] 
            #       + ' <===> [' + str(otherMentions[otherIndex][0]) + ',' + str(otherMentions[otherIndex][1]) + ']' + otherMentions[otherIndex][2])
            numCorrect += 1
            trueIndex += 1
            otherIndex += 1
        # if true mention starts before the other starts
        elif trueMentions[trueIndex][0] < otherMentions[otherIndex][0]:
            #print ('FAIL: [' + str(trueMentions[trueIndex][0]) + ',' + str(trueMentions[trueIndex][1]) + ']' + trueMentions[trueIndex][2] 
            #       + ' <XXX> [' + str(otherMentions[otherIndex][0]) + ',' + str(otherMentions[otherIndex][1]) + ']' + otherMentions[otherIndex][2])
            trueIndex += 1
        # if other mention starts before the true starts (same doesnt matter)
        elif trueMentions[trueIndex][0] >= otherMentions[otherIndex][0]:
            #print ('FAIL: [' + str(trueMentions[trueIndex][0]) + ',' + str(trueMentions[trueIndex][1]) + ']' + trueMentions[trueIndex][2] 
            #       + ' <XXX> [' + str(otherMentions[otherIndex][0]) + ',' + str(otherMentions[otherIndex][1]) + ']' + otherMentions[otherIndex][2])
            otherIndex += 1
        else:
            print 'AAAAAAAHHHHHHHHHHHHHHHHHHHHHHHHHHHHH!!!!!!!!!!!!!!!!!!!'

    #print 'correct: ' + str(numCorrect) + '\nfound: ' + str(numFound)
    if numFound == 0:
        return 0
    else:
        return (numCorrect/numFound)

def mentionRecall(trueMentions, otherMentions):
    """
    Description:
        Calculates the recall of otherMentions against the trueMentions.
    Args:
        trueMentions: The 'right' answers for what the mentions are.
        otherMentions: Our mentions obtained through some means.
    Return:
        The recall: (# of correct entities)/(# of actual entities)
    """
    
    numActual = len(trueMentions)
    numCorrect = 0 # incremented in for loop)
    
    trueIndex = 0
    otherIndex = 0
    
    while trueIndex < len(trueMentions) and otherIndex < len(otherMentions):
        # if mentions start and end on the same
        if (trueMentions[trueIndex][0] == otherMentions[otherIndex][0]
               and trueMentions[trueIndex][1] == otherMentions[otherIndex][1]):
            numCorrect += 1
            trueIndex += 1
            otherIndex += 1
        # if true mention starts before the other starts
        elif trueMentions[trueIndex][0] < otherMentions[otherIndex][0]:
            trueIndex += 1
        # if other mention starts before the true starts (same doesnt matter)
        elif trueMentions[trueIndex][0] >= otherMentions[otherIndex][0]:
            otherIndex += 1
        
    print 'correct: ' + str(numCorrect) + '\nactual: ' + str(numActual)
    if numActual == 0:
        return 0
    else:
        return (numCorrect/numActual)
    
def getSurroundingWords(text, mIndex, window, asList = False):
    """
    Description:
        Returns the words surround the given mention. Expanding out window elements
        on both sides.
    Args:
        text: A list of words.
        mIndex: The index of the word that is the center of where to get surrounding words.
        window: The amount of words to the left and right to get.
        asList: Whether to return the words as a list, otherwise just a string.
    Return:
        The words that surround the given mention. Expanding out window elements
        on both sides.
    """
    
    imin = mIndex - window
    imax = mIndex + window + 1
    
    # fix extreme bounds
    if imin < 0:
        imin = 0
    if imax > len(text):
        imax = len(text)
        
    if asList == True:
        words = (text[imin:mIndex] + text[mIndex+1:imax])
    else:
        words = " ".join(text[imin:mIndex] + text[mIndex+1:imax])
    
    # return surrounding part of word minus the mIndex word
    return words

def getMentionSentence(text, mention, asList = False):
    """
    Description:
        Returns the sentence of the mention, minus the mention.
    Args:
        text: The text to get the sentence from.
        index: The mention.
        asList: Whether to return the words as a list, otherwise just a string.
    Return:
        The sentence of the mention, minus the mention.
    """
    
    # the start and end indexes of the sentence
    sStart = 0
    sEnd = 0
    
    # get sentences using nltk
    sents = nltk.sent_tokenize(text)
    
    # find sentence that mention is in
    curLen = 0
    for s in sents:
        curLen += len(s)
        # if greater than begin of mention
        if curLen > mention[1]:
            # remove mention from string to not get bias from self referencing article
            if asList == True:
                sentence = (s.replace(text[mention[1]:mention[2]],"")).split(" ")
            else:
                sentence = s.replace(text[mention[1]:mention[2]],"")
            
            return sentence
        
    # in case it missed
    if asList == True:
        return []
    else:
        return ""

def escapeStringSolr(text):
    """
    Description:
        Escapes a given string for use in Solr.
    Args:
        text: The string to escape.
    Return:
        The escaped text.
    """
    
    text = text.replace("\\", "\\\\\\")
    text = text.replace('+', r'\+')
    text = text.replace("-", "\-")
    text = text.replace("&&", "\&&")
    text = text.replace("||", "\||")
    text = text.replace("!", "\!")
    text = text.replace("(", "\(")
    text = text.replace(")", "\)")
    text = text.replace("{", "\{")
    text = text.replace("}", "\}")
    text = text.replace("[", "\[")
    text = text.replace("]", "\]")
    text = text.replace("^", "\^")
    text = text.replace("\"", "\\\"")
    text = text.replace("~", "\~")
    text = text.replace("*", "\*")
    text = text.replace("?", "\?")
    text = text.replace(":", "\:")
    
    return text

def getContext1Scores(mentionStr, context, candidates):
    """
    Description:
        Uses Solr to find the relevancy scores of the candidates based on the context.
    Args:
        mentionStr: The mention as it appears in the text
        context: The words that surround the target word.
        candidates: A list of candidates that each have the entity id and its frequency/popularity.
    Return:
        The score for each candidate in the same order as the candidates.
    """
    
    candScores = []
    for i in range(len(candidates)):
        candScores.append(0)
    
    # put text in right format
    context = escapeStringSolr(context)
    mentionStr = escapeStringSolr(mentionStr)
    
    strIds = ['id:' +  str(strId[0]) for strId in candidates]
    
    # select all the docs from Solr with the best scores, highest first.
    addr = 'http://localhost:8983/solr/enwiki20160305/select'
    params={'fl':'id score', 'fq':" ".join(strIds), 'indent':'on',
            'q':'text:('+context.encode('utf-8')+')^1 title:(' + mentionStr.encode('utf-8')+')^1.35',
            'wt':'json'}
    r = requests.get(addr, params = params)
    
    try:
        # assign the scores
        for doc in r.json()['response']['docs']:
            # find candidate of doc
            i = 0
            for cand in candidates:
                if cand[0] == long(doc['id']):
                    candScores[i] = doc['score']
                    break
                i += 1
    except:
        # keep zero scores
        pass
            
    return candScores

def bestContext1Match(mentionStr, context, candidates):
    """
    Description:
        Uses Solr to find the candidate that gives the highest relevance when given the context.
    Args:
        mentionStr: The mention as it appears in the text
        context: The words that surround the target word.
        candidates: A list of candidates that each have the entity id and its frequency/popularity.
    Return:
        The index of the candidate with the best relevance score from the context.
    """
    
    # put text in right format
    context = escapeStringSolr(context)
    mentionStr = escapeStringSolr(mentionStr)
    
    strIds = ['id:' +  str(strId[0]) for strId in candidates]
    
    # select all the docs from Solr with the best scores, highest first.
    addr = 'http://localhost:8983/solr/enwiki20160305/select'
    params={'fl':'id score', 'fq':" ".join(strIds), 'indent':'on',
            'q':'text:('+context.encode('utf-8')+')^1 title:(' + mentionStr.encode('utf-8')+')^1.35',
            'wt':'json'}
    r = requests.get(addr, params = params)
    
    if 'response' not in r.json():
        return 0 # default to most popular
    
    if 'docs' not in r.json()['response']:
        return 0
    
    results = r.json()['response']['docs']
    if len(results) == 0:
        return 0 # default to most popular
    
    bestId = long(r.json()['response']['docs'][0]['id'])
    
    #for doc in r.json()['response']['docs']:
        #print '[' + id2title(doc['id']) + '] -> ' + str(doc['score'])
    
    # find which index has bestId
    bestIndex = 0
    for cand in candidates:
        if cand[0] == bestId:
            return bestIndex
        else:
            bestIndex += 1
            
    return bestIndex # in case it was missed

def getContext2Scores(mentionStr, context, candidates):
    """
    Description:
        Uses Solr to find the relevancy scores of the candidates based on the context.
    Args:
        mentionStr: The mention as it appears in the text
        context: The words that surround the target word.
        candidates: A list of candidates that each have the entity id and its frequency/popularity.
    Return:
        The score for each candidate in the same order as the candidates.
    """
    
    candScores = []
    for i in range(len(candidates)):
        candScores.append(0)
    
    # put text in right format
    context = escapeStringSolr(context)
    mentionStr = escapeStringSolr(mentionStr)
    
    strIds = ['entityid:' +  str(strId[0]) for strId in candidates]
    
    # dictionary to hold scores for each id
    scoreDict = {}
    for cand in candidates:
        scoreDict[str(cand[0])] = 0
    
    # select all the docs from Solr with the best scores, highest first.
    addr = 'http://localhost:8983/solr/enwiki20160305_context/select'
    params={'fl':'entityid', 'fq':" ".join(strIds), 'indent':'on',
            'q':'_context_:('+context.encode('utf-8')+') entity:(' + mentionStr.encode('utf-8') + ')^1',
            'wt':'json'}
    r = requests.get(addr, params = params)
    
    try:
        # get count for each id
        for doc in r.json()['response']['docs']:
            scoreDict[str(doc['entityid'])] += 1
    except:
        # keep zero scores
        pass
    
    # give scores to each cand
    for j in range(0, len(candidates)):
        candScores[j] = scoreDict[str(candidates[j][0])]
            
    return candScores

def bestContext2Match(mentionStr, context, candidates):
    """
    Description:
        Uses Solr to find the candidate that gives the highest relevance when given the context.
    Args:
        context: The words that surround the target word.
        candidates: A list of candidates that each have the entity id and its frequency/popularity.
    Return:
        The index of the candidate with the best relevance score from the context.
    """
    
    # put text in right format
    context = escapeStringSolr(context)
    mentionStr = escapeStringSolr(mentionStr)
    strIds = ['entityid:' +  str(strId[0]) for strId in candidates]
    
    # dictionary to hold scores for each id
    scoreDict = {}
    for cand in candidates:
        scoreDict[str(cand[0])] = 0
    
    # select all the docs from Solr with the best scores, highest first.
    addr = 'http://localhost:8983/solr/enwiki20160305_context/select'
    params={'fl':'entityid', 'fq':" ".join(strIds), 'indent':'on',
            'q':'_context_:('+context.encode('utf-8')+') entity:(' + mentionStr.encode('utf-8') + ')^1',
            'wt':'json'}
    r = requests.get(addr, params = params)
    
    if 'response' not in r.json():
        return 0 # default to most popular
    
    if 'docs' not in r.json()['response']:
        return 0
    
    results = r.json()['response']['docs']
    if len(results) == 0:
        return 0 # default to most popular
    
    for doc in r.json()['response']['docs']:
        scoreDict[str(doc['entityid'])] += 1
    
    # get the index that has the best score
    bestScore = 0
    bestIndex = 0
    curIndex = 0
    for cand in candidates:
        if scoreDict[str(cand[0])] > bestScore:
            bestScore = scoreDict[str(cand[0])]
            bestIndex = curIndex
        curIndex += 1
            
    return bestIndex

def getWord2VecScores(context, candidates):
    """
    Description:
        Uses word2vec to find the similarity scores of each mention to the context vector.
    Args:
        context: The words that surround the target word as a list.
        candidates: A list of candidates that each have the entity id and its frequency/popularity.
    Return:
        The scores of eac candidate.
    """
    
    candScores = []
    for i in range(len(candidates)):
        candScores.append(0)
        
    ctxVec = pd.Series(sp.zeros(300)) # default zero vector
    # add all context words together
    for word in context:
        ctxVec += getword2vector(word)
        
    # compare context vector to each of the candidates
    i = 0
    for cand in candidates:
        eVec = getentity2vector(str(cand[0]))
        score = 1-sp.spatial.distance.cosine(ctxVec, eVec)
        if math.isnan(score):
            score = 0
        candScores[i] = score
        i += 1 # next index
        
    return candScores

def bestWord2VecMatch(context, candidates):
    """
    Description:
        Uses word2vec to find the candidate with the best similarity to the context.
    Args:
        context: The words that surround the target word as a list.
        candidates: A list of candidates that each have the entity id and its frequency/popularity.
    Return:
        The index of the candidate with the best similarity score with the context.
    """
    
    ctxVec = pd.Series(sp.zeros(300)) # default zero vector
    # add all context words together
    for word in context:
        ctxVec += getword2vector(word)
        
    # compare context vector to each of the candidates
    bestIndex = 0
    bestScore = 0
    i = 0
    for cand in candidates:
        eVec = getentity2vector(str(cand[0]))
        score = 1-sp.spatial.distance.cosine(ctxVec, eVec)
        #print '[' + id2title(cand[0]) + ']' + ' -> ' + str(score)
        # update score and index
        if score > bestScore: 
            bestIndex = i
            bestScore = score
            
        i += 1 # next index
            
    return bestIndex
    
def wikifyPopular(textData, candidates):
    """
    Description:
        Chooses the most popular candidate for each mention.
    Args:
        textData: A text in split form along with its suspected mentions.
        candidates: A list of list of candidates that each have the entity id and its frequency/popularity.
    Return:
        All of the proposed entities for the mentions, of the form: [[start,end,entityId],...].
    """
    
    topCandidates = []
    i = 0 # track which mention's candidates we are looking at
    # for each mention choose the top candidate
    for mention in textData['mentions']:
        if len(candidates[i]) > 0:
            topCandidates.append([mention[1], mention[2], candidates[i][0][0]])
        i += 1 # move to list of candidates for next mention
            
    return topCandidates

def wikifyContext(textData, candidates, oText, useSentence = False, window = 7, method2 = False):
    """
    Description:
        Chooses the candidate that has the highest relevance with the surrounding window words.
    Args:
        textData: A textData in split form along with its suspected mentions.
        candidates: A list of candidates that each have the entity id and its frequency/popularity.
        oText: The original text to be used for getting sentence.
        useSentence: Whether to set use whole sentence as context, or just windowsize.
        window: How many words on both sides of a mention to search for context.
    Return:
        All of the proposed entities for the mentions, of the form: [[start,end,entityId],...].
    """
    
    topCandidates = []
    i = 0 # track which mention's candidates we are looking at
    # for each mention choose the top candidate
    for mention in textData['mentions']:
        if len(candidates[i]) > 0:
            if not useSentence:
                context = getSurroundingWords(textData['text'], mention[0], window)
            else:
                #context = getMentionSentence(oText, mention)
                context = getMentionsInSentence(textData, mention)
            #print '\nMention: ' + textData['text'][mention[0]]
            #print 'Context: ' + context
            if method2 == False:
                bestIndex = bestContext1Match(textData['text'][mention[0]], context, candidates[i])
            else:
                bestIndex = bestContext2Match(textData['text'][mention[0]], context, candidates[i])
            topCandidates.append([mention[1], mention[2], candidates[i][bestIndex][0]])
        i += 1 # move to list of candidates for next mention
        
    return topCandidates

def wikifyWord2Vec(textData, candidates, oText, useSentence = False, window = 5):
    """
    Description:
        Chooses the candidates that have the highest similarity to the context.
    Args:
        textData: A textData in split form along with its suspected mentions.
        candidates: A list of candidates that each have the entity id and its frequency/popularity.
        oText: The original text to be used for getting sentence.
        useSentence: Whether to set use whole sentence as context, or just windowsize.
        window: How many words on both sides of a mention to search for context.
    Return:
        All of the proposed entities for the mentions, of the form: [[start,end,entityId],...].
    """
    
    topCandidates = []
    i = 0 # track which mention's candidates we are looking at
    # for each mention choose the top candidate
    for mention in textData['mentions']:
        if len(candidates[i]) > 0:
            if not useSentence:
                context = getSurroundingWords(textData['text'], mention[0], window, asList = True)
            else:
                context = getMentionSentence(oText, mention, asList = True)
            #print '\nMention: ' + textData['text'][mention[0]]
            #print 'Context: ' + " ".join(context)
            bestIndex = bestWord2VecMatch(context, candidates[i])
            topCandidates.append([mention[1], mention[2], candidates[i][bestIndex][0]])
        i += 1 # move to list of candidates for next mention
        
    return topCandidates

def wikifyCoherence(textData, candidates, ws = 5):
    """
    Description:
        Chooses the candidates that have the highest coherence according to rvs pagerank method.
    Args:
        textData: A textData in split form along with its suspected mentions.
        candidates: A list of candidates that each have the entity id and its frequency/popularity.
        ws: How many words on both sides of a mention to search for context.
    Return:
        All of the proposed entities for the mentions, of the form: [[start,end,entityId],...].
    """
    
    topCands = [] # the top candidate from each candidate list
    candsScores = coherence_scores_driver(candidates, ws, method='rvspagerank', direction=DIR_BOTH, op_method="keydisamb")
    i = -1 # track what mention we are on
    for cScores in candsScores:
        i += 1
        
        if len(cScores) == 0:
            continue # nothing to do with this one
            
        bestScore = sorted(cScores, reverse = True)[0]
        curIndex = 0
        for score in cScores:
            if score == bestScore:
                topCands.append([textData['mentions'][i][1], textData['mentions'][i][2], candidates[i][curIndex][0]])
                break
            curIndex += 1
            
    return topCands

mlModels = {} # dictionary of different models
mlModelFiles = {
    'abc': '/users/cs/amaral/wikisim/wikification/ml-models/model-abc-10000-hyb.pkl',
    'bgc': '/users/cs/amaral/wikisim/wikification/ml-models/model-bgc-10000-hyb.pkl',
    'etc': '/users/cs/amaral/wikisim/wikification/ml-models/model-etc-10000-hyb.pkl',
    'gbc': '/users/cs/amaral/wikisim/wikification/ml-models/model-gbc-10000-hyb.pkl',
    'rfc': '/users/cs/amaral/wikisim/wikification/ml-models/model-rfc-10000-hyb.pkl',
    'lsvc': '/users/cs/amaral/wikisim/wikification/ml-models/model-lsvc-10000-hyb.pkl',
    'svc': '/users/cs/amaral/wikisim/wikification/ml-models/model-svc-10000-hyb.pkl',
    'lmart': '/users/cs/amaral/wikisim/wikification/ml-models/model-lmart-10000-hyb.pkl'}

def wikifyMulti(textData, candidates, oText, model, useSentence = True, window = 7):
    """
    Description:
        Disambiguates each of the mentions with their given candidates using the desired
        machine learned model.
    Args:
        textData: A textData in split form along with its suspected mentions.
        candidates: A list of candidates that each have the entity id and its frequency/popularity.
        oText: The original text, unsplit.
        model: The machine learned model to use for disambiguation: 
            'gbc' (gradient boosted classifier), 'etr' (extra trees regression), 
            'gbr' (gradient boosted regression), 'lmart' (LambdaMART (a learning to rank method)),
            and 'rfr' (random forest regression).
        useSentence: Whether to use windo size of sentence (for context methods)
        window: How many words on both sides of a mention to search for context.
    Return:
        All of the proposed entities for the mentions, of the form: [[start,end,entityId],...].
    """
    
    mlModel = mlModels[model] # get reference to model
    
    # get score from coherence
    cohScores = coherence_scores_driver(candidates, 5, method='rvspagerank', direction=DIR_BOTH, op_method="keydisamb")
    
    i = 0
    # get scores from each disambiguation method for all mentions
    for mention in textData['mentions']:
        if len(candidates[i]) > -1: # stub
            # get the scores from each basic method.
            
            # normalize popularity scores
            cScrs = []
            for cand in candidates[i]:
                cScrs.append(cand[1])
            cScrs = normalize(cScrs)
            j = 0
            for cand in candidates[i]:
                cand[1] = cScrs[j]
                j += 1
            
            contextMInS = getMentionsInSentence(textData, textData['mentions'][i])
            contextS = getMentionSentence(oText, textData['mentions'][i], asList = True)
            
            # context 1 scores
            cScrs = getContext1Scores(textData['text'][mention[0]], contextMInS, candidates[i])
            cScrs = normalize(cScrs)
            # apply score to candList
            for j in range(0, len(candidates[i])):
                candidates[i][j].append(cScrs[j])
            
            # context 2 scores
            cScrs = getContext2Scores(textData['text'][mention[0]], contextMInS, candidates[i])
            cScrs = normalize(cScrs)
            # apply score to candList
            for j in range(0, len(candidates[i])):
                candidates[i][j].append(cScrs[j])

            # get score form word2vec
            cScrs = getWord2VecScores(contextS, candidates[i])
            #cScrs = normalize(cScrs)
            # apply score to candList
            for j in range(0, len(candidates[i])):
                candidates[i][j].append(cScrs[j])

            # get score from coherence
            for j in range(0, len(candidates[i])):
                candidates[i][j].append(cohScores[i][j])
            
        i += 1
    
    topCandidates = []
    
    i = 0
    # go through all mentions again to disambiguate with ml model
    for mention in textData['mentions']:
        try:
            Xs = [cand[1:] for cand in candidates[i]]
            if len(Xs) == 0:
                i += 1
                continue
            pred = mlModel.predict(Xs)
        except:
            try:
                Xs = [cand[1:] for cand in candidates[i]]
                pred = mlModel.predict(np.array(candidates[i][1:]).reshape(1, -1))
            except:
                i += 1
                continue
        cur = 0
        best = 0
        bestI = 0
        for j in range(len(pred)):
            if pred[j] > best:
                best = pred[j]
                bestI = j
        
        topCandidates.append([mention[1], mention[2], candidates[i][bestI][0]])
        
        i += 1
        
    return topCandidates

def wikifyEval(text, mentionsGiven, maxC = 20, method='popular', strict = False, hybridC = True, model = 'lmart'):
    """
    Description:
        Takes the text (maybe text data), and wikifies it for evaluation purposes using the desired method.
    Args:
        text: The string to wikify. Either as just the original string to be modified, or in the 
            form of: [[w1,w2,...], [[wid,entityId],...] if the mentions are given.
        mentionsGiven: Whether the mentions are given to us and the text is already split.
        maxC: The max amount of candidates to extract.
        method: The method used to wikify.
        strict: Whether to use such rules as minimum metion length, or minimum frequency of concept.
        hybridC: Whether to split generated candidates between best of most frequent of most context related.
        model: What model to use if using machine learning based method. LambdaMART as 'lmart' is default.
            Other options are: 'gbc' (gradient boosted classifier), 'etr' (extra trees regression), 
            'gbr' (gradient boosted regression), and 'rfr' (random forest regression).
    Return:
        All of the proposed entities for the mentions, of the form: [[start,end,entityId],...].
    """
    
    if not(mentionsGiven): # if words are not in pre-split form
        textData = mentionExtract(text) # extract mentions from text
        oText = text # the original text
    else: # if they are
        textData = text
        textData['mentions'] = mentionStartsAndEnds(textData) # put mentions in right form
        oText = " ".join(text['text'])
    
    # get rid of small mentions
    if strict:
        textData['mentions'] = [item for item in textData['mentions']
                    if  len(textData['text'][item[0]]) >= MIN_MENTION_LENGTH]
    
    if method == 'popular':
        maxC = 1 # only need one cand for popular
    
    candidates = generateCandidates(textData, maxC, hybridC)
    
    if method == 'popular':
        wikified = wikifyPopular(textData, candidates)
    elif method == 'context1':
        wikified = wikifyContext(textData, candidates, oText, useSentence = True, window = 7)
    elif method == 'context2':
        wikified = wikifyContext(textData, candidates, oText, useSentence = True, window = 7, method2 = True)
    elif method == 'word2vec':
        wikified = wikifyWord2Vec(textData, candidates, oText, useSentence = False, window = 5)
    elif method == 'coherence':
        wikified = wikifyCoherence(textData, candidates, ws = 5)
    elif method == 'multi':
        if model not in mlModels:
            mlModels[model] = pickle.load(open(mlModelFiles[model], 'rb'))
        wikified = wikifyMulti(textData, candidates, oText, model, useSentence = True, window = 7)
    
    # get rid of very unpopular mentions
    if strict:
        wikified = [item for item in wikified
                    if item[3] >= MIN_FREQUENCY]
    
    return wikified

Overwriting wikification.py


In [None]:
from wikification import wikifyEval
from wikipedia import id2title, anchor2concept

the = wikifyEval('Tom Bombadil is a hobbit who lives in the Shire', False, method = 'context2', hybridC = False)
print the
for thing in the:
    print id2title(thing[2])

In [1]:
from wikification import mentionExtract

mentionExtract('Tom Bombadil is a hobbit who lives in the Shire')

[[u'startOffset', 0, u'endOffset', 3, u'ids', [u'Tom', u'T\xf4m', u"Tom's", u'Tom\u2019s']], [u'startOffset', 0, u'endOffset', 12, u'ids', [u'Tom Bombadil']], [u'startOffset', 4, u'endOffset', 12, u'ids', [u'Bombadil']], [u'startOffset', 13, u'endOffset', 15, u'ids', [u'is', u'.is', u'"is"', u'[is']], [u'startOffset', 13, u'endOffset', 17, u'ids', [u'is a', u'is-a']], [u'startOffset', 16, u'endOffset', 17, u'ids', [u'\u0259', u'a', u'/a/', u'.a', u'"a-"', u'"a"', u'"a"-', u'\\a', u'a :-)', u'a-', u'\xe1', u'\xe4', u'[\xe4', u'\xe3', u'\u0105', u'\u1e9a']], [u'startOffset', 18, u'endOffset', 24, u'ids', [u'hobbit', u'"hobbit"']], [u'startOffset', 25, u'endOffset', 28, u'ids', [u'who', u'"who"']], [u'startOffset', 29, u'endOffset', 34, u'ids', [u'lives']], [u'startOffset', 35, u'endOffset', 37, u'ids', [u'in', u'-in', u'.in', u"'in", u'in-', u'in.', u'in\xb2']], [u'startOffset', 35, u'endOffset', 41, u'ids', [u'in the']], [u'startOffset', 38, u'endOffset', 41, u'ids', [u'the', u'"the"']]

{'mentions': [[1, 0, 12], [6, 18, 24], [13, 42, 47]],
 'text': ['Tom',
  'Tom Bombadil',
  'Bombadil',
  'is',
  'is a',
  'a',
  'hobbit',
  'who',
  'lives',
  'in',
  'in the',
  'the',
  'the Shire',
  'Shire']}

In [13]:
#%%writefile word-pos-data.py
from __future__ import division
import os
import nltk
import json

"""
Gets stats on the POS tag data of mentions and non-mentions.
"""

pathStrt = '/users/cs/amaral/wsd-datasets'
dsPath = os.path.join(pathStrt,'wiki-mentions.30000.json')

with open(dsPath, 'r') as dataFile:
    dataLines = []
    skip = 0
    amount = 30000 # do 30000 for full
    i = 0
    for line in dataFile:
        if i >= skip:
            dataLines.append(json.loads(line.decode('utf-8').strip()))
        i += 1
        if i >= skip + amount:
            break
            
mentionB = {}
mentionC = {}
mentionA = {}
mentionBA = {}

nonmentionB = {}
nonmentionC = {}
nonmentionA = {}
nonmentionBA = {}

mentions = 0
nonmentions = 0

lnum = 0
for line in dataLines:
    lnum += 1
    print 'Line: ' + str(lnum)
    
    pos = nltk.pos_tag(line['text'])
    for i in range(len(line['text'])):
        # before
        if i == 0:
            keyB = 'NONE'
        else:
            keyB = pos[i-1][1]
            
        # current
        keyC = pos[i][1]
        
        # after
        if i == len(line['text']) - 1:
            keyA = 'NONE'
        else:
            keyA = pos[i+1][1]
        
        if i in [mnt[0] for mnt in line['mentions']]: # is mention
            mentions += 1
            # before
            try:
                mentionB[keyB][0] += 1
            except:
                mentionB[keyB] = [1]
            # current
            try:
                mentionC[keyC][0] += 1
            except:
                mentionC[keyC] = [1]
            # after
            try:
                mentionA[keyA][0] += 1
            except:
                mentionA[keyA] = [1]
            # before and after
            try:
                mentionBA[keyB + ' : ' + keyA][0] += 1
            except:
                mentionBA[keyB + ' : ' + keyA] = [1]
        else: # is nonmention
            nonmentions += 1
            # before
            try:
                nonmentionB[keyB][0] += 1
            except:
                nonmentionB[keyB] = [1]
            # current
            try:
                nonmentionC[keyC][0] += 1
            except:
                nonmentionC[keyC] = [1]
            # after
            try:
                nonmentionA[keyA][0] += 1
            except:
                nonmentionA[keyA] = [1]
            # before and after
            try:
                nonmentionBA[keyB + ' : ' + keyA][0] += 1
            except:
                nonmentionBA[keyB + ' : ' + keyA] = [1]
                
print 'Mentions', mentions
print 'Non-Mentions', nonmentions
                
# apply portion to each pos tag (mentions)
for key in mentionB.keys():
    mentionB[key].append(mentionB[key][0]/mentions)
for key in mentionC.keys():
    mentionC[key].append(mentionC[key][0]/mentions)
for key in mentionA.keys():
    mentionA[key].append(mentionA[key][0]/mentions)
for key in mentionBA.keys():
    mentionBA[key].append(mentionBA[key][0]/mentions)
# apply portion to each pos tag (nonmentions)
for key in nonmentionB.keys():
    nonmentionB[key].append(nonmentionB[key][0]/nonmentions)
for key in nonmentionC.keys():
    nonmentionC[key].append(nonmentionC[key][0]/nonmentions)
for key in nonmentionA.keys():
    nonmentionA[key].append(nonmentionA[key][0]/nonmentions)
for key in nonmentionBA.keys():
    nonmentionBA[key].append(nonmentionBA[key][0]/nonmentions)


""" Already have this data
with open('/users/cs/amaral/wikisim/wikification/pos-data/pos-mention-bef.tsv', 'w') as f:
    for key in mentionB.keys():
        f.write(key + '\t' + str(mentionB[key][0]) + '\t' + str(mentionB[key][1]) + '\n')
        
with open('/users/cs/amaral/wikisim/wikification/pos-data/pos-mention-cur.tsv', 'w') as f:
    for key in mentionC.keys():
        f.write(key + '\t' + str(mentionC[key][0]) + '\t' + str(mentionC[key][1]) + '\n')
        
with open('/users/cs/amaral/wikisim/wikification/pos-data/pos-mention-aft.tsv', 'w') as f:
    for key in mentionA.keys():
        f.write(key + '\t' + str(mentionA[key][0]) + '\t' + str(mentionA[key][1]) + '\n')
        
with open('/users/cs/amaral/wikisim/wikification/pos-data/pos-nonmention-bef.tsv', 'w') as f:
    for key in nonmentionB.keys():
        f.write(key + '\t' + str(nonmentionB[key][0]) + '\t' + str(nonmentionB[key][1]) + '\n')
        
with open('/users/cs/amaral/wikisim/wikification/pos-data/pos-nonmention-cur.tsv', 'w') as f:
    for key in nonmentionC.keys():
        f.write(key + '\t' + str(nonmentionC[key][0]) + '\t' + str(nonmentionC[key][1]) + '\n')
        
with open('/users/cs/amaral/wikisim/wikification/pos-data/pos-nonmention-aft.tsv', 'w') as f:
    for key in nonmentionA.keys():
        f.write(key + '\t' + str(nonmentionA[key][0]) + '\t' + str(nonmentionA[key][1]) + '\n')
"""

with open('/users/cs/amaral/wikisim/wikification/pos-data/pos-mention-befaft.tsv', 'w') as f:
    for key in mentionBA.keys():
        f.write(key + '\t' + str(mentionBA[key][0]) + '\t' + str(mentionBA[key][1]) + '\n')
        
with open('/users/cs/amaral/wikisim/wikification/pos-data/pos-nonmention-befaft.tsv', 'w') as f:
    for key in nonmentionBA.keys():
        f.write(key + '\t' + str(nonmentionBA[key][0]) + '\t' + str(nonmentionBA[key][1]) + '\n')

Line: 1
Line: 2
Line: 3
Line: 4
Line: 5
Line: 6
Line: 7
Line: 8
Line: 9
Line: 10
Line: 11
Line: 12
Line: 13
Line: 14
Line: 15
Line: 16
Line: 17
Line: 18
Line: 19
Line: 20
Line: 21
Line: 22
Line: 23
Line: 24
Line: 25
Line: 26
Line: 27
Line: 28
Line: 29
Line: 30
Line: 31
Line: 32
Line: 33
Line: 34
Line: 35
Line: 36
Line: 37
Line: 38
Line: 39
Line: 40
Line: 41
Line: 42
Line: 43
Line: 44
Line: 45
Line: 46
Line: 47
Line: 48
Line: 49
Line: 50
Line: 51
Line: 52
Line: 53
Line: 54
Line: 55
Line: 56
Line: 57
Line: 58
Line: 59
Line: 60
Line: 61
Line: 62
Line: 63
Line: 64
Line: 65
Line: 66
Line: 67
Line: 68
Line: 69
Line: 70
Line: 71
Line: 72
Line: 73
Line: 74
Line: 75
Line: 76
Line: 77
Line: 78
Line: 79
Line: 80
Line: 81
Line: 82
Line: 83
Line: 84
Line: 85
Line: 86
Line: 87
Line: 88
Line: 89
Line: 90
Line: 91
Line: 92
Line: 93
Line: 94
Line: 95
Line: 96
Line: 97
Line: 98
Line: 99
Line: 100
Line: 101
Line: 102
Line: 103
Line: 104
Line: 105
Line: 106
Line: 107
Line: 108
Line: 109
Line: 110
Line: 11