# Wikification
This notebook contains code for doing wikification, as well as evaluating it.

## Overview of our Wikification Method
The following flowchart describes how our wikification method works without much technical detail.
<img src="https://docs.google.com/drawings/d/19fInwE2C_fsAFiMNnIPe0cFldIHbJTHrXrLlLg6nbwI/pub?w=728&h=600">
<center><strong>Figure 1.</strong> A flowchart describing our wikification method at a relatively basic level.</center>

## With More Detail
In reference to figure 1.:

### 1.  Input Some Text
Self explanatory, just feed into the wikifier some text that is desired to be wikified. In the evaluation part of our code, the text comes pre-split from the datasets. We either keep the text split to focus more on our wikification, or join the text with spaces to evaluate while taking our own mention extraction into account.

### 2. Tokenize Text
The text is tokenized by a [Solr](http://lucene.apache.org/solr/) extension called [Solr Text Tagger](https://github.com/OpenSextant/SolrTextTagger), this tokenizer returns all potential mentions that it detects in the text. Our code is configured so that the tokenizer returns all overlaps. So if given the text: 'The United States of America', the tokenizer would return all of 'The United States', 'The United States of America', 'United States', and 'United States of America'. These overlaps are undesirable for our wikification purposes. However we choose to enable the overlaps so that we can obtain more potential mentions that we can deal with later more intelligently than the tokenizer can (without configuring it deeply). The overlaps are dealt with in the next step, though future work may make it better to deal with them later in the process.

### 3. Remove Overlaps
This part as-is is a work in progress. Currently our method is to first group all overlapping mentions into what we call overlap sets. Each overlap set is comprised of overlapping mentions that start at the same letter. The mention 'probability' of each mention is calculated at this time. The mention 'probability' is not truly a probability, it is defined as the amount of times the mention text is a mention in Wikipedia divided by the amount of documents it shows up in Wikipedia (it would be preferable to have the denominator be the total amount of times the mention text shows up in Wikipedia (to be an actual probability)). The mention with the highest 'probability' in each overlap set is the sole mention that is kept.

There of-course may still be overlaps remaining at this point, now the residual overlaps are to be dealt with. It is important to note that for the following part, the mentions are stored in the order that they appear in the text, by their beginning letters'. When we say the first mention we mean the mention that appears first in the text, and by next mention we mean the mention that appears next in the text. To deal with the residual overlaps we call the first mention the anchor, and all of the next mentions that start before the anchor mention ends, all get grouped together with the anchor mention in an overlap set. Just like before the most 'probable' mention in this set is kept, all others are discarded from the original set. Once again the first mention that in the updated original set is selected as the anchor, the same process is repeated. If the overlap set only contains the anchor mention, the whole process is repeated on the next mention. This process is repeated until there is no next mention to go to.

This step needs more investigation, perhaps the first part does not even need to be done.

### 4. Filter with POS Tags and Mention Probability
We use [Natural Language Toolkit (NLTK)](http://www.nltk.org/) to tag all of the mentions (though we should tag all of the text together to get more accurate results (update to come)). Using the POS tags helps us filter out bad mentions. Approximately 99% of all mentions in our datasets where either any type of noun, an adjective, or a cardinal number. The tags are displayed as 'NN', 'NNS', 'NNP', and 'NNPS' for nouns, 'JJ' for adjectives, and 'CD' for cardinal numbers.

In addition to filtering with POS tags, we also filter out any mention that have a 'probability' of being a mention of less than 0.001.

### 5. Candidate Generation
Now that the mentions are all extracted, we must generate a list of possible entities that each mention can refer to, we call these, entity candidates. To generate entity candidates, we select the entities (Wikipedia page) that the given mention refers to most on Wikipedia. We refer to this measure as popularity. This selection is limited to an amount of candidates that is inputted into the wikification method. This method of candidate generation is objectively the best way to select candidates of the ways we have tried. Of 27092 mentions in our datasets, this method puts the correct candidate in the first position for 22896 mentions (when they are sorted descending by popularity). When selecting 20 candidates, 532 mentions do not have their correct entity as one of the candidates. Which means that if we are selecting 20 candidates for each mention, then we will have about 98% of the mentions containing the correct entity in its candidates.

### 6. Candidate Scoring
For each mention, all of the candidates must be scored on some metric. The candidate with the best score will be selected as the proposed entity for the mention. Currently, we are only working with individual elementary methods. By 'elementary methods', we mean that the methods are basic algorithmic processes and do not make any use of machine learning. In the future we will look at ways to combine the elementary methods together using a learning to rank algorithm, and also potentially look into deep learning for this problem.

#### Popularity
This method simply chooses the most popular candidate (most popular as described in section 5. Candidate Generation). This method performs very well but is undesirable due to the fact that it is just blindly guessing, and could be horribly wrong in some cases. See [this comic](https://comic.hmp.is.it/comic/rainy-days/) for an example of someone who does not quite get this concept.

#### Context 1
For this method, the sentence that contains the mention is extracted and called the context. The mention is removed from this context. We then use Solr to search for the most similar document (Wikipedia page) by searching in the document text field for the context, as well as searching in the document title field for the mention text. The set of documents that we are searching through in Solr is of-course limited to those that are the candidates of the mention. We do what is called boosting to make the results more weighted by the title field, the results from this are boosted by 1.35 (multiplied) on each document. The document with the highest score is deemed the most similar and is selected as the proposed entity for the mention.

#### Context 2
This method is slightly similar to Context 1 as it also uses Solr and it uses the sentence as a context in the same way. The difference is that we use a different index for this method. The index for this method, rather than containing whole documents (Wikipedia pages), contains all instances of all mentions with the surrounding context of each mention as a record. For example, a record could be for the mention 'David', the record will also have n (5 in this example) words before the mention: 'is a soccer player named', n words after the mention: 'he played for Manchester United', the Wikipedia page that the mention is in, and the Wikipedia page that the mention refers to. Using this index we search in the collection of all records that have the mention refer our candidate, for each candidate. The n words before and after are searched in for our context sentence, whichever entity has the highest number of relevant examples is selected as the proposed entity for that mention.

#### Word2Vec
For this method we have Word2Vec create a vector space model of concepts from a Wikipedia corpus. The entities, as well as regular words all have their own vector representation. To use this method, we select n words before and after the mention, and get the vector representation of each of these words. All of these vectors are added together to become a context vector. This context vector is compared to the vector representation of each of the candidates. The candidate vector that is most similar (by cosine similarity) to the context vector is selected as the proposed entity for the mention.

#### Coherence
This method uses the reverse page rank algorithm to determine which combination of candidates from all mentions makes the most sense together. This method looks at the quality of all of the proposed entities from all mentions together, instead of individually selecting the proposed entity for each individual mention.

# Wikification Evaluation Code
The code in this cell is used to evaluate the precision and recall of the wikification code as well as other wikification methods.

## Datasets

### KORE
* 50 records.
* Relatively small pieces of text with the main goal of being tricky for wikification systems.

### Aquaint
* 50 records.
* News.

### MSNBC
* 20 records.
* News.

### Wiki[n]
* n records (we usually use 500 or 5000).
* Opening paragraph of a variety of randomly selected Wikipedia articles.

### nopop
* 2304 records.
* Comprised of subsets of the other datasets.
* Only contains records where the most popular candidate is never the correct entity.

In [18]:
%%writefile wikification_eval.py 

"""
This is for testing performance of different wikification methods.
"""

from wikification import *
from IPython.display import clear_output
import copy
from datetime import datetime
import tagme
import os
import json

tagme.GCUBE_TOKEN = "f6c2ba6c-751b-4977-a94c-c140c30e9b92-843339462"
    

pathStrt = '/users/cs/amaral/wsd-datasets'
#pathStrt = 'C:\\Temp\\wsd-datasets'

# the data sets for performing on
datasets = [{'name':'kore', 'path':os.path.join(pathStrt,'kore.json')},
            {'name':'AQUAINT', 'path':os.path.join(pathStrt,'AQUAINT.txt.json')},
            {'name':'MSNBC', 'path':os.path.join(pathStrt,'MSNBC.txt.json')},
            {'name':'wiki5000', 'path':os.path.join(pathStrt,'wiki-mentions.5000.json')}]

# many different option for combonations of datasets for smaller tests
#datasets = [{'name':'MSNBC', 'path':os.path.join(pathStrt,'MSNBC.txt.json')}]
datasets = [{'name':'kore', 'path':os.path.join(pathStrt,'kore.json')}]
#datasets = [{'name':'kore', 'path':os.path.join(pathStrt,'kore.json')}, {'name':'AQUAINT', 'path':os.path.join(pathStrt,'AQUAINT.txt.json')}]
#datasets = [{'name':'wiki5000', 'path':os.path.join(pathStrt,'wiki-mentions.5000.json')}]
datasets = [{'name':'kore', 'path':os.path.join(pathStrt,'kore.json')}, {'name':'AQUAINT', 'path':os.path.join(pathStrt,'AQUAINT.txt.json')}, {'name':'MSNBC', 'path':os.path.join(pathStrt,'MSNBC.txt.json')}]
#datasets = [{'name':'kore', 'path':os.path.join(pathStrt,'kore.json')}, {'name':'AQUAINT', 'path':os.path.join(pathStrt,'AQUAINT.txt.json')}, {'name':'MSNBC', 'path':os.path.join(pathStrt,'MSNBC.txt.json')},{'name':'wiki500', 'path':os.path.join(pathStrt,'wiki-mentions.500.json')}]
#datasets = [{'name':'nopop', 'path':os.path.join(pathStrt,'nopop.json')}]
#datasets = [{'name':'kore', 'path':os.path.join(pathStrt,'kore.json')}, {'name':'AQUAINT', 'path':os.path.join(pathStrt,'AQUAINT.txt.json')}, {'name':'MSNBC', 'path':os.path.join(pathStrt,'MSNBC.txt.json')},{'name':'wiki500', 'path':os.path.join(pathStrt,'wiki-mentions.500.json')},{'name':'nopop', 'path':os.path.join(pathStrt,'nopop.json')}]
#datasets = [{'name':'kore', 'path':os.path.join(pathStrt,'kore.json')}, {'name':'AQUAINT', 'path':os.path.join(pathStrt,'AQUAINT.txt.json')}, {'name':'MSNBC', 'path':os.path.join(pathStrt,'MSNBC.txt.json')},{'name':'wiki5000', 'path':os.path.join(pathStrt,'wiki-mentions.5000.json')},{'name':'nopop', 'path':os.path.join(pathStrt,'nopop.json')}]
#datasets = [{'name':'wiki500', 'path':os.path.join(pathStrt,'wiki-mentions.500.json')}]

# 'popular', 'context1', 'context2', 'word2vec', 'coherence', 'tagme'
methods = ['popularity']

if 'word2vec' in methods:
    try:
        word2vec
    except:
        word2vec = gensim_loadmodel('/users/cs/amaral/cgmdir/WikipediaClean5Negative300Skip10.Ehsan/WikipediaClean5Negative300Skip10')

doSplit = True
doManual = True

verbose = True

maxCands = 20

performances = {}

# for each dataset, run all methods
for dataset in datasets:
    performances[dataset['name']] = {}
    # get the data from dataset
    dataFile = open(dataset['path'], 'r')
    dataLines = []
    
    # put in all lines that contain proper ascii
    for line in dataFile:
        dataLines.append(json.loads(line.decode('utf-8').strip()))
        
    print dataset['name'] + '\n'
    
    # run each method on the data set
    for mthd in methods:
        print mthd
        print str(datetime.now()) + '\n'
        
        # reset counters
        totalPrecS = 0
        totalPrecM = 0
        totalRecS = 0
        totalRecM = 0
        totalLines = 0
        
        # each method tests all lines
        for line in dataLines:
            if verbose:
                print str(totalLines + 1)
            
            # get absolute text indexes and entity id of each given mention
            trueEntities = mentionStartsAndEnds(copy.deepcopy(line), forTruth = True) # the ground truth
            
            # get results for pre split string
            if doSplit and mthd <> 'tagme': # presplit no work on tagme
                # original split string with mentions given
                resultS = wikifyEval(copy.deepcopy(line), True, maxC = maxCands, method = mthd)
                precS = precision(trueEntities, resultS) # precision of pre-split
                recS = recall(trueEntities, resultS) # recall of pre-split
                
                if verbose:
                    print 'Split: ' + str(precS) + ', ' + str(recS)
                
                # track results
                totalPrecS += precS
                totalRecS += recS
            else:
                totalPrecS = 0
                totalRecS = 0
                
            # get results for manually split string
            if doManual:
                # tagme has separate way to do things
                if mthd == 'tagme':
                    antns = tagme.annotate(" ".join(line['text']))
                    resultM = []
                    for an in antns.get_annotations(0.005):
                        resultM.append([an.begin,an.end,title2id(an.entity_title)])
                else:
                    # unsplit string to be manually split and mentions found
                    resultM = wikifyEval(" ".join(line['text']), False, maxC = maxCands, method = mthd)
                
                precM = precision(trueEntities, resultM) # precision of manual split
                recM = recall(trueEntities, resultM) # recall of manual split
                
                if verbose:
                    print 'Manual: ' + str(precM) + ', ' + str(recM)
                    
                # track results
                totalPrecM += precM
                totalRecM += recM
            else:
                totalPrecM = 0
                totalRecM = 0
                
            totalLines += 1
        
        # record results for this method on this dataset
        # [avg precision split, avg precision manual, avg recall split, avg recall manual]
        performances[dataset['name']][mthd] = {'S Prec':totalPrecS/totalLines, 
                                               'M Prec':totalPrecM/totalLines,
                                              'S Rec':totalRecS/totalLines, 
                                               'M Rec':totalRecM/totalLines
                                              }

with open('/users/cs/amaral/wikisim/wikification/wikification_results.txt', 'a') as resultFile:
    resultFile.write('\nmaxC: ' + str(maxCands) + '\n\n')
    for dataset in datasets:
        resultFile.write(dataset['name'] + ':\n')
        for mthd in methods:
            if doSplit and doManual:
                resultFile.write(mthd + ':'
                       + '\n    S Prec :' + str(performances[dataset['name']][mthd]['S Prec'])
                       + '\n    S Rec :' + str(performances[dataset['name']][mthd]['S Rec'])
                       + '\n    M Prec :' + str(performances[dataset['name']][mthd]['M Prec'])
                       + '\n    M Rec :' + str(performances[dataset['name']][mthd]['M Rec']) + '\n')
            elif doSplit:
                resultFile.write(mthd + ':'
                       + '\n    S Prec :' + str(performances[dataset['name']][mthd]['S Prec'])
                       + '\n    S Rec :' + str(performances[dataset['name']][mthd]['S Rec']) + '\n')
            elif doManual:
                resultFile.write(mthd + ':'
                       + '\n    M Prec :' + str(performances[dataset['name']][mthd]['M Prec'])
                       + '\n    M Rec :' + str(performances[dataset['name']][mthd]['M Rec']) + '\n')
                
    resultFile.write('\n' + '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~' + '\n')

Overwriting wikification_eval.py


In [20]:
%%writefile wikification_eval_bot.py 

"""
This is for testing performance of different wikification methods using BOT F1 score
as described here: http://cogcomp.cs.illinois.edu/papers/RRDA11.pdf.
"""

from __future__ import division
from wikification import *
from IPython.display import clear_output
import copy
from datetime import datetime
import tagme
import os
import json
from sets import Set

tagme.GCUBE_TOKEN = "f6c2ba6c-751b-4977-a94c-c140c30e9b92-843339462"
    

pathStrt = '/users/cs/amaral/wsd-datasets'
#pathStrt = 'C:\\Temp\\wsd-datasets'

# the data sets for performing on
datasets = [{'name':'kore', 'path':os.path.join(pathStrt,'kore.json')},
            {'name':'AQUAINT', 'path':os.path.join(pathStrt,'AQUAINT.txt.json')},
            {'name':'MSNBC', 'path':os.path.join(pathStrt,'MSNBC.txt.json')},
            {'name':'wiki5000', 'path':os.path.join(pathStrt,'wiki-mentions.5000.json')}]

# many different option for combonations of datasets for smaller tests
#datasets = [{'name':'MSNBC', 'path':os.path.join(pathStrt,'MSNBC.txt.json')}]
#datasets = [{'name':'kore', 'path':os.path.join(pathStrt,'kore.json')}]
#datasets = [{'name':'kore', 'path':os.path.join(pathStrt,'kore.json')}, {'name':'AQUAINT', 'path':os.path.join(pathStrt,'AQUAINT.txt.json')}]
#datasets = [{'name':'wiki5000', 'path':os.path.join(pathStrt,'wiki-mentions.5000.json')}]
#datasets = [{'name':'kore', 'path':os.path.join(pathStrt,'kore.json')}, {'name':'AQUAINT', 'path':os.path.join(pathStrt,'AQUAINT.txt.json')}, {'name':'MSNBC', 'path':os.path.join(pathStrt,'MSNBC.txt.json')}]
#datasets = [{'name':'kore', 'path':os.path.join(pathStrt,'kore.json')}, {'name':'AQUAINT', 'path':os.path.join(pathStrt,'AQUAINT.txt.json')}, {'name':'MSNBC', 'path':os.path.join(pathStrt,'MSNBC.txt.json')},{'name':'wiki500', 'path':os.path.join(pathStrt,'wiki-mentions.500.json')}]
#datasets = [{'name':'nopop', 'path':os.path.join(pathStrt,'nopop.json')}]
#datasets = [{'name':'kore', 'path':os.path.join(pathStrt,'kore.json')}, {'name':'AQUAINT', 'path':os.path.join(pathStrt,'AQUAINT.txt.json')}, {'name':'MSNBC', 'path':os.path.join(pathStrt,'MSNBC.txt.json')},{'name':'wiki500', 'path':os.path.join(pathStrt,'wiki-mentions.500.json')},{'name':'nopop', 'path':os.path.join(pathStrt,'nopop.json')}]
#datasets = [{'name':'kore', 'path':os.path.join(pathStrt,'kore.json')}, {'name':'AQUAINT', 'path':os.path.join(pathStrt,'AQUAINT.txt.json')}, {'name':'MSNBC', 'path':os.path.join(pathStrt,'MSNBC.txt.json')},{'name':'wiki5000', 'path':os.path.join(pathStrt,'wiki-mentions.5000.json')},{'name':'nopop', 'path':os.path.join(pathStrt,'nopop.json')}]
#datasets = [{'name':'wiki500', 'path':os.path.join(pathStrt,'wiki-mentions.500.json')}]
datasets = [{'name':'MSNBC', 'path':os.path.join(pathStrt,'MSNBC.txt.json')},{'name':'AQUAINT', 'path':os.path.join(pathStrt,'AQUAINT.txt.json')}]

# 'popular', 'context1', 'context2', 'word2vec', 'coherence', 'tagme'
methods = ['popular']

if 'word2vec' in methods:
    try:
        word2vec
    except:
        word2vec = gensim_loadmodel('/users/cs/amaral/cgmdir/WikipediaClean5Negative300Skip10.Ehsan/WikipediaClean5Negative300Skip10')

doSplit = True
doManual = False

verbose = True

maxCands = 20

performances = {}

# for each dataset, run all methods
for dataset in datasets:
    performances[dataset['name']] = {}
    # get the data from dataset
    dataFile = open(dataset['path'], 'r')
    dataLines = []
    
    # put in all lines that contain proper ascii
    for line in dataFile:
        dataLines.append(json.loads(line.decode('utf-8').strip()))
        
    print dataset['name'] + '\n'
    
    # run each method on the data set
    for mthd in methods:
        print mthd
        print str(datetime.now()) + '\n'
        
        # reset counters
        totalPrecS = 0
        totalPrecM = 0
        totalRecS = 0
        totalRecM = 0
        totalBotF1S = 0
        totalBotF1M = 0
        totalLines = 0
        
        # each method tests all lines
        for line in dataLines:
            if verbose:
                print str(totalLines + 1)
            
            # get absolute text indexes and entity id of each given mention
            trueEntities = mentionStartsAndEnds(copy.deepcopy(line), forTruth = True) # the ground truth
            trueSet = Set()
            for truEnt in trueEntities:
                trueSet.add(truEnt[2])
            
            # get results for pre split string
            if doSplit and mthd <> 'tagme': # presplit no work on tagme
                # original split string with mentions given
                resultS = wikifyEval(copy.deepcopy(line), True, maxC = maxCands, method = mthd)
                spltSet = Set()
                for res in resultS:
                    spltSet.add(res[2])
                
                precS = len(trueSet & spltSet)/len(spltSet)
                recS = len(trueSet & spltSet)/len(trueSet)
                f1 = (2*precS*recS)/(precS+recS)
                
                if verbose:
                    print 'Split: ' + str(precS) + ', ' + str(recS) + ', ' + str(f1)
                
                # track results
                totalPrecS += precS
                totalRecS += recS
                totalBotF1S += f1
            else:
                totalPrecS = 0
                totalRecS = 0
                totalBotF1S = 0
                
            # get results for manually split string
            if doManual:
                # tagme has separate way to do things
                if mthd == 'tagme':
                    antns = tagme.annotate(" ".join(line['text']))
                    resultM = []
                    for an in antns.get_annotations(0.005):
                        resultM.append([an.begin,an.end,title2id(an.entity_title)])
                else:
                    # unsplit string to be manually split and mentions found
                    resultM = wikifyEval(" ".join(line['text']), False, maxC = maxCands, method = mthd)
                
                manSet = Set()
                for res in resultM:
                    manSet.add(res[2])
                
                precM = len(trueSet & manSet)/len(manSet)
                recM = len(trueSet & manSet)/len(trueSet)
                f1 = (2*precM*recM)/(precM+recM)
                
                if verbose:
                    print 'Manual: ' + str(precM) + ', ' + str(recM) + ', ' + str(f1)
                
                # track results
                totalPrecM += precM
                totalRecM += recM
                totalBotF1M += f1
            else:
                totalPrecM = 0
                totalRecM = 0
                totalBotF1M = 0
                
            totalLines += 1
        
        # record results for this method on this dataset
        # [avg precision split, avg precision manual, avg recall split, avg recall manual]
        performances[dataset['name']][mthd] = {'S Prec':totalPrecS/totalLines, 
                                               'M Prec':totalPrecM/totalLines,
                                              'S Rec':totalRecS/totalLines, 
                                               'M Rec':totalRecM/totalLines,
                                               'S BOT F1':totalBotF1S/totalLines,
                                               'M BOT F1':totalBotF1M/totalLines
                                              }

with open('/users/cs/amaral/wikisim/wikification/wikification_results.txt', 'a') as resultFile:
    resultFile.write('\nmaxC: ' + str(maxCands) + '\n\n')
    for dataset in datasets:
        resultFile.write(dataset['name'] + ':\n')
        for mthd in methods:
            if doSplit and doManual:
                resultFile.write(mthd + ':'
                       + '\n    S Prec :' + str(performances[dataset['name']][mthd]['S Prec'])
                       + '\n    S Rec :' + str(performances[dataset['name']][mthd]['S Rec'])
                       + '\n    S BOT F1 :' + str(performances[dataset['name']][mthd]['S BOT F1'])
                       + '\n    M Prec :' + str(performances[dataset['name']][mthd]['M Prec'])
                       + '\n    M Rec :' + str(performances[dataset['name']][mthd]['M Rec'])
                       + '\n    M BOT F1 :' + str(performances[dataset['name']][mthd]['M BOT F1'])+ '\n')
            elif doSplit:
                resultFile.write(mthd + ':'
                       + '\n    S Prec :' + str(performances[dataset['name']][mthd]['S Prec'])
                       + '\n    S Rec :' + str(performances[dataset['name']][mthd]['S Rec'])
                       + '\n    S BOT F1 :' + str(performances[dataset['name']][mthd]['S BOT F1'])+ '\n')
            elif doManual:
                resultFile.write(mthd + ':'
                       + '\n    M Prec :' + str(performances[dataset['name']][mthd]['M Prec'])
                       + '\n    M Rec :' + str(performances[dataset['name']][mthd]['M Rec'])
                       + '\n    M BOT F1 :' + str(performances[dataset['name']][mthd]['M BOT F1'])+ '\n')
                
    resultFile.write('\n' + '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~' + '\n')

Overwriting wikification_eval_bot.py


# The Main Wikification Code
The code in this cell contains all of the logic to do wikification.

In [15]:
%%writefile wikification.py 

from __future__ import division
import sys
sys.path.append('../wikisim/')
from wikipedia import *
from operator import itemgetter
import requests
import json
import nltk
import scipy as sp
import scipy.sparse as sprs
import scipy.spatial
import scipy.sparse.linalg
from calcsim import *
sys.path.append('../')
from wsd.wsd import *

MIN_MENTION_LENGTH = 3 # mentions must be at least this long
MIN_FREQUENCY = 20 # anchor with frequency below is ignored

def get_solr_count(s):
    """ Gets the number of documents the string occurs 
        NOTE: Multi words should be quoted
    Arg:
        s: the string (can contain AND, OR, ..)
    Returns:
        The number of documents
    """

    q='+text:(\"%s\")'%(s,)
    qstr = 'http://localhost:8983/solr/enwiki20160305/select'
    params={'indent':'on', 'wt':'json', 'q':q, 'rows':0}
    r = requests.get(qstr, params=params)
    try:
        if 'response' not in r.json():
            return 0
        else:
            return r.json()['response']['numFound']
    except:
        return 0

def get_mention_count(s):
    """
    Description:
        Returns the amount of times that the given string appears as a mention in wikipedia.
    Args:
        s: the string (can contain AND, OR, ..)
    Return:
        The amount of times the given string appears as a mention in wikipedia
    """
    
    result = anchor2concept(s)
    rSum = 0
    for item in result:
        rSum += item[1]
        
    return rSum

def mentionProb(text):
    """
    Description:
        Returns the probability that the text is a mention in Wikipedia.
    Args:
        text: 
    Return:
        The probability that the text is a mention in Wikipedia.
    """
    
    totalMentions = get_mention_count(text)
    totalAppearances = get_solr_count(text.replace(".", ""))
    
    if totalAppearances == 0:
        return 0 # a mention never used probably is not a good link
    else:
        return totalMentions/totalAppearances

def destroyExclusiveOverlaps(textData):
    """
    Description:
        Removes all overlaps that start at same letter from text data, so that only the best mention in an
        overlap set is left.
    Args:
        textData: [[start, end, text, anchProb],...]
    Return:
        textData minus the unesescary elements that overlap.
    """
    
    newTextData = [] # textData minus the unesescary parts of the overlapping
    overlappingSets = [] # stores arrays of the indexes of overlapping items from textData
    
    # creates the overlappingSets array
    i = 0
    while i < len(textData)-1:
        # even single elements considered overlapping set
        # this is root of overlapping set
        overlappingSets.append([i])
        overlapIndex = len(overlappingSets) - 1
        theBegin = textData[i][0]
        
        # look at next words until not overlap
        for j in range(i+1, len(textData)):
            # if next word starts before endiest one ends
            if textData[j][0] == theBegin:
                overlappingSets[overlapIndex].append(j)
                i = j # make sure not to repeat overlap set
            else:
                # add final word
                if j == len(textData) - 1:
                    overlappingSets.append([j])
                break
        i += 1
                    
    # get only the best overlapping element of each set
    for oSet in overlappingSets:
        bestIndex = 0
        bestScore = -1
        for i in oSet:
            score = mentionProb(textData[i][2])
            if score > bestScore:
                bestScore = score
                bestIndex = i
        
        # put right item in new textData
        newTextData.append(textData[bestIndex])
        
    return newTextData

def destroyResidualOverlaps(textData):
    """
    Description:
        Removes all overlaps from text data, so that only the best mention in an
        overlap set is left.
    Args:
        textData: [[start, end, text, anchProb],...]
    Return:
        textData minus the unesescary elements that overlap.
    """
    
    newTextData = [] # to be returned
    oSet = [] # the set of current overlaps
    rootWIndex = 0 # the word to start looking from for finding root word
    rEnd = 0 # the end index of the root word
    
    # keep looping as long as overlaps
    while True:
        oSet = []
        oSet.append(textData[rootWIndex])
        for i in range(rootWIndex + 1, len(textData)):
            # if cur start before root end
            if textData[i][0] < textData[rootWIndex][1]:
                oSet.append(textData[i])
            else:
                break # have all overlap words

        
        bestIndex = 0
        # deal with the overlaps
        if len(oSet) > 1:
            bestProb = 0
            
            # choose the most probable
            i = 0
            for mention in oSet:
                prob = mentionProb(mention[2])
                if prob > bestProb:
                    bestProb = prob
                    bestIndex = i
                i += 1
        else:
            rootWIndex += 1 # move up one if no overlaps
                
        # remove from old text data all that is not best
        for i in range(0, len(oSet)):
            if i <> bestIndex:
                textData.remove(oSet[i])
                
        # add the best to new
        if not (oSet[bestIndex] in newTextData):
            newTextData.append(oSet[bestIndex])
            
        if rootWIndex >= len(textData):
            break
    
    return newTextData
    
def mentionStartsAndEnds(textData, forTruth = False):
    """
    Description:
        Takes in a list of mentions and turns each of its mentions into the form: [wIndex, start, end]. 
        Or if forTruth is true: [[start,end,entityId]]
    Args:
        textData: {'text': [w1,w2,w3,...] , 'mentions': [[wordIndex,entityTitle],...]}, to be transformed 
            as described above.
        forTruth: Changes form to use.
    Return:
        The mentions in the form [[wIndex, start, end],...]]. Or if forTruth is true: [[start,end,entityId]]
    """
    
    curWord = 0 
    curStart = 0
    for mention in textData['mentions']:
        while curWord < mention[0]:
            curStart += len(textData['text'][curWord]) + 1
            curWord += 1
            
        ent = mention[1] # store entity title in case of forTruth
        mention.pop() # get rid of entity text
        
        if forTruth:
            mention.pop() # get rid of wIndex too
            
        mention.append(curStart) # start of the mention
        mention.append(curStart + len(textData['text'][curWord])) # end of the mention
        
        if forTruth:
            mention.append(title2id(ent)) # put on entityId
    
    return textData['mentions']
     
def mentionExtract(text):
    """
    Description:
        Takes in a text and splits it into the different words/mentions.
    Args:
        phrase: The text to be split.
    Return:
        The text split it into the different words / mentions: 
        {'text':[w1,w2,...], 'mentions': [[wIndex,begin,end],...]}
    """
    
    addr = 'http://localhost:8983/solr/enwikianchors20160305/tag'
    params={'overlaps':'ALL', 'tagsLimit':'5000', 'fl':'id','wt':'json','indent':'on'}
    r = requests.post(addr, params=params, data=text.encode('utf-8'))
    textData0 = r.json()['tags']
    
    splitText = [] # the text now in split form
    mentions = [] # mentions before remove inadequate ones
    
    textData = [] # [[begin,end,word,anchorProb],...]
    
    i = 0 # for wordIndex
    # get rid of extra un-needed Solr data, and add in anchor probability
    for item in textData0:
        totalMentions = get_mention_count(text[item[1]:item[3]])
        totalAppearances = get_solr_count(text[item[1]:item[3]].replace(".", ""))
        if totalAppearances == 0:
            anchorProb = 0
        else:
            anchorProb = totalMentions/totalAppearances
        # put in the new clean textData
        textData.append([item[1], item[3], text[item[1]:item[3]], anchorProb, i])
        i += 1
        
        # also fill split text
        splitText.append(text[item[1]:item[3]])
    
    # get rid of overlaps
    textData = destroyExclusiveOverlaps(textData)
    textData = destroyResidualOverlaps(textData)
        
    # gets the POS labels for the words
    postrs = []
    for item in textData:
        postrs.append(item[2])
    postrs = nltk.pos_tag(postrs)
    for i in range(0,len(textData)):
        textData[i].append(postrs[i]) # [5][1] is index of type of word
    
    mentionPThrsh = 0.001 # for getting rid of unlikelies
    
    # put in only good mentions
    for item in textData:
        if (item[3] >= mentionPThrsh # if popular enough, and either some type of noun or JJ or CD
                and (item[5][1][0:2] == 'NN' or item[5][1] == 'JJ' or item[5][1] == 'CD')):
            mentions.append([item[4], item[0], item[1]]) # wIndex, start, end
    
    # get in same format as dataset provided data
    newTextData = {'text':splitText, 'mentions':mentions}
    
    return newTextData

def generateCandidates(textData, maxC):
    """
    Description:
        Generates up to maxC candidates for each possible mention word in phrase.
    Args:
        textData: A text in split form along with its suspected mentions.
        maxC: The max amount of candidates to accept.
    Return:
        The top maxC candidates for each possible mention word in textData.
    """
    
    candidates = []
    
    for mention in textData['mentions']:
        results = sorted(anchor2concept(textData['text'][mention[0]]), key = itemgetter(1), 
                          reverse = True)
        candidates.append(results[:maxC]) # take up to maxC of the results
    
    return candidates

def precision(truthSet, mySet):
    """
    Description:
        Calculates the precision of mySet against the truthSet.
    Args:
        truthSet: The 'right' answers for what the entities are. [[start,end,id],...]
        mySet: My code's output for what it thinks the right entities are. [[start,end,id],...]
    Return:
        The precision: (# of correct entities)/(# of found entities)
    """
    
    numFound = len(mySet)
    numCorrect = 0 # incremented in for loop
    
    truthIndex = 0
    myIndex = 0
    
    while truthIndex < len(truthSet) and myIndex < len(mySet):
        if mySet[myIndex][0] < truthSet[truthIndex][0]:
            if mySet[myIndex][1] > truthSet[truthIndex][0]:
                # overlap with mine behind
                if truthSet[truthIndex][2] == mySet[myIndex][2]:
                    numCorrect += 1
                    truthIndex += 1
                    myIndex += 1
                elif truthSet[truthIndex][1] < mySet[myIndex][1]:
                    # truth ends first
                    truthIndex += 1
                else:
                    # mine ends first
                    myIndex += 1
            else:
                # mine not even reach truth
                myIndex += 1
                
        elif mySet[myIndex][0] == truthSet[truthIndex][0]:
            # same mention (same start atleast)
            if truthSet[truthIndex][2] == mySet[myIndex][2]:
                numCorrect += 1
                truthIndex += 1
                myIndex += 1
            elif truthSet[truthIndex][1] < mySet[myIndex][1]:
                # truth ends first
                truthIndex += 1
            else:
                # mine ends first
                myIndex += 1
                  
        elif mySet[myIndex][0] > truthSet[truthIndex][0]:
            if mySet[myIndex][0] < truthSet[truthIndex][1]:
                # overlap with truth behind
                if truthSet[truthIndex][2] == mySet[myIndex][2]:
                    numCorrect += 1
                    truthIndex += 1
                    myIndex += 1
                elif truthSet[truthIndex][1] < mySet[myIndex][1]:
                    # truth ends first
                    truthIndex += 1
                else:
                    # mine ends first
                    myIndex += 1
            else:
                # mine beyond mention, increment truth
                truthIndex += 1

    #print 'correct: ' + str(numCorrect) + '\nfound: ' + str(numFound)
    if numFound == 0:
        return 0
    else:
        return (numCorrect/numFound)

def recall(truthSet, mySet):
    """
    Description:
        Calculates the recall of mySet against the truthSet.
    Args:
        truthSet: The 'right' answers for what the entities are. [[start,end,id],...]
        mySet: My code's output for what it thinks the right entities are. [[start,end,id],...]
    Return:
        The recall: (# of correct entities)/(# of actual entities)
    """
    
    numActual = len(truthSet)
    numCorrect = 0 # incremented in for loop)
    
    truthIndex = 0
    myIndex = 0
    
    while truthIndex < len(truthSet) and myIndex < len(mySet):
        if mySet[myIndex][0] < truthSet[truthIndex][0]:
            if mySet[myIndex][1] > truthSet[truthIndex][0]:
                # overlap with mine behind
                if truthSet[truthIndex][2] == mySet[myIndex][2]:
                    numCorrect += 1
                    truthIndex += 1
                    myIndex += 1
                elif truthSet[truthIndex][1] < mySet[myIndex][1]:
                    # truth ends first
                    truthIndex += 1
                else:
                    # mine ends first
                    myIndex += 1
            else:
                # mine not even reach truth
                myIndex += 1
                
        elif mySet[myIndex][0] == truthSet[truthIndex][0]:
            # same mention (same start atleast)
            if truthSet[truthIndex][2] == mySet[myIndex][2]:
                numCorrect += 1
                truthIndex += 1
                myIndex += 1
            elif truthSet[truthIndex][1] < mySet[myIndex][1]:
                # truth ends first
                truthIndex += 1
            else:
                # mine ends first
                myIndex += 1
                  
        elif mySet[myIndex][0] > truthSet[truthIndex][0]:
            if mySet[myIndex][0] < truthSet[truthIndex][1]:
                # overlap with truth behind
                if truthSet[truthIndex][2] == mySet[myIndex][2]:
                    numCorrect += 1
                    truthIndex += 1
                    myIndex += 1
                elif truthSet[truthIndex][1] < mySet[myIndex][1]:
                    # truth ends first
                    truthIndex += 1
                else:
                    # mine ends first
                    myIndex += 1
            else:
                # mine beyond mention, increment truth
                truthIndex += 1
                
    if numActual == 0:
        return 0
    else:
        return (numCorrect/numActual)
    
def mentionPrecision(trueMentions, otherMentions):
    """
    Description:
        Calculates the precision of otherMentions against the trueMentions.
    Args:
        trueMentions: The 'right' answers for what the mentions are.
        otherMentions: Our mentions obtained through some means.
    Return:
        The precision: (# of correct mentions)/(# of found mentions)
    """
    
    numFound = len(otherMentions)
    numCorrect = 0 # incremented in for loop
    
    trueIndex = 0
    otherIndex = 0
    
    while trueIndex < len(trueMentions) and otherIndex < len(otherMentions):
        # if mentions start and end on the same
        if (trueMentions[trueIndex][0] == otherMentions[otherIndex][0]
               and trueMentions[trueIndex][1] == otherMentions[otherIndex][1]):
            #print ('MATCH: [' + str(trueMentions[trueIndex][0]) + ',' + str(trueMentions[trueIndex][1]) + ']' + trueMentions[trueIndex][2] 
            #       + ' <===> [' + str(otherMentions[otherIndex][0]) + ',' + str(otherMentions[otherIndex][1]) + ']' + otherMentions[otherIndex][2])
            numCorrect += 1
            trueIndex += 1
            otherIndex += 1
        # if true mention starts before the other starts
        elif trueMentions[trueIndex][0] < otherMentions[otherIndex][0]:
            #print ('FAIL: [' + str(trueMentions[trueIndex][0]) + ',' + str(trueMentions[trueIndex][1]) + ']' + trueMentions[trueIndex][2] 
            #       + ' <XXX> [' + str(otherMentions[otherIndex][0]) + ',' + str(otherMentions[otherIndex][1]) + ']' + otherMentions[otherIndex][2])
            trueIndex += 1
        # if other mention starts before the true starts (same doesnt matter)
        elif trueMentions[trueIndex][0] >= otherMentions[otherIndex][0]:
            #print ('FAIL: [' + str(trueMentions[trueIndex][0]) + ',' + str(trueMentions[trueIndex][1]) + ']' + trueMentions[trueIndex][2] 
            #       + ' <XXX> [' + str(otherMentions[otherIndex][0]) + ',' + str(otherMentions[otherIndex][1]) + ']' + otherMentions[otherIndex][2])
            otherIndex += 1
        else:
            print 'AAAAAAAHHHHHHHHHHHHHHHHHHHHHHHHHHHHH!!!!!!!!!!!!!!!!!!!'

    #print 'correct: ' + str(numCorrect) + '\nfound: ' + str(numFound)
    if numFound == 0:
        return 0
    else:
        return (numCorrect/numFound)

def mentionRecall(trueMentions, otherMentions):
    """
    Description:
        Calculates the recall of otherMentions against the trueMentions.
    Args:
        trueMentions: The 'right' answers for what the mentions are.
        otherMentions: Our mentions obtained through some means.
    Return:
        The recall: (# of correct entities)/(# of actual entities)
    """
    
    numActual = len(trueMentions)
    numCorrect = 0 # incremented in for loop)
    
    trueIndex = 0
    otherIndex = 0
    
    while trueIndex < len(trueMentions) and otherIndex < len(otherMentions):
        # if mentions start and end on the same
        if (trueMentions[trueIndex][0] == otherMentions[otherIndex][0]
               and trueMentions[trueIndex][1] == otherMentions[otherIndex][1]):
            numCorrect += 1
            trueIndex += 1
            otherIndex += 1
        # if true mention starts before the other starts
        elif trueMentions[trueIndex][0] < otherMentions[otherIndex][0]:
            trueIndex += 1
        # if other mention starts before the true starts (same doesnt matter)
        elif trueMentions[trueIndex][0] >= otherMentions[otherIndex][0]:
            otherIndex += 1
        
    print 'correct: ' + str(numCorrect) + '\nactual: ' + str(numActual)
    if numActual == 0:
        return 0
    else:
        return (numCorrect/numActual)
    
def getSurroundingWords(text, mIndex, window, asList = False):
    """
    Description:
        Returns the words surround the given mention. Expanding out window elements
        on both sides.
    Args:
        text: A list of words.
        mIndex: The index of the word that is the center of where to get surrounding words.
        window: The amount of words to the left and right to get.
        asList: Whether to return the words as a list, otherwise just a string.
    Return:
        The words that surround the given mention. Expanding out window elements
        on both sides.
    """
    
    imin = mIndex - window
    imax = mIndex + window + 1
    
    # fix extreme bounds
    if imin < 0:
        imin = 0
    if imax > len(text):
        imax = len(text)
        
    if asList == True:
        words = (text[imin:mIndex] + text[mIndex+1:imax])
    else:
        words = " ".join(text[imin:mIndex] + text[mIndex+1:imax])
    
    # return surrounding part of word minus the mIndex word
    return words

def getMentionSentence(text, mention, asList = False):
    """
    Description:
        Returns the sentence of the mention, minus the mention.
    Args:
        text: The text to get the sentence from.
        index: The mention.
        asList: Whether to return the words as a list, otherwise just a string.
    Return:
        The sentence of the mention, minus the mention.
    """
    
    # the start and end indexes of the sentence
    sStart = 0
    sEnd = 0
    
    # get sentences using nltk
    sents = nltk.sent_tokenize(text)
    
    # find sentence that mention is in
    curLen = 0
    for s in sents:
        curLen += len(s)
        # if greater than begin of mention
        if curLen > mention[1]:
            # remove mention from string to not get bias from self referencing article
            if asList == True:
                sentence = (s.replace(text[mention[1]:mention[2]],"")).split(" ")
            else:
                sentence = s.replace(text[mention[1]:mention[2]],"")
            
            return sentence
        
    # in case it missed
    if asList == True:
        return []
    else:
        return ""

def escapeStringSolr(text):
    """
    Description:
        Escapes a given string for use in Solr.
    Args:
        text: The string to escape.
    Return:
        The escaped text.
    """
    
    text = text.replace("\\", "\\\\\\")
    text = text.replace('+', r'\+')
    text = text.replace("-", "\-")
    text = text.replace("&&", "\&&")
    text = text.replace("||", "\||")
    text = text.replace("!", "\!")
    text = text.replace("(", "\(")
    text = text.replace(")", "\)")
    text = text.replace("{", "\{")
    text = text.replace("}", "\}")
    text = text.replace("[", "\[")
    text = text.replace("]", "\]")
    text = text.replace("^", "\^")
    text = text.replace("\"", "\\\"")
    text = text.replace("~", "\~")
    text = text.replace("*", "\*")
    text = text.replace("?", "\?")
    text = text.replace(":", "\:")
    
    return text

def bestContext1Match(mentionStr, context, candidates):
    """
    Description:
        Uses Solr to find the candidate that gives the highest relevance when given the context.
    Args:
        mentionStr: The mention as it appears in the text
        context: The words that surround the target word.
        candidates: A list of candidates that each have the entity id and its frequency/popularity.
    Return:
        The index of the candidate with the best relevance score from the context.
    """
    
    # put text in right format
    context = escapeStringSolr(context)
    mentionStr = escapeStringSolr(mentionStr)
    
    strIds = ['id:' +  str(strId[0]) for strId in candidates]
    
    # select all the docs from Solr with the best scores, highest first.
    addr = 'http://localhost:8983/solr/enwiki20160305/select'
    params={'fl':'id score', 'fq':" ".join(strIds), 'indent':'on',
            'q':'text:('+context.encode('utf-8')+')^1 title:(' + mentionStr.encode('utf-8')+')^1.35',
            'wt':'json'}
    r = requests.get(addr, params = params)
    
    if 'response' not in r.json():
        return 0 # default to most popular
    
    if 'docs' not in r.json()['response']:
        return 0
    
    results = r.json()['response']['docs']
    if len(results) == 0:
        return 0 # default to most popular
    
    bestId = long(r.json()['response']['docs'][0]['id'])
    
    #for doc in r.json()['response']['docs']:
        #print '[' + id2title(doc['id']) + '] -> ' + str(doc['score'])
    
    # find which index has bestId
    bestIndex = 0
    for cand in candidates:
        if cand[0] == bestId:
            return bestIndex
        else:
            bestIndex += 1
            
    return bestIndex # in case it was missed

def bestContext2Match(context, candidates):
    """
    Description:
        Uses Solr to find the candidate that gives the highest relevance when given the context.
    Args:
        context: The words that surround the target word.
        candidates: A list of candidates that each have the entity id and its frequency/popularity.
    Return:
        The index of the candidate with the best relevance score from the context.
    """
    
    # put text in right format
    context = escapeStringSolr(context)
    strIds = ['entityid:' +  str(strId[0]) for strId in candidates]
    
    # dictionary to hold scores for each id
    scoreDict = {}
    for cand in candidates:
        scoreDict[str(cand[0])] = 0
    
    # select all the docs from Solr with the best scores, highest first.
    addr = 'http://localhost:8983/solr/enwiki20160305_context/select'
    params={'fl':'entityid', 'fq':" ".join(strIds), 'indent':'on',
            'q':'_context_:('+context.encode('utf-8')+')',
            'wt':'json'}
    r = requests.get(addr, params = params)
    
    if 'response' not in r.json():
        return 0 # default to most popular
    
    if 'docs' not in r.json()['response']:
        return 0
    
    results = r.json()['response']['docs']
    if len(results) == 0:
        return 0 # default to most popular
    
    for doc in r.json()['response']['docs']:
        scoreDict[str(doc['entityid'])] += 1
    
    # get the index that has the best score
    bestScore = 0
    bestIndex = 0
    curIndex = 0
    for cand in candidates:
        if scoreDict[str(cand[0])] > bestScore:
            bestScore = scoreDict[str(cand[0])]
            bestIndex = curIndex
        curIndex += 1
            
    return bestIndex

def bestWord2VecMatch(context, candidates):
    """
    Description:
        Uses word2vec to find the candidate with the best similarity to the context.
    Args:
        context: The words that surround the target word as a list.
        candidates: A list of candidates that each have the entity id and its frequency/popularity.
    Return:
        The index of the candidate with the best similarity score with the context.
    """
    
    ctxVec = pd.Series(sp.zeros(300)) # default zero vector
    # add all context words together
    for word in context:
        ctxVec += getword2vector(word)
        
    # compare context vector to each of the candidates
    bestIndex = 0
    bestScore = 0
    i = 0
    for cand in candidates:
        eVec = getentity2vector(str(cand[0]))
        score = 1-sp.spatial.distance.cosine(ctxVec, eVec)
        #print '[' + id2title(cand[0]) + ']' + ' -> ' + str(score)
        # update score and index
        if score > bestScore: 
            bestIndex = i
            bestScore = score
            
        i += 1 # next index
            
    return bestIndex
    
def wikifyPopular(textData, candidates):
    """
    Description:
        Chooses the most popular candidate for each mention.
    Args:
        textData: A text in split form along with its suspected mentions.
        candidates: A list of list of candidates that each have the entity id and its frequency/popularity.
    Return:
        All of the proposed entities for the mentions, of the form: [[start,end,entityId],...].
    """
    
    topCandidates = []
    i = 0 # track which mention's candidates we are looking at
    # for each mention choose the top candidate
    for mention in textData['mentions']:
        if len(candidates[i]) > 0:
            topCandidates.append([mention[1], mention[2], candidates[i][0][0]])
        i += 1 # move to list of candidates for next mention
            
    return topCandidates

def wikifyContext(textData, candidates, oText, useSentence = False, window = 7, method2 = False):
    """
    Description:
        Chooses the candidate that has the highest relevance with the surrounding window words.
    Args:
        textData: A textData in split form along with its suspected mentions.
        candidates: A list of candidates that each have the entity id and its frequency/popularity.
        oText: The original text to be used for getting sentence.
        useSentence: Whether to set use whole sentence as context, or just windowsize.
        window: How many words on both sides of a mention to search for context.
    Return:
        All of the proposed entities for the mentions, of the form: [[start,end,entityId],...].
    """
    
    topCandidates = []
    i = 0 # track which mention's candidates we are looking at
    # for each mention choose the top candidate
    for mention in textData['mentions']:
        if len(candidates[i]) > 0:
            if not useSentence:
                context = getSurroundingWords(textData['text'], mention[0], window)
            else:
                context = getMentionSentence(oText, mention)
            #print '\nMention: ' + textData['text'][mention[0]]
            #print 'Context: ' + context
            if method2 == False:
                bestIndex = bestContext1Match(textData['text'][mention[0]], context, candidates[i])
            else:
                bestIndex = bestContext2Match(context, candidates[i])
            topCandidates.append([mention[1], mention[2], candidates[i][bestIndex][0]])
        i += 1 # move to list of candidates for next mention
        
    return topCandidates

def wikifyWord2Vec(textData, candidates, oText, useSentence = False, window = 5):
    """
    Description:
        Chooses the candidates that have the highest similarity to the context.
    Args:
        textData: A textData in split form along with its suspected mentions.
        candidates: A list of candidates that each have the entity id and its frequency/popularity.
        oText: The original text to be used for getting sentence.
        useSentence: Whether to set use whole sentence as context, or just windowsize.
        window: How many words on both sides of a mention to search for context.
    Return:
        All of the proposed entities for the mentions, of the form: [[start,end,entityId],...].
    """
    
    topCandidates = []
    i = 0 # track which mention's candidates we are looking at
    # for each mention choose the top candidate
    for mention in textData['mentions']:
        if len(candidates[i]) > 0:
            if not useSentence:
                context = getSurroundingWords(textData['text'], mention[0], window, asList = True)
            else:
                context = getMentionSentence(oText, mention, asList = True)
            #print '\nMention: ' + textData['text'][mention[0]]
            #print 'Context: ' + " ".join(context)
            bestIndex = bestWord2VecMatch(context, candidates[i])
            topCandidates.append([mention[1], mention[2], candidates[i][bestIndex][0]])
        i += 1 # move to list of candidates for next mention
        
    return topCandidates

def wikifyCoherence(textData, candidates, ws = 5):
    """
    Description:
        Chooses the candidates that have the highest coherence according to rvs pagerank method.
    Args:
        textData: A textData in split form along with its suspected mentions.
        candidates: A list of candidates that each have the entity id and its frequency/popularity.
        ws: How many words on both sides of a mention to search for context.
    Return:
        All of the proposed entities for the mentions, of the form: [[start,end,entityId],...].
    """
    
    topCands = [] # the top candidate from each candidate list
    candsScores = coherence_scores_driver(candidates, ws, method='rvspagerank', direction=DIR_BOTH, op_method="keydisamb")
    i = -1 # track what mention we are on
    for cScores in candsScores:
        i += 1
        
        if len(cScores) == 0:
            continue # nothing to do with this one
            
        bestScore = sorted(cScores, reverse = True)[0]
        curIndex = 0
        for score in cScores:
            if score == bestScore:
                topCands.append([textData['mentions'][i][1], textData['mentions'][i][2], candidates[i][curIndex][0]])
                break
            curIndex += 1
            
    return topCands

def wikifyEval(text, mentionsGiven, maxC = 20, method='popular', strict = False):
    """
    Description:
        Takes the text (maybe text data), and wikifies it for evaluation purposes using the desired method.
    Args:
        text: The string to wikify. Either as just the original string to be modified, or in the 
            form of: [[w1,w2,...], [[wid,entityId],...] if the mentions are given.
        mentionsGiven: Whether the mentions are given to us and the text is already split.
        maxC: The max amount of candidates to extract.
        method: The method used to wikify.
        strict: Whether to use such rules as minimum metion length, or minimum frequency of concept.
    Return:
        All of the proposed entities for the mentions, of the form: [[start,end,entityId],...].
    """
    
    if not(mentionsGiven): # if words are not in pre-split form
        textData = mentionExtract(text) # extract mentions from text
        oText = text # the original text
    else: # if they are
        textData = text
        textData['mentions'] = mentionStartsAndEnds(textData) # put mentions in right form
        oText = " ".join(text['text'])
    
    # get rid of small mentions
    if strict:
        textData['mentions'] = [item for item in textData['mentions']
                    if  len(textData['text'][item[0]]) >= MIN_MENTION_LENGTH]
    
    candidates = generateCandidates(textData, maxC)
    
    if method == 'popular':
        wikified = wikifyPopular(textData, candidates)
    elif method == 'context1':
        wikified = wikifyContext(textData, candidates, oText, useSentence = True, window = 7)
    elif method == 'context2':
        wikified = wikifyContext(textData, candidates, oText, useSentence = True, window = 7, method2 = True)
    elif method == 'word2vec':
        wikified = wikifyWord2Vec(textData, candidates, oText, useSentence = False, window = 5)
    elif method == 'coherence':
        wikified = wikifyCoherence(textData, candidates, ws = 5)
    
    # get rid of very unpopular mentions
    if strict:
        wikified = [item for item in wikified
                    if item[3] >= MIN_FREQUENCY]
        
    return wikified

Overwriting wikification.py
