# Exploring wikifier ("From Text to Knowledge: The Information Extraction Pipeline")

Exploration of the ``wikifier`` , which outputs entities within a text with their corresponding Wikidata IDs. As input, I utilise a dummy text from english Wikipedia: 

In [4]:
text = "François-Marie Arouet , known by his nom de plume Voltaire, was a French Enlightenment writer, historian, and philosopher famous for his wit, his criticism of Christianity—especially the Roman Catholic Church—as well as his advocacy of freedom of speech, freedom of religion, and separation of church and state.Voltaire's next play, Artémire, set in ancient Macedonia, opened on 15 February 1720. It was a flop and only fragments of the text survive. He instead turned to an epic poem about Henry IV of France that he had begun in early 1717."

In [9]:
print (text) # checking if the text is ok

François-Marie Arouet , known by his nom de plume Voltaire, was a French Enlightenment writer, historian, and philosopher famous for his wit, his criticism of Christianity—especially the Roman Catholic Church—as well as his advocacy of freedom of speech, freedom of religion, and separation of church and state.Voltaire's next play, Artémire, set in ancient Macedonia, opened on 15 February 1720. It was a flop and only fragments of the text survive. He instead turned to an epic poem about Henry IV of France that he had begun in early 1717.


In [4]:
import spacy
nlp =  spacy.load('en_core_web_md') # the package has to be installed via Anaconda shell  >python -m spacy download en_core_web_md. You can now load the model via spacy.load('en_core_web_md')

Open Anaconda Shell and install the following packages: 
``` 
conda install spacy
conda install neuralcoref```

In [2]:
import spacy
import neuralcoref

# Load SpaCy
nlp =  spacy.load('en_core_web_md')
# Add neural coref to SpaCy's pipe
neuralcoref.add_to_pipe(nlp)

def coref_resolution(text):
    """Function that executes coreference resolution on a given text"""
    doc = nlp(text)
    # fetches tokens with whitespaces from spacy document
    tok_list = list(token.text_with_ws for token in doc)
    for cluster in doc._.coref_clusters:
        # get tokens from representative cluster name
        cluster_main_words = set(cluster.main.text.split(' '))
        for coref in cluster:
            if coref != cluster.main:  # if coreference element is not the representative element of that cluster
                if coref.text != cluster.main.text and bool(set(coref.text.split(' ')).intersection(cluster_main_words)) == False:
                    # if coreference element text and representative element text are not equal and none of the coreference element words are in representative element. This was done to handle nested coreference scenarios
                    tok_list[coref.start] = cluster.main.text + \
                        doc[coref.end-1].whitespace_
                    for i in range(coref.start+1, coref.end):
                        tok_list[i] = ""

    return "".join(tok_list)



In [None]:
coref_resolution(text)

In [36]:
import urllib
from string import punctuation
import nltk
import json

ENTITY_TYPES = ["human", "person", "company", "enterprise", "business", "geographic region",
                "human settlement", "geographic entity", "territorial entity type", "organization"]

def wikifier(text, lang="en", threshold=0.8):
    """Function that fetches entity linking results from wikifier.com API"""
    # Prepare the URL.
    data = urllib.parse.urlencode([
        ("text", text), ("lang", lang),
        ("userKey", "tgbdmkpmkluegqfbawcwjywieevmza"),
        ("pageRankSqThreshold", "%g" %
         threshold), ("applyPageRankSqThreshold", "true"),
        ("nTopDfValuesToIgnore", "100"), ("nWordsToIgnoreFromList", "100"),
        ("wikiDataClasses", "true"), ("wikiDataClassIds", "false"),
        ("support", "true"), ("ranges", "false"), ("minLinkFrequency", "2"),
        ("includeCosines", "false"), ("maxMentionEntropy", "3")
    ])
    url = "http://www.wikifier.org/annotate-article"
    # Call the Wikifier and read the response.
    req = urllib.request.Request(url, data=data.encode("utf8"), method="POST")
    with urllib.request.urlopen(req, timeout=60) as f:
        response = f.read()
        response = json.loads(response.decode("utf8"))
    # Output the annotations.
    results = list()
    for annotation in response["annotations"]:
        # Filter out desired entity classes
        if ('wikiDataClasses' in annotation) and (any([el['enLabel'] in ENTITY_TYPES for el in annotation['wikiDataClasses']])):

            # Specify entity label
            if any([el['enLabel'] in ["human", "person"] for el in annotation['wikiDataClasses']]):
                label = 'Person'
            elif any([el['enLabel'] in ["company", "enterprise", "business", "organization"] for el in annotation['wikiDataClasses']]):
                label = 'Organization'
            elif any([el['enLabel'] in ["geographic region", "human settlement", "geographic entity", "territorial entity type"] for el in annotation['wikiDataClasses']]):
                label = 'Location'
            else:
                label = None

            results.append({'title': annotation['title'], 'wikiId': annotation['wikiDataItemId'], 'label': label,
                            'characters': [(el['chFrom'], el['chTo']) for el in annotation['support']]})
    return results

In [37]:
entities = wikifier(text, lang="en", threshold=0.8) # get all the entities within the Text and their wikidataID
entities

[{'title': 'Voltaire',
  'wikiId': 'Q9068',
  'label': 'Person',
  'characters': [(0, 20), (50, 57), (50, 58)]},
 {'title': 'Christianity',
  'wikiId': 'Q5043',
  'label': 'Organization',
  'characters': [(159, 170)]},
 {'title': 'Catholic Church',
  'wikiId': 'Q9592',
  'label': 'Organization',
  'characters': [(183, 207),
   (187, 191),
   (187, 200),
   (187, 207),
   (193, 200),
   (193, 207),
   (202, 207),
   (266, 273),
   (294, 299)]},
 {'title': 'Henry IV of France',
  'wikiId': 'Q936976',
  'label': 'Person',
  'characters': [(491, 498), (491, 508)]}]

Reference: Exploration of the following pipeline: https://towardsdatascience.com/from-text-to-knowledge-the-information-extraction-pipeline-b65e7e30273e. A dummy text ("Voltaire" on Wikipedia) is taken as a test. 