<a href="https://colab.research.google.com/github/tomasonjo/blogs/blob/master/ie_pipeline/SpaCy_informationextraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# !pip install crosslingual-coreference==0.2.3 spacy-transformers==1.1.5 wikipedia neo4j
# !pip install --upgrade google-cloud-storage
# !pip install transformers==4.18.0
# !python -m spacy download en_core_web_sm


# Restart runtime

Since I first dabbled with natural language processing, I have had a special place in my heart for information extraction (IE) pipelines. Information extraction (IE) pipelines extract structured data from unstructured data like text. The internet provides an abundance of information in the form of various articles and other content formats. However, while you might read the news or subscribe to multiple podcasts, it is virtually impossible to keep track of all the new information released daily. Even if you could manually read all the latest reports and articles, it would be incredibly tedious and labor-intensive to structure the data so that you can easily query and aggregate it with your preferred tools. I definitely wouldn't want to be doing that as my job. Luckily, we can resort to using the latest state-of-the-art NLP techniques to do the information extraction for us automatically.

# Developing IE pipeline in SpaCy
There has been much development around [SpaCy](https://spacy.io/) in the last couple of weeks, so I decided to try out the new plugins and use them to construct an information extraction pipeline.

## Coreference resolution
First off, we are going to be using the new [Crosslingual Coreference](https://spacy.io/universe/project/crosslingualcoreference) model contributed by [David Berenstein](https://www.linkedin.com/in/david-berenstein-1bab11105/) to the SpaCy Universe. SpaCy Universe is a collection of open-source plugins or addons for SpaCy. The cool thing about the SpaCy universe project is that it's straightforward to add the models to our pipeline.

## Relation extraction
You might wonder why we skipped the named entity recognition and linking step. Well, the reason is that we will be using the [Rebel project](https://github.com/Babelscape/rebel) that recognizes both the entities and relations from the text. If I understand correctly, the Rebel project was developed by [Pere-Lluís Huguet Cabot](https://www.linkedin.com/in/perelluis/) as a part of his PhD study. Again, a massive shoutout to Pere for creating such an incredible library with state-of-the-art results for relation extraction. The Rebel model is available on Hugginface as well as in the form of SpaCy component.
However, the model doesn't do any entity linking, so we will implement our version of entity linking. We will simply search for entities on WikiData by calling the search entities WikiData API.

I've been looking at how to improve entity linking and stumbled upon the [ExtEnd project](https://github.com/SapienzaNLP/extend). ExtEnd project is a novel approach to entity disambiguation and is available as a demo on Huggingface as well as a SpaCy component. I've played around with it a bit and managed to get it working using WikiData API for candidates instead of the original AIDA candidates. However, when I wanted to have all three projects (Coref, Rebel, ExtEnd) in the same pipeline, there were some dependency issues as they use a different version of PyTorch, so I gave up for now. However, the ExtEnd code I developed is available on GitHub, and if someone wants to help me get it working, I am more than happy to accept Pull requests.

In [None]:
import spacy
import crosslingual_coreference

Ok, so for now, we won't be using the ExtEnd project but will use a simplified version of entity linking by simply taking the first candidate fetched from the WikiData API. The only thing we need to do is incorporate our simplified entity linking solution in the Rebel pipeline. Since the Rebel component is not available directly as a SpaCy Universe project, we must copy the component definition from their repository manually. I've taken the liberty to implement my version of the set_annotations function in the Rebel SpaCy component, while the rest of the code is the same as the original.

In [None]:
# Add rebel component https://github.com/Babelscape/rebel/blob/main/spacy_component.py
import requests
import re
import hashlib
from spacy import Language
from typing import List

from spacy.tokens import Doc, Span

from transformers import pipeline

def call_wiki_api(item):
    try:
    url = f"https://www.wikidata.org/w/api.php?action=wbsearchentities&search={item}&language=en&format=json"
    data = requests.get(url).json()
    # Return the first id (Could upgrade this in the future)
    return data['search'][0]['id']
  except:
    return 'id-less'

def extract_triplets(text):
    """
    Function to parse the generated text and extract the triplets
    """
    triplets = []
    relation, subject, relation, object_ = '', '', '', ''
    text = text.strip()
    current = 'x'
    for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").split():
        if token == "<triplet>":
            current = 't'
            if relation != '':
                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
                relation = ''
            subject = ''
        elif token == "<subj>":
            current = 's'
            if relation != '':
                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
            object_ = ''
        elif token == "<obj>":
            current = 'o'
            relation = ''
        else:
            if current == 't':
                subject += ' ' + token
            elif current == 's':
                object_ += ' ' + token
            elif current == 'o':
                relation += ' ' + token
    if subject != '' and relation != '' and object_ != '':
        triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})

    return triplets


@Language.factory(
    "rebel",
    requires=["doc.sents"],
    assigns=["doc._.rel"],
    default_config={
        "model_name": "Babelscape/rebel-large",
        "device": 0,
    },
)
class RebelComponent:
    def __init__(
        self,
        nlp,
        name,
        model_name: str,
        device: int,
    ):
        assert model_name is not None, ""
        self.triplet_extractor = pipeline("text2text-generation", model=model_name, tokenizer=model_name, device=device)
        self.entity_mapping = {}
        # Register custom extension on the Doc
        if not Doc.has_extension("rel"):
            Doc.set_extension("rel", default={})

    def get_wiki_id(self, item: str):
        mapping = self.entity_mapping.get(item)
        if mapping:
            return mapping
        else:
            res = call_wiki_api(item)
            self.entity_mapping[item] = res
            return res

    
    def _generate_triplets(self, sent: Span) -> List[dict]:
        output_ids = self.triplet_extractor(sent.text, return_tensors=True, return_text=False)[0]["generated_token_ids"]["output_ids"]
        extracted_text = self.triplet_extractor.tokenizer.batch_decode(output_ids[0])
        extracted_triplets = extract_triplets(extracted_text[0])
        return extracted_triplets

    def set_annotations(self, doc: Doc, triplets: List[dict]):
        for triplet in triplets:

            # Remove self-loops (relationships that start and end at the entity)
            if triplet['head'] == triplet['tail']:
                continue

            # Use regex to search for entities
            head_span = re.search(triplet["head"], doc.text)
            tail_span = re.search(triplet["tail"], doc.text)

            # Skip the relation if both head and tail entities are not present in the text
            # Sometimes the Rebel model hallucinates some entities
            if not head_span or not tail_span:
                continue

            index = hashlib.sha1("".join([triplet['head'], triplet['tail'], triplet['type']]).encode('utf-8')).hexdigest()
            if index not in doc._.rel:
                # Get wiki ids and store results
                doc._.rel[index] = {"relation": triplet["type"], "head_span": {'text': triplet['head'], 'id': self.get_wiki_id(triplet['head'])}, "tail_span": {'text': triplet['tail'], 'id': self.get_wiki_id(triplet['tail'])}}

    def __call__(self, doc: Doc) -> Doc:
        for sent in doc.sents:
            sentence_triplets = self._generate_triplets(sent)
            self.set_annotations(doc, sentence_triplets)
        return doc

The set_annotations function handles how we store the results back to the SpaCy's Doc object. First, we ignore all the self-loops. Self-loops are relationships that start and end at the same entity. Next, we search for both the head and tail entities of the relation in the text using regex. I've noticed that the Rebel model sometimes hallucinates some entities which are not in the original text. For that reason, I added a step that verifies that both entities are actually in the text before appending them to the results.
Lastly, we use the WikiData API to map extracted entities to WikiData ids. As mentioned, this is a simplified version of entity disambiguation and linking, and you can take a more novel approach like the ExtEnd model, for example.

Now that the Rebel SpaCy component is defined, we can create a two SpaCy pipelines to handle coreference resolution as well as relation extraction and entity linking.

In [None]:
DEVICE = -1 # Number of the GPU, -1 if want to use CPU

# Add coreference resolution model
coref = spacy.load('en_core_web_sm', disable=['ner', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer'])
coref.add_pipe(
    "xx_coref", config={"chunk_size": 2500, "chunk_overlap": 2, "device": DEVICE})

# Define rel extraction model

rel_ext = spacy.load('en_core_web_sm', disable=['ner', 'lemmatizer', 'attribute_rules', 'tagger'])
rel_ext.add_pipe("rebel", config={
    'device':DEVICE, # Number of the GPU, -1 if want to use CPU
    'model_name':'Babelscape/rebel-large'} # Model used, will default to 'Babelscape/rebel-large' if not given
    )



In [None]:
input_text = "Christian Drosten works in Germany. He likes to work for Google."

coref_text = coref(input_text)._.resolved_text

doc = rel_ext(coref_text)

for value, rel_dict in doc._.rel.items():
    print(f"{value}: {rel_dict}")

The Rebel model extracted two relations from the text. For example, it recognized that Christian Drosten, with the WikiData id Q1079331, is employed by Google, which has an id Q95.
# Storing the information extraction pipeline results
Whenever I hear about relation information between entities, I think of a graph. A graph database is developed to store relations between entities, so what better fit to store the information extraction pipeline results.
As you might know, I am biased towards Neo4j, but you can use whatever tool you like. Here, I will demonstrate how to store the results of my implementation of the information extraction pipeline into Neo4j. We will process a couple of Wikipedia summaries of famous women scientists and store the results as a graph.
If you want to follow with the code examples, I would suggest you to create a [Blank Project in Neo4j Sandbox](https://sandbox.neo4j.com/?usecase=blank-sandbox) environment. After you have created the Neo4j Sandbox instance, you can copy the credentials to the code.

In [None]:
import pandas as pd
import wikipedia
from neo4j import GraphDatabase

# Define Neo4j connection
host = 'bolt://3.236.134.179:7687'
user = 'neo4j'
password = 'writer-calibers-steels'
driver = GraphDatabase.driver(host,auth=(user, password))

import_query = """
UNWIND $data AS row
MERGE (h:Entity {id: CASE WHEN NOT row.head_span.id = 'id-less' THEN row.head_span.id ELSE row.head_span.text END})
ON CREATE SET h.text = row.head_span.text
MERGE (t:Entity {id: CASE WHEN NOT row.tail_span.id = 'id-less' THEN row.tail_span.id ELSE row.tail_span.text END})
ON CREATE SET t.text = row.tail_span.text
WITH row, h, t
CALL apoc.merge.relationship(h, toUpper(replace(row.relation,' ', '_')),
  {},
  {},
  t,
  {}
)
YIELD rel
RETURN distinct 'done' AS result;
"""


def run_query(query, params={}):
    with driver.session() as session:
        result = session.run(query, params)
        return pd.DataFrame([r.values() for r in result], columns=result.keys())

def store_wikipedia_summary(page):
    try:
        input_text = wikipedia.page(page).summary
        coref_text = coref(input_text)._.resolved_text
        doc = rel_ext(coref_text)
        params = [rel_dict for value, rel_dict in doc._.rel.items()]
        run_query(import_query, {'data': params})
    except Exception as e:
        print(f"Couldn't parse text for {page} due to {e}")


We've used the wikipedia python library to help us fetch the summaries from Wikipedia. Next, we need to define the Cypher statement used to import the information extraction results. I won't go into details of Cypher syntax, but basically, we first merge the head and tail entities by their WikiData id and then use a procedure from the APOC library to merge the relationship. I recommend going through courses in the Neo4j Graph Academy if you are looking for resources to learn more about Cypher syntax.

Now that we have everything ready, we can go ahead and parse a couple of Wikipedia summaries.

In [None]:
ladies = ["Jennifer Doudna", "Rachel Carson", "Sara Seager OC", "Gertrude Elion", "Rita Levi-Montalcini"]

for l in ladies:
    print(f"Parsing {l}")
    store_wikipedia_summary(l)

# Enriching the graph
Since we have mapped our entities to the WikiData ids, we can further use the WikiData API to enrich our graph. I will show you how to extract INSTANCE_OF relations from WikiData and store them to Neo4j with the help of the APOC library, which allows us to call web APIs and store results in the database.
To be able to call WikiData API, you need to have a basic understanding of SPARQL syntax, but that is beyond the scope of this blog post. However, I've written a [post some time about that shows more SPARQL queries used to enrich a Neo4j graph and delves into SPARQL syntax](https://towardsdatascience.com/lord-of-the-wiki-ring-importing-wikidata-into-neo4j-and-analyzing-family-trees-da27f64d675e).
By executing the following query, we add the Class nodes to the graph and link them with the appropriate entities.

In [None]:
run_query("""
CALL apoc.periodic.iterate("
  MATCH (e:Entity)
  WHERE e.id STARTS WITH 'Q'
  RETURN e
","
  // Prepare a SparQL query
  WITH 'SELECT * WHERE{ ?item rdfs:label ?name . filter (?item = wd:' + e.id + ') filter (lang(?name) = \\\"en\\\") ' +
     'OPTIONAL {?item wdt:P31 [rdfs:label ?label] .filter(lang(?label)=\\\"en\\\")}}' AS sparql, e
  // make a request to Wikidata
  CALL apoc.load.jsonParams(
    'https://query.wikidata.org/sparql?query=' + 
      sparql,
      { Accept: 'application/sparql-results+json'}, null)
  YIELD value
  UNWIND value['results']['bindings'] as row
  SET e.wikipedia_name = row.name.value
  WITH e, row.label.value AS label
  MERGE (c:Class {id:label})
  MERGE (e)-[:INSTANCE_OF]->(c)
  RETURN distinct 'done'", {batchSize:1, retry:1})
""")

# Conclusion
I really like what SpaCy is doing lately and all the open-source projects around it. I noticed that various open-source projects are primarily standalone, and it can be tricky to combine multiple models into a single SpaCy pipeline. For example, you can see that we had to have two pipelines in this project, one for coreference resolution and one for relation extraction and entity linking.

As for the results of the IE pipeline, I am delighted with how well it turned out. As you can observe in the Rebel repository, their solution is state-of-the-art on many NLP datasets, so it is not a big surprise that the results are so good. The only weak link in my implementation is the Entity Linking step. As I said, it would probably greatly benefit from adding something like the ExtEnd library for more accurate entity disambiguation and linking. Perhaps that's something I'll do next time.

Try out the IE implementation, and please let me know what you think or if you have some ideas for improvements. There are a ton of opportunities to make this pipeline better!