# Working with Multi-Word Token Entities and RegEx in spaCy

## 9.1 Key Concepts in this Notebook

1. Working with Multi-Word Tokens and RegEx in spaCy 3x 
2. RegEx finditer 
3. Spans

## 9.2. Problems with Multi-Word Tokens in spaCy as Entities¶
As we saw in 01.03: Rules-Based NER, we can use spaCy’s Matcher to grab multi-word tokens, or tokens that span multiple tokens. The main problem with this, however, is that these multi-word tokens are not placed into the doc.ents. This means that we cannot access them the same way we would other entities. In this notebook, we will figure out how to solve that problem with a simple workflow:

Extract Multi-Word Tokens with re.finditer()

Reconstruct the spans in the spaCy doc

Give priority to longer spans (Optional)

Inject the Spans into doc.ents

We will cover each of these steps in turn.

## 9.3. Extract Multi-Word Tokens¶
First, we need to grab the multi-word tokens. In this notebook, we are going to try and grab a multi-word token. In this case, a person whose first name begins with Paul. In the RegEx below, we specify that we are looking for any string that starts with “Paul” and then is followed by a capitalized letter. We then tell it to grab the entire second word until the end of the word.

In [2]:
import re 

text = "Paul Newman was an American actor, but Paul Hollywood is a British TV Host. The name Paul is quite common."

pattern = r"Paul [A-Z]\w+"

matches = re.finditer(pattern, text)
for match in matches:
    print(match)

<re.Match object; span=(0, 11), match='Paul Newman'>
<re.Match object; span=(39, 53), match='Paul Hollywood'>


The regular expression r"Paul [A-Z]\\w+" is designed to match strings that start with “Paul” followed by a space, then an uppercase letter from A to Z, and one or more word characters (which include letters, digits, and underscores). Here’s a breakdown of the pattern:


* r: This denotes a raw string in Python, which tells the interpreter to treat backslashes as literal characters and not as escape characters.
* "Paul ": This matches the literal string “Paul” followed by a space.
* [A-Z]: This is a character class that matches any single uppercase letter from A to Z.
* \\w+: The double backslash is used to escape the backslash character itself because we are in a raw string. \w matches any word character (equivalent to [a-zA-Z0-9_]), and the + signifies that the \w pattern must occur one or more times.

So, this regular expression will match any string that begins with “Paul” followed by an uppercase letter and at least one more word character. For example, it would match “Paul Aardvark” but not “Paul aardvark” (due to the lowercase ‘a’) or “Paul” (since there are no characters after the space).

Note that we have not grabbed the final “Paul” which is not followed by a last name. In this case, we are not interested in that Paul. Now that we know how to grab the multi-word tokens, we need to have a way to parse them in spaCy.

## 9.4. Reconstruct Spans¶
This next stage is a bit more complicated, but works quite well once you understand the process. First, we need to import the libraries we will need. Note that we are also adding Span from spacy.tokens.

In [3]:
import re 
import spacy
from spacy.tokens import Span

text = "Paul Newman was an American actor, but Paul Hollywood is a British TV Host. The name Paul is quite common."
pattern = r"Paul [A-Z]\w+"

nlp = spacy.blank("en")
doc = nlp(text)

# Even though this part is unnecessary, it is good to do it here because in other situations you will have entities. If you do, you need to store them as a separate list to which we will append things.
original_ents = list(doc.ents)


Here, we will create a blank spaCy English model and create the doc object of the text. It will have no entities in it because we are working with a blank model that does not have an “ner” component.

- Now, let’s iterate over the results from re.finditer(). In this cell, we are goingg to grab the start and end from each match. we will then create a temporary span that will be equal to where the characters start and end in the doc object. This is important because tokens and characters do not always align correctly. Finally, we append to mwt_ents, the start, end, and text. The text is not necessary but it will help with debugging.

In [4]:
mwt_ents = []
for match in re.finditer(pattern, doc.text):
    start, end = match.span()
    span = doc.char_span(start, end)
    if span is not None:
        mwt_ents.append((span.start, span.end, span.text))

## 9.5. Inject the Spans into the doc.ents¶
With that data, we can iterate over each entity and identify where it begins and ends in spaCy. Note, we are using the spaCy Span class. This allows us to create a span object and assign it a custom label. With this data, we can append each Span to original_ents.

In [6]:
from spacy.tokens import Span

for ent in mwt_ents:
    start, end, name = ent 
    per_ent = Span(doc, start, end, label="PERSON")
    original_ents.append(per_ent)

And Finally, we set doc.ents equal to original_ents. This effectively loads the spans back into the spaCy doc.ents

In [7]:
doc.ents = original_ents

In [8]:
for ent in doc.ents:
    print(ent.text, ent.label_)

Paul Newman PERSON
Paul Hollywood PERSON


Note that these are now properly identified entities in our doc.ents class.

## 9.6. Give priority to Longer Spans¶
Sometimes, the situation is not so neat. Sometimes our custom RegEx entities will overlap with spaCy’s Entities

In [13]:
import re 
import spacy

text = "Paul Newman was an American actor, but Paul Hollywood is a British TV Host."
pattern = r"Hollywood"

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
for ent in doc.ents:
    print(ent.text, ent.label_)

Paul Newman PERSON
American NORP
Paul Hollywood PERSON
British NORP


The code pattern = r"Hollywood" defines a regular expression pattern in Python that matches the exact string “Hollywood”. Here’s a breakdown of the code:

* pattern: This is a variable name that is being assigned the regular expression.
* r: This prefix before the string literal indicates a raw string in Python. In raw strings, backslashes are treated as literal characters and not as escape characters.
* "Hollywood": This is the string literal that specifies the pattern to mat

Let’s say that we create a new entity. Maybe words associated with Cinema. So, we want to classify Hollywood as a tag “CINEMA”. Now, in the above text, Hollywood is clearly associated with Paul Hollywood, but let’s imagine for a moment that it is not. Let’s try and run the same code as above. If we do, we notice that we get an error.

The output “British NORP” refers to the classification of “British” as a Nationality, Religious, or Political group. In the context of natural language processing (NLP) and entity recognition, “NORP” is a common label used to identify mentions of national, religious, or political groups within a text. So, when an NLP system outputs “British NORP”, it indicates that the word “British” has been recognized as referring to the national identity of the United Kingdom or its people.

In [10]:
import re 
import spacy

text = "Shah Rukh Khan, also known by the initialism SRK, is an Indian actor and film producer who works in Hindi films."
pattern = r"Bollywood"

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
for ent in doc.ents:
    print(ent.text, ent.label_)

Shah Rukh Khan PERSON
SRK ORG
Indian NORP
Hindi GPE


In [25]:
mwt_ents = []
original_ents = list(doc.ents)
# original_ents

for match in re.finditer(pattern, doc.text):
    # print(match)
    start, end = match.span()
    span = doc.char_span(start, end)
    # print(span)
    if span is not None:
        mwt_ents.append((span.start, span.end, span.text))
        # print(mwt_ents)

for ent in mwt_ents:
    # print(ent)
    start, end, name = ent 
    # print(start)
    per_ent = Span(doc, start, end, label="CINEMA")
    # print(per_ent)
    original_ents.append(per_ent)
    # print(original_ents)

doc.ents = original_ents

ValueError: [E1010] Unable to set entity information for token 9 which is included in more than one span in entities, blocked, missing or outside.

This error tells us that one of our tokens from the finditer() overlapped with one that our “ner” component found. This is a problem that can be rectified with spaCy’s filter_spans. This gives primacy to longer spans. Notice how we have allowed the Paul Hollywood entity to be a PERSON, rather than CINEMA. This is because Hollywood is shorter than Paul Hollywood.

- The error ValueError: [E1010] Unable to set entity information for token 9 which is included in more than one span in entities, blocked, missing or outside occurs in spaCy when you try to assign overlapping entities to a Doc object. This happens when two or more entities are defined to include the same token, which is not allowed in spaCy’s data model.

### TO resolve this error

In [27]:
from spacy.util import filter_spans

filterd = filter_spans(original_ents)
# print(filterd)
doc.ents = filterd
for ent in doc.ents:
    print(ent.text, ent.label_)

Paul Newman PERSON
American NORP
Paul Hollywood PERSON
British NORP
