spaCy is the leading library for NLP, and it has quickly become one of the most popular Python frameworks.

In [3]:
#loading spacy
import spacy
#loading English language model.
# nlp = spacy.load('en')
# or by
from spacy.lang.en import English
#create the nlp object
nlp = English()

Tokenization

This returns a document object that contains tokens. A token is a unit of text in the document, such as individual words and punctuation. SpaCy splits contractions like "don't" into two tokens, "do" and "n't".

In [6]:
doc = nlp("I saw a saw that couldn't saw any other saw I ever saw")
for token in doc:
    print(token.text)
    # or print(token)

I
saw
a
saw
that
could
n't
saw
any
other
saw
I
ever
saw


Iterating through a document gives you token objects. Each of these tokens comes with additional information. In most cases, the important ones are token.lemma_ and token.is_stop

Text preprocessing

The first is "lemmatizing." The "lemma" of a word is its base form. For example, "walk" is the lemma of the word "walking". So, when you lemmatize the word walking, you would convert it to walk.Lemmatizing similarly helps by combining multiple forms of the same word into one base form ("calming", "calms", "calmed" would all change to "calm").

It's also common to remove stopwords. Stopwords are words that occur frequently in the language and don't contain much information. English stopwords include "the", "is", "and", "but", "not".

In [7]:
print(f"Token \t\tLemma \t\tStopword".format('Token', 'Lemma', 'Stopword'))
print("-"*40)
for token in doc:
    print(f"{str(token)}\t\t{token.lemma_}\t\t{token.is_stop}")

Token 		Lemma 		Stopword
----------------------------------------
I		I		True
saw		saw		False
a		a		True
saw		saw		False
that		that		True
could		could		True
n't		not		True
saw		saw		False
any		any		True
other		other		True
saw		saw		False
I		I		True
ever		ever		True
saw		saw		False


Pattern Matching

To match individual tokens, you create a Matcher. When you want to match a list of terms, it's easier and more efficient to use PhraseMatcher

In [8]:
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab, attr = 'LOWER')

Next you create a list of terms to match in the text. The phrase matcher needs the patterns as document objects. The easiest way to get these is with a list comprehension using the nlp model.

In [9]:
terms = ['rainbow', 'rain', 'gotta', 'want']
patterns = [nlp(text) for text in terms]
matcher.add("TerminologyList", None, *patterns)

In [11]:
text_doc = nlp("The way I see it, if you want the rainbow, you gotta put up with the rain.") 
matches = matcher(text_doc)
print(matches)
#The matches here are a tuple of the match id and the positions of the start and end of the phrase.

[(3766102292120407359, 8, 9), (3766102292120407359, 10, 11), (3766102292120407359, 13, 15), (3766102292120407359, 19, 20)]


In [12]:
match_id, start, end = matches[0]
print(nlp.vocab.strings[match_id], text_doc[start:end])

TerminologyList want
