# Text Preprocessing

##### Author: Alex Sherman | alsherman@deloitte.com

#### Agenga

1. SpaCy
2. Text Tokenization, Tagging, Parsing, NER
3. Text Rule-based matching
4. Text Pipelines
5. Advanced SpaCy Examples

In [19]:
import os
from IPython.core.display import display, HTML
from configparser import ConfigParser, ExtendedInterpolation

config = ConfigParser(interpolation=ExtendedInterpolation())
config.read('../../config.ini')
DB_PATH = config['DATABASES']['PROJECT_DB_PATH']

In [20]:
# confirm DB_PATH is correct db directory, otherwise the rest of the code will not work
DB_PATH

'sqlite:///C:\\Users\\alsherman\\Desktop\\PycharmProjects\\firm_initiatives\\ml_guild\\raw_data\\databases\\annual_report.db'

In [21]:
# check for the names of the tables in the database
import pandas as pd
from sqlalchemy import create_engine

engine = create_engine(DB_PATH)
pd.read_sql("SELECT name FROM sqlite_master WHERE type='table'", con=engine)

Unnamed: 0,name
0,DOCUMENTS
1,SECTIONS


In [22]:
# read the oracle 10k documents 
doc_df = pd.read_sql("SELECT * FROM Documents", con=engine)
doc_df

Unnamed: 0,document_id,path,filename,year,document_text,table_text,author,last_modified_by,created,revision,num_tables
0,1,C:\Users\alsherman\Desktop\PycharmProjects\fir...,southwest-airlines-co_annual_report_2012.docx,2012,SOUTHWEST AIRLINES CO. 2012 ANNUAL REPORT TO S...,2013 . . . . . . . . . . . . . . . . . . . . ....,,,2018-01-03 22:49:42,0,48
1,2,C:\Users\alsherman\Desktop\PycharmProjects\fir...,southwest-airlines-co_annual_report_2013.docx,2013,SOUTHWEST AIRLINES CO. 2013 ANNUAL REPORT TO S...,Period Dividend High Low 2013 1st Qua...,,,2018-01-03 22:50:40,0,45
2,3,C:\Users\alsherman\Desktop\PycharmProjects\fir...,southwest-airlines-co_annual_report_2014.docx,2014,SOUTHWEST AIRLINES CO. 2014 ANNUAL REPORT TO S...,PART I Item 1. Business 1 Item 1A. Risk Fa...,,,2018-01-03 22:51:35,0,58
3,4,C:\Users\alsherman\Desktop\PycharmProjects\fir...,southwest-airlines-co_annual_report_2015.docx,2015,SOUTHWEST AIRLINES CO. 2015 ANNUAL REPORT TO S...,PART I Item 1. Business 1 Item 1A. Risk Fa...,,,2018-01-03 22:52:25,0,53
4,5,C:\Users\alsherman\Desktop\PycharmProjects\fir...,southwest-airlines-co_annual_report_2016.docx,2016,SOUTHWEST AIRLINES CO. 2016 ANNUAL REPORT TO S...,PART I Item 1. Business 1 Item 1A. Risk Fa...,,,2018-01-03 22:53:10,0,58


In [23]:
# read the oracle 10k sections
df = pd.read_sql("SELECT * FROM Sections ", con=engine)
df.head(3)

Unnamed: 0,section_id,filename,section_name,criteria,section_text
0,1,southwest-airlines-co_annual_report_2012.docx,SOUTHWEST AIRLINES CO. 2012 ANNUAL REPORT TO ...,<function style at 0x00000227334AA048>,To our Shareholders: The year 2012 represented...
1,2,southwest-airlines-co_annual_report_2012.docx,AIRTRAN INTEGRATION: WE ARE ON TRACK WITH OUR ...,<function capitalization at 0x000002273349EF28>,"In December 2012, we announced new 2013 revenu..."
2,3,southwest-airlines-co_annual_report_2012.docx,UNITED STATES SECURITIES AND EXCHANGE COMMISSI...,<function style at 0x00000227334AA048>,Í ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d...


In [24]:
df[df.section_text.str.contains('fee')].section_name

15                                                AIRTRAN
20      SOUTHWEST’S ALL-NEW RAPID REWARDS FREQUENT FLY...
26      AGGRESSIVE PROMOTION OF THE COMPANY’S POINTS O...
28                            ANCILLARY SERVICES AND FEES
34      ECONOMIC AND OPERATIONAL REGULATION THE U.S. D...
35                                         AVIATION TAXES
38                                    SECURITY REGULATION
43                             PRICING AND COST STRUCTURE
54      THE COMPANY’S LOW-COST STRUCTURE HAS HISTORICA...
73      AIRTRAN IS CURRENTLY SUBJECT TO PENDING ANTITR...
79                         GROUND FACILITIES AND SERVICES
80                              ITEM 3. LEGAL PROCEEDINGS
92                                         YEAR IN REVIEW
94                                     OPERATING REVENUES
98      AVERAGE BRENT CRUDE OIL ESTIMATED DIFFERENCE I...
104                                         CHANGE CHANGE
107                   OBLIGATIONS BY PERIOD (IN MILLIONS)
109           

In [388]:
# example text
text = df.section_text[776]
text

'A complaint alleging violations of federal antitrust laws and seeking certification as a class action was filed against Delta Air Lines, Inc. and AirTran in the United States District Court for the Northern District of Georgia in Atlanta on May 22, 2009. The complaint alleged, among other things, that AirTran attempted to monopolize air travel in violation of Section 2 of the Sherman Act, and conspired with Delta in imposing $15-per-bag fees for the first item of checked luggage in violation of Section 1 of the Sherman Act. The initial complaint sought treble damages on behalf of a putative class of persons or entities in the United States who directly paid Delta and/or AirTran such fees on domestic flights beginning December 5, 2008. After the filing of the May 2009 complaint, various other nearly identical complaints also seeking certification as class actions were filed in federal district courts in Atlanta, Georgia; Orlando, Florida; and Las Vegas, Nevada. All of the cases were co

### SpaCy

SpaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.

If you're working with a lot of text, you'll eventually want to know more about it. For example, what's it about? What do the words mean in context? Who is doing what to whom? What companies and products are mentioned? Which texts are similar to each other?

SpaCy is designed specifically for production use and helps you build applications that process and "understand" large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.

SpaCy is not research software. It's built on the latest research, but it's designed to get things done. This leads to fairly different design decisions than NLTK or CoreNLP, which were created as platforms for teaching and research. The main difference is that SpaCy is integrated and opinionated. SpaCy tries to avoid asking the user to choose between multiple algorithms that deliver equivalent functionality. Keeping the menu small lets SpaCy deliver generally better performance and developer experience.

#### Installation:
- Windows: Download Microsoft Visual C++: http://landinghub.visualstudio.com/visual-cpp-build-tools
- conda install -c conda-forge spacy
- python -m spacy download en

##### if you run into an error try the following:
- python -m spacy link en_core_web_sm en
- SOURCE: https://github.com/explosion/spaCy/issues/950

##### Optional to install a convolutional neural network model:
- python -m spacy download en_core_web_lg


### SpaCy Features 

NAME |	DESCRIPTION |
:----- |:------|
Tokenization|Segmenting text into words, punctuations marks etc.|
Part-of-speech (POS) Tagging|Assigning word types to tokens, like verb or noun.|
Dependency Parsing|	Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.|
Lemmatization|	Assigning the base forms of words. For example, the lemma of "was" is "be", and the lemma of "rats" is "rat".|
Sentence Boundary Detection (SBD)|	Finding and segmenting individual sentences.|
Named Entity Recognition (NER)|	Labelling named "real-world" objects, like persons, companies or locations.|
Similarity|	Comparing words, text spans and documents and how similar they are to each other.|
Text Classification|	Assigning categories or labels to a whole document, or parts of a document.|
Rule-based Matching|	Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.|
Training|	Updating and improving a statistical model's predictions.|
Serialization|	Saving objects to files or byte strings.|

SOURCE: https://spacy.io/usage/spacy-101

In [389]:
# confirm which env you are using - make sure it is one with SpaCy installed
import sys
sys.executable

# if you have difficulty importing spacy try the following in git bash
# conda install ipykernel --name Python3

'C:\\Users\\alsherman\\AppData\\Local\\Continuum\\anaconda3\\envs\\guild\\python.exe'

In [27]:
import spacy
from spacy import displacy

In [28]:
# read in a English language model
#nlp = spacy.load('en')  # simple model
nlp = spacy.load('en_core_web_lg')  # cnn model

# another approach:
# import en_core_web_sm
# nlp = en_core_web_sm.load()

In [390]:
# instantiate the document text
doc = nlp(text)

In [391]:
# view the text
doc

A complaint alleging violations of federal antitrust laws and seeking certification as a class action was filed against Delta Air Lines, Inc. and AirTran in the United States District Court for the Northern District of Georgia in Atlanta on May 22, 2009. The complaint alleged, among other things, that AirTran attempted to monopolize air travel in violation of Section 2 of the Sherman Act, and conspired with Delta in imposing $15-per-bag fees for the first item of checked luggage in violation of Section 1 of the Sherman Act. The initial complaint sought treble damages on behalf of a putative class of persons or entities in the United States who directly paid Delta and/or AirTran such fees on domestic flights beginning December 5, 2008. After the filing of the May 2009 complaint, various other nearly identical complaints also seeking certification as class actions were filed in federal district courts in Atlanta, Georgia; Orlando, Florida; and Las Vegas, Nevada. All of the cases were con

In [32]:
spacy_url = 'https://spacy.io/assets/img/pipeline.svg'
iframe = '<iframe src={} width=1000 height=200></iframe>'.format(spacy_url)
HTML(iframe)

### Tokenization

spaCy first tokenizes the text, i.e. segments it into words, punctuation and so on. This is done by applying rules specific to each language. For example, punctuation at the end of a sentence should be split off – whereas "U.K." should remain one token. 

In [33]:
tokenization_url = 'https://spacy.io/assets/img/tokenization.svg'
iframe = '<iframe src={} width=650 height=400></iframe>'.format(tokenization_url)
HTML(iframe)

### Part-of-speech (POS) Tagging

After tokenization, spaCy can parse and tag a given Doc. This is where the statistical model comes in, which enables spaCy to make a prediction of which tag or label most likely applies in this context. A model consists of binary data and is produced by showing a system enough examples for it to make predictions that generalise across the language – for example, a word following "the" in English is most likely a noun.

Annotation | Description
:----- |:------|
Text |The original word text|
Lemma |The base form of the word.|
POS |The simple part-of-speech tag.|
Tag |The detailed part-of-speech tag.|
Dep |Syntactic dependency, i.e. the relation between tokens.|
Shape |The word shape – capitalisation, punctuation, digits.|
Is Alpha |Is the token an alpha character?|
Is Stop |Is the token part of a stop list, i.e. the most common words of the language?|

In [75]:
print('{:15} | {:15} | {:8} | {:8} | {:11} | {:8} | {:8} | {:8} | '.format(
    'TEXT','LEMMA_','POS_','TAG_','DEP_','SHAPE_','IS_ALPHA','IS_STOP'))

for token in doc:
    print('{:15} | {:15} | {:8} | {:8} | {:11} | {:8} | {:8} | {:8} |'.format(
          token.text, token.lemma_, token.pos_, token.tag_, token.dep_
        , token.shape_, token.is_alpha, token.is_stop))

TEXT            | LEMMA_          | POS_     | TAG_     | DEP_        | SHAPE_   | IS_ALPHA | IS_STOP  | 
A               | a               | DET      | DT       | det         | X        |        1 |        0 |
complaint       | complaint       | NOUN     | NN       | ROOT        | xxxx     |        1 |        0 |
alleging        | allege          | VERB     | VBG      | acl         | xxxx     |        1 |        0 |
violations      | violation       | NOUN     | NNS      | dobj        | xxxx     |        1 |        0 |
of              | of              | ADP      | IN       | prep        | xx       |        1 |        0 |
federal         | federal         | ADJ      | JJ       | amod        | xxxx     |        1 |        0 |
antitrust       | antitrust       | ADJ      | JJ       | amod        | xxxx     |        1 |        0 |
laws            | law             | NOUN     | NNS      | pobj        | xxxx     |        1 |        0 |
and             | and             | CCONJ    | CC     

15-per          | 15-per          | NUM      | CD       | nummod      | dd-xxx   |        0 |        0 |
-               | -               | PUNCT    | HYPH     | punct       | -        |        0 |        0 |
bag             | bag             | NOUN     | NN       | compound    | xxx      |        1 |        0 |
fees            | fee             | NOUN     | NNS      | dobj        | xxxx     |        1 |        0 |
for             | for             | ADP      | IN       | prep        | xxx      |        1 |        0 |
the             | the             | DET      | DT       | det         | xxx      |        1 |        0 |
first           | first           | ADJ      | JJ       | amod        | xxxx     |        1 |        0 |
item            | item            | NOUN     | NN       | pobj        | xxxx     |        1 |        0 |
of              | of              | ADP      | IN       | prep        | xx       |        1 |        0 |
checked         | check           | VERB     | VBN     

were            | be              | VERB     | VBD      | auxpass     | xxxx     |        1 |        0 |
filed           | file            | VERB     | VBN      | advcl       | xxxx     |        1 |        0 |
in              | in              | ADP      | IN       | prep        | xx       |        1 |        0 |
federal         | federal         | ADJ      | JJ       | amod        | xxxx     |        1 |        0 |
district        | district        | NOUN     | NN       | compound    | xxxx     |        1 |        0 |
courts          | court           | NOUN     | NNS      | pobj        | xxxx     |        1 |        0 |
in              | in              | ADP      | IN       | prep        | xx       |        1 |        0 |
Atlanta         | atlanta         | PROPN    | NNP      | pobj        | Xxxxx    |        1 |        0 |
,               | ,               | PUNCT    | ,        | punct       | ,        |        0 |        0 |
Georgia         | georgia         | PROPN    | NNP     

of              | of              | ADP      | IN       | prep        | xx       |        1 |        0 |
Section         | section         | NOUN     | NN       | pobj        | Xxxxx    |        1 |        0 |
1               | 1               | NUM      | CD       | nummod      | d        |        0 |        0 |
of              | of              | ADP      | IN       | prep        | xx       |        1 |        0 |
the             | the             | DET      | DT       | det         | xxx      |        1 |        0 |
Sherman         | sherman         | PROPN    | NNP      | compound    | Xxxxx    |        1 |        0 |
Act             | act             | PROPN    | NNP      | pobj        | Xxx      |        1 |        0 |
.               | .               | PUNCT    | .        | punct       | .        |        0 |        0 |
In              | in              | ADP      | IN       | prep        | Xx       |        1 |        0 |
addition        | addition        | NOUN     | NN      

the             | the             | DET      | DT       | det         | xxx      |        1 |        0 |
claims          | claim           | NOUN     | NNS      | dobj        | xxxx     |        1 |        0 |
of              | of              | ADP      | IN       | prep        | xx       |        1 |        0 |
a               | a               | DET      | DT       | det         | x        |        1 |        0 |
conspiracy      | conspiracy      | NOUN     | NN       | pobj        | xxxx     |        1 |        0 |
with            | with            | ADP      | IN       | prep        | xxxx     |        1 |        0 |
respect         | respect         | NOUN     | NN       | pobj        | xxxx     |        1 |        0 |
to              | to              | ADP      | IN       | prep        | xx       |        1 |        0 |
the             | the             | DET      | DT       | det         | xxx      |        1 |        0 |
imposition      | imposition      | NOUN     | NN      

discovery       | discovery       | NOUN     | NN       | pobj        | xxxx     |        1 |        0 |
,               | ,               | PUNCT    | ,        | punct       | ,        |        0 |        0 |
and             | and             | CCONJ    | CC       | cc          | xxx      |        1 |        0 |
discovery       | discovery       | NOUN     | NN       | nsubj       | xxxx     |        1 |        0 |
has             | have            | VERB     | VBZ      | aux         | xxx      |        1 |        0 |
now             | now             | ADV      | RB       | advmod      | xxx      |        1 |        0 |
closed          | close           | VERB     | VBN      | conj        | xxxx     |        1 |        0 |
.               | .               | PUNCT    | .        | punct       | .        |        0 |        0 |
On              | on              | ADP      | IN       | prep        | Xx       |        1 |        0 |
June            | june            | PROPN    | NNP     

entered         | enter           | VERB     | VBD      | ROOT        | xxxx     |        1 |        0 |
an              | an              | DET      | DT       | det         | xx       |        1 |        0 |
order           | order           | NOUN     | NN       | dobj        | xxxx     |        1 |        0 |
granting        | grant           | VERB     | VBG      | acl         | xxxx     |        1 |        0 |
class           | class           | NOUN     | NN       | compound    | xxxx     |        1 |        0 |
certification   | certification   | NOUN     | NN       | dobj        | xxxx     |        1 |        0 |
,               | ,               | PUNCT    | ,        | punct       | ,        |        0 |        0 |
which           | which           | ADJ      | WDT      | nsubjpass   | xxxx     |        1 |        0 |
was             | be              | VERB     | VBD      | auxpass     | xxx      |        1 |        0 |
vacated         | vacate          | VERB     | VBN     

                |                 | SPACE    |          |             |          |        0 |        0 |
of              | of              | ADP      | IN       | prep        | xx       |        1 |        0 |
experts         | expert          | NOUN     | NNS      | pobj        | xxxx     |        1 |        0 |
,               | ,               | PUNCT    | ,        | punct       | ,        |        0 |        0 |
and             | and             | CCONJ    | CC       | cc          | xxx      |        1 |        0 |
those           | those           | DET      | DT       | det         | xxxx     |        1 |        0 |
motions         | motion          | NOUN     | NNS      | nsubjpass   | xxxx     |        1 |        0 |
have            | have            | VERB     | VBP      | aux         | xxxx     |        1 |        0 |
been            | be              | VERB     | VBN      | auxpass     | xxxx     |        1 |        0 |
submitted       | submit          | VERB     | VBN     

and             | and             | CCONJ    | CC       | cc          | xxx      |        1 |        0 |
documents       | document        | NOUN     | NNS      | conj        | xxxx     |        1 |        0 |
about           | about           | ADP      | IN       | prep        | xxxx     |        1 |        0 |
the             | the             | DET      | DT       | det         | xxx      |        1 |        0 |
Company         | company         | PROPN    | NNP      | poss        | Xxxxx    |        1 |        0 |
’s              | ’s              | PART     | POS      | case        | ’x       |        0 |        0 |
capacity        | capacity        | NOUN     | NN       | pobj        | xxxx     |        1 |        0 |
from            | from            | ADP      | IN       | prep        | xxxx     |        1 |        0 |
January         | january         | PROPN    | NNP      | pobj        | Xxxxx    |        1 |        0 |
2010            | 2010            | NUM      | CD      

.               | .               | PUNCT    | .        | punct       | .        |        0 |        0 |
The             | the             | DET      | DT       | det         | Xxx      |        1 |        0 |
Company         | company         | PROPN    | NNP      | nsubj       | Xxxxx    |        1 |        0 |
is              | be              | VERB     | VBZ      | aux         | xx       |        1 |        0 |
cooperating     | cooperate       | VERB     | VBG      | ROOT        | xxxx     |        1 |        0 |
fully           | fully           | ADV      | RB       | advmod      | xxxx     |        1 |        0 |
with            | with            | ADP      | IN       | prep        | xxxx     |        1 |        0 |
the             | the             | DET      | DT       | det         | xxx      |        1 |        0 |
DOJ             | doj             | PROPN    | NNP      | compound    | XXX      |        1 |        0 |
CID             | cid             | PROPN    | NNP     

Sherman         | sherman         | PROPN    | NNP      | compound    | Xxxxx    |        1 |        0 |
Act             | act             | PROPN    | NNP      | pobj        | Xxx      |        1 |        0 |
.               | .               | PUNCT    | .        | punct       | .        |        0 |        0 |
Since           | since           | ADP      | IN       | prep        | Xxxxx    |        1 |        0 |
then            | then            | ADV      | RB       | pcomp       | xxxx     |        1 |        0 |
,               | ,               | PUNCT    | ,        | punct       | ,        |        0 |        0 |
a               | a               | DET      | DT       | det         | x        |        1 |        0 |
number          | number          | NOUN     | NN       | nsubjpass   | xxxx     |        1 |        0 |
of              | of              | ADP      | IN       | prep        | xx       |        1 |        0 |
similar         | similar         | ADJ      | JJ      

District        | district        | PROPN    | NNP      | conj        | Xxxxx    |        1 |        0 |
of              | of              | ADP      | IN       | prep        | xx       |        1 |        0 |
Minnesota       | minnesota       | PROPN    | NNP      | pobj        | Xxxxx    |        1 |        0 |
,               | ,               | PUNCT    | ,        | punct       | ,        |        0 |        0 |
the             | the             | DET      | DT       | det         | xxx      |        1 |        0 |
District        | district        | PROPN    | NNP      | conj        | Xxxxx    |        1 |        0 |
of              | of              | ADP      | IN       | prep        | xx       |        1 |        0 |
New             | new             | PROPN    | NNP      | compound    | Xxx      |        1 |        0 |
Jersey          | jersey          | PROPN    | NNP      | pobj        | Xxxxx    |        1 |        0 |
,               | ,               | PUNCT    | ,       

and             | and             | CCONJ    | CC       | cc          | xxx      |        1 |        0 |
injunctive      | injunctive      | ADJ      | JJ       | amod        | xxxx     |        1 |        0 |
relief          | relief          | NOUN     | NN       | conj        | xxxx     |        1 |        0 |
.               | .               | PUNCT    | .        | punct       | .        |        0 |        0 |
On              | on              | ADP      | IN       | ROOT        | Xx       |        1 |        0 |
October         | october         | PROPN    | NNP      | pobj        | Xxxxx    |        1 |        0 |
13              | 13              | NUM      | CD       | nummod      | dd       |        0 |        0 |
,               | ,               | PUNCT    | ,        | punct       | ,        |        0 |        0 |
2015            | 2015            | NUM      | CD       | appos       | dddd     |        0 |        0 |
,               | ,               | PUNCT    | ,       

2015            | 2015            | NUM      | CD       | nummod      | dddd     |        0 |        0 |
,               | ,               | PUNCT    | ,        | punct       | ,        |        0 |        0 |
the             | the             | DET      | DT       | det         | xxx      |        1 |        0 |
Company         | company         | PROPN    | NNP      | nsubjpass   | Xxxxx    |        1 |        0 |
was             | be              | VERB     | VBD      | auxpass     | xxx      |        1 |        0 |
named           | name            | VERB     | VBN      | ROOT        | xxxx     |        1 |        0 |
as              | as              | ADP      | IN       | prep        | xx       |        1 |        0 |
a               | a               | DET      | DT       | det         | x        |        1 |        0 |
defendant       | defendant       | NOUN     | NN       | pobj        | xxxx     |        1 |        0 |
in              | in              | ADP      | IN      

Company         | company         | PROPN    | NNP      | nsubj       | Xxxxx    |        1 |        0 |
to              | to              | PART     | TO       | aux         | xx       |        1 |        0 |
respond         | respond         | VERB     | VB       | relcl       | xxxx     |        1 |        0 |
to              | to              | ADP      | IN       | prep        | xx       |        1 |        0 |
the             | the             | DET      | DT       | det         | xxx      |        1 |        0 |
complaints      | complaint       | NOUN     | NNS      | pobj        | xxxx     |        1 |        0 |
has             | have            | VERB     | VBZ      | aux         | xxx      |        1 |        0 |
not             | not             | ADV      | RB       | neg         | xxx      |        1 |        0 |
yet             | yet             | ADV      | RB       | advmod      | xxx      |        1 |        0 |
expired         | expire          | VERB     | VBN     

the             | the             | DET      | DT       | det         | xxx      |        1 |        0 |
outcome         | outcome         | NOUN     | NN       | conj        | xxxx     |        1 |        0 |
of              | of              | ADP      | IN       | prep        | xx       |        1 |        0 |
any             | any             | DET      | DT       | det         | xxx      |        1 |        0 |
proposed        | propose         | VERB     | VBN      | amod        | xxxx     |        1 |        0 |
adjustments     | adjustment      | NOUN     | NNS      | pobj        | xxxx     |        1 |        0 |
presented       | present         | VERB     | VBN      | acl         | xxxx     |        1 |        0 |
to              | to              | ADP      | IN       | prep        | xx       |        1 |        0 |
date            | date            | NOUN     | NN       | pobj        | xxxx     |        1 |        0 |
by              | by              | ADP      | IN      

### Text Dependency Parsing

spaCy features a fast and accurate syntactic dependency parser, and has a rich API for navigating the tree. The parser also powers the sentence boundary detection, and lets you iterate over base noun phrases, or "chunks". You can check whether a Doc  object has been parsed with the doc.is_parsed attribute, which returns a boolean value. If this attribute is False, the default sentence iterator will raise an exception.

In [76]:
print('{:15} | {:10} | {:10} | {:10} | {:30} | {:25}'.format(
    'TEXT','ROOT','DEPENDENCY','ROOT_TEXT','CHILDREN','LEFTS'))

for token in doc:
    print('{:15} | {:10} | {:10} | {:10} | {:30} | {:25}'.format(token.text, token.dep_, token.head.text, token.head.pos_,
          str([child for child in token.children]),str([t.text for t in token.lefts])))

TEXT            | ROOT       | DEPENDENCY | ROOT_TEXT  | CHILDREN                       | LEFTS                    
A               | det        | complaint  | NOUN       | []                             | []                       
complaint       | ROOT       | complaint  | NOUN       | [A, alleging, .]               | ['A']                    
alleging        | acl        | complaint  | NOUN       | [violations, and, seeking]     | []                       
violations      | dobj       | alleging   | VERB       | [of]                           | []                       
of              | prep       | violations | NOUN       | [laws]                         | []                       
federal         | amod       | laws       | NOUN       | []                             | []                       
antitrust       | amod       | laws       | NOUN       | []                             | []                       
laws            | pobj       | of         | ADP        | [federal, antit

,               | punct      | attempted  | VERB       | []                             | []                       
and             | cc         | attempted  | VERB       | []                             | []                       
conspired       | conj       | attempted  | VERB       | [with, in]                     | []                       
with            | prep       | conspired  | VERB       | [Delta]                        | []                       
Delta           | pobj       | with       | ADP        | []                             | []                       
in              | prep       | conspired  | VERB       | [imposing]                     | []                       
imposing        | pcomp      | in         | ADP        | [fees, for]                    | []                       
$               | nmod       | fees       | NOUN       | []                             | []                       
15-per          | nummod     | bag        | NOUN       | []             

the             | det        | complaint  | NOUN       | []                             | []                       
May             | nmod       | complaint  | NOUN       | [2009]                         | []                       
2009            | nummod     | May        | PROPN      | []                             | []                       
complaint       | pobj       | of         | ADP        | [the, May]                     | ['the', 'May']           
,               | punct      | seeking    | VERB       | []                             | []                       
various         | amod       | complaints | NOUN       | []                             | []                       
other           | amod       | complaints | NOUN       | []                             | []                       
nearly          | advmod     | identical  | ADJ        | []                             | []                       
identical       | amod       | complaints | NOUN       | [nearly]       

which           | nsubj      | broadened  | VERB       | []                             | []                       
broadened       | relcl      | February   | PROPN      | [which, allegations]           | ['which']                
the             | det        | allegations | NOUN       | []                             | []                       
allegations     | dobj       | broadened  | VERB       | [the, add]                     | ['the']                  
to              | aux        | add        | VERB       | []                             | []                       
add             | acl        | allegations | NOUN       | [to, claims]                   | ['to']                   
claims          | dobj       | add        | VERB       | [conspired]                    | []                       
that            | mark       | conspired  | VERB       | []                             | []                       
Delta           | nsubj      | conspired  | VERB       | [and, AirTran

as              | cc         | activities | NOUN       | [as, well]                     | ['as', 'well']           
attorneys’      | compound   | fees       | NOUN       | []                             | []                       
fees            | conj       | activities | NOUN       | [attorneys’]                   | ['attorneys’']           
.               | punct      | seeks      | VERB       | []                             | []                       
On              | prep       | dismissed  | VERB       | [August]                       | []                       
August          | pobj       | On         | ADP        | [2, ,, 2010]                   | []                       
2               | nummod     | August     | PROPN      | []                             | []                       
,               | punct      | August     | PROPN      | []                             | []                       
2010            | appos      | August     | PROPN      | []             

,               | punct      | class      | NOUN       | []                             | []                       
which           | dobj       | opposed    | VERB       | []                             | []                       
AirTran         | nsubj      | opposed    | VERB       | [and, Delta]                   | []                       
and             | cc         | AirTran    | PROPN      | []                             | []                       
Delta           | conj       | AirTran    | PROPN      | []                             | []                       
have            | aux        | opposed    | VERB       | []                             | []                       
opposed         | relcl      | class      | NOUN       | [which, AirTran, have]         | ['which', 'AirTran', 'have']
.               | punct      | filed      | VERB       | []                             | []                       
The             | det        | parties    | NOUN       | []          

Delta           | conj       | AirTran    | PROPN      | []                             | []                       
conspired       | acl        | claim      | NOUN       | [that, AirTran, reduce]        | ['that', 'AirTran']      
to              | aux        | reduce     | VERB       | []                             | []                       
reduce          | xcomp      | conspired  | VERB       | [to, capacity]                 | ['to']                   
capacity        | dobj       | reduce     | VERB       | []                             | []                       
.               | punct      | filed      | VERB       | []                             | []                       
On              | prep       | moved      | VERB       | [August]                       | []                       
August          | pobj       | On         | ADP        | [31, ,, 2012]                  | []                       
31              | nummod     | August     | PROPN      | []             

certification   | pobj       | on         | ADP        | [class, and, motion]           | ['class']                
and             | cc         | certification | NOUN       | []                             | []                       
AirTran         | poss       | motion     | NOUN       | [’s]                           | []                       
’s              | case       | AirTran    | PROPN      | []                             | []                       
motion          | conj       | certification | NOUN       | [AirTran, exclude]             | ['AirTran']              
to              | aux        | exclude    | VERB       | []                             | []                       
exclude         | acl        | motion     | NOUN       | [to, expert]                   | ['to']                   
plaintiffs’     | compound   | expert     | NOUN       | []                             | []                       
expert          | dobj       | exclude    | VERB       | [plaintif

decision        | pobj       | for        | ADP        | []                             | []                       
.               | punct      | submitted  | VERB       | []                             | []                       
AirTran         | nsubj      | denies     | VERB       | []                             | []                       
denies          | ROOT       | denies     | VERB       | [AirTran, allegations, ,, and, intends, .] | ['AirTran']              
all             | det        | allegations | NOUN       | []                             | []                       
allegations     | dobj       | denies     | VERB       | [all, of, ,, including]        | ['all']                  
of              | prep       | allegations | NOUN       | [wrongdoing]                   | []                       
wrongdoing      | pobj       | of         | ADP        | []                             | []                       
,               | punct      | allegations | NOUN       | 

capacity        | pobj       | about      | ADP        | [Company]                      | ['Company']              
from            | prep       | seeks      | VERB       | [January, to]                  | []                       
January         | pobj       | from       | ADP        | [2010]                         | []                       
2010            | nummod     | January    | PROPN      | []                             | []                       
to              | prep       | from       | ADP        | [present]                      | []                       
the             | det        | present    | ADJ        | []                             | []                       
present         | pobj       | to         | ADP        | [the, including]               | ['the']                  
including       | prep       | present    | ADJ        | [statements]                   | []                       
public          | amod       | statements | NOUN       | []             

the             | det        | present    | NOUN       | []                             | []                       
present         | pobj       | to         | ADP        | [the]                          | ['the']                  
.               | punct      | issued     | VERB       | []                             | []                       
The             | det        | Company    | PROPN      | []                             | []                       
Company         | nsubj      | cooperating | VERB       | [The]                          | ['The']                  
is              | aux        | cooperating | VERB       | []                             | []                       
cooperating     | ROOT       | cooperating | VERB       | [Company, is, fully, with, .]  | ['Company', 'is']        
fully           | advmod     | cooperating | VERB       | []                             | []                       
with            | prep       | cooperating | VERB       | [CID]     

maintain        | conj       | limit      | VERB       | [fares]                        | []                       
higher          | amod       | fares      | NOUN       | []                             | []                       
fares           | dobj       | maintain   | VERB       | [higher, in]                   | ['higher']               
in              | prep       | fares      | NOUN       | [violation]                    | []                       
violation       | pobj       | in         | ADP        | [of]                           | []                       
of              | prep       | violation  | NOUN       | [Section]                      | []                       
Section         | pobj       | of         | ADP        | [1, of]                        | []                       
1               | nummod     | Section    | NOUN       | []                             | []                       
of              | prep       | Section    | NOUN       | [Act]          

Northern        | compound   | District   | PROPN      | []                             | []                       
District        | conj       | District   | PROPN      | [the, Northern, of, ,, District] | ['the', 'Northern']      
of              | prep       | District   | PROPN      | [Illinois]                     | []                       
Illinois        | pobj       | of         | ADP        | []                             | []                       
,               | punct      | District   | PROPN      | []                             | []                       
the             | det        | District   | PROPN      | []                             | []                       
Southern        | compound   | District   | PROPN      | []                             | []                       
District        | conj       | District   | PROPN      | [the, Southern, of, ,, District] | ['the', 'Southern']      
of              | prep       | District   | PROPN      | [Indiana]  

,               | punct      | District   | PROPN      | []                             | []                       
and             | cc         | District   | PROPN      | []                             | []                       
the             | det        | District   | PROPN      | []                             | []                       
Eastern         | compound   | District   | PROPN      | []                             | []                       
District        | conj       | District   | PROPN      | [the, Eastern, of]             | ['the', 'Eastern']       
of              | prep       | District   | PROPN      | [Wisconsin]                    | []                       
Wisconsin       | pobj       | of         | ADP        | []                             | []                       
.               | punct      | filed      | VERB       | []                             | []                       
The             | det        | complaints | NOUN       | []             

not             | neg        | entered    | VERB       | []                             | []                       
yet             | advmod     | entered    | VERB       | []                             | []                       
entered         | ROOT       | entered    | VERB       | [Court, has, not, yet, order, .] | ['Court', 'has', 'not', 'yet']
a               | det        | order      | NOUN       | []                             | []                       
scheduling      | compound   | order      | NOUN       | []                             | []                       
order           | dobj       | entered    | VERB       | [a, scheduling, establishing]  | ['a', 'scheduling']      
establishing    | acl        | order      | NOUN       | [date, respond]                | []                       
a               | det        | date       | NOUN       | []                             | []                       
date            | dobj       | establishing | VERB       | [a]   

Airlines        | conj       | Lines      | PROPN      | [United]                       | ['United']               
colluded        | ccomp      | alleging   | VERB       | [that, Company, restrict, and, maintain] | ['that', 'Company']      
to              | aux        | restrict   | VERB       | []                             | []                       
restrict        | xcomp      | colluded   | VERB       | [to, capacity]                 | ['to']                   
capacity        | dobj       | restrict   | VERB       | []                             | []                       
and             | cc         | colluded   | VERB       | []                             | []                       
maintain        | conj       | colluded   | VERB       | [fares]                        | []                       
higher          | amod       | fares      | NOUN       | []                             | []                       
fares           | dobj       | maintain   | VERB       | [high

to              | prep       | from       | ADP        | [time]                         | []                       
time            | pobj       | to         | ADP        | []                             | []                       
subject         | amod       | from       | ADP        | [to]                           | []                       
to              | prep       | subject    | ADJ        | [proceedings]                  | []                       
various         | amod       | proceedings | NOUN       | []                             | []                       
legal           | amod       | proceedings | NOUN       | []                             | []                       
proceedings     | pobj       | to         | ADP        | [various, legal, and, claims, arising, ,, including] | ['various', 'legal']     
and             | cc         | proceedings | NOUN       | []                             | []                       
claims          | conj       | proceedings | NO

have            | ccomp      | expect     | VERB       | [that, outcome, ,, will, effect, ,] | ['that', 'outcome', ',', 'will']
a               | det        | effect     | NOUN       | []                             | []                       
material        | amod       | effect     | NOUN       | []                             | []                       
adverse         | amod       | effect     | NOUN       | []                             | []                       
effect          | dobj       | have       | VERB       | [a, material, adverse, on]     | ['a', 'material', 'adverse']
on              | prep       | effect     | NOUN       | [condition]                    | []                       
the             | det        | Company    | PROPN      | []                             | []                       
Company         | poss       | condition  | NOUN       | [the, ’s]                      | ['the']                  
’s              | case       | Company    | PROPN      | 

#### NOUN CHUNCKS:

| **TERM** | Definition |
|:---|:---:|
| **Text** | The original noun chunk text |
| **Root text** | The original text of the word connecting the noun chunk to the rest of the parse |
| **Root dependency** | Dependency relation connecting the root to its head |
| **Root head text** | The text of the root token's head |

In [77]:
print('{:15} | {:10} | {:10} | {:40}'.format('ROOT_TEXT','ROOT','DEPENDENCY','TEXT'))

for chunk in doc.noun_chunks:
    print('{:15} | {:10} | {:10} | {:40}'.format(
        chunk.root.text, chunk.root.dep_, chunk.root.head.text, chunk.text))

ROOT_TEXT       | ROOT       | DEPENDENCY | TEXT                                    
complaint       | ROOT       | complaint  | A complaint                             
violations      | dobj       | alleging   | violations                              
laws            | pobj       | of         | federal antitrust laws                  
certification   | dobj       | seeking    | certification                           
action          | nsubjpass  | filed      | a class action                          
Lines           | pobj       | against    | Delta Air Lines                         
Inc.            | conj       | Lines      | Inc.                                    
AirTran         | conj       | Inc.       | AirTran                                 
Court           | pobj       | in         | the United States District Court        
District        | pobj       | for        | the Northern District                   
Georgia         | pobj       | of         | Georgia              

motion          | dobj       | filed      | a motion                                
class           | dobj       | certify    | a class                                 
AirTran         | nsubj      | opposed    | AirTran                                 
Delta           | conj       | AirTran    | Delta                                   
parties         | nsubj      | submitted  | The parties                             
briefs          | dobj       | submitted  | briefs                                  
certification   | pobj       | on         | class certification                     
parties         | nsubj      | filed      | the parties                             
motions         | dobj       | filed      | motions                                 
opinions        | dobj       | exclude    | the class certification opinions        
expert          | pobj       | of         | each other’s expert                     
parties         | nsubj      | engaged    | The parties          

Company         | nsubj      | cooperating | The Company                             
CID             | pobj       | with       | the DOJ CID                             
inquiries       | conj       | CID        | these two state inquiries               
July            | pobj       | on         | July                                    
complaint       | nsubjpass  | filed      | a complaint                             
Court           | pobj       | in         | the United States District Court        
District        | pobj       | for        | the Southern District                   
York            | pobj       | of         | New York                                
behalf          | pobj       | on         | behalf                                  
classes         | pobj       | of         | putative classes                        
consumers       | pobj       | of         | consumers                               
collusion       | dobj       | alleging   | collusion           

citizens        | pobj       | for        | Canadian citizens                       
States          | pobj       | in         | the United States                       
travel          | pobj       | for        | travel                                  
States          | pobj       | between    | the United States                       
Canada          | conj       | States     | Canada                                  
lawsuits        | nsubjpass  | filed      | Similar lawsuits                        
Ontario         | pobj       | in         | Ontario                                 
Quebec          | conj       | Ontario    | Quebec                                  
Saskatchewan    | conj       | Quebec     | Saskatchewan                            
time            | nsubj      | expired    | The time                                
Company         | nsubj      | respond    | the Company                             
complaints      | pobj       | to         | the complaints       

In [41]:
# dependency visualization
# after you run this code, open another browser and go to http://localhost:5000
# when you are done (before you run the next cell in the notebook) stop this cell 

displacy.serve(docs=doc, style='dep', jupyter=True)
# Another option: show visualization in Jupyter Notebook
# displacy.render(docs=doc, style='dep', jupyter=True)

### Named Entity Recognition (NER)

A named entity is a "real-world object" that's assigned a name – for example, a person, a country, a product, or a book title. spaCy can recognise various types of named entities in a document, by asking the model for a prediction. 

In [79]:
print('{:10} | {:15} '.format('LABEL','ENTITY'))

for ent in doc.ents:
    print('{:10} | {:50} '.format(ent.label_, ent.text))

LABEL      | ENTITY          
ORG        | Delta Air Lines                                    
ORG        | AirTran                                            
GPE        | the United States District Court                   
LOC        | the Northern District                              
GPE        | Georgia                                            
GPE        | Atlanta                                            
DATE       | May 22, 2009                                       
ORG        | AirTran                                            
LAW        | Section 2                                          
LAW        | the Sherman Act                                    
ORG        | Delta                                              
ORDINAL    | first                                              
LAW        | Section 1                                          
LAW        | the Sherman Act                                    
GPE        | the United States                              

ORG        | the Judicial Panel                                 
GPE        | Multi-District                                     
ORG        | United States District Court                       
GPE        | the District of Columbia                           
ORG        | Court                                              
ORG        | Company                                            
DATE       | July 8, 2015                                       
ORG        | Company                                            
GPE        | British Columbia                                   
GPE        | Canada                                             
ORG        | the Company, Air Canada                            
ORG        | American Airlines                                  
ORG        | Delta Air  Lines                                   
ORG        | United Airlines                                    
NORP       | Canadian                                           
GPE        | the United S

In [78]:
# entity visualization
# after you run this code, open another browser and go to http://localhost:5000
# when you are done (before you run the next cell in the notebook) stop this cell 

displacy.serve(doc, style='ent')


    Serving on port 5000...
    Using the 'ent' visualizer


    Shutting down server on port 5000.



In [398]:
# observe the named entities tagged as a law
print(set(ent.text for ent in doc.ents if 'LAW' in ent.label_))

{'Section 2', 'Section 2 of the Sherman Act', 'Stipulation and Order', 'Section 1', 'the Sherman Act'}


In [397]:
# observe the named entities tagged as geopolitical entity
print(set(ent.text for ent in doc.ents if 'GPE' in ent.label_))

{'Indiana', 'Louisiana', 'Nevada', 'Quebec', 'the United States District Courts', 'the United States', 'New York', 'the District of Oklahoma', 'Ontario', 'the District of New Jersey', 'Multi-District  ', 'Saskatchewan', 'the District of Vermont', 'Las Vegas', 'Texas', 'British Columbia', 'Canada', 'Florida', 'California', 'Atlanta', 'North Carolina', 'Wisconsin', 'the United States District Court', 'Connecticut', 'the District of Columbia', 'Georgia', 'Pennsylvania', 'Orlando', 'the District of Minnesota', 'Illinois'}


### Identify Relevant Text (Rule-based Matching)

Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions. We will use this to filter and extract relevant text.

In [80]:
rule_basesd_matching_url = 'https://spacy.io/usage/linguistic-features#rule-based-matching'
iframe = '<iframe src={} width=1000 height=700></iframe>'.format(rule_basesd_matching_url)
HTML(iframe)

In [93]:
# The Matcher identifies text based off rules we specify
from spacy.matcher import Matcher

In [97]:
# create a function to specify what to do with the text we collect

def collect_sents(matcher, doc, i, matches):
    """  collect and transform text

    :param doc: is the full
    :param i: is the index of the text matches
    :param matches: is the text that we match
    """
    
    match_id, start, end = matches[i]  # indices of matched term
    span = doc[start : end]            # extract matched term
    
    print('span: {} | start_ind:{:5} | end_ind:{:5} | id:{}'.format(
        span, start, end, match_id))

In [98]:
# set a pattern of text to collect
# we can add complex rules to match
pattern = [{'LOWER':'fees'}]

# instantiate matcher
matcher = Matcher(nlp.vocab)

# add pattern
matcher.add('fee', collect_sents, pattern)

# pass the doc to the matcher to run the collect_sents function
matcher(doc)

span: fees | start_ind:   80 | end_ind:   81 | id:7125196598045271428
span: fees | start_ind:  125 | end_ind:  126 | id:7125196598045271428
span: fees | start_ind:  252 | end_ind:  253 | id:7125196598045271428
span: fees | start_ind:  281 | end_ind:  282 | id:7125196598045271428
span: fees | start_ind:  933 | end_ind:  934 | id:7125196598045271428


[(7125196598045271428, 80, 81),
 (7125196598045271428, 125, 126),
 (7125196598045271428, 252, 253),
 (7125196598045271428, 281, 282),
 (7125196598045271428, 933, 934)]

In [99]:
# change the function to print the sentence of the matched term (span)

def collect_sents(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    span = doc[start : end]
    print('SPAN: {}'.format(span))
    print('SENT: {}'.format(span.sent))
    print()

pattern = [{'POS': 'NOUN', 'OP': '+'},{'LOWER':'fees'}]
matcher = Matcher(nlp.vocab)
matcher.add('fee', collect_sents, pattern)
matcher(doc)

SPAN: bag fees
SENT: The complaint alleged, among other things, that AirTran attempted to monopolize air travel in violation of Section 2 of the Sherman Act, and conspired with Delta in imposing $15-per-bag fees for the first item of checked luggage in violation of Section 1 of the Sherman Act.

SPAN: baggage fees
SENT: In addition to treble damages for the amount of first baggage fees paid to  AirTran and to Delta, the Consolidated Amended Complaint seeks injunctive relief against a broad range of alleged anticompetitive activities, as well as attorneys’ fees.

SPAN: attorneys’ fees
SENT: In addition to treble damages for the amount of first baggage fees paid to  AirTran and to Delta, the Consolidated Amended Complaint seeks injunctive relief against a broad range of alleged anticompetitive activities, as well as attorneys’ fees.

SPAN: attorneys’ fees
SENT: The complaints seek treble damages for periods that vary among the complaints, costs, attorneys’ fees, and injunctive relief.



[(7125196598045271428, 79, 81),
 (7125196598045271428, 251, 253),
 (7125196598045271428, 280, 282),
 (7125196598045271428, 932, 934)]

In [100]:
# change the function to collect sentences

def collect_sents(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    span = doc[start : end]
    # update matched data collections
    matched_sents.append(span.sent)
    
matched_sents = []
pattern = [{'POS': 'NOUN', 'OP': '+'},{'LOWER':'fees'}]
matcher = Matcher(nlp.vocab)
matcher.add('fee', collect_sents, pattern)
matcher(doc)

[(7125196598045271428, 79, 81),
 (7125196598045271428, 251, 253),
 (7125196598045271428, 280, 282),
 (7125196598045271428, 932, 934)]

In [101]:
# review matches
set(matched_sents)

{The complaint alleged, among other things, that AirTran attempted to monopolize air travel in violation of Section 2 of the Sherman Act, and conspired with Delta in imposing $15-per-bag fees for the first item of checked luggage in violation of Section 1 of the Sherman Act.,
 In addition to treble damages for the amount of first baggage fees paid to  AirTran and to Delta, the Consolidated Amended Complaint seeks injunctive relief against a broad range of alleged anticompetitive activities, as well as attorneys’ fees.,
 The complaints seek treble damages for periods that vary among the complaints, costs, attorneys’ fees, and injunctive relief.}

##### DefaultDict

Usually, a Python dictionary throws a KeyError if you try to get an item with a key that is not currently in the dictionary. The defaultdict in contrast will simply create any items that you try to access (provided of course they do not exist yet). To create such a "default" item, it calls the function object that you pass in the constructor (more precisely, it's an arbitrary "callable" object, which includes function and type objects). For the first example, default items are created using int(), which will return the integer object 0. For the second example, default items are created using list(), which returns a new empty list object.

In [89]:
sentence = ['The','airline','baggage','fees','and','food','fees','are','outrageous']

In [90]:
# WRONG APPROACH - ERROR!
d = {}
for word in sentence:
    d[word] += 1

print(d)

KeyError: 'The'

In [91]:
from collections import defaultdict

d = defaultdict(int)
for word in sentence:
    d[word] += 1

print(d)

defaultdict(<class 'int'>, {'baggage': 1, 'food': 1, 'are': 1, 'and': 1, 'The': 1, 'airline': 1, 'outrageous': 1, 'fees': 2})


In [102]:
# change the function to count matches using defaultdict

def collect_sents(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    span = doc[start : end]
    # update matched data collections
    ent_count[span.text] += 1  # defaultdict keys must be span.text not span!

ent_count = defaultdict(int)
pattern = [{'LOWER':'fees'}]
matcher = Matcher(nlp.vocab)
matcher.add('fees', collect_sents, pattern)
matcher(doc)

ent_count

defaultdict(int, {'fees': 5})

In [186]:
# update the pattern to look for a noun describing the fee

ent_count = defaultdict(int)
pattern = [{'POS': 'NOUN', 'OP': '+'},{'LOWER':'fees'}]
matcher = Matcher(nlp.vocab)
matcher.add('fee', collect_sents, pattern)
matcher(doc)

ent_count

defaultdict(int, {})

# Pipeline

If you have a sequence of documents to process, you should use the Language.pipe()  method. The method takes an iterator of texts, and accumulates an internal buffer, which it works on in parallel. It then yields the documents in order, one-by-one.

- batch_size: number of docs to process per thread
- n_threads: number threads to use (-1 is the default that let's SpaCy decide)
- disable: Names of pipeline components to disable.

In [156]:
from spacy.pipeline import Pipe

In [219]:
# get multiple sections with the term fee
# use SpaCy to determine what type of fee
texts = df[df['section_text'].str.contains('fee')]['section_text'].values[0:5]

In [226]:
%%time

ent_count = defaultdict(int) # reset defaultdict

for doc in nlp.pipe(texts): #['parser','tagger','ner']
    matcher(doc) # match on your text

print(ent_count)

defaultdict(<class 'int'>, {'change fee': 8, 'bag fee': 4})
Wall time: 34.2 s


### SpaCy - Tips for faster processing

You can substantially speed up the time it takes SpaCy to read a document by disabling components of the NLP that are not necessary for a given task.

- Disable options: **parser, tagger, ner**

In [229]:
%%time

ent_count = defaultdict(int) # reset defaultdict

# disable the parser and ner, as we only use POS tagging in this example
# processing occurs ~5x faster
for doc in nlp.pipe(texts, batch_size=100, disable=['parser','ner']):  
    matcher(doc) # match on your text

print(ent_count)

defaultdict(<class 'int'>, {'change fee': 2, 'bag fee': 1})
Wall time: 7.27 s


In [230]:
%%time

ent_count = defaultdict(int) # reset defaultdict

# disable the parser and ner, as we only use POS tagging in this example
# processing occurs ~75x faster, but doesn't work as the tagger is needed
for doc in nlp.pipe(texts, batch_size=100, disable=['parser','tagger','ner']):
    matcher(doc) # match on your text

print(ent_count)

defaultdict(<class 'int'>, {})
Wall time: 462 ms


### Analyze the different risk types by year

In [399]:
# get multiple sections with the term risk
texts = df[df['section_text'].str.contains('risk')][['filename','section_text']].values

In [400]:
def collect_sents(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    span = doc[start : end]
    ent_count[span.text] += 1

In [401]:
pattern = [{'POS': 'NOUN', 'OP': '+'},{'LOWER':'risk'}]
matcher = Matcher(nlp.vocab)
matcher.add('risk', collect_sents, pattern)

In [None]:
%%time

years = defaultdict(dict)
for year, text in texts:
    ent_count = defaultdict(int)               # reset ent_count for each year
    doc = nlp(text, disable=['parser','ner'])  # disable unnessecary components
    matcher(doc)                               # match on your text
    
    for key, val in ent_count.items():
        years[year][key] = val

In [None]:
# view the risks by year
pd.DataFrame(years).T

## Advanced SpaCy

##### Stop Words

In [338]:
# add to existing stop words in SpaCy

from spacy.lang.en.stop_words import STOP_WORDS
for word in STOP_WORDS:
    lexeme = nlp.vocab[word]
    lexeme.is_stop = True

##### Text Matching

When using rule-based matching, SpaCy may match the same term multiple times if it is part of different n-term pairs with one term contained in another. For instance, 'integration services' in 'system integration services.'

To avoid matching these terms multiple times, we can add to the collect_sents function to check if each term is contained in the previous term

In [364]:
def collect_sents(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    span = doc[start:end]
    sent = span.sent
    
    # lemmatize the matched spans
    entity = span.lemma_.lower()
            
    # explicity add the first entity without checking if it matches other terms
    # as there is no previous span to check    
    if i == 0:
        ent_count[entity] += 1
        ent_sents[entity].append(sent)
        matched_sents.append(sent)
        return

    # get the span, entity, and sentence from the previous match
    # if more than one match exist
    last_match_id, last_start, last_end = matches[i-1]
    last_span = doc[last_start : last_end]
    last_entity = last_span.text.lower()
    last_sent = last_span.sent

    # to avoid adding duplicates when one term is contained in another 
    # (e.g. 'integration services' in 'system integration services')
    # make sure new spans are unique
    distinct_entity = (entity not in last_entity) or (sent != last_sent)
    not_duplicate_entity = (entity != last_entity) or (sent != last_sent)
    
    # update collections for unique data
    if distinct_entity and not_duplicate_entity:
        ent_count[entity] += 1
        ent_sents[entity].append(sent)
        matched_sents.append(sent)

##### Multiple Patterns

SpaCy matchers can use multiple patterns. Each pattern can be added to the Matcher individually with match.add and can use their own collect_sents function. Or use *patterns to add multiple patterns to the matcher at once.

In [369]:
matched_sents = []
ent_sents  = defaultdict(list)
ent_count = defaultdict(int)

# multiple patterns
pattern = [[{'POS': 'NOUN', 'OP': '+'},{'LOWER': 'fee'}]
           , [{'POS': 'NOUN', 'OP': '+'},{'LOWER': 'fees'}]]
matcher = Matcher(nlp.vocab)

# *patterns to add multiple patterns with the same collect_sents function
matcher.add('all_fees', collect_sents, *pattern)

texts = df[df['section_text'].str.contains('fee')]['section_text'].values[0:5]
for doc in nlp.pipe(texts, batch_size=100, disable=['ner']):
    matches = matcher(doc) 

#### Text Preprocessing Example

In [371]:
def clean_text(doc): 
    # Add named entities, but only if they are a compound of more than word.
    IGNORE_ENTS = ('QUANTITY','ORDINAL','CARDINAL','DATE'
                   ,'PERCENT','MONEY','TIME')
    ents = doc.ents
    ents = [ent for ent in ents if 
             (ent.label_ not in IGNORE_ENTS) and (len(ent) > 2)]
    
    # add underscores to combine words in entities
    ents = [str(ent).strip().replace(' ','_') for ent in ents]
 
    # Keep only words (no numbers, no punctuation).
    # Lemmatize tokens, remove punctuation and remove stopwords.
    doc = [token.lemma_ for token in doc 
           if token.is_alpha and not token.is_stop]
    
    doc.extend([entity for entity in ents])
    
    return [str(term) for term in doc]

In [380]:
%%time
cleaned_text = []
for sent in matched_sents:
    sent = nlp(sent.text)
    text = clean_text(sent)
    cleaned_text.append(text)

Wall time: 5.31 s


In [385]:
print(matched_sents[0])

AirTran Business Class fares are refundable and changeable and include additional perks such as priority boarding, oversized seats with additional leg room, bonus frequent flyer credit, no first or second bag fees, and complimentary cocktails onboard.


In [386]:
print(cleaned_text[0])

['airtran', 'business', 'class', 'fare', 'refundable', 'changeable', 'include', 'additional', 'perk', 'priority', 'boarding', 'oversized', 'seat', 'additional', 'leg', 'room', 'bonus', 'frequent', 'flyer', 'credit', 'second', 'bag', 'fee', 'complimentary', 'cocktail', 'onboard', 'AirTran_Business_Class']
