# Text Preprocessing

##### Author: Alex Sherman | alsherman@deloitte.com

#### Agenga

1. SpaCy
2. Text Tagging
3. Text Identification
4. Text Preprocessing

In [31]:
import os
from IPython.core.display import display, HTML
from configparser import ConfigParser, ExtendedInterpolation

config = ConfigParser(interpolation=ExtendedInterpolation())
config.read('../../config.ini')
DB_PATH = config['DATABASES']['PROJECT_DB_PATH']

In [32]:
# confirm DB_PATH is correct db directory, otherwise the rest of the code will not work
DB_PATH

'sqlite:///C:\\Users\\alsherman\\Desktop\\PycharmProjects\\firm_initiatives\\ml_guild\\raw_data\\databases\\annual_report.db'

In [33]:
# check for the names of the tables in the database
import pandas as pd
from sqlalchemy import create_engine

engine = create_engine(DB_PATH)
pd.read_sql("SELECT name FROM sqlite_master WHERE type='table'", con=engine)

Unnamed: 0,name
0,DOCUMENTS
1,SECTIONS


In [34]:
# read the oracle 10k documents 
doc_df = pd.read_sql("SELECT * FROM Documents", con=engine)
doc_df

Unnamed: 0,document_id,path,filename,year,document_text,table_text,author,last_modified_by,created,revision,num_tables
0,1,C:\Users\alsherman\Desktop\PycharmProjects\fir...,southwest-airlines-co_annual_report_2012.docx,2012,SOUTHWEST AIRLINES CO. 2012 ANNUAL REPORT TO S...,2013 . . . . . . . . . . . . . . . . . . . . ....,,,2018-01-03 22:49:42,0,48
1,2,C:\Users\alsherman\Desktop\PycharmProjects\fir...,southwest-airlines-co_annual_report_2013.docx,2013,SOUTHWEST AIRLINES CO. 2013 ANNUAL REPORT TO S...,Period Dividend High Low 2013 1st Qua...,,,2018-01-03 22:50:40,0,45
2,3,C:\Users\alsherman\Desktop\PycharmProjects\fir...,southwest-airlines-co_annual_report_2014.docx,2014,SOUTHWEST AIRLINES CO. 2014 ANNUAL REPORT TO S...,PART I Item 1. Business 1 Item 1A. Risk Fa...,,,2018-01-03 22:51:35,0,58
3,4,C:\Users\alsherman\Desktop\PycharmProjects\fir...,southwest-airlines-co_annual_report_2015.docx,2015,SOUTHWEST AIRLINES CO. 2015 ANNUAL REPORT TO S...,PART I Item 1. Business 1 Item 1A. Risk Fa...,,,2018-01-03 22:52:25,0,53
4,5,C:\Users\alsherman\Desktop\PycharmProjects\fir...,southwest-airlines-co_annual_report_2016.docx,2016,SOUTHWEST AIRLINES CO. 2016 ANNUAL REPORT TO S...,PART I Item 1. Business 1 Item 1A. Risk Fa...,,,2018-01-03 22:53:10,0,58


In [35]:
# read the oracle 10k sections
df = pd.read_sql("SELECT * FROM Sections ", con=engine)
df.head(3)

Unnamed: 0,section_id,filename,section_name,criteria,section_text
0,1,southwest-airlines-co_annual_report_2012.docx,SOUTHWEST AIRLINES CO. 2012 ANNUAL REPORT TO ...,<function style at 0x00000227334AA048>,To our Shareholders: The year 2012 represented...
1,2,southwest-airlines-co_annual_report_2012.docx,AIRTRAN INTEGRATION: WE ARE ON TRACK WITH OUR ...,<function capitalization at 0x000002273349EF28>,"In December 2012, we announced new 2013 revenu..."
2,3,southwest-airlines-co_annual_report_2012.docx,UNITED STATES SECURITIES AND EXCHANGE COMMISSI...,<function style at 0x00000227334AA048>,Í ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d...


In [36]:
df[df.section_text.str.contains('fee')].section_name

15                                                AIRTRAN
20      SOUTHWEST’S ALL-NEW RAPID REWARDS FREQUENT FLY...
26      AGGRESSIVE PROMOTION OF THE COMPANY’S POINTS O...
28                            ANCILLARY SERVICES AND FEES
34      ECONOMIC AND OPERATIONAL REGULATION THE U.S. D...
35                                         AVIATION TAXES
38                                    SECURITY REGULATION
43                             PRICING AND COST STRUCTURE
54      THE COMPANY’S LOW-COST STRUCTURE HAS HISTORICA...
73      AIRTRAN IS CURRENTLY SUBJECT TO PENDING ANTITR...
79                         GROUND FACILITIES AND SERVICES
80                              ITEM 3. LEGAL PROCEEDINGS
92                                         YEAR IN REVIEW
94                                     OPERATING REVENUES
98      AVERAGE BRENT CRUDE OIL ESTIMATED DIFFERENCE I...
104                                         CHANGE CHANGE
107                   OBLIGATIONS BY PERIOD (IN MILLIONS)
109           

In [37]:
# example text
text = df.section_text[946]
text

'During 2016, the Company continued to aggressively market and benefit from Southwest’s points of differentiation from its competitors. For example, the Company’s TransfarencySM  campaign emphasizes Southwest’s approach to treating Customers fairly, honestly, and respectfully, with its low fares and no unexpected bag fees, change fees, or hidden fees. Southwest continues to be the only major U.S. airline that offers to all ticketed Customers up to two checked bags that fly free (weight and size limits apply). Through both its national and local marketing campaigns, Southwest has continued to aggressively promote this point of differentiation from its competitors with its “Bags Fly Free®” message. The Company believes its decision not to charge for first and second checked bags, as reinforced by the Company’s related marketing, has driven an increase in the Company’s market share and a resulting net increase in revenues. Southwest is also the only major U.S. airline that does not charge

### SpaCy

#### Installation:
- Download Microsoft Visual C++: http://landinghub.visualstudio.com/visual-cpp-build-tools
- conda install -c conda-forge spacy
- python -m spacy download en

##### if you run into an error try the following:
- python -m spacy link en_core_web_sm en
- SOURCE: https://github.com/explosion/spaCy/issues/950

##### Optional to install a convolutional neural network model:
- python -m spacy download en_core_web_lg

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.

If you're working with a lot of text, you'll eventually want to know more about it. For example, what's it about? What do the words mean in context? Who is doing what to whom? What companies and products are mentioned? Which texts are similar to each other?

spaCy is designed specifically for production use and helps you build applications that process and "understand" large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.

spaCy is not research software. It's built on the latest research, but it's designed to get things done. This leads to fairly different design decisions than NLTK or CoreNLP, which were created as platforms for teaching and research. The main difference is that spaCy is integrated and opinionated. spaCy tries to avoid asking the user to choose between multiple algorithms that deliver equivalent functionality. Keeping the menu small lets spaCy deliver generally better performance and developer experience.

### SpaCy Features 

NAME |	DESCRIPTION |
:----- |:------|
Tokenization|Segmenting text into words, punctuations marks etc.|
Part-of-speech (POS) Tagging|Assigning word types to tokens, like verb or noun.|
Dependency Parsing|	Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.|
Lemmatization|	Assigning the base forms of words. For example, the lemma of "was" is "be", and the lemma of "rats" is "rat".|
Sentence Boundary Detection (SBD)|	Finding and segmenting individual sentences.|
Named Entity Recognition (NER)|	Labelling named "real-world" objects, like persons, companies or locations.|
Similarity|	Comparing words, text spans and documents and how similar they are to each other.|
Text Classification|	Assigning categories or labels to a whole document, or parts of a document.|
Rule-based Matching|	Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.|
Training|	Updating and improving a statistical model's predictions.|
Serialization|	Saving objects to files or byte strings.|

SOURCE: https://spacy.io/usage/spacy-101

In [38]:
import spacy
from spacy import displacy

In [39]:
# read in a English language model
#nlp = spacy.load('en')  # simple model
nlp = spacy.load('en_core_web_lg')  # cnn model

# another approach:
# import en_core_web_sm
# nlp = en_core_web_sm.load()

In [40]:
# instantiate the document text
doc = nlp(text)

In [41]:
# view the text
doc

During 2016, the Company continued to aggressively market and benefit from Southwest’s points of differentiation from its competitors. For example, the Company’s TransfarencySM  campaign emphasizes Southwest’s approach to treating Customers fairly, honestly, and respectfully, with its low fares and no unexpected bag fees, change fees, or hidden fees. Southwest continues to be the only major U.S. airline that offers to all ticketed Customers up to two checked bags that fly free (weight and size limits apply). Through both its national and local marketing campaigns, Southwest has continued to aggressively promote this point of differentiation from its competitors with its “Bags Fly Free®” message. The Company believes its decision not to charge for first and second checked bags, as reinforced by the Company’s related marketing, has driven an increase in the Company’s market share and a resulting net increase in revenues. Southwest is also the only major U.S. airline that does not charge 

In [42]:
spacy_url = 'https://spacy.io/assets/img/pipeline.svg'
iframe = '<iframe src={} width=1000 height=200></iframe>'.format(spacy_url)
HTML(iframe)

### Tokenization

spaCy first tokenizes the text, i.e. segments it into words, punctuation and so on. This is done by applying rules specific to each language. For example, punctuation at the end of a sentence should be split off – whereas "U.K." should remain one token. 

In [43]:
tokenization_url = 'https://spacy.io/assets/img/tokenization.svg'
iframe = '<iframe src={} width=650 height=400></iframe>'.format(tokenization_url)
HTML(iframe)

### Part-of-speech (POS) Tagging

After tokenization, spaCy can parse and tag a given Doc. This is where the statistical model comes in, which enables spaCy to make a prediction of which tag or label most likely applies in this context. A model consists of binary data and is produced by showing a system enough examples for it to make predictions that generalise across the language – for example, a word following "the" in English is most likely a noun.

Annotation | Description
:----- |:------|
Text |The original word text|
Lemma |The base form of the word.|
POS |The simple part-of-speech tag.|
Tag |The detailed part-of-speech tag.|
Dep |Syntactic dependency, i.e. the relation between tokens.|
Shape |The word shape – capitalisation, punctuation, digits.|
Is Alpha |Is the token an alpha character?|
Is Stop |Is the token part of a stop list, i.e. the most common words of the language?|

In [None]:
print('{:15} | {:15} | {:8} | {:8} | {:11} | {:8} | {:8} | {:8} | '.format(
    'text', 'lemma_', 'pos_', 'tag_', 'dep_', 'shape_', 'is_alpha', 'is_stop'))
print('_'*104)

for token in doc:
    print('{:15} | {:15} | {:8} | {:8} | {:11} | {:8} | {:8} | {:8} |'.format(
          token.text, token.lemma_, token.pos_, token.tag_, token.dep_
        , token.shape_, token.is_alpha, token.is_stop))

text            | lemma_          | pos_     | tag_     | dep_        | shape_   | is_alpha | is_stop  | 
________________________________________________________________________________________________________
During          | during          | ADP      | IN       | prep        | Xxxxx    |        1 |        0 |
2016            | 2016            | NUM      | CD       | pobj        | dddd     |        0 |        0 |
,               | ,               | PUNCT    | ,        | punct       | ,        |        0 |        0 |
the             | the             | DET      | DT       | det         | xxx      |        1 |        0 |
Company         | company         | PROPN    | NNP      | nsubj       | Xxxxx    |        1 |        0 |
continued       | continue        | VERB     | VBD      | ROOT        | xxxx     |        1 |        0 |
to              | to              | PART     | TO       | aux         | xx       |        1 |        0 |
aggressively    | aggressively    | ADV      | RB     

up              | up              | ADP      | IN       | prep        | xx       |        1 |        0 |
to              | to              | PART     | TO       | prep        | xx       |        1 |        0 |
two             | two             | NUM      | CD       | nummod      | xxx      |        1 |        0 |
checked         | check           | VERB     | VBN      | amod        | xxxx     |        1 |        0 |
bags            | bag             | NOUN     | NNS      | pobj        | xxxx     |        1 |        0 |
that            | that            | ADJ      | WDT      | nsubj       | xxxx     |        1 |        0 |
fly             | fly             | VERB     | VBP      | relcl       | xxx      |        1 |        0 |
free            | free            | ADJ      | JJ       | advmod      | xxxx     |        1 |        0 |
(               | (               | PUNCT    | -LRB-    | punct       | (        |        0 |        0 |
weight          | weight          | NOUN     | NN      

Company         | company         | PROPN    | NNP      | poss        | Xxxxx    |        1 |        0 |
’s              | ’s              | PART     | POS      | case        | ’x       |        0 |        0 |
market          | market          | NOUN     | NN       | compound    | xxxx     |        1 |        0 |
share           | share           | NOUN     | NN       | pobj        | xxxx     |        1 |        0 |
and             | and             | CCONJ    | CC       | cc          | xxx      |        1 |        0 |
a               | a               | DET      | DT       | det         | x        |        1 |        0 |
resulting       | result          | VERB     | VBG      | amod        | xxxx     |        1 |        0 |
net             | net             | ADJ      | JJ       | amod        | xxx      |        1 |        0 |
increase        | increase        | NOUN     | NN       | conj        | xxxx     |        1 |        0 |
in              | in              | ADP      | IN      

a               | a               | DET      | DT       | det         | x        |        1 |        0 |
change          | change          | NOUN     | NN       | compound    | xxxx     |        1 |        0 |
fee             | fee             | NOUN     | NN       | dobj        | xxx      |        1 |        0 |
.               | .               | PUNCT    | .        | punct       | .        |        0 |        0 |
While           | while           | ADP      | IN       | mark        | Xxxxx    |        1 |        0 |
a               | a               | DET      | DT       | det         | x        |        1 |        0 |
Customer        | customer        | NOUN     | NN       | nsubj       | Xxxxx    |        1 |        0 |
may             | may             | VERB     | MD       | aux         | xxx      |        1 |        0 |
pay             | pay             | VERB     | VB       | advcl       | xxx      |        1 |        0 |
a               | a               | DET      | DT      

car             | car             | NOUN     | NN       | compound    | xxx      |        1 |        0 |
seat            | seat            | NOUN     | NN       | npadvmod    | xxxx     |        1 |        0 |
free            | free            | ADJ      | JJ       | conj        | xxxx     |        1 |        0 |
of              | of              | ADP      | IN       | prep        | xx       |        1 |        0 |
charge          | charge          | NOUN     | NN       | pobj        | xxxx     |        1 |        0 |
,               | ,               | PUNCT    | ,        | punct       | ,        |        0 |        0 |
in              | in              | ADP      | IN       | prep        | xx       |        1 |        0 |
addition        | addition        | NOUN     | NN       | pobj        | xxxx     |        1 |        0 |
to              | to              | ADP      | IN       | prep        | xx       |        1 |        0 |
the             | the             | DET      | DT      

trust           | trust           | NOUN     | NN       | conj        | xxxx     |        1 |        0 |
,               | ,               | PUNCT    | ,        | punct       | ,        |        0 |        0 |
and             | and             | CCONJ    | CC       | cc          | xxx      |        1 |        0 |
belief          | belief          | NOUN     | NN       | conj        | xxxx     |        1 |        0 |
in              | in              | ADP      | IN       | prep        | xx       |        1 |        0 |
providing       | provide         | VERB     | VBG      | pcomp       | xxxx     |        1 |        0 |
exceptional     | exceptional     | ADJ      | JJ       | amod        | xxxx     |        1 |        0 |
Hospitality     | hospitality     | PROPN    | NNP      | dobj        | Xxxxx    |        1 |        0 |
,               | ,               | PUNCT    | ,        | punct       | ,        |        0 |        0 |
and             | and             | CCONJ    | CC      

Company         | company         | PROPN    | NNP      | nsubj       | Xxxxx    |        1 |        0 |
unveiled        | unveil          | VERB     | VBD      | ROOT        | xxxx     |        1 |        0 |
the             | the             | DET      | DT       | det         | xxx      |        1 |        0 |
next            | next            | ADJ      | JJ       | amod        | xxxx     |        1 |        0 |
phase           | phase           | NOUN     | NN       | dobj        | xxxx     |        1 |        0 |
of              | of              | ADP      | IN       | prep        | xx       |        1 |        0 |
the             | the             | DET      | DT       | det         | xxx      |        1 |        0 |
Heart           | heart           | PROPN    | NNP      | compound    | Xxxxx    |        1 |        0 |
brand           | brand           | NOUN     | NN       | pobj        | xxxx     |        1 |        0 |
with            | with            | ADP      | IN      

personal        | personal        | ADJ      | JJ       | amod        | xxxx     |        1 |        0 |
belongings      | belonging       | NOUN     | NNS      | pobj        | xxxx     |        1 |        0 |
.               | .               | PUNCT    | .        | punct       | .        |        0 |        0 |
In              | in              | ADP      | IN       | prep        | Xx       |        1 |        0 |
addition        | addition        | NOUN     | NN       | pobj        | xxxx     |        1 |        0 |
,               | ,               | PUNCT    | ,        | punct       | ,        |        0 |        0 |
in              | in              | ADP      | IN       | prep        | xx       |        1 |        0 |
mid-2017        | mid-2017        | NOUN     | NN       | pobj        | xxx-dddd |        0 |        0 |
,               | ,               | PUNCT    | ,        | punct       | ,        |        0 |        0 |
front           | front           | ADJ      | JJ      

In [None]:
displacy.serve(doc, style='dep')

  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)


  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)



    Serving on port 5000...
    Using the 'dep' visualizer



### Named Entity Recognition (NER)

A named entity is a "real-world object" that's assigned a name – for example, a person, a country, a product or a book title. spaCy can recognise various types of named entities in a document, by asking the model for a prediction. 

In [None]:
for ent in doc.ents:
    print('label: {:10} | entitiy: {:50} '.format(ent.label_, ent.text))

In [None]:
displacy.serve(doc, style='ent')

In [None]:
# observe the named entities tagged as PERSON
for ent in doc.ents:
    if 'PERSON' in ent.label_:
        print(ent)

In [None]:
# observe the named entities tagged as ORG (organization)
for ent in doc.ents:
    if 'ORG' in ent.label_:
        print(ent)

### Text Dependency Parsing

In [None]:
print('{:15} | {:5} | {:10} | {:40}'.format('Text','Root','Dependency','Root Text'))
for chunk in doc.noun_chunks:
    print('{:15} | {:5} | {:10} | {:40}'.format(
        chunk.root.text, chunk.root.dep_,chunk.root.head.text, chunk.text))

### Identify Relevant Text (Rule-based Matching)

Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions. We will use this to filter and extract relevant text.

In [None]:
rule_basesd_matching_url = 'https://spacy.io/usage/linguistic-features#rule-based-matching'
iframe = '<iframe src={} width=1000 height=700></iframe>'.format(rule_basesd_matching_url)
HTML(iframe)

In [None]:
# The Matcher identifies text based off rules we specify
from spacy.matcher import Matcher

In [None]:
# create a function to specify what to do with the text we collect

def collect_sents(matcher, doc, i, matches):
    """  collect and transform text

    :param i: is the index of the text matches
    :param matches: is the text that we match
    :param doc: is the full
    """
    
    match_id, start, end = matches[i]  # indices of matched term
    span = doc[start : end] # extract matched term
    
    print('span: {} | start:{:5} | end:{:5} | id:{}'.format(
        span, start, end, match_id))

In [None]:
# set a pattern of text to collect
# we can add complex rules to match
pattern = [{'LOWER':'fee'}]

# instantiate matcher
matcher = Matcher(nlp.vocab)

# add pattern
matcher.add('fee', collect_sents, pattern)

# pass the doc to the matcher to run the collect_sents function
matcher(doc)

In [None]:
# change the function to print the sentence of the matched term (span)

def collect_sents(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    span = doc[start : end]
    print('SPAN {}'.format(span))
    print('SENT: {}'.format(span.sent))
    print()

pattern = [{'POS': 'NOUN', 'OP': '+'},{'LOWER':'fee'}]
matcher = Matcher(nlp.vocab)
matcher.add('fee', collect_sents, pattern)
matcher(doc)

In [None]:
# change the function to collect sentences

def collect_sents(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    span = doc[start : end]
    # update matched data collections
    matched_sents.append(span.sent)
    
matched_sents = []
pattern = [{'POS': 'NOUN', 'OP': '+'},{'LOWER':'fee'}]
matcher = Matcher(nlp.vocab)
matcher.add('fee', collect_sents, pattern)
matcher(doc)

In [None]:
# review matches
matched_sents

##### DefaultDict

Usually, a Python dictionary throws a KeyError if you try to get an item with a key that is not currently in the dictionary. The defaultdict in contrast will simply create any items that you try to access (provided of course they do not exist yet). To create such a "default" item, it calls the function object that you pass in the constructor (more precisely, it's an arbitrary "callable" object, which includes function and type objects). For the first example, default items are created using int(), which will return the integer object 0. For the second example, default items are created using list(), which returns a new empty list object.

In [None]:
from collections import defaultdict

s = 'mississippi'

d = defaultdict(int)
for k in s:
    d[k] += 1

sorted(d.items())

In [None]:
# change the function to count matches using defaultdict

def collect_sents(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    span = doc[start : end]
    # update matched data collections
    ent_count[span.text] += 1  # key must be span.text not span!

ent_count = defaultdict(int)
pattern = [{'POS': 'NOUN', 'OP': '+'},{'LOWER':'fee'}]
matcher = Matcher(nlp.vocab)
matcher.add('fee', collect_sents, pattern)
matcher(doc)

ent_count

In [None]:
# collect entity counts across all documents

ent_count = defaultdict(int)
pattern = [{'POS': 'NOUN', 'OP': '+'},{'LOWER':'fee'}]
matcher = Matcher(nlp.vocab)
matcher.add('fee', collect_sents, pattern)

for section in df['section_text'][0:10]:
    matcher(nlp(section)) # match on your text

ent_count

### Exercise 
get all sentences with word risk for topic analysis

In [None]:
df.head()

In [None]:
pattern = [{'POS': 'NOUN', 'OP': '+'},{'LOWER':'fee'}]
matcher = Matcher(nlp.vocab)
matcher.add('risk', collect_sents, pattern)

years = {}
for ind, row in df.iterrows():
    if ind == 10:
        break
    ent_count = defaultdict(int)
    year = row['filename']
    text = row['section_text']
    doc = nlp(text)
    matcher(doc) # match on your text
    years[year] = ent_count

years

In [None]:
years

In [None]:
pd.DataFrame(years).T

## Advanced SpaCy

##### Stop Words

In [None]:
from spacy.lang.en.stop_words import STOP_WORDS
for word in STOP_WORDS:
    lexeme = nlp.vocab[word]
    lexeme.is_stop = True

##### Text Matching

When using rule-based matching, SpaCy may match the same term multiple times if it is part of different n-term pairs with one term contained in another. For instance, 'integration services' in 'system integration services.'

To avoid matching these terms multiple times, we can add to the collect_sents function to check if each term is contained in the previous term

In [None]:
def collect_sents(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    span = doc[start:end]
    sent = span.sent

    # lemmatize the matched spans
    entity = span.lemma_.lower()
            
    # explicity add the first entity without checking if it matches other terms
    # as there is no previous span to check    
    if i == 0:
        ent_count[entity] += 1
        ent_sents[entity].append(sent)
        matched_sents.append(sent)
        return

    # get the span, entity, and sentence from the previous match
    # if more than one match exist
    last_match_id, last_start, last_end = matches[i-1]
    last_span = doc[last_start : last_end]
    last_entity = last_span.text.lower()
    last_sent = last_span.sent

    # to avoid adding duplicates when one term is contained in another 
    # (e.g. 'integration services' in 'system integration services')
    # make sure new spans are unique
    distinct_entity = (entity not in last_entity) or (sent != last_sent)
    not_duplicate_entity = (entity != last_entity) or (sent != last_sent)
    
    # update collections for unique data
    if distinct_entity and not_duplicate_entity:
        ent_count[entity] += 1
        ent_sents[entity].append(sent)
        matched_sents.append(sent)

##### Multiple Patterns

SpaCy matchers can use multiple patterns. Each pattern can be added to the Matcher individually with match.add and can use their own collect_sents function. Or use *patterns to add multiple patterns to the matcher at once.

In [None]:
matched_sents = []
ent_sents  = defaultdict(list)
ent_count = defaultdict(int)

# multiple patterns
pattern = [[{'POS': 'NOUN', 'OP': '+'},{'LOWER': 'fee'}]
           , [{'POS': 'NOUN', 'OP': '+'},{'LOWER': 'fees'}]]
matcher = Matcher(nlp.vocab)

# *patterns to add multiple patterns with the same collect_sents function
matcher.add('ProductTypes', collect_sents, *pattern)
matches = matcher(doc) 

### Text Preprocessing

In [None]:
def clean_text(doc): 
    # Add named entities, but only if they are a compound of more than word.
    IGNORE_ENTS = ('QUANTITY','ORDINAL','CARDINAL','DATE'
                   ,'PERCENT','MONEY','TIME')
    ents = doc.ents
    ents = [ent for ent in ents if 
             (ent.label_ not in IGNORE_ENTS) and (len(ent) > 2)]
    
    # add underscores to combine words in entities
    ents = [str(ent).strip().replace(' ','_') for ent in ents]
 
    # Keep only words (no numbers, no punctuation).
    # Lemmatize tokens, remove punctuation and remove stopwords.
    doc = [token.lemma_ for token in doc 
           if token.is_alpha and not token.is_stop]
    
    doc.extend([entity for entity in ents])
    
    return [str(term) for term in doc]

In [None]:
%%time
cleaned_text = []
for sent in matched_sents:
    text = clean_text(sent)
    cleaned_text.append(text)

print(cleaned_text[0])