<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Create-pipeline-components-to-use-in-SpaCy-NLP-pipeline" data-toc-modified-id="Create-pipeline-components-to-use-in-SpaCy-NLP-pipeline-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Create pipeline components to use in SpaCy NLP pipeline</a></span><ul class="toc-item"><li><span><a href="#Class-to-tag-employee-nouns-as-entities" data-toc-modified-id="Class-to-tag-employee-nouns-as-entities-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Class to tag employee nouns as entities</a></span></li><li><span><a href="#YearMatcher:-Add-custom-_.is_year-attribute-to-year-tokens-that-are-part-of-Date-entities" data-toc-modified-id="YearMatcher:-Add-custom-_.is_year-attribute-to-year-tokens-that-are-part-of-Date-entities-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>YearMatcher: Add custom <code>_.is_year</code> attribute to year tokens that are part of Date entities</a></span></li></ul></li><li><span><a href="#Code-for-relationship-extraction-and-helper-functions" data-toc-modified-id="Code-for-relationship-extraction-and-helper-functions-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Code for relationship extraction and helper functions</a></span><ul class="toc-item"><li><span><a href="#Function:-extract_emp_relations" data-toc-modified-id="Function:-extract_emp_relations-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Function: extract_emp_relations</a></span></li><li><span><a href="#Testing-handling-of-tuples-(sorting-to-match-years-with-cardinal-numbers)" data-toc-modified-id="Testing-handling-of-tuples-(sorting-to-match-years-with-cardinal-numbers)-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Testing handling of tuples (sorting to match years with cardinal numbers)</a></span></li><li><span><a href="#Function:-print_doc_info" data-toc-modified-id="Function:-print_doc_info-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Function: print_doc_info</a></span></li></ul></li><li><span><a href="#Define-sentence-structure-types" data-toc-modified-id="Define-sentence-structure-types-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Define sentence structure types</a></span><ul class="toc-item"><li><span><a href="#[Company]-[employed-|-have]-[number]-[emp_noun]" data-toc-modified-id="[Company]-[employed-|-have]-[number]-[emp_noun]-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>[Company] [employed | have] [number] [emp_noun]</a></span></li><li><span><a href="#[number_noun]-[was]-[number]" data-toc-modified-id="[number_noun]-[was]-[number]-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>[number_noun] [was] [number]</a></span></li><li><span><a href="#[we]-[employed-|-have]-[number]-[emp_noun]" data-toc-modified-id="[we]-[employed-|-have]-[number]-[emp_noun]-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>[we] [employed | have] [number] [emp_noun]</a></span></li><li><span><a href="#[noun-phrases]-[employed-|-have]-[prep-phrase]--[number]-[emp_noun]" data-toc-modified-id="[noun-phrases]-[employed-|-have]-[prep-phrase]--[number]-[emp_noun]-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>[noun phrases] [employed | have] [prep phrase]  [number] [emp_noun]</a></span></li><li><span><a href="#[we]-[employed-|-have]-[number]-[emp_noun]" data-toc-modified-id="[we]-[employed-|-have]-[number]-[emp_noun]-3.5"><span class="toc-item-num">3.5&nbsp;&nbsp;</span>[we] [employed | have] [number] [emp_noun]</a></span></li></ul></li><li><span><a href="#Testing-with-paragraphs" data-toc-modified-id="Testing-with-paragraphs-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Testing with paragraphs</a></span></li><li><span><a href="#Matcher-testing" data-toc-modified-id="Matcher-testing-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Matcher testing</a></span></li></ul></div>

In [1]:
import spacy
from spacy import displacy
from spacy.matcher import Matcher, PhraseMatcher
from spacy.tokens import Doc, Span, Token
import pandas as pd
import re

from path import Path, getcwdu

In [2]:
subset_df = pd.read_excel('../data/subset_employee_count_paragraphs.xlsx')

In [3]:
nonum_pat = re.compile(r"^([^\d]|[$][\d]+)[^\d]*$", re.I)

train_df = pd.read_csv('data/classifier_input_train.csv', index_col=0)
train_df = train_df[train_df.len > 19][train_df.para_text.apply(lambda x: not re.match(nonum_pat, x))]
val_df = pd.read_csv('data/classifier_input_val.csv', index_col=0)
val_df = val_df[val_df.len > 19][val_df.para_text.apply(lambda x: not re.match(nonum_pat, x))]

  after removing the cwd from sys.path.
  


## Create pipeline components to use in SpaCy NLP pipeline

### Class to tag employee nouns as entities

In [4]:
# Templated from: https://spacy.io/usage/processing-pipelines#custom-components 
class EmpNounRecognizer(object):
    """A spaCy v2.0 pipeline component that sets entity annotations
    based on list of terms. Terms are labelled as EMP_NOUN. Additionally,
    ._.has_emp_noun and ._.is_emp_noun is set on the Doc/Span and Token
    respectively."""
    name = 'employee_nouns'  # component name, will show up in the pipeline

    def __init__(self, nlp, terms=tuple(), label='EMP_NOUN'):
        """Initialise the pipeline component. The shared nlp instance is used
        to initialise the matcher with the shared vocab, get the label ID and
        generate Doc objects as phrase match patterns.
        """
        self.label = nlp.vocab.strings[label]  # get entity label ID

        # Set up the PhraseMatcher – it can now take Doc objects as patterns,
        # so even if the list of terms is long, it's very efficient
        patterns = [nlp(term) for term in terms]
        self.matcher = PhraseMatcher(nlp.vocab)
        self.matcher.add('EMP_NOUN', None, *patterns)

        # Register attribute on the Token. We'll be overwriting this based on
        # the matches, so we're only setting a default value, not a getter.
        Token.set_extension('is_emp_noun', default=False)

        # Register attributes on Doc and Span via a getter that checks if one of
        # the contained tokens is set to is_emp_noun == True.
        Doc.set_extension('has_emp_noun', getter=self.has_emp_noun)
        Span.set_extension('has_emp_noun', getter=self.has_emp_noun)

    def __call__(self, doc):
        """Apply the pipeline component on a Doc object and modify it if matches
        are found. Return the Doc, so it can be processed by the next component
        in the pipeline, if available.
        """
        matches = self.matcher(doc)
        spans = []  # keep the spans for later so we can merge them afterwards
        for _, start, end in matches:
            # Generate Span representing the entity & set label
            entity = Span(doc, start, end, label=self.label)
            spans.append(entity)
            # Set custom attribute on each token of the entity
            for token in entity:
                token._.set('is_emp_noun', True)
            # Overwrite doc.ents and add entity – be careful not to replace!
            doc.ents = list(doc.ents) + [entity]
        for span in spans:
            # Iterate over all spans and merge them into one token. This is done
            # after setting the entities – otherwise, it would cause mismatched
            # indices!
            span.merge()
        return doc  # don't forget to return the Doc!

    def has_emp_noun(self, tokens):
        """Getter for Doc and Span attributes. Returns True if one of the tokens
        is an employee noun. Since the getter is only called when we access the
        attribute, we can refer to the Token's 'is_emp_noun' attribute here,
        which is already set in the processing step."""
        return any([t._.get('is_emp_noun') for t in tokens])

In [5]:
emp_terms_list = ["associates", "employees", "individuals", "people", "persons", "team members"]
emp_terms_list = emp_terms_list + [s.title() for s in emp_terms_list] + [s.upper() for s in emp_terms_list]

In [6]:
nlp = spacy.load('en_core_web_lg')

In [7]:
emp_noun_recognizer = EmpNounRecognizer(nlp, emp_terms_list)

In [8]:
nlp.add_pipe(emp_noun_recognizer, last=True) 

In [9]:
print('Pipeline', nlp.pipe_names) 

Pipeline ['tagger', 'parser', 'ner', 'employee_nouns']


### YearMatcher: Add custom `_.is_year` attribute to year tokens that are part of Date entities

In [10]:
class YearMatcher(object):
    name = 'year_matcher'
    
    def __init__(self, nlp, pattern_list, match_id='Year'):
        # register a new token extension to flag year tokens
        Token.set_extension('is_year', default=False)
        self.matcher = Matcher(nlp.vocab)
        self.matcher.add(match_id, None, pattern_list)

    def __call__(self, doc):
        matches = self.matcher(doc)
        spans = []  # collect the matched spans here
        for match_id, start, end in matches:
            spans.append(doc[start:end])
        for span in spans:
            span.merge()   # merge
            for token in span:
                token._.is_year = True  # mark token as a year
        return doc

year_patterns = [{'ENT_TYPE': 'DATE', 'TAG' : 'CD', 'SHAPE' : 'dddd'}]

In [11]:
year_matcher = YearMatcher(nlp, year_patterns)

nlp.add_pipe(year_matcher, last=True) 

In [12]:
print('Pipeline', nlp.pipe_names) 

Pipeline ['tagger', 'parser', 'ner', 'employee_nouns', 'year_matcher']


## Code for relationship extraction and helper functions

### Function: extract_emp_relations

In [83]:
def extract_emp_relations(doc):
    relations = []
    for word in filter(lambda w: w.ent_type_ == 'EMP_NOUN', doc):
        if word.dep_ in ('attr', 'dobj'):
            num_tok = [w for w in word.children if w.dep_ == 'nummod']
            subject = [w for w in word.head.lefts if w.dep_ == 'nsubj']
            if subject:
                subject = subject[0]
                #[print(str(p) + '  :  ' + str(p.dep_)) for p in subject.subtree]
                sub_poss = [p for p in subject.subtree if p.dep_ == 'poss']
                if sub_poss:
                    subject = sub_poss[0]
                relations.append((subject, word))
            if num_tok:
                num_tok = num_tok[0]
                relations.append((num_tok, word))
        elif word.dep_ == 'pobj' and word.head.dep_ == 'prep':
            subject = word.head.head 
            num_toks = [c for c in word.children if c.dep_ == 'nummod' and c.ent_type_ == 'CARDINAL']
            if num_toks:
                relations.append((num_toks[0], word))
            elif word.head.head.head.pos_ == 'VERB':
                cards = [c for c in word.head.head.head.rights if c.tag_== 'CD' and c.ent_type_ == 'CARDINAL']
                dates = [e for e in word.doc.ents if e.label_ == 'DATE']
                if cards:
                    relations.append((cards[0], word))
            relations.append((subject, word))
        elif word.dep_ == 'conj':
            child_num_toks = [w for w in word.children if w.dep_ == 'nummod'] 
            head_num_tok = [w for w in [word.head] if w.tag_ == 'CD']
            if child_num_toks and head_num_tok:
                years = [(y.i, y) for y in word.doc if y._.is_year == True]
                emp_counts = [(c.i, c) for c in child_num_toks + head_num_tok]
                order_indices = [years.index(y) for y in sorted(years, reverse=True, key = lambda x: x[1])]
                year_emps = [(years[i][1], emp_counts[i][1]) for i in order_indices]
                relations.append((max(year_emps)[1], word))
            else: 
                relations.append((child_num_toks[0], word))
    return relations


### Testing handling of tuples (sorting to match years with cardinal numbers)

In [14]:
iyears =[(13, 2015), (11, 2016)]

In [15]:
iemps = [(15, '43,246'), (12, '53,146')]

In [16]:
sorted(iemps)

[(12, '53,146'), (15, '43,246')]

In [17]:
year_sorted = sorted(iyears, key = lambda tup: tup[1], reverse=True)

In [18]:
year_sorted

[(11, 2016), (13, 2015)]

In [19]:
order_indices = [iyears.index(x) for x in year_sorted]

In [20]:
order_indices

[1, 0]

In [21]:
year_emps = [(iyears[i][1], iemps[i][1]) for i in order_indices]

In [22]:
year_emps

[(2016, '53,146'), (2015, '43,246')]

In [23]:
max(year_emps)[1]

'53,146'

### Function: print_doc_info 

In [24]:
def print_doc_df(doc):
    """Return a dataframe showing attributes for each token in doc."""
    doc_dict = {'tok_ent' : [tok.ent_type_ for tok in doc], 
        'toks' : [tok for tok in doc], 
        'dep' : [tok.dep_ for tok in doc], 
        'pos' : [tok.pos_ for tok in doc], 
        'tag' : [tok.tag_ for tok in doc], 
        'tag_def' : [spacy.explain(tok.tag_) for tok in doc]}
    columns = ['toks', 'pos', 'dep', 'tag', 'tag_def', 'tok_ent' ]
    return pd.DataFrame(doc_dict, columns=columns)

def print_doc_info(doc):
    print("doc is: ")
    print(doc)
    print('-' * 50)
    print("Entities are: ")
    for ent in doc.ents:
        print(ent.text, ent.label_)
    print('-' * 50)
    print("Noun chunks are: ")

    for chunk in doc.noun_chunks:
        print(chunk.text, chunk.label_, chunk.root.text)
    print('-' * 50)
    print("Cardinal entities are: ")
    for cardinal in filter(lambda w: w.ent_type_ == 'CARDINAL', doc):
        print(cardinal)
        print("Cardinal.dep_ : " + str(cardinal.dep_))
        print("Cardinal.head : " + str(cardinal.head))
        print("Cardinal.head.dep_ : " + str(cardinal.head.dep_))

## Define sentence structure types

### [Company] [employed | have] [number] [emp_noun]

Examples: 


`"At December 31, 2016, Bio-Rad had approximately 8,250 employees."`

In [25]:
ex1 = nlp('At December 31, 2016, Bio-Rad had approximately 8,250 employees.')

In [26]:
displacy.render(ex1, style='ent', jupyter=True, options={'distance': 90})

In [27]:
displacy.render(ex1, style='dep', jupyter=True, options={'distance': 90})

In [28]:
print_doc_df(ex1)

Unnamed: 0,toks,pos,dep,tag,tag_def,tok_ent
0,At,ADP,prep,IN,"conjunction, subordinating or preposition",
1,December,PROPN,pobj,NNP,"noun, proper singular",DATE
2,31,NUM,nummod,CD,cardinal number,DATE
3,",",PUNCT,punct,",","punctuation mark, comma",DATE
4,2016,NUM,nummod,CD,cardinal number,DATE
5,",",PUNCT,punct,",","punctuation mark, comma",
6,Bio,PROPN,compound,NNP,"noun, proper singular",ORG
7,-,PUNCT,punct,HYPH,"punctuation mark, hyphen",ORG
8,Rad,PROPN,nsubj,NNP,"noun, proper singular",ORG
9,had,VERB,ROOT,VBD,"verb, past tense",


In [29]:
print_doc_info(ex1)

doc is: 
At December 31, 2016, Bio-Rad had approximately 8,250 employees.
--------------------------------------------------
Entities are: 
December 31, 2016 DATE
Bio-Rad ORG
approximately 8,250 CARDINAL
employees EMP_NOUN
--------------------------------------------------
Noun chunks are: 
December NP December
Bio-Rad NP Rad
approximately 8,250 employees NP employees
--------------------------------------------------
Cardinal entities are: 
approximately
Cardinal.dep_ : advmod
Cardinal.head : 8,250
Cardinal.head.dep_ : nummod
8,250
Cardinal.dep_ : nummod
Cardinal.head : employees
Cardinal.head.dep_ : dobj


In [30]:
ex1_emp_tok = ex1[12]

In [31]:
ex1_emp_tok.dep_

'dobj'

In [32]:
[x for x in ex1_emp_tok.children]

[8,250]

In [33]:
#[w for w in ex1_emp_tok.head.lefts if w.dep_ == 'nsubj']
print('Token head: ' + str(ex1_emp_tok.head) + '       head.dep_:'+ str(ex1_emp_tok.head.dep_) )
for w in ex1_emp_tok.head.lefts:
    print(str(w) + '       head.left.dep_:' + str(w.dep_))
    if w.dep_ == 'nsubj':
        print(str(w) + ' is nsubj. Subtree is:' )
        print(list(w.subtree))
print('Token ancestors: ')
for w in ex1_emp_tok.ancestors:
    print(str(w) + '       ancestor.dep_:' + str(w.dep_))
print('Token children: ')
for w in ex1_emp_tok.children:
    print(str(w) + '       child.dep_:' + str(w.dep_))

Token head: had       head.dep_:ROOT
At       head.left.dep_:prep
,       head.left.dep_:punct
Rad       head.left.dep_:nsubj
Rad is nsubj. Subtree is:
[Bio, -, Rad]
Token ancestors: 
had       ancestor.dep_:ROOT
Token children: 
8,250       child.dep_:nummod


In [34]:
extract_emp_relations(ex1)

[(Rad, employees), (8,250, employees)]

### [number_noun] [was] [number]

Examples:

`"The number of full-time employees of the Company was approximately 31,800 at December 31, 2016 and 32,300 at December 31, 2015."`

`"Alcoa's total worldwide employment at the end of 2016 was approximately 14,000 employees in 15 countries."`

In [35]:
ex2 = nlp("The number of full-time employees of the Company was approximately 31,800 at December 31, 2016 and 32,300 at December 31, 2015.")

In [36]:
print_doc_df(ex2)

Unnamed: 0,toks,pos,dep,tag,tag_def,tok_ent
0,The,DET,det,DT,determiner,
1,number,NOUN,nsubj,NN,"noun, singular or mass",
2,of,ADP,prep,IN,"conjunction, subordinating or preposition",
3,full,ADJ,amod,JJ,adjective,
4,-,PUNCT,punct,HYPH,"punctuation mark, hyphen",
5,time,NOUN,compound,NN,"noun, singular or mass",
6,employees,NOUN,pobj,NNS,"noun, plural",EMP_NOUN
7,of,ADP,prep,IN,"conjunction, subordinating or preposition",
8,the,DET,det,DT,determiner,
9,Company,PROPN,pobj,NNP,"noun, proper singular",ORG


In [37]:
print_doc_info(ex2)

doc is: 
The number of full-time employees of the Company was approximately 31,800 at December 31, 2016 and 32,300 at December 31, 2015.
--------------------------------------------------
Entities are: 
employees EMP_NOUN
Company ORG
approximately 31,800 CARDINAL
December 31, 2016 DATE
32,300 CARDINAL
December 31, 2015 DATE
--------------------------------------------------
Noun chunks are: 
The number NP number
full-time employees NP employees
the Company NP Company
December NP December
December NP December
--------------------------------------------------
Cardinal entities are: 
approximately
Cardinal.dep_ : advmod
Cardinal.head : 31,800
Cardinal.head.dep_ : attr
31,800
Cardinal.dep_ : attr
Cardinal.head : was
Cardinal.head.dep_ : ROOT
32,300
Cardinal.dep_ : conj
Cardinal.head : was
Cardinal.head.dep_ : ROOT


In [38]:
ex2_emp_tok = ex2[6]

In [39]:
ex2_emp_tok.dep_

'pobj'

In [64]:
print(ex2_emp_tok.head.dep_)
print(ex2_emp_tok.head.head)
print(ex2_emp_tok.head.head.dep_)
print(ex2_emp_tok.head.head.head)
print(ex2_emp_tok.head.head.head.dep_)
print(ex2_emp_tok.head.head.head.pos_)
print(ex2_emp_tok.head.head.head.tag_)
print([c for c in ex2_emp_tok.head.head.head.children])
print([c.dep_ for c in ex2_emp_tok.head.head.head.children])
print([c.tag_ for c in ex2_emp_tok.head.head.head.children])
print([r for r in ex2_emp_tok.head.head.head.rights if r.tag_== 'CD'])
print([c.ent_iob_ for c in ex2_emp_tok.head.head.head.children])
print([e for e in ex2.ents])
print([e.label_ for e in ex2.ents])
print([e for e in ex2.ents if e.label_ == 'DATE'])

prep
number
nsubj
was
ROOT
VERB
VBD
[number, 31,800, at, and, 32,300, .]
['nsubj', 'attr', 'prep', 'cc', 'conj', 'punct']
['NN', 'CD', 'IN', 'CC', 'CD', '.']
[31,800, 32,300]
['', 'I', '', '', 'B', '']
[employees, Company, approximately 31,800, December 31, 2016, 32,300, December 31, 2015]
['EMP_NOUN', 'ORG', 'CARDINAL', 'DATE', 'CARDINAL', 'DATE']
[December 31, 2016, December 31, 2015]


In [40]:
displacy.render(ex2, style='dep', jupyter=True, options={'distance': 110})

In [41]:
extract_emp_relations(ex2)

[(31,800, employees), (number, employees)]

In [42]:
ex3 = nlp("Alcoa's total worldwide employment at the end of 2016 was approximately 14,000 employees in 15 countries.")

In [43]:
print_doc_df(ex3)

Unnamed: 0,toks,pos,dep,tag,tag_def,tok_ent
0,Alcoa,PROPN,poss,NNP,"noun, proper singular",ORG
1,'s,PART,case,POS,possessive ending,
2,total,ADJ,amod,JJ,adjective,
3,worldwide,ADJ,amod,JJ,adjective,
4,employment,NOUN,nsubj,NN,"noun, singular or mass",
5,at,ADP,prep,IN,"conjunction, subordinating or preposition",
6,the,DET,det,DT,determiner,DATE
7,end,NOUN,pobj,NN,"noun, singular or mass",DATE
8,of,ADP,prep,IN,"conjunction, subordinating or preposition",DATE
9,2016,NUM,pobj,CD,cardinal number,DATE


In [44]:
print_doc_info(ex3)

doc is: 
Alcoa's total worldwide employment at the end of 2016 was approximately 14,000 employees in 15 countries.
--------------------------------------------------
Entities are: 
Alcoa ORG
the end of 2016 DATE
approximately 14,000 CARDINAL
employees EMP_NOUN
15 CARDINAL
--------------------------------------------------
Noun chunks are: 
Alcoa's total worldwide employment NP employment
the end NP end
approximately 14,000 employees NP employees
15 countries NP countries
--------------------------------------------------
Cardinal entities are: 
approximately
Cardinal.dep_ : advmod
Cardinal.head : 14,000
Cardinal.head.dep_ : nummod
14,000
Cardinal.dep_ : nummod
Cardinal.head : employees
Cardinal.head.dep_ : attr
15
Cardinal.dep_ : nummod
Cardinal.head : countries
Cardinal.head.dep_ : pobj


In [45]:
ex3_emp_tok = ex3[13]

In [46]:
#[w for w in ex3_emp_tok.head.lefts if w.dep_ == 'nsubj']
print('Token dep_: ' + str(ex3_emp_tok.dep_))
print('Token head: ' + str(ex3_emp_tok.head) + '       head.dep_:'+ str(ex3_emp_tok.head.dep_) )
for w in ex3_emp_tok.head.lefts:
    print(str(w) + '       head.left.dep_:' + str(w.dep_))
    if w.dep_ == 'nsubj':
        print(str(w) + ' is nsubj. Subtree is:' )
        [print(x) for x in w.lefts]
        [print(s)for s in w.subtree if s.dep_ == 'poss']
print('Token ancestors: ')
for w in ex3_emp_tok.ancestors:
    print(str(w) + '       ancestor.dep_:' + str(w.dep_))
print('Token children: ')
for w in ex3_emp_tok.children:
    print(str(w) + '       child.dep_:' + str(w.dep_))

Token dep_: attr
Token head: was       head.dep_:ROOT
employment       head.left.dep_:nsubj
employment is nsubj. Subtree is:
Alcoa
total
worldwide
Alcoa
Token ancestors: 
was       ancestor.dep_:ROOT
Token children: 
14,000       child.dep_:nummod
in       child.dep_:prep


In [47]:
displacy.render(ex3, style='dep', jupyter=True, options={'distance': 90})

In [48]:
extract_emp_relations(ex3)

[(Alcoa, employees), (14,000, employees)]

### [we] [employed | have] [number] [emp_noun]

`"As of January 31, 2017, we employed 7,683 individuals."`

In [49]:
ex4 = nlp("As of January 31, 2017, we employed 7,683 individuals.")

In [50]:
print_doc_df(ex4)

Unnamed: 0,toks,pos,dep,tag,tag_def,tok_ent
0,As,ADP,prep,IN,"conjunction, subordinating or preposition",
1,of,ADP,prep,IN,"conjunction, subordinating or preposition",
2,January,PROPN,pobj,NNP,"noun, proper singular",DATE
3,31,NUM,nummod,CD,cardinal number,DATE
4,",",PUNCT,punct,",","punctuation mark, comma",DATE
5,2017,NUM,nummod,CD,cardinal number,DATE
6,",",PUNCT,punct,",","punctuation mark, comma",
7,we,PRON,nsubj,PRP,"pronoun, personal",
8,employed,VERB,ROOT,VBD,"verb, past tense",
9,7683,NUM,nummod,CD,cardinal number,CARDINAL


In [51]:
print_doc_info(ex4)

doc is: 
As of January 31, 2017, we employed 7,683 individuals.
--------------------------------------------------
Entities are: 
January 31, 2017 DATE
7,683 CARDINAL
individuals EMP_NOUN
--------------------------------------------------
Noun chunks are: 
January NP January
we NP we
7,683 individuals NP individuals
--------------------------------------------------
Cardinal entities are: 
7,683
Cardinal.dep_ : nummod
Cardinal.head : individuals
Cardinal.head.dep_ : dobj


In [52]:
ex4_emp_tok = ex4[10]

In [53]:
#[w for w in ex4_emp_tok.head.lefts if w.dep_ == 'nsubj']
print('Token dep_: ' + str(ex4_emp_tok.dep_))
print('Token head: ' + str(ex4_emp_tok.head) + '       head.dep_:'+ str(ex4_emp_tok.head.dep_) )
for w in ex4_emp_tok.head.lefts:
    print(str(w) + '       head.left.dep_:' + str(w.dep_))
    if w.dep_ == 'nsubj':
        print(str(w) + ' is nsubj. Subtree is:' )
        [print(x) for x in w.lefts]
        [print(s)for s in w.subtree if s.dep_ == 'poss']
print('Token ancestors: ')
for w in ex4_emp_tok.ancestors:
    print(str(w) + '       ancestor.dep_:' + str(w.dep_))
print('Token children: ')
for w in ex4_emp_tok.children:
    print(str(w) + '       child.dep_:' + str(w.dep_))

Token dep_: dobj
Token head: employed       head.dep_:ROOT
As       head.left.dep_:prep
,       head.left.dep_:punct
we       head.left.dep_:nsubj
we is nsubj. Subtree is:
Token ancestors: 
employed       ancestor.dep_:ROOT
Token children: 
7,683       child.dep_:nummod


In [54]:
extract_emp_relations(ex4)

[(we, individuals), (7,683, individuals)]

Example 5

### [noun phrases] [employed | have] [prep phrase]  [number] [emp_noun]

In [55]:
ex5 = nlp("As of December 31, 2016, the subsidiaries of AEP had a total of 17,634 employees.")

In [56]:
print_doc_df(ex5)

Unnamed: 0,toks,pos,dep,tag,tag_def,tok_ent
0,As,ADP,prep,IN,"conjunction, subordinating or preposition",
1,of,ADP,prep,IN,"conjunction, subordinating or preposition",
2,December,PROPN,pobj,NNP,"noun, proper singular",DATE
3,31,NUM,nummod,CD,cardinal number,DATE
4,",",PUNCT,punct,",","punctuation mark, comma",DATE
5,2016,NUM,nummod,CD,cardinal number,DATE
6,",",PUNCT,punct,",","punctuation mark, comma",
7,the,DET,det,DT,determiner,
8,subsidiaries,NOUN,nsubj,NNS,"noun, plural",
9,of,ADP,prep,IN,"conjunction, subordinating or preposition",


In [57]:
print_doc_info(ex5)

doc is: 
As of December 31, 2016, the subsidiaries of AEP had a total of 17,634 employees.
--------------------------------------------------
Entities are: 
December 31, 2016 DATE
AEP ORG
17,634 CARDINAL
employees EMP_NOUN
--------------------------------------------------
Noun chunks are: 
December NP December
the subsidiaries NP subsidiaries
AEP NP AEP
a total NP total
17,634 employees NP employees
--------------------------------------------------
Cardinal entities are: 
17,634
Cardinal.dep_ : nummod
Cardinal.head : employees
Cardinal.head.dep_ : pobj


In [58]:
ex5_emp_tok = ex5[16]

ex5_emp_num_ent = ex5.ents[2]

In [59]:
ex5_emp_num_ent

17,634

In [60]:
ex5_emp_num_ent.root.ent_type_

'CARDINAL'

In [61]:
ex5[3].ent_type_

'DATE'

In [62]:
#[w for w in ex5_emp_tok.head.lefts if w.dep_ == 'nsubj']
print('Token dep_: ' + str(ex5_emp_tok.dep_))
print('Token head: ' + str(ex5_emp_tok.head) + '       head.dep_:'+ str(ex5_emp_tok.head.dep_) )
for w in ex5_emp_tok.head.lefts:
    print(str(w) + '       head.left.dep_:' + str(w.dep_))
    if w.dep_ == 'nsubj':
        print(str(w) + ' is nsubj. Subtree is:' )
        [print(x) for x in w.lefts]
        [print(s)for s in w.subtree if s.dep_ == 'poss']
print('Token ancestors: ')
for w in ex5_emp_tok.ancestors:
    print(str(w) + '       ancestor.dep_:' + str(w.dep_))
print('Token children: ')
for w in ex5_emp_tok.children:
    print(str(w) + '       child.dep_:' + str(w.dep_))

Token dep_: pobj
Token head: of       head.dep_:prep
Token ancestors: 
of       ancestor.dep_:prep
total       ancestor.dep_:dobj
had       ancestor.dep_:ROOT
Token children: 
17,634       child.dep_:nummod


In [63]:
displacy.render(ex5, style='dep', jupyter=True, options={'distance': 110})

In [64]:
print(ex5_emp_tok.head)
print(ex5_emp_tok.head.dep_)
print(ex5_emp_tok.head.head)
print(ex5_emp_tok.head.head.dep_)
print(ex5_emp_tok.head.head.head)
print(ex5_emp_tok.head.head.head.dep_)
print(ex5_emp_tok.head.head.head.pos_)
print(ex5_emp_tok.head.head.head.tag_)
print([c for c in ex5_emp_tok.head.head.head.children])
print([c.dep_ for c in ex5_emp_tok.head.head.head.children])
print([c.tag_ for c in ex5_emp_tok.head.head.head.children])
print([r for r in ex5_emp_tok.head.head.head.subtree if r.tag_== 'CD'])
print([c.ent_iob_ for c in ex5_emp_tok.head.head.head.children])
print([e for e in ex5.ents])
print([e.label_ for e in ex5.ents])
print([e for e in ex5.ents if e.label_ == 'DATE'])

of
prep
total
dobj
had
ROOT
VERB
VBD
[As, ,, subsidiaries, total, .]
['prep', 'punct', 'nsubj', 'dobj', 'punct']
['IN', ',', 'NNS', 'NN', '.']
[31, 2016, 17,634]
['', '', '', '', '']
[December 31, 2016, AEP, 17,634, employees]
['DATE', 'ORG', 'CARDINAL', 'EMP_NOUN']
[December 31, 2016]


In [65]:
extract_emp_relations(ex5)

[(17,634, employees), (total, employees)]

### [we] [employed | have] [number] [emp_noun]

In [66]:
ex6 = nlp("At December 31, 2016 and 2015, we had approximately 56,400 and 66,400 employees, respectively.")

In [67]:
print_doc_df(ex6)

Unnamed: 0,toks,pos,dep,tag,tag_def,tok_ent
0,At,ADP,prep,IN,"conjunction, subordinating or preposition",
1,December,PROPN,pobj,NNP,"noun, proper singular",DATE
2,31,NUM,nummod,CD,cardinal number,DATE
3,",",PUNCT,punct,",","punctuation mark, comma",DATE
4,2016,NUM,nummod,CD,cardinal number,DATE
5,and,CCONJ,cc,CC,"conjunction, coordinating",DATE
6,2015,NUM,conj,CD,cardinal number,DATE
7,",",PUNCT,punct,",","punctuation mark, comma",
8,we,PRON,nsubj,PRP,"pronoun, personal",
9,had,VERB,ROOT,VBD,"verb, past tense",


In [68]:
print_doc_info(ex6)

doc is: 
At December 31, 2016 and 2015, we had approximately 56,400 and 66,400 employees, respectively.
--------------------------------------------------
Entities are: 
December 31, 2016 and 2015 DATE
approximately 56,400 CARDINAL
66,400 CARDINAL
employees EMP_NOUN
--------------------------------------------------
Noun chunks are: 
December NP December
we NP we
approximately 56,400 and 66,400 employees NP employees
--------------------------------------------------
Cardinal entities are: 
approximately
Cardinal.dep_ : advmod
Cardinal.head : 56,400
Cardinal.head.dep_ : nummod
56,400
Cardinal.dep_ : nummod
Cardinal.head : employees
Cardinal.head.dep_ : dobj
66,400
Cardinal.dep_ : conj
Cardinal.head : 56,400
Cardinal.head.dep_ : nummod


In [69]:
ex6_emp_tok = ex6[14]

ex6_emp_num_tok = ex6[11]

In [70]:
ex6_emp_tok

employees

In [71]:
ex6_emp_tok.dep_

'dobj'

In [72]:
ex6_emp_num_tok

56,400

In [73]:
[c for c in ex6_emp_num_tok.conjuncts]

[66,400]

In [74]:
print('Token children: ')
for w in ex6_emp_num_tok.children:
    print(str(w) + '       child.dep_:' + str(w.dep_))

Token children: 
approximately       child.dep_:advmod
and       child.dep_:cc
66,400       child.dep_:conj


In [75]:
#[w for w in ex6_emp_tok.head.lefts if w.dep_ == 'nsubj']
print('Token:  ' + str(ex6_emp_tok) )
for w in ex6_emp_tok.children:
    if w.dep_ == 'nummod':
        print("Token child: " + str(w) + "  has dep_ == 'nummod'")
        print("       Token.child.i : " + str(w.i))
        print('       child.dep_: ' + str(w.dep_))
        print("       child.tag_: '"+ str(w.tag_) + "'" )
        print("       child.pos_: '"+ str(w.pos_) + "'" )
if ex6_emp_tok.dep_ == 'conj':
    print('Token dep_ is conj.')
    if ex6_emp_tok.head.tag_ == 'CD':
        print("       token.head.tag_ == 'CD' ")
        print("       token.head.i : " + str(ex6_emp_tok.head.i))
    print('Token head is: ' + str(ex6_emp_tok.head) + '       head.dep_:  '+ str(ex6_emp_tok.head.dep_) )
    for i, d in enumerate(ex6_emp_tok.doc.ents):
        if d.label_ == 'DATE':
            print(str(i) + "  " + str(d))
            for tok in ex6_emp_tok.doc[d.start:d.end + 1]:
                if tok.tag_ == 'CD':
                    print("CD tok in date: " + str(tok))
#    if ex6_emp_tok.head.dep_ == ''

print('Token dep_: ' + str(ex6_emp_tok.dep_))
print('Token head: ' + str(ex6_emp_tok.head))
print('       head.dep_: ' + str(ex6_emp_tok.head.dep_))
print("       head.tag_: '"+ str(ex6_emp_tok.head.tag_) + "'" )
print("       head.pos_: '"+ str(ex6_emp_tok.head.pos_) + "'" )
print('Token head lefts: ' )
print('Token conjuncts: ' )
print(str([c for c in ex6_emp_tok.conjuncts]))
for w in ex6_emp_tok.head.rights:
    print(str(w) + '       head.left.dep_:' + str(w.dep_))
    if w.dep_ == 'nsubj':
        print(str(w) + ' is nsubj. Subtree is:' )
        [print(x) for x in w.lefts]
        [print(s)for s in w.subtree if s.dep_ == 'poss']
print('Token ancestors: ')
for w in ex6_emp_tok.ancestors:
    print(str(w) + '       ancestor.dep_:' + str(w.dep_))
print('Token children: ')
for w in ex6_emp_tok.children:
    print(str(w) + '       child.dep_:' + str(w.dep_))

Token:  employees
Token child: 56,400  has dep_ == 'nummod'
       Token.child.i : 11
       child.dep_: nummod
       child.tag_: 'CD'
       child.pos_: 'NUM'
Token dep_: dobj
Token head: had
       head.dep_: ROOT
       head.tag_: 'VBD'
       head.pos_: 'VERB'
Token head lefts: 
Token conjuncts: 
[]
employees       head.left.dep_:dobj
,       head.left.dep_:punct
respectively       head.left.dep_:advmod
.       head.left.dep_:punct
Token ancestors: 
had       ancestor.dep_:ROOT
Token children: 
56,400       child.dep_:nummod


In [76]:
displacy.render(ex6, style='dep', jupyter=True, options={'distance': 105})

In [77]:
print(ex6_emp_tok.head)
print(ex6_emp_tok.head.dep_)
print(ex6_emp_tok.head.head)
print(ex6_emp_tok.head.head.dep_)
print(ex6_emp_tok.head.head.head)
print(ex6_emp_tok.head.head.head.dep_)
print(ex6_emp_tok.head.head.head.pos_)
print(ex6_emp_tok.head.head.head.tag_)
print([c for c in ex6_emp_tok.head.head.head.children])
print([c.dep_ for c in ex6_emp_tok.head.head.head.children])
print([c.tag_ for c in ex6_emp_tok.head.head.head.children])
print([r for r in ex6_emp_tok.head.head.head.subtree if r.tag_== 'CD'])
print([c.ent_iob_ for c in ex6_emp_tok.head.head.head.children])
print([e for e in ex6.ents])
print([e.label_ for e in ex6.ents])
print([e for e in ex6.ents if e.label_ == 'DATE'])

had
ROOT
had
ROOT
had
ROOT
VERB
VBD
[At, ,, we, employees, ,, respectively, .]
['prep', 'punct', 'nsubj', 'dobj', 'punct', 'advmod', 'punct']
['IN', ',', 'PRP', 'NNS', ',', 'RB', '.']
[31, 2016, 2015, 56,400, 66,400]
['', '', '', 'B', '', '', '']
[December 31, 2016 and 2015, approximately 56,400, 66,400, employees]
['DATE', 'CARDINAL', 'CARDINAL', 'EMP_NOUN']
[December 31, 2016 and 2015]


In [78]:
extract_emp_relations(ex6)

[(we, employees), (56,400, employees)]

In [79]:
test_sents = ["As of September 30, 2016, we employed approximately 7,300 employees world-wide.", 
"As of December 31, 2016, the subsidiaries of AEP had a total of 17,634 employees.", 
"At December 31, 2016 and 2015, we had approximately 56,400 and 66,400 employees, respectively.", 
"At December 31, 2016, we had approximately 9,400 full-time employees.", 
"As of October 29, 2016, we employed approximately 10,000 individuals worldwide.", 
"The number of full-time employees of the Company was approximately 31,800 at December 31, 2016 and 32,300 at December 31, 2015.", 
"As of December 31, 2016, we had 1,469 total employees.", 
"ADP employed approximately 57,000 persons as of June 30, 2016.", 
"At December 31, 2016, we employed approximately 26,400 employees.", 
"The Company and its subsidiaries employed 1,562 persons at December 31, 2016, 114 of whom are covered by a collective bargaining agreement with District 10 of the International Association of Machinists.", 
"As of December 31, 2016, the Company had 455 employees, an increase of 17 employees from the prior year end.", 
"As of December 31, 2016, we had approximately 17,500 employees worldwide.", 
"Based in Neenah, Wisconsin, at December 31, 2016, the Company employed approximately 17,500 individuals and had 59 manufacturing facilities.", 
"As of January 31, 2017, we employed 7,683 individuals.", 
"At December 31, 2016, Bio-Rad had approximately 8,250 employees.", 
"As of December 31, 2016, we had approximately 8,500 full-time employees and 600 contractors.", 
"Alcoa's total worldwide employment at the end of 2016 was approximately 14,000 employees in 15 countries.", 
"As of December 31, 2016, we employed approximately 2,100 people.", 
"At December 31, 2016, the Company had approximately 11,500 employees.",
"As of February 23, 2017, we employed approximately 41,000 full-time Team Members and approximately 33,000 part-time Team Members.", 
"As of December 31, 2016, we had 699 full-time employees and 202 temporary employees.", 
"As of December 31, 2016, we had 2,646 employees, 1,581 of whom were pilots."]

In [84]:
for i,t in enumerate(test_sents):
    print('Sentence index: ' + str(i))
    print(extract_emp_relations(nlp(t)))

Sentence index: 0
[(we, employees), (7,300, employees)]
Sentence index: 1
[(17,634, employees), (total, employees)]
Sentence index: 2
[(we, employees), (56,400, employees)]
Sentence index: 3
[(we, employees), (9,400, employees)]
Sentence index: 4
[(we, individuals), (10,000, individuals)]
Sentence index: 5
[(31,800, employees), (number, employees)]
Sentence index: 6
[(we, employees), (1,469, employees)]
Sentence index: 7
[(ADP, persons), (57,000, persons)]
Sentence index: 8
[(we, employees), (26,400, employees)]
Sentence index: 9
[(its, persons), (1,562, persons)]
Sentence index: 10
[(Company, employees), (455, employees), (17, employees), (increase, employees)]
Sentence index: 11
[(we, employees), (17,500, employees)]
Sentence index: 12
[(Company, individuals), (17,500, individuals)]
Sentence index: 13
[(we, individuals), (7,683, individuals)]
Sentence index: 14
[(Rad, employees), (8,250, employees)]
Sentence index: 15
[(we, employees), (8,500, employees)]
Sentence index: 16
[(Alcoa, 

## Testing with paragraphs

In [12]:
train_para_list = train_df.para_text.tolist()

In [13]:
for sent in nlp(train_para_list[0]).sents:
    print(sent)

As of September 30, 2016, we employed approximately 7,300 employees world-wide.
Approximately 860 of our employees in Mexico, 450 employees in Singapore, and 200 employees in Japan are covered by collective bargaining and other union agreements.


In [14]:
print([(token.text, token.tag_) for token in nlp(list(nlp(train_para_list[0]).sents)[0].text)])

[('As', 'IN'), ('of', 'IN'), ('September', 'NNP'), ('30', 'CD'), (',', ','), ('2016', 'CD'), (',', ','), ('we', 'PRP'), ('employed', 'VBD'), ('approximately', 'RB'), ('7,300', 'CD'), ('employees', 'NNS'), ('world', 'NN'), ('-', 'HYPH'), ('wide', 'RB'), ('.', '.')]


In [16]:
for ent in nlp(list(nlp(train_para_list[0]).sents)[0].text).ents:
    print(ent.text, ent.label_)

September 30, 2016 DATE
approximately 7,300 CARDINAL
employees EMP_NOUN


In [47]:
test_list = ["We are a small company with approximately 61 employees.",
"The number of full-time employees of the Company was approximately 31,800 at December 31, 2016 and 32,300 at December 31, 2015. ",
"Total workforce level at December 31, 2016 was approximately 150,500",           
"Currently, the Company and its subsidiaries have an aggregate of 35 employees.",
"we employ only 31 employees", 
"We currently have 21 employees",
"We currently employ 26 full-time employees",
"Including our full and part-time personnel, we estimate that we have \
the equivalent of 12 full time employees.",
"As a REIT, we employ only 31 employees and have a cost-effective \
management structure."
            ]

In [57]:
doc1 = nlp(test_list[1])

In [50]:
doc7 = nlp(test_list[7])

In [60]:
displacy.render(doc7, style='dep', jupyter=True, options={'distance': 90})
displacy.render(doc7, style='ent', jupyter=True, options={'distance': 90})

In [19]:
displacy.render(doc1, style='dep', jupyter=True)
displacy.render(doc1, style='ent', jupyter=True, options={'distance': 90})

In [70]:
spacy.explain(doc1[0].tag_)

'determiner'

In [127]:
print_doc_info(doc1)

doc is: 
The number of full-time employees of the Company was approximately 31,800 at December 31, 2016 and 32,300 at December 31, 2015. 
--------------------------------------------------
tokens are: 
--------------------------------------------------
0: The | DT  : determiner
1: number | NN  : noun, singular or mass
2: of | IN  : conjunction, subordinating or preposition
3: full | JJ  : adjective
4: - | HYPH  : punctuation mark, hyphen
5: time | NN  : noun, singular or mass
6: employees | NNS  : noun, plural
7: of | IN  : conjunction, subordinating or preposition
8: the | DT  : determiner
9: Company | NNP  : noun, proper singular
10: was | VBD  : verb, past tense
11: approximately | RB  : adverb
12: 31,800 | CD  : cardinal number
13: at | IN  : conjunction, subordinating or preposition
14: December | NNP  : noun, proper singular
15: 31 | CD  : cardinal number
16: , | ,  : punctuation mark, comma
17: 2016 | CD  : cardinal number
18: and | CC  : conjunction, coordinating
19: 32,300 | C

In [98]:
emp_tok = doc1[6]
emp_num_tok = doc1[12]

In [121]:
print("Token dep_: " + emp_num_tok.dep_)
print("Token head: " + str(emp_num_tok.head))
print("Token head dep_: " + str(emp_num_tok.head.dep_))
for w in emp_num_tok.head.lefts:
    print(str(w) + ' : ' + str(w.dep_))

print("Token head head: " + str(emp_num_tok.head.head))
print("Token head head dep_: " + str(emp_num_tok.head.head.dep_))

Token dep_: attr
Token head: was
Token head dep_: ROOT
number : nsubj
Token head head: was
Token head head dep_: ROOT


In [122]:
print("Token dep_: " + emp_tok.dep_)
print("Token head: " + str(emp_tok.head))
print("Token head dep_: " + str(emp_tok.head.dep_))
print("Token head head: " + str(emp_tok.head.head))
print("Token head head dep_: " + str(emp_tok.head.head.dep_))

Token dep_: pobj
Token head: of
Token head dep_: prep
Token head head: number
Token head head dep_: nsubj


In [18]:
doc0 = nlp(test_sents[0]) 
doc1 = nlp(test_sents[1]) 
doc2 = nlp(test_sents[2]) 
doc3 = nlp(test_sents[3]) 
doc4 = nlp(test_sents[4]) 

In [188]:
displacy.render(doc1, style='ent', jupyter=True) 

In [187]:
print_doc_info(doc1)

doc is: 
As of December 31, 2016, the subsidiaries of AEP had a total of 17,634 employees.
--------------------------------------------------
tokens are: 
--------------------------------------------------
0: As | IN  : conjunction, subordinating or preposition
1: of | IN  : conjunction, subordinating or preposition
2: December | NNP  : noun, proper singular
3: 31 | CD  : cardinal number
4: , | ,  : punctuation mark, comma
5: 2016 | CD  : cardinal number
6: , | ,  : punctuation mark, comma
7: the | DT  : determiner
8: subsidiaries | NNS  : noun, plural
9: of | IN  : conjunction, subordinating or preposition
10: AEP | NNP  : noun, proper singular
11: had | VBD  : verb, past tense
12: a | DT  : determiner
13: total | NN  : noun, singular or mass
14: of | IN  : conjunction, subordinating or preposition
15: 17,634 | CD  : cardinal number
16: employees | NNS  : noun, plural
17: . | .  : punctuation mark, sentence closer
--------------------------------------------------
Entities are: 
Decem

## Matcher testing

In [183]:
emp_ner_regex = re.compile(r"employees|individuals|people|persons|team members", re.I)

emp_flag = lambda text: bool(emp_ner_regex.match(text))

IS_EMP = nlp.vocab.add_flag(emp_flag)

matcher = Matcher(nlp.vocab)

num_emp_pat = [{'ENT_TYPE' : 'CARDINAL', 'TAG' : 'CD'}, {IS_EMP : True}]

matcher.add('NUM_EMP', None, num_emp_pat)
matcher.add('EMP_NOUN', None, [{IS_EMP: True}])

In [184]:
matches1 = matcher(doc1)

In [186]:
for match_id, start, end in matches1:
# create a new Span for each match and use the match_id as the label
    span = Span(doc1, start, end, label=match_id)
    doc1.ents = list(doc1.ents) + [span]  # add span to doc.ents

In [164]:
def add_match_ents(doc, matches):
    """Add matches from Matcher instance to a document's entities."""
    for match_id, start, end in matches:
    # create a new Span for each match and use the match_id as the label
        span = Span(doc, start, end, label=match_id)
        doc.ents = list(doc.ents) + [span]  # add span to doc.ents

In [166]:
matcher = Matcher(nlp.vocab)
matcher.add('NumEmp', None, num_emp_pat)

In [128]:
for match in re.finditer(emp_ner_regex, nlp(test_sents[0]).text):
    start, end = match.span()         # get matched indices
    span = doc.char_span(start, end)  # create Span from indices
    print(span.text)

employees


In [189]:
for match_id, start, end in matches1:
    span = doc1[start:end]
    print(span)
    print(span.sent)

17,634 employees
As of December 31, 2016, the subsidiaries of AEP had a total of 17,634 employees.
