<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Create-entity-matcher-to-use-in-NLP-pipeline" data-toc-modified-id="Create-entity-matcher-to-use-in-NLP-pipeline-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Create entity matcher to use in NLP pipeline</a></span></li><li><span><a href="#Define-sentence-structure-types" data-toc-modified-id="Define-sentence-structure-types-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Define sentence structure types</a></span><ul class="toc-item"><li><span><a href="#[Company-|-we]-[employed-|-have]-[number]-[emp_noun]" data-toc-modified-id="[Company-|-we]-[employed-|-have]-[number]-[emp_noun]-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>[Company | we] [employed | have] [number] [emp_noun]</a></span></li><li><span><a href="#[number_noun]-[was]-[number]" data-toc-modified-id="[number_noun]-[was]-[number]-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>[number_noun] [was] [number]</a></span></li></ul></li></ul></div>

In [1]:
import spacy
from spacy import displacy
from spacy.matcher import Matcher, PhraseMatcher
from spacy.tokens import Doc, Span, Token
import pandas as pd
import re

from path import Path, getcwdu

In [2]:
subset_df = pd.read_excel('../data/subset_employee_count_paragraphs.xlsx')

In [3]:
nonum_pat = re.compile(r"^([^\d]|[$][\d]+)[^\d]*$", re.I)

train_df = pd.read_csv('data/classifier_input_train.csv', index_col=0)
train_df = train_df[train_df.len > 19][train_df.para_text.apply(lambda x: not re.match(nonum_pat, x))]
val_df = pd.read_csv('data/classifier_input_val.csv', index_col=0)
val_df = val_df[val_df.len > 19][val_df.para_text.apply(lambda x: not re.match(nonum_pat, x))]

  after removing the cwd from sys.path.
  


### Create entity matcher to use in NLP pipeline

In [4]:
# Templated from: https://spacy.io/usage/processing-pipelines#custom-components 
class EmpNounRecognizer(object):
    """A spaCy v2.0 pipeline component that sets entity annotations
    based on list of terms. Terms are labelled as EMP_NOUN. Additionally,
    ._.has_emp_noun and ._.is_emp_noun is set on the Doc/Span and Token
    respectively."""
    name = 'employee_nouns'  # component name, will show up in the pipeline

    def __init__(self, nlp, terms=tuple(), label='EMP_NOUN'):
        """Initialise the pipeline component. The shared nlp instance is used
        to initialise the matcher with the shared vocab, get the label ID and
        generate Doc objects as phrase match patterns.
        """
        self.label = nlp.vocab.strings[label]  # get entity label ID

        # Set up the PhraseMatcher – it can now take Doc objects as patterns,
        # so even if the list of terms is long, it's very efficient
        patterns = [nlp(term) for term in terms]
        self.matcher = PhraseMatcher(nlp.vocab)
        self.matcher.add('EMP_NOUN', None, *patterns)

        # Register attribute on the Token. We'll be overwriting this based on
        # the matches, so we're only setting a default value, not a getter.
        Token.set_extension('is_emp_noun', default=False)

        # Register attributes on Doc and Span via a getter that checks if one of
        # the contained tokens is set to is_emp_noun == True.
        Doc.set_extension('has_emp_noun', getter=self.has_emp_noun)
        Span.set_extension('has_emp_noun', getter=self.has_emp_noun)

    def __call__(self, doc):
        """Apply the pipeline component on a Doc object and modify it if matches
        are found. Return the Doc, so it can be processed by the next component
        in the pipeline, if available.
        """
        matches = self.matcher(doc)
        spans = []  # keep the spans for later so we can merge them afterwards
        for _, start, end in matches:
            # Generate Span representing the entity & set label
            entity = Span(doc, start, end, label=self.label)
            spans.append(entity)
            # Set custom attribute on each token of the entity
            for token in entity:
                token._.set('is_emp_noun', True)
            # Overwrite doc.ents and add entity – be careful not to replace!
            doc.ents = list(doc.ents) + [entity]
        for span in spans:
            # Iterate over all spans and merge them into one token. This is done
            # after setting the entities – otherwise, it would cause mismatched
            # indices!
            span.merge()
        return doc  # don't forget to return the Doc!

    def has_emp_noun(self, tokens):
        """Getter for Doc and Span attributes. Returns True if one of the tokens
        is an employee noun. Since the getter is only called when we access the
        attribute, we can refer to the Token's 'is_emp_noun' attribute here,
        which is already set in the processing step."""
        return any([t._.get('is_emp_noun') for t in tokens])

In [5]:
emp_terms_list = ["associates", "employees", "individuals", "people", "persons", "team members"]

In [6]:
nlp = spacy.load('en_core_web_sm')

In [7]:
emp_noun_recognizer = EmpNounRecognizer(nlp, emp_terms_list)

In [8]:
nlp.add_pipe(emp_noun_recognizer, last=True) 

In [9]:
print('Pipeline', nlp.pipe_names) 

Pipeline ['tagger', 'parser', 'ner', 'employee_nouns']


### Define sentence structure types

#### [Company | we] [employed | have] [number] [emp_noun]

Examples: 

`"As of January 31, 2017, we employed 7,683 individuals."`

`"At December 31, 2016, Bio-Rad had approximately 8,250 employees."`

#### [number_noun] [was] [number]

Examples:

`"The number of full-time employees of the Company was approximately 31,800 at December 31, 2016 and 32,300 at December 31, 2015."`

`"Alcoa's total worldwide employment at the end of 2016 was approximately 14,000 employees in 15 countries."`

In [10]:
test_sents = ["As of September 30, 2016, we employed approximately 7,300 employees world-wide.", 
"As of December 31, 2016, the subsidiaries of AEP had a total of 17,634 employees.", 
"At December 31, 2016 and 2015, we had approximately 56,400 and 66,400 employees, respectively.", 
"At December 31, 2016, we had approximately 9,400 full-time employees.", 
"As of October 29, 2016, we employed approximately 10,000 individuals worldwide.", 
"The number of full-time employees of the Company was approximately 31,800 at December 31, 2016 and 32,300 at December 31, 2015.", 
"As of December 31, 2016, we had 1,469 total employees.", 
"ADP employed approximately 57,000 persons as of June 30, 2016.", 
"At December 31, 2016, we employed approximately 26,400 employees.", 
"The Company and its subsidiaries employed 1,562 persons at December 31, 2016, 114 of whom are covered by a collective bargaining agreement with District 10 of the International Association of Machinists.", 
"As of December 31, 2016, the Company had 455 employees, an increase of 17 employees from the prior year end.", 
"As of December 31, 2016, we had approximately 17,500 employees worldwide.", 
"Based in Neenah, Wisconsin, at December 31, 2016, the Company employed approximately 17,500 individuals and had 59 manufacturing facilities.", 
"As of January 31, 2017, we employed 7,683 individuals.", 
"At December 31, 2016, Bio-Rad had approximately 8,250 employees.", 
"As of December 31, 2016, we had approximately 8,500 full-time employees and 600 contractors.", 
"Alcoa's total worldwide employment at the end of 2016 was approximately 14,000 employees in 15 countries.", 
"As of December 31, 2016, we employed approximately 2,100 people.", 
"At December 31, 2016, the Company had approximately 11,500 employees.",
"As of February 23, 2017, we employed approximately 41,000 full-time Team Members and approximately 33,000 part-time Team Members.", 
"As of December 31, 2016, we had 699 full-time employees and 202 temporary employees.", 
"As of December 31, 2016, we had 2,646 employees, 1,581 of whom were pilots."]

In [11]:
doc1b = nlp(test_sents[1])

In [20]:
print_doc_info(doc1)

doc is: 
As of December 31, 2016, the subsidiaries of AEP had a total of 17,634 employees.
--------------------------------------------------
tokens are: 
--------------------------------------------------
0: As | IN  : conjunction, subordinating or preposition
1: of | IN  : conjunction, subordinating or preposition
2: December | NNP  : noun, proper singular
3: 31 | CD  : cardinal number
4: , | ,  : punctuation mark, comma
5: 2016 | CD  : cardinal number
6: , | ,  : punctuation mark, comma
7: the | DT  : determiner
8: subsidiaries | NNS  : noun, plural
9: of | IN  : conjunction, subordinating or preposition
10: AEP | NNP  : noun, proper singular
11: had | VBD  : verb, past tense
12: a | DT  : determiner
13: total | NN  : noun, singular or mass
14: of | IN  : conjunction, subordinating or preposition
15: 17,634 | CD  : cardinal number
16: employees | NNS  : noun, plural
17: . | .  : punctuation mark, sentence closer
--------------------------------------------------
Entities are: 
Decem

In [None]:
class EntityMatcher(object):
    name = 'entity_matcher'

    def __init__(self, nlp, terms, label):
        patterns = [nlp(text) for text in terms]
        self.matcher = PhraseMatcher(nlp.vocab)
        self.matcher.add(label, None, *patterns)

    def __call__(self, doc):
        matches = self.matcher(doc)
        for match_id, start, end in matches:
            span = Span(doc, start, end, label=match_id)
            doc.ents = list(doc.ents) + [span]
        return doc

nlp = spacy.load('en_core_web_sm')
terms = (u'cat', u'dog', u'tree kangaroo', u'giant sea spider')
entity_matcher = EntityMatcher(nlp, terms, 'ANIMAL')

nlp.add_pipe(entity_matcher, after='ner')

In [4]:
nlp = spacy.load('en_core_web_lg')

In [12]:
train_para_list = train_df.para_text.tolist()

In [13]:
for sent in nlp(train_para_list[0]).sents:
    print(sent)

As of September 30, 2016, we employed approximately 7,300 employees world-wide.
Approximately 860 of our employees in Mexico, 450 employees in Singapore, and 200 employees in Japan are covered by collective bargaining and other union agreements.


In [14]:
print([(token.text, token.tag_) for token in nlp(list(nlp(train_para_list[0]).sents)[0].text)])

[('As', 'IN'), ('of', 'IN'), ('September', 'NNP'), ('30', 'CD'), (',', ','), ('2016', 'CD'), (',', ','), ('we', 'PRP'), ('employed', 'VBD'), ('approximately', 'RB'), ('7,300', 'CD'), ('employees', 'NNS'), ('world', 'NN'), ('-', 'HYPH'), ('wide', 'RB'), ('.', '.')]


In [16]:
for ent in nlp(list(nlp(train_para_list[0]).sents)[0].text).ents:
    print(ent.text, ent.label_)

September 30, 2016 DATE
approximately 7,300 CARDINAL
employees EMP_NOUN


In [47]:
test_list = ["We are a small company with approximately 61 employees.",
"The number of full-time employees of the Company was approximately 31,800 at December 31, 2016 and 32,300 at December 31, 2015. ",
"Total workforce level at December 31, 2016 was approximately 150,500",           
"Currently, the Company and its subsidiaries have an aggregate of 35 employees.",
"we employ only 31 employees", 
"We currently have 21 employees",
"We currently employ 26 full-time employees",
"Including our full and part-time personnel, we estimate that we have \
the equivalent of 12 full time employees.",
"As a REIT, we employ only 31 employees and have a cost-effective \
management structure."
            ]

In [57]:
doc1 = nlp(test_list[1])

In [50]:
doc7 = nlp(test_list[7])

In [60]:
displacy.render(doc7, style='dep', jupyter=True, options={'distance': 90})
displacy.render(doc7, style='ent', jupyter=True, options={'distance': 90})

In [17]:
def print_doc_info(doc):
    print("doc is: ")
    print(doc)
    print('-' * 50)
    print("tokens are: ")
    print('-' * 50)
    for tid, token in enumerate(doc):
        #print(str(tid) + ' : ' + token.text)
        print(str(tid) + ': ' + token.text + ' | ' + token.tag_ + '  : ' + spacy.explain(token.tag_))
    print('-' * 50)
    print("Entities are: ")
    for ent in doc.ents:
        print(ent.text, ent.label_)
        
    print('-' * 50)
    print("Noun chunks are: ")

    for chunk in doc.noun_chunks:
        print(chunk.text, chunk.label_, chunk.root.text)
    
    print('-' * 50)
    print("Cardinal entities are: ")
    for cardinal in filter(lambda w: w.ent_type_ == 'CARDINAL', doc):
        print(cardinal)
        print("Cardinal.dep_ : " + str(cardinal.dep_))
        print("Cardinal.head : " + str(cardinal.head))
        print("Cardinal.head.dep_ : " + str(cardinal.head.dep_))

In [19]:
displacy.render(doc1, style='dep', jupyter=True)
displacy.render(doc1, style='ent', jupyter=True, options={'distance': 90})

In [70]:
spacy.explain(doc1[0].tag_)

'determiner'

In [127]:
print_doc_info(doc1)

doc is: 
The number of full-time employees of the Company was approximately 31,800 at December 31, 2016 and 32,300 at December 31, 2015. 
--------------------------------------------------
tokens are: 
--------------------------------------------------
0: The | DT  : determiner
1: number | NN  : noun, singular or mass
2: of | IN  : conjunction, subordinating or preposition
3: full | JJ  : adjective
4: - | HYPH  : punctuation mark, hyphen
5: time | NN  : noun, singular or mass
6: employees | NNS  : noun, plural
7: of | IN  : conjunction, subordinating or preposition
8: the | DT  : determiner
9: Company | NNP  : noun, proper singular
10: was | VBD  : verb, past tense
11: approximately | RB  : adverb
12: 31,800 | CD  : cardinal number
13: at | IN  : conjunction, subordinating or preposition
14: December | NNP  : noun, proper singular
15: 31 | CD  : cardinal number
16: , | ,  : punctuation mark, comma
17: 2016 | CD  : cardinal number
18: and | CC  : conjunction, coordinating
19: 32,300 | C

In [98]:
emp_tok = doc1[6]
emp_num_tok = doc1[12]

In [121]:
print("Token dep_: " + emp_num_tok.dep_)
print("Token head: " + str(emp_num_tok.head))
print("Token head dep_: " + str(emp_num_tok.head.dep_))
for w in emp_num_tok.head.lefts:
    print(str(w) + ' : ' + str(w.dep_))

print("Token head head: " + str(emp_num_tok.head.head))
print("Token head head dep_: " + str(emp_num_tok.head.head.dep_))

Token dep_: attr
Token head: was
Token head dep_: ROOT
number : nsubj
Token head head: was
Token head head dep_: ROOT


In [122]:
print("Token dep_: " + emp_tok.dep_)
print("Token head: " + str(emp_tok.head))
print("Token head dep_: " + str(emp_tok.head.dep_))
print("Token head head: " + str(emp_tok.head.head))
print("Token head head dep_: " + str(emp_tok.head.head.dep_))

Token dep_: pobj
Token head: of
Token head dep_: prep
Token head head: number
Token head head dep_: nsubj


In [40]:
relations = []
for cardinal in filter(lambda w: w.ent_type_ == 'CARDINAL', nlp(list(doc.sents)[0].text)):
    if cardinal.dep_ in ('attr', 'dobj'):
        subject = [w for w in cardinal.head.lefts if w.dep_ == 'nsubj']
        if subject:
            subject = subject[0]
            relations.append((subject, cardinal))
    elif cardinal.dep_ == 'pobj' and cardinal.head.dep_ == 'prep':
        relations.append((cardinal.head.head, cardinal))
relations

[]

In [10]:
test_sents = ["As of September 30, 2016, we employed approximately 7,300 employees world-wide.", 
"As of December 31, 2016, the subsidiaries of AEP had a total of 17,634 employees.", 
"At December 31, 2016 and 2015, we had approximately 56,400 and 66,400 employees, respectively.", 
"At December 31, 2016, we had approximately 9,400 full-time employees.", 
"As of October 29, 2016, we employed approximately 10,000 individuals worldwide.", 
"The number of full-time employees of the Company was approximately 31,800 at December 31, 2016 and 32,300 at December 31, 2015.", 
"As of December 31, 2016, we had 1,469 total employees.", 
"ADP employed approximately 57,000 persons as of June 30, 2016.", 
"At December 31, 2016, we employed approximately 26,400 employees.", 
"The Company and its subsidiaries employed 1,562 persons at December 31, 2016, 114 of whom are covered by a collective bargaining agreement with District 10 of the International Association of Machinists.", 
"As of December 31, 2016, the Company had 455 employees, an increase of 17 employees from the prior year end.", 
"As of December 31, 2016, we had approximately 17,500 employees worldwide.", 
"Based in Neenah, Wisconsin, at December 31, 2016, the Company employed approximately 17,500 individuals and had 59 manufacturing facilities.", 
"As of January 31, 2017, we employed 7,683 individuals.", 
"At December 31, 2016, Bio-Rad had approximately 8,250 employees.", 
"As of December 31, 2016, we had approximately 8,500 full-time employees and 600 contractors.", 
"Alcoa's total worldwide employment at the end of 2016 was approximately 14,000 employees in 15 countries.", 
"As of December 31, 2016, we employed approximately 2,100 people.", 
"At December 31, 2016, the Company had approximately 11,500 employees.",
"As of February 23, 2017, we employed approximately 41,000 full-time Team Members and approximately 33,000 part-time Team Members.", 
"As of December 31, 2016, we had 699 full-time employees and 202 temporary employees.", 
"As of December 31, 2016, we had 2,646 employees, 1,581 of whom were pilots."]

In [18]:
doc0 = nlp(test_sents[0]) 
doc1 = nlp(test_sents[1]) 
doc2 = nlp(test_sents[2]) 
doc3 = nlp(test_sents[3]) 
doc4 = nlp(test_sents[4]) 

In [188]:
displacy.render(doc1, style='ent', jupyter=True) 

In [187]:
print_doc_info(doc1)

doc is: 
As of December 31, 2016, the subsidiaries of AEP had a total of 17,634 employees.
--------------------------------------------------
tokens are: 
--------------------------------------------------
0: As | IN  : conjunction, subordinating or preposition
1: of | IN  : conjunction, subordinating or preposition
2: December | NNP  : noun, proper singular
3: 31 | CD  : cardinal number
4: , | ,  : punctuation mark, comma
5: 2016 | CD  : cardinal number
6: , | ,  : punctuation mark, comma
7: the | DT  : determiner
8: subsidiaries | NNS  : noun, plural
9: of | IN  : conjunction, subordinating or preposition
10: AEP | NNP  : noun, proper singular
11: had | VBD  : verb, past tense
12: a | DT  : determiner
13: total | NN  : noun, singular or mass
14: of | IN  : conjunction, subordinating or preposition
15: 17,634 | CD  : cardinal number
16: employees | NNS  : noun, plural
17: . | .  : punctuation mark, sentence closer
--------------------------------------------------
Entities are: 
Decem

In [183]:
emp_ner_regex = re.compile(r"employees|individuals|people|persons|team members", re.I)

emp_flag = lambda text: bool(emp_ner_regex.match(text))

IS_EMP = nlp.vocab.add_flag(emp_flag)

matcher = Matcher(nlp.vocab)

num_emp_pat = [{'ENT_TYPE' : 'CARDINAL', 'TAG' : 'CD'}, {IS_EMP : True}]

matcher.add('NUM_EMP', None, num_emp_pat)
matcher.add('EMP_NOUN', None, [{IS_EMP: True}])

In [184]:
matches1 = matcher(doc1)

In [186]:
for match_id, start, end in matches1:
# create a new Span for each match and use the match_id as the label
    span = Span(doc1, start, end, label=match_id)
    doc1.ents = list(doc1.ents) + [span]  # add span to doc.ents

In [164]:
def add_match_ents(doc, matches):
    """Add matches from Matcher instance to a document's entities."""
    for match_id, start, end in matches:
    # create a new Span for each match and use the match_id as the label
        span = Span(doc, start, end, label=match_id)
        doc.ents = list(doc.ents) + [span]  # add span to doc.ents

In [166]:
matcher = Matcher(nlp.vocab)
matcher.add('NumEmp', None, num_emp_pat)

In [128]:
for match in re.finditer(emp_ner_regex, nlp(test_sents[0]).text):
    start, end = match.span()         # get matched indices
    span = doc.char_span(start, end)  # create Span from indices
    print(span.text)

employees


In [189]:
for match_id, start, end in matches1:
    span = doc1[start:end]
    print(span)
    print(span.sent)

17,634 employees
As of December 31, 2016, the subsidiaries of AEP had a total of 17,634 employees.


In [160]:
print_doc_info(doc4)

doc is: 
As of October 29, 2016, we employed approximately 10,000 individuals worldwide.
--------------------------------------------------
tokens are: 
--------------------------------------------------
0: As | IN  : conjunction, subordinating or preposition
1: of | IN  : conjunction, subordinating or preposition
2: October | NNP  : noun, proper singular
3: 29 | CD  : cardinal number
4: , | ,  : punctuation mark, comma
5: 2016 | CD  : cardinal number
6: , | ,  : punctuation mark, comma
7: we | PRP  : pronoun, personal
8: employed | VBD  : verb, past tense
9: approximately | RB  : adverb
10: 10,000 | CD  : cardinal number
11: individuals | NNS  : noun, plural
12: worldwide | RB  : adverb
13: . | .  : punctuation mark, sentence closer
--------------------------------------------------
Entities are: 
October 29, 2016 DATE
approximately 10,000 CARDINAL
--------------------------------------------------
Noun chunks are: 
October NP October
we NP we
approximately 10,000 individuals NP indiv

In [159]:
relations = []
for cardinal in filter(lambda w: w.ent_type_ == 'CARDINAL', doc4):
    if cardinal.dep_ in ('attr', 'dobj'):
        subject = [w for w in cardinal.head.lefts if w.dep_ == 'nsubj']
        if subject:
            subject = subject[0]
            relations.append((subject, cardinal))
    elif cardinal.dep_ == 'pobj' and cardinal.head.dep_ == 'prep':
        relations.append((cardinal.head.head, cardinal))
relations

[]

In [150]:
matches

[(9799849049383154009, 15, 16)]