This notebook contains the code to create a facts dataframe from SEC paragraphs. It is a work in progress, and is currently being updated. 

In [1]:
import pandas as pd

#train_df = pd.read_csv('data/classifier_input_train.csv', index_col=0)
train_df = pd.read_csv('data/classifier_input_train3.csv', index_col=0)
#val_df = pd.read_csv('data/classifier_input_val.csv', index_col=0)
val_df = pd.read_csv('data/classifier_input_val3.csv', index_col=0)

## Create pipeline components to use in SpaCy NLP pipeline

In [2]:
import spacy
#from spacy.matcher import Matcher, PhraseMatcher
#from spacy.tokens import Doc, Span, Token
from spacy_pipes import pipeline_components 
from spacy_utils import helper_funcs as hf

In [3]:
import re

# Testing regex using examples from data
false_date_regex = re.compile(r"^([4-9][\d]|3[2-9]|(([0-9]{1,3},)*[0-9]{3}([.][0-9])?))$")
[True, True, False] == [bool(re.match(false_date_regex, s)) for s in ["375,000", "61", "2014"]]

True

Components to:
- EmpNounRecognizer: Class to tag employee nouns as entities
- EmpTypeRecognizer: Create entities that mark employee types (e.g., full-time, temporary, etc.)
- NumberWordRecognizer: Flag number words (e.g., "thousand")
- YearMatcher: Add custom `_.is_year` attribute to year tokens that are part of DATE entities
- FalseDateMatcher: Flags numbers that have been incorrectly labeled as part of DATE entities

In [4]:
def get_case_combos(str_list, fast=False):
    """Return a list with original, lower, upper, and title case."""
    
    if not fast: # Preserve some rational ordering
        case_combos = [s.lower() for s in str_list] + [s.upper() for s in str_list] 
        case_combos = case_combos + [s.title() for s in str_list if s.title() not in case_combos] 
        case_combos = case_combos + [s for s in str_list if s not in case_combos]
        return case_combos
    
    case_combos = str_list + [s.lower() for s in str_list] + [s.upper() for s in str_list] + [s.title() for s in str_list]
    return list(set(case_combos))

emp_terms_list = ["associates", "employees", "equivalents", "FTEs", "FTE's", "headcount", "individuals", 
                  "people", "persons", "team members", "workers", "workforce"]
emp_terms_list = get_case_combos(emp_terms_list)

part_time_terms = get_case_combos(["half-time", "half time", "part-time", "part time"])
full_time_terms = get_case_combos(["full-time", "full time", "40-hour equivalent", "40 hour equivalent", "full-time equivalent", "full time equivalent", "full-"])

# The emp_noun_recognizer accepts a dictionary for each entity label type
emp_type_dict = {'PART_TIME': part_time_terms, 
                'FULL_TIME': full_time_terms}

singles_word_list = ["one", "two", "three", "four", "five", "six", "seven", "eight", "nine"]
teens_word_list = ["ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", "seventeen", "eighteen", "nineteen"]
tens_word_list = ["twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"]
magnitude_word_list = ["hundred", "thousand", "million", "billion"]

teen_unit_combos = [x.join([y,z])  for x in [" ", "-"] for y in get_case_combos(tens_word_list) for z in get_case_combos(singles_word_list) ]
number_word_list = get_case_combos(singles_word_list) + get_case_combos(teens_word_list) + get_case_combos(tens_word_list) + get_case_combos(magnitude_word_list) + teen_unit_combos

year_patterns = [{'ENT_TYPE': 'DATE', 'TAG' : 'CD', 'SHAPE' : 'dddd'}]
false_date_patterns = [{'ENT_TYPE': 'DATE', 'TAG' : 'CD'}]

The large, pre-computed language model seems to do a better job of parsing dependencies, which is important with the complex sentence types in these documents. 

The below code initializes the nlp pipeline and then adds the custom components. 

In [5]:
nlp = spacy.load('en_core_web_lg')

emp_noun_recognizer = pipeline_components.EmpNounRecognizer(nlp, emp_terms_list)
nlp.add_pipe(emp_noun_recognizer, last=True) 

emp_type_recognizer = pipeline_components.EmpTypeRecognizer(nlp, emp_type_dict)
nlp.add_pipe(emp_type_recognizer, last=True) 

number_word_recognizer = pipeline_components.NumberWordRecognizer(nlp, number_word_list)
nlp.add_pipe(number_word_recognizer, last=True) 

year_matcher = pipeline_components.YearMatcher(nlp, year_patterns)
nlp.add_pipe(year_matcher, last=True) 

false_date_matcher = pipeline_components.FalseDateMatcher(nlp, false_date_patterns)
nlp.add_pipe(false_date_matcher, last=True) 

print('Pipeline', nlp.pipe_names) 

Pipeline ['tagger', 'parser', 'ner', 'employee_nouns', 'employee_types', 'number_words', 'year_matcher', 'false_date']


## Code for relationship extraction and helper functions

### Function: extract_emp_relations

In [6]:
from collections import namedtuple

def find_root_tok(tok):
    """Return (tok's root node, num steps to reach root)."""
    
    steps = 0 
    root_tok = tok
    while root_tok.dep_ != 'ROOT':
        steps +=1 
        root_tok = root_tok.head
    return (root_tok, steps)

def find_verb_tok(tok, verbose = False):
    """Return first verb ancestor of tok."""
    verb_tok = 0
    for a in tok.ancestors:
        #if a.pos_ == 'VERB' and a.dep_ in ['ROOT', 'ccomp', 'advcl']:
        if a.pos_ == 'VERB' and a.dep_ in ['ROOT', 'ccomp', 'advcl', 'relcl']:
            return a
    return verb_tok
    
def find_tok_side_of_root(tok, root_tok):
    """Return 'right' or 'left' if tok is in subtree of root."""
    
    for a in [tok] + list(tok.ancestors): # The ancestors of a token will either be in root.rights or root.lefts
        if a in root_tok.lefts:
            return 'left'
        elif a in root_tok.rights:
            return 'right'
    else:
        return None

def find_subject(root_tok, verbose=False):
    """Return token of nominal subject"""
    subjects = [w for w in root_tok.lefts if w.dep_ == 'nsubj']
    try:
#        for i, s in enumerate(subjects):
#            print("nsubj "+str(i)+" of " + str(root_tok) + " is : " + str(s))
        return subjects[0]
    except:
        if verbose == True:
            print("No nsubj found left of ROOT. Noun phrases left of root are:")
            print([x for x in list(root_tok.doc.noun_chunk)])
        return False
        
def get_org_span(tok):
    """Return the entity span if token has ORG ent_type_."""
    if tok.ent_type_ == 'ORG':
        subject = [e for e in tok.doc.ents if tok in e][0]
        return subject
    return tok

def check_emp_type_flags(toks, verbose=False):
    """Return 'Part-Time' or 'Full-Time' if corresponding flags
    are set to True."""
    part_time, full_time = (0,0)
    for tok in toks:
        if tok._.is_part_time == True:
            part_time += 1 
        elif tok._.is_full_time == True:
            full_time += 1
    if part_time and full_time:
        if verbose == True:
            print("Part_time and full_time flags found.")
        return 'Other Employees'
    if part_time:
        return 'Part-Time Employees'
    if full_time:
        return 'Full-Time Employees'
    return 'Other Employees'

def find_emp_type_toks(tok, verbose=False):
    """Return child token left of tok if emp_type flagged or ADJ."""
    
    tok_emp_type_subtree = [t for t in tok.subtree if t._.is_emp_type == True]
    flagged_toks = [t for t in tok.children if t in tok_emp_type_subtree]
    if tok_emp_type_subtree:
        if verbose == True:
            print("tok_emp_type_subtree: ",tok_emp_type_subtree)
        if flagged_toks:
            if verbose == True:
                print("Flagged toks from tok.children:", flagged_toks)
        if not flagged_toks:
            flagged_toks = [t for t in tok_emp_type_subtree if t.dep_ == 'compound' and t.head.dep_ == 'pobj' and t.head.head.head == tok]
            if flagged_toks:
                if verbose == True:
                    print("Flagged toks from tok_emp_subtree:", flagged_toks)
                    print("t.dep_ == 'compound', t.head.dep_ == 'pobj', and t.head.head.head is emp_noun")
        if not flagged_toks:
            flagged_toks = [t for t in tok_emp_type_subtree if t.dep_ == 'pobj' and t.head.head == tok]
            if flagged_toks:
                if verbose == True:
                    print("Flagged toks from tok_emp_subtree:", flagged_toks)
                    print("t.dep_ == 'pobj' and t.head.head is emp_noun")
    # If word is dobj and emp_type is part of a prepositional phrase, need to check head.rights
    if not flagged_toks and tok.dep_ == 'dobj':
        tok_head_emptype_subtreee = [t for t in list(tok.head.subtree) if t._.is_emp_type]
        tok_head_empnoun_subtreee = [t for t in list(tok.head.subtree) if t._.is_emp_noun]
        if verbose==True:
            print("Finding emp_type, emp_tok is dobj")
            print("emp_noun tok head:", tok.head)
            print("tok_head_emptype_subtreee:", tok_head_emptype_subtreee)
            print("tok_head_empnoun_subtreee:", tok_head_empnoun_subtreee)
        flagged_toks = [t for t in tok_head_emptype_subtreee if tok_head_emptype_subtreee.index(t) == tok_head_empnoun_subtreee.index(tok) ]
        if flagged_toks:
            if verbose == True:
                print("Flagged toks from tok_head_emptype_subtreee:", flagged_toks)
                print("Flagged tok heads:", [t.head for t in flagged_toks])
                print("Flagged tok head deps:", [t.head.dep_ for t in flagged_toks])
    if flagged_toks:
        type_conjs = [t for t in list(flagged_toks[0].conjuncts) if t._.is_emp_type == True]
        if type_conjs:
            if verbose == True:
                print("type_conjs: ", type_conjs)
            flagged_toks = flagged_toks + type_conjs
        if verbose == True:
            print("Flagged_toks: ", flagged_toks)
        #return flagged_toks[0]
        return flagged_toks
    
    candidate_tok = tok.doc[tok.i - 1]  
    while candidate_tok.is_punct == True:
        candidate_tok = tok.doc[candidate_tok.i - 1]
    if candidate_tok.head == tok:
        if candidate_tok.pos_ == 'ADJ' or candidate_tok.dep_ == 'compound':
            return [candidate_tok]
        if verbose == True:
            print("Candidate tok: ", candidate_tok)
            print("Candidate tok.pos_:  ", candidate_tok.pos_)
            print("Candidate tok.dep_:  ", candidate_tok.dep_)
    if verbose == True:
            print("No toks, returning 0.")
    return flagged_toks

def get_nummod_tok(tok, years, verbose=False):
    """Return tok.children that are nummod and card entities."""
    
    num_toks = [c for c in tok.children if c.dep_ == 'nummod' and c.ent_type_ in ['CARDINAL', 'QUANTITY', 'FALSE_DATE']]
    if num_toks:
        if verbose == True:
            print("Num_toks are: " + str(num_toks))
        num_tok = num_toks[0]
        num_tok_conj = [c for c in num_tok.children if c.dep_=='conj' and c.tag_ == 'CD']
        
        if num_tok_conj:
            if verbose == True:
                print("num_tok has conjugate children:" + str(num_tok_conj))
                print("num_tok subtree is :" + str(list(num_tok.subtree)))
            cards = [(c.i, c) for c in num_tok.subtree if c.tag_== 'CD' and c.ent_type_ in ['CARDINAL', 'QUANTITY', 'FALSE_DATE']]
            
            if len(years) == len(cards):
                order_indices = [years.index(y) for y in sorted(years, reverse=True, key = lambda x: x[1])]
                #year_emps = [(years[i][1].text, cards[i][1].text) for i in order_indices]
                #num_tok = max(year_emps)[1]
                year_emps = sorted([(years[i][1], cards[i][1]) for i in order_indices], reverse=True, key = lambda x: x[0].text)
                num_tok = year_emps[0][1]
                if verbose == True:
                    print("years: " + str(years))
                    print("cards: " + str(cards))
                    print("order_indices: " + str(order_indices))
                    print("year_emps: " + str(year_emps))
        return num_tok
    
    return num_toks    

def extract_emp_relations(doc, verb_list=False, verbose=False):
    """Return tuple of extracted relations."""
    if not verb_list:
        verb_list = ['be', 'employ', 'have']
    
    relation_tuples = []
    
    tuple_field_names = ["sent_num", "word_num",  "subject", "verb", 
                         "quantity", "quantity_idx",  "quantity_type", "type_token" , 
                         "word", "word_idx", "word_dep",  "depth", "sentence"]
    RelationDetails = namedtuple('RelationDetails', tuple_field_names)
    
    for sent_id, sent in enumerate(doc.sents):
        
        # Find the root token
        root_tok, depth = find_root_tok(sent[0])
        
        match_pairs = []
        num_tok, num_tok_conj, subject, year_conj = (False, False, False, False)
        years = [(y.i, y) for y in root_tok.subtree if y._.is_year == True] # Need to change to root.subtree to only return the word's sentence

        for word_id, word in enumerate(filter(lambda w: w.ent_type_ == 'EMP_NOUN', sent)):  
            
            if verbose == True:
                print("Word_id is : " + str(word_id))
                print("Word is : " + str(word))
                print("Word subtree is : ", doc[word.left_edge.i:word.right_edge.i].text)
                print("Word children : ", list(word.children))
            
            num_toks = []
            num_tok = get_nummod_tok(word, years, verbose = verbose)
            
            # Find first verb ancestor 
            verb_tok = find_verb_tok(word)
            if not verb_tok:
                continue
            
            # If verb does not have expected lemma, move to next sentence
            if root_tok.lemma_ not in verb_list:
                root_tok = verb_tok
                if verbose == True:
                        print("Root token lemma not one of ['be', 'employ', 'have']. ")
                        print("Root token, lemma are : " + str(root_tok) + " " + str(root_tok.lemma_))
                        print(list(root_tok.subtree))
                
                if verb_tok.lemma_ not in verb_list:
                    if verbose == True:
                        print("verb token lemma not one of ['be', 'employ', 'have']. ")
                        print("verb token, lemma are : " + str(verb_tok) + " " + str(verb_tok.lemma_))
                        print(list(verb_tok.subtree))
                    continue

            emp_type_toks = find_emp_type_toks(word, verbose=verbose)
            emp_type = 'Other Employees'
            if emp_type_toks:
                emp_type = check_emp_type_flags(emp_type_toks, verbose=verbose)
            parts_found = []
            # Find out if the employee noun is in subject (left) or predicate (right)
            left_side = []; right_side = []
            emp_tok_side = find_tok_side_of_root(word, root_tok)
            if emp_tok_side == 'left':
                left_side.append(word)
            elif emp_tok_side == 'right':
                right_side.append(word)
            else:
                if verbose == True:
                    print("No ancestor of'" + str(word) + "' is in root.rights or root.lefts.")    

            if verbose == True:
                print("Dep_ of EMP_NOUN is: " + str(word.dep_))
            if word.dep_ in ('attr', 'dobj', 'compound') or word.dep_ == 'pobj' and word.head.dep_ == 'prep':
                #num_tok = get_nummod_tok(word, years, verbose = verbose)      
                if num_tok:
                    match_pairs.append((num_tok, word))
                else:
                    cards = [e for e in word.doc.ents if e.label_ in ['CARDINAL', 'FALSE_DATE'] and e.root in root_tok.rights]
                    if cards:
                        cards = cards + [c for c in word.doc.ents if c.root in cards[0].root.subtree and c not in cards and c.label_ in ['CARDINAL', 'FALSE_DATE']]           
                        if word in left_side:    
                            if len(years) > 0:                       
                                emp_counts = [(c.start, c) for c in sorted(cards, reverse=False, key = lambda x: x.start)]                      
                                order_indices = [years.index(y) for y in sorted(years, reverse=True, key = lambda x: x[1])]
                                try: 
                                    #year_emps = [(years[i][1].text, emp_counts[i][1].text) for i in order_indices]
                                    year_emps = sorted([(years[i][1], emp_counts[i][1]) for i in order_indices], reverse=True, key = lambda x: x[0].text)
                                    if verbose == True:
                                        print("years: " + str(years))
                                        print("emp_counts: " + str(emp_counts))
                                        print("order_indices: " + str(order_indices))
                                        print("year_emps: " + str(year_emps))
                                    #num_tok = max(year_emps)[1]
                                    num_tok = year_emps[0][1]
                                except:
                                    print(str("==" * 20))
                                    print("Length of emp_counts is : " + str(len(emp_counts)) + 
                                         " while length of years is : " + str(len(years)))
                                    num_tok = cards[0]
                                    print(str("-" * 20))
                                    print(word.doc.text)
                                if verbose == True:
                                    print("Sentence has multiple years:" + str(years))
                                    print("First card subtree is :" + str(list(cards[0].subtree)))
                                    print("years: " + str(years))
                                    print("cards: " + str(cards))
                                    print("emp_counts: " + str(emp_counts))
                    #                print("order_indices: " + str(order_indices))                   
                                match_pairs.append((num_tok, word))
                        else:
                            if verbose == True:
                                print("Emp_tok is in right side; appending first card.")
                            match_pairs.append((cards[0], word))
                    elif verb_tok.dep_ == 'relcl':
                        cards = [e for e in word.doc.ents if e.label_ in ['CARDINAL', 'FALSE_DATE'] and e.root in verb_tok.lefts]
                        if cards:
                            match_pairs.append((cards[0], word))
                            num_tok = cards[0]
                if verbose == True:
                    print("Root is at "+str(depth)+" steps from "+str(word)+".")
                subject = find_subject(root_tok)
                if not subject: # For debugging
                    if verbose == True:
                        print("No nsubj found left of ROOT. Noun phrases left of root are:")
                        left_filter = lambda e: e.root in root_tok.lefts
                        print_df(make_span_df(doc, entities=False, span_filter_func=left_filter))
                else:
                    if subject == word.head.head: # If word is part of prep phrase of subject
                        subject = doc[subject.left_edge.i : subject.right_edge.i + 1]
                    else: 
                        subject = get_org_span(subject) # Use full span of ORG entity if subject tok is in ORG 
                    parts_found.append(subject)
                    match_pairs.append((subject, word))
                    #[print(str(p) + '  :  ' + str(p.dep_)) for p in subject.subtree]
                    sub_poss = [p for p in subject.subtree if p.dep_ == 'poss']
    #                if sub_poss:
    #                    sub_poss = sub_poss[0]
    #                    match_pairs.append((sub_poss, word))
                    if root_tok:
                        parts_found.append(root_tok)
                    if num_tok:
                        parts_found.append(num_tok)
                        parts_found.append(emp_type)
                        parts_found.append(emp_type_toks)
                        parts_found.append(word)
                    elif word.head.head.head.pos_ == 'VERB':
                        if verbose == True:
                            print("No num_tok. ")
                        cards = [c for c in word.head.head.head.rights if c.tag_== 'CD' and c.ent_type_ in ['CARDINAL', 'QUANTITY', 'FALSE_DATE'] ]
                        years = [(y.i, y) for y in root_tok.subtree if y._.is_year == True]
                        match_pairs.append((years, cards))
                        if cards:
                            match_pairs.append((cards[0], word))

            elif word.dep_ == 'conj':
                
                head_num_tok = [w for w in [word.head] if w.tag_ == 'CD' and w.ent_type_ in ['CARDINAL', 'QUANTITY', 'FALSE_DATE'] ]
                if verbose == True:
                    print("Emp_noun token has dep_ == 'conj'.")
                    print("Child num_tok: " + str(num_tok))
                    print("Head num_tok: " + str(head_num_tok))
                if num_tok and head_num_tok:
                    if verbose == True:
                        print("child_num_tok and head_num_toks")                      
                    num_toks = [num_tok] + head_num_tok
                    years = [(y.i, y) for y in root_tok.subtree if y._.is_year == True]
                    if verbose == True:
                        print("years: " + str(years))   
                        print("num_toks: " + str(num_toks))   
                    if head_num_tok[0].dep_ == 'conj' or head_num_tok[0].head.ent_type_ in ['CARDINAL', 'FALSE_DATE'] :
                        num_toks = num_toks + [w for w in [head_num_tok[0].head] if w.tag_ == 'CD']
                    if len(years) > len(num_toks):
                        possible_series_num_tok = doc[head_num_tok[0].i - 2]
                        if possible_series_num_tok.ent_type_ in ['CARDINAL', 'FALSE_DATE']:
                            if verbose == True:
                                print("possible_series_num_tok: " + str(possible_series_num_tok))
                            num_toks = num_toks + [possible_series_num_tok] 
                        head_num_conjucts = [c for c in head_num_tok[0].conjuncts if c.tag_ == 'CD']
                        if head_num_conjucts:
                            if head_num_conjucts[0].ent_type_ != 'CARDINAL':
                                print(str("==" * 30))
                                print("Potential series token :" + str(head_num_conjucts[0]) + 
                                     " does not have CARDINAL entity type. ")
                                print("Entity type is: " + str(head_num_conjucts[0].ent_type_))
                                print("Token index: " + str(head_num_conjucts[0].i))
                                print("Doc is: " + str(word.doc))
                            num_toks = num_toks + head_num_conjucts
                    emp_counts = sorted([(c.i, c) for c in num_toks], key = lambda x: x[0])
                    if verbose == True:
                        print("emp_counts: " + str(emp_counts))
                    order_indices = [years.index(y) for y in sorted(years, reverse=True, key = lambda x: x[1])]
                    if verbose == True:
                        print("order_indices: " + str(order_indices))
                    try:
                        year_emps = sorted([(years[i][1], emp_counts[i][1]) for i in order_indices], reverse=True, key = lambda x: x[0].text)
                    except:
                        print(str("==" * 20))
                        print("Error on doc:")
                        print(str("-" * 20))
                        print(word.doc.text)
                        print(str("-" * 20))
                        print("Error sentence:")
                        print(sent)
                        
                    if verbose == True:
                        print("year_emps: " + str(year_emps))
                    #num_tok = max(year_emps)[1]
                    num_tok = year_emps[0][1]
                match_pairs.append((num_tok, word))
                
                if not subject:
                    try:
                        subject = get_org_span(find_subject(root_tok))
                    except:
                        continue
                parts_found.append(subject)
                parts_found.append(root_tok)
                parts_found.append(num_tok)
                parts_found.append(emp_type)
                parts_found.append(emp_type_toks)
                parts_found.append(word)
            
            else:
                continue
            
            # Check for employee counts that are nummods of emp_type tokens
            emp_type_num_toks = [get_nummod_tok(x,[], verbose=verbose) for x in emp_type_toks if get_nummod_tok(x,[])]
            if emp_type_num_toks:
                if verbose == True:
                    print("emp_type_num_toks: ", emp_type_num_toks)
                    print("num_tok: ", num_tok)
                if num_tok:
                    num_toks = sorted([num_tok] + emp_type_num_toks, key = lambda t: t.i)
                else:
                    num_toks = sorted(emp_type_num_toks, key = lambda t: t.i)
                if verbose == True:
                    print("num_toks: ", num_toks)
                if len(emp_type_toks) == len(num_toks):
                    emp_types = [check_emp_type_flags([x], verbose=verbose) for x in emp_type_toks]
                    num_type_toks = list(zip(num_toks, emp_types,  emp_type_toks))
                    if all([subject, root_tok]):
                        for ir, r in enumerate(num_type_toks):
                            details = [sent_id, word_id, subject, root_tok, r[0], r[0].i, r[1], r[2], word, word.i, word.dep_, depth, sent.text]
                            if verbose == True:
                                print("Detail_list ", ir, ": ", details)
                            relation_tuples.append(RelationDetails(*details))
                        continue
                if verbose == True:
                    print("emp_type_num_toks: ", emp_type_num_toks)
            
            # Check to see if emp_type actually belongs to a relative clause 
            # with a different employee number 
            # This should be replaced by a handler for relative clauses, 
            # and more abstarct collection functions for employee count tokens
            
            if  len(emp_type_toks) == 1 and all([num_tok, len(emp_type_toks) == 1, emp_type_toks[0].head.dep_ == 'relcl', root_tok != find_verb_tok(emp_type_toks[0])]):
                type_tok_relcl = emp_type_toks[0] # known to exist because of "if" condition 
                verb_relcl = find_verb_tok(type_tok_relcl) # known to exist because of "if" condition 
                sub_relcl = find_subject(verb_relcl) # looking for employee count as nsubj
                if verbose == True:
                    print("Type token in relative clause while emp_noun is not.")
                    print("type_tok_relcl.dep_ is", type_tok_relcl.dep_)
                    if sub_relcl:
                        print("sub_relcl found: ", sub_relcl)
                if sub_relcl:
                    num_tok_relcl = [s for s in [sub_relcl] if s.tag_ == 'CD' and s.ent_type_ in ['CARDINAL', 'QUANTITY', 'FALSE_DATE']] 
                    if num_tok_relcl:
                        num_toks = [num_tok] + num_tok_relcl
                        if verbose == True:
                            print("num_tok_relcl:", num_tok_relcl)
                            print("num_toks:", num_toks)
                        if type_tok_relcl.dep_ == 'attr':
                            emp_types = ['Other Employees', emp_type]
                            emp_type_toks_relcl = list(find_verb_tok(type_tok_relcl).subtree)
                            if emp_type_toks_relcl: # Get evicence to support "Other Employees" classification
                                if len(emp_type_toks_relcl) > 4:
                                    emp_type_toks_relcl = emp_type_toks_relcl[:min(3,len(emp_type_toks_relcl))]
                            emp_type_toks = [emp_type_toks_relcl, emp_type_toks[0]]
                            for i_tup, tup in enumerate(list(zip([root_tok, verb_tok], num_toks, emp_types, emp_type_toks))):
                                details = [sent_id, word_id, subject, tup[0], tup[1], tup[1].i, tup[2], tup[3], word, word.i, word.dep_, depth, sent.text]
                                if verbose == True:
                                    print("Detail_list ", i_tup, ": ", details)
                                relation_tuples.append(RelationDetails(*details))
                            continue
            
            if all([subject, root_tok, num_tok, emp_type ]):
                if verbose == True:
                    print("Parts found: ", tuple(parts_found))
                if 'root' in num_tok.__dir__() :
                    num_tok_idx = num_tok.root.i
                else:
                    num_tok_idx = num_tok.i
                details = [sent_id, word_id, subject, root_tok, num_tok, num_tok_idx, emp_type, emp_type_toks, word, word.i, word.dep_, depth, sent.text]
                #details = [sent_id, word_id] + parts_found + [word.dep_, depth, sent.text]
                relation_tuples.append(RelationDetails(*details))        
    return relation_tuples

## Define sentence structure types

In [7]:
from spacy import displacy

### `Emp_noun` and `Company_noun`  in subject, `Emp_num` in predicate

#### `"The number of full-time employees of the Company was approximately 31,800 at December 31, 2016 and 32,300 at December 31, 2015."`

Desired output:  
`(number, was, 31,800, full-time employees)`

In [8]:
ex6 = nlp("The number of full-time employees of the Company was approximately 31,800 at December 31, 2016 and 32,300 at December 31, 2015.")
ex6_emp_tok = ex6[4]
ex6_emp_tok.dep_
hf.print_df(hf.make_tok_df(ex6))

Unnamed: 0,tok_ent,toks,lemma,dep,head,h_dep,pos,tag,dep_def,tag_def
0,,The,the,det,number,nsubj,DET,DT,determiner,determiner
1,,number,number,nsubj,was,ROOT,NOUN,NN,nominal subject,"noun, singular or mass"
2,,of,of,prep,number,nsubj,ADP,IN,prepositional modifier,"conjunction, subordinating or preposition"
3,FULL_TIME,full-time,full,compound,employees,pobj,ADJ,JJ,,adjective
4,EMP_NOUN,employees,employee,pobj,of,prep,NOUN,NNS,object of preposition,"noun, plural"
5,,of,of,prep,employees,pobj,ADP,IN,prepositional modifier,"conjunction, subordinating or preposition"
6,,the,the,det,Company,pobj,DET,DT,determiner,determiner
7,ORG,Company,company,pobj,of,prep,PROPN,NNP,object of preposition,"noun, proper singular"
8,,was,be,ROOT,was,ROOT,VERB,VBD,,"verb, past tense"
9,CARDINAL,approximately,approximately,advmod,31800,attr,ADV,RB,adverbial modifier,adverb


In [9]:
hf.print_doc_info(ex6)

doc is: 
The number of full-time employees of the Company was approximately 31,800 at December 31, 2016 and 32,300 at December 31, 2015.
--------------------------------------------------
Entities are: 


Unnamed: 0,tok_i,entity,ent_label,root,root_dep,dep_def,root_head,root_head_dep,root_head_pos
0,3,full-time,FULL_TIME,full-time,compound,,employees,pobj,NOUN
1,4,employees,EMP_NOUN,employees,pobj,object of preposition,of,prep,ADP
2,7,Company,ORG,Company,pobj,object of preposition,of,prep,ADP
3,9,"approximately 31,800",CARDINAL,31800,attr,attribute,was,ROOT,VERB
4,12,"December 31, 2016",DATE,December,pobj,object of preposition,at,prep,ADP
5,17,32300,CARDINAL,32300,conj,conjunct,was,ROOT,VERB
6,19,"December 31, 2015",DATE,December,pobj,object of preposition,at,prep,ADP


--------------------------------------------------
Noun chunks are: 


Unnamed: 0,tok_i,noun_chunk,root,root_ent,root_dep,dep_def,root_head,root_head_dep,root_head_pos
0,0,The number,number,,nsubj,nominal subject,was,ROOT,VERB
1,3,full-time employees,employees,EMP_NOUN,pobj,object of preposition,of,prep,ADP
2,6,the Company,Company,ORG,pobj,object of preposition,of,prep,ADP
3,12,December,December,DATE,pobj,object of preposition,at,prep,ADP
4,19,December,December,DATE,pobj,object of preposition,at,prep,ADP


--------------------------------------------------
Cardinal entities are: 


Unnamed: 0,tok_ent,toks,lemma,dep,head,h_dep,pos,tag,dep_def,tag_def
0,CARDINAL,approximately,approximately,advmod,31800,attr,ADV,RB,adverbial modifier,adverb
1,CARDINAL,31800,31800,attr,was,ROOT,NUM,CD,attribute,cardinal number
2,CARDINAL,32300,32300,conj,was,ROOT,NUM,CD,conjunct,cardinal number


In [10]:
[c for c in ex6[8].conjuncts]

[32,300]

In [11]:
#displacy.render(ex6, style='dep', jupyter=True, options={'distance': 110})

In [13]:
extract_emp_relations(ex6, verbose=True)

Word_id is : 0
Word is : employees
Word subtree is :  full-time employees of the
Word children :  [full-time, of]
tok_emp_type_subtree:  [full-time]
Flagged toks from tok.children: [full-time]
Flagged_toks:  [full-time]
Dep_ of EMP_NOUN is: pobj
years: [(15, 2016), (22, 2015)]
emp_counts: [(9, approximately 31,800), (17, 32,300)]
order_indices: [1, 0]
year_emps: [(2016, approximately 31,800), (2015, 32,300)]
Sentence has multiple years:[(15, 2016), (22, 2015)]
First card subtree is :[approximately, 31,800]
years: [(15, 2016), (22, 2015)]
cards: [approximately 31,800, 32,300]
emp_counts: [(9, approximately 31,800), (17, 32,300)]
Root is at 2 steps from employees.
Parts found:  (The number of full-time employees of the Company, was, approximately 31,800, 'Full-Time Employees', [full-time], employees)


[RelationDetails(sent_num=0, word_num=0, subject=The number of full-time employees of the Company, verb=was, quantity=approximately 31,800, quantity_idx=10, quantity_type='Full-Time Employees', type_token=[full-time], word=employees, word_idx=4, word_dep='pobj', depth=2, sentence='The number of full-time employees of the Company was approximately 31,800 at December 31, 2016 and 32,300 at December 31, 2015.')]

#### `Total workforce level at December 31, 2016 was approximately 150,500.`

Examples:  
`"Total workforce level at December 31, 2016 was approximately 150,500."`

Desired output:  
`(workforce, was, 150,500, 'Other', 'Total', workforce)`

In [14]:
ex10 = nlp("Total workforce level at December 31, 2016 was approximately 150,500.")
ex10_emp_tok = ex10[1]
ex10_emp_tok.dep_
#print_df(make_tok_df(ex10))

'compound'

In [15]:
hf.print_doc_info(ex10)

doc is: 
Total workforce level at December 31, 2016 was approximately 150,500.
--------------------------------------------------
Entities are: 


Unnamed: 0,tok_i,entity,ent_label,root,root_dep,dep_def,root_head,root_head_dep,root_head_pos
0,1,workforce,EMP_NOUN,workforce,compound,,level,nsubj,NOUN
1,4,"December 31, 2016",DATE,2016,nsubj,nominal subject,was,ROOT,VERB
2,10,150500,CARDINAL,150500,attr,attribute,was,ROOT,VERB


--------------------------------------------------
Noun chunks are: 


Unnamed: 0,tok_i,noun_chunk,root,root_ent,root_dep,dep_def,root_head,root_head_dep,root_head_pos
0,0,Total workforce level,level,,nsubj,nominal subject,was,ROOT,VERB
1,4,December,December,DATE,pobj,object of preposition,at,prep,ADP


--------------------------------------------------
Cardinal entities are: 


Unnamed: 0,tok_ent,toks,lemma,dep,head,h_dep,pos,tag,dep_def,tag_def
0,CARDINAL,150500,150500,attr,was,ROOT,NUM,CD,attribute,cardinal number


In [16]:
print('Token children: ')
for w in ex10[10].children:
    print(str(w) + '       child.dep_:' + str(w.dep_))

Token children: 
approximately       child.dep_:advmod


In [17]:
#displacy.render(ex10, style='dep', jupyter=True, options={'distance': 110})

In [18]:
extract_emp_relations(ex10)

[RelationDetails(sent_num=0, word_num=0, subject=level, verb=was, quantity=150,500, quantity_idx=10, quantity_type='Other Employees', type_token=[], word=workforce, word_idx=1, word_dep='compound', depth=2, sentence='Total workforce level at December 31, 2016 was approximately 150,500.')]

### `Emp_num` above and below `EMP_NOUN` in parse tree

#### One emp_num is a conj child of the other emp_num

Example text:

`At December 31, 2016 and 2015, we had approximately 56,400 and 66,400 employees, respectively.`

Desired output: 

`(we, had, 56,400, employees)`

In [19]:
ex7 = nlp("At December 31, 2016 and 2015, we had approximately 56,400 and 66,400 employees, respectively.")
ex7_emp_tok = ex7[14]
ex7_emp_num_tok = ex7[11]
print(ex7_emp_tok.dep_)
print(ex7_emp_num_tok)
print([c for c in ex7_emp_num_tok.conjuncts])
#print_df(make_tok_df(ex7))

dobj
56,400
[66,400]


In [20]:
hf.print_doc_info(ex7)

doc is: 
At December 31, 2016 and 2015, we had approximately 56,400 and 66,400 employees, respectively.
--------------------------------------------------
Entities are: 


Unnamed: 0,tok_i,entity,ent_label,root,root_dep,dep_def,root_head,root_head_dep,root_head_pos
0,1,"December 31, 2016 and 2015",DATE,December,pobj,object of preposition,At,prep,ADP
1,10,"approximately 56,400",CARDINAL,56400,nummod,,employees,dobj,NOUN
2,13,66400,CARDINAL,66400,conj,conjunct,56400,nummod,NUM
3,14,employees,EMP_NOUN,employees,dobj,direct object,had,ROOT,VERB


--------------------------------------------------
Noun chunks are: 


Unnamed: 0,tok_i,noun_chunk,root,root_ent,root_dep,dep_def,root_head,root_head_dep,root_head_pos
0,1,December,December,DATE,pobj,object of preposition,At,prep,ADP
1,8,we,we,,nsubj,nominal subject,had,ROOT,VERB
2,10,"approximately 56,400 and 66,400 employees",employees,EMP_NOUN,dobj,direct object,had,ROOT,VERB


--------------------------------------------------
Cardinal entities are: 


Unnamed: 0,tok_ent,toks,lemma,dep,head,h_dep,pos,tag,dep_def,tag_def
0,CARDINAL,approximately,approximately,advmod,56400,nummod,ADV,RB,adverbial modifier,adverb
1,CARDINAL,56400,56400,nummod,employees,dobj,NUM,CD,,cardinal number
2,CARDINAL,66400,66400,conj,56400,nummod,NUM,CD,conjunct,cardinal number


In [21]:
print('Token children: ')
for w in ex7_emp_num_tok.children:
    print(str(w) + '       child.dep_:' + str(w.dep_))

Token children: 
approximately       child.dep_:advmod
and       child.dep_:cc
66,400       child.dep_:conj


In [22]:
#displacy.render(ex7, style='dep', jupyter=True, options={'distance': 110})

In [23]:
extract_emp_relations(ex7, verbose=True)

Word_id is : 0
Word is : employees
Word subtree is :  approximately 56,400 and 66,400
Word children :  [56,400]
Num_toks are: [56,400]
num_tok has conjugate children:[66,400]
num_tok subtree is :[approximately, 56,400, and, 66,400]
years: [(4, 2016), (6, 2015)]
cards: [(11, 56,400), (13, 66,400)]
order_indices: [1, 0]
year_emps: [(2016, 56,400), (2015, 66,400)]
Finding emp_type, emp_tok is dobj
emp_noun tok head: had
tok_head_emptype_subtreee: []
tok_head_empnoun_subtreee: [employees]
No toks, returning 0.
Dep_ of EMP_NOUN is: dobj
Root is at 1 steps from employees.
Parts found:  (we, had, 56,400, 'Other Employees', [], employees)


[RelationDetails(sent_num=0, word_num=0, subject=we, verb=had, quantity=56,400, quantity_idx=11, quantity_type='Other Employees', type_token=[], word=employees, word_idx=14, word_dep='dobj', depth=1, sentence='At December 31, 2016 and 2015, we had approximately 56,400 and 66,400 employees, respectively.')]

#### `At March 31, 2016, 2015 and 2014, we had 3,066, 2,982 and 2,843 employees, respectively. `

Desired output: 
`(we, had, 3,066, 'Other Employees', 0, employees)`

In [24]:
ex12 = nlp("At March 31, 2016, 2015 and 2014, we had 3,066, 2,982 and 2,843 employees, respectively.")

ex12_emp_tok = ex12[17]
ex12_emp_num_tok = ex12[16]; ex12_emp_num_tok_2 = ex12[14]; ex12_emp_num_tok_3 = ex12[12]

print(ex12_emp_tok)
print(ex12_emp_tok.dep_)
print(ex12_emp_num_tok)
print([c for c in ex12_emp_num_tok_3.conjuncts])

#print_df(make_tok_df(ex12))

employees
conj
2,843
[2,982]


In [25]:
hf.print_doc_info(ex12)

doc is: 
At March 31, 2016, 2015 and 2014, we had 3,066, 2,982 and 2,843 employees, respectively.
--------------------------------------------------
Entities are: 


Unnamed: 0,tok_i,entity,ent_label,root,root_dep,dep_def,root_head,root_head_dep,root_head_pos
0,1,"March 31, 2016, 2015 and 2014",DATE,March,pobj,object of preposition,At,prep,ADP
1,12,3066,CARDINAL,3066,dobj,direct object,had,ROOT,VERB
2,14,2982,CARDINAL,2982,conj,conjunct,3066,dobj,NUM
3,16,2843,CARDINAL,2843,nummod,,employees,conj,NOUN
4,17,employees,EMP_NOUN,employees,conj,conjunct,2982,conj,NUM


--------------------------------------------------
Noun chunks are: 


Unnamed: 0,tok_i,noun_chunk,root,root_ent,root_dep,dep_def,root_head,root_head_dep,root_head_pos
0,1,March,March,DATE,pobj,object of preposition,At,prep,ADP
1,10,we,we,,nsubj,nominal subject,had,ROOT,VERB
2,16,"2,843 employees",employees,EMP_NOUN,conj,conjunct,2982,conj,NUM


--------------------------------------------------
Cardinal entities are: 


Unnamed: 0,tok_ent,toks,lemma,dep,head,h_dep,pos,tag,dep_def,tag_def
0,CARDINAL,3066,3066,dobj,had,ROOT,NUM,CD,direct object,cardinal number
1,CARDINAL,2982,2982,conj,3066,dobj,NUM,CD,conjunct,cardinal number
2,CARDINAL,2843,2843,nummod,employees,conj,NUM,CD,,cardinal number


In [26]:
#displacy.render(ex12, style='dep', jupyter=True)

In [27]:
extract_emp_relations(ex12, verbose = True)

Word_id is : 0
Word is : employees
Word subtree is :  2,843
Word children :  [2,843]
Num_toks are: [2,843]
Candidate tok:  2,843
Candidate tok.pos_:   NUM
Candidate tok.dep_:   nummod
No toks, returning 0.
Dep_ of EMP_NOUN is: conj
Emp_noun token has dep_ == 'conj'.
Child num_tok: 2,843
Head num_tok: [2,982]
child_num_tok and head_num_toks
years: [(4, 2016), (6, 2015), (8, 2014)]
num_toks: [2,843, 2,982]
emp_counts: [(12, 3,066), (14, 2,982), (16, 2,843)]
order_indices: [2, 1, 0]
year_emps: [(2016, 3,066), (2015, 2,982), (2014, 2,843)]
Parts found:  (we, had, 3,066, 'Other Employees', [], employees)


[RelationDetails(sent_num=0, word_num=0, subject=we, verb=had, quantity=3,066, quantity_idx=12, quantity_type='Other Employees', type_token=[], word=employees, word_idx=17, word_dep='conj', depth=1, sentence='At March 31, 2016, 2015 and 2014, we had 3,066, 2,982 and 2,843 employees, respectively.')]

#### `We had a total of 9,832, 9,058, and 8,806 employees as of December 31, 2016, 2015, and 2014, respectively. `
Desired output: 
`(we, had, 9,832, 'Other Employees', 0, employees)`

In [28]:
ex13 = nlp("We had a total of 9,832, 9,058, and 8,806 employees as of December 31, 2016, 2015, and 2014, respectively.")
#ex13 = nlp("We had 9,832, 9,058, and 8,806 employees as of December 31, 2016, 2015, and 2014, respectively.")

ex13_emp_tok = ex13[11]
ex13_emp_num_tok = ex13[10]; ex13_emp_num_tok_2 = ex13[7]; ex13_emp_num_tok_3 = ex13[5]

print(ex13_emp_tok)
print(ex13_emp_tok.dep_)
print(ex13_emp_num_tok)
print([c for c in ex13_emp_num_tok_2.conjuncts])

#print_df(make_tok_df(ex13))

employees
conj
8,806
[employees]


In [29]:
hf.print_doc_info(ex13)

doc is: 
We had a total of 9,832, 9,058, and 8,806 employees as of December 31, 2016, 2015, and 2014, respectively.
--------------------------------------------------
Entities are: 


Unnamed: 0,tok_i,entity,ent_label,root,root_dep,dep_def,root_head,root_head_dep,root_head_pos
0,5,9832,CARDINAL,9832,pobj,object of preposition,of,prep,ADP
1,7,9058,CARDINAL,9058,appos,appositional modifier,9832,pobj,NUM
2,10,8806,CARDINAL,8806,nummod,,employees,conj,NOUN
3,11,employees,EMP_NOUN,employees,conj,conjunct,9058,appos,NUM
4,14,"December 31, 2016, 2015",DATE,December,pobj,object of preposition,of,prep,ADP
5,22,2014,DATE,2014,conj,conjunct,December,pobj,PROPN


--------------------------------------------------
Noun chunks are: 


Unnamed: 0,tok_i,noun_chunk,root,root_ent,root_dep,dep_def,root_head,root_head_dep,root_head_pos
0,0,We,We,,nsubj,nominal subject,had,ROOT,VERB
1,2,a total,total,,dobj,direct object,had,ROOT,VERB
2,10,"8,806 employees",employees,EMP_NOUN,conj,conjunct,9058,appos,NUM
3,14,December,December,DATE,pobj,object of preposition,of,prep,ADP


--------------------------------------------------
Cardinal entities are: 


Unnamed: 0,tok_ent,toks,lemma,dep,head,h_dep,pos,tag,dep_def,tag_def
0,CARDINAL,9832,9832,pobj,of,prep,NUM,CD,object of preposition,cardinal number
1,CARDINAL,9058,9058,appos,9832,pobj,NUM,CD,appositional modifier,cardinal number
2,CARDINAL,8806,8806,nummod,employees,conj,NUM,CD,,cardinal number


In [30]:
#displacy.render(ex13, style='dep', jupyter=True)

In [31]:
extract_emp_relations(ex13, verbose = True)

Word_id is : 0
Word is : employees
Word subtree is :  8,806 employees as of December 31, 2016, 2015, and 2014,
Word children :  [8,806, as, respectively]
Num_toks are: [8,806]
Candidate tok:  8,806
Candidate tok.pos_:   NUM
Candidate tok.dep_:   nummod
No toks, returning 0.
Dep_ of EMP_NOUN is: conj
Emp_noun token has dep_ == 'conj'.
Child num_tok: 8,806
Head num_tok: [9,058]
child_num_tok and head_num_toks
years: [(17, 2016), (19, 2015), (22, 2014)]
num_toks: [8,806, 9,058]
emp_counts: [(5, 9,832), (7, 9,058), (10, 8,806)]
order_indices: [2, 1, 0]
year_emps: [(2016, 9,832), (2015, 9,058), (2014, 8,806)]
Parts found:  (We, had, 9,832, 'Other Employees', [], employees)


[RelationDetails(sent_num=0, word_num=0, subject=We, verb=had, quantity=9,832, quantity_idx=5, quantity_type='Other Employees', type_token=[], word=employees, word_idx=11, word_dep='conj', depth=1, sentence='We had a total of 9,832, 9,058, and 8,806 employees as of December 31, 2016, 2015, and 2014, respectively.')]

#### `We had 17,912, 14,533 and 10,625 employees as of December 31, 2014, 2015 and 2016, respectively.`  
Desired output: 
`(we, had, 10,625, 'Other Employees', 0, employees)`

In [36]:
ex14 = nlp("We had 17,912, 14,533 and 10,625 employees as of December 31, 2014, 2015 and 2016, respectively.")

#print_df(make_tok_df(ex14))

ex14_emp_tok = ex14[7]
ex14_emp_num_tok = ex14[6]; ex14_emp_num_tok_2 = ex14[4]; ex14_emp_num_tok_3 = ex14[2];

In [37]:
print(ex14_emp_tok)
print(ex14_emp_tok.dep_)
print(ex14_emp_num_tok)
print([c for c in ex14_emp_num_tok_3.conjuncts])
print_doc_info(ex14)

employees
conj
10,625
[14,533, employees]
doc is: 
We had 17,912, 14,533 and 10,625 employees as of December 31, 2014, 2015 and 2016, respectively.
--------------------------------------------------
Entities are: 


Unnamed: 0,tok_i,entity,ent_label,root,root_dep,dep_def,root_head,root_head_dep,root_head_pos
0,2,17912,CARDINAL,17912,dobj,direct object,had,ROOT,VERB
1,4,14533,CARDINAL,14533,conj,conjunct,17912,dobj,NUM
2,6,10625,CARDINAL,10625,nummod,,employees,conj,NOUN
3,7,employees,EMP_NOUN,employees,conj,conjunct,17912,dobj,NUM
4,10,"December 31, 2014, 2015 and 2016",DATE,December,pobj,object of preposition,of,prep,ADP


--------------------------------------------------
Noun chunks are: 


Unnamed: 0,tok_i,noun_chunk,root,root_ent,root_dep,dep_def,root_head,root_head_dep,root_head_pos
0,0,We,We,,nsubj,nominal subject,had,ROOT,VERB
1,6,"10,625 employees",employees,EMP_NOUN,conj,conjunct,17912,dobj,NUM
2,10,December,December,DATE,pobj,object of preposition,of,prep,ADP


--------------------------------------------------
Cardinal entities are: 


Unnamed: 0,tok_ent,toks,lemma,dep,head,h_dep,pos,tag,dep_def,tag_def
0,CARDINAL,17912,17912,dobj,had,ROOT,NUM,CD,direct object,cardinal number
1,CARDINAL,14533,14533,conj,17912,dobj,NUM,CD,conjunct,cardinal number
2,CARDINAL,10625,10625,nummod,employees,conj,NUM,CD,,cardinal number


In [46]:
#displacy.render(ex14, style='dep', jupyter=True)

In [38]:
extract_emp_relations(ex14, verbose = True)

Word_id is : 0
Word is : employees
Word subtree is :  10,625 employees as of December 31, 2014, 2015 and 2016
Word children :  [10,625, as]
Num_toks are: [10,625]
Candidate tok:  10,625
Candidate tok.pos_:   NUM
Candidate tok.dep_:   nummod
No toks, returning 0.
Dep_ of EMP_NOUN is: conj
Emp_noun token has dep_ == 'conj'.
Child num_tok: 10,625
Head num_tok: [17,912]
child_num_tok and head_num_toks
years: [(13, 2014), (15, 2015), (17, 2016)]
num_toks: [10,625, 17,912]
emp_counts: [(2, 17,912), (4, 14,533), (6, 10,625)]
order_indices: [2, 1, 0]
year_emps: [(2016, 10,625), (2015, 14,533), (2014, 17,912)]
Parts found:  (We, had, 10,625, 'Other Employees', [], employees)


[RelationDetails(sent_num=0, word_num=0, subject=We, verb=had, quantity=10,625, quantity_type='Other Employees', type_token=[], word=employees, word_dep='conj', depth=1, sentence='We had 17,912, 14,533 and 10,625 employees as of December 31, 2014, 2015 and 2016, respectively.')]

### Dealing with units

#### `The number of regular employees was 71.1 thousand, 73.5 thousand, and 75.3 thousand at years ended 2016, 2015 and 2014, respectively.`

Example text:  

`"The number of regular employees was 71.1 thousand, 73.5 thousand, and 75.3 thousand at years ended 2016, 2015 and 2014, respectively."`

In [32]:
thousands_sent_doc = nlp("The number of regular employees was 71.1 thousand, 73.5 thousand, and 75.3 thousand at years ended 2016, 2015 and 2014, respectively.")
print(list(list(thousands_sent_doc[4].head.head.head.rights)[0].subtree))

print([r for r in thousands_sent_doc[4].head.head.head.lefts])
hf.print_df(hf.make_tok_df(thousands_sent_doc))



[71.1, thousand, ,, 73.5, thousand, ,, and, 75.3, thousand]
[number]


Unnamed: 0,tok_ent,toks,lemma,dep,head,h_dep,pos,tag,dep_def,tag_def
0,,The,the,det,number,nsubj,DET,DT,determiner,determiner
1,,number,number,nsubj,was,ROOT,NOUN,NN,nominal subject,"noun, singular or mass"
2,,of,of,prep,number,nsubj,ADP,IN,prepositional modifier,"conjunction, subordinating or preposition"
3,,regular,regular,amod,employees,pobj,ADJ,JJ,adjectival modifier,adjective
4,EMP_NOUN,employees,employee,pobj,of,prep,NOUN,NNS,object of preposition,"noun, plural"
5,,was,be,ROOT,was,ROOT,VERB,VBD,,"verb, past tense"
6,CARDINAL,71.1,71.1,compound,thousand,attr,NUM,CD,,cardinal number
7,CARDINAL,thousand,thousand,attr,was,ROOT,NUM,CD,attribute,cardinal number
8,,",",",",punct,thousand,attr,PUNCT,",",punctuation,"punctuation mark, comma"
9,CARDINAL,73.5,73.5,compound,thousand,appos,NUM,CD,,cardinal number


In [33]:
hf.print_doc_info(thousands_sent_doc)

doc is: 
The number of regular employees was 71.1 thousand, 73.5 thousand, and 75.3 thousand at years ended 2016, 2015 and 2014, respectively.
--------------------------------------------------
Entities are: 


Unnamed: 0,tok_i,entity,ent_label,root,root_dep,dep_def,root_head,root_head_dep,root_head_pos
0,4,employees,EMP_NOUN,employees,pobj,object of preposition,of,prep,ADP
1,6,71.1 thousand,CARDINAL,thousand,attr,attribute,was,ROOT,VERB
2,9,73.5 thousand,CARDINAL,thousand,appos,appositional modifier,thousand,attr,NUM
3,13,75.3 thousand,CARDINAL,thousand,conj,conjunct,thousand,attr,NUM
4,16,years ended 2016,DATE,ended,advcl,adverbial clause modifier,was,ROOT,VERB
5,20,2015,DATE,2015,conj,conjunct,2016,npadvmod,NUM
6,22,2014,DATE,2014,conj,conjunct,2015,conj,NUM


--------------------------------------------------
Noun chunks are: 


Unnamed: 0,tok_i,noun_chunk,root,root_ent,root_dep,dep_def,root_head,root_head_dep,root_head_pos
0,0,The number,number,,nsubj,nominal subject,was,ROOT,VERB
1,3,regular employees,employees,EMP_NOUN,pobj,object of preposition,of,prep,ADP
2,16,years,years,DATE,pobj,object of preposition,at,prep,ADP


--------------------------------------------------
Cardinal entities are: 


Unnamed: 0,tok_ent,toks,lemma,dep,head,h_dep,pos,tag,dep_def,tag_def
0,CARDINAL,71.1,71.1,compound,thousand,attr,NUM,CD,,cardinal number
1,CARDINAL,thousand,thousand,attr,was,ROOT,NUM,CD,attribute,cardinal number
2,CARDINAL,73.5,73.5,compound,thousand,appos,NUM,CD,,cardinal number
3,CARDINAL,thousand,thousand,appos,thousand,attr,NUM,CD,appositional modifier,cardinal number
4,CARDINAL,75.3,75.3,compound,thousand,conj,NUM,CD,,cardinal number
5,CARDINAL,thousand,thousand,conj,thousand,attr,NUM,CD,conjunct,cardinal number


In [34]:
nsub = thousands_sent_doc[1]

print(thousands_sent_doc[nsub.left_edge.i : nsub.right_edge.i + 1])

# Testing custom flags
year_tok = thousands_sent_doc[18]
print(year_tok._.is_year)

thousand_tok = thousands_sent_doc[7]
print(thousand_tok._.is_num_word)

The number of regular employees
True
True


In [35]:
displacy.render(thousands_sent_doc, style='ent', jupyter=True)

In [36]:
#displacy.render(thousands_sent_doc, jupyter=True)

In [37]:
extract_emp_relations(thousands_sent_doc, verbose=True)

Word_id is : 0
Word is : employees
Word subtree is :  regular
Word children :  [regular]
Dep_ of EMP_NOUN is: pobj
years: [(18, 2016), (20, 2015), (22, 2014)]
emp_counts: [(6, 71.1 thousand), (9, 73.5 thousand), (13, 75.3 thousand)]
order_indices: [2, 1, 0]
year_emps: [(2016, 71.1 thousand), (2015, 73.5 thousand), (2014, 75.3 thousand)]
Sentence has multiple years:[(18, 2016), (20, 2015), (22, 2014)]
First card subtree is :[71.1, thousand, ,, 73.5, thousand, ,, and, 75.3, thousand]
years: [(18, 2016), (20, 2015), (22, 2014)]
cards: [71.1 thousand, 73.5 thousand, 75.3 thousand]
emp_counts: [(6, 71.1 thousand), (9, 73.5 thousand), (13, 75.3 thousand)]
Root is at 2 steps from employees.
Parts found:  (The number of regular employees, was, 71.1 thousand, 'Other Employees', [regular], employees)


[RelationDetails(sent_num=0, word_num=0, subject=The number of regular employees, verb=was, quantity=71.1 thousand, quantity_idx=7, quantity_type='Other Employees', type_token=[regular], word=employees, word_idx=4, word_dep='pobj', depth=2, sentence='The number of regular employees was 71.1 thousand, 73.5 thousand, and 75.3 thousand at years ended 2016, 2015 and 2014, respectively.')]

### Identifying full-time, part-time, etc.

#### ```As of February 23, 2017, we employed approximately 41,000 full-time Team Members and approximately 33,000 part-time Team Members.```

Example text:

`As of February 23, 2017, we employed approximately 41,000 full-time Team Members and approximately 33,000 part-time Team Members.`  

Desired output: 

`(we, employed, 41,000, full-time Team Members)`   
`(we, employed, 33,000, part-time Team Members)`

In [38]:
ex8 = nlp("As of February 23, 2017, we employed approximately 41,000 full-time Team Members and approximately 33,000 part-time Team Members.")

#hf.print_df(make_tok_df(ex8))

In [39]:
hf.print_doc_info(ex8)

doc is: 
As of February 23, 2017, we employed approximately 41,000 full-time Team Members and approximately 33,000 part-time Team Members.
--------------------------------------------------
Entities are: 


Unnamed: 0,tok_i,entity,ent_label,root,root_dep,dep_def,root_head,root_head_dep,root_head_pos
0,2,"February 23, 2017",DATE,February,pobj,object of preposition,of,prep,ADP
1,9,"approximately 41,000",CARDINAL,41000,nummod,,Team Members,dobj,PROPN
2,11,full-time,FULL_TIME,full-time,compound,,Team Members,dobj,PROPN
3,12,Team Members,EMP_NOUN,Team Members,dobj,direct object,employed,ROOT,VERB
4,14,"approximately 33,000",CARDINAL,33000,nummod,,Team Members,conj,PROPN
5,16,part-time,PART_TIME,part-time,compound,,Team Members,conj,PROPN
6,17,Team Members,EMP_NOUN,Team Members,conj,conjunct,Team Members,dobj,PROPN


--------------------------------------------------
Noun chunks are: 


Unnamed: 0,tok_i,noun_chunk,root,root_ent,root_dep,dep_def,root_head,root_head_dep,root_head_pos
0,2,February,February,DATE,pobj,object of preposition,of,prep,ADP
1,7,we,we,,nsubj,nominal subject,employed,ROOT,VERB
2,9,"approximately 41,000 full-time Team Members",Team Members,EMP_NOUN,dobj,direct object,employed,ROOT,VERB
3,14,"approximately 33,000 part-time Team Members",Team Members,EMP_NOUN,conj,conjunct,Team Members,dobj,PROPN


--------------------------------------------------
Cardinal entities are: 


Unnamed: 0,tok_ent,toks,lemma,dep,head,h_dep,pos,tag,dep_def,tag_def
0,CARDINAL,approximately,approximately,advmod,41000,nummod,ADV,RB,adverbial modifier,adverb
1,CARDINAL,41000,41000,nummod,Team Members,dobj,NUM,CD,,cardinal number
2,CARDINAL,approximately,approximately,advmod,33000,nummod,ADV,RB,adverbial modifier,adverb
3,CARDINAL,33000,33000,nummod,Team Members,conj,NUM,CD,,cardinal number


In [40]:
ex8_emp_tok = ex8[12]
ex8_emp_num_tok = ex8[10]
ex8_emp_tok_2 = ex8[17]

print(ex8[ex8_emp_tok.i - 1])
print(check_emp_type_flags(find_emp_type_toks(ex8_emp_tok)))
print("ex8_emp_tok.child emp_type : ", [t for t in ex8_emp_tok.children if t._.is_emp_type == True])
print("ex8_emp_tok.child emp_type pos_: ", [t for t in ex8_emp_tok.children if t._.is_emp_type == True][0].pos_)
print("ex8_emp_tok_2.child emp_type : ", [t for t in ex8_emp_tok_2.children if t._.is_emp_type == True])
print("ex8_emp_tok_2.child emp_type pos_: ", [t for t in ex8_emp_tok_2.children if t._.is_emp_type == True][0].pos_)
print(ex8_emp_tok.dep_)
print(ex8_emp_num_tok)
print([c for c in ex8_emp_tok.conjuncts])

full-time
Full-Time Employees
ex8_emp_tok.child emp_type :  [full-time]
ex8_emp_tok.child emp_type pos_:  ADJ
ex8_emp_tok_2.child emp_type :  [part-time]
ex8_emp_tok_2.child emp_type pos_:  ADJ
dobj
41,000
[Team Members]


In [41]:
print('Token children: ')
for w in ex8_emp_tok.children:
    print(str(w) + '       child.dep_:' + str(w.dep_))

Token children: 
41,000       child.dep_:nummod
full-time       child.dep_:compound
and       child.dep_:cc
Team Members       child.dep_:conj


In [42]:
extract_emp_relations(ex8, verbose = True)

Word_id is : 0
Word is : Team Members
Word subtree is :  approximately 41,000 full-time Team Members and approximately 33,000 part-time
Word children :  [41,000, full-time, and, Team Members]
Num_toks are: [41,000]
tok_emp_type_subtree:  [full-time, part-time]
Flagged toks from tok.children: [full-time]
Flagged_toks:  [full-time]
Dep_ of EMP_NOUN is: dobj
Root is at 1 steps from Team Members.
Parts found:  (we, employed, 41,000, 'Full-Time Employees', [full-time], Team Members)
Word_id is : 1
Word is : Team Members
Word subtree is :  approximately 33,000 part-time
Word children :  [33,000, part-time]
Num_toks are: [33,000]
tok_emp_type_subtree:  [part-time]
Flagged toks from tok.children: [part-time]
Flagged_toks:  [part-time]
Dep_ of EMP_NOUN is: conj
Emp_noun token has dep_ == 'conj'.
Child num_tok: 33,000
Head num_tok: []
Parts found:  (we, employed, 33,000, 'Part-Time Employees', [part-time], Team Members)


[RelationDetails(sent_num=0, word_num=0, subject=we, verb=employed, quantity=41,000, quantity_idx=10, quantity_type='Full-Time Employees', type_token=[full-time], word=Team Members, word_idx=12, word_dep='dobj', depth=1, sentence='As of February 23, 2017, we employed approximately 41,000 full-time Team Members and approximately 33,000 part-time Team Members.'),
 RelationDetails(sent_num=0, word_num=1, subject=we, verb=employed, quantity=33,000, quantity_idx=15, quantity_type='Part-Time Employees', type_token=[part-time], word=Team Members, word_idx=17, word_dep='conj', depth=1, sentence='As of February 23, 2017, we employed approximately 41,000 full-time Team Members and approximately 33,000 part-time Team Members.')]

#### ```As of September 30, 2016, we had approximately 19,000 employees, of which approximately 18,000 were full-time employees.```

Example text:

`As of September 30, 2016, we had approximately 19,000 employees, of which approximately 18,000 were full-time employees.`  

Desired output: 

`(we, had, 19,000, (0, 'Other') ,employees)`  
`(we, had, 18,000, ('full-time', 'Full-Time') ,employees)`  

In [43]:
ex9 = nlp("As of September 30, 2016, we had approximately 19,000 employees, of which approximately 18,000 were full-time employees.")

#hf.print_df(make_tok_df(ex9))

In [44]:
hf.print_doc_info(ex9)

doc is: 
As of September 30, 2016, we had approximately 19,000 employees, of which approximately 18,000 were full-time employees.
--------------------------------------------------
Entities are: 


Unnamed: 0,tok_i,entity,ent_label,root,root_dep,dep_def,root_head,root_head_dep,root_head_pos
0,2,"September 30, 2016",DATE,September,pobj,object of preposition,of,prep,ADP
1,9,"approximately 19,000",CARDINAL,19000,nummod,,employees,dobj,NOUN
2,11,employees,EMP_NOUN,employees,dobj,direct object,had,ROOT,VERB
3,15,"approximately 18,000",CARDINAL,18000,nsubj,nominal subject,were,relcl,VERB
4,18,full-time,FULL_TIME,full-time,compound,,employees,attr,NOUN
5,19,employees,EMP_NOUN,employees,attr,attribute,were,relcl,VERB


--------------------------------------------------
Noun chunks are: 


Unnamed: 0,tok_i,noun_chunk,root,root_ent,root_dep,dep_def,root_head,root_head_dep,root_head_pos
0,2,September,September,DATE,pobj,object of preposition,of,prep,ADP
1,7,we,we,,nsubj,nominal subject,had,ROOT,VERB
2,9,"approximately 19,000 employees",employees,EMP_NOUN,dobj,direct object,had,ROOT,VERB
3,18,full-time employees,employees,EMP_NOUN,attr,attribute,were,relcl,VERB


--------------------------------------------------
Cardinal entities are: 


Unnamed: 0,tok_ent,toks,lemma,dep,head,h_dep,pos,tag,dep_def,tag_def
0,CARDINAL,approximately,approximately,advmod,19000,nummod,ADV,RB,adverbial modifier,adverb
1,CARDINAL,19000,19000,nummod,employees,dobj,NUM,CD,,cardinal number
2,CARDINAL,approximately,approximately,advmod,18000,nsubj,ADV,RB,adverbial modifier,adverb
3,CARDINAL,18000,18000,nsubj,were,relcl,NUM,CD,nominal subject,cardinal number


In [45]:
ex9_emp_tok = ex9[11]; ex9_emp_tok_2 = ex9[19]
ex9_emp_num_tok = ex9[16]
print(ex9[ex9_emp_tok.i - 1])
print(check_emp_type_flags(find_emp_type_toks(ex9_emp_tok_2)))
print([t for t in ex9_emp_tok.children if t._.is_emp_type == True])
print([t for t in ex9_emp_tok_2.children if t._.is_emp_type == True])
print(ex9_emp_tok.dep_)
print(ex9_emp_num_tok)
print([c for c in ex9_emp_tok.conjuncts])

19,000
Full-Time Employees
[]
[full-time]
dobj
18,000
[]


In [46]:
displacy.render(ex9, style='ent', jupyter=True, options={'distance': 110})

In [47]:
list(ex9[19].head.lefts)

[18,000]

In [48]:
extract_emp_relations(ex9, verbose = True)

Word_id is : 0
Word is : employees
Word subtree is :  approximately 19,000 employees, of which approximately 18,000 were full-time
Word children :  [19,000, ,, were]
Num_toks are: [19,000]
tok_emp_type_subtree:  [full-time]
Finding emp_type, emp_tok is dobj
emp_noun tok head: had
tok_head_emptype_subtreee: [full-time]
tok_head_empnoun_subtreee: [employees, employees]
Flagged toks from tok_head_emptype_subtreee: [full-time]
Flagged tok heads: [employees]
Flagged tok head deps: ['attr']
Flagged_toks:  [full-time]
Dep_ of EMP_NOUN is: dobj
Root is at 1 steps from employees.
Parts found:  (we, had, 19,000, 'Full-Time Employees', [full-time], employees)
Word_id is : 1
Word is : employees
Word subtree is :  full-time
Word children :  [full-time]
tok_emp_type_subtree:  [full-time]
Flagged toks from tok.children: [full-time]
Flagged_toks:  [full-time]
Dep_ of EMP_NOUN is: attr
Root is at 1 steps from employees.
Parts found:  (we, had, approximately 18,000, 'Full-Time Employees', [full-time], e

[RelationDetails(sent_num=0, word_num=0, subject=we, verb=had, quantity=19,000, quantity_idx=10, quantity_type='Full-Time Employees', type_token=[full-time], word=employees, word_idx=11, word_dep='dobj', depth=1, sentence='As of September 30, 2016, we had approximately 19,000 employees, of which approximately 18,000 were full-time employees.'),
 RelationDetails(sent_num=0, word_num=1, subject=we, verb=had, quantity=approximately 18,000, quantity_idx=16, quantity_type='Full-Time Employees', type_token=[full-time], word=employees, word_idx=19, word_dep='attr', depth=1, sentence='As of September 30, 2016, we had approximately 19,000 employees, of which approximately 18,000 were full-time employees.')]

### Dealing with sub-clauses

#### `Including our full and part-time personnel, we estimate that we have the equivalent of 12 full time employees.`

In [49]:
ex11 = nlp("Including our full and part-time personnel, we estimate that we have the equivalent of 12 full time employees.")
ex11_emp_tok = ex11[17]
#Subtree of 'have'
print(list(list(ex11_emp_tok.head.head.head.rights)[0].subtree))

print([r for r in ex11_emp_tok.head.head.head.lefts])
hf.print_df(hf.make_tok_df(ex11))

[the, equivalent, of, 12, full time, employees]
[that, we]


Unnamed: 0,tok_ent,toks,lemma,dep,head,h_dep,pos,tag,dep_def,tag_def
0,,Including,include,prep,estimate,ROOT,VERB,VBG,prepositional modifier,"verb, gerund or present participle"
1,,our,-PRON-,poss,personnel,pobj,ADJ,PRP$,possession modifier,"pronoun, possessive"
2,,full,full,amod,personnel,pobj,ADJ,JJ,adjectival modifier,adjective
3,,and,and,cc,full,amod,CCONJ,CC,coordinating conjunction,"conjunction, coordinating"
4,PART_TIME,part-time,part,conj,full,amod,ADJ,JJ,conjunct,adjective
5,,personnel,personnel,pobj,Including,prep,NOUN,NNS,object of preposition,"noun, plural"
6,,",",",",punct,estimate,ROOT,PUNCT,",",punctuation,"punctuation mark, comma"
7,,we,-PRON-,nsubj,estimate,ROOT,PRON,PRP,nominal subject,"pronoun, personal"
8,,estimate,estimate,ROOT,estimate,ROOT,VERB,VBP,,"verb, non-3rd person singular present"
9,,that,that,mark,have,ccomp,ADP,IN,marker,"conjunction, subordinating or preposition"


In [50]:
hf.print_doc_info(ex11)

doc is: 
Including our full and part-time personnel, we estimate that we have the equivalent of 12 full time employees.
--------------------------------------------------
Entities are: 


Unnamed: 0,tok_i,entity,ent_label,root,root_dep,dep_def,root_head,root_head_dep,root_head_pos
0,4,part-time,PART_TIME,part-time,conj,conjunct,full,amod,ADJ
1,15,12,CARDINAL,12,nummod,,employees,pobj,NOUN
2,16,full time,FULL_TIME,full time,compound,,employees,pobj,NOUN
3,17,employees,EMP_NOUN,employees,pobj,object of preposition,of,prep,ADP


--------------------------------------------------
Noun chunks are: 


Unnamed: 0,tok_i,noun_chunk,root,root_ent,root_dep,dep_def,root_head,root_head_dep,root_head_pos
0,1,our full and part-time personnel,personnel,,pobj,object of preposition,Including,prep,VERB
1,7,we,we,,nsubj,nominal subject,estimate,ROOT,VERB
2,10,we,we,,nsubj,nominal subject,have,ccomp,VERB
3,12,the equivalent,equivalent,,dobj,direct object,have,ccomp,VERB
4,15,12 full time employees,employees,EMP_NOUN,pobj,object of preposition,of,prep,ADP


--------------------------------------------------
Cardinal entities are: 


Unnamed: 0,tok_ent,toks,lemma,dep,head,h_dep,pos,tag,dep_def,tag_def
0,CARDINAL,12,12,nummod,employees,pobj,NUM,CD,,cardinal number


In [51]:
extract_emp_relations(ex11)

[RelationDetails(sent_num=0, word_num=0, subject=we, verb=have, quantity=12, quantity_idx=15, quantity_type='Full-Time Employees', type_token=[full time], word=employees, word_idx=17, word_dep='pobj', depth=1, sentence='Including our full and part-time personnel, we estimate that we have the equivalent of 12 full time employees.')]

### Testing on sentences

In [52]:
test_sents = ["As of September 30, 2016, we employed approximately 7,300 employees world-wide.", 
"As of December 31, 2016, the subsidiaries of AEP had a total of 17,634 employees.", 
"At December 31, 2016 and 2015, we had approximately 56,400 and 66,400 employees, respectively.", 
"At December 31, 2016, we had approximately 9,400 full-time employees.", 
"As of October 29, 2016, we employed approximately 10,000 individuals worldwide.", 
"The number of full-time employees of the Company was approximately 31,800 at December 31, 2016 and 32,300 at December 31, 2015.", 
"As of December 31, 2016, we had 1,469 total employees.", 
"ADP employed approximately 57,000 persons as of June 30, 2016.", 
"At December 31, 2016, we employed approximately 26,400 employees.", 
"The Company and its subsidiaries employed 1,562 persons at December 31, 2016, 114 of whom are covered by a collective bargaining agreement with District 10 of the International Association of Machinists.", 
"As of December 31, 2016, the Company had 455 employees, an increase of 17 employees from the prior year end.", 
"As of December 31, 2016, we had approximately 17,500 employees worldwide.", 
"Based in Neenah, Wisconsin, at December 31, 2016, the Company employed approximately 17,500 individuals and had 59 manufacturing facilities.", 
"As of January 31, 2017, we employed 7,683 individuals.", 
"At December 31, 2016, Bio-Rad had approximately 8,250 employees.", 
"As of December 31, 2016, we had approximately 8,500 full-time employees and 600 contractors.", 
"Alcoa's total worldwide employment at the end of 2016 was approximately 14,000 employees in 15 countries.", 
"As of December 31, 2016, we employed approximately 2,100 people.", 
"At December 31, 2016, the Company had approximately 11,500 employees.",
"As of February 23, 2017, we employed approximately 41,000 full-time Team Members and approximately 33,000 part-time Team Members.", 
"As of December 31, 2016, we had 699 full-time employees and 202 temporary employees.", 
"As of September 30, 2016, we had approximately 19,000 employees, of which approximately 18,000 were full-time employees.", 
"As of December 31, 2016, we had 2,646 employees, 1,581 of whom were pilots.", 
"The number of regular employees was 71.1 thousand, 73.5 thousand, and 75.3 thousand at years ended 2016, 2015 and 2014, respectively.",
"We are a small company with approximately 61 employees.",
"Total workforce level at December 31, 2016 was approximately 150,500.",           
"Currently, the Company and its subsidiaries have an aggregate of 35 employees.",
"we employ only 31 employees", 
"We currently have 21 employees",
"We currently employ 26 full-time employees",
"Including our full and part-time personnel, we estimate that we have the equivalent of 12 full time employees.",
"As a REIT, we employ only 31 employees and have a cost-effective management structure."
]

Nominal subjects are almost always left of root token. 

In [53]:
for i,t in enumerate(test_sents):
    nsubs = [t for t in nlp(t) if t.dep_ == 'nsubj']
    if len(nsubs) > 1:
        print(str(i))
        for i2, tok in enumerate(nsubs):
            print("Tok "+str(i2 + 1) + " of " + str(len(nsubs)))
            print("Token root: " + str(find_root_tok(tok)[0]))
            print("Token side of root: " + find_tok_side_of_root(tok, find_root_tok(tok)[0]))
            print("Sentence is :" + str(nlp(t).text))
            print("Token, dep_, index :" + str(tok) + " " + str(tok.dep_) + " " + str(tok.i))
            print("POS, tag, lemma :" + str(tok.pos_) + " " + str(tok.tag_) + " " + str(tok.lemma_))

21
Tok 1 of 2
Token root: had
Token side of root: left
Sentence is :As of September 30, 2016, we had approximately 19,000 employees, of which approximately 18,000 were full-time employees.
Token, dep_, index :we nsubj 7
POS, tag, lemma :PRON PRP -PRON-
Tok 2 of 2
Token root: had
Token side of root: right
Sentence is :As of September 30, 2016, we had approximately 19,000 employees, of which approximately 18,000 were full-time employees.
Token, dep_, index :18,000 nsubj 16
POS, tag, lemma :NUM CD 18,000
22
Tok 1 of 2
Token root: had
Token side of root: left
Sentence is :As of December 31, 2016, we had 2,646 employees, 1,581 of whom were pilots.
Token, dep_, index :we nsubj 7
POS, tag, lemma :PRON PRP -PRON-
Tok 2 of 2
Token root: had
Token side of root: right
Sentence is :As of December 31, 2016, we had 2,646 employees, 1,581 of whom were pilots.
Token, dep_, index :1,581 nsubj 12
POS, tag, lemma :NUM CD 1,581
25
Tok 1 of 2
Token root: was
Token side of root: left
Sentence is :Total work

In [54]:
for i,t in enumerate(test_sents[:3]):
    print(str(i))
    emp_nouns = [t for t in nlp(t) if t.ent_type_ == 'EMP_NOUN']
    for i2, tok in enumerate(emp_nouns):
        print("Tok "+str(i2 + 1) + " of " + str(len(emp_nouns)))
        print("Sentence is :" + str(nlp(t).text))
        print("Token, dep_, index :" + str(tok) + " " + str(tok.dep_) + " " + str(tok.i))
        print("POS, tag, lemma :" + str(tok.pos_) + " " + str(tok.tag_) + " " + str(tok.lemma_))

0
Tok 1 of 1
Sentence is :As of September 30, 2016, we employed approximately 7,300 employees world-wide.
Token, dep_, index :employees dobj 11
POS, tag, lemma :NOUN NNS employee
1
Tok 1 of 1
Sentence is :As of December 31, 2016, the subsidiaries of AEP had a total of 17,634 employees.
Token, dep_, index :employees pobj 16
POS, tag, lemma :NOUN NNS employee
2
Tok 1 of 1
Sentence is :At December 31, 2016 and 2015, we had approximately 56,400 and 66,400 employees, respectively.
Token, dep_, index :employees dobj 14
POS, tag, lemma :NOUN NNS employee


In [55]:
for i,t in enumerate(test_sents[24:26]):
    print(str(i))
    print(extract_emp_relations(nlp(t), verbose=True))

0
Word_id is : 0
Word is : employees
Word subtree is :  approximately 61
Word children :  [61]
Num_toks are: [61]
Candidate tok:  61
Candidate tok.pos_:   NUM
Candidate tok.dep_:   nummod
No toks, returning 0.
Dep_ of EMP_NOUN is: pobj
Root is at 1 steps from employees.
Parts found:  (We, are, 61, 'Other Employees', [], employees)
[RelationDetails(sent_num=0, word_num=0, subject=We, verb=are, quantity=61, quantity_idx=7, quantity_type='Other Employees', type_token=[], word=employees, word_idx=8, word_dep='pobj', depth=1, sentence='We are a small company with approximately 61 employees.')]
1
Word_id is : 0
Word is : workforce
Word subtree is :  
Word children :  []
No toks, returning 0.
Dep_ of EMP_NOUN is: compound
years: [(7, 2016)]
emp_counts: [(10, 150,500)]
order_indices: [0]
year_emps: [(2016, 150,500)]
Sentence has multiple years:[(7, 2016)]
First card subtree is :[approximately, 150,500]
years: [(7, 2016)]
cards: [150,500]
emp_counts: [(10, 150,500)]
Root is at 2 steps from work

## Testing with paragraphs

acc_id: `0000034088-17-000017`  
```
"The number of regular employees was 71.1 thousand, 73.5 thousand, and 75.3 thousand at years ended 2016, 2015 and 2014, respectively. Regular employees are defined as active executive, management, professional, technical and wage employees who work full time or part time for the Corporation and are covered by the Corporation's benefit plans and programs. Regular employees do not include employees of the company‑operated retail sites (CORS). The number of CORS employees was 1.6 thousand, 2.1 thousand, and 8.4 thousand at years ended 2016, 2015 and 2014, respectively. The decrease in CORS employees reflects the multi‑year transition of the company‑operated retail network to a more capital‑efficient Branded Wholesaler model."
```

In [56]:
test_paragraphs = ["The number of regular employees was 71.1 thousand, 73.5 thousand, and 75.3 thousand at years ended 2016, 2015 and 2014, respectively. Regular employees are defined as active executive, management, professional, technical and wage employees who work full time or part time for the Corporation and are covered by the Corporation's benefit plans and programs. Regular employees do not include employees of the company‑operated retail sites (CORS). The number of CORS employees was 1.6 thousand, 2.1 thousand, and 8.4 thousand at years ended 2016, 2015 and 2014, respectively. The decrease in CORS employees reflects the multi‑year transition of the company‑operated retail network to a more capital‑efficient Branded Wholesaler model.",
                  "As of December 31, 2016, we had a total of 619 employees, including 565 full-time, 20 regularly scheduled part-time employees, and 34 need-based part-time employees. We consider our current relationship with our employees to be good. Our employees are not represented by labor unions and are not subject to collective bargaining agreements.", 
                  "As of September 30, 2016, we employed approximately 7,300 employees world-wide. Approximately 860 of our employees in Mexico, 450 employees in Singapore, and 200 employees in Japan are covered by collective bargaining and other union agreements.",
                  "As of December 31, 2016, we employed 2,776 full-time employees, of which 31 held Ph.D. degrees in a science or engineering field. Of our employees, 287 are located in the U.S., 1,218 are located in Taiwan and 1,271 are located in China. None of our employees are represented by any collective bargaining agreement, but certain employees of our China subsidiary are members of a trade union. We have never suffered any work stoppage as a result of an employment related strike or any employee related dispute and believe that we have satisfactory relations with our employees.", 
                  "Our business depends on highly qualified management, operations and flight personnel. As a percentage of our consolidated operating expenses, salaries, wages and benefits accounted for approximately 25.4% in 2016, 20.7% in 2015 and 19.2% in 2014. As of December 31, 2016, we had 2,646 employees, 1,581 of whom were pilots.", 
                  "As of December 31, 2016, we had 699 full-time employees and 202 temporary employees. The breakdown of our full-time employees by department is as follows: 175 direct manufacturing employees and 524 administrative and manufacturing support employees. Of the 524 administrative and manufacturing support employees, 213 were involved in sales, marketing, communications and training. Of the 202 temporary employees, more than 92% worked in direct manufacturing roles. Our employees are not covered by any collective bargaining agreement, and we have never experienced a work stoppage. We believe that our relations with our employees are good.", 
                  "The employee cost at Jaguar Land Rover increased by 17.6% to Rs.228,730 million in Fiscal 2016 from Rs.194,467 million in Fiscal 2015. This increase includes an unfavorable foreign currency translation from GBP to Indian rupees of Rs.546 million. In GBP terms, employee costs at Jaguar Land Rover increased to GBP 2,321 million in Fiscal 2016 from GBP1,977 million in Fiscal 2015. The employee cost at Jaguar Land Rover as a percentage to revenue increased to 10.5% in Fiscal 2016 from 9.0% in Fiscal 2015. Due to consistent increases in volumes and to support new launches and product development projects, Jaguar Land Rover increased its average permanent headcount by 19.6% in Fiscal 2016 to 29,789 employees from 24,902 employees in Fiscal 2015. However, the average temporary headcount was flat at 7,216 employees in Fiscal 2016 from 7,225 employees in Fiscal 2015. Total number of permanent employees as at March 31, 2016 was 30,750, as compared to 27,004 as at March 31, 2015 for Jaguar Land Rover."]

test_para = nlp(test_paragraphs[0])

In [57]:
extract_emp_relations(nlp(test_paragraphs[5]), verbose=False)

[RelationDetails(sent_num=0, word_num=0, subject=we, verb=had, quantity=699, quantity_idx=9, quantity_type='Full-Time Employees', type_token=[full-time], word=employees, word_idx=11, word_dep='dobj', depth=1, sentence='As of December 31, 2016, we had 699 full-time employees and 202 temporary employees.'),
 RelationDetails(sent_num=0, word_num=1, subject=we, verb=had, quantity=202, quantity_idx=13, quantity_type='Other Employees', type_token=[temporary], word=employees, word_idx=15, word_dep='conj', depth=1, sentence='As of December 31, 2016, we had 699 full-time employees and 202 temporary employees.'),
 RelationDetails(sent_num=1, word_num=1, subject=breakdown, verb=is, quantity=175, quantity_idx=29, quantity_type='Other Employees', type_token=[manufacturing], word=employees, word_idx=32, word_dep='dobj', depth=2, sentence='The breakdown of our full-time employees by department is as follows: 175 direct manufacturing employees and 524 administrative and manufacturing support employees

In [58]:
hf.print_doc_info(nlp(test_paragraphs[5]))

doc is: 
As of December 31, 2016, we had 699 full-time employees and 202 temporary employees. The breakdown of our full-time employees by department is as follows: 175 direct manufacturing employees and 524 administrative and manufacturing support employees. Of the 524 administrative and manufacturing support employees, 213 were involved in sales, marketing, communications and training. Of the 202 temporary employees, more than 92% worked in direct manufacturing roles. Our employees are not covered by any collective bargaining agreement, and we have never experienced a work stoppage. We believe that our relations with our employees are good.
--------------------------------------------------
Entities are: 


Unnamed: 0,tok_i,entity,ent_label,root,root_dep,dep_def,root_head,root_head_dep,root_head_pos
0,2,"December 31, 2016",DATE,December,pobj,object of preposition,of,prep,ADP
1,9,699,CARDINAL,699,nummod,,employees,dobj,NOUN
2,10,full-time,FULL_TIME,full-time,compound,,employees,dobj,NOUN
3,11,employees,EMP_NOUN,employees,dobj,direct object,had,ROOT,VERB
4,13,202,CARDINAL,202,nummod,,employees,conj,NOUN
5,15,employees,EMP_NOUN,employees,conj,conjunct,employees,dobj,NOUN
6,21,full-time,FULL_TIME,full-time,compound,,employees,pobj,NOUN
7,22,employees,EMP_NOUN,employees,pobj,object of preposition,of,prep,ADP
8,29,175,CARDINAL,175,nummod,,employees,dobj,NOUN
9,32,employees,EMP_NOUN,employees,dobj,direct object,follows,advcl,VERB


--------------------------------------------------
Noun chunks are: 


Unnamed: 0,tok_i,noun_chunk,root,root_ent,root_dep,dep_def,root_head,root_head_dep,root_head_pos
0,2,December,December,DATE,pobj,object of preposition,of,prep,ADP
1,7,we,we,,nsubj,nominal subject,had,ROOT,VERB
2,9,699 full-time employees,employees,EMP_NOUN,dobj,direct object,had,ROOT,VERB
3,13,202 temporary employees,employees,EMP_NOUN,conj,conjunct,employees,dobj,NOUN
4,17,The breakdown,breakdown,,nsubj,nominal subject,is,ROOT,VERB
5,20,our full-time employees,employees,EMP_NOUN,pobj,object of preposition,of,prep,ADP
6,24,department,department,,pobj,object of preposition,by,prep,ADP
7,29,175 direct manufacturing employees,employees,EMP_NOUN,dobj,direct object,follows,advcl,VERB
8,34,524 administrative and manufacturing support e...,employees,EMP_NOUN,conj,conjunct,employees,dobj,NOUN
9,42,the 524 administrative and manufacturing suppo...,employees,EMP_NOUN,pobj,object of preposition,Of,prep,ADP


--------------------------------------------------
Cardinal entities are: 


Unnamed: 0,tok_ent,toks,lemma,dep,head,h_dep,pos,tag,dep_def,tag_def
0,CARDINAL,699,699,nummod,employees,dobj,NUM,CD,,cardinal number
1,CARDINAL,202,202,nummod,employees,conj,NUM,CD,,cardinal number
2,CARDINAL,175,175,nummod,employees,dobj,NUM,CD,,cardinal number
3,CARDINAL,524,524,nmod,employees,conj,NUM,CD,modifier of nominal,cardinal number
4,CARDINAL,524,524,nummod,employees,pobj,NUM,CD,,cardinal number
5,CARDINAL,213,213,nsubjpass,involved,ROOT,NUM,CD,nominal subject (passive),cardinal number
6,CARDINAL,202,202,nummod,employees,pobj,NUM,CD,,cardinal number


In [59]:
extract_emp_relations(nlp(test_paragraphs[6]), verbose=True)

Word_id is : 0
Word is : headcount
Word subtree is :  its average permanent
Word children :  [its, average, permanent]
Root token lemma not one of ['be', 'employ', 'have']. 
Root token, lemma are : increased increase
[Due, to, consistent, increases, in, volumes, and, to, support, new, launches, and, product, development, projects, ,, Jaguar, Land, Rover, increased, its, average, permanent, headcount, by, 19.6, %, in, Fiscal, 2016, to, 29,789, employees, from, 24,902, employees, in, Fiscal, 2015, .]
verb token lemma not one of ['be', 'employ', 'have']. 
verb token, lemma are : increased increase
[Due, to, consistent, increases, in, volumes, and, to, support, new, launches, and, product, development, projects, ,, Jaguar, Land, Rover, increased, its, average, permanent, headcount, by, 19.6, %, in, Fiscal, 2016, to, 29,789, employees, from, 24,902, employees, in, Fiscal, 2015, .]
Word_id is : 1
Word is : employees
Word subtree is :  29,789
Word children :  [29,789]
Num_toks are: [29,789]
R

[RelationDetails(sent_num=5, word_num=1, subject=headcount, verb=was, quantity=7,216, quantity_idx=141, quantity_type='Other Employees', type_token=[], word=employees, word_idx=142, word_dep='pobj', depth=1, sentence='However, the average temporary headcount was flat at 7,216 employees in Fiscal 2016 from 7,225 employees in Fiscal 2015.'),
 RelationDetails(sent_num=5, word_num=2, subject=headcount, verb=was, quantity=7,225, quantity_idx=147, quantity_type='Other Employees', type_token=[], word=employees, word_idx=148, word_dep='pobj', depth=1, sentence='However, the average temporary headcount was flat at 7,216 employees in Fiscal 2016 from 7,225 employees in Fiscal 2015.'),
 RelationDetails(sent_num=6, word_num=0, subject=Total number of permanent employees as at March 31, 2016, verb=was, quantity=30,750, quantity_idx=165, quantity_type='Other Employees', type_token=[permanent], word=employees, word_idx=157, word_dep='pobj', depth=2, sentence='Total number of permanent employees as at

## Extract information from training paragraphs

### Make fact dataframes

Train fact df

In [60]:
print(train_df.shape)
#fact_df = make_fact_df(train_df.para_text, extract_emp_relations, df=train_df, verbose=True)
#fact_df2 = make_fact_df(train_df.para_text, extract_emp_relations, df=train_df, verbose=True)
#fact_df3 = hf.make_fact_df(train_df.para_text, extract_emp_relations, nlp=nlp, df=train_df, verbose=True)
fact_df3 = pd.read_csv('../data/fact_df3.csv')

(7700, 9)


In [61]:
err_doc = nlp("We had 49, 61, and 57 full-time equivalent employees in research and development at December 31, 2016, 2015, and 2014, respectively.")
err_tok = err_doc[4]
err_ent = err_doc.ents[1]
list(err_doc[2].conjuncts)

[61, employees]

In [62]:
#fact_df = add_units_and_values(fact_df, 'quantity')
#fact_df.units.value_counts()

#fact_df3 = hf.add_units_and_values(fact_df3, 'quantity'); 
fact_df3.units.value_counts()

ones        2841
thousand       2
Name: units, dtype: int64

In [63]:
fact_df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2843 entries, 0 to 2842
Data columns (total 16 columns):
doc_ids          2843 non-null int64
sent_num         2843 non-null int64
word_num         2843 non-null int64
subject          2843 non-null object
verb             2843 non-null object
quantity         2843 non-null object
quantity_idx     2843 non-null int64
quantity_type    2843 non-null object
type_token       2843 non-null object
word             2843 non-null object
word_idx         2843 non-null int64
word_dep         2843 non-null object
depth            2843 non-null int64
sentence         2843 non-null object
units            2843 non-null object
data_value       2843 non-null float64
dtypes: float64(1), int64(6), object(9)
memory usage: 355.5+ KB


TODO:
- Keep track of year 
- Deal with ", a(n) (in|de)crease of \d" clauses

Validation fact df

In [64]:
fact_df_val = hf.make_fact_df(val_df.para_text, extract_emp_relations, nlp, df=val_df, verbose=True)

Length of emp_counts is : 1 while length of years is : 2
--------------------
As of December 31, 2016, we had approximately 28,300 employees, as compared to approximately 19,200 employees as of December 31, 2015. During 2016, reduction in workforce activities resulted in the severance of approximately 1,950 employees of which approximately 450 employees remained as employees as of December 31, 2016. Approximately 17,900 of our total employees are represented by unions. The number of employees covered by a collective bargaining agreement that expired in 2016, but have been extended and are still effective for 2017, is approximately 600. The number of employees covered by collective bargaining agreements that expire in 2017 is approximately 3,800. We consider our relations with our employees to be good.
len of all_rels is: 953
len of all_docs is: 953
len of doc_id_list is: 953


Create units column

In [66]:
fact_df_val = hf.add_units_and_values(fact_df_val, 'quantity')

fact_df_val.units.value_counts()

ones    953
Name: units, dtype: int64

In [67]:
fact_df_val.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 953 entries, 0 to 952
Data columns (total 16 columns):
doc_ids          953 non-null int64
sent_num         953 non-null int64
word_num         953 non-null int64
subject          953 non-null object
verb             953 non-null object
quantity         953 non-null object
quantity_idx     953 non-null int64
quantity_type    953 non-null object
type_token       953 non-null object
word             953 non-null object
word_idx         953 non-null int64
word_dep         953 non-null object
depth            953 non-null int64
sentence         953 non-null object
units            953 non-null object
data_value       953 non-null object
dtypes: int64(6), object(10)
memory usage: 119.2+ KB


## Classification Scores

### Identify tables in labeled data

I can eliminate tables from the false negative pool, as table parsing is a separate (and easier) problem. In my labeled data set, tables have a lot of white space.

In [68]:
labeled_df = pd.read_excel('../data/train_val_employee_count_paragraphs.xlsx')
subset_df = pd.read_excel('../data/subset_employee_count_paragraphs.xlsx')

train_labeled_df = labeled_df[labeled_df.split == 'train'].copy()
val_labeled_df = labeled_df[labeled_df.split == 'val'].copy()
#print(train_labeled_df.paragraph_text.apply(lambda x: x.count(' ') / len(x)).to_frame('space_percent').sort_values('space_percent', ascending=False).head(100))

val_paragraphs = val_df.loc[:, ['acc_id', 'para_text']].copy().merge(labeled_df.loc[:,['accession_number', 'ticker']].copy().drop_duplicates(), 
                                          how = 'left', left_on = 'acc_id', right_on = 'accession_number')

In [69]:
train_labeled_df.loc[:,'space_percent'] = train_labeled_df.loc[:, 'paragraph_text'].apply(lambda x: x.count(' ') / len(x))

val_labeled_df.loc[:,'space_percent'] = val_labeled_df.loc[:, 'paragraph_text'].apply(lambda x: x.count(' ') / len(x))

space_bs = train_labeled_df.paragraph_text.apply(lambda x: x.count(' ') / len(x)) > 0.36 
space_bs_val = val_labeled_df.paragraph_text.apply(lambda x: x.count(' ') / len(x)) > 0.36 
print(sum(space_bs))
print(sum(space_bs_val))

105
37


In [70]:
def print_row_detail(df=val_paragraphs, header_list = ['ticker', 'accession_number' ], nrow=10, 
                    detail_list = ['data_key_friendly_name','para_text'],
                    sortby=['ticker', 'data_key_friendly_name'], ascending=True):
    df_sorted = df.sort_values(sortby, ascending=ascending).copy().reset_index()
    nrow = min(len(df), nrow)
    for i in range(0, nrow):
        for h in header_list:
            print('-'*35  + ' ' +  str(df_sorted[h][i]) + ' ' + '-'*35)
        for d in detail_list:
            print(d + '  :' + str(df_sorted[d][i]))
            print('')

In [71]:
print_row_detail(train_labeled_df[space_bs].head(2), detail_list=['space_percent','data_key_friendly_name', 'data_value', 'reported_units', 'text', 'paragraph_text'],
                 sortby='space_percent')

----------------------------------- ADBE -----------------------------------
----------------------------------- 0000796343-17-000031 -----------------------------------
space_percent  :0.49632892804698975

data_key_friendly_name  :Other Employees

data_value  :15706.0

reported_units  :ones

text  :Worldwide employees

paragraph_text  :Fiscal Years                                           2016           2015           2014           2013           2012   Operations:   Revenue                             $  5,854,430   $  4,795,511   $  4,147,065   $  4,055,240   $  4,403,677   Gross profit                        $  5,034,522   $  4,051,194   $  3,524,985   $  3,468,683   $  3,919,895   Income before income taxes          $  1,435,138   $    873,781   $    361,376   $    356,141   $  1,118,794   Net income                          $  1,168,782   $    629,551   $    268,395   $    289,985   $    832,775   Net income per share:   Basic                               $       2.35   $     

In [72]:
print_row_detail(val_labeled_df[space_bs_val].head(2), detail_list=['space_percent','data_key_friendly_name', 'data_value', 'reported_units', 'text', 'paragraph_text'],
                 sortby='space_percent')

----------------------------------- ANW -----------------------------------
----------------------------------- 0000919574-17-004377 -----------------------------------
space_percent  :0.5145631067961165

data_key_friendly_name  :Other Employees

data_value  :379.0

reported_units  :ones

text  :Shoreside personnel

paragraph_text  :Year Ended December 31,                         2016     2015   2014   Shipboard personnel     612      645    646   Shoreside personnel     379      332    314   Total                   991      977    960

----------------------------------- AAL -----------------------------------
----------------------------------- 0001193125-17-051216 -----------------------------------
space_percent  :0.6181384248210023

data_key_friendly_name  :Full-Time Employees

data_value  :122300.0

reported_units  :ones

text  :Total

paragraph_text  :Mainline Operations   Wholly-owned Regional Carriers    Total   Pilots and Flight Crew Training Instructors               13,400 

### Merge fact_df with train_df

In [73]:
fact_df = pd.read_csv('../data/fact_df1.csv')
train_fact_df = train_df.merge(fact_df, left_index=True, right_on='doc_ids')

train_accession_ids = pd.read_csv('../data/train_accession_ids.csv', names=['acc_id']).loc[:,'acc_id'].tolist()

negative_ids = [x for x in train_accession_ids if x not in train_labeled_df.accession_number.unique()]

train_facts = train_fact_df.loc[:, train_fact_df.columns.intersection(['acc_id', 'data_value', 'quantity_type'])].copy().drop_duplicates()
train_facts_values = train_fact_df.loc[:, train_fact_df.columns.intersection(['acc_id', 'data_value'])].copy().drop_duplicates()
train_facts_keys = train_fact_df.loc[:, train_fact_df.columns.intersection(['acc_id', 'quantity_type'])].copy().drop_duplicates()
train_facts.columns = ['accession_number', 'data_key_friendly_name', 'data_value']
train_facts_values.columns = ['accession_number',  'data_value']
train_facts_keys.columns = ['accession_number',  'data_key_friendly_name']

train_fact_df3 = train_df.merge(fact_df3, left_index=True, right_on='doc_ids')
train_facts3 = train_fact_df3.loc[:, train_fact_df3.columns.intersection(['acc_id', 'data_value', 'quantity_type'])].copy().drop_duplicates()
train_facts_values3 = train_fact_df3.loc[:, train_fact_df3.columns.intersection(['acc_id', 'data_value'])].copy().drop_duplicates()
train_facts_keys3 = train_fact_df3.loc[:, train_fact_df3.columns.intersection(['acc_id', 'quantity_type'])].copy().drop_duplicates()
train_facts3.columns = ['accession_number', 'data_key_friendly_name', 'data_value']
train_facts_values3.columns = ['accession_number',  'data_value']
train_facts_keys3.columns = ['accession_number',  'data_key_friendly_name']


print("facts: ", train_facts.shape)
print("values: ", train_facts_values.shape)
print("keys: ", train_facts_keys.shape)
print("facts3: ", train_facts3.shape)
print("values3: ", train_facts_values3.shape)
print("keys3: ", train_facts_keys3.shape)

facts:  (2322, 3)
values:  (2311, 2)
keys:  (1578, 2)
facts3:  (2530, 3)
values3:  (2511, 2)
keys3:  (1714, 2)


In [15]:
#fact_df.to_csv("../data/fact_df1.csv", index=False)
#fact_df2.to_csv("../data/fact_df2.csv", index=False)
#fact_df3.to_csv("../data/fact_df3.csv", index=False)

In [74]:
len(train_accession_ids)

1667

In [75]:
labeled_facts = train_labeled_df.loc[~space_bs, ['accession_number', 'data_key_friendly_name', 'data_value']].copy().drop_duplicates()
labeled_facts_values = train_labeled_df.loc[~space_bs, ['accession_number', 'data_value']].copy().drop_duplicates()
labeled_facts_keys = train_labeled_df.loc[~space_bs, ['accession_number', 'data_key_friendly_name']].copy().drop_duplicates()
print("facts: ", labeled_facts.shape)
print("values: ", labeled_facts_values.shape)
print("keys: ", labeled_facts_keys.shape)

facts:  (1853, 3)
values:  (1852, 2)
keys:  (1775, 2)


In [76]:
missed_facts = pd.merge(labeled_facts, train_facts, on=labeled_facts.columns.tolist(), 
                         how='outer', indicator=True).query(
    "_merge == 'left_only'").drop('_merge', 1).sort_values('accession_number')
missed_facts_values =pd.merge(labeled_facts_values,train_facts_values , on=labeled_facts_values.columns.tolist(), 
                               how='outer', indicator=True).query(
    "_merge == 'left_only'").drop('_merge', 1).sort_values('accession_number')
missed_facts_keys =pd.merge(labeled_facts_keys, train_facts_keys , on=labeled_facts_keys.columns.tolist(), 
                             how='outer', indicator=True).query(
    "_merge == 'left_only'").drop('_merge', 1).sort_values('accession_number')

missed_facts3 = pd.merge(labeled_facts, train_facts3, on=labeled_facts.columns.tolist(), 
                         how='outer', indicator=True).query(
    "_merge == 'left_only'").drop('_merge', 1).sort_values('accession_number')
missed_facts_values3 =pd.merge(labeled_facts_values,train_facts_values3 , on=labeled_facts_values.columns.tolist(), 
                               how='outer', indicator=True).query(
    "_merge == 'left_only'").drop('_merge', 1).sort_values('accession_number')
missed_facts_keys3 =pd.merge(labeled_facts_keys, train_facts_keys3 , on=labeled_facts_keys.columns.tolist(), 
                             how='outer', indicator=True).query(
    "_merge == 'left_only'").drop('_merge', 1).sort_values('accession_number')
print("missed facts: ", missed_facts.shape)
print("missed values: ", missed_facts_values.shape)
print("missed keys: ", missed_facts_keys.shape)

print("missed facts3: ", missed_facts3.shape)
print("missed values3: ", missed_facts_values3.shape)
print("missed keys3: ", missed_facts_keys3.shape)

missed facts:  (549, 3)
missed values:  (463, 2)
missed keys:  (456, 2)
missed facts3:  (416, 3)
missed values3:  (331, 2)
missed keys3:  (298, 2)


In [77]:
train_facts3.loc[train_facts3.accession_number.isin(missed_facts3.head(15).accession_number.tolist())]

Unnamed: 0,accession_number,data_key_friendly_name,data_value
56,0000018498-16-000065,Other Employees,27500.0
108,0000049071-17-000019,Other Employees,51600.0
109,0000049754-17-000003,Full-Time Employees,960.0
128,0000059527-17-000006,Other Employees,50.0
129,0000059527-17-000006,Other Employees,3.0


In [78]:
missed_facts.data_key_friendly_name.value_counts()

Other Employees        218
Full-Time Employees    214
Part-Time Employees    117
Name: data_key_friendly_name, dtype: int64

In [79]:
missed_facts3.data_key_friendly_name.value_counts()

Other Employees        234
Full-Time Employees    131
Part-Time Employees     51
Name: data_key_friendly_name, dtype: int64

### Validation fact df

In [80]:
val_fact_df = val_df.merge(fact_df_val, left_index=True, right_on='doc_ids')

val_accession_ids = pd.read_csv('../data/val_accession_ids.csv', names=['acc_id'])['acc_id'].tolist()
val_negative_ids = [x for x in val_accession_ids if x not in val_labeled_df.accession_number.unique()]

val_facts = val_fact_df.loc[:, val_fact_df.columns.intersection(['acc_id', 'quantity_type', 'data_value'])].copy().drop_duplicates()
val_facts_values = val_fact_df.loc[:, val_fact_df.columns.intersection(['acc_id', 'data_value'])].copy().drop_duplicates()
val_facts_keys = val_fact_df.loc[:, val_fact_df.columns.intersection(['acc_id', 'quantity_type'])].copy().drop_duplicates()
val_facts.columns = ['accession_number', 'data_key_friendly_name', 'data_value']
val_facts_values.columns = ['accession_number',  'data_value']
val_facts_keys.columns = ['accession_number',  'data_key_friendly_name']


print(val_facts.shape)
print(val_facts_values.shape)
print(val_facts_keys.shape)

(881, 3)
(873, 2)
(593, 2)


In [81]:
len(val_accession_ids)

555

In [82]:
labeled_facts_val = val_labeled_df.loc[~space_bs_val, ['accession_number', 'data_key_friendly_name', 'data_value']].copy().drop_duplicates()
labeled_facts_val_values = val_labeled_df.loc[~space_bs_val, ['accession_number', 'data_value']].copy().drop_duplicates()
labeled_facts_val_keys = val_labeled_df.loc[~space_bs_val, ['accession_number', 'data_key_friendly_name']].copy().drop_duplicates()
print("facts: ", labeled_facts_val.shape)
print("values: ", labeled_facts_val_values.shape)
print("keys: ", labeled_facts_val_keys.shape)

facts:  (630, 3)
values:  (627, 2)
keys:  (600, 2)


In [83]:
missed_facts_val = pd.merge(labeled_facts_val, val_facts, on=labeled_facts_val.columns.tolist(), 
                         how='outer', indicator=True).query(
    "_merge == 'left_only'").drop('_merge', 1).sort_values('accession_number')
missed_facts_val_values =pd.merge(labeled_facts_val_values,val_facts_values , on=labeled_facts_val_values.columns.tolist(), 
                               how='outer', indicator=True).query(
    "_merge == 'left_only'").drop('_merge', 1).sort_values('accession_number')
missed_facts_val_keys =pd.merge(labeled_facts_val_keys, val_facts_keys , on=labeled_facts_val_keys.columns.tolist(), 
                             how='outer', indicator=True).query(
    "_merge == 'left_only'").drop('_merge', 1).sort_values('accession_number')

print("facts: ", missed_facts_val.shape)
print("values: ", missed_facts_val_values.shape)
print("keys: ", missed_facts_val_keys.shape)

facts:  (147, 3)
values:  (115, 2)
keys:  (95, 2)


In [84]:
val_fact_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 953 entries, 0 to 952
Data columns (total 25 columns):
acc_id                  953 non-null object
para_text               953 non-null object
len                     953 non-null int64
emp_header              953 non-null bool
first_emp_head_block    953 non-null bool
para_text_orig          953 non-null object
para_tag                953 non-null object
split                   953 non-null object
label                   953 non-null int64
doc_ids                 953 non-null int64
sent_num                953 non-null int64
word_num                953 non-null int64
subject                 953 non-null object
verb                    953 non-null object
quantity                953 non-null object
quantity_idx            953 non-null int64
quantity_type           953 non-null object
type_token              953 non-null object
word                    953 non-null object
word_idx                953 non-null int64
word_dep                95

### Scores for training set

In [85]:
len([x for x in train_labeled_df[~space_bs].accession_number.unique() if x not in train_fact_df3.acc_id.unique()])

151

Current state:

In [86]:
value_match3 = train_fact_df3[['acc_id', 'data_value']].drop_duplicates().merge(train_labeled_df[['accession_number', 'data_value']].drop_duplicates(), left_on=['acc_id', 'data_value'], 
                   right_on = ['accession_number', 'data_value']).drop_duplicates()

value_match_dict3 = {'True Positives' : value_match3.shape[0], 
                   'False Positives' : train_fact_df3[['acc_id', 'data_value']].drop_duplicates().shape[0] - value_match3.shape[0], 
                   'True Negatives' : len([x for x in negative_ids if x not in train_fact_df3.acc_id.unique()]), 
                   'False Negatives' : missed_facts_values3.shape[0]
                  }

value_match_dict3['recall'] = round(value_match_dict3['True Positives'] /  (
    value_match_dict3['True Positives'] + value_match_dict3['False Negatives']), 2)
value_match_dict3['precision'] = round(value_match_dict3['True Positives'] /  (
    value_match_dict3['True Positives'] + value_match_dict3['False Positives']), 2)
value_match_dict3['accuracy'] = round((value_match_dict3['True Positives'] + value_match_dict3['True Negatives']) /  (
    value_match_dict3['True Positives'] + value_match_dict3['True Negatives'] + value_match_dict3['False Positives'] + value_match_dict3['False Negatives']), 2)

value_match_dict3

{'False Negatives': 331,
 'False Positives': 968,
 'True Negatives': 10,
 'True Positives': 1543,
 'accuracy': 0.54,
 'precision': 0.61,
 'recall': 0.82}

Demo day performance, after some corrections to evaluation code:

In [87]:
value_match = train_fact_df[['acc_id', 'data_value']].drop_duplicates().merge(train_labeled_df[['accession_number', 'data_value']].drop_duplicates(), left_on=['acc_id', 'data_value'], 
                   right_on = ['accession_number', 'data_value']).drop_duplicates()

value_match_dict = {'True Positives' : value_match.shape[0], 
                   'False Positives' : train_fact_df[['acc_id', 'data_value']].drop_duplicates().shape[0] - value_match.shape[0], 
                   'True Negatives' : len([x for x in negative_ids if x not in train_fact_df.acc_id.unique()]), 
                   'False Negatives' : missed_facts_values.shape[0]
                  }

value_match_dict['recall'] = round(value_match_dict['True Positives'] /  (
    value_match_dict['True Positives'] + value_match_dict['False Negatives']), 2)
value_match_dict['precision'] = round(value_match_dict['True Positives'] /  (
    value_match_dict['True Positives'] + value_match_dict['False Positives']), 2)
value_match_dict['accuracy'] = round((value_match_dict['True Positives'] + value_match_dict['True Negatives']) /  (
    value_match_dict['True Positives'] + value_match_dict['True Negatives'] + value_match_dict['False Positives'] + value_match_dict['False Negatives']), 2)

value_match_dict

{'False Negatives': 463,
 'False Positives': 900,
 'True Negatives': 10,
 'True Positives': 1411,
 'accuracy': 0.51,
 'precision': 0.61,
 'recall': 0.75}

Current state:

In [88]:
key_match3 = train_fact_df3[['acc_id', 'quantity_type']].drop_duplicates().merge(train_labeled_df[['accession_number', 'data_key_friendly_name']].drop_duplicates(), left_on=['acc_id', 'quantity_type'], 
                   right_on = ['accession_number', 'data_key_friendly_name']).drop_duplicates()

key_match_dict3 = {'True Positives' : key_match3.shape[0], 
                   'False Positives' : train_fact_df3[['acc_id', 'quantity_type']].drop_duplicates().shape[0] - key_match3.shape[0], 
                   'True Negatives' : len([x for x in negative_ids if x not in train_fact_df3.acc_id.unique()]), 
                   'False Negatives' : missed_facts_keys3.shape[0]
                  }

key_match_dict3['recall'] = round(key_match_dict3['True Positives'] /  (
    key_match_dict3['True Positives'] + key_match_dict3['False Negatives']), 2)
key_match_dict3['precision'] = round(key_match_dict3['True Positives'] /  (
    key_match_dict3['True Positives'] + key_match_dict3['False Positives']), 2)
key_match_dict3['accuracy'] = round((key_match_dict3['True Positives'] + key_match_dict3['True Negatives']) /  (
    key_match_dict3['True Positives'] + key_match_dict3['True Negatives'] + key_match_dict3['False Positives'] + key_match_dict3['False Negatives']), 2)

key_match_dict3

{'False Negatives': 298,
 'False Positives': 192,
 'True Negatives': 10,
 'True Positives': 1522,
 'accuracy': 0.76,
 'precision': 0.89,
 'recall': 0.84}

Demo day performance, after some corrections to evaluation code:

In [89]:
key_match = train_fact_df[['acc_id', 'quantity_type']].drop_duplicates().merge(train_labeled_df[['accession_number', 'data_key_friendly_name']].drop_duplicates(), left_on=['acc_id', 'quantity_type'], 
                   right_on = ['accession_number', 'data_key_friendly_name']).drop_duplicates()

key_match_dict = {'True Positives' : key_match.shape[0], 
                   'False Positives' : train_fact_df[['acc_id', 'quantity_type']].drop_duplicates().shape[0] - key_match.shape[0], 
                   'True Negatives' : len([x for x in negative_ids if x not in train_fact_df.acc_id.unique()]), 
                   'False Negatives' : missed_facts_keys.shape[0]
                  }

key_match_dict['recall'] = round(key_match_dict['True Positives'] /  (
    key_match_dict['True Positives'] + key_match_dict['False Negatives']), 2)
key_match_dict['precision'] = round(key_match_dict['True Positives'] /  (
    key_match_dict['True Positives'] + key_match_dict['False Positives']), 2)
key_match_dict['accuracy'] = round((key_match_dict['True Positives'] + key_match_dict['True Negatives']) /  (
    key_match_dict['True Positives'] + key_match_dict['True Negatives'] + key_match_dict['False Positives'] + key_match_dict['False Negatives']), 2)

key_match_dict

{'False Negatives': 456,
 'False Positives': 215,
 'True Negatives': 10,
 'True Positives': 1363,
 'accuracy': 0.67,
 'precision': 0.86,
 'recall': 0.75}

Current state:

In [90]:
full_match3 = train_fact_df3[['acc_id', 'data_value', 'quantity_type']].drop_duplicates().merge(train_labeled_df[['accession_number', 'data_value', 'data_key_friendly_name']].drop_duplicates(), 
                                                           left_on=['acc_id', 'data_value', 'quantity_type'], 
                   right_on = ['accession_number', 'data_value', 'data_key_friendly_name']).drop_duplicates()

full_match_dict3 = {'True Positives' : full_match3.shape[0], 
                   'False Positives' : train_fact_df3[['acc_id', 'data_value', 'quantity_type']].drop_duplicates().shape[0] - full_match3.shape[0], 
                   'True Negatives' : len([x for x in negative_ids if x not in train_fact_df3.acc_id.unique()]), 
                   'False Negatives' : missed_facts3.shape[0]
                  }

full_match_dict3['recall'] = round(full_match_dict3['True Positives'] /  (
    full_match_dict3['True Positives'] + full_match_dict3['False Negatives']), 2)
full_match_dict3['precision'] = round(full_match_dict3['True Positives'] /  (
    full_match_dict3['True Positives'] + full_match_dict3['False Positives']), 2)
full_match_dict3['accuracy'] = round((full_match_dict3['True Positives'] + full_match_dict3['True Negatives']) /  (
    full_match_dict3['True Positives'] + full_match_dict3['True Negatives'] + full_match_dict3['False Positives'] + full_match_dict3['False Negatives']), 2)

full_match_dict3

{'False Negatives': 416,
 'False Positives': 1073,
 'True Negatives': 10,
 'True Positives': 1457,
 'accuracy': 0.5,
 'precision': 0.58,
 'recall': 0.78}

Demo day performance, after some corrections to evaluation code:

In [91]:
full_match = train_fact_df[['acc_id', 'data_value', 'quantity_type']].drop_duplicates().merge(train_labeled_df[['accession_number', 'data_value', 'data_key_friendly_name']].drop_duplicates(), 
                                                           left_on=['acc_id', 'data_value', 'quantity_type'], 
                   right_on = ['accession_number', 'data_value', 'data_key_friendly_name']).drop_duplicates()

full_match_dict = {'True Positives' : full_match.shape[0], 
                   'False Positives' : train_fact_df[['acc_id', 'data_value', 'quantity_type']].drop_duplicates().shape[0] - full_match.shape[0], 
                   'True Negatives' : len([x for x in negative_ids if x not in train_fact_df.acc_id.unique()]), 
                   'False Negatives' : missed_facts.shape[0]
                  }

full_match_dict['recall'] = round(full_match_dict['True Positives'] /  (
    full_match_dict['True Positives'] + full_match_dict['False Negatives']), 2)
full_match_dict['precision'] = round(full_match_dict['True Positives'] /  (
    full_match_dict['True Positives'] + full_match_dict['False Positives']), 2)
full_match_dict['accuracy'] = round((full_match_dict['True Positives'] + full_match_dict['True Negatives']) /  (
    full_match_dict['True Positives'] + full_match_dict['True Negatives'] + full_match_dict['False Positives'] + full_match_dict['False Negatives']), 2)

full_match_dict

{'False Negatives': 549,
 'False Positives': 999,
 'True Negatives': 10,
 'True Positives': 1323,
 'accuracy': 0.46,
 'precision': 0.57,
 'recall': 0.71}

In [92]:
train_fact_df.quantity_type.value_counts()

Other Employees        2030
Full-Time Employees     527
Part-Time Employees      48
Name: quantity_type, dtype: int64

In [93]:
train_fact_df3.quantity_type.value_counts()

Other Employees        2019
Full-Time Employees     662
Part-Time Employees     162
Name: quantity_type, dtype: int64

In [94]:
train_labeled_df[~space_bs].data_key_friendly_name.value_counts()

Other Employees        1093
Full-Time Employees     605
Part-Time Employees     155
Name: data_key_friendly_name, dtype: int64

### Scores for validation set

In [95]:
len([x for x in val_labeled_df[~space_bs_val].accession_number.unique() if x not in val_fact_df.acc_id.unique()])

45

In [96]:
value_match_val = val_fact_df[['acc_id', 'data_value']].drop_duplicates().merge(val_labeled_df[['accession_number', 'data_value']].drop_duplicates(), left_on=['acc_id', 'data_value'], 
                   right_on = ['accession_number', 'data_value']).drop_duplicates()

value_match_dict_val = {'True Positives' : value_match_val.shape[0], 
                   'False Positives' : val_fact_df[['acc_id', 'data_value']].drop_duplicates().shape[0] - value_match_val.shape[0], 
                   'True Negatives' : len([x for x in val_negative_ids if x not in val_fact_df.acc_id.unique()]), 
                   'False Negatives' : missed_facts_val_values.shape[0]
                  }

value_match_dict_val['recall'] = round(value_match_dict_val['True Positives'] /  (
    value_match_dict_val['True Positives'] + value_match_dict_val['False Negatives']), 2)
value_match_dict_val['precision'] = round(value_match_dict_val['True Positives'] /  (
    value_match_dict_val['True Positives'] + value_match_dict_val['False Positives']), 2)
value_match_dict_val['accuracy'] = round((value_match_dict_val['True Positives'] + value_match_dict_val['True Negatives']) /  (
    value_match_dict_val['True Positives'] + value_match_dict_val['True Negatives'] + value_match_dict_val['False Positives'] + value_match_dict_val['False Negatives']), 2)

value_match_dict_val

{'False Negatives': 115,
 'False Positives': 350,
 'True Negatives': 1,
 'True Positives': 523,
 'accuracy': 0.53,
 'precision': 0.6,
 'recall': 0.82}

In [97]:
key_match_val = val_fact_df[['acc_id', 'quantity_type']].drop_duplicates().merge(val_labeled_df[['accession_number', 'data_key_friendly_name']].drop_duplicates(), left_on=['acc_id', 'quantity_type'], 
                   right_on = ['accession_number', 'data_key_friendly_name']).drop_duplicates()

key_match_dict_val = {'True Positives' : key_match_val.shape[0], 
                   'False Positives' : val_fact_df[['acc_id', 'quantity_type']].drop_duplicates().shape[0] - key_match_val.shape[0], 
                   'True Negatives' : len([x for x in val_negative_ids if x not in val_fact_df.acc_id.unique()]), 
                   'False Negatives' : missed_facts_val_keys.shape[0]
                  }

key_match_dict_val['recall'] = round(key_match_dict_val['True Positives'] /  (
    key_match_dict_val['True Positives'] + key_match_dict_val['False Negatives']), 2)
key_match_dict_val['precision'] = round(key_match_dict_val['True Positives'] /  (
    key_match_dict_val['True Positives'] + key_match_dict_val['False Positives']), 2)
key_match_dict_val['accuracy'] = round((key_match_dict_val['True Positives'] + key_match_dict_val['True Negatives']) /  (
    key_match_dict_val['True Positives'] + key_match_dict_val['True Negatives'] + key_match_dict_val['False Positives'] + key_match_dict_val['False Negatives']), 2)

key_match_dict_val

{'False Negatives': 95,
 'False Positives': 72,
 'True Negatives': 1,
 'True Positives': 521,
 'accuracy': 0.76,
 'precision': 0.88,
 'recall': 0.85}

In [98]:
full_match_val = val_fact_df[['acc_id', 'data_value', 'quantity_type']].drop_duplicates().merge(val_labeled_df[['accession_number', 'data_value', 'data_key_friendly_name']].drop_duplicates(), 
                                                           left_on=['acc_id', 'data_value', 'quantity_type'], 
                   right_on = ['accession_number', 'data_value', 'data_key_friendly_name']).drop_duplicates()

full_match_dict_val = {'True Positives' : full_match_val.shape[0], 
                   'False Positives' : val_fact_df[['acc_id', 'data_value', 'quantity_type']].drop_duplicates().shape[0] - full_match_val.shape[0], 
                   'True Negatives' : len([x for x in val_negative_ids if x not in val_fact_df.acc_id.unique()]), 
                   'False Negatives' : missed_facts_val.shape[0]
                  }

full_match_dict_val['recall'] = round(full_match_dict_val['True Positives'] /  (
    full_match_dict_val['True Positives'] + full_match_dict_val['False Negatives']), 2)
full_match_dict_val['precision'] = round(full_match_dict_val['True Positives'] /  (
    full_match_dict_val['True Positives'] + full_match_dict_val['False Positives']), 2)
full_match_dict_val['accuracy'] = round((full_match_dict_val['True Positives'] + full_match_dict_val['True Negatives']) /  (
    full_match_dict_val['True Positives'] + full_match_dict_val['True Negatives'] + full_match_dict_val['False Positives'] + full_match_dict_val['False Negatives']), 2)

full_match_dict_val

{'False Negatives': 147,
 'False Positives': 389,
 'True Negatives': 1,
 'True Positives': 492,
 'accuracy': 0.48,
 'precision': 0.56,
 'recall': 0.77}

In [99]:
val_fact_df.quantity_type.value_counts()

Other Employees        648
Full-Time Employees    239
Part-Time Employees     66
Name: quantity_type, dtype: int64

In [100]:
val_labeled_df[~space_bs_val].data_key_friendly_name.value_counts()

Other Employees        364
Full-Time Employees    205
Part-Time Employees     61
Name: data_key_friendly_name, dtype: int64

### Evaluate misses

In [101]:
# Matches that were correct in previous code versions but no longer are
lost_matches = pd.merge(full_match, full_match3, on=full_match3.columns.tolist(), 
                         how='outer', indicator=True).query(
    "_merge == 'left_only'").drop('_merge', 1).sort_values('accession_number')

lost_matches.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 54 entries, 57 to 1248
Data columns (total 5 columns):
acc_id                    54 non-null object
data_value                54 non-null float64
quantity_type             54 non-null object
accession_number          54 non-null object
data_key_friendly_name    54 non-null object
dtypes: float64(1), object(4)
memory usage: 2.5+ KB


In [102]:
lost_matches

Unnamed: 0,acc_id,data_value,quantity_type,accession_number,data_key_friendly_name
57,0000049754-17-000003,960.0,Other Employees,0000049754-17-000003,Other Employees
127,0000105770-17-000011,7300.0,Other Employees,0000105770-17-000011,Other Employees
187,0000745732-16-000037,77800.0,Other Employees,0000745732-16-000037,Other Employees
208,0000785161-17-000009,27968.0,Other Employees,0000785161-17-000009,Other Employees
219,0000799292-17-000004,1138.0,Other Employees,0000799292-17-000004,Other Employees
271,0000868671-17-000037,2291.0,Other Employees,0000868671-17-000037,Other Employees
307,0000894081-17-000036,350.0,Other Employees,0000894081-17-000036,Other Employees
308,0000895419-16-000167,6237.0,Other Employees,0000895419-16-000167,Other Employees
350,0000919956-17-000013,7400.0,Other Employees,0000919956-17-000013,Other Employees
359,0000927089-17-000123,1263.0,Other Employees,0000927089-17-000123,Other Employees


In [103]:
print_row_detail(df=train_fact_df3[train_fact_df3.acc_id.isin(lost_matches.acc_id.tolist())], 
                 nrow=10, header_list = ['acc_id', 'sent_num', 'word_num'],
                    detail_list = ['quantity_type', 'type_token', 'data_value',  'para_text'],
                    sortby=['acc_id', 'sent_num', 'word_num'], ascending=True)

----------------------------------- 0000049754-17-000003 -----------------------------------
----------------------------------- 0 -----------------------------------
----------------------------------- 0 -----------------------------------
quantity_type  :Full-Time Employees

type_token  :[full-time]

data_value  :960.0

para_text  :At December 31, 2016, we had approximately 960 employees, of whom approximately 500 were full-time, non-restaurant, corporate personnel. Our employees are not presently represented by any collective bargaining agreements and we have never experienced a work stoppage. We believe our relations with employees are good. Our franchisees are independent business owners and their employees are not our employees. Therefore, their employees are not included in our employee count.

----------------------------------- 0000105770-17-000011 -----------------------------------
----------------------------------- 0 -----------------------------------
--------------------

para_text  :As of December 31, 2016, the Company had approximately 3,230 full-time and part-time employees. The Company employed approximately 450 flight crewmembers, 1,405 aircraft maintenance technicians and flight support personnel, 670 warehousing, sorting and logistics personnel, 440 employees for airport maintenance and logistics, 50 employees for sales and marketing and 215 employees for administrative functions. In addition to full time and part time employees, the Company typically has approximately 350 temporary employees mainly serving the USPS operations and aircraft line maintenance operations. On December 31, 2015, the Company had approximately 2,170 full-time and part-time employees.



In [108]:
full_match_left= train_fact_df3.drop_duplicates().merge(train_labeled_df.drop_duplicates(), 
                                                           how='left', left_on=['acc_id', 'data_value', 'quantity_type'], 
                   right_on = ['accession_number', 'data_value', 'data_key_friendly_name']).drop_duplicates()

full_match_right= train_fact_df3.drop_duplicates().merge(train_labeled_df.loc[~space_bs,:].drop_duplicates(), 
                                                           how='right', left_on=['acc_id', 'data_value', 'quantity_type'], 
                   right_on = ['accession_number', 'data_value', 'data_key_friendly_name']).drop_duplicates()

print(full_match_left.shape)
print(full_match_right.shape)

fp_bs = full_match_left.ticker.isna()
fn_bs = full_match_right.acc_id.isna()
print(sum(fp_bs))
print(sum(fn_bs))

(2843, 32)
(2039, 32)
1192
416


In [109]:
full_match_left.quantity_type.value_counts(dropna=False)

Other Employees        2019
Full-Time Employees     662
Part-Time Employees     162
Name: quantity_type, dtype: int64

In [110]:
full_match_left.data_key_friendly_name.value_counts(dropna=False)

NaN                    1192
Other Employees        1029
Full-Time Employees     511
Part-Time Employees     111
Name: data_key_friendly_name, dtype: int64

In [111]:
full_match_right.data_key_friendly_name.value_counts(dropna=False)

Other Employees        1236
Full-Time Employees     641
Part-Time Employees     162
Name: data_key_friendly_name, dtype: int64

In [112]:
full_match_left[fp_bs].quantity_type.value_counts(dropna=False)

Other Employees        990
Full-Time Employees    151
Part-Time Employees     51
Name: quantity_type, dtype: int64

In [113]:
full_match_right.quantity_type.value_counts(dropna=False)

Other Employees        1002
Full-Time Employees     510
NaN                     416
Part-Time Employees     111
Name: quantity_type, dtype: int64

In [114]:
full_match_right[fn_bs].data_key_friendly_name.value_counts(dropna=False)

Other Employees        234
Full-Time Employees    131
Part-Time Employees     51
Name: data_key_friendly_name, dtype: int64

In [115]:
full_match_right[fn_bs].head(1)

Unnamed: 0,acc_id,para_text,len,emp_header,first_emp_head_block,para_text_orig,para_tag,split_x,label,doc_ids,...,data_value,ticker,accession_number,data_key_friendly_name,text,reported_data_value,reported_units,paragraph_text,split_y,space_percent
1623,,,,,,,,,,,...,19000.0,ABC,0001140859-16-000022,Other Employees,employees,19000.0,ones,"Employees As of September 30, 2016, we had a...",train,0.182598


In [116]:
full_match_left_columns = ['acc_id','ticker', 'data_value',  'quantity_type', 
                           'data_key_friendly_name',
 'paragraph_text',  'para_text', 'sent_num', 'sentence', 'emp_header', 'first_emp_head_block', 
 'word_num', 'subject', 'verb', 'quantity',  'type_token', 'word','len', 
       'para_text_orig', 'para_tag', 'split_x', 'label', 'doc_ids',        
        'units', 'reported_units',  'text',  'reported_data_value',
         'space_percent']

full_match_right_columns = ['accession_number','ticker', 'data_value',  'quantity_type', 
                           'data_key_friendly_name',
 'paragraph_text',  'para_text', 'sent_num', 'sentence', 'emp_header', 'first_emp_head_block', 
 'word_num', 'subject', 'verb', 'quantity',  'type_token', 'word','len', 
       'para_text_orig', 'para_tag', 'split_x', 'label', 'doc_ids',        
        'units', 'reported_units',  'text',  'reported_data_value',
         'space_percent']

False negatives - facts in golden dataset that were not extracted or not labeled correctly

In [117]:
print_row_detail(df=full_match_right.loc[fn_bs, :], 
                 nrow=10, header_list = ['accession_number', 'ticker' ],
                    detail_list = ['data_key_friendly_name','data_value',  'text', 'paragraph_text'],
                    sortby=['accession_number', 'ticker', 'data_key_friendly_name' ], ascending=True)

----------------------------------- 0000018498-16-000065 -----------------------------------
----------------------------------- GCO -----------------------------------
data_key_friendly_name  :Part-Time Employees

data_value  :18275.0

text  :part-time

paragraph_text  :Employees   Genesco had approximately 27,500 employees  at January 30, 2016, approximately  130 of whom were employed  in   corporate staff departments  and the  balance in  operations. Retail stores  employ a  substantial number  of   part-time employees, and approximately 18,275 of the Company's employees were part-time at January 30, 2016

----------------------------------- 0000024090-17-000008 -----------------------------------
----------------------------------- CIA -----------------------------------
data_key_friendly_name  :Other Employees

data_value  :333.0

text  :employee-agents

paragraph_text  :Domestic Home Service Insurance   Our domestic  Home Service  segment operates  in  this market  through our  s

False positives - facts in the extracted dataset that are not in the golden dataset or are not labeled correctly in the extracted dataset

In [118]:
print_row_detail(df=full_match_left.loc[fp_bs, :], 
                 nrow=10, header_list = ['acc_id', 'sent_num' ],
                    detail_list = ['quantity_type','data_value',  'label','subject', 'verb', 'quantity',
                                   'sentence', 'para_text'],
                    sortby=['acc_id', 'quantity_type', 'sent_num'], ascending=True)

----------------------------------- 0000004904-17-000019 -----------------------------------
----------------------------------- 2 -----------------------------------
quantity_type  :Other Employees

data_value  :2475.0

label  :1

subject  :I&M

verb  :had

quantity  :2,475

sentence  :As of December 31, 2016, I&M had 2,475 employees.

para_text  :Organized in Indiana in 1907, I&M is engaged in the generation, transmission and distribution of electric power to approximately 592,000 retail customers in northern and eastern Indiana and southwestern Michigan, and in supplying and marketing electric power at wholesale to other electric utility companies, rural electric cooperatives, municipalities and other market participants.  I&M owns or leases 3,539 MWs of generating capacity, which it uses to serve its retail and other customers.  As of December 31, 2016, I&M had 2,475 employees. Among the principal industries served are primary metals, transportation equipment, electrical and electr

In [147]:
train_labeled_df[train_labeled_df.accession_number == '0000011199-17-000011']

Unnamed: 0,ticker,accession_number,data_key_friendly_name,text,data_value,reported_data_value,reported_units,paragraph_text,split,space_percent
342,BMS,0000011199-17-000011,Other Employees,Number of employees,17678.0,17678.0,ones,"Years Ended December 31, ...",train,0.563313


In [119]:
full_match_left.loc[fp_bs, ['acc_id', 'data_value', 'quantity_type', 'label', 'para_text']].drop_duplicates().shape

(1175, 5)

In [120]:
full_match_left.loc[:,full_match_left_columns].head()

Unnamed: 0,acc_id,ticker,data_value,quantity_type,data_key_friendly_name,paragraph_text,para_text,sent_num,sentence,emp_header,...,para_text_orig,para_tag,split_x,label,doc_ids,units,reported_units,text,reported_data_value,space_percent
0,0000004127-16-000068,SWKS,7300.0,Other Employees,Other Employees,"EMPLOYEES As of September 30, 2016, we emp...","As of September 30, 2016, we employed approxim...",0,"As of September 30, 2016, we employed approxim...",True,...,"As of September 30, 2016,\r\r\r\nwe employed a...","<div class=""c80""><span class=""c32"">As of</span...",train,1,4,ones,ones,employed,7300.0,0.201465
1,0000004904-17-000019,AEP,17634.0,Other Employees,Other Employees,"As of December 31, 2016, the subsidiaries of ...","As of December 31, 2016, the subsidiaries of A...",0,"As of December 31, 2016, the subsidiaries of A...",False,...,"As of December 31, 2016,\r\r\r\nthe subsidiari...","<div class=""c110""><span class=""c99"">As of Dece...",train,1,13,ones,ones,employees,17634.0,0.198214
2,0000004904-17-000019,,1500.0,Other Employees,,,"Organized in Delaware in 1925, AEP Texas was f...",4,"As of December 31, 2016, AEP Texas had 1,500 e...",False,...,"Organized in Delaware in 1925, AEP Texas was f...","<div class=""c110""><span class=""c99"">Organized ...",train,1,14,ones,,,,
3,0000004904-17-000019,,1845.0,Other Employees,,,"Organized in Virginia in 1926, APCo is engaged...",3,"As of December 31, 2016, APCo had 1,845 employ...",False,...,"Organized in Virginia in 1926, APCo is engaged...","<div class=""c110""><span class=""c99"">Organized ...",train,1,15,ones,,,,
4,0000004904-17-000019,,2475.0,Other Employees,,,"Organized in Indiana in 1907, I&M is engaged i...",2,"As of December 31, 2016, I&M had 2,475 employees.",False,...,"Organized in Indiana in 1907, I&M is engaged i...","<div class=""c110""><span class=""c99"">Organized ...",train,1,16,ones,,,,


### Miss examples

Some of these are now caught.

In [121]:
fn_list = [
"We had approximately 11,000 full-time and part-time employees as of January 30, 2016.", 
"As of December 31, 2016, Comerica and its subsidiaries had 7,659 full-time and 490 part-time employees.", 
"At December 31, 2016, we had approximately 1,150 employees on a full-time equivalent basis.",
"The Company employed 4,482 persons on a full-time basis and 395 persons on a part-time basis at December 31, 2016.", 
"As of December 31, 2016, we had 428 employees (200 U.S. and 228 non-U.S.), the majority of whom are engaged in manufacturing operations, with the remainder primarily in sales, marketing and administrative functions.", 
"Retail stores employ a substantial number of part-time employees, and approximately 18,275 of the Company's employees were part-time at January 30, 2016.", 
"As of November 30, 2016 we had over 792 full-time employees.", 
"At December 31, 2016, we had approximately 960 employees, of whom approximately 500 were full-time, non-restaurant, corporate personnel.", 
"At December 30, 2016, we employed over 8,900 people.", 
"As of February 15, 2017, we had 2,254 employees, 1,157 of which were covered by a contract with Locals 304 and 1523 of the International Brotherhood of Electrical Workers that extends through June 30, 2018.", 
"The number of persons employed by the Company worldwide at December 31, 2016 was approximately 9,000.", 
"The Company's number of employees worldwide, including its corporate office employees and company-owned restaurant employees, was approximately 375,000 as of year-end 2016.",
"As of December 31, 2016, ALLETE had 1,963 employees, of which 1,917 were full-time.",
"As of December 31, 2016, the Company had 9,598 employees with 138 employed at MDU Resources Group, Inc., 1,030 at Montana-Dakota, 35 at Great Plains, 342 at Cascade, 236 at Intermountain, 342 at WBI Holdings, 3,099 at Knife River and 4,376 at MDU Construction Services.",
"At December 31, 2016, and December 31, 2015 we had approximately 14,000 employees.",
"At January 31, 2017, we employed approximately 1,800 full and part-time personnel none of which are represented by unions.", 
"During 2015, we employed approximately 72,500 employees on a full- or part-time basis.", 
" As of December 31, 2016, Eversource Energy employed a total of 7,762 employees, excluding temporary employees, of which 1,258 were employed by CL&P, 1,627 were employed by NSTAR Electric, 928 were employed by PSNH, and 297 were employed by WMECO.", 
"At December 31, 2016, we had 269,100 active, full-time equivalent team members.", 
"At December 31, 2016, the utility workforce consisted of 611 members of the Office and Professional Employees International Union (OPEIU) Local No. 11, AFL-CIO, and 497 non-union employees.", 
"The average number of persons employed worldwide by PPG during 2016 was about 47,000.",
"Northern Trust employed approximately 17,100 full-time equivalent staff members as of December 31, 2016.", 
"With about 201,000 employees and 62 plants worldwide, our core business includes designing, manufacturing, marketing, and servicing a full line of Ford cars, trucks, and SUVs, as well as Lincoln luxury vehicles.",
"Our policies are sold and serviced through a home service marketing distribution system of approximately 333 employee-agents who work on a route system and through over 286 funeral homes and independent agents to sell policies, collect premiums and service policyholders.",
]
fn1 = nlp(fn_list[0])
fn2 = nlp(fn_list[1])
fn3 = nlp(fn_list[2])
fn4 = nlp(fn_list[3])
fn5 = nlp(fn_list[4])
fn6 = nlp(fn_list[5])
fn7 = nlp(fn_list[6])
fn8 = nlp(fn_list[7])
fn12 = nlp(fn_list[11])
fn13 = nlp(fn_list[12])
fn17 = nlp(fn_list[16])

In [None]:
print("verb rights: ", list(fn1[1].rights))
print("employees tok rights: ", list(fn1[7].rights))
print("employees tok lefts: ", list(fn1[7].lefts))
print("employees tok children deps: ", [x.dep_ for x in list(fn1[7].children)])
print("employees tok gc: ", [list(x.children) for x in list(fn1[7].children)])
print("employees tok gc deps: ", [y.dep_  for x in list(fn1[7].children) for y in list(x.children)])

In [None]:
print("verb rights: ", list(fn2[11].rights))
print("employees tok rights: ", list(fn2[17].rights))
print("employees tok lefts: ", list(fn2[17].lefts))
print("employees tok children deps: ", [x.dep_ for x in list(fn2[17].children)])
print("employees tok gc: ", [list(x.children) for x in list(fn2[17].children)])
print("employees tok gc deps: ", [y.dep_  for x in list(fn2[17].children) for y in list(x.children)])
print("employees tok gc deps: ", [list(y.children)  for x in list(fn2[17].children) for y in list(x.children) if y._.is_emp_type])
print("employees tok gc deps: ", [z.dep_  for x in list(fn2[17].children) for y in list(x.children) for z in list(y.children)  if y._.is_emp_type])

In [None]:
extract_emp_relations(fn1, verbose=True)

In [None]:
print_doc_info(fn1)

In [None]:
extract_emp_relations(fn2, verbose=True)

In [123]:
#print_df(make_tok_df(fn2))

In [124]:
list(fn2[13].conjuncts)

[part-time]

In [125]:
[get_nummod_tok(x,[], verbose=True) for x in find_emp_type_toks(fn2[17], verbose=True) if get_nummod_tok(x,[])]

tok_emp_type_subtree:  [full-time, part-time]
Flagged toks from tok.children: [full-time]
type_conjs:  [part-time]
Flagged_toks:  [full-time, part-time]
Num_toks are: [490]


[490]

In [126]:
get_nummod_tok(find_emp_type_toks(fn2[17], verbose=True)[0], [], verbose=True)

tok_emp_type_subtree:  [full-time, part-time]
Flagged toks from tok.children: [full-time]
type_conjs:  [part-time]
Flagged_toks:  [full-time, part-time]


[]

In [127]:
print_doc_info(fn2)

doc is: 
As of December 31, 2016, Comerica and its subsidiaries had 7,659 full-time and 490 part-time employees.
--------------------------------------------------
Entities are: 


Unnamed: 0,tok_i,entity,ent_label,root,root_dep,dep_def,root_head,root_head_dep,root_head_pos
0,2,"December 31, 2016",DATE,December,pobj,object of preposition,of,prep,ADP
1,7,Comerica,ORG,Comerica,nsubj,nominal subject,had,ROOT,VERB
2,12,7659,CARDINAL,7659,nummod,,employees,dobj,NOUN
3,13,full-time,FULL_TIME,full-time,nmod,modifier of nominal,employees,dobj,NOUN
4,15,490,CARDINAL,490,nummod,,part-time,conj,ADJ
5,16,part-time,PART_TIME,part-time,conj,conjunct,full-time,nmod,ADJ
6,17,employees,EMP_NOUN,employees,dobj,direct object,had,ROOT,VERB


--------------------------------------------------
Noun chunks are: 


Unnamed: 0,tok_i,noun_chunk,root,root_ent,root_dep,dep_def,root_head,root_head_dep,root_head_pos
0,2,December,December,DATE,pobj,object of preposition,of,prep,ADP
1,7,Comerica,Comerica,ORG,nsubj,nominal subject,had,ROOT,VERB
2,9,its subsidiaries,subsidiaries,,conj,conjunct,Comerica,nsubj,PROPN
3,12,"7,659 full-time and 490 part-time employees",employees,EMP_NOUN,dobj,direct object,had,ROOT,VERB


--------------------------------------------------
Cardinal entities are: 


Unnamed: 0,tok_ent,toks,lemma,dep,head,h_dep,pos,tag,dep_def,tag_def
0,CARDINAL,7659,7659,nummod,employees,dobj,NUM,CD,,cardinal number
1,CARDINAL,490,490,nummod,part-time,conj,NUM,CD,,cardinal number


In [128]:
extract_emp_relations(fn3, verbose=True)

Word_id is : 0
Word is : employees
Word subtree is :  approximately 1,150 employees on a full-time equivalent
Word children :  [1,150, on]
Num_toks are: [1,150]
tok_emp_type_subtree:  [full-time equivalent]
Finding emp_type, emp_tok is dobj
emp_noun tok head: had
tok_head_emptype_subtreee: [full-time equivalent]
tok_head_empnoun_subtreee: [employees]
Flagged toks from tok_head_emptype_subtreee: [full-time equivalent]
Flagged tok heads: [basis]
Flagged tok head deps: ['pobj']
Flagged_toks:  [full-time equivalent]
Dep_ of EMP_NOUN is: dobj
Root is at 1 steps from employees.
Parts found:  (we, had, 1,150, 'Full-Time Employees', [full-time equivalent], employees)


[RelationDetails(sent_num=0, word_num=0, subject=we, verb=had, quantity=1,150, quantity_type='Full-Time Employees', type_token=[full-time equivalent], word=employees, word_dep='dobj', depth=1, sentence='At December 31, 2016, we had approximately 1,150 employees on a full-time equivalent basis.')]

In [129]:
extract_emp_relations(fn4, verbose=True)

Word_id is : 0
Word is : persons
Word subtree is :  4,482
Word children :  [4,482]
Num_toks are: [4,482]
Finding emp_type, emp_tok is dobj
emp_noun tok head: employed
tok_head_emptype_subtreee: [full-time, part-time]
tok_head_empnoun_subtreee: [persons, persons]
Flagged toks from tok_head_emptype_subtreee: [full-time]
Flagged tok heads: [basis]
Flagged tok head deps: ['pobj']
Flagged_toks:  [full-time]
Dep_ of EMP_NOUN is: dobj
Root is at 2 steps from persons.
Parts found:  (Company, employed, 4,482, 'Full-Time Employees', [full-time], persons)
Word_id is : 1
Word is : persons
Word subtree is :  395 persons on a part-time
Word children :  [395, on]
Num_toks are: [395]
tok_emp_type_subtree:  [part-time]
Flagged toks from tok_emp_subtree: [part-time]
t.dep_ == 'compound', t.head.dep_ == 'pobj', and t.head.head.head is emp_noun
Flagged_toks:  [part-time]
Dep_ of EMP_NOUN is: conj
Emp_noun token has dep_ == 'conj'.
Child num_tok: 395
Head num_tok: []
Parts found:  (Company, employed, 395, 

[RelationDetails(sent_num=0, word_num=0, subject=Company, verb=employed, quantity=4,482, quantity_type='Full-Time Employees', type_token=[full-time], word=persons, word_dep='dobj', depth=2, sentence='The Company employed 4,482 persons on a full-time basis and 395 persons on a part-time basis at December 31, 2016.'),
 RelationDetails(sent_num=0, word_num=1, subject=Company, verb=employed, quantity=395, quantity_type='Part-Time Employees', type_token=[part-time], word=persons, word_dep='conj', depth=2, sentence='The Company employed 4,482 persons on a full-time basis and 395 persons on a part-time basis at December 31, 2016.')]

In [130]:
#print_df(make_tok_df(fn4))

In [131]:
print_doc_info(fn4)

doc is: 
The Company employed 4,482 persons on a full-time basis and 395 persons on a part-time basis at December 31, 2016.
--------------------------------------------------
Entities are: 


Unnamed: 0,tok_i,entity,ent_label,root,root_dep,dep_def,root_head,root_head_dep,root_head_pos
0,3,4482,CARDINAL,4482,nummod,,persons,dobj,NOUN
1,4,persons,EMP_NOUN,persons,dobj,direct object,employed,ROOT,VERB
2,7,full-time,FULL_TIME,full-time,compound,,basis,pobj,NOUN
3,10,395,CARDINAL,395,nummod,,persons,conj,NOUN
4,11,persons,EMP_NOUN,persons,conj,conjunct,employed,ROOT,VERB
5,14,part-time,PART_TIME,part-time,compound,,basis,pobj,NOUN
6,17,"December 31, 2016",DATE,December,pobj,object of preposition,at,prep,ADP


--------------------------------------------------
Noun chunks are: 


Unnamed: 0,tok_i,noun_chunk,root,root_ent,root_dep,dep_def,root_head,root_head_dep,root_head_pos
0,0,The Company,Company,,nsubj,nominal subject,employed,ROOT,VERB
1,3,"4,482 persons",persons,EMP_NOUN,dobj,direct object,employed,ROOT,VERB
2,6,a full-time basis,basis,,pobj,object of preposition,on,prep,ADP
3,10,395 persons,persons,EMP_NOUN,conj,conjunct,employed,ROOT,VERB
4,13,a part-time basis,basis,,pobj,object of preposition,on,prep,ADP
5,17,December,December,DATE,pobj,object of preposition,at,prep,ADP


--------------------------------------------------
Cardinal entities are: 


Unnamed: 0,tok_ent,toks,lemma,dep,head,h_dep,pos,tag,dep_def,tag_def
0,CARDINAL,4482,4482,nummod,persons,dobj,NUM,CD,,cardinal number
1,CARDINAL,395,395,nummod,persons,conj,NUM,CD,,cardinal number


In [132]:
print(fn4[7].head.head.head)
print(list(fn4[4].head.subtree))
print([t for t in list(fn4[4].head.subtree) if t._.is_emp_type])
print([t for t in list(fn4[4].head.subtree) if t._.is_emp_noun])
print("emp noun tok indices: ", [t.i for t in list(fn4[4].head.subtree) if t._.is_emp_noun])
print("emp noun list index: ", [t for t in list(fn4[4].head.subtree) if t._.is_emp_noun].index(fn4[4]))

employed
[The, Company, employed, 4,482, persons, on, a, full-time, basis, and, 395, persons, on, a, part-time, basis, at, December, 31, ,, 2016, .]
[full-time, part-time]
[persons, persons]
emp noun tok indices:  [4, 11]
emp noun list index:  0


In [133]:
extract_emp_relations(fn5, verbose=True)

Word_id is : 0
Word is : employees
Word subtree is :  428 employees (200 U.S. and 228 non-U.S.
Word children :  [428, (, U.S., )]
Num_toks are: [428]
Root token lemma not one of ['be', 'employ', 'have']. 
Root token, lemma are : had have
[As, of, December, 31, ,, 2016, ,, we, had, 428, employees, (, 200, U.S., and, 228, non, -, U.S., )]
Finding emp_type, emp_tok is dobj
emp_noun tok head: had
tok_head_emptype_subtreee: []
tok_head_empnoun_subtreee: [employees]
Candidate tok:  428
Candidate tok.pos_:   NUM
Candidate tok.dep_:   nummod
No toks, returning 0.
Dep_ of EMP_NOUN is: dobj
Root is at 2 steps from employees.
Parts found:  (we, had, 428, 'Other Employees', [], employees)


[RelationDetails(sent_num=0, word_num=0, subject=we, verb=had, quantity=428, quantity_type='Other Employees', type_token=[], word=employees, word_dep='dobj', depth=2, sentence='As of December 31, 2016, we had 428 employees (200 U.S. and 228 non-U.S.), the majority of whom are engaged in manufacturing operations, with the remainder primarily in sales, marketing and administrative functions.')]

In [134]:
extract_emp_relations(fn8, verbose=True)

Word_id is : 0
Word is : employees
Word subtree is :  approximately 960 employees, of whom approximately 500 were full-time, non-restaurant, corporate
Word children :  [960, ,, were]
Num_toks are: [960]
tok_emp_type_subtree:  [full-time]
Finding emp_type, emp_tok is dobj
emp_noun tok head: had
tok_head_emptype_subtreee: [full-time]
tok_head_empnoun_subtreee: [employees]
Flagged toks from tok_head_emptype_subtreee: [full-time]
Flagged tok heads: [personnel]
Flagged tok head deps: ['attr']
Flagged_toks:  [full-time]
Dep_ of EMP_NOUN is: dobj
Root is at 1 steps from employees.
Parts found:  (we, had, 960, 'Full-Time Employees', [full-time], employees)


[RelationDetails(sent_num=0, word_num=0, subject=we, verb=had, quantity=960, quantity_type='Full-Time Employees', type_token=[full-time], word=employees, word_dep='dobj', depth=1, sentence='At December 31, 2016, we had approximately 960 employees, of whom approximately 500 were full-time, non-restaurant, corporate personnel.')]

In [135]:
#print_df(make_tok_df(fn8))

In [136]:
list(fn8[17].head.lefts)

[full-time, ,, corporate]

In [137]:
extract_emp_relations(fn12, verbose=True)

Word_id is : 0
Word is : employees
Word subtree is :  
Word children :  []
No toks, returning 0.
Dep_ of EMP_NOUN is: pobj
years: [(28, 2016)]
emp_counts: [(22, 375,000)]
order_indices: [0]
year_emps: [(2016, 375,000)]
Sentence has multiple years:[(28, 2016)]
First card subtree is :[approximately, 375,000, as, of, year, -, end, 2016]
years: [(28, 2016)]
cards: [375,000]
emp_counts: [(22, 375,000)]
Root is at 3 steps from employees.
Parts found:  (The Company's number of employees worldwide, including its corporate office employees and company-owned restaurant employees,, was, 375,000, 'Other Employees', [], employees)
Word_id is : 1
Word is : employees
Word subtree is :  its corporate office employees and company-owned restaurant
Word children :  [its, corporate, office, and, employees]
Dep_ of EMP_NOUN is: pobj
years: [(28, 2016)]
emp_counts: [(22, 375,000)]
order_indices: [0]
year_emps: [(2016, 375,000)]
Sentence has multiple years:[(28, 2016)]
First card subtree is :[approximately, 

[RelationDetails(sent_num=0, word_num=0, subject=The Company's number of employees worldwide, including its corporate office employees and company-owned restaurant employees,, verb=was, quantity=375,000, quantity_type='Other Employees', type_token=[], word=employees, word_dep='pobj', depth=3, sentence="The Company's number of employees worldwide, including its corporate office employees and company-owned restaurant employees, was approximately 375,000 as of year-end 2016."),
 RelationDetails(sent_num=0, word_num=1, subject=The Company's number of employees worldwide, including its corporate office employees and company-owned restaurant employees,, verb=was, quantity=375,000, quantity_type='Other Employees', type_token=[office], word=employees, word_dep='pobj', depth=3, sentence="The Company's number of employees worldwide, including its corporate office employees and company-owned restaurant employees, was approximately 375,000 as of year-end 2016.")]

In [138]:
#print_df(make_tok_df(fn12))

In [139]:
print_doc_info(fn12)

doc is: 
The Company's number of employees worldwide, including its corporate office employees and company-owned restaurant employees, was approximately 375,000 as of year-end 2016.
--------------------------------------------------
Entities are: 


Unnamed: 0,tok_i,entity,ent_label,root,root_dep,dep_def,root_head,root_head_dep,root_head_pos
0,1,Company,ORG,Company,poss,possession modifier,number,nsubj,NOUN
1,5,employees,EMP_NOUN,employees,pobj,object of preposition,of,prep,ADP
2,12,employees,EMP_NOUN,employees,pobj,object of preposition,including,prep,VERB
3,18,employees,EMP_NOUN,employees,conj,conjunct,employees,pobj,NOUN
4,21,approximately,DATE,approximately,advmod,adverbial modifier,375000,attr,NUM
5,22,375000,FALSE_DATE,375000,attr,attribute,was,ROOT,VERB
6,23,as of year-end 2016.,DATE,as,prep,prepositional modifier,375000,attr,NUM


--------------------------------------------------
Noun chunks are: 


Unnamed: 0,tok_i,noun_chunk,root,root_ent,root_dep,dep_def,root_head,root_head_dep,root_head_pos
0,0,The Company's number,number,,nsubj,nominal subject,was,ROOT,VERB
1,5,employees,employees,EMP_NOUN,pobj,object of preposition,of,prep,ADP
2,9,its corporate office employees,employees,EMP_NOUN,pobj,object of preposition,including,prep,VERB
3,14,company-owned restaurant employees,employees,EMP_NOUN,conj,conjunct,employees,pobj,NOUN
4,25,year-end,end,DATE,pobj,object of preposition,of,prep,ADP


--------------------------------------------------
Cardinal entities are: 


Unnamed: 0,tok_ent,toks,lemma,dep,head,h_dep,pos,tag,dep_def,tag_def


In [142]:
dtok = fn12[22]
dtok_span = [e for e in fn12.ents if dtok in e][0]

if dtok.i > dtok_span.start:
    left_span = fn12[dtok_span.start:dtok.i]
    left_ents = list(nlp(left_span.text).ents)
    if left_ents:
        print(left_ents)
if dtok.i < dtok_span.end:
    right_span = fn12[dtok.i + 1: dtok_span.end + 1]
    right_ents = list(nlp(right_span.text).ents)
    if right_ents:
        print(right_ents[0].end)
    
print("dtok index:", dtok.i)
print("right span start:", right_span.start)
print("right span end:", right_span.end)
print("doc last token index:", fn12[-1].i)

dtok index: 22
right span start: 23
right span end: 24
doc last token index: 29


In [143]:
extract_emp_relations(fn13, verbose=True)

Word_id is : 0
Word is : employees
Word subtree is :  1,963 employees, of which 1,917 were
Word children :  [1,963, ,, were]
Num_toks are: [1,963]
tok_emp_type_subtree:  [full-time]
Finding emp_type, emp_tok is dobj
emp_noun tok head: had
tok_head_emptype_subtreee: [full-time]
tok_head_empnoun_subtreee: [employees]
Flagged toks from tok_head_emptype_subtreee: [full-time]
Flagged tok heads: [were]
Flagged tok head deps: ['relcl']
Flagged_toks:  [full-time]
Dep_ of EMP_NOUN is: dobj
Root is at 1 steps from employees.
Type token in relative clause while emp_noun is not.
type_tok_relcl.dep_ is attr
sub_relcl found:  1,917
num_tok_relcl: [1,917]
num_toks: [1,963, 1,917]
Detail_list  0 :  [0, 0, ALLETE, had, 1,963, 'Other Employees', [of, which, 1,917], employees, 'dobj', 1, 'As of December 31, 2016, ALLETE had 1,963 employees, of which 1,917 were full-time.']
Detail_list  1 :  [0, 0, ALLETE, had, 1,917, 'Full-Time Employees', full-time, employees, 'dobj', 1, 'As of December 31, 2016, ALLETE

[RelationDetails(sent_num=0, word_num=0, subject=ALLETE, verb=had, quantity=1,963, quantity_type='Other Employees', type_token=[of, which, 1,917], word=employees, word_dep='dobj', depth=1, sentence='As of December 31, 2016, ALLETE had 1,963 employees, of which 1,917 were full-time.'),
 RelationDetails(sent_num=0, word_num=0, subject=ALLETE, verb=had, quantity=1,917, quantity_type='Full-Time Employees', type_token=full-time, word=employees, word_dep='dobj', depth=1, sentence='As of December 31, 2016, ALLETE had 1,963 employees, of which 1,917 were full-time.')]

In [144]:
#print_df(make_tok_df(fn13))

In [145]:
ft_tok = fn13[16]
fn13_emp_tok = fn13[10]
ft_tok.dep_ == 'attr' and ft_tok.head.dep_ == 'relcl' and find_verb_tok(ft_tok) != fn13_emp_tok.head
find_subject(ft_tok.head).tag_ == 'CD' and find_subject(ft_tok.head).ent_type_ in ['CARDINAL', 'QUANTITY', 'FALSE_DATE']
list(find_verb_tok(ft_tok).subtree)[:3]
#print(find_verb_tok(ft_tok).left_edge.subtree)

[of, which, 1,917]

In [146]:
find_emp_type_toks(fn13_emp_tok)

[full-time]

In [147]:
list(fn13[16].head.children)

[of, 1,917, full-time]

In [148]:
extract_emp_relations(fn17, verbose=True)

Word_id is : 0
Word is : employees
Word subtree is :  approximately 72,500
Word children :  [72,500]
Num_toks are: [72,500]
Finding emp_type, emp_tok is dobj
emp_noun tok head: employed
tok_head_emptype_subtreee: [full-, part-time]
tok_head_empnoun_subtreee: [employees]
Flagged toks from tok_head_emptype_subtreee: [full-]
Flagged tok heads: [basis]
Flagged tok head deps: ['pobj']
type_conjs:  [part-time]
Flagged_toks:  [full-, part-time]
Part_time and full_time flags found.
Dep_ of EMP_NOUN is: dobj
Root is at 1 steps from employees.
Parts found:  (we, employed, 72,500, 'Other Employees', [full-, part-time], employees)


[RelationDetails(sent_num=0, word_num=0, subject=we, verb=employed, quantity=72,500, quantity_type='Other Employees', type_token=[full-, part-time], word=employees, word_dep='dobj', depth=1, sentence='During 2015, we employed approximately 72,500 employees on a full- or part-time basis.')]

In [149]:
#print_df(make_tok_df(fn17))

In [150]:
print_doc_info(fn17)

doc is: 
During 2015, we employed approximately 72,500 employees on a full- or part-time basis.
--------------------------------------------------
Entities are: 


Unnamed: 0,tok_i,entity,ent_label,root,root_dep,dep_def,root_head,root_head_dep,root_head_pos
0,1,2015,DATE,2015,pobj,object of preposition,During,prep,ADP
1,5,"approximately 72,500",CARDINAL,72500,nummod,,employees,dobj,NOUN
2,7,employees,EMP_NOUN,employees,dobj,direct object,employed,ROOT,VERB
3,10,full-,FULL_TIME,full-,amod,adjectival modifier,basis,pobj,NOUN
4,12,part-time,PART_TIME,part-time,conj,conjunct,full-,amod,ADJ


--------------------------------------------------
Noun chunks are: 


Unnamed: 0,tok_i,noun_chunk,root,root_ent,root_dep,dep_def,root_head,root_head_dep,root_head_pos
0,3,we,we,,nsubj,nominal subject,employed,ROOT,VERB
1,5,"approximately 72,500 employees",employees,EMP_NOUN,dobj,direct object,employed,ROOT,VERB
2,9,a full- or part-time basis,basis,,pobj,object of preposition,on,prep,ADP


--------------------------------------------------
Cardinal entities are: 


Unnamed: 0,tok_ent,toks,lemma,dep,head,h_dep,pos,tag,dep_def,tag_def
0,CARDINAL,approximately,approximately,advmod,72500,nummod,ADV,RB,adverbial modifier,adverb
1,CARDINAL,72500,72500,nummod,employees,dobj,NUM,CD,,cardinal number


Check for full or part time indicators that are missed by the pipeline component.

In [151]:
print_row_detail(df=train_fact_df3[train_fact_df3.para_text.str.contains("full- ")], 
                 nrow=10, header_list = ['acc_id', 'sent_num', 'word_num'],
                    detail_list = ['quantity_type', 'type_token', 'data_value',  'para_text'],
                    sortby=['acc_id', 'sent_num', 'word_num'], ascending=True)

----------------------------------- 0000072333-16-000260 -----------------------------------
----------------------------------- 0 -----------------------------------
----------------------------------- 0 -----------------------------------
quantity_type  :Other Employees

type_token  :[full-, part-time]

data_value  :72500.0

para_text  :During 2015, we employed approximately 72,500 employees on a full- or part-time basis. Due to the seasonal nature of our business, employment increased to approximately 74,000 employees in July 2015 and 78,000 in December 2015. All of our employees are non-union. We believe our relationship with our employees is good.

----------------------------------- 0000750686-17-000057 -----------------------------------
----------------------------------- 1 -----------------------------------
----------------------------------- 0 -----------------------------------
quantity_type  :Other Employees

type_token  :[full-, part-time]

data_value  :631.0

para_text  

In [152]:
train_fact_df3[train_fact_df3.para_text.str.contains("full or part")]

Unnamed: 0,acc_id,para_text,len,emp_header,first_emp_head_block,para_text_orig,para_tag,split,label,doc_ids,...,verb,quantity,quantity_type,type_token,word,word_dep,depth,sentence,units,data_value
1268,0001171843-17-001394,"As of December 31, 2016, we employed approxima...",318,True,True,"As of December 31, 2016, we employed approxima...","<p class=""c22"">As of December 31, 2016, we emp...",train,1,7330,...,employed,135,Part-Time Employees,[part-time],people,dobj,1,"As of December 31, 2016, we employed approxima...",ones,135.0


## Examples that have needed handling to avoid errors (more robust treatment needed)

In [384]:
#train_df.loc[7729].para_text

'Our current product, the Argus ® II System, treats outer retinal degenerations, such as retinitis pigmentosa, also referred to as RP. RP is a hereditary disease, affecting an estimated 1.5 million people worldwide including about 100,000 people in the United States, that causes a progressive degeneration of the light-sensitive cells of the retina, leading to significant visual impairment and ultimately blindness. The Argus II System is the only retinal prosthesis approved in the United States by the Food and Drug Administration (FDA), and was the first approved retinal prosthesis in the world. By restoring a form of useful vision in patients who otherwise have total sight loss, the Argus II System can provide benefits which include:'

In [476]:
#train_para_list[744:747]

['Employees and Labor Relations - As of December 31, 2016, Praxair had 26,498 employees worldwide. Of this number, 10,182 are employed in the United States. Praxair has collective bargaining agreements with unions at numerous locations throughout the world, which expire at various dates. Praxair considers relations with its employees to be good.',
 'The number of employees at December 31, 2016 was 26,498, a decrease of 159 employees from December 31, 2015. This decrease primarily reflects the impact of cost reduction programs implemented during the current year partially offset by acquisitions.',
 'The number of employees at December 31, 2015 was 26,657, a decrease of 1,123 employees from December 31, 2014. This decrease primarily reflects the impact of cost reduction programs implemented during the current year.']

In [441]:
#train_para_list[462]

'At March 31, 2016, 2015 and 2014, we had 3,066, 2,982 and 2,843 employees, respectively. None of our employees are covered by a collective bargaining agreement. We consider our relations with our employees to be satisfactory. However, competition for experienced asset management personnel is intense and from time to time we may experience a loss of valuable personnel. We recognize the importance to our business of hiring, training and retaining skilled professionals.'

In [489]:
#train_para_list[1144]

'Our operating expenses have historically been driven in large part by personnel-related costs, including wages, commissions, bonuses, benefits, share-based compensation, and travel. Facility and information technology, or IT, departmental costs are allocated to each department based on usage and headcount. We had a total of 9,832, 9,058, and 8,806 employees as of December 31, 2016, 2015, and 2014, respectively. Our headcount increased by 774 employees, or 9%, in 2016, compared to 2015, primarily in research and development, driven by our 2016 business acquisitions, as well as higher services and sales headcount as we focus on delivering our new products to our customers.'

In [620]:
#train_para_list[1694]

'We had 17,912, 14,533 and 10,625 employees as of December 31, 2014, 2015 and 2016, respectively. The following table sets forth the number of our employees categorized by our areas of operations and as a percentage of our total employees as of December 31, 2016.'

In [677]:
#train_para_list[1907]

'The employee cost at Jaguar Land Rover increased by 17.6% to Rs.228,730 million in Fiscal 2016 from Rs.194,467 million in Fiscal 2015. This increase includes an unfavorable foreign currency translation from GBP to Indian rupees of Rs.546 million. In GBP terms, employee costs at Jaguar Land Rover increased to GBP 2,321 million in Fiscal 2016 from GBP1,977 million in Fiscal 2015. The employee cost at Jaguar Land Rover as a percentage to revenue increased to 10.5% in Fiscal 2016 from 9.0% in Fiscal 2015. Due to consistent increases in volumes and to support new launches and product development projects, Jaguar Land Rover increased its average permanent headcount by 19.6% in Fiscal 2016 to 29,789 employees from 24,902 employees in Fiscal 2015. However, the average temporary headcount was flat at 7,216 employees in Fiscal 2016 from 7,225 employees in Fiscal 2015. Total number of permanent employees as at March 31, 2016 was 30,750, as compared to 27,004 as at March 31, 2015 for Jaguar Land 

In [689]:
#train_para_list[2204]

'•  Independent Equipment Dealer Solicitations. This origination channel focuses on soliciting and establishing relationships with independent equipment dealers in a variety of equipment categories located across the United States. Our typical independent equipment dealer has less than $12.0 million in annual revenues and fewer than 50 employees. Service is a key determinant in becoming the preferred provider of financing recommended by these equipment dealers.'

In [696]:
#train_para_list[2460]

'The assets of communications services businesses are primarily their employees, and the Company is highly dependent on the talent, creative abilities and technical skills of its personnel and the relationships its personnel have with clients. The Company believes that its operating companies have established reputations in the industry that attract talented personnel. However, the Company, like all communications services businesses, is vulnerable to adverse consequences from the loss of key employees due to the competition among these businesses for talented personnel. On 31 December 2016, the Group, including all employees of associated undertakings, had approximately 198,000 employees located in over 3,000 offices in 112 countries compared with 190,000 and 179,000 as at 31 December 2015 and 2014, respectively. Excluding all employees of associated undertakings, this figure is 134,341 (2015: 128,123, 2014: 123,621). The average number of employees in 2016 was 132,657 compared to 124

In [697]:
#extract_emp_relations(nlp(train_para_list[2460]), verbose=True)

Word_id is : 0
Word is : employees
Dep_ of EMP_NOUN is: attr
Root is at 2 steps from employees.
No num_tok. 
Word_id is : 0
Word is : employees
Dep_ of EMP_NOUN is: pobj
Root is at 1 steps from employees.
Word_id is : 0
Word is : employees
Dep_ of EMP_NOUN is: pobj
Root is at 1 steps from employees.
No num_tok. 
Word_id is : 1
Word is : employees
Dep_ of EMP_NOUN is: dobj
Num_toks are: [198,000]
Root is at 1 steps from employees.
(Group, had, 198,000, 'Other', 0, employees)
Word_id is : 0
Word is : employees
Dep_ of EMP_NOUN is: pobj
year_emps: [('2014', '128,123'), ('2015', '134,341')]
max year_emps: ('2015', '134,341')
Sentence has multiple years:[(143, 2015), (147, 2014)]
First card subtree is :[134,341, (, 2015, :, 128,123, ,, 2014, :, 123,621, )]
years: [(143, 2015), (147, 2014)]
cards: [134,341, 128,123, 123,621]
emp_counts: [(141, 134,341), (145, 128,123), (149, 123,621)]
Root is at 1 steps from employees.
(figure, is, '134,341', 'Other', 0, employees)
Word_id is : 0
Word is : e

[RelationDetails(sent_num=3, word_num=1, s=Group, v=had, quantity=198,000, quantity_type='Other', type_token=0, word=employees),
 RelationDetails(sent_num=4, word_num=0, s=figure, v=is, quantity='134,341', quantity_type='Other', type_token=0, word=employees),
 RelationDetails(sent_num=5, word_num=0, s=The average number of employees in 2016, v=was, quantity=132,657, quantity_type='Other', type_token=0, word=employees)]

In [428]:
#train_para_list[392]

"At September 30, 2016, HRG employed 21 persons and HRG's subsidiaries employed approximately 16,000 persons. In the normal course of business, HRG and its subsidiaries use contract personnel to supplement their employee base to meet business needs. As of September 30, 2016, none of HRG's employees were represented by labor unions or covered by collective bargaining agreements. See the remainder of this report for additional information regarding the employees of HRG's subsidiaries. HRG believes that its overall relationship with its employees is good."

In [409]:
#extract_emp_relations(nlp(train_para_list[47]), verbose=True)

Word_id is : 0
Word is : individuals
Dep_ of EMP_NOUN is: dobj
Num_toks are: [7,683]
Root is at 1 steps from individuals.
(we, employed, 7,683, 'Other', 0, individuals)
Word_id is : 0
Word is : persons
Dep_ of EMP_NOUN is: dobj
Num_toks are: [7,536]
Root is at 1 steps from persons.
(subsidiaries, employed, 7,536, 'Other', 0, persons)
Word_id is : 1
Word is : persons
Dep_ of EMP_NOUN is: nsubjpass


[RelationDetails(sent_num=0, word_num=0, s=we, v=employed, quantity=7,683, quantity_type='Other', type_token=0, word=individuals),
 RelationDetails(sent_num=1, word_num=0, s=subsidiaries, v=employed, quantity=7,536, quantity_type='Other', type_token=0, word=persons)]