<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Explore-labeled-dataframe" data-toc-modified-id="Explore-labeled-dataframe-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Explore labeled dataframe</a></span><ul class="toc-item"><li><span><a href="#funtion:-print_row_detail" data-toc-modified-id="funtion:-print_row_detail-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>funtion: <code>print_row_detail</code></a></span></li></ul></li><li><span><a href="#Exploring-tags-that-match-any-of-the-regex-patterns" data-toc-modified-id="Exploring-tags-that-match-any-of-the-regex-patterns-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Exploring tags that match any of the regex patterns</a></span><ul class="toc-item"><li><span><a href="#Patterns-to-flag-candidate-paragraphs" data-toc-modified-id="Patterns-to-flag-candidate-paragraphs-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Patterns to flag candidate paragraphs</a></span></li><li><span><a href="#Tag-Exploration" data-toc-modified-id="Tag-Exploration-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Tag Exploration</a></span><ul class="toc-item"><li><span><a href="#Example-1" data-toc-modified-id="Example-1-2.2.1"><span class="toc-item-num">2.2.1&nbsp;&nbsp;</span>Example 1</a></span></li><li><span><a href="#Example-2" data-toc-modified-id="Example-2-2.2.2"><span class="toc-item-num">2.2.2&nbsp;&nbsp;</span>Example 2</a></span></li><li><span><a href="#Example-3" data-toc-modified-id="Example-3-2.2.3"><span class="toc-item-num">2.2.3&nbsp;&nbsp;</span>Example 3</a></span></li><li><span><a href="#Example-4" data-toc-modified-id="Example-4-2.2.4"><span class="toc-item-num">2.2.4&nbsp;&nbsp;</span>Example 4</a></span></li><li><span><a href="#Example-5" data-toc-modified-id="Example-5-2.2.5"><span class="toc-item-num">2.2.5&nbsp;&nbsp;</span>Example 5</a></span></li></ul></li></ul></li><li><span><a href="#Test-regex" data-toc-modified-id="Test-regex-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Test regex</a></span><ul class="toc-item"><li><span><a href="#function:-check_regex_match" data-toc-modified-id="function:-check_regex_match-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>function: check_regex_match</a></span></li></ul></li><li><span><a href="#Write-functions-to-flag-paragraphs" data-toc-modified-id="Write-functions-to-flag-paragraphs-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Write functions to flag paragraphs</a></span><ul class="toc-item"><li><span><a href="#Test-logic-for-finding-paragraphs-below-flagged-headings" data-toc-modified-id="Test-logic-for-finding-paragraphs-below-flagged-headings-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Test logic for finding paragraphs below flagged headings</a></span></li><li><span><a href="#Find-candidate-paragraphs" data-toc-modified-id="Find-candidate-paragraphs-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Find candidate paragraphs</a></span></li><li><span><a href="#Create-dataframes-and-save-as-CSVs" data-toc-modified-id="Create-dataframes-and-save-as-CSVs-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Create dataframes and save as CSVs</a></span></li></ul></li><li><span><a href="#Debug-odd-findings-to-refine-regex-or-code" data-toc-modified-id="Debug-odd-findings-to-refine-regex-or-code-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Debug odd findings to refine regex or code</a></span><ul class="toc-item"><li><span><a href="#Explore-dataframe-of-paragraphs-for-training-data" data-toc-modified-id="Explore-dataframe-of-paragraphs-for-training-data-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Explore dataframe of paragraphs for training data</a></span></li></ul></li></ul></div>

# Retrieve candidate paragraphs from HTML files

In [1]:
import numpy as np
import pandas as pd
import re

from bs4 import BeautifulSoup as bs

from path import Path, getcwdu

import glob
import os
from pathlib import PurePath
import copy

import random
import gzip
import shutil

In [2]:
full_path_list = [PurePath(os.getcwd()).joinpath(file).as_posix() for file in glob.iglob('../employee_filings/*.gz')]
full_file_list = [PurePath(file).name for file in glob.iglob('../employee_filings/*.gz')]
full_accession_ids = [PurePath(file).stem.replace('.html', '') for file in full_file_list] 
full_cik_nbrs = [x.split(sep='-')[0] for x in full_accession_ids]
html_path_list = [x.replace('.gz', '') for x in full_path_list]

Read in accession ID lists created from initial data splitting

In [3]:
train_accession_ids = pd.read_csv('../data/train_accession_ids.csv', names=['acc_id'])['acc_id'].tolist()
val_accession_ids = pd.read_csv('../data/val_accession_ids.csv', names=['acc_id'])['acc_id'].tolist()

labeled_df = pd.read_excel('../data/train_val_employee_count_paragraphs.xlsx')
subset_df = pd.read_excel('../data/subset_employee_count_paragraphs.xlsx')

## Explore labeled dataframe

### funtion: `print_row_detail`

In [4]:
def print_row_detail(df=subset_df, nrow=10, header_list = ['ticker', 'accession_number' ],
                    detail_list = ['data_key_friendly_name', 'data_value', 'reported_units', 'text', 'paragraph_text'],
                    sortby=['accession_number', 'data_key_friendly_name'], ascending=True):
    df_sorted = df.sort_values(sortby, ascending=ascending).reset_index()
    nrow = min(len(df_sorted), nrow)
    for i in range(0, nrow):
        for h in header_list:
            print('-'*35  + ' ' +  str(df_sorted[h][i]) + ' ' + '-'*35)
        for d in detail_list:
            print(d + '  :' + str(df_sorted[d][i]))
            print('')

In [5]:
print_row_detail(labeled_df)

----------------------------------- LNG -----------------------------------
----------------------------------- 0000003570-17-000052 -----------------------------------
data_key_friendly_name  :Full-Time Employees

data_value  :911.0

reported_units  :ones

text  :full-time employees

paragraph_text  :Employees   We had 911 full-time employees at January 31, 2017

----------------------------------- SWKS -----------------------------------
----------------------------------- 0000004127-16-000068 -----------------------------------
data_key_friendly_name  :Other Employees

data_value  :7300.0

reported_units  :ones

text  :employed

paragraph_text  :EMPLOYEES   As of September  30, 2016, we  employed approximately 7,300  employees world-wide. Approximately  860 of  our   employees in  Mexico, 450  employees in  Singapore, and  200 employees  in Japan  are covered  by  collective   bargaining and other union agreements

----------------------------------- UHAL ---------------------------

In [6]:
print_row_detail(df=subset_df, nrow=20)

----------------------------------- AAN -----------------------------------
----------------------------------- 0000706688-17-000030 -----------------------------------
data_key_friendly_name  :Other Employees

data_value  :11500

reported_units  :ones

text  :employees

paragraph_text  :Employees   At December 31, 2016, the Company had approximately 11,500 employees. None of our employees are covered by a   collective bargaining agreement and we believe that our relations with employees are good

----------------------------------- AAON -----------------------------------
----------------------------------- 0000824142-17-000034 -----------------------------------
data_key_friendly_name  :Other Employees

data_value  :1619

reported_units  :ones

text  :employees

paragraph_text  :Employees   As of February 12, 2017, we employed 1,619 permanent employees. Our employees are not represented by  unions

----------------------------------- AAXN -----------------------------------
---------

Only 15 files are included in the subset_df

In [7]:
subset_df.accession_number.unique()

array(['0001090872-16-000082', '0001193125-17-083862',
       '0001193125-17-065791', '0001564590-17-003590',
       '0000706688-17-000030', '0001558370-17-001556',
       '0000824142-17-000034', '0001158449-17-000034',
       '0001193125-17-053796', '0001069183-17-000042',
       '0001104659-17-015892', '0001551152-17-000004',
       '0001140859-16-000022', '0001466815-17-000007',
       '0001628280-16-022122'], dtype=object)

In [8]:
labeled_df.groupby('data_key_friendly_name').text.value_counts()

data_key_friendly_name  text                                             
Full-Time Employees     full-time                                            516
                        full-time employees                                  104
                        full time                                             41
                        employees                                             19
                        full time employees                                   16
                        full-time equivalent                                  16
                        full-time equivalent employees                        16
                        Total                                                 11
                        employed                                               7
                        Full-time                                              4
                        full time equivalent                                   4
                        full-time e

In [9]:
subset_df.groupby('data_key_friendly_name').text.value_counts()

data_key_friendly_name  text                  
Full-Time Employees     full-time                 5
                        full-time Team            1
Other Employees         employees                 5
                        employed                  4
                        Total                     1
                        temporary                 1
Part-Time Employees     part-time                 2
                        part-time Team Members    1
Name: text, dtype: int64

In [5]:
# Create list of paths for subset_df
subset_file_list = [PurePath(os.getcwd()).joinpath('../employee_filings/').joinpath(file) for file in full_file_list if PurePath(file).stem.replace('.html', '') in subset_df.accession_number.unique().tolist()]

In [10]:
train_accession_ids.index('0000034088-17-000017')

54

## Exploring tags that match any of the regex patterns

### Patterns to flag candidate paragraphs

Modular components make it easier to organize (and update) regular expressions.

In [33]:
employee_terms = "(associates|employees|full[ -]time[ -]equivalent(s)?|staff|team members|workers)"
person_terms = "(individuals|people|persons)"
workforce_terms = "(((employee|employment|head|personnel|staff|worker|workforce) (count(s)|level(s)|total(s))+)|(head-count|headcount|workforce))"
employee_type_terms = "(full time|full-time|permanent|part time|part-time|regular|seasonal|temporary|total)"

numeral_pat = "(([0-9]{1,3},)*[0-9]{1,3}([.][0-9])?)"
rel_qualifiers = "(a total of|approximately|in aggregate|in total|(an|the) equivalent of|total)"

space_pat = "( |\n)"

magnitude_words = "(hundred|thousand|million|billion)" 
num_words = "(one|two|three|four|five|six|seven|eight|nine|ten|eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen|eighteen|nineteen)"

In [34]:
num_pat = "".join(["((",numeral_pat, "|", num_words, ")(", space_pat, magnitude_words, ")?", ")"])

num_emps_pat = "".join([num_pat, space_pat, "(", rel_qualifiers, space_pat, ")*",  "(", employee_type_terms, space_pat, ")*"])

#number_employees_pat = "".join(['r"',num_emps_pat, employee_terms, '"'])
number_employees_pat = "".join([num_emps_pat, employee_terms])
employed_num_pat = "".join(["employ((ed|s)?)?", space_pat, "(", rel_qualifiers, space_pat, ")*", 
                        num_emps_pat])
emp_type_emp_term_pat = "".join([employee_type_terms, space_pat, employee_terms])
employed_end_span_pat = "".join(["employed(", space_pat, rel_qualifiers, ")*$"])
span_start_employees_pat = "".join(["^", "(", rel_qualifiers, space_pat, ")*",  
                                   "(", employee_type_terms, space_pat, ")*", 
                                   employee_terms])

emp_pat_list = [number_employees_pat, employed_num_pat, emp_type_emp_term_pat, 
                employed_end_span_pat, span_start_employees_pat, workforce_terms]
emp_pats = [re.compile(x, re.I) for x in emp_pat_list]

In [46]:
sum(labeled_df.paragraph_text.str.contains(re.compile("workforce")))

115

In [48]:
print_row_detail(labeled_df[labeled_df.paragraph_text.str.contains(re.compile("workforce"))], nrow=115)

----------------------------------- TREC -----------------------------------
----------------------------------- 0000007039-17-000009 -----------------------------------
data_key_friendly_name  :Other Employees

data_value  :310.0

reported_units  :ones

text  :employees

paragraph_text  :Personnel   The number of  our regular, U.S.  based employees was  approximately 310, 296,  and 271 for  the years  ended   December 31,  2016,  2015, and  2014,  respectively.  Of these  employees,  none are  covered  by  collective   bargaining agreements. Regular employees are defined as active executive, management, professional, technical   and wage employees who work full time or part time  for the Company and are covered by our benefit plans  and   programs. Our workforce has increased primarily due to expansions at both facilities

----------------------------------- BA -----------------------------------
----------------------------------- 0000012927-17-000006 --------------------------------


data_value  :24000.0

reported_units  :ones

text  :employees

paragraph_text  :Employees and Labor   As of  December 31,  2016, we  had approximately  24,000  employees. The  majority of  the employees  of  our   subsidiaries outside of the United States are subject  to the terms of individual employment agreements.  One   of our wholly owned subsidiaries has approximately 1,600  employees in the United Kingdom, a portion of  whom   are members of the Unite trade union. Employees  of our subsidiaries in Vienna, Austria; Frankfurt,  Germany;   and Nuernberg, Germany are also represented by local work councils. The Vienna workforce and a portion of the   Frankfurt workforce are  also covered by  a union contract.  Certain employees of  our Korean subsidiary  are   represented by a Labor-Management council. In Brazil, all employees are unionized and covered by the terms of   industry-specific collective agreements. Employees in certain other  countries are also covered by the  terms   o

In [9]:
emp_pat_list_old = [r"^(Employees|Team Members)$",
r"([0-9]{1,3},)*[0-9]{1,3}( |\n)((permanent|full-time|part-time|temporary|total)( |\n))*(employees|people|team members|members)",
r"employ((ed|s)?)?( |\n)(approximately( |\n))?([0-9]{1,3},)*[0-9]{1,3}( |\n)((permanent|full-time|part-time|temporary)( |\n))?(employees|people|team members|members|persons|associates)", 
r"employed(( |\n)approximately)?$", 
r"Total workforce",
r"((full|full-time|permanent|part|part-time|regular|seasonal|temporary|time)( |\n))+(employees|team members|associates)",
r"^((permanent|full|part|time|full-time|part-time|temporary|total)( |\n))*(employees|team members|associates)"]
emp_pats_old = [re.compile(x, re.I) for x in emp_pat_list_old]

Regex for identifying "block" html elements - useful for finding actual paragraph boundaries, or the first set of *n* paragraphs or tables immediately after a secion heading

In [10]:
block_re = re.compile(r"^(p|div|table)$")

In [11]:
def show_str_context(search_str: str, in_string: str, char_before: int=500, char_after: int=2000, match_num: int=0):
    """Return characters before and after a given string."""
    sub_idx = in_string.find(search_str) 
    len_offset = len(search_str)
    if match_num:
        idx = sub_idx
        for i in range(match_num):
            idx = in_string.find(search_str, idx + len_offset)
        sub_idx = idx
    context_beg = max(sub_idx - char_before, 0)
    context_end = sub_idx + len_offset + char_after
    return in_string[context_beg:context_end]   

### Tag Exploration

#### Example 1

In [35]:
with gzip.open(subset_file_list[2], mode='rt', encoding="utf8") as file: 
            file1_html = file.read()
            soup1 = bs(file1_html, 'lxml')
soup1_emp_count = soup1.find_all(string=[emp_pats])

In [36]:
len(soup1_emp_count)

9

In [70]:
emp1_tag = soup1_emp_count[0]

In [71]:
for i, v in enumerate(emp1_tag.parent.find_next_siblings(block_re, limit=6)):
    print(i)
    print(v)

0
<div class="c88"><span class="c16">As of December 31,</span> <span class="c16">2016</span><span class="c16">,
we had</span> <span class="c16">699</span> <span class="c16">full-time employees and</span> <span class="c16">202</span> <span class="c16">temporary employees. The breakdown of our full-time employees by department
is as follows:</span> <span class="c16">175</span> <span class="c16">direct manufacturing employees and</span>
<span class="c16">524</span> <span class="c16">administrative and manufacturing support employees. Of
the</span> <span class="c16">524</span> <span class="c16">administrative and manufacturing support
employees,</span> <span class="c16">213</span> <span class="c16">were involved in sales, marketing,
communications and training. Of the</span> <span class="c16">202</span> <span class="c16">temporary employees,
more than</span> <span class="c16">92%</span> <span class="c16">worked in direct manufacturing roles. Our
employees are not covered by any collective 

#### Example 2

In [37]:
with gzip.open(subset_file_list[12], mode='rt', encoding="utf8") as file: 
            file2_html = file.read()
            soup2 = bs(file2_html, 'lxml')
soup2_emp_count = soup2.find_all(string=[emp_pats])

In [38]:
len(soup2_emp_count)

7

In [39]:
soup2_emp_count

['As of December 31, 2016, we had a total of 276 employees working in the\nR&D department, including 16 with Ph.D. degrees. We continue to recruit talented engineers to further\nenhance our research and development capabilities. We have research and development departments in our\nfacilities in Texas, Georgia, China and Taiwan. Our research and development teams collaborate on joint\nprojects, and by co-locating with our manufacturing operations enable us to achieve an efficient cost structure\nand improve our time to market.',
 'Employees',
 'As of December 31, 2016, we employed 2,776 full-time employees, of which 31\nheld Ph.D. degrees in a science or engineering field. Of our employees, 287 are located in the U.S., 1,218 are\nlocated in Taiwan and 1,271 are located in China. None of our employees are represented by any collective\nbargaining agreement, but certain employees of our China subsidiary are members of a trade union. We have never\nsuffered any work stoppage as a result of

In [78]:
soup2_emp_count

['As of December 31, 2016, we had a total of 276 employees working in the\nR&D department, including 16 with Ph.D. degrees. We continue to recruit talented engineers to further\nenhance our research and development capabilities. We have research and development departments in our\nfacilities in Texas, Georgia, China and Taiwan. Our research and development teams collaborate on joint\nprojects, and by co-locating with our manufacturing operations enable us to achieve an efficient cost structure\nand improve our time to market.',
 'Employees',
 'As of December 31, 2016, we employed 2,776 full-time employees, of which 31\nheld Ph.D. degrees in a science or engineering field. Of our employees, 287 are located in the U.S., 1,218 are\nlocated in Taiwan and 1,271 are located in China. None of our employees are represented by any collective\nbargaining agreement, but certain employees of our China subsidiary are members of a trade union. We have never\nsuffered any work stoppage as a result of

In [19]:
emp_tag2 = soup2_emp_count[1]

In [20]:
emp_tag2.parent

<span class="c58">Employees</span>

In [21]:
emp2_tag = soup2_emp_count[1].parent

In [22]:
soup2_emp_count_body = soup2.body.find_all(string=[emp_pats])

In [23]:
soup2.body.text.find('bargaining agreements. We believe that our relationship with our employees is good.') 

-1

In [24]:
search_str = 'bargaining agreements. We believe that our relationship with our employees is good.'
show_str_context(search_str=search_str, in_string= file2_html)

'<!--HTML document created with Merrill Bridge  6.4.50.0-->\n<!--Created on: 3/9/2017 4:02:22 PM-->\n<!DOCTYPE html>\n<html xmlns="http://www.w3.org/1999/xhtml">\n<head>\n<meta name="generator" content=\n"HTML Tidy for HTML5 (experimental) for Linux https://github.com/w3c/tidy-html5/tree/68a9e74" />\n<title>aaoi_Current_Folio_10K</title>\n<title>aaoi_Ex10_6</title>\n<title>aaoi_Ex23_1</title>\n<title>aaoi_Ex31_1</title>\n<title>aaoi_Ex31_2</title>\n<title>aaoi_Ex32_1</title>\n\n<style type="text/css">\n/*<![CDATA[*/\n td.c1420 {width:40.00%;padding:0pt;}\n td.c1419 {width:00.96%;padding:0pt;}\n td.c1418 {width:40.00%;border-bottom:1pt solid #000000 ;padding:0pt;}\n p.c1417 {display: inline; font-family: Arial,Helvetica,sans-serif; font-size: 10pt; line-height: 100%; margin: 0pt 0pt 0pt 7.2pt; text-indent: -7.2pt}\n p.c1416 {margin:0pt 0pt 0pt 27pt;line-height:100%;font-family:Arial,Helvetica,sans-serif;font-size: 10pt;}\n p.c1415 {color: #0563C1; display: inline; font-family: Arial,Hel

In [25]:
len(soup2.body.text)

471394

Examples of employee count data in HTML

view-source:file:///C:/projects/DSBC/capstone/subset_filings/0001090872-16-000082.html

#### Example 3

In [40]:
with gzip.open(subset_file_list[11], mode='rt', encoding="utf8") as file: 
            file3_html = file.read()
            soup3 = bs(file3_html, 'lxml')
soup3_emp_count = soup3.find_all(string=[emp_pats])

In [41]:
len(soup3_emp_count)

4

In [42]:
soup3_emp_count#[0].find_parent(block_re).next_sibling.next_sibling

['Employees',
 "AbbVie employed approximately 30,000 persons as of January 31, 2017. Outside the United\nStates, some of AbbVie's employees are represented by unions or works councils. AbbVie believes that it has\ngood relations with its employees.",
 "AbbVie's historical financial statements for periods prior to January 1, 2013 reflected an\nallocation of expenses related to certain Abbott corporate functions, including senior management, legal, human\nresources, finance, information technology and quality assurance. These expenses were allocated to AbbVie based\non direct usage or benefit where identifiable, with the remainder allocated on a pro rata basis of revenues,\nheadcount, square footage, number of transactions or other measures. AbbVie considers the expense allocation\nmethodology and results to be reasonable. However, the allocations may not be indicative of the actual expenses\nthat would have been incurred had AbbVie operated as an independent, stand-alone, publicly-trade

In [29]:
soup3_emp_count[0].parent.next_sibling.next_sibling

<div class="c77">AbbVie employed approximately 30,000 persons as of January 31, 2017. Outside the United
States, some of AbbVie's employees are represented by unions or works councils. AbbVie believes that it has
good relations with its employees.</div>

In [30]:
soup3_emp_count[0].find_parent(name=block_re)#.parent#.find_next_siblings(block_re, limit=6)

<div class="c74">Employees</div>

In [31]:
soup3_emp_count[0].find_parent(name=block_re).next_sibling.next_sibling

<div class="c77">AbbVie employed approximately 30,000 persons as of January 31, 2017. Outside the United
States, some of AbbVie's employees are represented by unions or works councils. AbbVie believes that it has
good relations with its employees.</div>

#### Example 4

In [116]:
with open(html_path_list[4], encoding="utf8") as file: 
            file4_html = file.read()
            soup4 = bs(file4_html, 'lxml')
soup4_emp_count = soup4.find_all(string=[emp_pats])

In [117]:
len(soup4_emp_count)

5

In [119]:
soup4_emp_count[1]

"Workers' Compensation):"

#### Example 5

In [120]:
with open(html_path_list[24], encoding="utf8") as file: 
            file5_html = file.read()
            soup5 = bs(file5_html, 'lxml')
soup5_emp_count = soup5.find_all(string=[emp_pats])

In [121]:
len(soup5_emp_count)

7

In [37]:
soup5_emp_count

['Employees',
 ', we employed approximately 8,400 people in R&D and related support\nactivities, including a substantial number of physicians, scientists holding graduate or postgraduate degrees\nand higher-skilled technical personnel.',
 'Employees',
 'We have approximately 25,000 employees as of December 31,',
 'BMS sponsors defined benefit pension plans, defined contribution plans and\ntermination indemnity plans for regular full-time employees. The principal defined benefit pension plan is the\nBristol-Myers Squibb Retirement Income Plan, covering most U.S. employees and representing approximately',
 'In the\nevent of your Retirement prior to settlement of the Performance Share Units, you will be deemed vested in a\nprorated portion of the Performance Share Units granted, provided that you have been employed',
 'during the Non-Competition and Non-Solicitation Period, employ, solicit for employment,\nsolicit, induce, encourage, or participate in soliciting, inducing or encouraging a

In [38]:
soup5_emp_count[2].find_parent(block_re).next_sibling.next_sibling.next_sibling.next_sibling

<div class="c71"><span class="c73">We have approximately 25,000 employees as of December 31,</span>
<span class="c73">2016</span><span class="c73">.</span></div>

In [39]:
soup5_emp_count[2].find_parent(block_re).find_next_siblings(block_re, limit=2)[1].find(string=emp_pats)

'We have approximately 25,000 employees as of December 31,'

## Test regex

###  function: check_regex_match

In [12]:
def check_regex_match(pattern, text_list):
    for idx, s in enumerate(text_list):
        mo = re.search(pattern, s)
        if mo:
            ms = mo.span()[1]
            print("------    " + str(idx) + "   Matched!    -----")
            print('str length  :' + str(len(s)) + '    match span  :' + str(ms))
            print(s[:ms])
            print('')
            print(s[ms:])
            print(re.search(pattern, s))
        else:
            print("------    " + str(idx) + "  NO MATCH    -----")
            print(s)

## Write functions to flag paragraphs

notes:
- if r"^Employees$" matches, find the next block element that matches another pattern 
- Include tables as "matches" when siblings of Employees block elements

Need regex that will capture heading tags

In [13]:
head_block_re = re.compile(r"^(p|div|h[1-6])$")

In [49]:
emp_head_raw = re.compile(r"[>](Our|Number of  )?(Associates|Employees|Headcount|Personnel|Team Members|Workforce)[<]", re.I)
emp_head = re.compile(r"^(Our|Number of  )?(Associates|Employees|Headcount|Personnel|Team Members|Workforce)$", re.I)

### Test logic for finding paragraphs below flagged headings

In [42]:
emp2_tag

<span class="c58">Employees</span>

In [43]:
file2_html.find('>Employees<')

342239

In [44]:
re.search(emp_head, file2_html)

In [45]:
file2_html.count('\n',0,86339)

459

In [46]:
soup2.body.find(string=emp_head).find_parent(
    name=block_re).next_sibling.next_sibling.next_sibling.next_sibling

<p class="c77"><span class="c32">As of December 31, 2016, we employed 2,776 full-time employees, of which 31
held Ph.D. degrees in a science or engineering field. Of our employees, 287 are located in the U.S., 1,218 are
located in Taiwan and 1,271 are located in China. None of our employees are represented by any collective
bargaining agreement, but certain employees of our China subsidiary are members of a trade union. We have never
suffered any work stoppage as a result of an employment related strike or any employee related dispute and
believe that we have satisfactory relations with our employees.</span></p>

In [47]:
soup5.find_all(string=emp_head)[1].find_parent(block_re).find_next_siblings(block_re, limit=5)#[1].find(string=emp_pats)
#.next_sibling.next_sibling#.next_sibling.next_sibling


[<div class="c106"></div>,
 <div class="c71"><span class="c73">We have approximately 25,000 employees as of December 31,</span>
 <span class="c73">2016</span><span class="c73">.</span></div>,
 <div><a id="sD00DD6DDD08DA54E10480FFA5CAA1F4F" name="sD00DD6DDD08DA54E10480FFA5CAA1F4F"></a></div>,
 <div class="c1"></div>,
 <div class="c70">Foreign Operations</div>]

Lists to accumulate flagged items; later turned into dataframes

### Find candidate paragraphs

In [50]:
acc_id_list = [] ; para_list_orig = [] ;  tag_list = [];
emp_head_list = []; emp_head_first_list = [];
tbl_acc_id_list = [] ; tbl_tag_list = []; 

In [51]:
for i, fl in enumerate(html_path_list[:557]):
    acc_id = PurePath(fl).stem.replace('.html', '')
    tag_set = set();
    with open(fl, encoding="utf8") as file: 
        file_html = file.read()
        soup = bs(file_html, 'lxml')
#        emp_head_flag = False
#        emp_head_first_match = False
        if re.search(emp_head_raw, file_html):
            for ihead, hblock in enumerate(soup.body.find_all(string=emp_head, limit=4)):
                try:
                    emp_head_tag = hblock.find_parent(name=head_block_re)
                    if emp_head_tag.name != 'table' and emp_head_tag.find_parent('table') == None:
                        emp_head_matched = False
            #            print(emp_head_tag.name) ;print(emp_head_tag)
                        for i2, block in enumerate(emp_head_tag.find_next_siblings(block_re, limit=6)):
                            #print('Block sibling number: ' + str(i2))
                            if block.find(string=[emp_pats]) != None and block.name != 'table':
                                block_tag = copy.copy(block)
                                if block_tag not in tag_set:
                                    acc_id_list.append(acc_id) 
                                    tag_list.append(block_tag)
                                    para_list_orig.append(block_tag.get_text())
                                    tag_set.add(block_tag)
                                    emp_head_list.append(True)
                                    if not emp_head_matched:
                                        emp_head_flag = True
                                        emp_head_matched = True
                                        emp_head_first_list.append(True)
                                    else:
                                        emp_head_first_list.append(False)
                            if block.find('table') != None:
            #                    print('Found table match!')
                                tbl_block_tag = copy.copy(block)
                                if tbl_block_tag not in tag_set:
                                    tbl_acc_id_list.append(acc_id) 
                                    tbl_tag_list.append(tbl_block_tag)
                                    tag_set.add(tbl_block_tag)
                except:
                    continue
                        #tbl_df = pd.read_html(block.find('table').prettify(), tupleize_cols=True)[0].dropna(axis=1,how='all')
    #                    print(tbl_df); print(tbl_df.info()); print(block)
#        else:
#            print('No Employees header')
        soup_emp_count = soup.body.find_all(string=[emp_pats])
        soup_emp_paras = [x.find_parent(name=block_re) for x in soup_emp_count]
        soup_emp_paras = [x for x in soup_emp_paras if x != None]
        for i2, block in enumerate(soup_emp_paras):
#                print('Para number: ' + str(i2)); print(block)
            block_tag = copy.copy(block)
            if block_tag not in tag_set:
                if block.find('table') != None:
                    tbl_acc_id_list.append(acc_id) 
                    tbl_tag_list.append(block_tag)
                    tag_set.add(block_tag)
                else:
                    acc_id_list.append(acc_id) 
                    tag_list.append(block_tag)
                    tag_set.add(block_tag)
                    para_list_orig.append(block_tag.get_text())
                    emp_head_list.append(False)
                    emp_head_first_list.append(False)
#emp2_tag.find_next_siblings(block_re, limit = 1, string=False)[0].find(string=[emp_pats]).parent

In [52]:
print(len(set(acc_id_list)))
print(len(acc_id_list))
print(len(para_list_orig))
print(len(set(para_list_orig)))
print(len(tag_list))
print(len(emp_head_list))
print(len(emp_head_first_list))
print(len(tbl_acc_id_list))
print(len(tbl_tag_list))

555
3775
3775
3237
3775
3775
3775
57
57


### Create dataframes and save as CSVs

In [53]:
#### Make dataframe
tbl_html_df = pd.DataFrame(data = { 'acc_id': tbl_acc_id_list, 'tbl_html': tbl_tag_list, 'split' : 'train' })
tbl_html_df.loc[tbl_html_df.acc_id.isin(val_accession_ids),'split'] = 'val'

In [54]:
tbl_html_df.loc[:,'idx'] = tbl_html_df.index.values

tbl_html_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57 entries, 0 to 56
Data columns (total 4 columns):
acc_id      57 non-null object
split       57 non-null object
tbl_html    57 non-null object
idx         57 non-null int64
dtypes: int64(1), object(3)
memory usage: 1.9+ KB


In [55]:
#### Write to csv for later use
tbl_html_df.to_csv('data/tbl_html_df_1.csv')

In [56]:
paragraph_input_dict = {'acc_id' : acc_id_list, 
                        'para_text' : [p.replace('\n', ' ').strip().replace(' ,', ',') for p in para_list_orig],
                        'len' : [len(p) for p in para_list_orig], 
                        'emp_header' : emp_head_list,
                        'first_emp_head_block' : emp_head_first_list,
                       'para_text_orig' : para_list_orig, 
                        'para_tag' : tag_list, 
                       'split' : 'train', 
                       'label' : 0 }
#paragraph_input_df['para_text'] = paragraph_input_df.para_text_orig.replace('\n', ' ')
p_columns = ['acc_id', 'para_text', 'len', 'emp_header', 'first_emp_head_block', 'para_text_orig',
              'para_tag', 'split', 'label']

paragraph_input_df = pd.DataFrame(paragraph_input_dict, columns=p_columns)


paragraph_input_df.loc[paragraph_input_df.acc_id.isin(val_accession_ids),'split'] = 'val'

train_df = paragraph_input_df[paragraph_input_df.split == 'train']

In [57]:
paragraph_input_df.loc[:,'idx'] = paragraph_input_df.index.values

paragraph_input_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3775 entries, 0 to 3774
Data columns (total 10 columns):
acc_id                  3775 non-null object
para_text               3775 non-null object
len                     3775 non-null int64
emp_header              3775 non-null bool
first_emp_head_block    3775 non-null bool
para_text_orig          3775 non-null object
para_tag                3775 non-null object
split                   3775 non-null object
label                   3775 non-null int64
idx                     3775 non-null int64
dtypes: bool(2), int64(3), object(5)
memory usage: 243.4+ KB


In [58]:
paragraph_input_df.head()

Unnamed: 0,acc_id,para_text,len,emp_header,first_emp_head_block,para_text_orig,para_tag,split,label,idx
0,0000003570-17-000052,"We had 911 full-time employees at January 31, ...",53,True,True,"We had 911 full-time\nemployees at January 31,...","<div class=""c93""><span class=""c14"">We had</spa...",val,0,0
1,0000003570-17-000052,Employees,9,False,False,Employees,"<div class=""c91"">Employees</div>",val,0,1
2,0000003570-17-000052,Our total operating costs and expenses increas...,778,False,False,Our total operating costs and expenses increas...,"<div class=""c93""><span class=""c14"">Our total o...",val,0,2
3,0000003570-17-000052,Blackstone Group disclosed that NCR Corporatio...,1826,False,False,Blackstone Group disclosed that NCR Corporatio...,"<div class=""c93""><span class=""c14"">Blackstone ...",val,0,3
4,0000004127-16-000068,"As of September 30, 2016, we employed approxim...",245,True,True,"As of September 30, 2016,\nwe employed approxi...","<div class=""c80""><span class=""c32"">As of</span...",train,0,4


In [59]:
paragraph_input_df.to_csv('data/paragraph_input_df_1.csv')

## Debug odd findings to refine regex or code

Code for printing results from paragraph search for a file

In [62]:
def print_parse_objects(file: str, use_acc_id: bool=True, acc_id_list = full_accession_ids, 
                       file_path_list = html_path_list):
    fl = file
    if use_acc_id:
        fl = html_path_list[full_accession_ids.index(fl)]        
        print('Index number: ')
        print(full_accession_ids.index(file))
    print('File path: ')
    print(fl)
    #for i, fl in enumerate(html_path_list[303:304]):
    with open(fl, encoding="utf8") as file: 
        file_html = file.read()
        soup = bs(file_html, 'lxml')
        emp_head_flag = False
        emp_head_first_match = False
        if re.search(emp_head_raw, file_html):
            for ihead, hblock in enumerate(soup.body.find_all(string=emp_head)):
                emp_head_tag = hblock.find_parent(name=head_block_re)

                if emp_head_tag != None and emp_head_tag.name != 'table' and emp_head_tag.find_parent('table') == None:
                    emp_head_matched = False
                    print(ihead); 
                    try:
                        print(emp_head_tag.name) 
                    except:
                        print('emp_head_tag has no name')
                    print(emp_head_tag)
                    for i2, block in enumerate(emp_head_tag.find_next_siblings(block_re, limit=6)):
                        print('Block sibling number: ' + str(i2))
                        if block.find(string=emp_pats) != None:
                            block_tag = copy.copy(block)
                            print('Found match!'); print(block_tag) ; print(block_tag.get_text())
                        if block.find('table') != None:
                            print('Found table match!')
                            tbl_block_tag = copy.copy(block)
                            tbl_df = pd.read_html(block.find('table').prettify())[0].dropna(axis=1,how='all')
                            print(tbl_df); print(tbl_df.info()); print(block)
        #else:
        print('No Employees header')
        soup_emp_count = soup.body.find_all(string=[emp_pats])
        soup_emp_paras = [x.find_parent(name=head_block_re) for x in soup_emp_count]
        soup_emp_paras = [x for x in soup_emp_paras if x != None]
        soup_emp_blocks = [x  for x in soup_emp_count if x.name in ['p', 'div']]
        for i2, block in enumerate(soup_emp_paras):
            block_tag = copy.copy(block)
            print('Para number: ' + str(i2)); 
            try:
                print('Original match name:  ' +  soup_emp_count[i2].name)
            except:
                print('Original match has no name')
            print('Original match:  ' +  soup_emp_count[i2])
            try:
                print('Block match name:  ' +  block_tag.name)
            except:
                print('Block match has no name')
            print('Block:  ')
            print(block)
            print(type(block))
            print('Block Text:  ')
            print(block.get_text().replace('\n', ' '))
        for i2, block in enumerate(soup_emp_blocks):
            print('Block number: ' + str(i2)); print(block)
#emp2_tag.find_next_siblings(block_re, limit = 1, string=False)[0].find(string=[emp_pats]).parent

In [77]:
print('Paragraphs: ' + str(len(acc_id_list)))
print('Files with paragraphs: ' + str(len(set(acc_id_list))))
print('Tables: ' + str(len(tbl_acc_id_list)))

Paragraphs: 8205
Files with paragraphs: 2182
Tables: 208


Code for investigating files for which no candidate paragraphs were identified

In [78]:
missed_ids = [x for x in full_accession_ids[:100] if x not in set(acc_id_list)]
# For investigating files with no candidate paragraphs if they are part of the training set
#[x for x in missed_ids if x in train_accession_ids]  

In [350]:
#find index of file for specific accession id
#[i for i, x in enumerate(html_path_list[:300]) if re.search(r"0000042316-17-000014", x)]

[89]

### Explore dataframe of paragraphs for training data

In [460]:
train_df.head(2)

Unnamed: 0,acc_id,para_text,len,emp_header,first_emp_head_block,para_text_orig,para_tag,split,label
2,0000004127-16-000068,"As of September 30, 2016, we employed approxim...",245,True,True,"As of September 30, 2016,\nwe employed approxi...","<div class=""c80""><span class=""c32"">As of</span...",train,0
3,0000004127-16-000068,EMPLOYEES,9,False,False,EMPLOYEES,"<div class=""c90"">EMPLOYEES</div>",train,0


In [79]:
train_df.describe()

Unnamed: 0,len,label
count,6170.0,6170.0
mean,415.585575,0.0
std,475.61218,0.0
min,8.0,0.0
25%,26.0,0.0
50%,300.0,0.0
75%,618.0,0.0
max,5054.0,0.0


In [490]:
train_df.len.value_counts().to_frame('len_counts').sort_index(ascending=False)

Unnamed: 0,len_counts
5054,1
4944,1
4868,1
4213,1
3972,1
3904,1
3878,1
3862,1
3795,1
3791,1


In [80]:
print_row_detail(df=train_df, nrow=10, header_list = ['acc_id'  ],
                    detail_list = ['len', 'emp_header', 'first_emp_head_block', 'para_text'],
                    sortby=['len', 'acc_id'], ascending=False)

----------------------------------- 0001193125-17-106055 -----------------------------------
len  :5054

emp_header  :False

first_emp_head_block  :False

para_text  :Current format     Previous format                             Millions of Euros       Millions of Euros    ASSETS   December  2015     December  2014      ASSETS   December  2015     December  2014              CASH, CASH BALANCES AT CENTRAL BANKS AND OTHER DEMAND DEPOSITS (1)   29,282    27,719     CASH AND BALANCES WITH CENTRAL BANKS   43,467    31,430    FINANCIAL ASSETS HELD FOR TRADING   78,326    83,258     FINANCIAL ASSETS HELD FOR TRADING   78,326    83,258    Derivatives   40,902    44,229     Loans and advances to credit institutions   -    -    Equity instruments   4,534    5,017     Loans and advances to customers   65    128    Debt securities   32,825    33,883     Debt securities   32,825    33,883    Loans and advances to central banks   -    -     Equity instruments   4,534    5,017    Loans and advances

In [68]:
print_parse_objects(file='0001047469-16-014916')

Index number: 
734
File path: 
c:/projects/DSBC/capstone/sec_employee_information_extraction/../employee_filings/0001047469-16-014916.html
No Employees header
Para number: 0
Original match has no name
Original match:  We currently have 21 employees, 18 of whom are located in Denver, Colorado, one who is located
in Zug, Switzerland, and two who are located in Toronto, Canada. Our employees are not subject to a labor
contract or a collective bargaining agreement. We consider our employee relations to be good.
Block match name:  p
Block:  
<p class="c15">We currently have 21 employees, 18 of whom are located in Denver, Colorado, one who is located
in Zug, Switzerland, and two who are located in Toronto, Canada. Our employees are not subject to a labor
contract or a collective bargaining agreement. We consider our employee relations to be good.</p>
<class 'bs4.element.Tag'>
Block Text:  
We currently have 21 employees, 18 of whom are located in Denver, Colorado, one who is located in Zug, 