# Getting Bulk SEC Archive of Annual and Quarterly Financial Reports 

Date: 2021-01-20  
Author: Jason Beach  
Categories: Data_Science 
Tags: nlp, spacy, finance

<!--eofm-->

In this post we investigate information within the most import SEC filings (10K, 10Q, 8K).  We will perform a review of the data and ensure we can validate the numbers.  Then, we can look into unstructured data to determine if we can: i) convert it into a useful structure, ii) we have appropriate contextual information for the newly structured data to be mapped to the document.  If this structured data seems valuable, then it can be compared against other data to discover predictive relationships.

## Data Source

The data is obtained from the SEC [web page](https://www.sec.gov/dera/data/financial-statement-and-notes-data-set.html).  These files are flattened from their original XBRL format.  The data extracted from the XBRL submissions is organized into eight data sets containing information about submissions, numbers, taxonomy tags, presentation, and more.  Each data set consists of rows and fields, and is provided as a tab-delimited TXT format file.  The data sets are as follows (non-meta data is in bold font):

* __SUB__ – Submission data set; this includes one record for each XBRL submission. The set includes fields of information pertinent to the submission and the filing entity. Information is extracted from the Commission's EDGAR system and the filings submitted to the Commission by registrants.
* TAG – Tag data set; includes defining information about each tag.  Information includes tag descriptions (documentation labels), taxonomy version information and other tag attributes.
* DIM – Dimension data set; used to provide more detail in numeric and non-numeric facts.
* __NUM__ – Number data set; this includes one row for each distinct amount from each submission included in the SUB data set. The Number data set includes, for every submission, all line item values as rendered by the Commission Viewer/Previewer.
* __TXT__ – Text data set; this is the plain text of all the non-numeric tagged items in the submission.
* REN – Rendering data set; this contains data from the rendering of the filing on the Commission website.
* PRE – Presentation data set; this provides information about how the tags and numbers were presented in the primary financial statements.
* CAL – Calculation data set; provides information to arithmetically relate tags in a filing.

The forms we are interested include:

* 10K - report filed annually by a publicly-traded company about its financial performance
* 10Q - unaudited report (3 times a year) of company's performance and includes relevant information about its financial position
* 8K - required by the SEC whenever companies announce major events, may include sales, acquisitions, delistings, departures, and elections of executives, as well as changes in a company's status or control, bankruptcies, information about operations, assets, and any other relevant news

Steps

* search EDGAR for specific firm: https://www.sec.gov/edgar/searchedgar/companysearch.html
* manually compare against documents

In [1]:
! ls large_dataset/2021_12_notes/

cal.tsv  notes-metadata.json  pre.tsv	  ren.tsv  tag.tsv
dim.tsv  num.tsv	      readme.htm  sub.tsv  txt.tsv


In [2]:
! head large_dataset/2021_12_notes/cal.tsv

adsh	grp	arc	negative	ptag	pversion	ctag	cversion
0001654954-21-013551	3	1	1	NetCashProvidedByUsedInOperatingActivities	us-gaap/2020	NetIncomeLoss	us-gaap/2020
0001477932-21-009337	2	3	1	NetCashProvidedByUsedInOperatingActivities	us-gaap/2020	NetIncomeLoss	us-gaap/2020
0001589526-21-000140	13	2	1	NetCashProvidedByUsedInOperatingActivities	us-gaap/2020	NetIncomeLoss	us-gaap/2020
0000950170-21-005014	2	2	1	NetCashProvidedByUsedInOperatingActivities	us-gaap/2020	NetIncomeLoss	us-gaap/2020
0001589526-21-000140	6	2	-1	OtherComprehensiveIncomeLossPensionAndOtherPostretirementBenefitPlansAdjustmentBeforeTax	us-gaap/2020	NetIncomeLoss	us-gaap/2020
0001558370-21-016630	19	1	1	ComprehensiveIncomeNetOfTax	us-gaap/2020	NetIncomeLoss	us-gaap/2020
0000950170-21-005014	25	1	1	ComprehensiveIncomeNetOfTax	us-gaap/2020	NetIncomeLoss	us-gaap/2020
0000074046-21-000108	5	4	1	ComprehensiveIncomeNetOfTax	us-gaap/2020	NetIncomeLoss	us-gaap/2020
0001589526-21-000140	24	1	1	ComprehensiveIncomeNetOfTax	us-gaap/2

### Setup

In [1]:
import pandas as pd
import numpy as np

In [2]:
def read_table(file_path, display_cols=False):
    df = pd.read_csv(file_path, sep='\t', low_memory=False)
    print(f'Data size: {df.shape}')
    if display_cols: print(f'Data columns: {df.columns}')
    
    if 'form' in df.columns:
        forms = ['10-K','10-Q','8-K']
        df_tmp = df
        df_tmp = df_tmp[df_tmp['form'].isin(forms)]
        print(f"Size of submission of interest (10K,10Q,8K): {df_tmp.shape[0]}")
    return df

### Data ETL

In [98]:
file_path = 'large_dataset/2021_12_notes/sub.tsv'
df_sub = read_table(file_path, display_cols=True)

Data size: (5689, 40)
Data columns: Index(['adsh', 'cik', 'name', 'sic', 'countryba', 'stprba', 'cityba', 'zipba',
       'bas1', 'bas2', 'baph', 'countryma', 'stprma', 'cityma', 'zipma',
       'mas1', 'mas2', 'countryinc', 'stprinc', 'ein', 'former', 'changed',
       'afs', 'wksi', 'fye', 'form', 'period', 'fy', 'fp', 'filed', 'accepted',
       'prevrpt', 'detail', 'instance', 'nciks', 'aciks', 'pubfloatusd',
       'floatdate', 'floataxis', 'floatmems'],
      dtype='object')
Size of submission of interest (10K,10Q,8K): 5089


In [99]:
file_path = 'large_dataset/2021_12_notes/num.tsv'
df_num = read_table(file_path, display_cols=True)

Data size: (780139, 16)
Data columns: Index(['adsh', 'tag', 'version', 'ddate', 'qtrs', 'uom', 'dimh', 'iprx',
       'value', 'footnote', 'footlen', 'dimn', 'coreg', 'durp', 'datp',
       'dcml'],
      dtype='object')


In [100]:
file_path = 'large_dataset/2021_12_notes/txt.tsv'
df_txt = read_table(file_path, display_cols=True)

Data size: (216344, 20)
Data columns: Index(['adsh', 'tag', 'version', 'ddate', 'qtrs', 'iprx', 'lang', 'dcml',
       'durp', 'datp', 'dimh', 'dimn', 'coreg', 'escaped', 'srclen', 'txtlen',
       'footnote', 'footlen', 'context', 'value'],
      dtype='object')


## Basic Review

Submissions

In [45]:
df_sub[['adsh','name', 'countryma', 'stprma', 'cityma', 'countryinc', 'ein', 'former', 'changed']].head()
df_sub[['adsh', 'afs', 'wksi', 'fye', 'form', 'period', 'fy', 'fp', 'filed', 'accepted']].head()
df_sub[['adsh', 'prevrpt', 'detail', 'instance', 'nciks', 'aciks', 'pubfloatusd', 'floatdate', 'floataxis', 'floatmems']].head()

Unnamed: 0,adsh,prevrpt,detail,instance,nciks,aciks,pubfloatusd,floatdate,floataxis,floatmems
0,0001096906-21-002921,0,0,byoc-20211201_htm.xml,1,,,,,
1,0001104659-21-145567,0,0,tm2132359d2_8k_htm.xml,1,,,,,
2,0001104659-21-145577,0,1,pixy-20210531.xml,1,,,,,
3,0001104659-21-145626,0,0,tm2134286d1_8k_htm.xml,1,,,,,
4,0001104659-21-145637,0,0,tm2134057d1_8k_htm.xml,1,,,,,


In [64]:
pd.crosstab(df_tmp['name'], df_tmp['form'], rownames=['name'], colnames=['form'])

Size of submission of interest: 5089


form,10-K,10-Q,8-K
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1 800 FLOWERS COM INC,0,0,1
180 LIFE SCIENCES CORP.,0,0,4
1847 GOEDEKER INC.,0,0,2
1ST CONSTITUTION BANCORP,0,0,2
"22ND CENTURY GROUP, INC.",0,0,1
...,...,...,...
"ZSCALER, INC.",0,1,0
ZUMIEZ INC,0,1,2
ZUORA INC,0,1,1
ZYMEWORKS INC.,0,0,3


Number: all line item values from each dataset

In [52]:
df_num[['adsh', 'tag', 'version', 'ddate', 'qtrs', 'uom', 'dimh', 'iprx']].head()
df_num[['value', 'footnote', 'footlen', 'dimn', 'coreg', 'durp', 'datp','dcml']].head()

Unnamed: 0,adsh,tag,version,ddate,qtrs,uom,dimh,iprx
0,0001062993-21-011894,WeightedAverageExercisePriceOfShareOptionsOuts...,ifrs/2020,20191231,0,CAD,0x833c5d9b45ae0f823524a9499bd2029b,0
1,0001062993-21-011894,EachSharePurchaseWarrantAssessedValue,0001062993-21-011894,20171231,0,CAD,0xb36e98a1603d71c9130dc58a98eb1359,0
2,0001193125-21-344339,ProfitLossAttributableToOwnersOfParent,ifrs/2020,20201031,4,CAD,0x00000000,0
3,0001193125-21-344339,NotionalAmount,ifrs/2020,20211031,0,CAD,0x330691fdeafe2e135ffc8330e5171705,0
4,0001193125-21-344339,IncreaseDecreaseThroughNetExchangeDifferencesP...,ifrs/2020,20201031,4,CAD,0x99090c3ff88704202dc8610a2a65a5a0,0


Text: plain text of all non-numeric items in the dataset.  This includes tables and other unstructured data.

In [56]:
df_txt[['adsh', 'tag', 'version', 'ddate', 'qtrs', 'iprx', 'lang', 'dcml']].head()
df_txt[['durp', 'datp', 'dimh', 'dimn', 'coreg', 'escaped', 'srclen', 'txtlen','footnote', 'footlen', 'context', 'value']].head()

Unnamed: 0,durp,datp,dimh,dimn,coreg,escaped,srclen,txtlen,footnote,footlen,context,value
0,0.019179,0.0,0x00000000,0,,1,14355,10701,,0,Duration_1_1_2021_To_9_30_2021_RwCPA6c1GkyrMOE...,"1. Organization, Business Operations. Incorpor..."
1,0.019179,0.0,0x00000000,0,,1,12319,9175,,0,Duration_1_1_2021_To_9_30_2021_-D0OjqPQ9UywHQZ...,1. NATURE OF BUSINESS AND BASIS OF PRESENTATIO...
2,0.482192,0.0,0x00000000,0,,1,17429,14660,,0,Duration_2_14_2020_To_12_31_2020_1uTU7qmFN0u03...,NOTE 1. DESCRIPTION OF ORGANIZATION AND BUSINE...
3,0.019179,0.0,0x00000000,0,,1,18120,14832,,0,Duration_1_1_2021_To_9_30_2021_J8h0TwBpskSZgk7...,NOTE 1. DESCRIPTION OF ORGANIZATION AND BUSINE...
4,0.0,0.0,0x00000000,0,,1,16072,11749,,0,Duration_1_1_2020_To_12_31_2020_yzcIljh5FkytHG...,"Note 1 Organization, Business Operations and ..."


## Data Validation

Let's validate we know what we are looking at.

* `adsh` - unique primary key, 20-char Accession Number automatically assigned when submission is accepted by EDGAR; format: `<entity_CIK>-<YR>-<CIK_sequence_num>`
  - `entity_CIK` company or a third-party filer agent; may have no searchable presence in public EDGAR database
  - `YR` year
  - `sequence_num` of the submitted filings from that CIK

There are 3,321 unique names making submissions during this month, but only 128 unique CIK.  However, CIK derived from ADSH appers to be different.  

TODO:

* is this because most CIK is a limited number of 3rd-party preparers?
* why is CIK from ADSH different?

In [13]:
print(f"Count of names: {df_sub['name'].unique().shape}")

Count of names: (3321,)


In [14]:
print(f"Count of names: {df_sub['cik'].unique().shape}")

Count of names: (3297,)


In [25]:
df_sub[['id_cik','id_yr','id_seq_num']] = df_sub['adsh'].str.split('-', None, expand=True)

In [26]:
df_sub['id_cik'].unique().shape

(923,)

Still unique CIK differ

In [20]:
df_txt[['id_cik','id_yr','id_seq_num']] = df_txt['adsh'].str.split('-', None, expand=True)

In [23]:
#df_txt['id_cik'].value_counts()
print(f"Unique CIK from ADSH within txt: {df_txt['id_cik'].unique().shape }")

Unique CIK from ADSH within txt: (923,)


In [None]:
The year

In [28]:
df_txt['yr'].value_counts()

21    216344
Name: yr, dtype: int64

In [10]:
df_txt['qtrs'].value_counts()

0     122934
3      36587
4      34554
2      13163
1       9072
5         26
8          3
12         2
9          1
15         1
60         1
Name: qtrs, dtype: int64

## Review

There are only a small percentage of footnotes.

In [68]:
footnotes = df_num['footnote'].dropna()
footnotes.shape

(1427,)

In [42]:
footnotes.iloc[0]

'Represents unallocated items. In the three-month periods ended October 31, 2021, and November 1, 2020, there were pension actuarial losses of $6 and pension actuarial gains of $4, respectively. Costs related to the cost savings initiatives were $4 and $5 in the three-month periods ended October 31, 2021, and November 1, 2020, respectively. Unrealized mark-to-market adjustments on outstanding undesignated commodity hedges were losses of $3 and gains of $6 in the three-month periods ended October 31, 2021, an'

In [58]:
df_txt['footnote'].dropna().shape

(66,)

In [67]:
df_txt['value'].iloc[2]

'NOTE 1. DESCRIPTION OF ORGANIZATION AND BUSINESS OPERATIONS Capstar Special Purpose Acquisition Corp. (the Company) was incorporated in Delaware on February 14, 2020. The Company was formed for the purpose of effecting a merger, capital stock exchange, asset acquisition, stock purchase, reorganization or similar business combination with one or more businesses (the Business Combination). Although the Company is not limited to a particular industry or geographic region for purposes of consummating a Business Combination, the Company intends to focus on businesses in the consumer, healthcare and technology, media and telecommunications (TMT) industries. The Company is an early stage and emerging growth company and, as such, the Company is subject to all of the risks associated with early stage and emerging growth companies. As of December 31, 2020, the Company had not commenced any operations. All activity for the period from February 14, 2020 (inception) through December 31, 2020 relat

## Manual Review: Wells Fargo

We will search for two .pdf documents from WFC: 10K, 8K.  Then we will look for specific data within these files to map it to the documents.

### Find the documents

In [3]:
! ls large_dataset/

2020q1_notes  2020q1_notes.zip	2020q2_notes  2020q2_notes.zip	2021_12_notes


In [4]:
file_path = 'large_dataset/2020q1_notes/sub.tsv'
df_sub2020q1 = read_table(file_path)

Data size: (13561, 40)
Size of submission of interest (10K,10Q,8K): 12714


In [5]:
file_path = 'large_dataset/2020q2_notes/sub.tsv'
df_sub2020q2 = read_table(file_path)

Data size: (16411, 40)
Size of submission of interest (10K,10Q,8K): 15369


In [6]:
df_sub = pd.concat([df_sub2020q2, df_sub2020q1])

In [7]:
df_sub.shape

(29972, 40)

In [8]:
df_wfc = df_sub[df_sub['name'].str.contains('FARGO')]

In [12]:
df_10k = df_wfc[df_wfc['form'].isin(['10-K'])]
df_10k['adsh'].values[0]

'0000072971-20-000217'

In [13]:
df_8k = df_wfc[df_wfc['instance'].str.contains('0414')]
df_8k['adsh'].values[0]

'0001387131-20-003874'

In [11]:
df_sub2020q1[df_sub2020q1['adsh'].str.contains(df_10k['adsh'].values[0])].shape
df_sub2020q2[df_sub2020q2['adsh'].str.contains(df_8k['adsh'].values[0])].shape

(1, 40)

### Find associated records

In [5]:
! sqlite3 --version

3.37.0 2021-11-27 14:13:22 bd41822c7424d393a30e92ff6cb254d25c26769889c1499a18a0b9339f5d6c8a


Data is too big!

```
file_path = 'large_dataset/2020q1_notes/num.tsv'
df_tmp = read_table(file_path)
df_num = df_tmp[df_tmp['adsh'].str.contains('0000072971-20-000217')]
```

From the commandline:

```bash
sqlite3 sec.db

sqlite> .mode tabs
sqlite> .import 2020q1_notes/num.tsv num
sqlite> .tables

sqlite> .headers on
sqlite> .mode column
sqlite> SELECT * FROM num LIMIT 3;
```

and 

```bash
sqlite> .mode tabs
sqlite> .import 2020q1_notes/txt.tsv txt
sqlite> .tables

sqlite> .headers on
sqlite> .mode column
sqlite> SELECT * FROM txt LIMIT 3;
```

In [37]:
! ls -alh 2020q1_notes/ | head -n4

total 3.4G
drwx------ 12 root   root   384 Jan 26 15:26 .
drwxr-xr-x 10 jovyan users  320 Jan 26 18:33 ..
-rw-r--r--  1 jovyan users  65K Dec 13  2020 2020q1_notes-metadata.json


In [18]:
#check the results of the load
! head -n3 large_dataset/2020q1_notes/num.tsv

adsh	tag	version	ddate	qtrs	uom	dimh	iprx	value	footnote	footlen	dimn	coreg	durp	datp	dcml
0000014195-20-000008	DerivativeNonmonetaryNotionalAmount	invest/2013	20191231	0	Btu	0xd510cbd5aa429b08bfe3fe1c1a52fc18	0	4074000.0000		0	1		0.0	2.0	-3


In [19]:
import sqlite3
import os

In [23]:
os.chdir('large_dataset/')
os.getcwd()

'/home/jovyan/NOTEBOOK_PUBLIC/large_dataset'

In [25]:
conn = sqlite3.connect('sec.db')
query = "SELECT * FROM num WHERE adsh = '0000072971-20-000217';"

df_num = pd.read_sql_query(query,conn)
df_num.shape

In [38]:
query = "SELECT * FROM txt WHERE adsh = '0000072971-20-000217';"

df_txt = pd.read_sql_query(query,conn)
df_txt.shape

(392, 20)

In [43]:
df_txt[df_txt['value'].str.contains('Key Economic')]

Unnamed: 0,adsh,tag,version,ddate,qtrs,iprx,lang,dcml,durp,datp,dimh,dimn,coreg,escaped,srclen,txtlen,footnote,footlen,context,value
