## Parsing Companies 10-K filings. 

After the daily-index scrape, we are now able to retrieve multiple types of filings. In this example, we will focus on one of the most important filing, the 10-K (Annual Report).

### Using Pandas

We will be able to convert all tabular data contained in a 10-K to a pandas dataframe which will help us manipulate the information into what we need. 

In [13]:
# import some libraries
import requests
from bs4 import BeautifulSoup
import urllib
import pandas as pd

In [14]:
# defining a base url
base_url = r"https://www.sec.gov"

# from daily index, we get a file that looks like this
example = r"https://www.sec.gov/Archives/edgar/data/1652044/0001652044-22-000019.txt"

# we use an url that allows us to reach the 10-K reference page
documents_url = r"https://www.sec.gov/Archives/edgar/data/1652044/000165204422000019/index.json"

# Declaring user agent header for the content, this header is additional info passed to the request
headers = {'user-agent':'William',
          'Host':'www.sec.gov'}

# request and decode the documents url
content = requests.get(documents_url, headers=headers).json()

for file in content['directory']['item']:
    if file['name'] == 'FilingSummary.xml':
        
        xml_summary = base_url + content['directory']['name'] + '/' + file['name']
        
        print('-' * 100)
        print('File Name: ' + file['name'])
        print('File Path: ' + xml_summary)

----------------------------------------------------------------------------------------------------
File Name: FilingSummary.xml
File Path: https://www.sec.gov/Archives/edgar/data/1652044/000165204422000019/FilingSummary.xml


### Now it's time to parse this xml file

We notice that under the MyReports tag, there are multiple instances of reports along with information about them

In [23]:
# new base url for every report 
base_url = xml_summary.replace('FilingSummary.xml', '')

# request and parsing the xml file
content = requests.get(xml_summary, headers=headers).content
soup = BeautifulSoup(content, "html.parser")

# find the 'myreports' tag
reports = soup.find('myreports')

# store each individual report in a master list
master_list = []

# adding each report to the master list
for report in reports.find_all('report')[:-1]:
    # saving all the info we need in a dictionary
    report_info = {}
    report_info['name_short'] = report.shortname.text
    report_info['url'] = base_url + report.htmlfilename.text
    report_info['position'] = report.position.text
    report_info['name_long'] = report.longname.text
    report_info['category'] = report.menucategory.text
    
    # adding each report info to the master list
    master_list.append(report_info)
    
    # output to show urls and whats happening
    print('-' * 100)
    print("Short Name: " + report_info['name_short'])
    print("Long Name: " + report_info['name_long'])
    print("Category Type: " + report_info['category'])
    print("Position: " + report_info['position'])
    print("URL: " + report_info['url'])

----------------------------------------------------------------------------------------------------
Short Name: Cover Page
Long Name: 0001001 - Document - Cover Page
Category Type: Cover
Position: 1
URL: https://www.sec.gov/Archives/edgar/data/1652044/000165204422000019/R1.htm
----------------------------------------------------------------------------------------------------
Short Name: Audit Information
Long Name: 0002002 - Document - Audit Information
Category Type: Notes
Position: 2
URL: https://www.sec.gov/Archives/edgar/data/1652044/000165204422000019/R2.htm
----------------------------------------------------------------------------------------------------
Short Name: CONSOLIDATED BALANCE SHEETS
Long Name: 1001003 - Statement - CONSOLIDATED BALANCE SHEETS
Category Type: Uncategorized
Position: 3
URL: https://www.sec.gov/Archives/edgar/data/1652044/000165204422000019/R3.htm
----------------------------------------------------------------------------------------------------
Short

### Let's convert the information in these reports to dataframes

We have parsed the FilingSummary xml file to retrieve all the reports. 
Now it is time to go through those reports and save the information we want.


In [25]:
# save all important statements into another list
statements_url = []

for report in master_list:
    # list of statement names
    item1 = r"Consolidated Balance Sheets"
    item2 = r"Consolidated Statements of Operations and Comprehensive Income (Loss)"
    item3 = r"Consolidated Statements of Cash Flows"
    item4 = r"Consolidated Statements of Stockholder's (Deficit) Equity"
    
    # store the items into a list
    search_list = [item1.upper(), item2.upper(), item3.upper(), item4.upper()]
    
    # look if the report.shortname is in the list
    if report['name_short'] in search_list:
        
        # output information
        print('-' * 100)
        print(report['name_short'])
        print(report['url'])
        
        # add report to the statements url list
        statements_url.append(report['url'])

----------------------------------------------------------------------------------------------------
CONSOLIDATED BALANCE SHEETS
https://www.sec.gov/Archives/edgar/data/1652044/000165204422000019/R3.htm
----------------------------------------------------------------------------------------------------
CONSOLIDATED STATEMENTS OF CASH FLOWS
https://www.sec.gov/Archives/edgar/data/1652044/000165204422000019/R9.htm


In [33]:
# creating a dataset, with a list of data, headers and sections
statements_data = []

# looping through each statement
for statement in statements_url:
    
    # create a dictionary to store the content of each statement
    statement_data = {}
    statement_data['headers'] = []
    statement_data['sections'] = []
    statement_data['data'] = []
    
    # request the content and parse it with BeautifulSoup
    content = requests.get(statement, headers=headers).content
    report_soup = BeautifulSoup(content, "html")
    
    # loop through the soup and retrieve the data
    for index, row in enumerate(report_soup.table.find_all('tr')):
        
        # first we get all the elements
        cols = row.find_all('td')
        
        # detect difference between regular rows, sections and table headers
        if len(row.find_all('th')) == 0 and len(row.find_all('strong')) == 0:
            
            reg_row = [ele.text.strip() for ele in cols]
            statement_data['data'].append(reg_row)
                                               
        # if strong but not th, we know it will be a section
        elif len(row.find_all('th')) == 0 and len(row.find_all('strong')) != 0:   
             
            section_row = cols[0].text.strip()
            statement_data['sections'].append(section_row)
                                               
        # else we know it will be header since th is present but not strong
        elif len(row.find_all('th')) != 0:
            
            header_row = [ele.text.strip() for ele in row.find_all('th')]
            statement_data['headers'].append(header_row)
                                               
        # this should only trigger when an error occurred
        else:
            print("an error occured on this row")
    
    # append data to master list
    statements_data.append(statement_data)

# outpute the statements_data
statements_data

[{'headers': [['CONSOLIDATED BALANCE SHEETS - USD ($) $ in Millions',
    'Dec. 31, 2021',
    'Dec. 31, 2020']],
  'sections': ['Current assets:',
   'Current liabilities:',
   'Stockholders’ equity:'],
  'data': [['Cash and cash equivalents', '$ 20,945', '$ 26,465'],
   ['Marketable securities', '118,704', '110,229'],
   ['Total cash, cash equivalents, and marketable securities',
    '139,649',
    '136,694'],
   ['Accounts receivable, net', '39,304', '30,930'],
   ['Income taxes receivable, net', '966', '454'],
   ['Inventory', '1,170', '728'],
   ['Other current assets', '7,054', '5,490'],
   ['Total current assets', '188,143', '174,296'],
   ['Non-marketable securities', '29,549', '20,703'],
   ['Deferred income taxes', '1,284', '1,084'],
   ['Property and equipment, net', '97,599', '84,749'],
   ['Operating lease assets', '12,959', '12,211'],
   ['Intangible assets, net', '1,417', '1,445'],
   ['Goodwill', '22,956', '21,175'],
   ['Other non-current assets', '5,361', '3,953'],
  

### Now we convert to DataFrame

Essentially in statements_data we currently have all the table information for each statement and we have properly separated the rows into the data, headers and sections.

Now we can convert the data into a dataframe which will allow us flexibility in potential visualizations or further manipulations

In [51]:
# grab the components we want
income_headers = statements_data[1]['headers'][1]
income_data = statements_data[1]['data']

# save data into a df
income_df = pd.DataFrame(income_data)

# display before reindex
print('-' * 100)
print('Before Reindexing')
print('-' * 100)
display(income_df.head(5))

# Drop the old index and define a new one
income_df.index = income_df[0]
income_df.index.name = 'Category'
income_df = income_df.drop(0, axis = 1)

# display before regex
print('-' * 100)
print('Before Regex Matching')
print('-' * 100)
display(income_df.head(5))

# Converting data to string and removing negatives and nulls
income_df = income_df.replace('[\$,)]', '', regex = True)\
                     .replace('[(]', '-', regex = True)\
                     .replace('', 'NaN', regex = True)

# Convert the data from string to float
income_df = income_df.astype(float)

# Replace the headers with the income headers
income_df.columns = income_headers

# Display of final results
print('-' * 100)
print('Final Product')
print('-' * 100)
display(income_df.head(5))

----------------------------------------------------------------------------------------------------
Before Reindexing
----------------------------------------------------------------------------------------------------


Unnamed: 0,0,1,2,3
0,Net income,"$ 76,033","$ 40,269","$ 34,343"
1,Depreciation and impairment of property and eq...,11555,12905,10856
2,Amortization and impairment of intangible assets,886,792,925
3,Stock-based compensation expense,15376,12991,10794
4,Deferred income taxes,1808,1390,173


----------------------------------------------------------------------------------------------------
Before Regex Matching
----------------------------------------------------------------------------------------------------


Unnamed: 0_level_0,1,2,3
Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Net income,"$ 76,033","$ 40,269","$ 34,343"
Depreciation and impairment of property and equipment,11555,12905,10856
Amortization and impairment of intangible assets,886,792,925
Stock-based compensation expense,15376,12991,10794
Deferred income taxes,1808,1390,173


----------------------------------------------------------------------------------------------------
Final Product
----------------------------------------------------------------------------------------------------


Unnamed: 0_level_0,"Dec. 31, 2021","Dec. 31, 2020","Dec. 31, 2019"
Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Net income,76033.0,40269.0,34343.0
Depreciation and impairment of property and equipment,11555.0,12905.0,10856.0
Amortization and impairment of intangible assets,886.0,792.0,925.0
Stock-based compensation expense,15376.0,12991.0,10794.0
Deferred income taxes,1808.0,1390.0,173.0
