#### PGGM Data Science Bootcamp 2020
*Notebook by [Pedro V Hernandez Serrano](https://github.com/pedrohserrano)*

---
![](../img/image_2.png)

# 2. Intro to Internet Parsing
* [2.1. Internet parsing using API](#2.1)
* [2.2. Internet parsing using BeautifulSoup](#2.2)

---

### 2.1. Internet Parsing using API
<a id="2.1">

- Before we get deep into the EDGAR data we actually need to map the CIK codes with the company names so that we can automatically can gather information
- A Central Index Key or CIK number is a number given to an individual, company, or foreign government by the United States Securities and Exchange Commission. 
- The number will help is to identify its filings in several online databases, including EDGAR.
- The numbers are ten digits in length.

Accessing CIK - Ticker mappings via API

In [2]:
import pandas as pd

In [3]:
NASDAQ = pd.read_json('https://mapping-api.herokuapp.com/exchange/NASDAQ')
OTC = pd.read_json('https://mapping-api.herokuapp.com/exchange/OTC')
NYSE = pd.read_json('https://mapping-api.herokuapp.com/exchange/NYSE')

In [4]:
df = pd.DataFrame()
df = df.append(NASDAQ)
df = df.append(OTC)
df = df.append(NYSE)

In [6]:
df

Unnamed: 0,cik,ticker,name,sic,exchange,irs
0,1099290,AAC,Sinocoking Coal & Coke Chemical Industries Inc,3312,NASDAQ,593404233
1,6201,AAL,American Airlines Group Inc,4512,NASDAQ,751825172
2,8177,AAME,Atlantic American Corp,6311,NASDAQ,581027114
3,1158114,AAOI,Applied Optoelectronics Inc,3674,NASDAQ,760533927
4,320193,AAPL,Apple Inc,3571,NASDAQ,942404110
...,...,...,...,...,...,...
2731,1024628,ZP,Zap,3751,NYSE ARCA,943210624
2732,805305,ZQK,Quiksilver Inc,2320,NYSE,330199426
2733,1052257,ZTM,Z Trim Holdings Inc,2040,NYSE,364197173
2734,1555280,ZTS,Zoetis Inc,2834,NYSE,460696167


In [None]:
# sample for example
df = df.sample(120, random_state=1298, replace=True)

In [5]:
df

Unnamed: 0,cik,ticker,name,sic,exchange,irs
0,1099290,AAC,Sinocoking Coal & Coke Chemical Industries Inc,3312,NASDAQ,593404233
1,6201,AAL,American Airlines Group Inc,4512,NASDAQ,751825172
2,8177,AAME,Atlantic American Corp,6311,NASDAQ,581027114
3,1158114,AAOI,Applied Optoelectronics Inc,3674,NASDAQ,760533927
4,320193,AAPL,Apple Inc,3571,NASDAQ,942404110
...,...,...,...,...,...,...
2731,1024628,ZP,Zap,3751,NYSE ARCA,943210624
2732,805305,ZQK,Quiksilver Inc,2320,NYSE,330199426
2733,1052257,ZTM,Z Trim Holdings Inc,2040,NYSE,364197173
2734,1555280,ZTS,Zoetis Inc,2834,NYSE,460696167


In [9]:
df = df.head(10) #Only taking the first 10 for the moment to avoid heavy processing
len(df)

In [8]:
df

Unnamed: 0,cik,ticker,name,sic,exchange,irs
0,1099290,AAC,Sinocoking Coal & Coke Chemical Industries Inc,3312,NASDAQ,593404233
1,6201,AAL,American Airlines Group Inc,4512,NASDAQ,751825172
2,8177,AAME,Atlantic American Corp,6311,NASDAQ,581027114
3,1158114,AAOI,Applied Optoelectronics Inc,3674,NASDAQ,760533927
4,320193,AAPL,Apple Inc,3571,NASDAQ,942404110
...,...,...,...,...,...,...
2731,1024628,ZP,Zap,3751,NYSE ARCA,943210624
2732,805305,ZQK,Quiksilver Inc,2320,NYSE,330199426
2733,1052257,ZTM,Z Trim Holdings Inc,2040,NYSE,364197173
2734,1555280,ZTS,Zoetis Inc,2834,NYSE,460696167


In [10]:
df.to_csv('output.csv')

**There are a couple of data sources that provide the information of companies names, ticker, CIK and Exchange schema like [rankandfiled.com](http://rankandfiled.com/#/data/tickers) or [dan.vonkohorn.com](https://dan.vonkohorn.com/2016/07/03/cik-ticker-mappings/)**

In [11]:
import pandas as pd

In [22]:
#reading the ticker file
cik_ticker = pd.read_csv('../datasets/cik_ticker.csv', sep='|')

In [23]:
cik_ticker.head(2)

Unnamed: 0,CIK,Ticker,Name,Exchange,SIC,Business,Incorporated,IRS
0,1090872,A,Agilent Technologies Inc,NYSE,3825.0,CA,DE,770518772.0
1,4281,AA,Alcoa Inc,NYSE,3350.0,PA,PA,250317820.0


In [15]:
type(cik_ticker)

pandas.core.frame.DataFrame

In [18]:
cik_ticker['Exchange'].unique()

array(['NYSE', nan, 'OTC', 'NASDAQ', 'OTCBB', 'NYSE ARCA', 'NYSE MKT',
       'BATS'], dtype=object)

In [19]:
len(cik_ticker)

13737

---
#### *Learn more about API at [dataquest.io](https://www.dataquest.io/blog/python-api-tutorial/) and [freecodecamp.org](https://www.freecodecamp.org/news/what-is-an-api-in-english-please-b880a3214a82/)*

---
### 2.2. Internet Parsing using BeautifulSoup
<a id="2.2">

The main idea is to automate the search of companies and extract the information of the annual filings since there is no proper data source of annual reports we can always access them using BeautifulSoup from [EDGAR Company Filings](https://www.sec.gov/edgar/searchedgar/companysearch.html) official website 

EDGAR data analysis sources:
- [SEC Edgar Crawler](https://github.com/coyo8/sec-edgar)
- [OpenEDGAR by LexPredict](https://github.com/LexPredict/openedgar) and [Paper](https://arxiv.org/pdf/1806.04973.pdf)     

In [24]:
#main libraries for internet parsing
import requests
from bs4 import BeautifulSoup

In [25]:
import re
import time
from datetime import datetime, timedelta

Let's go step by step into EDGAR database

1. In the main page we insert a ticker to look for `AAL` for example
2. Check at the URL, it has a query sintax `https://www.sec.gov/cgi-bin/browse-edgar?CIK=AAL`
3. We have the annual reports listed, ordered by the date added, the last year should be the first one

In [28]:
# query search
cik = 'AAL'
#query_search = 'https://www.sec.gov/cgi-bin/browse-edgar?CIK='+str(cik)+'&type=10-K'
query_search = 'https://www.sec.gov/cgi-bin/browse-edgar?CIK={}&type=10-K'.format(str(cik))

In [29]:
query_search

'https://www.sec.gov/cgi-bin/browse-edgar?CIK=AAL&type=10-K'

In [30]:
# internet request
page = requests.get(query_search)
page

<Response [200]>

In [31]:
# parsing the page
page_parsed = BeautifulSoup(page.text, 'html.parser')

In [33]:
#page_parsed

In [34]:
# looking for the right doc
results_table = page_parsed.find(class_='tableFile2')

In [41]:
results_table.find_all('a')[0]

<a href="/Archives/edgar/data/6201/000000620120000023/0000006201-20-000023-index.htm" id="documentsbutton"> Documents</a>

In [None]:
'https://www.sec.gov/Archives/edgar/data/6201/000000620120000023/0000006201-20-000023.txt

Make the procedure modular

In [43]:
#list of cik codes
cik_codes = list(df.cik.unique())

In [44]:
# function to get the url to the file
def get_url(query):
    page = requests.get(query)
    page_parsed = BeautifulSoup(page.text, 'html.parser')
    results_table = page_parsed.find(class_='tableFile2')
    return results_table.find_all('a')

In [45]:
# function to get the txt files
def get_txt_files(cik_codes):
    # INPUT = list of CIK codes
    docs_urls = []
    for cik in cik_codes:
        query = 'https://www.sec.gov/cgi-bin/browse-edgar?CIK='+str(cik)+'&type=10-K'
        results_files = get_url(query)
        #if there is no files in the registry means the table and list are empty
        if results_files != []:
            #print(str(query))
            results_items = results_files[0].prettify()  #here access the company
            raw_url = 'https://www.sec.gov'+re.findall(r'\"(.+?)\"',results_items)[0]
            url = raw_url.replace('-index.htm','.txt')
            docs_urls.append(url)
        else:
            docs_urls.append('https://www.google.com/')#'no-url')
    return docs_urls

**Web crawler to extract the URL of the text files from EDGAR database**

In [46]:
start_time = time.monotonic()
#execution
urls_files = get_txt_files(cik_codes)
#time
sttime = datetime.now().strftime('%Y%m%d_%H:%M_')
end_time = time.monotonic()
ex_time = timedelta(seconds=end_time - start_time)
print("Execution time: {}".format(ex_time))

Execution time: 0:34:09.126541


Now we can define a new column, and add the URL of each company related to each CIK

In [47]:
df = df.head(len(urls_files))
df['file'] = urls_files

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [48]:
# check the first URL
urls_files[0]

'https://www.sec.gov/Archives/edgar/data/1099290/000114420417037280/0001144204-17-037280.txt'

In [50]:
df

Unnamed: 0,cik,ticker,name,sic,exchange,irs,file
0,1099290,AAC,Sinocoking Coal & Coke Chemical Industries Inc,3312,NASDAQ,593404233,https://www.sec.gov/Archives/edgar/data/109929...
1,6201,AAL,American Airlines Group Inc,4512,NASDAQ,751825172,https://www.sec.gov/Archives/edgar/data/6201/0...
2,8177,AAME,Atlantic American Corp,6311,NASDAQ,581027114,https://www.sec.gov/Archives/edgar/data/8177/0...
3,1158114,AAOI,Applied Optoelectronics Inc,3674,NASDAQ,760533927,https://www.sec.gov/Archives/edgar/data/115811...
4,320193,AAPL,Apple Inc,3571,NASDAQ,942404110,https://www.sec.gov/Archives/edgar/data/320193...
...,...,...,...,...,...,...,...
2731,1024628,ZP,Zap,3751,NYSE ARCA,943210624,https://www.sec.gov/Archives/edgar/data/102462...
2732,805305,ZQK,Quiksilver Inc,2320,NYSE,330199426,https://www.sec.gov/Archives/edgar/data/805305...
2733,1052257,ZTM,Z Trim Holdings Inc,2040,NYSE,364197173,https://www.sec.gov/Archives/edgar/data/105225...
2734,1555280,ZTS,Zoetis Inc,2834,NYSE,460696167,https://www.sec.gov/Archives/edgar/data/155528...


In [49]:
df.to_csv('output.csv')

---
#### *Learn more about BeautifulSoup at [dataquest.io](https://www.dataquest.io/blog/web-scraping-beautifulsoup/ ) and [BeautifulSoup official documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), the implementation is inspired from [scrapsfromtheloft.com](scrapsfromtheloft.com)*