#### PGGM Bootcamp Text Analytics 2020
*Notebook by [Pedro V Hernandez Serrano](https://github.com/pedrohserrano)*

---
![](images/1_2.png)

# 1.2 Text Extraction:
* [1.2.1. Exploring text data sources](#1.21)
* [1.2.2. Internet parsing using API](#1.2.2)
* [1.2.3. Internet parsing using BeautifulSoup](#1.2.3)

---

### Financial Text Data Sources 

Examples of Data Sources
Corporate Reports  
• SEC’s EDGAR: 1994-2015, 15+ million filings, annual and quarterly reports  
• Regulatory disclosures: annual and interim filings (10-K and 10-Q), correspondences, IPO registration statements, etc.  
• Company Announcements: Websites, SEC 8-K filings, news feeds  

News: Newswires, articles, blogs.  
• WSJ News Archive: XML encapsulated, 2000 - present  
• SEC’s Current Report filings (8-K): any material new information  
• Audio transcripts of conference calls by top executives, earnings announcements..  
• Analyst ratings and research reports, investor surveys, investors’ activism campaigns  

Social Media, other sources    
• Twitter, Stocktwits, message boards, websites  
• Google searches (Google Analytics)  
• Patent Applications: [patft.uspto.gov](http://patft.uspto.gov/netahtml/PTO/search-adv.htm) and [patents.google.com](https://patents.google.com/)

<br>

*For the purpose of the **Use case** we will use [EDGAR Company Filings](https://www.sec.gov/edgar/searchedgar/companysearch.html) and [Annual reports](http://bit.ly/39VuFHW) provided by PGGM*

---
### 1.2.2. Internet Parsing using API
<a id="1.2.2">

- Before we get deep into the EDGAR data we actually need to map the CIK codes with the company names so that we can automatically can gather information
- A Central Index Key or CIK number is a number given to an individual, company, or foreign government by the United States Securities and Exchange Commission. 
- The number will help is to identify its filings in several online databases, including EDGAR.
- The numbers are ten digits in length.

**There are a couple of data sources that provide the information of companies names, ticker, CIK and Exchange schema like [rankandfiled.com](http://rankandfiled.com/#/data/tickers) or [dan.vonkohorn.com](https://dan.vonkohorn.com/2016/07/03/cik-ticker-mappings/)**

In [None]:
import pandas as pd

In [None]:
#reading the ticker file
cik_ticker = pd.read_csv('datasets/cik_ticker.csv', sep='|')

In [None]:
cik_ticker.head()

In [None]:
cik_ticker.Exchange.unique()

In [None]:
len(cik_ticker)

Accessing CIK - Ticker mappings via API

In [None]:
NASDAQ = pd.read_json('https://mapping-api.herokuapp.com/exchange/NASDAQ')
OTC = pd.read_json('https://mapping-api.herokuapp.com/exchange/OTC')
NYSE = pd.read_json('https://mapping-api.herokuapp.com/exchange/NYSE')

In [None]:
df = pd.DataFrame()
df = df.append(NASDAQ)
df = df.append(OTC)
df = df.append(NYSE)

In [None]:
# sample for example
df = df.sample(120, random_state=1298, replace=True)

In [None]:
df = df.head(10) #Only taking the first 10 for the moment to avoid heavy processing
len(df)

In [None]:
df

---
#### *Learn more about API at [dataquest.io](https://www.dataquest.io/blog/python-api-tutorial/) and [freecodecamp.org](https://www.freecodecamp.org/news/what-is-an-api-in-english-please-b880a3214a82/)*

---
### 1.2.3. Internet Parsing using BeautifulSoup
<a id="1.2.3">

The main idea is to automate the search of companies and extract the information of the annual filings since there is no proper data source of annual reports we can always "scrape EDGAR" using BeautifulSoup

EDGAR text analysis sources:
- [SEC Edgar Crawler](https://github.com/coyo8/sec-edgar)
- [OpenEDGAR by LexPredict](https://github.com/LexPredict/openedgar) and [Paper](https://arxiv.org/pdf/1806.04973.pdf)     

In [None]:
#main libraries for internet parsing
import requests
from bs4 import BeautifulSoup

In [None]:
import re
import time
from datetime import datetime, timedelta

Let's go step by step into EDGAR database

1. In the main page we insert a ticker to look for `AAL` for example
2. Check at the URL, it has a query sintax `https://www.sec.gov/cgi-bin/browse-edgar?CIK=AAL`
3. We have the annual reports listed, ordered by the date added, the last year should be the first one

In [None]:
# query search
cik = 'AAL'
query_search = 'https://www.sec.gov/cgi-bin/browse-edgar?CIK='+str(cik)+'&type=10-K'

In [None]:
# internet request
page = requests.get(query_search)
page

In [None]:
# parsing the page
page_parsed = BeautifulSoup(page.text, 'html.parser')

In [None]:
# looking for the right doc
results_table = page_parsed.find(class_='tableFile2')

Make the procedure modular

In [None]:
#list of cik codes
cik_codes = list(df.cik.unique())

In [None]:
# function to get the url to the file
def get_url(query):
    page = requests.get(query)
    page_parsed = BeautifulSoup(page.text, 'html.parser')
    results_table = page_parsed.find(class_='tableFile2')
    return results_table.find_all('a')

In [None]:
# function to get the txt files
def get_txt_files(cik_codes):
    # INPUT = list of CIK codes
    docs_urls = []
    for cik in cik_codes:
        query = 'https://www.sec.gov/cgi-bin/browse-edgar?CIK='+str(cik)+'&type=10-K'
        results_files = get_url(query)
        #if there is no files in the registry means the table and list are empty
        if results_files != []:
            #print(str(query))
            results_items = results_files[0].prettify()  #here access the company
            raw_url = 'https://www.sec.gov'+re.findall(r'\"(.+?)\"',results_items)[0]
            url = raw_url.replace('-index.htm','.txt')
            docs_urls.append(url)
        else:
            docs_urls.append('https://www.google.com/')#'no-url')
    return docs_urls

#### Web crawler to extract the URL of the text files from EDGAR database

In [None]:
start_time = time.monotonic()
#execution
urls_files = get_txt_files(cik_codes)
#time
sttime = datetime.now().strftime('%Y%m%d_%H:%M_')
end_time = time.monotonic()
ex_time = timedelta(seconds=end_time - start_time)
print("Execution time: {}".format(ex_time))

Now we can define a new column, and add the URL of each company related to each CIK

In [None]:
# Only the head to test
df = df.head(len(urls_files))
df['file'] = urls_files

In [None]:
# save to csv
df.to_csv('datasets/table_companies_short.csv')

In [None]:
tickers = list(df['ticker'])

In [None]:
df.head()

In [None]:
# check the first URL
urls_files[0]

**WARNING: Here we actually request documents (it takes some time to run)**

In [None]:
def url_to_transcript(url):
    return requests.get(url).text

In [None]:
reports = [url_to_transcript(u) for u in urls_files]

<br>
Parsing the collected files and pickle-ing them to drive

In [None]:
# Make a new directory to hold the text files
#!mkdir reports

In [None]:
import pickle

In [None]:
# for i, c in enumerate(tickers):
#     with open("reports/" + c + ".txt", "wb") as file:
#         pickle.dump(reports[i], file)

In [None]:
# Take a look of a file
print(reports[0][:800])

. 

**Congratulations we have scraped EDGAR succesfully, to find out wether if the files are saved correctly we simply load them and see (good practice)**|

---
#### *Learn more about BeautifulSoup at [dataquest.io](https://www.dataquest.io/blog/web-scraping-beautifulsoup/ ) and [BeautifulSoup official documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), the implementation is inspired from [scrapsfromtheloft.com](scrapsfromtheloft.com)*