# NLP Financial Statements



 >**[Part 1 - Get Raw 10-Ks](#Part-1---Get-Raw-10-Ks)**
 
 >>[Get List of 10-Ks](#Get-List-of-10-Ks-for-Each-Company)
 
 >>[Download 10-Ks](#Download-10-Ks)

 >**[Part 2 - Pre-process Text](#Part-2---Pre-process-Text)**
 
 >>[Clean Text](#Clean-Text)
 
 >>[Lemmatize](#Lemmatize)
 
 >>[Remove Stopwords](#Remove-Stopwords)
 
 >**[Part 3 - Measure Sentiment](#Part-3---Measure-Sentiment)**
 
 >>[Loughran Mcdonald Sentment Word Lists](#Loughran-McDonald-Sentiment-Word-Lists)
 
 >>[Bag of Words](#Bag-of-Words)
 
 >>[Jaccard Similarity](#Jaccard-Similarity)
 
 >>[TFIDF](#TFIDF)
 
 >>[Cosine Similarity](#Cosine-Similarity)
 
 >**[Part 4 - Evaluate Alpha Factors](#Part-4---Evaluate-Alpha-Factors)**
 
 >>[Price Data](#Price-Data)
 
 >>[Alphalens Format](#Alphalens-Format)
 
 >>[Factor Returns](#Factor-Returns)
 
 >>[Basis Points per Day per Quantile](#Basis-Points-per-Day-per-Quantile)
 
 >>[Turnover Analysis](#Turnover-Analysis)
 
 >>[Sharpe Ratio of the Alphas](#Sharpe-Ratio-of-The-Alphas)




In [1]:
import nltk
import numpy as np
import pandas as pd
import pickle
import pprint
import requests
import datetime as dt
from bs4 import BeautifulSoup
from tqdm import tqdm

In [2]:
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ryanbusby/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/ryanbusby/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

---
## Part 1 - Get Raw 10-Ks

### Get List of 10-Ks for Each Company

In [7]:
cik_lookup = {
    'AEP': '0000004904',
    'AMZN': '0001018724',
    'AXP': '0000004962',
    'BA': '0000012927', 
    'BK': '0001390777',
    'BMY': '0000014272',
    'CAT': '0000018230',
    'CNP': '0001130310',
    'CVX': '0000093410',
    'DE': '0000315189',
    'DIS': '0001001039', 
    'DTE': '0000936340',
    'ED': '0001047862',
    'EMR': '0000032604',
    'ETN': '0001551182',
    'FL': '0000850209',
    'FRT': '0000034903',
    'GE': '0000040545',
    'HON': '0000773840',
    'IBM': '0000051143',
    'IP': '0000051434',
    'JNJ': '0000200406',
    'KO': '0000021344',
    'LLY': '0000059478',
    'MCD': '0000063908',
    'MO': '0000764180',
    'MRK': '0000310158',
    'MRO': '0000101778',
    'PCG': '0001004980',
    'PEP': '0000077476',
    'PFE': '0000078003',
    'PG': '0000080424',
    'PNR': '0000077360',
    'SYY': '0000096021',
    'TXN': '0000097476',
    'UTX': '0000101829',
    'WFC': '0000072971',
    'WMT': '0000104169',
    'WY': '0000106535',
    'XOM': '0000034088'
}

In [8]:
def get_sec_data(cik, doc_type, start=0, count=60):
    newest_price_data = pd.to_datetime('2021-02-01')
    rss_url = 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany'\
    '&CIK={}&type={}&start={}&count={}&owner=exclude&output=atom'\
    .format(cik, doc_type, start, count)
    sec_data = requests.get(rss_url).text.encode('ascii')
    feed = BeautifulSoup(sec_data, 'xml').feed
    entries = [
        (
            entry.content.find('filing-href').getText(),
            entry.content.find('filing-type').getText(),
            entry.content.find('filing-date').getText(),
        )\
        for entry in feed.find_all('entry', recursive=False)
        if pd.to_datetime(entry.content.find('filing-date').getText())\
        <= newest_price_data
    ]
    
    return entries

In [10]:
sec_data = {}

for ticker, cik in cik_lookup.items():
    sec_data[ticker] = get_sec_data(cik, '10-K')
    
pprint.pprint(sec_data['AMZN'][:10])

[('https://www.sec.gov/Archives/edgar/data/1018724/000101872420000004/0001018724-20-000004-index.htm',
  '10-K',
  '2020-01-31'),
 ('https://www.sec.gov/Archives/edgar/data/1018724/000101872419000004/0001018724-19-000004-index.htm',
  '10-K',
  '2019-02-01'),
 ('https://www.sec.gov/Archives/edgar/data/1018724/000101872418000005/0001018724-18-000005-index.htm',
  '10-K',
  '2018-02-02'),
 ('https://www.sec.gov/Archives/edgar/data/1018724/000101872417000011/0001018724-17-000011-index.htm',
  '10-K',
  '2017-02-10'),
 ('https://www.sec.gov/Archives/edgar/data/1018724/000101872416000172/0001018724-16-000172-index.htm',
  '10-K',
  '2016-01-29'),
 ('https://www.sec.gov/Archives/edgar/data/1018724/000101872415000006/0001018724-15-000006-index.htm',
  '10-K',
  '2015-01-30'),
 ('https://www.sec.gov/Archives/edgar/data/1018724/000101872414000006/0001018724-14-000006-index.htm',
  '10-K',
  '2014-01-31'),
 ('https://www.sec.gov/Archives/edgar/data/1018724/000119312513028520/0001193125-13-028520

[&#9650;back to top](#NLP-Financial-Statements)

---

### Download 10-Ks

Iterate over urls in ```sec_data``` and load raw 10-Ks into a dictionary.
The urls in ```sec_data``` point to metadata, simply replace part of the url to get the raw text file.

In [11]:
raw_filings = {}

for ticker, data in sec_data.items():
    raw_filings[ticker] = {}
    for index_url, file_type, file_date in tqdm(data, desc='Downloading {} Filings'.format(ticker), unit='filing'):
        if file_type == '10-K':
            file_url = index_url.replace('-index.htm', '.txt').replace('.txtl', 'txt')
            raw_filings[ticker][file_date] = requests.get(file_url).text

Downloading AEP Filings: 100%|██████████| 29/29 [01:10<00:00,  2.44s/filing]
Downloading AMZN Filings: 100%|██████████| 25/25 [00:05<00:00,  4.51filing/s]
Downloading AXP Filings: 100%|██████████| 31/31 [00:11<00:00,  2.72filing/s]
Downloading BA Filings: 100%|██████████| 30/30 [00:13<00:00,  2.25filing/s]
Downloading BK Filings: 100%|██████████| 15/15 [00:12<00:00,  1.18filing/s]
Downloading BMY Filings: 100%|██████████| 30/30 [00:11<00:00,  2.62filing/s]
Downloading CAT Filings: 100%|██████████| 40/40 [00:25<00:00,  1.57filing/s]
Downloading CNP Filings: 100%|██████████| 22/22 [00:08<00:00,  2.56filing/s]
Downloading CVX Filings: 100%|██████████| 28/28 [00:14<00:00,  1.90filing/s]
Downloading DE Filings: 100%|██████████| 33/33 [00:17<00:00,  1.87filing/s]
Downloading DIS Filings: 100%|██████████| 38/38 [00:13<00:00,  2.75filing/s]
Downloading DTE Filings: 100%|██████████| 27/27 [00:18<00:00,  1.43filing/s]
Downloading ED Filings: 100%|██████████| 23/23 [00:12<00:00,  1.84filing/s]
Do

In [12]:
print('Example Document:\n\n{}...'.format(next(iter(raw_filings['XOM'].values()))[:3000]))

Example Document:

<SEC-DOCUMENT>0000034088-20-000016.txt : 20200226
<SEC-HEADER>0000034088-20-000016.hdr.sgml : 20200226
<ACCEPTANCE-DATETIME>20200226161519
ACCESSION NUMBER:		0000034088-20-000016
CONFORMED SUBMISSION TYPE:	10-K
PUBLIC DOCUMENT COUNT:		138
CONFORMED PERIOD OF REPORT:	20191231
FILED AS OF DATE:		20200226
DATE AS OF CHANGE:		20200226

FILER:

	COMPANY DATA:	
		COMPANY CONFORMED NAME:			EXXON MOBIL CORP
		CENTRAL INDEX KEY:			0000034088
		STANDARD INDUSTRIAL CLASSIFICATION:	PETROLEUM REFINING [2911]
		IRS NUMBER:				135409005
		STATE OF INCORPORATION:			NJ
		FISCAL YEAR END:			1231

	FILING VALUES:
		FORM TYPE:		10-K
		SEC ACT:		1934 Act
		SEC FILE NUMBER:	001-02256
		FILM NUMBER:		20655849

	BUSINESS ADDRESS:	
		STREET 1:		5959 LAS COLINAS BLVD
		CITY:			IRVING
		STATE:			TX
		ZIP:			75039-2298
		BUSINESS PHONE:		9729406000

	MAIL ADDRESS:	
		STREET 1:		5959 LAS COLINAS BLVD
		CITY:			IRVING
		STATE:			TX
		ZIP:			75039-2298

	FORMER COMPANY:	
		FORMER CONFORMED NAME:	E

#### Extract the Content Within the &lt;DOCUMENT&gt; Tag

In [13]:
import re

In [14]:
def get_documents(raw_file):
    start_regex, end_regex = re.compile(r'<DOCUMENT>'), re.compile(r'</DOCUMENT>')
    
    start_indices = [m.end() for m in start_regex.finditer(raw_file)]
    end_indices = [m.start() for m in end_regex.finditer(raw_file)]
    
    extracted_docs = [raw_file[a:b] for a,b in zip(start_indices, end_indices)]
    
    return extracted_docs

In [15]:
filing_docs = {}
for ticker, filings in raw_filings.items():
    filing_docs[ticker] = {}
    for file_date, raw_file in tqdm(filings.items(), desc='Getting Documents from {} Filings'.format(ticker), unit='filing'):
        filing_docs[ticker][file_date] = get_documents(raw_file)

Getting Documents from AEP Filings: 100%|██████████| 23/23 [00:10<00:00,  2.19filing/s]
Getting Documents from AMZN Filings: 100%|██████████| 20/20 [00:00<00:00, 30.69filing/s]
Getting Documents from AXP Filings: 100%|██████████| 24/24 [00:02<00:00, 11.78filing/s]
Getting Documents from BA Filings: 100%|██████████| 28/28 [00:01<00:00, 18.22filing/s]
Getting Documents from BK Filings: 100%|██████████| 13/13 [00:03<00:00,  3.62filing/s]
Getting Documents from BMY Filings: 100%|██████████| 26/26 [00:02<00:00, 10.37filing/s]
Getting Documents from CAT Filings: 100%|██████████| 19/19 [00:06<00:00,  2.95filing/s]
Getting Documents from CNP Filings: 100%|██████████| 18/18 [00:02<00:00,  7.84filing/s]
Getting Documents from CVX Filings: 100%|██████████| 24/24 [00:02<00:00,  9.61filing/s]
Getting Documents from DE Filings: 100%|██████████| 20/20 [00:04<00:00,  4.97filing/s]
Getting Documents from DIS Filings: 100%|██████████| 22/22 [00:01<00:00, 11.95filing/s]
Getting Documents from DTE Filings

In [16]:
print(
    '\n\n'.join(
        [
            'Document {} Filed on {}:\n{}...'.format(doc_i, file_date, doc[:200])
            for file_date, docs in filing_docs['AMZN'].items()
            for doc_i, doc in enumerate(docs)
        ][:3]
    )
)

Document 0 Filed on 2020-01-31:

<TYPE>10-K
<SEQUENCE>1
<FILENAME>amzn-20191231x10k.htm
<DESCRIPTION>10-K
<TEXT>
<XBRL>
<?xml version="1.0" encoding="UTF-8"?>
<!--XBRL Document Created with Wdesk from Workiva-->
<!--p:c57a17684e854b...

Document 1 Filed on 2020-01-31:

<TYPE>EX-4.6
<SEQUENCE>2
<FILENAME>amzn-20191231xex46.htm
<DESCRIPTION>EXHIBIT 4.6
<TEXT>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>...

Document 2 Filed on 2020-01-31:

<TYPE>EX-21.1
<SEQUENCE>3
<FILENAME>amzn-20191231xex211.htm
<DESCRIPTION>EXHIBIT 21.1
<TEXT>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<ht...


#### Filter Out 10-Ks

In [17]:
def get_doc_type(doc):
    regex = re.compile(r'<TYPE>[\S]+')
    matches = regex.finditer(doc)
    doc_type = next(matches).group(0)[6:].lower()
    
    return doc_type

In [18]:
tenKs = {}

for ticker, filing_documents in filing_docs.items():
    tenKs[ticker] = []
    for file_date, docs in filing_documents.items():
        for doc in docs:
            if get_doc_type(doc) == '10-k':
                tenKs[ticker].append(
                    {
                        'cik': cik_lookup[ticker],
                        'file': doc,
                        'file_date': file_date
                    }
                )

In [19]:
from nb_utils import *

In [20]:
ten_k_data = tenKs['AMZN'][:5]
fields = ['cik','file','file_date']
field_length_limit=50

print_ten_k_data(ten_k_data, fields)

[
  {
    cik: '0001018724'
    file: '\n<TYPE>10-K\n<SEQUENCE>1\n<FILENAME>amzn-2019123...
    file_date: '2020-01-31'},
  {
    cik: '0001018724'
    file: '\n<TYPE>10-K\n<SEQUENCE>1\n<FILENAME>amzn-2018123...
    file_date: '2019-02-01'},
  {
    cik: '0001018724'
    file: '\n<TYPE>10-K\n<SEQUENCE>1\n<FILENAME>amzn-2017123...
    file_date: '2018-02-02'},
  {
    cik: '0001018724'
    file: '\n<TYPE>10-K\n<SEQUENCE>1\n<FILENAME>amzn-2016123...
    file_date: '2017-02-10'},
  {
    cik: '0001018724'
    file: '\n<TYPE>10-K\n<SEQUENCE>1\n<FILENAME>amzn-2015123...
    file_date: '2016-01-29'},
]


[&#9650;back to top](#NLP-Financial-Statements)

---

## Part 2 - Pre-process Text

### Clean Text

In [21]:
def clean_text(text):
    text = text.lower()
    
    text = BeautifulSoup(text, 'html.parser').get_text()
    
    return text

In [22]:
for ticker, tenK_ in tenKs.items():
    for tenK in tqdm(tenK_, desc='Cleaning {} 10-Ks'.format(ticker), unit='10-K'):
        tenK['file_clean'] = clean_text(tenK['file'])

Cleaning AEP 10-Ks: 100%|██████████| 25/25 [00:30<00:00,  1.22s/10-K]
Cleaning AMZN 10-Ks: 100%|██████████| 18/18 [00:36<00:00,  2.03s/10-K]
Cleaning AXP 10-Ks: 100%|██████████| 19/19 [00:41<00:00,  2.16s/10-K]
Cleaning BA 10-Ks: 100%|██████████| 21/21 [00:49<00:00,  2.34s/10-K]
Cleaning BK 10-Ks: 100%|██████████| 13/13 [00:04<00:00,  2.9210-K/s]
Cleaning BMY 10-Ks: 100%|██████████| 19/19 [01:16<00:00,  4.01s/10-K]
Cleaning CAT 10-Ks: 100%|██████████| 35/35 [01:09<00:00,  1.98s/10-K]
Cleaning CNP 10-Ks: 100%|██████████| 18/18 [01:11<00:00,  3.99s/10-K]
Cleaning CVX 10-Ks: 100%|██████████| 19/19 [02:23<00:00,  7.56s/10-K]
Cleaning DE 10-Ks: 100%|██████████| 19/19 [01:06<00:00,  3.51s/10-K]
Cleaning DIS 10-Ks: 100%|██████████| 19/19 [01:00<00:00,  3.19s/10-K]
Cleaning DTE 10-Ks: 100%|██████████| 19/19 [01:27<00:00,  4.59s/10-K]
Cleaning ED 10-Ks: 100%|██████████| 20/20 [01:30<00:00,  4.53s/10-K]
Cleaning EMR 10-Ks: 100%|██████████| 21/21 [00:10<00:00,  2.0810-K/s]
Cleaning ETN 10-Ks: 100

In [23]:
print_ten_k_data(tenKs['AMZN'][:5], ['file_clean'])

[
  {
    file_clean: '\n10-k\n1\namzn-20191231x10k.htm\n10-k\n\n\n\n\n\...},
  {
    file_clean: '\n10-k\n1\namzn-20181231x10k.htm\n10-k\n\n\n\n\n\...},
  {
    file_clean: '\n10-k\n1\namzn-20171231x10k.htm\n10-k\n\n\n\n\n\...},
  {
    file_clean: '\n10-k\n1\namzn-20161231x10k.htm\nform 10-k\n\n\n...},
  {
    file_clean: '\n10-k\n1\namzn-20151231x10k.htm\nform 10-k\n\n\n...},
]


[&#9650;back to top](#NLP-Financial-Statements)

---

### Lemmatize

In [29]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

In [30]:
def lemmatize_words(words):
    
    lemmatizer = WordNetLemmatizer()
    lemmatized_words = [lemmatizer.lemmatize(w, pos='v') for w in words]
    
    return lemmatized_words

In [51]:
word_pattern = re.compile('[a-z]{2,}')

for ticker, ten_ks in tenKs.items():
    for ten_k in tqdm(ten_ks, desc='Lemmatize {} 10-Ks'.format(ticker), unit='10-K'):
        ten_k['file_lemma'] = lemmatize_words(word_pattern.findall(ten_k['file_clean']))

Lemmatize AEP 10-Ks: 100%|██████████| 25/25 [00:06<00:00,  3.7710-K/s]
Lemmatize AMZN 10-Ks: 100%|██████████| 18/18 [00:02<00:00,  6.1110-K/s]
Lemmatize AXP 10-Ks: 100%|██████████| 19/19 [00:04<00:00,  3.8110-K/s]
Lemmatize BA 10-Ks: 100%|██████████| 21/21 [00:04<00:00,  4.2310-K/s]
Lemmatize BK 10-Ks: 100%|██████████| 13/13 [00:01<00:00,  7.8110-K/s]
Lemmatize BMY 10-Ks: 100%|██████████| 19/19 [00:05<00:00,  3.3210-K/s]
Lemmatize CAT 10-Ks: 100%|██████████| 35/35 [00:17<00:00,  1.9710-K/s]
Lemmatize CNP 10-Ks: 100%|██████████| 18/18 [00:05<00:00,  3.0610-K/s]
Lemmatize CVX 10-Ks: 100%|██████████| 19/19 [00:04<00:00,  3.9710-K/s]
Lemmatize DE 10-Ks: 100%|██████████| 19/19 [00:03<00:00,  5.4010-K/s]
Lemmatize DIS 10-Ks: 100%|██████████| 19/19 [00:03<00:00,  4.8210-K/s]
Lemmatize DTE 10-Ks: 100%|██████████| 19/19 [00:05<00:00,  3.4410-K/s]
Lemmatize ED 10-Ks: 100%|██████████| 20/20 [00:05<00:00,  3.4110-K/s]
Lemmatize EMR 10-Ks: 100%|██████████| 21/21 [00:01<00:00, 15.4010-K/s]
Lemmatize

In [52]:
print_ten_k_data(tenKs['AMZN'][:15], ['file_lemma'])

[
  {
    file_lemma: '['amzn', 'htm', 'document', 'yp', 'yp', 'false', ...},
  {
    file_lemma: '['amzn', 'htm', 'document', 'table', 'of', 'conte...},
  {
    file_lemma: '['amzn', 'htm', 'document', 'unite', 'statessecur...},
  {
    file_lemma: '['amzn', 'htm', 'form', 'document', 'unite', 'sta...},
  {
    file_lemma: '['amzn', 'htm', 'form', 'unite', 'statessecuritie...},
  {
    file_lemma: '['amzn', 'htm', 'form', 'amzn', 'unite', 'statess...},
  {
    file_lemma: '['amzn', 'htm', 'form', 'amzn', 'table', 'of', 'c...},
  {
    file_lemma: '['htm', 'form', 'form', 'table', 'of', 'content',...},
  {
    file_lemma: '['htm', 'form', 'form', 'table', 'of', 'content',...},
  {
    file_lemma: '['htm', 'form', 'form', 'table', 'of', 'content',...},
  {
    file_lemma: '['htm', 'form', 'form', 'table', 'of', 'content',...},
  {
    file_lemma: '['htm', 'form', 'form', 'table', 'of', 'content',...},
  {
    file_lemma: '['htm', 'form', 'form', 'table', 'of', 'content',...},
  {
    fi

[&#9650;back to top](#NLP-Financial-Statements)

---

### Remove Stopwords

In [53]:
from nltk.corpus import stopwords

In [54]:
lemma_english_stopwords = lemmatize_words(stopwords.words('english'))

for ticker, ten_ks in tenKs.items():
    for ten_k in tqdm(ten_ks, desc='Remove Stop Words for {} 10-Ks'.format(ticker), unit='10-K'):
        ten_k['file_lemma'] = [word for word in ten_k['file_lemma'] if word not in lemma_english_stopwords]

Remove Stop Words for AEP 10-Ks: 100%|██████████| 25/25 [00:04<00:00,  6.1910-K/s]
Remove Stop Words for AMZN 10-Ks: 100%|██████████| 18/18 [00:01<00:00, 15.9610-K/s]
Remove Stop Words for AXP 10-Ks: 100%|██████████| 19/19 [00:02<00:00,  9.4210-K/s]
Remove Stop Words for BA 10-Ks: 100%|██████████| 21/21 [00:01<00:00, 10.6110-K/s]
Remove Stop Words for BK 10-Ks: 100%|██████████| 13/13 [00:00<00:00, 28.8210-K/s]
Remove Stop Words for BMY 10-Ks: 100%|██████████| 19/19 [00:02<00:00,  8.0510-K/s]
Remove Stop Words for CAT 10-Ks: 100%|██████████| 35/35 [00:09<00:00,  3.7010-K/s]
Remove Stop Words for CNP 10-Ks: 100%|██████████| 18/18 [00:02<00:00,  7.6810-K/s]
Remove Stop Words for CVX 10-Ks: 100%|██████████| 19/19 [00:01<00:00, 10.0610-K/s]
Remove Stop Words for DE 10-Ks: 100%|██████████| 19/19 [00:01<00:00, 13.9510-K/s]
Remove Stop Words for DIS 10-Ks: 100%|██████████| 19/19 [00:01<00:00, 12.2110-K/s]
Remove Stop Words for DTE 10-Ks: 100%|██████████| 19/19 [00:02<00:00,  8.6010-K/s]
Remove

In [55]:
print_ten_k_data(tenKs['AMZN'][:15], ['file_lemma'])

[
  {
    file_lemma: '['amzn', 'htm', 'document', 'yp', 'yp', 'false', ...},
  {
    file_lemma: '['amzn', 'htm', 'document', 'table', 'content', '...},
  {
    file_lemma: '['amzn', 'htm', 'document', 'unite', 'statessecur...},
  {
    file_lemma: '['amzn', 'htm', 'form', 'document', 'unite', 'sta...},
  {
    file_lemma: '['amzn', 'htm', 'form', 'unite', 'statessecuritie...},
  {
    file_lemma: '['amzn', 'htm', 'form', 'amzn', 'unite', 'statess...},
  {
    file_lemma: '['amzn', 'htm', 'form', 'amzn', 'table', 'content...},
  {
    file_lemma: '['htm', 'form', 'form', 'table', 'content', 'unit...},
  {
    file_lemma: '['htm', 'form', 'form', 'table', 'content', 'unit...},
  {
    file_lemma: '['htm', 'form', 'form', 'table', 'content', 'unit...},
  {
    file_lemma: '['htm', 'form', 'form', 'table', 'content', 'unit...},
  {
    file_lemma: '['htm', 'form', 'form', 'table', 'content', 'unit...},
  {
    file_lemma: '['htm', 'form', 'form', 'table', 'content', 'unit...},
  {
    fi

[&#9650;back to top](#NLP-Financial-Statements)

---
## Part 3 - Measure Sentiment

### Loughran McDonald Sentiment Word Lists
Using the [Loughran and McDonald](https://sraf.nd.edu/textual-analysis/resources/#Master%20Dictionary) sentiment word lists, these word lists cover the following sentiment:
- Negative 
- Positive
- Uncertainty
- Litigious
- Constraining
- Superfluous
- Modal

In [56]:
def load_sent_df():
    cols = ['Word', 'Negative', 'Positive', 'Uncertainty', 'Litigious', 'Constraining', 'Superfluous', 'Interesting']
    sent_df = pd.read_csv('LoughranMcDonald_MasterDictionary_2018.csv', usecols=cols)
    sent_df.columns = [c.lower() for c in cols]
    cols = [c.lower() for c in cols]
    sent_df[cols[1:]] = sent_df[cols[1:]].astype(bool)
    sent_df = sent_df[(sent_df[cols[1:]]).any(1)]
    sent_df.word = lemmatize_words(sent_df.word.str.lower())
    sent_df = sent_df.drop_duplicates('word')
    
    return sent_df

In [57]:
sent_df = load_sent_df()
sent_df.head()

Unnamed: 0,word,negative,positive,uncertainty,litigious,constraining,superfluous,interesting
9,abandon,True,False,False,False,False,False,False
12,abandonment,True,False,False,False,False,False,False
13,abandonments,True,False,False,False,False,False,False
51,abdicate,True,False,False,False,False,False,False
54,abdication,True,False,False,False,False,False,False


[&#9650;back to top](#NLP-Financial-Statements)

---

### Bag of Words

Using the sentiment word lists, generate sentiment bag of words from the 10-K documents.

In [58]:
from collections import defaultdict, Counter
from sklearn.feature_extraction.text import CountVectorizer

In [59]:
def get_bag_of_words(sentiment_words, docs):
    rows = [Counter(doc.split()) for doc in docs]
    bag_of_words = pd.DataFrame(rows, columns=sentiment_words)\
        .fillna(0)\
        .astype(int)\
        .values
    return bag_of_words

In [60]:
sent_bow_ten_ks = {}
sentiments = ['negative', 'positive', 'uncertainty', 'litigious', 'constraining', 'interesting']
for ticker, ten_ks in tenKs.items():
    lemma_docs = [' '.join(ten_k['file_lemma']) for ten_k in ten_ks]
    
    sent_bow_ten_ks[ticker] = {
        sentiment: get_bag_of_words(sent_df[sent_df[sentiment]]['word'], lemma_docs)
        for sentiment in sentiments
    }

In [61]:
print_ten_k_data([sent_bow_ten_ks['AMZN']], sentiments)

[
  {
    negative: '[[0 0 0 ... 0 0 0]\n [0 0 0 ... 0 0 0]\n [0 0 0 ....
    positive: '[[12  0  0 ...  0  0  0]\n [15  0  0 ...  0  0  0...
    uncertainty: '[[0 0 0 ... 1 1 2]\n [0 0 0 ... 1 1 2]\n [0 0 0 ....
    litigious: '[[0 0 0 ... 0 0 0]\n [0 0 0 ... 0 0 0]\n [0 0 0 ....
    constraining: '[[0 0 0 ... 0 0 2]\n [0 0 0 ... 0 0 2]\n [0 0 0 ....
    interesting: '[[2 0 0 ... 0 0 0]\n [2 0 0 ... 0 0 0]\n [2 0 0 ....},
]


[&#9650;back to top](#NLP-Financial-Statements)

---

### Jaccard Similarity

In [62]:
from sklearn.metrics import jaccard_score

In [63]:
def get_j_sim(bow_mat):
    bow = bow_mat.astype(bool)
    jaccard_sims = []
    for idx in range(1, bow.shape[0]):
        u = bow[idx-1]
        v = bow[idx]
        j_sim = jaccard_score(u,v)
        jaccard_sims.append(j_sim)
    return jaccard_sims

In [64]:
file_dates = {
    ticker: [ten_k['file_date'] for ten_k in ten_ks]
    for ticker, ten_ks in tenKs.items()
}

jaccard_similarities = {
    ticker: {
        sentiment_name: get_j_sim(sentiment_values)
        for sentiment_name, sentiment_values in ten_k_sents.items()
    }
    for ticker, ten_k_sents in sent_bow_ten_ks.items()
}

In [65]:
from bokeh.plotting import figure, show, output_file, save
from bokeh.models import ColumnDataSource
from bokeh.io import output_notebook, export_png
from bokeh.palettes import Bokeh as palette

In [66]:
output_notebook()

In [67]:
import datetime as dt

In [68]:
def compare_similarities(tick, output_f=None):
    jaccard_sims_df = pd.DataFrame(np.array([jaccard_similarities[tick][sent] for sent in sentiments]).T, file_dates[tick][1:], sentiments)
    jaccard_sims_df.index = pd.to_datetime(jaccard_sims_df.index)
    jaccard_sims_df.index.name = 'jaccard_dates'
    x = pd.to_datetime(file_dates[tick][1:])
    add_x = dt.timedelta(days=800)
    minus_x = dt.timedelta(days=250)
    p = figure(
        plot_width=1200,
        plot_height=250,
        x_axis_type='datetime',
        x_range=(x[-1]-minus_x, x[0]+add_x),
        y_range=(.1,1.05),
        title = f'Jacarrd Similarities of {tick} 10-Ks',
    )

    for n, sentiment in enumerate(sentiments):
        y = jaccard_similarities[tick][sentiment]
        p.line(x, y, color=palette[6][n], line_width=1, legend_label=sentiment)
    p.xaxis.ticker.desired_num_ticks = len(x)
    p.xaxis.axis_label = 'Date of Report'
    p.yaxis.axis_label = 'Jacarrd Similarity'
    p.legend.label_text_font_size='8pt'
    if output_f:
        export_png(p, filename=output_f)
    else:
        show(p)
    

In [69]:
compare_similarities('AMZN', output_f='amzn_jaccard.png')

![AMZN](img/amzn_jaccard.png)

In [70]:
compare_similarities('HON', output_f='hon_jaccard.png')

![HON](img/hon_jaccard.png)

[&#9650;back to top](#NLP-Financial-Statements)

---
### TFIDF

In [72]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [73]:
def get_tfidf(sentiment_words, docs):
    vectorizer = TfidfVectorizer(vocabulary=sentiment_words)
    tfidf = vectorizer.fit_transform(docs).toarray()
    
    return tfidf

In [74]:
sentiment_tfidf_ten_ks = {}

for ticker, ten_ks in tenKs.items():
    lemma_docs = [' '.join(ten_k['file_lemma']) for ten_k in ten_ks]
    
    sentiment_tfidf_ten_ks[ticker] = {
        sentiment: get_tfidf(sent_df[sent_df[sentiment]]['word'], lemma_docs)
        for sentiment in sentiments
    }

In [75]:
print_ten_k_data([sentiment_tfidf_ten_ks['AMZN']], sentiments)

[
  {
    negative: '[[0.         0.         0.         ... 0.        ...
    positive: '[[0.19508185 0.         0.         ... 0.        ...
    uncertainty: '[[0.         0.         0.         ... 0.00569911...
    litigious: '[[0. 0. 0. ... 0. 0. 0.]\n [0. 0. 0. ... 0. 0. 0....
    constraining: '[[0.         0.         0.         ... 0.        ...
    interesting: '[[0.01889322 0.         0.         ... 0.        ...},
]


[&#9650;back to top](#NLP-Financial-Statements)

---
### Cosine Similarity

In [76]:
from sklearn.metrics.pairwise import cosine_similarity

In [77]:
def get_cosine_similarity(tfidf_matrix):
    cosine_similarities = []
    for idx in range(1, tfidf_matrix.shape[0]):
        u = [tfidf_matrix[idx-1]]
        v = [tfidf_matrix[idx]]
        cos_sim = float(cosine_similarity(u,v))
        cosine_similarities.append(cos_sim)
    return cosine_similarities

In [78]:
cosine_similarities = {
    ticker: {
        sentiment_name: get_cosine_similarity(sentiment_values)
        for sentiment_name, sentiment_values in ten_k_sentiments.items()
    }\
    for ticker, ten_k_sentiments in sentiment_tfidf_ten_ks.items()
}

In [79]:
def compare_similarities(tick, sim_dict, sim_type, output_f=None):
    sims_df = pd.DataFrame(np.array([sim_dict[tick][sent] for sent in sentiments]).T, file_dates[tick][1:], sentiments)
    sims_df.index = pd.to_datetime(sims_df.index)
    x = pd.to_datetime(file_dates[tick][1:])
    add_x = dt.timedelta(days=800)
    minus_x = dt.timedelta(days=250)
    p = figure(
        plot_width=1200,
        plot_height=250,
        x_axis_type='datetime',
        x_range=(x[-1]-minus_x, x[0]+add_x),
        y_range=(.1,1.05),
        title = f'{sim_type} Similarities of {tick} 10-Ks',
    )

    for n, sentiment in enumerate(sentiments):
        y = jaccard_similarities[tick][sentiment]
        p.line(x, y, color=palette[6][n], line_width=2, legend_label=sentiment)
    p.xaxis.ticker.desired_num_ticks = len(x)
    p.xaxis.axis_label = 'Date of Report'
    p.yaxis.axis_label = f'{sim_type} Similarity'
    p.legend.label_text_font_size='8pt'
    if output_f:
        export_png(p, filename=output_f)
    else:
        show(p)

In [80]:
compare_similarities('AMZN', cosine_similarities, 'Cosine', output_f='amzn_csine.png')

![AMZN](img/amzn_csine.png)

In [82]:
compare_similarities('HON', cosine_similarities, 'Cosine', output_f='hon_csine.png')

![HON](img/hon_csine.png)

[&#9650;back to top](#NLP-Financial-Statements)

---
## [Part 4 - Evaluate Alpha Factors](#Sentiment-Analysis-on-Financial-Statements)

### Price Data

Utilize [Alpha Vantage API](https://www.alphavantage.co/) to acquire pricing data of stocks.
Run the pricing against the cosine similarity to determine if it's a valid alpha factor.

In [87]:
import os
import json

In [95]:
from ratelimit import limits, sleep_and_retry
class AlphaVantageAPI(object):
    @staticmethod
    @sleep_and_retry
    @limits(calls=5, period=60)
    def _call_api(url):
        response = requests.get(url)
        if response.status_code != 200:
            raise Exception('API response: {}'.format(response.status_code))
        return response

    def get(self, url):
        return self._call_api(url).json()

def make_requests(series, tickers):
    missed = []
    for ticker in tickers:
        try:
            prices = av_api.get(url.format(ticker,key))
            df = pd.DataFrame(prices['Monthly Adjusted Time Series']).T
            df.index = pd.to_datetime(df.index)
            series[ticker]=df[df.index.month==1]['5. adjusted close']
            print(f'{ticker} prices successfully loaded')
        except Exception as e:
            print(e, ticker)
            missed.append(ticker)
    return series, missed    

In [96]:
av_api = AlphaVantageAPI()
key = os.getenv('alpha_api')
url = 'https://www.alphavantage.co/query?function=TIME_SERIES_MONTHLY_ADJUSTED&symbol={}&apikey={}&outputsize=full'

In [97]:
tickers = list(cik_lookup.keys())
tries = 0
series = {}

while True:
    tries += 1
    if tries == 1:
        series, missed = make_requests(series, tickers)
    else:
        series, missed = make_requests(series, missed)
    if len(missed) == 0 or tries == 10:
        break
        
if len(missed) > 0:
    print(f'{missed} were missed')

AEP prices successfully loaded
AMZN prices successfully loaded
AXP prices successfully loaded
BA prices successfully loaded
BK prices successfully loaded
BMY prices successfully loaded
CAT prices successfully loaded
CNP prices successfully loaded
CVX prices successfully loaded
DE prices successfully loaded
DIS prices successfully loaded
DTE prices successfully loaded
ED prices successfully loaded
EMR prices successfully loaded
ETN prices successfully loaded
FL prices successfully loaded
FRT prices successfully loaded
GE prices successfully loaded
HON prices successfully loaded
IBM prices successfully loaded
IP prices successfully loaded
JNJ prices successfully loaded
KO prices successfully loaded
LLY prices successfully loaded
MCD prices successfully loaded
MO prices successfully loaded
MRK prices successfully loaded
MRO prices successfully loaded
PCG prices successfully loaded
PEP prices successfully loaded
PFE prices successfully loaded
PG prices successfully loaded
PNR prices succes

In [110]:
pricing = pd.DataFrame(series).astype(float)
pricing.index = pd.to_datetime(pricing.index.year, format='%Y')

In [112]:
cosine_similarities_df_dict = {
    'date': [],
    'ticker': [],
    'sentiment': [],
    'value': []
}

for ticker, ten_k_sentiments in cosine_similarities.items():
    for sentiment_name, sentiment_values in ten_k_sentiments.items():
        for sentiment_values, sentiment_value in enumerate(sentiment_values):
            cosine_similarities_df_dict['ticker'].append(ticker)
            cosine_similarities_df_dict['sentiment'].append(sentiment_name)
            cosine_similarities_df_dict['value'].append(sentiment_value)
            cosine_similarities_df_dict['date'].append(file_dates[ticker][1:][sentiment_values])

In [113]:
cosine_similarities_df = pd.DataFrame(cosine_similarities_df_dict)
cosine_similarities_df['date'] = pd.DatetimeIndex(cosine_similarities_df['date']).year
cosine_similarities_df['date'] = pd.to_datetime(cosine_similarities_df['date'], format='%Y')

cosine_similarities_df.head()

Unnamed: 0,date,ticker,sentiment,value
0,2019-01-01,AEP,negative,0.981974
1,2018-01-01,AEP,negative,0.964001
2,2017-01-01,AEP,negative,0.933406
3,2016-01-01,AEP,negative,0.976963
4,2015-01-01,AEP,negative,0.959769


[&#9650;back to top](#NLP-Financial-Statements)

---

### Alphalens Format

In [107]:
import alphalens as al

In [108]:
# get_clean_factor_and_forward_returns documentation:
al.utils.get_clean_factor_and_forward_returns?

[0;31mSignature:[0m
[0mal[0m[0;34m.[0m[0mutils[0m[0;34m.[0m[0mget_clean_factor_and_forward_returns[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mfactor[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mprices[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mgroupby[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mbinning_by_group[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mquantiles[0m[0;34m=[0m[0;36m5[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mbins[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mperiods[0m[0;34m=[0m[0;34m([0m[0;36m1[0m[0;34m,[0m [0;36m5[0m[0;34m,[0m [0;36m10[0m[0;34m)[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfilter_zscore[0m[0;34m=[0m[0;36m20[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mgroupby_labels[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmax_loss[0m[0;34m=[0m[0;36m0.35[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0m

In [127]:
factor_data = {}
skipped_sentiments = []

for sentiment in sentiments:
    cs_df = cosine_similarities_df\
        .query(
            f'''sentiment=='{sentiment}' & value!=0'''
        )\
        .drop_duplicates(
            subset=['date', 'ticker'], keep='first'
        )
    cs_df = cs_df.pivot(index='date', columns='ticker', values='value')

    try:
        data = al.utils.get_clean_factor_and_forward_returns(
            cs_df.stack(),
            pricing,
            quantiles=5,
            bins=None,
            periods=[-1]
        )
        factor_data[sentiment] = data
    except:
        skipped_sentiments.append(sentiment)
        
if skipped_sentiments:
    print('\nSkipped the following sentiments:\n{}'.format('\n'.join(skipped_sentiments)))

Dropped 0.0% entries from factor data: 0.0% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!
Dropped 0.0% entries from factor data: 0.0% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!
Dropped 0.0% entries from factor data: 0.0% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!
Dropped 0.0% entries from factor data: 0.0% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!
Dropped 0.1% entries from factor data: 0.0% in forward returns computation and 0.1% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!
Dropped 0.1% en

In [128]:
factor_data[sentiments[0]].head(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,1D,factor,factor_quantile
date,asset,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01,DIS,-0.155786,0.682675,1
2000-01-01,EMR,0.414174,0.966952,4
2000-01-01,PG,-0.274783,0.838083,2
2000-01-01,SYY,0.531144,0.972026,5
2001-01-01,AXP,-0.231893,0.799564,3
2001-01-01,BA,-0.290376,0.759543,2
2001-01-01,BMY,-0.251902,0.612686,2
2001-01-01,CVX,0.036224,0.791476,3
2001-01-01,DIS,-0.301666,0.895939,4
2001-01-01,ED,0.241159,0.970284,5


**Alphalens' ```factor_rank_autocorrelation``` and ```mean_return_by_quantile``` functions requre unix timestamps**

In [129]:
unixt_factor_data = {
    factor: data.set_index(
        pd.MultiIndex.from_tuples(
            [
                (x.timestamp(), y)
                for x, y in data.index.values
            ],
        names=['date', 'asset']
        )
    )
    for factor, data in factor_data.items()
}

[&#9650;back to top](#NLP-Financial-Statements)

---

### Factor Returns

In [130]:
ls_factor_returns = pd.DataFrame()

for factor_name, data in factor_data.items():
    ls_factor_returns[factor_name] = al.performance.factor_returns(data).iloc[:,0]

factor_returns_df = (1 + ls_factor_returns).cumprod()

In [144]:
p = figure(
    x_axis_type='datetime',
    plot_width=1000,
    plot_height=200,
)
for n, ticker in enumerate(factor_returns_df.columns):
    p.line(
        factor_returns_df.index,
        factor_returns_df[ticker],
        color=palette[6][n],
        legend_label=ticker
    )
p.xaxis.ticker.desired_num_ticks = len(factor_returns_df.index)
p.legend.label_text_font_size='8pt'
p.legend.location='top_left'
p.legend.spacing=1
export_png(p, filename='factor_returns.png')
show(p)

![](img/factor_returns.png)

[&#9650;back to top](#NLP-Financial-Statements)

---

### Basis Points per Day per Quantile

In [132]:
qr_factor_returns = pd.DataFrame()

for factor_name, data in unixt_factor_data.items():
    qr_factor_returns[factor_name] = al.performance.mean_return_by_quantile(data)[0].iloc[:, 0]

In [133]:
qr_factor_returns_df = (10000*qr_factor_returns)

In [134]:
from bokeh.layouts import gridplot

In [146]:
vbars = {sentiment: None for sentiment in sentiments}
x = [str(q) for q in qr_factor_returns_df.index]
for n, sentiment in enumerate(sentiments):
    vbars[sentiment] = figure(x_range=x, title=sentiment)
    vbars[sentiment].vbar(
        x=x,
        top=qr_factor_returns_df[sentiment],
        color=palette[6][n],   
        width=.75,
    
    )
    
grid = gridplot(
    [
        [
            vbars['negative'],vbars['positive'],vbars['uncertainty']
        ],
        [
            vbars['litigious'],vbars['constraining'],vbars['interesting']
        ]
    ],
    plot_width=300,
    plot_height=250
)

show(grid)
# export_png(grid, filename='quantiles.png')

![img](img/grid/bokeh_plot.png)

![img](img/grid/bokeh_plot1.png)

![](img/grid/bokeh_plot2.png)

![](img/grid/bokeh_plot3.png)

![](img/grid/bokeh_plot4.png)

![](img/grid/bokeh_plot5.png)

[&#9650;back to top](#NLP-Financial-Statements)

---

### Turnover Analysis

In [136]:
ls_FRA = pd.DataFrame()

for factor, data in unixt_factor_data.items():
    ls_FRA[factor] = al.performance.factor_rank_autocorrelation(data)

In [139]:
p = figure(
    x_axis_type='datetime',
    plot_width=1000,
    plot_height=200,
    title='Factor Rank Autocorrelation'
)
for n, ticker in enumerate(ls_FRA.columns):
    p.line(
        ls_FRA.index,
        ls_FRA[ticker],
        color=palette[6][n],
        legend_label=ticker,
        line_width=2
    )
p.xaxis.ticker.desired_num_ticks = len(ls_FRA.index)
p.legend.label_text_font_size='8pt'
p.legend.location='top_left'
p.legend.spacing=1

show(p)

![](img/fra.png)

[&#9650;back to top](#NLP-Financial-Statements)

---

### Sharpe Ratio of The Alphas

Let's see what the sharpe ratio for the factors are. Generally, a Sharpe Ratio of near 1.0 or higher is an acceptable single alpha for this universe.

In [140]:
daily_annualization_factor = np.sqrt(252)

(daily_annualization_factor * ls_factor_returns.mean() / ls_factor_returns.std()).round(2)

negative       -1.48
positive       -3.12
uncertainty    -4.91
litigious       0.94
constraining    3.86
interesting     4.46
dtype: float64