<a href="https://colab.research.google.com/github/SLCFLAB/Data-Science-Python/blob/main/Day%2013/13_3_nlp_in_finance.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP in Finance: SEC Filings

In this notebook, you will analyze 10-K, 8-K filings using NLP skills that we practiced today.

Reference: https://github.com/Seungju182/AI-for-Trading/tree/master/Project_5_NLP_on_Financial_Statements

We are going to use sec-edgar-downloader package. For details, please check the link below.

https://pypi.org/project/sec-edgar-downloader/

In [1]:
!pip install -U sec-edgar-downloader



In [2]:
from sec_edgar_downloader import Downloader

In [3]:
doc_type = '8-K'
ticker = 'AAPL'
after = '2017-01-01'
before = '2017-03-25'

In [4]:
dl = Downloader("./data")

# Get all 8-K filings for Apple (ticker: AAPL)
#dl.get("8-K", "AAPL")

# Get all 8-K filings for Apple, including filing amends (8-K/A)
#dl.get("8-K", "AAPL", include_amends=True)

# Get all 8-K filings for Apple after January 1, 2017 and before March 25, 2017
# Note: after and before strings must be in the form "YYYY-MM-DD"
dl.get(doc_type, ticker, after=after, before=before)


4

## Import libraries

> Don't forget to download stopwords and wordnet



In [5]:
import os

import time
import numpy as np
import pandas as pd
import nltk
import re
from bs4 import BeautifulSoup

In [6]:
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [7]:
os.listdir(os.getcwd())

['.config', 'data', 'sample_data']

In [8]:
txt_files = []

for dir, subdir, files in os.walk(os.getcwd()):
  for x in files:
    if x.endswith('.txt'):
      print(dir)
      print(x)
      txt_files.append(os.path.join(dir, x))

/content/data/sec-edgar-filings/AAPL/8-K/0001628280-17-000663
full-submission.txt
/content/data/sec-edgar-filings/AAPL/8-K/0001193125-17-069853
full-submission.txt
/content/data/sec-edgar-filings/AAPL/8-K/0001193125-17-064019
full-submission.txt
/content/data/sec-edgar-filings/AAPL/8-K/0001193125-17-036283
full-submission.txt


In [9]:
txt_files

['/content/data/sec-edgar-filings/AAPL/8-K/0001628280-17-000663/full-submission.txt',
 '/content/data/sec-edgar-filings/AAPL/8-K/0001193125-17-069853/full-submission.txt',
 '/content/data/sec-edgar-filings/AAPL/8-K/0001193125-17-064019/full-submission.txt',
 '/content/data/sec-edgar-filings/AAPL/8-K/0001193125-17-036283/full-submission.txt']

In [10]:
text_data = []

for file in txt_files:
  with open(file, 'r', encoding='utf-8') as f:
    text = f.read()
    text_data.append(text)
    f.close()

In [11]:
text_data[0]

'<SEC-DOCUMENT>0001628280-17-000663.txt : 20170131\n<SEC-HEADER>0001628280-17-000663.hdr.sgml : 20170131\n<ACCEPTANCE-DATETIME>20170131163102\nACCESSION NUMBER:\t\t0001628280-17-000663\nCONFORMED SUBMISSION TYPE:\t8-K\nPUBLIC DOCUMENT COUNT:\t\t4\nCONFORMED PERIOD OF REPORT:\t20170131\nITEM INFORMATION:\t\tResults of Operations and Financial Condition\nITEM INFORMATION:\t\tFinancial Statements and Exhibits\nFILED AS OF DATE:\t\t20170131\nDATE AS OF CHANGE:\t\t20170131\n\nFILER:\n\n\tCOMPANY DATA:\t\n\t\tCOMPANY CONFORMED NAME:\t\t\tAPPLE INC\n\t\tCENTRAL INDEX KEY:\t\t\t0000320193\n\t\tSTANDARD INDUSTRIAL CLASSIFICATION:\tELECTRONIC COMPUTERS [3571]\n\t\tIRS NUMBER:\t\t\t\t942404110\n\t\tSTATE OF INCORPORATION:\t\t\tCA\n\t\tFISCAL YEAR END:\t\t\t0930\n\n\tFILING VALUES:\n\t\tFORM TYPE:\t\t8-K\n\t\tSEC ACT:\t\t1934 Act\n\t\tSEC FILE NUMBER:\t001-36743\n\t\tFILM NUMBER:\t\t17561519\n\n\tBUSINESS ADDRESS:\t\n\t\tSTREET 1:\t\tONE INFINITE LOOP\n\t\tCITY:\t\t\tCUPERTINO\n\t\tSTATE:\t\t\tCA\

In [14]:
len(text_data[0])

390102

## Preprocess

In [15]:
def get_documents(text):
    """
    Extract the documents from the text

    Parameters
    ----------
    text : str
        The text with the document strings inside

    Returns
    -------
    extracted_docs : list of str
        The document strings found in `text`
    """
    
    regex_start = re.compile(r'<DOCUMENT>')
    regex_end = re.compile(r'</DOCUMENT>')
    
    start_idx = [x.end() for x in re.finditer(regex_start, text)]
    end_idx = [x.start() for x in re.finditer(regex_end, text)]
    
    result = []
    for start, end in zip(start_idx, end_idx):
        result.append(text[start:end])
    
    return result

In [16]:
get_documents(text_data[0])

['\n<TYPE>8-K\n<SEQUENCE>1\n<FILENAME>a8-kq1201712312016.htm\n<DESCRIPTION>8-K\n<TEXT>\n<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">\n<html>\n\t<head>\n\t\t<!-- Document created using Wdesk 1 -->\n\t\t<!-- Copyright 2017 Workiva -->\n\t\t<title>Document</title>\n\t</head>\n\t<body style="font-family:Times New Roman;font-size:10pt;">\n<div><a name="sBEE0111826B857478DB50DD19DFE1989"></a></div><div><div style="line-height:120%;font-size:10pt;"><font style="font-family:inherit;font-size:10pt;"><br></font></div></div><div><br></div><div style="line-height:120%;padding-top:16px;text-align:center;font-size:10pt;"><div style="padding-left:0px;text-indent:0px;line-height:normal;padding-top:10px;"><table cellpadding="0" cellspacing="0" style="font-family:Times New Roman;font-size:10pt;margin-left:auto;margin-right:auto;width:100%;border-collapse:collapse;text-align:left;"><tr><td colspan="1"></td></tr><tr><td style="width:100%;"></td></t

In [17]:
len(get_documents(text_data[1]))

6

In [18]:
def get_document_type(doc):
    """
    Return the document type lowercased

    Parameters
    ----------
    doc : str
        The document string

    Returns
    -------
    doc_type : str
        The document type lowercased
    """
    
    match = re.search(r'<TYPE>[^\n]+', doc)
    
    return match.group()[len('<TYPE>'):].lower()

In [23]:
len(text_data)

4

In [25]:
# Something's wrong!
len(text_data[1])

234543

In [31]:
data = []

# We should only select documents with the type we want
for docs in text_data:
  for doc in get_documents(docs):
    if get_document_type(doc) == doc_type.lower():
      data.append(doc)


8-k
ex-99.1
ex-99.2
graphic
8-k
ex-1.1
ex-4.1
ex-5.1
graphic
graphic
8-k
graphic
8-k
ex-1.1
ex-4.1
ex-5.1
ex-12.1
graphic
graphic


In [32]:
data

['\n<TYPE>8-K\n<SEQUENCE>1\n<FILENAME>a8-kq1201712312016.htm\n<DESCRIPTION>8-K\n<TEXT>\n<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">\n<html>\n\t<head>\n\t\t<!-- Document created using Wdesk 1 -->\n\t\t<!-- Copyright 2017 Workiva -->\n\t\t<title>Document</title>\n\t</head>\n\t<body style="font-family:Times New Roman;font-size:10pt;">\n<div><a name="sBEE0111826B857478DB50DD19DFE1989"></a></div><div><div style="line-height:120%;font-size:10pt;"><font style="font-family:inherit;font-size:10pt;"><br></font></div></div><div><br></div><div style="line-height:120%;padding-top:16px;text-align:center;font-size:10pt;"><div style="padding-left:0px;text-indent:0px;line-height:normal;padding-top:10px;"><table cellpadding="0" cellspacing="0" style="font-family:Times New Roman;font-size:10pt;margin-left:auto;margin-right:auto;width:100%;border-collapse:collapse;text-align:left;"><tr><td colspan="1"></td></tr><tr><td style="width:100%;"></td></t

## Clean up

Now let's clean up our text

In [33]:
def remove_html_tags(text):
    text = BeautifulSoup(text, 'html.parser').get_text()
    
    return text


def clean_text(text):
    text = text.lower()
    text = remove_html_tags(text)
    
    return text

In [34]:
cleaned_data = [clean_text(text) for text in data]

In [35]:
cleaned_data[0]

'\n8-k\n1\na8-kq1201712312016.htm\n8-k\n\n\n\n\n\n\ndocument\n\n\n\xa0united statessecurities and exchange commissionwashington, d.c. 20549\xa0form 8-kcurrent reportpursuant to section 13 or 15(d) of the securities exchange act of 1934january\xa031, 2017 date of report (date of earliest event reported)\xa0\xa0\xa0apple inc.(exact name of registrant as specified in its charter)\xa0california\xa0001-36743\xa094-2404110(state or other jurisdictionof incorporation)\xa0(commissionfile number)\xa0(irs. employeridentification no.)1 infinite loopcupertino, california 95014(address of principal executive offices) (zip code)(408) 996-1010(registrant’s telephone number, including area code)not applicable(former name or former address, if changed since last report.)check the appropriate box below if the form 8-k filing is intended to simultaneously satisfy the filing obligation of the registrant under any of the following provisions:\xa0¨written communications pursuant to rule 425 under the securi

## Lemmatize and remove stopwords

In [36]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet


def lemmatize_words(words):
    """
    Lemmatize words 

    Parameters
    ----------
    words : list of str
        List of words

    Returns
    -------
    lemmatized_words : list of str
        List of lemmatized words
    """
    
    return [WordNetLemmatizer().lemmatize(w, pos='v') for w in words]

In [37]:
word_pattern = re.compile('\w+')

cleaned_data = [lemmatize_words(word_pattern.findall(text)) for text in cleaned_data]

In [38]:
len(cleaned_data)

4

In [39]:
len(cleaned_data[0])

439

In [40]:
from nltk.corpus import stopwords

In [41]:
lemma_stopwords = lemmatize_words(stopwords.words('english'))

In [42]:
final_words = []

for data in cleaned_data:
  final = [word for word in data if word not in lemma_stopwords]
  final_words.append(final)

In [44]:
len(final_words)

4

In [45]:
len(final_words[0])

318

We just lemmatized and removed stopwords. From now on, you can play with this text data. Code below is a brief example of sentimental analysis.

In [46]:
pos = pd.read_csv('https://raw.githubusercontent.com/RaghuveerRao/Sentiment-Analysis/master/LM_pos_words.txt', squeeze=True).str.lower()

In [47]:
neg = pd.read_csv('https://raw.githubusercontent.com/RaghuveerRao/Sentiment-Analysis/master/LM_neg_words.txt', squeeze=True).str.lower()

In [48]:
neg

0          abandoned
1         abandoning
2        abandonment
3       abandonments
4           abandons
            ...     
2344      wrongdoing
2345     wrongdoings
2346        wrongful
2347      wrongfully
2348         wrongly
Name: ABANDON, Length: 2349, dtype: object

In [49]:
pos = lemmatize_words(list(pos.values))
neg = lemmatize_words(list(neg.values))

In [50]:
for doc in final_words:
  
  num_pos = len([word for word in doc if word in pos])
  num_neg = len([word for word in doc if word in neg])

  if num_pos > num_neg:
    print("Positive")
  elif num_pos < num_neg:
    print("Negative")
  else:
    print("Neutral")

Negative
Positive
Positive
Negative
