<a href="https://colab.research.google.com/github/SLCFLAB/Fintech2023/blob/main/ML_day12/12_2_sec_filing_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP in Finance: SEC Filings

In this notebook, you will analyze 10-K, 8-K filings using NLP skills that we practiced today.

Reference: https://github.com/Seungju182/AI-for-Trading/tree/master/Project_5_NLP_on_Financial_Statements

We are going to use sec-edgar-downloader package. For details, please check the link below.

https://pypi.org/project/sec-edgar-downloader/

In [1]:
!pip install -U sec-edgar-downloader

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sec-edgar-downloader
  Downloading sec_edgar_downloader-4.3.0-py3-none-any.whl (13 kB)
Collecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting Faker
  Downloading Faker-18.4.0-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m24.3 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... [?25l[?25hdone
  Created wheel for bs4: filename=bs4-0.0.1-py3-none-any.whl size=1270 sha256=dd8c420603f2560f30d4a14ae558cdfc3c6852bc867596963e882e880187e61b
  Stored in directory: /root/.cache/pip/wheels/73/2b/cb/099980278a0c9a3e57ff1a89875ec07bfa0b6fcbebb9a8cad3
Successfully built bs4
Installing collected packages: Faker, bs4, sec-edgar-downloader
Successfully installed Faker-18.4.0 bs4-0.0.1 sec-edgar-downloader-4

In [2]:
from sec_edgar_downloader import Downloader

In [3]:
doc_type = '8-K'
ticker = 'AAPL'
after = '2017-01-01'
before = '2017-03-25'

In [4]:
dl = Downloader("./data")

# Get all 8-K filings for Apple (ticker: AAPL)
#dl.get("8-K", "AAPL")

# Get all 8-K filings for Apple, including filing amends (8-K/A)
#dl.get("8-K", "AAPL", include_amends=True)

# Get all 8-K filings for Apple after January 1, 2017 and before March 25, 2017
# Note: after and before strings must be in the form "YYYY-MM-DD"
dl.get(doc_type, ticker, after=after, before=before)


4

## Import libraries

> Don't forget to download stopwords and wordnet



In [5]:
import os

import time
import numpy as np
import pandas as pd
import nltk
import re
from bs4 import BeautifulSoup

In [6]:
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [7]:
os.listdir(os.getcwd())

['.config', 'data', 'sample_data']

In [8]:
txt_files = []

for dir, subdir, files in os.walk(os.getcwd()):
  for x in files:
    if x.endswith('.txt'):
      print(dir)
      print(x)
      txt_files.append(os.path.join(dir, x))

/content/data/sec-edgar-filings/AAPL/8-K/0001193125-17-064019
full-submission.txt
/content/data/sec-edgar-filings/AAPL/8-K/0001193125-17-036283
full-submission.txt
/content/data/sec-edgar-filings/AAPL/8-K/0001628280-17-000663
full-submission.txt
/content/data/sec-edgar-filings/AAPL/8-K/0001193125-17-069853
full-submission.txt


In [9]:
txt_files

['/content/data/sec-edgar-filings/AAPL/8-K/0001193125-17-064019/full-submission.txt',
 '/content/data/sec-edgar-filings/AAPL/8-K/0001193125-17-036283/full-submission.txt',
 '/content/data/sec-edgar-filings/AAPL/8-K/0001628280-17-000663/full-submission.txt',
 '/content/data/sec-edgar-filings/AAPL/8-K/0001193125-17-069853/full-submission.txt']

In [10]:
text_data = []

for file in txt_files:
  with open(file, 'r', encoding='utf-8') as f:
    text = f.read()
    text_data.append(text)
    f.close()

In [11]:
text_data[0]

'<SEC-DOCUMENT>0001193125-17-064019.txt : 20170301\n<SEC-HEADER>0001193125-17-064019.hdr.sgml : 20170301\n<ACCEPTANCE-DATETIME>20170301090056\nACCESSION NUMBER:\t\t0001193125-17-064019\nCONFORMED SUBMISSION TYPE:\t8-K\nPUBLIC DOCUMENT COUNT:\t\t2\nCONFORMED PERIOD OF REPORT:\t20170228\nITEM INFORMATION:\t\tSubmission of Matters to a Vote of Security Holders\nFILED AS OF DATE:\t\t20170301\nDATE AS OF CHANGE:\t\t20170301\n\nFILER:\n\n\tCOMPANY DATA:\t\n\t\tCOMPANY CONFORMED NAME:\t\t\tAPPLE INC\n\t\tCENTRAL INDEX KEY:\t\t\t0000320193\n\t\tSTANDARD INDUSTRIAL CLASSIFICATION:\tELECTRONIC COMPUTERS [3571]\n\t\tIRS NUMBER:\t\t\t\t942404110\n\t\tSTATE OF INCORPORATION:\t\t\tCA\n\t\tFISCAL YEAR END:\t\t\t0930\n\n\tFILING VALUES:\n\t\tFORM TYPE:\t\t8-K\n\t\tSEC ACT:\t\t1934 Act\n\t\tSEC FILE NUMBER:\t001-36743\n\t\tFILM NUMBER:\t\t17651574\n\n\tBUSINESS ADDRESS:\t\n\t\tSTREET 1:\t\tONE INFINITE LOOP\n\t\tCITY:\t\t\tCUPERTINO\n\t\tSTATE:\t\t\tCA\n\t\tZIP:\t\t\t95014\n\t\tBUSINESS PHONE:\t\t(408)

In [12]:
len(text_data[0])

49026

## Preprocess

In [13]:
def get_documents(text):
    """
    Extract the documents from the text

    Parameters
    ----------
    text : str
        The text with the document strings inside

    Returns
    -------
    extracted_docs : list of str
        The document strings found in `text`
    """
    
    regex_start = re.compile(r'<DOCUMENT>')
    regex_end = re.compile(r'</DOCUMENT>')
    
    start_idx = [x.end() for x in re.finditer(regex_start, text)]
    end_idx = [x.start() for x in re.finditer(regex_end, text)]
    
    result = []
    for start, end in zip(start_idx, end_idx):
        result.append(text[start:end])
    
    return result

In [14]:
get_documents(text_data[0])

['\n<TYPE>8-K\n<SEQUENCE>1\n<FILENAME>d342218d8k.htm\n<DESCRIPTION>FORM 8-K\n<TEXT>\n<HTML><HEAD>\n<TITLE>Form 8-K</TITLE>\n</HEAD>\n <BODY BGCOLOR="WHITE">\n\n <P STYLE="line-height:1.0pt;margin-top:0pt;margin-bottom:0pt;border-bottom:1px solid #000000">&nbsp;</P>\n<P STYLE="line-height:3.0pt;margin-top:0pt;margin-bottom:2pt;border-bottom:1px solid #000000">&nbsp;</P> <P STYLE="margin-top:12pt; margin-bottom:0pt; font-size:15pt; font-family:ARIAL" ALIGN="center"><B>UNITED STATES </B></P>\n<P STYLE="margin-top:0pt; margin-bottom:0pt; font-size:15pt; font-family:ARIAL" ALIGN="center"><B>SECURITIES AND EXCHANGE COMMISSION </B></P>\n<P STYLE="margin-top:0pt; margin-bottom:0pt; font-size:11pt; font-family:ARIAL" ALIGN="center"><B>Washington, D.C. 20549 </B></P> <P STYLE="font-size:12pt;margin-top:0pt;margin-bottom:0pt">&nbsp;</P><center>\n<P STYLE="line-height:6.0pt;margin-top:0pt;margin-bottom:2pt;border-bottom:1.00pt solid #000000;width:21%">&nbsp;</P></center> <P STYLE="margin-top:12pt;

In [15]:
len(get_documents(text_data[1]))

7

In [16]:
def get_document_type(doc):
    """
    Return the document type lowercased

    Parameters
    ----------
    doc : str
        The document string

    Returns
    -------
    doc_type : str
        The document type lowercased
    """
    
    match = re.search(r'<TYPE>[^\n]+', doc)
    
    return match.group()[len('<TYPE>'):].lower()

In [17]:
len(text_data)

4

In [18]:
# Something's wrong!
len(text_data[1])

675606

In [19]:
data = []

# We should only select documents with the type we want
for docs in text_data:
  for doc in get_documents(docs):
    if get_document_type(doc) == doc_type.lower():
      data.append(doc)


In [20]:
data

['\n<TYPE>8-K\n<SEQUENCE>1\n<FILENAME>d342218d8k.htm\n<DESCRIPTION>FORM 8-K\n<TEXT>\n<HTML><HEAD>\n<TITLE>Form 8-K</TITLE>\n</HEAD>\n <BODY BGCOLOR="WHITE">\n\n <P STYLE="line-height:1.0pt;margin-top:0pt;margin-bottom:0pt;border-bottom:1px solid #000000">&nbsp;</P>\n<P STYLE="line-height:3.0pt;margin-top:0pt;margin-bottom:2pt;border-bottom:1px solid #000000">&nbsp;</P> <P STYLE="margin-top:12pt; margin-bottom:0pt; font-size:15pt; font-family:ARIAL" ALIGN="center"><B>UNITED STATES </B></P>\n<P STYLE="margin-top:0pt; margin-bottom:0pt; font-size:15pt; font-family:ARIAL" ALIGN="center"><B>SECURITIES AND EXCHANGE COMMISSION </B></P>\n<P STYLE="margin-top:0pt; margin-bottom:0pt; font-size:11pt; font-family:ARIAL" ALIGN="center"><B>Washington, D.C. 20549 </B></P> <P STYLE="font-size:12pt;margin-top:0pt;margin-bottom:0pt">&nbsp;</P><center>\n<P STYLE="line-height:6.0pt;margin-top:0pt;margin-bottom:2pt;border-bottom:1.00pt solid #000000;width:21%">&nbsp;</P></center> <P STYLE="margin-top:12pt;

## Clean up

Now let's clean up our text

In [21]:
def remove_html_tags(text):
    text = BeautifulSoup(text, 'html.parser').get_text()
    
    return text


def clean_text(text):
    text = text.lower()
    text = remove_html_tags(text)
    
    return text

In [22]:
cleaned_data = [clean_text(text) for text in data]

In [23]:
cleaned_data[0]

'\n8-k\n1\nd342218d8k.htm\nform 8-k\n\n\nform 8-k\n\n\n\xa0\n\xa0 united states \nsecurities and exchange commission \nwashington, d.c. 20549  \xa0\n\xa0 form 8-k  current report \npursuant to section\xa013 or 15(d) of the securities exchange act of 1934 \nfebruary 28, 2017  date of report (date of\nearliest event reported)  \xa0 \xa0\n\xa0 \n\n apple inc. \n(exact name of registrant as specified in its charter)  \xa0\n\n\n\n\n\n\n\n\ncalifornia\n\xa0\n001-36743\n\xa0\n94-2404110\n\n (state or other jurisdiction\nof incorporation)\n\xa0\n (commission file\nnumber)\n\xa0\n (irs. employer\nidentification no.)\n 1 infinite loop \ncupertino, california 95014  (address of principal\nexecutive offices) (zip code)  (408) 996-1010 \n(registrant’s telephone number, including area code) \nnot applicable  (former name or former address,\nif changed since last report.)  check the appropriate box below if the form 8-k filing is intended to\nsimultaneously satisfy the filing obligation of the regist

## Lemmatize and remove stopwords

In [24]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet


def lemmatize_words(words):
    """
    Lemmatize words 

    Parameters
    ----------
    words : list of str
        List of words

    Returns
    -------
    lemmatized_words : list of str
        List of lemmatized words
    """
    
    return [WordNetLemmatizer().lemmatize(w, pos='v') for w in words]

In [25]:
word_pattern = re.compile('\w+')

cleaned_data = [lemmatize_words(word_pattern.findall(text)) for text in cleaned_data]

In [26]:
len(cleaned_data)

4

In [27]:
len(cleaned_data[0])

851

In [28]:
from nltk.corpus import stopwords

In [29]:
lemma_stopwords = lemmatize_words(stopwords.words('english'))

In [30]:
final_words = []

for data in cleaned_data:
  final = [word for word in data if word not in lemma_stopwords]
  final_words.append(final)

In [31]:
len(final_words)

4

In [32]:
len(final_words[0])

653

We just lemmatized and removed stopwords. From now on, you can play with this text data. Code below is a brief example of sentimental analysis.

In [33]:
pos = pd.read_csv('https://raw.githubusercontent.com/RaghuveerRao/Sentiment-Analysis/master/LM_pos_words.txt', squeeze=True).str.lower()



  pos = pd.read_csv('https://raw.githubusercontent.com/RaghuveerRao/Sentiment-Analysis/master/LM_pos_words.txt', squeeze=True).str.lower()


In [34]:
neg = pd.read_csv('https://raw.githubusercontent.com/RaghuveerRao/Sentiment-Analysis/master/LM_neg_words.txt', squeeze=True).str.lower()



  neg = pd.read_csv('https://raw.githubusercontent.com/RaghuveerRao/Sentiment-Analysis/master/LM_neg_words.txt', squeeze=True).str.lower()


In [35]:
neg

0          abandoned
1         abandoning
2        abandonment
3       abandonments
4           abandons
            ...     
2344      wrongdoing
2345     wrongdoings
2346        wrongful
2347      wrongfully
2348         wrongly
Name: ABANDON, Length: 2349, dtype: object

In [36]:
pos = lemmatize_words(list(pos.values))
neg = lemmatize_words(list(neg.values))

In [37]:
for doc in final_words:
  
  num_pos = len([word for word in doc if word in pos])
  num_neg = len([word for word in doc if word in neg])

  if num_pos > num_neg:
    print("Positive")
  elif num_pos < num_neg:
    print("Negative")
  else:
    print("Neutral")

Positive
Negative
Negative
Positive
