---
title: "Module 1 - Lab 1"
subtitle: "From Raw Text to NLP Pipelines (SEC 10-K)"
author: "Nakul R. Padalkar"
number-sections: true
date: "2024-11-21"
date-modified: today
date-format: long
format: 
    html:
        code-overflow: wrap
categories: ['1', 'M01:', 'Lab']
description: "Hands-on lab activity: Interacting with Textual Data in Jupyter and Colab."
---

## Lab Objective {.unnumbered}

In this lab, you will:

- Connect **Google Colab** to **VS Code**
- Load real-world corporate text data (SEC 10-K filings)
- Implement a **classical NLP preprocessing pipeline**
- Answer **exploratory questions** about corporate disclosures using text analytics

This lab establishes the **computational and conceptual foundation** for later work with embeddings and generative models.


## Background Context {.unnumbered}

Public companies file **Form 10-K** annually with the U.S. Securities and Exchange Commission (SEC).  
These filings contain rich textual information about:

- business operations  
- risks and uncertainties  
- management discussion  
- regulatory disclosures  

In this lab, we treat each 10-K as **raw text data** and apply a standard NLP pipeline to prepare it for analysis.

## Dataset Overview  {.unnumbered}

- All data for this lab is located in: [SEC-10K-2024/](https://drive.google.com/drive/folders/1q7BfsNHCewG1zNfnqyCcBj9p_RUt-zW6?usp=drive_link)
- You will need to "copy" the folder to your own Google Drive
- Right click on the folder, and then click "Add shortcut to Drive". This will allow you to access the folder from your drive!
- This folder contains **plain-text 10-K filings** for multiple publicly traded firms.
- Each file represents **one company’s annual report**.

![](./M01_lecture02_figures/gdrive-add-folder.png){width="80%" fig-align="center"}

## Research Framing (Important) {.unnumbered}

You are **not** training a model yet. Instead, think of this lab as **asking structured questions of text**, such as:

- What terms dominate risk disclosures?
- How consistent is language across companies?
- Which words survive aggressive cleaning?
- How does preprocessing change the text representation?

Your answers will be supported by **intermediate outputs**, not final predictions.

## NLP Processing Pipeline {.unnumbered}

You will implement the following pipeline **step by step**:

1. Raw text  
2. Sentence segmentation  
3. Tokenization  
4. Part-of-Speech (POS) tagging  
5. Stop-word removal  
6. Stemming / Lemmatization  
7. Dependency parsing  
8. String metrics & matching  
Each stage produces **artifacts** that help you answer analytical questions.

## Load and Inspect the Data {.unnumbered}

In [24]:
# import os

# from google.colab import drive

# drive.mount("/content/drive")

# DATA_DIR = "/content/drive/MyDrive/Research/SEC-10K-2024"

# assert os.path.exists(DATA_DIR), (
#     "Google Drive is not mounted or the dataset path is incorrect. "
#     "Did you run drive.mount()?"
# )

# print("Drive mounted successfully. Data directory found.")


In [25]:
from pathlib import Path

SEC_DIR = Path("../../data/SEC-10K-2024")
# DATA_ROOT = Path("/content/drive/MyDrive/Research")
# SEC_DIR = DATA_ROOT / "SEC-10K-2024"

assert SEC_DIR.exists(), "SEC data folder not found. Check Drive mount."

sec_files = list(SEC_DIR.glob("*.txt"))
print(f"Found {len(sec_files)} SEC filings")

Found 7754 SEC filings


In [26]:

# Read a sample document
sample_text = sec_files[0].read_text(encoding="utf-8")
print(sample_text[:1500])

<Header>
<FileStats>
    <FileName>20240102_10-K_edgar_data_1288750_0001654954-24-000069.txt</FileName>
    <GrossFileSize>5409869</GrossFileSize>
    <NetFileSize>331515</NetFileSize>
    <NonText_DocumentType_Chars>3657261</NonText_DocumentType_Chars>
    <HTML_Chars>1346602</HTML_Chars>
    <XBRL_Chars>0</XBRL_Chars>
    <XML_Chars>0</XML_Chars>
    <N_Exhibits>4</N_Exhibits>
</FileStats>
<SEC-Header>
0001654954-24-000069.hdr.sgml : 20240102
<ACCEPTANCE-DATETIME>20240102153402
ACCESSION NUMBER:		0001654954-24-000069
CONFORMED SUBMISSION TYPE:	10-K
PUBLIC DOCUMENT COUNT:		32
CONFORMED PERIOD OF REPORT:	20230930
FILED AS OF DATE:		20240102
DATE AS OF CHANGE:		20240102

FILER:

	COMPANY DATA:	
		COMPANY CONFORMED NAME:			Timberline Resources Corp
		CENTRAL INDEX KEY:			0001288750
		STANDARD INDUSTRIAL CLASSIFICATION:	GOLD & SILVER ORES [1040]
		ORGANIZATION NAME:           	01 Energy & Transportation
		IRS NUMBER:				820291227
		STATE OF INCORPORATION:			DE
		FISCAL YEAR END:			0930

	

## What sections of the 10-K appear most frequently in the opening text?
This will help you understand the structure of the document and identify key areas for analysis (e.g., risk factors, management discussion). We first start with Sentence Segmentation

In [27]:
import nltk
from zipfile import ZipFile

nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("omw-1.4") # Open Multilingual Wordnet
nltk.download("averaged_perceptron_tagger")
nltk.download('all')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\nakulpadalkar\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\nakulpadalkar\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\nakulpadalkar\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\nakulpadalkar\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\nakulpadalkar\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |    

True

In [28]:
from nltk.tokenize import sent_tokenize

sentences = sent_tokenize(sample_text)
print(f"Number of sentences: {len(sentences)}")
sentences[:5]

Number of sentences: 1738


["<Header>\n<FileStats>\n    <FileName>20240102_10-K_edgar_data_1288750_0001654954-24-000069.txt</FileName>\n    <GrossFileSize>5409869</GrossFileSize>\n    <NetFileSize>331515</NetFileSize>\n    <NonText_DocumentType_Chars>3657261</NonText_DocumentType_Chars>\n    <HTML_Chars>1346602</HTML_Chars>\n    <XBRL_Chars>0</XBRL_Chars>\n    <XML_Chars>0</XML_Chars>\n    <N_Exhibits>4</N_Exhibits>\n</FileStats>\n<SEC-Header>\n0001654954-24-000069.hdr.sgml : 20240102\n<ACCEPTANCE-DATETIME>20240102153402\nACCESSION NUMBER:\t\t0001654954-24-000069\nCONFORMED SUBMISSION TYPE:\t10-K\nPUBLIC DOCUMENT COUNT:\t\t32\nCONFORMED PERIOD OF REPORT:\t20230930\nFILED AS OF DATE:\t\t20240102\nDATE AS OF CHANGE:\t\t20240102\n\nFILER:\n\n\tCOMPANY DATA:\t\n\t\tCOMPANY CONFORMED NAME:\t\t\tTimberline Resources Corp\n\t\tCENTRAL INDEX KEY:\t\t\t0001288750\n\t\tSTANDARD INDUSTRIAL CLASSIFICATION:\tGOLD & SILVER ORES [1040]\n\t\tORGANIZATION NAME:           \t01 Energy & Transportation\n\t\tIRS NUMBER:\t\t\t\t82029

## Are sentences in 10-Ks longer or shorter than typical news or social media text?

In [29]:
from nltk.tokenize import word_tokenize

tokens = word_tokenize(sample_text)
tokens[:30]

['<',
 'Header',
 '>',
 '<',
 'FileStats',
 '>',
 '<',
 'FileName',
 '>',
 '20240102_10-K_edgar_data_1288750_0001654954-24-000069.txt',
 '<',
 '/FileName',
 '>',
 '<',
 'GrossFileSize',
 '>',
 '5409869',
 '<',
 '/GrossFileSize',
 '>',
 '<',
 'NetFileSize',
 '>',
 '331515',
 '<',
 '/NetFileSize',
 '>',
 '<',
 'NonText_DocumentType_Chars',
 '>']

## What kinds of tokens appear that are not “words” (e.g., symbols, numbers, legal references)?

In [30]:
from nltk import pos_tag

pos_tags = pos_tag(tokens[:50])
pos_tags

[('<', 'JJ'),
 ('Header', 'NNP'),
 ('>', 'NNP'),
 ('<', 'NNP'),
 ('FileStats', 'NNP'),
 ('>', 'NNP'),
 ('<', 'NNP'),
 ('FileName', 'NNP'),
 ('>', 'NNP'),
 ('20240102_10-K_edgar_data_1288750_0001654954-24-000069.txt', 'JJ'),
 ('<', 'NNP'),
 ('/FileName', 'NNP'),
 ('>', 'NNP'),
 ('<', 'NNP'),
 ('GrossFileSize', 'NNP'),
 ('>', 'NNP'),
 ('5409869', 'CD'),
 ('<', 'NNP'),
 ('/GrossFileSize', 'NNP'),
 ('>', 'NNP'),
 ('<', 'NNP'),
 ('NetFileSize', 'NNP'),
 ('>', 'NNP'),
 ('331515', 'CD'),
 ('<', 'NNP'),
 ('/NetFileSize', 'NNP'),
 ('>', 'NNP'),
 ('<', 'NNP'),
 ('NonText_DocumentType_Chars', 'NNP'),
 ('>', 'VBD'),
 ('3657261', 'CD'),
 ('<', 'JJ'),
 ('/NonText_DocumentType_Chars', 'NNS'),
 ('>', 'VBP'),
 ('<', 'JJ'),
 ('HTML_Chars', 'NNP'),
 ('>', 'VBD'),
 ('1346602', 'CD'),
 ('<', 'JJ'),
 ('/HTML_Chars', 'NNS'),
 ('>', 'VBP'),
 ('<', 'JJ'),
 ('XBRL_Chars', 'NNP'),
 ('>', 'VBD'),
 ('0', 'CD'),
 ('<', 'JJ'),
 ('/XBRL_Chars', 'NNS'),
 ('>', 'VBP'),
 ('<', 'CD'),
 ('XML_Chars', 'NNS')]

## Which POS categories dominate risk-related sections (nouns, verbs, adjectives)?

In [31]:
from nltk.corpus import stopwords
nltk.download("stopwords")

stop_words = set(stopwords.words("english"))

filtered_tokens = [
    t.lower() for t in tokens
    if t.isalpha() and t.lower() not in stop_words
]

filtered_tokens[:30]

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\nakulpadalkar\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['header',
 'filestats',
 'filename',
 'grossfilesize',
 'netfilesize',
 'accession',
 'number',
 'conformed',
 'submission',
 'type',
 'public',
 'document',
 'count',
 'conformed',
 'period',
 'report',
 'filed',
 'date',
 'date',
 'change',
 'filer',
 'company',
 'data',
 'company',
 'conformed',
 'name',
 'timberline',
 'resources',
 'corp',
 'central']

## Which important business terms survive stop-word removal?

In [32]:
from nltk.stem import PorterStemmer, WordNetLemmatizer

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

stems = [stemmer.stem(t) for t in filtered_tokens[:20]]
lemmas = [lemmatizer.lemmatize(t) for t in filtered_tokens[:20]]

list(zip(filtered_tokens[:20], stems, lemmas))

[('header', 'header', 'header'),
 ('filestats', 'filestat', 'filestats'),
 ('filename', 'filenam', 'filename'),
 ('grossfilesize', 'grossfiles', 'grossfilesize'),
 ('netfilesize', 'netfiles', 'netfilesize'),
 ('accession', 'access', 'accession'),
 ('number', 'number', 'number'),
 ('conformed', 'conform', 'conformed'),
 ('submission', 'submiss', 'submission'),
 ('type', 'type', 'type'),
 ('public', 'public', 'public'),
 ('document', 'document', 'document'),
 ('count', 'count', 'count'),
 ('conformed', 'conform', 'conformed'),
 ('period', 'period', 'period'),
 ('report', 'report', 'report'),
 ('filed', 'file', 'filed'),
 ('date', 'date', 'date'),
 ('date', 'date', 'date'),
 ('change', 'chang', 'change')]

## Which transformation preserves interpretability better for financial text?

In [33]:
import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp(sentences[0])
[(token.text, token.dep_, token.head.text) for token in doc]

OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.

## How might dependency relationships help identify risk statements or obligations?

from difflib import SequenceMatcher

def similarity(a, b):
    return SequenceMatcher(None, a, b).ratio()

similarity(
    "risk management strategy",
    "enterprise risk management"
)

## Why might approximate string matching be useful for cross-company comparison?