In [None]:
---
title: "Module 1 - Lab 1"
subtitle: "From Raw Text to NLP Pipelines (SEC 10-K)"
author: "Nakul R. Padalkar"
number-sections: true
date: "2024-11-21"
date-modified: today
date-format: long
format: 
    html:
        code-overflow: wrap
categories: ['1', 'M01:', 'Lab']
description: "Hands-on lab activity: Interacting with Textual Data in Jupyter and Colab."
---

## Lab Objective {.unnumbered}

In this lab, you will:

- Connect **Google Colab** to **VS Code**
- Load real-world corporate text data (SEC 10-K filings)
- Implement a **classical NLP preprocessing pipeline**
- Answer **exploratory questions** about corporate disclosures using text analytics

This lab establishes the **computational and conceptual foundation** for later work with embeddings and generative models.


## Background Context {.unnumbered}

Public companies file **Form 10-K** annually with the U.S. Securities and Exchange Commission (SEC).  
These filings contain rich textual information about:

- business operations  
- risks and uncertainties  
- management discussion  
- regulatory disclosures  

In this lab, we treat each 10-K as **raw text data** and apply a standard NLP pipeline to prepare it for analysis.

## Dataset Overview  {.unnumbered}

- All data for this lab is located in: [SEC-10K-2024/](https://drive.google.com/drive/folders/1q7BfsNHCewG1zNfnqyCcBj9p_RUt-zW6?usp=drive_link)
- You will need to "copy" the folder to your own Google Drive
- Right click on the folder, and then click "Add shortcut to Drive". This will allow you to access the folder from your drive!
- This folder contains **plain-text 10-K filings** for multiple publicly traded firms.
- Each file represents **one company’s annual report**.

![](./M01_lecture02_figures/gdrive-add-folder.png){width="80%" fig-align="center"}

## Research Framing (Important) {.unnumbered}

You are **not** training a model yet. Instead, think of this lab as **asking structured questions of text**, such as:

- What terms dominate risk disclosures?
- How consistent is language across companies?
- Which words survive aggressive cleaning?
- How does preprocessing change the text representation?

Your answers will be supported by **intermediate outputs**, not final predictions.

## NLP Processing Pipeline {.unnumbered}

You will implement the following pipeline **step by step**:

1. Raw text  
2. Sentence segmentation  
3. Tokenization  
4. Part-of-Speech (POS) tagging  
5. Stop-word removal  
6. Stemming / Lemmatization  
7. Dependency parsing  
8. String metrics & matching  
Each stage produces **artifacts** that help you answer analytical questions.

## Load and Inspect the Data {.unnumbered}

In [None]:
# import os

# from google.colab import drive

# drive.mount("/content/drive")

# DATA_DIR = "/content/drive/MyDrive/Research/SEC-10K-2024"

# assert os.path.exists(DATA_DIR), (
#     "Google Drive is not mounted or the dataset path is incorrect. "
#     "Did you run drive.mount()?"
# )

# print("Drive mounted successfully. Data directory found.")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Drive mounted successfully. Data directory found.


In [None]:
from pathlib import Path

SEC_DIR = Path("D:/Repositories/AD698-generative-ai-for-BA/data/SEC-10K-2024")
# DATA_ROOT = Path("/content/drive/MyDrive/Research")
# SEC_DIR = DATA_ROOT / "SEC-10K-2024"

assert SEC_DIR.exists(), "SEC data folder not found. Check Drive mount."

sec_files = list(SEC_DIR.glob("*.txt"))
print(f"Found {len(sec_files)} SEC filings")

Found 7754 SEC filings


In [None]:

# Read a sample document
sample_text = sec_files[0].read_text(encoding="utf-8")
print(sample_text[:1500])

<Header>
<FileStats>
    <FileName>20240426_10-K-A_edgar_data_1434524_0001104659-24-053028.txt</FileName>
    <GrossFileSize>573357</GrossFileSize>
    <NetFileSize>79834</NetFileSize>
    <NonText_DocumentType_Chars>125288</NonText_DocumentType_Chars>
    <HTML_Chars>219025</HTML_Chars>
    <XBRL_Chars>69133</XBRL_Chars>
    <XML_Chars>53924</XML_Chars>
    <N_Exhibits>8</N_Exhibits>
</FileStats>
<SEC-Header>
0001104659-24-053028.hdr.sgml : 20240426
<ACCEPTANCE-DATETIME>20240426164535
ACCESSION NUMBER:		0001104659-24-053028
CONFORMED SUBMISSION TYPE:	10-K/A
PUBLIC DOCUMENT COUNT:		19
CONFORMED PERIOD OF REPORT:	20231231
FILED AS OF DATE:		20240426
DATE AS OF CHANGE:		20240426

FILER:

	COMPANY DATA:	
		COMPANY CONFORMED NAME:			ClearSign Technologies Corp
		CENTRAL INDEX KEY:			0001434524
		STANDARD INDUSTRIAL CLASSIFICATION:	INDUSTRIAL INSTRUMENTS FOR MEASUREMENT, DISPLAY, AND CONTROL [3823]
		ORGANIZATION NAME:           	08 Industrial Applications and Services
		IRS NUMBER:				0000

## What sections of the 10-K appear most frequently in the opening text?
This will help you understand the structure of the document and identify key areas for analysis (e.g., risk factors, management discussion). We first start with Sentence Segmentation

In [None]:
import nltk
from zipfile import ZipFile

nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("omw-1.4") # Open Multilingual Wordnet

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [None]:
from nltk.tokenize import sent_tokenize

sentences = sent_tokenize(sample_text)
print(f"Number of sentences: {len(sentences)}")
sentences[:5]

LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/root/nltk_data'
    - '/usr/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


## Are sentences in 10-Ks longer or shorter than typical news or social media text?


```python
from nltk.tokenize import word_tokenize

tokens = word_tokenize(sample_text)
tokens[:30]
```

## What kinds of tokens appear that are not “words” (e.g., symbols, numbers, legal references)?

```python
nltk.download("averaged_perceptron_tagger")

from nltk import pos_tag

pos_tags = pos_tag(tokens[:50])
pos_tags
```

## Which POS categories dominate risk-related sections (nouns, verbs, adjectives)?


```python
from nltk.corpus import stopwords
nltk.download("stopwords")

stop_words = set(stopwords.words("english"))

filtered_tokens = [
    t.lower() for t in tokens
    if t.isalpha() and t.lower() not in stop_words
]

filtered_tokens[:30]
```

## Which important business terms survive stop-word removal?

```python
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download("wordnet")

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

stems = [stemmer.stem(t) for t in filtered_tokens[:20]]
lemmas = [lemmatizer.lemmatize(t) for t in filtered_tokens[:20]]

list(zip(filtered_tokens[:20], stems, lemmas))
```

## Which transformation preserves interpretability better for financial text?


```python
import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp(sentences[0])
[(token.text, token.dep_, token.head.text) for token in doc]
```

## How might dependency relationships help identify risk statements or obligations?

```python
from difflib import SequenceMatcher

def similarity(a, b):
    return SequenceMatcher(None, a, b).ratio()

similarity(
    "risk management strategy",
    "enterprise risk management"
)
```

## Why might approximate string matching be useful for cross-company comparison?



## Deliverables

Submit word document with answering the questions in addition to the Jupyter notebook with the code and outputs (either `.ipynb` or `.pdf`):

## Key Takeaway

> Before we can generate language,
> we must first **discipline text into structure**.

This pipeline is the foundation upon which **Bag of Words, TF-IDF, embeddings, and generative models** are built.
