# Sentiment Analysis of Company Filings

This notebook implements a simple open-source method to extract and analyze sentence-level sentiment from the MD&A sections of 10-Q filings submitted to the SEC. 

We focus on selected S&P 500 companies with complete filings over the last 4 years. Using `sec-edgar-downloader`, `sent_tokenize`, and FinBERT, we build a dictionary mapping tickers to positive/negative sentiment counts per filing.

See the README for a detailed description of the pipeline.


In [None]:
import pandas as pd
import time
from pathlib import Path
import shutil
import random
from bs4 import BeautifulSoup
import re
from collections import defaultdict

from datetime import datetime
import nltk
#nltk.download('punkt')
nltk.download('punkt_tab')
from nltk.tokenize import sent_tokenize


## Fetching tickers
The following snippets scrapes the table of S&P500 companies from Wikipedia, records the tickers and samples 50 random tickers.

In [16]:
url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
sp500 = pd.read_html(url)[0]
sp500.head()

Unnamed: 0,Symbol,Security,GICS Sector,GICS Sub-Industry,Headquarters Location,Date added,CIK,Founded
0,MMM,3M,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1957-03-04,66740,1902
1,AOS,A. O. Smith,Industrials,Building Products,"Milwaukee, Wisconsin",2017-07-26,91142,1916
2,ABT,Abbott Laboratories,Health Care,Health Care Equipment,"North Chicago, Illinois",1957-03-04,1800,1888
3,ABBV,AbbVie,Health Care,Biotechnology,"North Chicago, Illinois",2012-12-31,1551152,2013 (1888)
4,ACN,Accenture,Information Technology,IT Consulting & Other Services,"Dublin, Ireland",2011-07-06,1467373,1989


In [17]:
AllTickersSp500 = sp500['Symbol'].tolist()
random.seed(193)
#Sampling 50 random S&P500 tickers
SampleTickerSp500=random.sample(AllTickersSp500,50)

## Downloading 10-Q filings using EDGAR Downloader

EDGAR stands for:
Electronic Data Gathering, Analysis, and Retrieval
It’s the official online system maintained by the U.S. Securities and Exchange Commission (SEC) for collecting and publishing financial filings submitted by public companies.

The SEC is a federal government agency responsible for:
Regulating the securities markets (stocks, bonds, ETFs)
- Enforcing laws against market manipulation, fraud, and insider trading
- Protecting investors
- Requiring public companies to disclose financial and operational information through filings (like 10-Ks, 10-Qs)

In [18]:
from sec_edgar_downloader import Downloader
import time
from tqdm.notebook import tqdm
dl=Downloader(company_name="MyNLPProject",email_address="bersudsky87@gmail.com")

for ticker in tqdm(SampleTickerSp500,desc='Ticker processed'):
    if len(TickersToAnalyze)==40:
        break
    #Downloading 10-Q filings in a period of four years, 3 per year, 
    #and only saving those which have no missing filings.    
    
    dl.get("10-Q",ticker,after="2021-01-01",before="2024-12-31",limit=12)
    folder_path=Path(f"sec-edgar-filings/{ticker}/10-Q")
    #Verifying that there's no missing filings:
    
    if len([f for f in folder_path.iterdir() if f.is_dir()]) == 12:
        TickersToAnalyze.append(ticker)
    else:
        try:
            shutil.rmtree(folder_path)
        except Exception as e:
            print(f"{folder_path} was deleted due to {e}")
    time.sleep(10)


Ticker processed:   0%|          | 0/50 [00:00<?, ?it/s]

The following creates a list of the downloaded tickers.

In [19]:
from collections import defaultdict
from pathlib import Path
from bs4 import BeautifulSoup
import re
root=Path("sec-edgar-filings")
TickersToAnalyze = [
    name.name
    for name in root.iterdir()
    if name.is_dir() and any(f.is_dir() and f.name == "10-Q" for f in name.iterdir())
]
print(TickersToAnalyze,'\n')
print(f'The number of tickers to analyze is: {len(TickersToAnalyze)}')

['CAT', 'TROW', 'WSM', 'HWM', 'CME', 'NWS', 'INTC', 'LII', 'CDNS', 'HD', 'NDSN', 'AKAM', 'WAB', 'TYL', 'PSX', 'INVH', 'LVS', 'ELV', 'MCD', 'COR', 'VMC', 'PEG', 'XEL', 'SPG', 'CSX', 'KR', 'ESS', 'PAYX', 'VICI', 'DOV', 'UNH', 'LDOS', 'CSCO', 'ETN', 'DAY', 'CL', 'RL', 'COP', 'DVN', 'WRB'] 

The number of tickers to analyze is: 40


## Filtering Tickers – Extracting Filings with a Table of Contents

We now filter the tickers to retain only those whose 10-Q filings include a **table of contents (TOC)**. The downloaded filings are raw and unstructured HTML documents. However, if a filing contains a TOC, it typically includes hyperlinks (anchors) to specific sections.

We use this structure to identify:
- A reference (href) to the **MD&A** section, and
- A reference to the **following section**

This allows us to accurately extract the portion of the filing that contains the main textual content for sentiment analysis.

The result of this step is a dictionary called `hrefToMDA`, which maps each valid ticker to the list of Path locations of the files, MD&A section reference and the following section reference.


In [None]:
from pathlib import Path

hrefToMDA = defaultdict(list) #to be modified in place

def CheckTC(ticker : str):
    root_path = Path(f"sec-edgar-filings/{ticker}/10-Q")
    filing_folders = [f for f in root_path.iterdir() if f.is_dir()]
    if not filing_folders:
        print(f"[!] No valid filing folders for {ticker}")
        return

    for filing_folder in filing_folders:
        txt_files = list(filing_folder.glob("*.txt")) + list(filing_folder.glob("*.TXT"))
        filing_file=txt_files[0] #there is only one file in each filing folder
        with open(filing_file, encoding='utf-8', errors='ignore') as f:
            html = f.read(700000) #this optimizes run time. The hrefs are in the beginning of the document.
        soup = BeautifulSoup(html, 'lxml')
        pattern_first = re.compile(r"discussion and", re.I)
        pattern_second = re.compile(r"quantitative and", re.I) #The locations were spotted after manually inspecting the files
        pattern_third = re.compile(r"qualitative and",re.I)
        # Find hrefs for MD&A section
        href_first = None
        h_ref_second = None
        for a_tag in soup.find_all("a"):
          if pattern_first.search(a_tag.get_text()):
              href_first = a_tag.get("href")
              next_tag = a_tag
              for i in range(30):
                 next_tag = next_tag.find_next("a")
                 if not next_tag:
                     break
                 if pattern_second.search(next_tag.get_text()):
                     h_ref_second = next_tag.get("href")
                     break
                 elif pattern_third.search(next_tag.get_text()):
                     h_ref_second = next_tag.get("href")
                     break
              if not h_ref_second:
                print(f"Quantiative and Qualitative section was not found for: {ticker}")

              if href_first and h_ref_second:
                hrefToMDA[ticker].append([str(filing_file),href_first, h_ref_second])
                break  # Stop after first match

# Run for all tickers
for tick in tqdm(TickersToAnalyze,desc='Tickers analyzed'):
    CheckTC(tick)

In [22]:
import json
#Saving the hrefToMDA
with open('hrefToMDA.json','w') as f:
  json.dump(hrefToMDA,f)

In [29]:
import json
#Load hrefToMDA
with open('hrefToMDA.json','r') as f:
    hrefToMDA = json.load(f)

## Cleaning MD&A section

In [24]:
import re

def clean(paragraph : str) -> str: #helper function which will clear paragraphs
    # Shorten redundant spaces
    paragraph = re.sub(r'\s+', ' ', paragraph).strip()
    # Match common bullet characters followed by space and text
    bullets = re.findall(r'[•●–\-]\s*([^•●–\-]+?)(?=(?:[•●–\-]|$))', paragraph)
    # Clean whitespace and join with commas
    return ', '.join(b.strip().rstrip(',') for b in bullets)

In [31]:
from bs4 import BeautifulSoup
from datetime import datetime
def extract_mda_section(ticker:str,filing_file:str, start_href:str, end_href:str):
    with open(Path(filing_file), encoding='utf-8', errors='ignore') as f:
        html = f.read()
    filing_date = re.search(r'FILED AS OF DATE:\s*(\d{8})',html).group(1)
    date_res = datetime.strptime(filing_date, "%Y%m%d").date()
    soup = BeautifulSoup(html, 'lxml')

    # Step 1: Locate the anchors by href
    start_anchor = soup.find(id = start_href[1:])
    end_anchor = soup.find(id = end_href[1:])


    # Step 2: Collect content between start and end
    current = start_anchor.find_next()
    content_between = []

    while current and current != end_anchor:
        next_node = current.find_next()
        if current.name!='table':
            content_between.append(current)
        current = next_node

    # Step 3: Wrap content into a temporary container for further parsing
    container = BeautifulSoup('<div></div>', 'lxml')
    for tag in content_between:
        container.div.append(tag)


    # Step 5: Break into paragraphs
    paragraphs = []
    for tag in container.find_all(['p', 'div']):
        text = tag.get_text(separator=' ',strip=True)
        if text and len(text.split())>10:
            text=clean(text)
            paragraphs.append(text)

    return date_res, paragraphs

## Example of the extracted lines

In [32]:
#An example of extracted paragraphs of a filing (of CAT):
CAT_date, CAT_paragraphs = extract_mda_section('CAT', hrefToMDA['CAT'][0][0], hrefToMDA['CAT'][0][1],hrefToMDA['CAT'][0][2])


The extracted paragraphs are often quite long, so we first split them into individual sentences using `sent_tokenize`.  
Even after segmentation, some noisy or irrelevant lines may remain. To address this, we apply FinBERT for sentiment analysis **only** to shorter sentences, and we **only keep results with high confidence scores**.


In [48]:
import random
from nltk.tokenize import sent_tokenize
lines = []

for paragraph in CAT_paragraphs:
    lines.extend(sent_tokenize(paragraph))
for l in lines:
    print(f"{l}\n{'='*40}\n" )

K .

Highlights for the third quarter of 2021 include:, Total sales and revenues for the third quarter of 2021 were $12.397 billion, an increase of $2.516 billion, or 25 percent, compared with $9.881 billion in the third quarter of 2020.

Sales were higher across the three primary segments., Operating profit margin was 13.4 percent for the third quarter of 2021, compared with 10.0 percent for the third quarter of 2020.

Adjusted operating profit margin was 13.7 percent for the third quarter of 2021, compared with 11.1 percent for the third quarter of 2020., Third, quarter 2021 profit per share was $2.60, and excluding the items in the table below, adjusted profit per share was $2.66.

Third, quarter 2020 profit per share was $1.22, and excluding the items in the table below, adjusted profit per share was $1.52., Caterpillar ended the third quarter of 2021 with $9.4 billion of enterprise cash.

Highlights for the nine months ended September 30, 2021 include:, Total sales and revenues we

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline

model_name="yiyanghkust/finbert-tone"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
classifier = pipeline("sentiment-analysis",model=model,tokenizer=tokenizer)

In [None]:
sentiment_data = defaultdict(list)
for ticker in hrefToMDA:
    for i in range(len(hrefToMDA[ticker])):
            ticker_date, ticker_paragraphs = extract_mda_section(ticker, hrefToMDA[ticker][i][0], hrefToMDA[ticker][i][1],hrefToMDA[ticker][i][2])
            lines = []
# Break every paragraph into sentnces using Punkt model
            for paragraph in ticker_paragraphs:
                lines.extend(sent_tokenize(paragraph))

            # Filter out long lines
            filtered_lines = [l for l in lines if len(tokenizer.encode(l)) <= 512]

            # Run classifier in batch. 
            results = classifier(filtered_lines, batch_size=16, truncation=True,max_length=512)

            # Count positives and negatives of high confidence.
            count_positive = sum(1 for r in results if r['label'] == 'Positive' and r['score'] > 0.7)
            count_negative = sum(1 for r in results if r['label'] == 'Negative' and r['score'] > 0.7)

            sentiment_data[ticker].append({'date': ticker_date, 'positive': count_positive, 'negative': count_negative})

