# Annual Financial Report Analyzer (Stocks) 📈

### Idea:
Collect 10–20 annual reports (10-Ks) from different publicly traded companies (Apple, Tesla, etc.), totaling 200+ pages. Extract financial tables, risk factors, and business sections. The user can ask:


“What are the main risk factors Apple mentioned in 2023?”
“Summarize Tesla’s financial performance in 2022.”

Why it’s valuable:
	•	Great practical value for finance analysts
	•	Real-world PDFs with heavy tabular + text mix
	•	Good for resume: “Built LLM-powered SEC report analyzer with semantic search, MMR reranking, and document rendering.”
    

### Downloading Documents

In [1]:
from sec_edgar_downloader import Downloader

In [2]:
d1 = Downloader("student","nirmitkhurana82@gmail.com","data")

companies = {
    "AAPL": "Apple Inc.",
    "MSFT": "Microsoft Corporation",
    "TSLA": "Tesla, Inc.",
    "AMZN": "Amazon.com, Inc.",
    "GOOGL": "Alphabet Inc.",
    "META": "Meta Platforms, Inc."
}

for ticker in companies:
    d1.get("10-K",ticker, limit=3, download_details=False)

print("All the files have been downloaded successfully")

All the files have been downloaded successfully


### Data Ingestion

In [7]:
from bs4 import BeautifulSoup

def extract_clean_text_from_xbrl(raw_text):
    soup = BeautifulSoup(raw_text, "html.parser")

    cleaned_text = []

    # Get all text from <p> and <td> tags
    for p in soup.find_all("p"):
        cleaned_text.append(p.get_text(strip=True))

    for td in soup.find_all("td"):
        cleaned_text.append(td.get_text(strip=True))

    return "\n".join(cleaned_text)

In [3]:
from langchain_community.document_loaders.text import TextLoader

In [8]:
# Join the folders
from langchain.docstore.document import Document
import os
folder_path = "data/sec-edgar-filings"
documents = []

for company in os.listdir(folder_path):
    company_folder = os.path.join(folder_path, company,"10-K")
    if not os.path.isdir(company_folder):
        continue

    for filing_id in os.listdir(company_folder):
        filing_path = os.path.join(company_folder, filing_id)
        file_path = os.path.join(filing_path, "full-submission.txt")

        if os.path.exists(file_path):
            try:
                                # Step 1: Load raw HTML text manually
                with open(file_path, "r", encoding="utf-8", errors="ignore") as f:
                    raw_html = f.read()

                # Step 2: Clean it using the XBRL parser
                cleaned_text = extract_clean_text_from_xbrl(raw_html)

                # Step 3: Wrap into a LangChain Document object
                doc = Document(
                    page_content=cleaned_text,
                    metadata={
                        "ticker": company,
                        "filing_id": filing_id,
                        "source_file": file_path
                    }
                )

                documents.append(doc)
            except Exception as e:
                print(f"❌ Error loading {file_path}:{e}")

print(f"✅ Loaded all {len(documents)} documents.")


✅ Loaded all 18 documents.


In [13]:
print(documents[0].page_content)

Boolean flag that is true when the XBRL content amends previously-filed or accepted submission.
No definition available.
Area code of city
No definition available.
Cover page.
No definition available.
End date of current fiscal year in the format --MM-DD.
No definition available.
Boolean flag that is true only for a form used as an annual report.
Reference 1: http://www.xbrl.org/2003/role/presentationRef-Publisher SEC-Name Form 10-K-Number 249-Section 310Reference 2: http://www.xbrl.org/2003/role/presentationRef-Publisher SEC-Name Form 20-F-Number 249-Section 220-Subsection fReference 3: http://www.xbrl.org/2003/role/presentationRef-Publisher SEC-Name Form 40-F-Number 249-Section 240-Subsection f
Indicates whether any of the financial statement period in the filing include a restatement due to error correction.
Reference 1: http://www.xbrl.org/2003/role/presentationRef-Publisher SEC-Name Regulation S-K-Number 229-Section 402-Subsection wReference 2: http://www.xbrl.org/2003/role/presen

## Chunking

In [14]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size = 1000, chunk_overlap = 200)

chunks = splitter.split_documents(documents)
print(f"✅ Split into {len(chunks)} chunks.")

✅ Split into 87179 chunks.


In [None]:
print(chunks[0].metadata)

{'ticker': 'TSLA', 'filing_id': '0000950170-23-001409', 'source_file': 'data/sec-edgar-filings/TSLA/10-K/0000950170-23-001409/full-submission.txt'}


In [None]:
## Saving chunks in a json file.

import json
import os

output_file = "data/chunks/all_chunks.json"
os.makedirs(os.path.dirname(output_file), exist_ok=True)

with open(output_file, "w", encoding="utf-8") as f:
    for chunk in chunks:
        f.write(json.dumps({
            "text": chunk.page_content,
            "metadata": chunk.metadata
        }) + "\n")

print(f"✅ Saved {len(chunks)} chunks to {output_file}")

✅ Saved 87179 chunks to data/chunks/all_chunks.json


## Embeddings