<a href="https://colab.research.google.com/github/Retieun/HA-agent/blob/main/10_K_Financial_Analyst%22_Agent.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Cell 1: Install Dependencies & Restart Runtime
import os

# Install the latest LangChain v0.3 ecosystem
!pip install -qU langchain langchain-community langchain-openai langchain-chroma beautifulsoup4 langchain-text-splitters pypdf pysqlite3-binary

# Automatically restart the runtime to apply the new versions
print("Restarting runtime to apply upgrades...")
os.kill(os.getpid(), 9)

In [1]:
# Cell 2: Setup Infrastructure
import os
import sys
from google.colab import userdata

# 1. FIX: Upgrade SQLite for ChromaDB (Critical for Colab)
# We swap the old system sqlite3 with the new pysqlite3-binary we just installed
try:
    __import__('pysqlite3')
    sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')
    print("‚úÖ SQLite successfully patched for ChromaDB.")
except ImportError:
    print("‚ùå Error: pysqlite3 not found. Did you run Cell 1?")

# 2. Load API Key
try:
    os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')
    print("‚úÖ API Key loaded.")
except Exception as e:
    print("‚ùå Error: Go to the 'Key' icon on the left -> Add 'OPENAI_API_KEY'")

‚úÖ SQLite successfully patched for ChromaDB.
‚úÖ API Key loaded.


In [2]:
# Cell 3: Ingest Data (HTML 10-K)
import requests

# Direct link to Capital One's 2023 10-K HTML file
url = "https://www.sec.gov/Archives/edgar/data/927628/000092762824000094/cof-20231231.htm"

headers = {
    "User-Agent": "StudentProject/1.0 (student@example.com)"
}

response = requests.get(url, headers=headers)

with open("capital_one_10k.html", "w", encoding="utf-8") as f:
    f.write(response.text)

print("‚úÖ Data Ingestion Complete: 'capital_one_10k.html' saved.")

‚úÖ Data Ingestion Complete: 'capital_one_10k.html' saved.


In [3]:
# Cell 4: Parse and Chunk
from langchain_community.document_loaders import BSHTMLLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# 1. Load HTML
print("Parsing HTML... (This may take 30-60 seconds)")
loader = BSHTMLLoader("capital_one_10k.html")
docs = loader.load()

# 2. Split text
# We use a large chunk size (4000) because HTML contains dense tables
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=4000,
    chunk_overlap=200
)
splits = text_splitter.split_documents(docs)

print(f"‚úÖ Preprocessing Complete: Created {len(splits)} chunks.")

Parsing HTML... (This may take 30-60 seconds)



Assuming this really is an XML document, what you're doing might work, but you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the Python package 'lxml' installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.




  soup = BeautifulSoup(f, **self.bs_kwargs)


‚úÖ Preprocessing Complete: Created 233 chunks.


In [4]:
# Cell 5: Create Vector Database
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

print("Creating Vector Store... (Embedding chunks, takes ~1 minute)")

vectorstore = Chroma.from_documents(
    documents=splits,
    embedding=OpenAIEmbeddings(),
    collection_name="capital_one_10k"
)
retriever = vectorstore.as_retriever()

print("‚úÖ Vector Store Ready. Database is live.")

Creating Vector Store... (Embedding chunks, takes ~1 minute)
‚úÖ Vector Store Ready. Database is live.


In [9]:
# --- CORRECTED CELL 6: Tuned RAG Pipeline ---

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI

# 1. TUNING: Increase 'k' to retrieve more context (Top 20 chunks instead of 4)
# This is crucial for dense financial documents.
retriever = vectorstore.as_retriever(search_kwargs={"k": 8})

# 2. Helper to format the extra documents nicely
def format_docs(docs):
    return "\n\n---\n\n".join(doc.page_content for doc in docs)

# 3. Refined System Prompt (Encourages extraction)
system_prompt = (
    "You are a senior financial analyst. "
    "Analyze the provided 10-K extracts below to answer the user's question. "
    "Focus on 'Item 1A. Risk Factors' and 'MD&A' sections if present. "
    "Summarize the key points in bullet points. "
    "If the text mentions specific dollar amounts or percentages, include them."
    "\n\n"
    "Context from 10-K:\n{context}"
)

prompt = ChatPromptTemplate.from_template(system_prompt)
llm = ChatOpenAI(model="gpt-3.5-turbo-16k") # Use 16k context model if available, or gpt-4-turbo

# 4. Modern LCEL Chain
rag_chain = (
    {"context": retriever | format_docs, "input": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# 5. Try the question again
query = "What are the specific credit risks related to consumer loans and credit cards?"
print(f"‚ùì Question: {query}\n")

response = rag_chain.invoke(query)
print("üí° Analyst Answer:")
print(response)

‚ùì Question: What are the specific credit risks related to consumer loans and credit cards?

üí° Analyst Answer:
1. Credit Card and Consumer Banking Loan Portfolios:
- Loans are assessed based on common risk characteristics such as origination year, interest rate, borrower credit score, and geography.
- Credit card loans do not have a defined contractual life, and expected credit losses are measured based only on the drawn balance.
- An allowance is established based on a modeled calculation supplemented by management judgment.
- Loan portfolios are divided into segments like auto loans and retail banking loans.

2. Commercial Banking Loan Portfolio:
- Loans are subject to internal risk ratings considering factors like borrower financial condition, collateral performance, and industry-specific information.
- The contractual period typically does not include renewals or extensions.
- The company assigns internal risk ratings and monitors delinquency trends for credit quality assessmen

In [7]:
# --- DEBUG CELL: Inspect Retrieval ---

# 1. Search for the exact term in the vector store manually
print("üîç DEBUGGING RETRIEVAL...")
results = vectorstore.similarity_search("credit risk", k=5)

if len(results) == 0:
    print("‚ùå ERROR: No documents found in Vector Store. Did parsing fail?")
else:
    print(f"‚úÖ Found {len(results)} chunks related to 'credit risk'.")
    print("-" * 40)
    for i, doc in enumerate(results):
        print(f"üìÑ Chunk {i+1} Preview:")
        print(doc.page_content[:300].replace("\n", " ")) # Print first 300 chars
        print("-" * 40)

üîç DEBUGGING RETRIEVAL...
‚úÖ Found 5 chunks related to 'credit risk'.
----------------------------------------
üìÑ Chunk 1 Preview:
the industry. Additionally, we monitor timely and effective responsiveness to these conditions, strategic decisions that impact the Company‚Äôs scale, market position or operating model and failure to appropriately consider implementation risks in the Company‚Äôs strategy. Potential areas of opportunity
----------------------------------------
üìÑ Chunk 2 Preview:
We assign internal risk ratings to loans based on relevant information about the ability of the borrowers to repay their debt. In determining the risk rating of a particular loan, some of the factors considered are the borrower‚Äôs current financial condition, historical and projected future credit pe
----------------------------------------
üìÑ Chunk 3 Preview:
credit risk and comply with credit policies and guidelines. In addition, the Chief Credit and Financial Risk Officer establishes p