# 📚 Table of Contents

- [🧾 Introduction](#🧾-Introduction)
- [🏗️ System Architecture Overview](#🏗️-System-Architecture-Overview)
- [⚙️ System Setup](#⚙️-System-Setup)
- [📄 PDF Parsing](#📄-PDF-Parsing)
- [🧹 Text Preprocessing](#🧹-Text-Preprocessing)
- [📊 Table Preprocessing](#📊-Table-Preprocessing)
- [🧩 Chunking Mechanism](#🧩-Chunking-Mechanism)
- [✍️ Summarization](#✍️-Summarization)
- [🧠 Embedding & Storage](#🧠-Embedding--Storage)
- [🔎 Retrieval & Q&A](#🔎-Retrieval--QA)
- [📝 Report Generation](#📝-Report-Generation)
- [📌 Conclusion](#📌-Conclusion)


## 🧾 Introduction

Welcome to the **Financial PDF Analyzer** – a modular pipeline system designed to transform unstructured financial documents into structured, actionable insights.

This notebook walks through each phase of the pipeline:
<details>
<summary>📌 <strong>Pipeline Steps (Click to Expand)</strong></summary>

- 📄 **PDF Parsing**: Extrac text and tables from raw financial PDFs using [unstructured](https://docs.unstructured.io/welcome).
- 🔍 **Preprocessing**: Clean and prepare text and tabular data.
- 🧩 **Chunking**: Segment text into manageable, coherent blocks.
- ✍️ **Summarization**: Used [Gemini](https://ai.google.dev/gemini-api/docs/models) to summarize chunks.
- 🧠 **Embedding Generation**: Create text and table embeddings using [Hugging Face Instructor xl](https://huggingface.co/hkunlp/instructor-xl) and [BAAI model](https://huggingface.co/BAAI/bge-base-en-v1.5).
- 🗂️ **Vector Store**: Store embeddings and metadata in [ChromaDB](https://www.trychroma.com/).
- 📥 **Retrieval and Reporting**: Query and synthesize insights via LLM to produce financial summaries.

</details>

---

### 🔗 Key Technologies Used

- [LangChain](https://www.langchain.com/)
- [Chroma Vector Store](https://docs.trychroma.com/)
- [Instructor-XL Embedding Model](https://huggingface.co/hkunlp/instructor-xl)
- [Gemini](https://deepmind.google/technologies/gemini/)


## 🧾 Introduction

The **Financial PDF Analyzer** is a pipeline-based system designed to transform unstructured financial documents into structured, actionable insights. This notebook walks through each step of the system, demonstrating how financial PDFs are processed, summarized, embedded, and ultimately used to generate concise reports.

The architecture follows a modular approach:

- 📄 **PDF Parsing**: Raw financial PDFs are parsed to extract both freeform text and structured tables.
- 🔍 **Preprocessing**: Separate pipelines clean and prepare text and tabular content.
- 🧩 **Chunking**: Text content is segmented into manageable chunks to support efficient processing.
- ✍️ **Summarization**: Each chunk is summarized using a Large Language Model to distill core insights.
- 🧠 **Embedding Generation**: Summarized chunks are embedded using the BAAI model to enable vector-based retrieval.
- 🗂️ **Vector Store**: Embeddings, summaries, and metadata are stored for downstream retrieval and analysis.
- 📥 **Retrieval and Reporting**: Relevant text and tables are retrieved and synthesized by an LLM to produce a final financial report.

## 🏗️ System Architecture Overview

The **Financial PDF Analyzer** is designed as a modular pipeline, where each stage transforms data for the next. Here's an overview of the system flow:

<img src="./resources/Diagram.jpg" alt="System Architecture" width="900" height="500"/>

In [1]:
# importing libraries
import os
import json
from langchain.vectorstores import Chroma
from source.summary_generator import SummaryGenerator
from unstructured.staging.base import elements_from_json
from unstructured.staging.base import elements_to_json
from unstructured.partition.pdf import partition_pdf
from source.document_preprocessor import DocumentPreprocessor, Chunker
from source.financial_analysis_agent import FinancialAnalysisAgent
from source.multi_vector_store import TextVectorStoreBuilder, TableVectorStoreBuilder
from config import LANGSMITH_TRACING, LANGSMITH_ENDPOINT, PDF_FILE, CHUNK_SIZE, CHUNK_OVERLAP, LANGCHAIN_API_KEY, GEMINI_API_KEY, DATA_SAVE_PATH

---

### ⚙️ System Setup

The `config.py` file serves as a centralized system configuration.

- **API Keys**: Stores credentials for services like LangChain, LangSmith, and Gemini to ensure secure access to LLM and tracing functionalities.
- **Paths**: Defines file system locations for the input PDF, Chroma DB storage, and data output directories.
- **Model Configuration**: Sets the model names used for chatting, summarization, and embedding text/table data.
- **Chunking Settings**: Controls the `CHUNK_SIZE` and `CHUNK_OVERLAP` used in preprocessing and splitting document content.
- **Device Selection**: Allows specification of the device (CPU/GPU) for running embedding models efficiently.



---


📁 Creating an output directory based on the input PDF filename to store extracted and processed data.


In [2]:
pdf_output_path = os.path.join(DATA_SAVE_PATH, PDF_FILE.split("/")[-1].replace(".pdf", ""))
if not os.path.exists(pdf_output_path):
    os.makedirs(pdf_output_path)
print(f"Outputs path: {pdf_output_path}")

Outputs path: outputs/pfizer-report


**Data**: For development and testing, the Pfizer annual report [PFE-2022-Form-10K-FINAL (without Exhibits)](https://s28.q4cdn.com/781576035/files/doc_financials/2022/ar/PFE-2022-Form-10K-FINAL-(without-Exhibits).pdf) was used, which contains a total of 144 pages.


### 🧭 Optional: LangSmith Configuration

LangSmith is optionally integrated into the system to provide detailed tracing, debugging, and monitoring of LangChain workflows. This setup is especially valuable during development or for auditing model behaviors in production.

- **Tracing Enablement**: Controlled via the `LANGSMITH_TRACING` flag, which toggles tracing functionality.
- **API Configuration**: Uses `LANGSMITH_ENDPOINT` and `LANGCHAIN_API_KEY` to connect with the LangSmith service.
- **Use Cases**: Helps in visualizing execution chains, diagnosing failures, analyzing latency, and understanding prompt-response interactions.

This optional configuration is fully decoupled—developers can enable or disable it without affecting core functionality, making it a flexible addition for deeper observability.


In [3]:
# Setting up Langsmith
os.environ["LANGSMITH_TRACING"] = LANGSMITH_TRACING
os.environ["LANGSMITH_ENDPOINT"] = LANGSMITH_ENDPOINT
os.environ["LANGSMITH_API_KEY"] = LANGCHAIN_API_KEY

## 📄 Step 1: Load PDF

In this step, the input PDF is parsed into a list of unstructured elements using `partition_pdf` from the `unstructured` library. These elements can include text blocks, titles, tables, and images.

### ⚙️ Key Parameters Used:
- `filename`: Path to the input PDF file.
- `strategy="hi_res"`: Uses a high-resolution layout detection model for better accuracy in identifying content structure.
- `extract_images_in_pdf=True`: Extracts images embedded within the PDF.
- `infer_table_structure=True`: Enables automatic detection and conversion of tables into structured formats (like HTML or CSV).
- `table_extraction_mode="lattice"`: Extracts tables using visible gridlines. (`"stream"` can be used for borderless, spacing-based tables).
- `skip_infer_table_types=[]`: Skips specific table types from inference (empty here, so all tables are considered).

📦 **Why Save as JSON?**  
The parsing process can be time-consuming, especially for large PDFs. Saving the extracted elements to a JSON file (`raw_unstructured_elements.json`) allows for faster reloads and avoids redundant parsing in future runs.

In [5]:
# 1. Load PDFs
print("Loading PDF...")

pdf_elements = partition_pdf(
        filename=PDF_FILE,
        strategy="hi_res",
        extract_images_in_pdf=True,
        infer_table_structure=True, # Detects and extracts tables into structured formats (e.g., HTML, CSV-like).
        table_extraction_mode="lattice",  # Detects tables with clear grid lines. or "stream" for Detects tables without borders (spacing-based).
        skip_infer_table_types=[],
    )

print(f"Total elements: {len(pdf_elements)}")

# Saving to JSON file
_ = elements_to_json(pdf_elements, filename=os.path.join(pdf_output_path, "raw_unstructured_elements.json"))

Loading PDF...
Total elements: 1935


In [3]:
# Load elements from JSON
pdf_elements = elements_from_json(os.path.join(pdf_output_path, "raw_unstructured_elements.json"))
print(f"Total elements: {len(pdf_elements)}")

Total elements: 1935


📌 Extracted and categorized various element types (e.g., text, titles, tables, images) from the PDF to understand its content structure.

In [4]:
# Create a dictionary to store counts of each type
category_counts = {}

for element in pdf_elements:
    category = str(type(element))
    if category in category_counts:
        category_counts[category] += 1
    else:
        category_counts[category] = 1

# Unique_categories will have unique elements
unique_categories = set(category_counts.keys())
category_counts

{"<class 'unstructured.documents.elements.Header'>": 27,
 "<class 'unstructured.documents.elements.Title'>": 343,
 "<class 'unstructured.documents.elements.NarrativeText'>": 882,
 "<class 'unstructured.documents.elements.Image'>": 12,
 "<class 'unstructured.documents.elements.Table'>": 98,
 "<class 'unstructured.documents.elements.Text'>": 124,
 "<class 'unstructured.documents.elements.ListItem'>": 428,
 "<class 'unstructured.documents.elements.FigureCaption'>": 20,
 "<class 'unstructured.documents.elements.Formula'>": 1}

## 🧹 Step 2: Preprocessing of Text and Table Elements

The preprocessing step transforms raw PDF elements into clean, structured text and table chunks:

- **Textual Preprocessing**: Groups together contiguous headers, titles, and text-like elements (narrative, list items, formulas, etc.) while preserving their order and associated metadata.
- **Table Handling**: Detects tables and converts them into HTML format for structure retention, storing them with their own metadata.
- **Metadata Aggregation**: Merges metadata like page numbers, languages, and element types for each chunk, making the data searchable and analyzable later.
- **Output**: The result is a list of `Element` objects categorized into `text_elements` and `table_elements`, ready for chunking or embedding.


In [5]:
# 2. Splitting text into chunks
print("Splitting text into chunks...")
preprocessor = DocumentPreprocessor(pdf_elements)

# Preprocessing include appending element's text by type, tables as html text and creating metadata accordingly
elements = preprocessor.preprocess_as_html() 
text_elements, table_elements = preprocessor.split_by_type()
print(f"Processed text elements count: {len(text_elements)}")
print(f"Processed table elements count: {len(table_elements)}")

Splitting text into chunks...
Processed text elements count: 384
Processed table elements count: 98


### 🔗 Step 3: Chunking Text Elements

In this step, large preprocessed text blocks are split into smaller chunks using a token-based approach:

- **Chunking Strategy**: The `Chunker` class breaks down long text into manageable pieces based on a predefined token limit (`CHUNK_SIZE`) with optional overlap (`CHUNK_OVERLAP`) to preserve context across chunks.
- **Metadata Preservation**: During chunking, relevant metadata is merged and attached to each chunk to retain traceability.
- **Result**: The `text_chunks` contain segmented text ready for embedding or further processing, while `table_chunks` are retained as-is without further splitting.


In [6]:
# Splitting text chunks and maintaining metadata acc. 
chunker = Chunker(chunk_size=CHUNK_SIZE, overlap=CHUNK_OVERLAP)
text_chunks = chunker.chunk_elements(text_elements) # chunking only text elements

table_chunks = table_elements

print(f"Text chunks count: {len(text_chunks)}")
print(f"Table chunks count: {len(table_chunks)}")

Text chunks count: 107
Table chunks count: 98


In [None]:
# Sample text chunk
print(text_chunks[10])

type='text' text="OPERATIONS</td><td>25</td></tr><tr><td>ITEM 7A. QUANTITATIVE AND QUALITATIVE DISCLOSURES ABOUT MARKET RISK</td><td>44</td></tr><tr><td>ITEM 8. FINANCIAL STATEMENTS AND SUPPLEMENTARY DATA</td><td>45</td></tr><tr><td>ITEM 9. CHANGES IN AND DISAGREEMENTS WITH ACCOUNTANTS ON ACCOUNTING AND FINANCIAL DISCLOSURE</td><td>101</td></tr><tr><td>ITEM 9A. CONTROLS AND PROCEDURES</td><td>101</td></tr><tr><td>ITEM 9B. OTHER INFORMATION</td><td>N/A</td></tr><tr><td>ITEM 9C. DISCLOSURE REGARDING FOREIGN JURISDICTIONS THAT PREVENT INSPECTIONS</td><td>N/A</td></tr><tr><td>PART Ill</td><td>104</td></tr><tr><td>ITEM 10. DIRECTORS, EXECUTIVE OFFICERS AND CORPORATE GOVERNANCE</td><td>104</td></tr><tr><td>ITEM 11. EXECUTIVE COMPENSATION</td><td>104</td></tr><tr><td>ITEM 12. SECURITY OWNERSHIP OF CERTAIN BENEFICIAL OWNERS AND MANAGEMENT AND RELATED STOCKHOLDER MATTERS.</td><td>104</td></tr><tr><td>ITEM 13. CERTAIN RELATIONSHIPS AND RELATED TRANSACTIONS, AND DIRECTOR INDEPENDENCE</td><td>104<

In [11]:
# Sample table chunk
print(table_chunks[10])

type='table' text='<table><thead><tr><th>2022</th><th></th><th>$100.3</th></tr></thead><tbody><tr><td>2020</td><td>$41.7</td><td></td></tr></tbody></table>' metadata={'languages': 'eng', 'page_number': 31}


### 📄 Step 4: Summarizing Text and Table Elements

In this step, the content from the financial PDF is summarized:

- **Text and Table Summarization**: Textual data and tables are processed and summarized separately.
  
- **Using Gemini API**: The `SummaryGenerator` class uses the Google Gemini API to generate summaries for each chunk of text and table content.
  
- **Saving Summaries**: The summaries are saved as JSON files (`text_summaries.json` and `table_summaries.json`) for future use.


In [13]:
# 3. Multi-Vetor Retriever

# Initialize summarizer
summarizer = SummaryGenerator(api_key=GEMINI_API_KEY)

# Getting summaries
text_summaries = summarizer.summarize_chunks(text_chunks, label="text")
table_summaries = summarizer.summarize_chunks(table_chunks, label="table")

# Saving to file
with open(os.path.join(pdf_output_path, "text_summaries.json"), "w") as f:
    json.dump(text_summaries, f)
    
with open(os.path.join(pdf_output_path, "table_summaries.json"), "w") as f:
    json.dump(table_summaries, f)

Summarizing text #0
Summarizing text #1
Summarizing text #2
Summarizing text #3
Summarizing text #4
Summarizing text #5
Summarizing text #6
Summarizing text #7
Summarizing text #8
Summarizing text #9
Summarizing text #10
Summarizing text #11
Summarizing text #12
Summarizing text #13
Summarizing text #14
Summarizing text #15
Summarizing text #16
Summarizing text #17
Summarizing text #18
Summarizing text #19
Summarizing text #20
Summarizing text #21
Summarizing text #22
Summarizing text #23
Summarizing text #24
Summarizing text #25
Summarizing text #26
Summarizing text #27
Summarizing text #28
Summarizing text #29
Summarizing text #30
Summarizing text #31
Summarizing text #32
Summarizing text #33
Summarizing text #34
Summarizing text #35
Summarizing text #36
Summarizing text #37
Summarizing text #38
Summarizing text #39
Summarizing text #40
Summarizing text #41
Summarizing text #42
Summarizing text #43
Summarizing text #44
Summarizing text #45
Summarizing text #46
Summarizing text #47
Su

In [16]:
# Loading summaries
with open(os.path.join(pdf_output_path, "text_summaries.json"), "r") as f:
    text_summaries = json.load(f)
    
with open(os.path.join(pdf_output_path, "table_summaries.json"), "r") as f:
    table_summaries = json.load(f)

In [17]:
print(f"Text chunks length: {len(text_chunks)}")
print(f"Text summaries length: {len(text_summaries)}")

print(f"Table chunks length: {len(table_chunks)}")
print(f"Table summaries length: {len(table_summaries)}")

Text chunks length: 107
Text summaries length: 107
Table chunks length: 98
Table summaries length: 98


### 🧠 Step 5: Storing Summarized Chunks in Chroma Vector Database

This step stores the summarized text and table chunks into Chroma DB for semantic search and retrieval:

- **Persistence Setup**: 
  - Separate directories are created for storing text and table embeddings (`embeddings/chroma_text` and `embeddings/chroma_table`).

In [18]:
# Presist directory
text_pdir = os.path.join(pdf_output_path, "embeddings/chroma_text")
table_pdir = os.path.join(pdf_output_path, "embeddings/chroma_table")

if not os.path.exists(text_pdir):
    os.makedirs(text_pdir)
if not os.path.exists(table_pdir):
    os.makedirs(table_pdir)

- **Vector Store Creation**: 
  - The `TextVectorStoreBuilder` and `TableVectorStoreBuilder` classes build vector stores using Hugging Face and  embedding models.
  - Each chunk is converted into a `Document` object containing the original content and its corresponding summary as metadata.
  - These documents are embedded and stored in Chroma DB using `Chroma.from_documents()`.


In [19]:
# Storing texts to vector DB
text_builder = TextVectorStoreBuilder(text_pdir)
text_store, text_retriever = text_builder.build_store_and_retriever(text_chunks, text_summaries)

  self.embedding_model = HuggingFaceEmbeddings(


In [20]:
# Storing tables to vector DB
table_builder = TableVectorStoreBuilder(table_pdir)
table_store, table_retriever = table_builder.build_store_and_retriever(table_chunks, table_summaries)


- **Retriever Configuration**: 
  - A retriever is created from each Chroma store using **MMR (Maximal Marginal Relevance)** to balance relevance and diversity in search results.
  - The retrievers are configured with custom `lambda_mult` values and return the top 5 results (`k=5`).

- **Output**:
  - Two retrievers (`text_retriever` and `table_retriever`) are initialized for querying textual and tabular content respectively from the stored embeddings.


In [11]:
# Loading text retriever
text_builder = TextVectorStoreBuilder(text_pdir)
text_store = Chroma(
    persist_directory=text_pdir,
    embedding_function=text_builder.embedding_model
)

text_retriever = text_builder.get_retriever(text_store)

  self.embedding_model = HuggingFaceEmbeddings(
  text_store = Chroma(


In [12]:
# Loading table retriever
table_builder = TableVectorStoreBuilder(table_pdir)
table_store = Chroma(
            embedding_function=table_builder.embedding_model,
            persist_directory=table_pdir,
        )

table_retriever = table_builder.get_retriever(table_store)

### 📊 Step 6: Generating the Financial Analysis Report

This step creates a comprehensive markdown report summarizing key insights from the financial PDF:

- **Agent Setup**:  
  - The `FinancialAnalysisAgent` is initialized with retrievers for both text and table embeddings.
  - It uses Google’s Gemini chat model to generate report content.

- **Section-wise Reporting**:  
  - A predefined set of financial topics (e.g., Executive Summary, Revenue & Profit Trends, Liquidity & Solvency) guides the structure of the report.
  - For each topic, relevant chunks are retrieved from the vector stores using the corresponding query.

- **LLM-Powered Generation**:  
  - The context (retrieved content) is passed to the language model with a tailored prompt to generate markdown-formatted analysis.
  - Each section includes titles and attempts to reference metadata such as page numbers.

- **Output**:  
  - All generated sections are combined into a single markdown report (`Financial_Analysis_Report.md`) and saved to disk.

In [22]:
# 5. Run Financial Analysis
print("\nRunning Financial Analysis...")
agent = FinancialAnalysisAgent(text_retriever, table_retriever)

full_report = agent.generate_full_report()
print("\n=== Executive Summary Report ===")
with open(os.path.join(pdf_output_path, "Financial_Analysis_Report.md"), "w", encoding="utf-8") as f:
    f.write(full_report)
print("Report saved at",  os.path.join(pdf_output_path, "Financial_Analysis_Report.md"))



Running Financial Analysis...

## Executive Summary

* **Total Revenues:** $100.3 billion in 2022 (a 23% increase from 2021) (p. 31)
* **Net Cash Flow from Operations:** $29.3 billion in 2022 (a 10% decrease from 2021) (p. 31)
* **Reported Diluted EPS:** $5.47 in 2022 (a 42% increase from 2021) (p. 31)
* **Adjusted Diluted EPS (Non-GAAP):** $6.58 in 2022 (a 62% increase from 2021) (p. 31)
* **Strategic Actions:** Spin-off of Upjohn and sale of Meridian (p. 31).  Restructuring since 2019 resulted in a more focused structure with Biopharma as the only reportable operating segment (p. 31).  Further organizational changes in 2022 aimed to optimize operations and R&D (p. 31).



## Executive Summary

* **Total Revenues:** $100.3 billion in 2022 (a 23% increase from 2021) (p. 31)
* **Net Cash Flow from Operations:** $29.3 billion in 2022 (a 10% decrease from 2021) (p. 31)
* **Reported Diluted EPS:** $5.47 in 2022 (a 42% increase from 2021) (p. 31)
* **Adjusted Diluted EPS (Non-GAAP):** $6.5

### ❓ Step 7: QA Testing on Financial Data

This step validates the retriever and LLM integration by answering specific financial questions:

- **QA Agent Setup**:  
  - `RetrieverQATester` is initialized with both text and table retrievers along with the Gemini chat model.

- **Context Retrieval**:  
  - For a given question, relevant chunks are fetched from both retrievers and merged as context.

- **Answer Generation**:  
  - The combined context and question are passed through a prompt to the LLM, which returns a detailed answer with potential metadata references.


In [23]:
# QA Testing
from source.retriever_qa_tester import RetrieverQATester

qa_tester = RetrieverQATester(text_retriever, table_retriever)

question = "What was the total equity of Pfizer as of December 31, 2022?"
answer = qa_tester.ask(question)

print("Answer:", answer)

Answer: Pfizer's total equity as of December 31, 2022, was $95,916 million.  This is found in the table on page 54 of the 2022 Form 10-K.


In [24]:
question = "What is the total revenue geographically??"
answer = qa_tester.ask(question)

print("Answer:", answer)

Answer: In 2022, Pfizer's worldwide revenues totaled $100.33 billion, a 23% increase from 2021.  This is broken down as follows: U.S. revenues increased by 39%, while international revenues increased by 12%.  Emerging market revenues, however, decreased by 3% to $20.1 billion due to foreign exchange impacts.


In [25]:
question = "What is the total revenue geographically??"
answer = qa_tester.ask(question)

print("Answer:", answer)

Answer: The total worldwide revenue for 2022 was $100.33 billion, a 23% increase from 2021.  This is broken down as follows:

* **U.S.:** $42.473 billion (39% increase from 2021)
* **International:** $57.857 billion (12% increase from 2021)

Emerging market revenues decreased 3% to $20.1 billion in 2022.  A more detailed geographic breakdown for 2020 and 2021 can be found in the Revenues by Geography section of the 2021 Form 10-K.


In [25]:
answer = qa_tester.ask("Tell me about the administrative expenses")

print("Answer:", answer)

Answer: The provided text mentions selling, informational, and administrative expenses in several sections, but doesn't offer a comprehensive overview of these expenses as a single entity.  Instead, it shows how these expenses are affected by other factors:

* **Restructuring Charges:**  A portion of additional depreciation related to asset restructuring is recorded within selling, informational, and administrative expenses (page 44, section N).  The amounts vary yearly.  Implementation costs associated with acquisitions and cost-reduction initiatives are also partially allocated to these expenses (page 44, section N).

* **Discontinued Operations:** Selling, informational, and administrative expenses related to discontinued operations are detailed in a table on page 67 (section 2022 Form 10-K Notes to Consolidated Financial Statements).  The amounts are $8 million in 2022, $26 million in 2021, and $1,682 million in 2020.

  There is no single, total figure for selling, informational, 

### 🔭 Future Enhancements

Planned improvements to further enhance the system’s capabilities:

1. **Image Semantics Integration**  
   Incorporate visual analysis to interpret diagrams, charts, and scanned figures for deeper insights.

2. **TOC-Guided Search**  
   Use the Table of Contents to enable structured, section-aware retrieval for more contextually accurate results.

3. **Table Correction via Page Image Reference**  
   Enhance table accuracy by validating and correcting extracted tables against their original visual representation on the page.

4. **Smarter QA with Agents and Memory**  
   Improve the question-answering experience by leveraging agentic reasoning and memory to handle multi-turn and context-rich queries.



---

### 📚 References

The following tools, libraries, and resources were instrumental in building this financial PDF analyzer:

- **LangChain** – For powerful chaining, retrieval, and LLM orchestration.  
  [https://www.langchain.com](https://www.langchain.com)

- **Semi_Structured_RAG** – For Multi-vector retriever handson.  
  [https://github.com/langchain-ai/langchain/blob/master/cookbook/Semi_Structured_RAG.ipynb](https://github.com/langchain-ai/langchain/blob/master/cookbook/Semi_Structured_RAG.ipynb)

- **Chroma** – For efficient local vector store and retrieval.  
  [https://www.trychroma.com](https://www.trychroma.com)

- **Google Generative AI Studio (Gemini)** – For Gemini APIs.  
  [https://aistudio.google.com/welcome](https://aistudio.google.com/welcome)

- **Unstructured.io** – For high-quality document element extraction.  
  [https://www.unstructured.io](https://www.unstructured.io)
  
- **HuggingFace** – For pretrained embedding models.  
  [https://huggingface.co](https://huggingface.co)

---

### 🙏 Acknowledgements

A heartfelt thank you to the amazing open-source communities, tool creators, and researchers who made these technologies accessible. Your contributions empower builders like me to bring complex ideas to life.

Special thanks to:

- The teams behind **LangChain**, **Unstructured**, and **Chroma** for their excellent documentation and active development.
- Open-source contributors at **Hugging Face** for democratizing machine learning.
- **Google** for providing access to free-tier Gemini APIs that enabled high-quality language understanding.

---  