# Simple RAG for 10-K Filings

This notebook demonstrates a compact Retrieval-Augmented Generation (RAG) pipeline that answers targeted questions using public 10‑K filings from the U.S. Securities and Exchange Commission (SEC).

Scope: process a 10‑K document, split it into chunks, create embeddings, store vectors in FAISS, retrieve relevant passages, and generate answers with a large language model.

Learning goals:
- Load and inspect 10‑K filings programmatically.
- Split long documents into overlapping chunks for retrieval.
- Produce embeddings and build a FAISS vector store.
- Implement a retriever + prompt template to feed context to an LLM.
- Generate and post-process model answers grounded in source text.

Definitions:
- **RAG (Retrieval-Augmented Generation)**: a pattern that augments a generative model with retrieved documents so answers are grounded in external sources.
- **10‑K**: an annual report that public companies file with the SEC describing business operations, risks, and financials.

**Objective**: Use the specified source document(s) to answer the question below, and show the supporting passages used to form the answer.

Question: "What technological advancements were made in the batteries used in Tesla's electric vehicles?"
Source: Tesla 2023 Form 10‑K (SEC)

# How to Run this Notebook

Follow these steps to prepare the environment and run all cells. This notebook expects an OpenAI API key to generate embeddings and call the model.

1. Get an OpenAI API key:
   - Visit: https://platform.openai.com/account/api-keys and create a new API key.
   - Save it securely; you will not be able to view the key again after creation.

2. Add the API key to your notebook's Secrets manager:
   - Click the Secrets / Keys icon in the notebook UI (left sidebar).
   - Click '+ Add new secret'.
   - **Name**: `OPEN_AI_KEY` (the notebook's code reads this name).
   - **Value**: paste your OpenAI API key.

   Note: the code in this notebook maps the secret `OPEN_AI_KEY` into the environment variable `OPENAI_API_KEY` before using OpenAI clients.

3. Enable access for this notebook to the secret (toggle access permissions in the Secrets UI).

4. Optional: if you prefer a different secret name (for example `OPENAI_API_KEY`), update the small setup cell that reads the secret or add both names to the Secrets manager.

5. Security best practices:
   - Never commit API keys to source control.
   - Use scoped/organizational keys when possible and rotate keys periodically.

6. Run the notebook:
   - Click 'Runtime' → 'Run all' (or run cells sequentially).

If you encounter authentication errors, verify the secret name and that the environment variable `OPENAI_API_KEY` is being set in the setup cell.

## Basic Setup

This notebook was developed for Python 3.10+ and is intended to run in a Jupyter / Colab-like environment.
Recommended workflow:
- Use a virtual environment (`venv` or `conda`) to isolate dependencies.
- On macOS (zsh) activate your environment before running the notebook.
- If you use Google Colab, follow the 'How to Run this Notebook' cell to add secrets.

Quick setup commands (macOS, zsh):
```bash
python3 -m venv .venv
source .venv/bin/activate
pip install -U pip
```

If you plan to run larger models locally, ensure appropriate hardware (GPU + CUDA) and install GPU-compatible builds of libraries. Otherwise, use CPU-only packages or cloud APIs.

## Install Frameworks

Install the Python packages required by this notebook. The provided pip command installs the main dependencies used in the examples below. Pin versions for reproducibility if needed.

Run this cell to install dependencies (works in Colab and most local environments):

```bash
!pip install langchain langchain_core langchain_community faiss-cpu openai langchain_openai langchain_huggingface -U
```

Note: On local macOS systems you may prefer `conda install -c conda-forge faiss-cpu` if pip wheels fail.

### Key libraries used

- **langchain / langchain_core / langchain_community**: A framework for composing LLM pipelines, chains, document loaders, and retrievers.
- **langchain_huggingface**: Integration helpers for using HuggingFace embeddings and models within LangChain workflows.
- **faiss-cpu**: Facebook AI Similarity Search — an in-memory vector index for fast nearest-neighbor retrieval (CPU build). Use GPU FAISS for large-scale datasets if available.
- **openai / langchain_openai**: Official OpenAI Python client and LangChain's OpenAI integration for embeddings and chat/completion calls.

Optional utilities: `transformers` and `sentence-transformers` if you want to run local HuggingFace embedding models instead of using OpenAI (lower API cost, potentially higher runtime).

Advice: Use OpenAI embeddings for high-quality vectors but be mindful of API costs. HuggingFace models are a viable local alternative for experimentation or cost-sensitive workflows.

In [None]:
%%capture
# Install necessary libraries.
# We use %%capture to suppress the extensive output logs during installation.
!pip install langchain langchain_core langchain_community faiss-cpu openai langchain_openai langchain_huggingface -U

: 

In [None]:
import langchain
import langchain_core
import langchain_community
import openai

# Verify installed versions to ensure compatibility
print(f"langchain version: {langchain.__version__}")
print(f"langchain_core version: {langchain_core.__version__}")
print(f"langchain_community version: {langchain_community.__version__}")
# FAISS and langchain_openai do not have a standard __version__ attribute exposed here
print(f"openai version: {openai.__version__}")

## API Keys Setup

This notebook uses API keys to access external services (OpenAI for embeddings/LLM calls; optional HuggingFace for embeddings/models). Below are recommended secret names and examples for Colab and local runs.

Recommended secret / environment variable names:
- `OPENAI_API_KEY` — the standard environment variable used by OpenAI client libraries.
- `HUGGINGFACEHUB_API_TOKEN` — optional token for HuggingFace Hub access if you use HuggingFace models.

Colab / Notebook Secrets (recommended):
- Add a secret named `OPEN_AI_KEY` or `OPENAI_API_KEY` in the notebook Secrets UI.
- The notebook contains a small setup cell that maps `OPEN_AI_KEY` → `OPENAI_API_KEY` for compatibility. If you prefer, add `OPENAI_API_KEY` directly and update the setup cell accordingly.

Example: local environment (macOS / zsh):
```bash
export OPENAI_API_KEY="sk-..."
export HUGGINGFACEHUB_API_TOKEN="hf_..."  # optional
```

Example: Google Colab Secrets mapping (the notebook does this automatically if you used the Secrets UI):
```python
# inside the notebook setup cell
import os
from google.colab import userdata  # only available in Colab
os.environ['OPENAI_API_KEY'] = userdata.get('OPEN_AI_KEY')  # maps the secret to the env var
```

Security notes:
- Never hard-code API keys in notebooks or commit them to source control.
- Prefer notebook Secret managers or environment variables for runtime-only access.
- Rotate keys regularly and limit scope where possible.


In [None]:
import os

# Load the API key from Colab User Data (Secrets)
if 'google.colab' in str(get_ipython):
    from google.colab import userdata
    # Set the environment variable that OpenAI libraries expect
    os.environ["OPENAI_API_KEY"] = userdata.get('OPEN_AI_KEY')

## Import Libraries

This cell imports the Python modules used in the RAG pipeline. Short descriptions help you understand each component's role:

- `WebBaseLoader`: downloads and parses web pages (used to fetch SEC filing HTML).
- `RecursiveCharacterTextSplitter`: splits long documents into overlapping chunks for retrieval.
- `PromptTemplate`: constructs prompt strings with placeholders for question/context.
- `FAISS`: in-memory vector index used as the vector store for nearest-neighbor search.
- `ChatOpenAI`: LangChain wrapper to call OpenAI chat/completion models (used for answer generation).
- `StrOutputParser`: converts LLM structured output into a plain string for display.
- `RunnablePassthrough`: a pipeline utility that forwards inputs unchanged (useful for questions).
- `Document`: LangChain document data type that holds `page_content` and `metadata`.
- `HuggingFaceEmbeddings`: optional embeddings implementation if you use HuggingFace models locally.
- `pprint`: pretty-print helper for readable debug output.

In [None]:
# LangChain imports for RAG components
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.prompts import PromptTemplate
from langchain_community.vectorstores import FAISS
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.documents import Document
from langchain_huggingface import HuggingFaceEmbeddings
from pprint import pprint

# Configuration Dictionary

This dictionary centralizes settings used throughout the notebook so you can tune behavior from one place. Below are the most important fields and recommended defaults with rationale:

- `chunkSize` (int, default 500): target token/character length for each text chunk. Larger chunks preserve more context but increase embedding cost and may reduce retrieval precision. Start with 400–800 for SEC filings.
- `chunkOverlap` (int, default 50): number of overlapping characters between adjacent chunks to preserve cross-boundary context. Typical values: 20–200 depending on chunkSize.
- `userAgentHeader` (str): custom User-Agent header for web requests. Use a descriptive string (e.g., `YourCompany-ResearchBot/1.0 (email@example.com)`) to comply with site policies and avoid being blocked.
- `embeddingModelName` (str): embedding model identifier (e.g., `text-embedding-3-small`). Use high-quality embeddings for better retrieval; choose cheaper or local models for experimentation.
- `numRetrievedDocuments` (int): how many candidate passages the retriever returns (k). Higher `k` increases recall but may add noise—common values 3–10.
- `numSelectedDocuments` (int): how many documents to include when formatting the final `context` passed to the LLM. Keep this small enough to fit the model context window (2–6).
- `ragAnswerModel` (str): name of the LLM used for answer generation. Choose a model with enough context and reasoning capacity.
- `ragAnswerModelTemeprature` (float): sampling temperature for generation (0.0–1.0). Lower values produce more deterministic, factual answers; higher values increase creativity.
- `companyFilingUrls` (list of tuples): target sources as `(company_name, url)` tuples. Use canonical SEC URLs or local file paths if you prefer offline files.
- `ragPromptTemplate` (str): the prompt template used to ask the LLM. Keep prompts explicit about using only provided context and requesting source citations if desired.

Tuning tips:
- If answers are hallucinating, reduce `numSelectedDocuments` and lower temperature to 0.0–0.3, and ensure the retriever returns higher-quality passages (adjust embeddings or `k`).
- For long documents, increase `chunkOverlap` slightly to preserve sentence continuity across chunks.
- For cost-sensitive runs, use smaller embedding models or run HuggingFace embeddings locally; measure retrieval F1/precision to select tradeoffs.


In [7]:
defaultConfig = {
    # Document processing settings
    "chunkSize": 500,          # Size of each text chunk
    "chunkOverlap": 50,        # Overlap between chunks to preserve context
    "userAgentHeader": "YourCompany-ResearchBot/1.0 (your@email.com)",

    # Embedding model configuration
    "embeddingModelName": "text-embedding-3-small",  # OpenAI's efficient embedding model

    # Vector store retrieval settings
    "numRetrievedDocuments": 5,

    # Document formatter settings
    "numSelectedDocuments": 5,

    # LLM generation settings
    "ragAnswerModel": "gpt-4o",
    "ragAnswerModelTemeprature": 0.7,

    # Target data source
    "companyFilingUrls": [
        ("Tesla", "https://www.sec.gov/Archives/edgar/data/1318605/000162828024002390/tsla-20231231.htm")
    ],

    # RAG prompt template
    "ragPromptTemplate": """
    Give an answer for the `Question` using only the given `Context`. Use information relevant to the query from the entire context.
    Provide a detailed answer with thorough explanations, avoiding summaries.

    Question: {question}

    Context: {context}

    Answer:
    """
}

In [8]:
# Create a working copy of the configuration to avoid accidental modification of defaults
config = defaultConfig.copy()

# Load Document

This section demonstrates loading the target 10‑K filing from the SEC using LangChain's `WebBaseLoader` and best practices for responsible web access.

About `WebBaseLoader`:
- `WebBaseLoader` fetches a URL and extracts the page content into a LangChain `Document` object (with `page_content` and `metadata`).
- It is convenient for single-page HTML sources such as SEC filings, but may require HTML cleaning depending on the site structure.

User-Agent and polite scraping:
- Use a clear `User-Agent` (configured via `userAgentHeader` in `config`) so the server can identify your requests (e.g., `MyOrg-ResearchBot/1.0 (email@example.com)`).
- Respect `robots.txt`, rate limits, and site terms of service. For the SEC and many public data sources a polite scraping cadence is expected.
- If you expect many requests or bulk downloads, prefer official APIs or bulk data feeds where available (SEC provides bulk access for filings).

Caching and reproducibility:
- For repeatable experiments, cache downloaded documents locally instead of re-fetching on every run. This reduces network load and avoids accidental rate-limiting.
- Consider saving the raw HTML or extracted text alongside metadata (source URL and retrieval timestamp) for traceability.

Alternatives and fallbacks:
- If the target page structure causes noisy HTML extraction, save the filing locally (or use the SEC bulk-download) and use a file loader instead of `WebBaseLoader`.
- For large-scale crawling use robust tools (Scrapy, newspaper3k, or custom parsers) and respect site policies.

The code cell below initializes `WebBaseLoader` with a custom `User-Agent` and loads the Tesla 10‑K filing into a `Document` object for downstream processing.

In [9]:
# Extract URL and Company Name from config
url = config["companyFilingUrls"][0][1]
company = config["companyFilingUrls"][0][0]

# Initialize the WebBaseLoader with a proper User-Agent
loader = WebBaseLoader(
    url,
    header_template={'User-Agent': config["userAgentHeader"]}
)

# Load the content
docs = loader.load()
print(f"Loaded {len(docs)} document(s) with {len(docs[0].page_content)} characters")

Loaded 1 document(s) with 422473 characters


In [10]:
# Verify the type and count of loaded documents
print(type(docs), len(docs))

<class 'list'> 1


In [11]:
# Check the type of a single document object
type(docs[0])

In [12]:
# Inspect the metadata of the loaded document
docs[0].metadata

{'source': 'https://www.sec.gov/Archives/edgar/data/1318605/000162828024002390/tsla-20231231.htm',
 'title': 'tsla-20231231',
 'language': 'No language found.'}

In [13]:
# Preview the first 100 characters of the raw content
docs[0].page_content[:100]

'\ntsla-20231231false00013186052023FYhttp://fasb.org/us-gaap/2023#AccountingStandardsUpdate202006Membe'

# Split Document into Chunks

Long documents (like 10‑Ks) must be split into smaller chunks before creating embeddings and indexing. This cell uses `RecursiveCharacterTextSplitter` to produce overlapping chunks that preserve local context.

Key concepts and tradeoffs:
- `chunkSize` (characters or tokens): larger chunks keep more context per vector but increase embedding cost and reduce the number of distinct vectors (lower recall). Smaller chunks create more vectors (higher recall) but each vector holds less context (may fragment information).
- `chunkOverlap`: overlapping characters between adjacent chunks. Overlap helps preserve sentence continuity and avoids cutting important phrases; typical values are 10–30% of `chunkSize` (e.g., 50 for 500).
- Tokens vs characters: many splitters work in characters; when working with token-limited models, approximate tokens ≈ characters/4 as a rough heuristic, or use a tokenizer to be precise.

Practical recommendations for SEC filings:
- Start with `chunkSize=500` and `chunkOverlap=50` (good default used in this notebook).
- If retrieval returns overly broad passages, reduce `chunkSize` (e.g., 300–400) to increase precision.
- If answers cut sentences or lose context at boundaries, increase `chunkOverlap` to 100–200 or switch to a sentence-aware splitter.

Measuring effects:
- Evaluate retrieval quality by manually inspecting top-k retrieved chunks for a sample of queries.
- Track metrics such as whether the supporting passage contains a direct answer (precision) and whether the retriever returns at least one relevant passage (recall).

Implementation note:
- The `RecursiveCharacterTextSplitter` used below balances multiple separators (newlines, sentences) to avoid arbitrary cuts and produce readable chunks for embedding.

In [14]:
# Initialize the text splitter with config parameters
splitter = RecursiveCharacterTextSplitter(
    chunk_size=config["chunkSize"],
    chunk_overlap=config["chunkOverlap"]
)

# Split the loaded documents into chunks
chunks = splitter.transform_documents(docs)

In [15]:
# Verify the number of resulting chunks
type(chunks), len(chunks)

(list, 942)

In [16]:
# Inspect metadata of the first chunk
chunks[0].metadata

{'source': 'https://www.sec.gov/Archives/edgar/data/1318605/000162828024002390/tsla-20231231.htm',
 'title': 'tsla-20231231',
 'language': 'No language found.'}

In [17]:
# Inspect content of a specific chunk (e.g., the 100th chunk)
chunks[100].page_content

'dangerous aspects of road travel much like the system that airplane pilots use, when conditions permit. As with other vehicle systems, we improve these functions in our vehicles over time through over-the-air updates.We intend to establish in the future an autonomous Tesla ride-hailing network, which we expect would also allow us to access a new customer base even as modes of transportation evolve.We are also applying our artificial intelligence learnings from self-driving technology to the'

In [18]:
# Inspect the next chunk to observe the overlap
chunks[101].page_content

'learnings from self-driving technology to the field of robotics, such as through Optimus, a robotic humanoid in development, which is controlled by the same AI system. 5Table of ContentsEnergy Generation and StorageEnergy Storage ProductsWe leverage many of the component-level technologies from our vehicles in our energy storage products. By taking a modular approach to the design of battery systems, we can optimize manufacturing capacity of our energy storage products. Additionally, our'

In [19]:
# Enrich metadata: Add the company name to each chunk for better traceability
for chunk in chunks:
    chunk.metadata["company"] = company

In [20]:
# Verify metadata update
chunks[0].metadata

{'source': 'https://www.sec.gov/Archives/edgar/data/1318605/000162828024002390/tsla-20231231.htm',
 'title': 'tsla-20231231',
 'language': 'No language found.',
 'company': 'Tesla'}

OpenAIEmbeddings is a LangChain wrapper that uses OpenAI's embedding API to generate vector embeddings. This requires an OpenAI API key to be set in your environment.

## Generate Embeddings

Embeddings convert text into numeric vectors that capture semantic meaning. These vectors enable similarity search (retrieval) by comparing distances between vectors instead of raw text matching.

**Theoretical Foundation: Embeddings as Semantic Space**

Modern embeddings (produced by transformers like BERT or GPT-based models) map text into a high-dimensional vector space (e.g., 1536 dimensions for OpenAI's embeddings) where:
- **Semantically similar texts** have vectors that are close together (small distance).
- **Semantically different texts** have vectors that are far apart (large distance).
- **Vector operations** are meaningful: the vector for "king" minus "man" plus "woman" is close to "queen" (though this is an idealized view).

This property is what makes similarity search work: to find documents relevant to a query, we compute the distance (typically L2 Euclidean or cosine similarity) between the query vector and all document vectors, then return the nearest neighbors.

**Options in this notebook**:
- **OpenAIEmbeddings**: high-quality, managed embeddings provided by OpenAI. Easy to use and frequently produce better downstream retrieval accuracy, but incur API cost and latency.
- **HuggingFaceEmbeddings**: local or hosted models from HuggingFace (via `sentence-transformers` or `transformers`). Lower API cost (can run locally) but may require more compute and tuning.

**Practical Considerations and Tradeoffs**:
- **Quality vs Cost**: OpenAI's embeddings often give better retrieval for small collections; for large-scale or cost-sensitive workflows, consider local HuggingFace models or smaller OpenAI models.
- **Dimensionality**: embedding dimension (e.g., 1536) affects index size and similarity behavior. Keep consistent embedding models for indexing and querying. Mismatched dimensions will cause errors.
- **Batching & Rate Limits**: Generate embeddings in batches to reduce per-call overhead and respect API rate limits. For 10k+ documents, batch requests in groups of 100–1000 to minimize roundtrips.
- **Caching**: Cache embeddings locally in JSON or pickle format to avoid repeated calls for the same text, reducing costs and latency on subsequent runs.
- **Normalization**: Some retrieval methods benefit from L2-normalized vectors (each vector has length 1). FAISS supports this via IndexFlatIP (inner product); check your index configuration.

**Distance Metrics** (related to embeddings):
- **L2 Distance (Euclidean)**: $d(\mathbf{u}, \mathbf{v}) = \sqrt{\sum_{i=1}^{n} (u_i - v_i)^2}$ — widely used, works well with FAISS Flat index.
- **Cosine Similarity**: $\text{sim}(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\| \|\mathbf{v}\|}$ — ranges from -1 to 1; normalized versions range from 0 to 1. Better for comparing direction regardless of magnitude.

**Implementation Note**:
The cell below chooses `OpenAIEmbeddings` if the configured model name starts with `text-embedding` and falls back to `HuggingFaceEmbeddings` otherwise. Adjust `config['embeddingModelName']` to switch providers.


In [21]:
from langchain_openai import OpenAIEmbeddings
from langchain_huggingface import HuggingFaceEmbeddings

embedding_model_name = config.get('embeddingModelName', 'text-embedding-3-small')

if embedding_model_name.startswith("text-embedding"):
    # Use OpenAIEmbeddings for OpenAI models
    embeddingFunction = OpenAIEmbeddings(model=embedding_model_name)
else:
    # Fallback to HuggingFaceEmbeddings for other models
    embeddingFunction = HuggingFaceEmbeddings(model_name=embedding_model_name)

## Create Vector Store

FAISS (Facebook AI Similarity Search) is an in-memory index for efficient similarity search on dense vectors. It provides multiple index types optimized for different dataset sizes and recall-speed tradeoffs. For this RAG system, we use a flat L2 index (simple and effective for small-to-medium collections up to ~1M vectors).

**Theoretical Foundation: Nearest-Neighbor Search as Retrieval**

At its core, RAG retrieval is a nearest-neighbor search problem:
1. Each document chunk is converted to a vector via embeddings: $\mathbf{d}_i = \text{embed}(\text{chunk}_i)$.
2. The user's query is converted to a vector: $\mathbf{q} = \text{embed}(\text{question})$.
3. The retriever finds the k chunks whose vectors are nearest to $\mathbf{q}$ (i.e., smallest distance or highest cosine similarity).
4. These k nearest-neighbor chunks are returned as the relevant context.

The assumption: because the embedding space preserves semantic similarity, chunks with vectors near the query vector are likely semantically relevant to the question.

**Creating the vector store**:
- `FAISS.from_documents()` automatically embeds all chunks using your embedding function and builds an in-memory index.
- The resulting vectorstore object provides a standard interface for similarity search, retrieval, and serialization.

**FAISS Index Types** (for reference and future tuning):
- **Flat (L2)**: exhaustive nearest-neighbor search; computes distance from query to all vectors. Best for small datasets (<100k vectors) and when exact recall is critical. Highest memory use per vector, but guaranteed to find true nearest neighbors.
- **IVF (Inverted File)**: partitions vectors into k clusters (centroids) for faster approximate search; queries probe a subset of clusters. Ideal for medium datasets (100k–1M vectors). Trades some recall for speed; requires training on sample data.
- **HNSW**: approximate nearest-neighbor using a hierarchical graph structure (inspired by Small-World networks); allows efficient search without exhaustive comparison. Excellent for large-scale (1M+ vectors) and fast retrieval with tunable recall-precision tradeoff.

For more options and detailed [vector database benchmarks](https://docs.google.com/document/d/1RzLxisgBhFwciCNuztrgm5Y9CsHK7qxP0lYARH5ADo8/edit?usp=drive_link).


In [22]:
# Create the FAISS vector store from our document chunks and embedding function
vectorstore = FAISS.from_documents(chunks, embeddingFunction)
print("Vector store created successfully")

Vector store created successfully


In [23]:
# Display basic statistics about the vector store
print(f"Vector store type: {type(vectorstore)}")
print(f"Number of vectors: {vectorstore.index.ntotal}")
print(f"Vector dimension: {vectorstore.index.d}")

Vector store type: <class 'langchain_community.vectorstores.faiss.FAISS'>
Number of vectors: 942
Vector dimension: 1536


In [24]:
# Extract and view a sample vector (first 10 dimensions) to understand the data structure
sample_vector = vectorstore.index.reconstruct(0)  # Get the first vector
print("\nSample vector (first 10 dimensions):")
print(sample_vector[:10])
print(f"Vector shape: {sample_vector.shape}")


Sample vector (first 10 dimensions):
[-0.02827102 -0.03220107  0.0256369  -0.01048718 -0.01531171  0.00960679
 -0.04927355 -0.03713124  0.01701614 -0.00686703]
Vector shape: (1536,)


## Save Vector Store Locally

Persisting the vector store to disk allows you to reuse it across notebook runs, share indexes with teammates, and avoid recomputing expensive embeddings for the same document set. FAISS uses binary serialization (pickle format) for compact, fast I/O.

**Saving best practices**:
- Store the FAISS index alongside a metadata file (JSON or YAML) containing the embedding model name, creation date, chunk configuration, and source document info for full traceability.
- Use version control or numbered backups (e.g., `faiss_index_v1/`, `faiss_index_v2/`) to track index evolution and enable rollback if needed.
- On cloud storage (Google Drive, S3), use checksums and file locking to detect tampering or accidental overwrites.


In [25]:
from google.colab import drive
from pathlib import Path

# Mount Google Drive
drive.mount('/content/drive')

Mounted at /content/drive


In [26]:
# Define the path where the FAISS index will be saved
gdrive_root = Path('/content/drive/MyDrive')
faiss_dir = gdrive_root/"teaching_fall_2025/LLM_Fall_2025/9_RAG/faiss_index"

In [27]:
# Ensure the directory exists
faiss_dir.mkdir(parents=True, exist_ok=True)

In [28]:
# Save the vectorstore to Google Drive
vectorstore.save_local(str(faiss_dir))
print(f"Vector store saved to {faiss_dir}")

Vector store saved to /content/drive/MyDrive/teaching_fall_2025/LLM_Fall_2025/9_RAG/faiss_index


## Load Vector Store

When loading a saved FAISS index, the `allow_dangerous_deserialization=True` parameter enables pickle deserialization, which poses a security risk if the index file comes from an untrusted source (pickle can execute arbitrary code during deserialization). 

**Security best practices when loading**:
- Only load indexes from sources you control or trust (your own Google Drive, GitHub releases with GPG signatures, official team archives).
- Always version and checksum your indexes so you can detect tampering (e.g., MD5 or SHA-256 hash stored separately).
- If working with indexes from untrusted third parties, rebuild the index from raw documents instead of deserializing a pickled object.
- Consider wrapping loads in try-except blocks to catch deserialization errors gracefully.

By following these practices, you gain the performance and convenience of serialized indexes while minimizing security exposure.


In [29]:
# Load the vectorstore from Google Drive
# 'allow_dangerous_deserialization' is required for pickle files, which FAISS uses
loaded_vectorstore = FAISS.load_local(str(faiss_dir), embeddingFunction, allow_dangerous_deserialization=True)
print("Vector store loaded successfully from Google Drive")

Vector store loaded successfully from Google Drive


## Explore vector store

In [30]:
# Get the internal mapping of FAISS IDs to Document IDs
index_dict = loaded_vectorstore.index_to_docstore_id

In [31]:
# Display the first 10 items of this mapping
print(f"\nFirst {min(10, len(index_dict))} items in the index_to_docstore_id dictionary:")
print("-" * 60)
for i in range(min(10, len(index_dict))):
    key = list(index_dict.keys())[i]
    value = index_dict[key]
    print(f"FAISS ID: {key:>3} → Document ID: {value}")


First 10 items in the index_to_docstore_id dictionary:
------------------------------------------------------------
FAISS ID:   0 → Document ID: eb3dfc20-d125-43b9-a133-3b4b1d99cd93
FAISS ID:   1 → Document ID: 771b9c55-a7d0-4dc8-86e6-cd2e80c5d5ca
FAISS ID:   2 → Document ID: c63d11f7-0c28-4533-a527-a576166cca87
FAISS ID:   3 → Document ID: 1025ae2f-0893-4fc4-92a8-e3a7add5924d
FAISS ID:   4 → Document ID: 7a28d8a9-95c4-4ab3-8f8c-ef2315220b07
FAISS ID:   5 → Document ID: 93a35c29-28a0-433d-ae39-f17c4b4db4d2
FAISS ID:   6 → Document ID: f2bae244-6acc-47bd-bdee-2455b183fdfd
FAISS ID:   7 → Document ID: 96021e99-edec-4808-81a8-e68ba77ebb17
FAISS ID:   8 → Document ID: 659e35cf-ce61-40d8-b337-5b90bf325375
FAISS ID:   9 → Document ID: 0f6019c3-a1c9-418b-b67c-4e2fc22ad5c5


In [32]:
# Detailed inspection of a specific chunk (e.g., ID 100)
faiss_id = "100" if "100" in index_dict else 100

# Retrieve Document ID
document_id = loaded_vectorstore.index_to_docstore_id[faiss_id]
print(f"\nFAISS ID: {faiss_id}")
print(f"Document ID: {document_id}")

# Retrieve actual content and metadata
document = loaded_vectorstore.docstore._dict[document_id]
print("\nDOCUMENT METADATA:")
pprint(document.metadata)
print("\nDOCUMENT CONTENT:")
pprint(document.page_content)

# Retrieve the vector itself
vector = loaded_vectorstore.index.reconstruct(int(faiss_id))
print(f"\nVector dimensions: {len(vector)}")
print(f"\nFirst 10 values: {vector[:10]}")


FAISS ID: 100
Document ID: a89867d9-29d7-4e03-9ecc-614dd1439e93

DOCUMENT METADATA:
{'company': 'Tesla',
 'language': 'No language found.',
 'source': 'https://www.sec.gov/Archives/edgar/data/1318605/000162828024002390/tsla-20231231.htm',
 'title': 'tsla-20231231'}

DOCUMENT CONTENT:
('dangerous aspects of road travel much like the system that airplane pilots '
 'use, when conditions permit. As with other vehicle systems, we improve these '
 'functions in our vehicles over time through over-the-air updates.We intend '
 'to establish in the future an autonomous Tesla ride-hailing network, which '
 'we expect would also allow us to access a new customer base even as modes of '
 'transportation evolve.We are also applying our artificial intelligence '
 'learnings from self-driving technology to the')

Vector dimensions: 1536

First 10 values: [-0.00058303 -0.04598977  0.00645413  0.01947445 -0.00173879 -0.02887546
  0.01028441  0.04728191  0.0040841   0.07594641]


In [33]:
# Perform a manual similarity search to see what chunks match our query
query = "What technological advancements were made in the batteries used in Tesla's Electric Vehicles?"
print(f"\n=== Similarity Search Results for Query ===\n{query}")

# search_results returns a list of (Document, score) tuples
search_results = loaded_vectorstore.similarity_search_with_score(query, k=5)

for i, (doc, score) in enumerate(search_results):
    print(f"\nResult {i+1} (Similarity Score: {score}):")
    pprint(f"Content: {doc.page_content}")
    pprint(f"Metadata: {doc.metadata}")

print("\nSearch complete!")


=== Similarity Search Results for Query ===
What technological advancements were made in the batteries used in Tesla's Electric Vehicles?

Result 1 (Similarity Score: 0.8078051209449768):
('Content: technology featuring three electric motors for further increased '
 'performance in certain versions of Model S and Model X, Cybertruck and the '
 'Tesla Semi.We maintain extensive testing and R&D capabilities for battery '
 'cells, packs and systems, and have built an expansive body of knowledge on '
 'lithium-ion cell chemistry types and performance characteristics. In order '
 'to enable a greater supply of cells for our products with higher energy '
 'density at lower costs, we have developed a new proprietary')
("Metadata: {'source': "
 "'https://www.sec.gov/Archives/edgar/data/1318605/000162828024002390/tsla-20231231.htm', "
 "'title': 'tsla-20231231', 'language': 'No language found.', 'company': "
 "'Tesla'}")

Result 2 (Similarity Score: 0.9508378505706787):
('Content: electric veh

# Retrievers

## Why Use a Retriever?

A retriever serves as an abstraction layer that decouples the retrieval mechanism from the rest of your RAG pipeline. This provides multiple benefits:

1. **Standardizes Access**: Provides a consistent interface (`invoke()` method) regardless of the underlying vector store implementation (FAISS, Pinecone, Weaviate, etc.).
2. **Simplifies Integration**: Makes it seamless to plug into LangChain chains, agents, and other components without rewriting code when switching vector store providers.
3. **Encapsulates Search Logic**: Hides implementation details of similarity search, filtering, reranking, and other retrieval strategies behind a simple public interface.
4. **Enables Composition**: Retrievers can be stacked or combined (e.g., chaining multiple retrievers, adding filters, or reranking results) while maintaining a clean API.

**In this notebook**, we create a retriever from our FAISS vector store using `.as_retriever()` and configure it with `search_kwargs={"k": 5}` to return the top-5 most similar documents. This retriever is then composed into the RAG chain to automatically fetch relevant context for every question.

**Theoretical Insight: Relevance Ranking**

When you call `retriever.invoke(question)`:
1. The question is embedded into the same vector space as your document chunks: $\mathbf{q} = \text{embed}(\text{question})$.
2. The index computes a similarity score (e.g., L2 distance) between $\mathbf{q}$ and each document vector: $\text{score}_i = d(\mathbf{q}, \mathbf{d}_i)$ or $\text{sim}(\mathbf{q}, \mathbf{d}_i)$.
3. Results are ranked by score; the top-k documents (smallest distance or highest similarity) are returned.

This ranking assumes that semantic similarity (as captured by embeddings) correlates with relevance. In practice, this often works well but isn't perfect—sometimes syntactically similar but semantically unrelated passages rank high, or relevant passages with different phrasing rank lower. Advanced techniques like reranking (using cross-encoders) can improve this, but for this notebook, we rely on embedding-based ranking.


In [34]:
# Create a retriever object from the vector store
retriever = loaded_vectorstore.as_retriever(search_kwargs={"k": config["numRetrievedDocuments"]})

In [35]:
# Inspect the retriever object
retriever

VectorStoreRetriever(tags=['FAISS', 'OpenAIEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x7a5139aa58b0>, search_kwargs={'k': 5})

In [36]:
# Test the retriever with our query
retrieved_documents = retriever.invoke("What technological advancements were made in the batteries used in Tesla's Electric Vehicles?")

In [37]:
# View the retrieved documents
retrieved_documents

[Document(id='17098739-e7c7-4f6c-b746-458c6ce97e3c', metadata={'source': 'https://www.sec.gov/Archives/edgar/data/1318605/000162828024002390/tsla-20231231.htm', 'title': 'tsla-20231231', 'language': 'No language found.', 'company': 'Tesla'}, page_content='technology featuring three electric motors for further increased performance in certain versions of Model S and Model X, Cybertruck and the Tesla Semi.We maintain extensive testing and R&D capabilities for battery cells, packs and systems, and have built an expansive body of knowledge on lithium-ion cell chemistry types and performance characteristics. In order to enable a greater supply of cells for our products with higher energy density at lower costs, we have developed a new proprietary'),
 Document(id='778a4cb7-f62f-4be8-babb-49aa8c60c57b', metadata={'source': 'https://www.sec.gov/Archives/edgar/data/1318605/000162828024002390/tsla-20231231.htm', 'title': 'tsla-20231231', 'language': 'No language found.', 'company': 'Tesla'}, pag

# Prepare inputs for Prompts

## Document Formatter

The document formatter function transforms raw retrieved documents into a structured format suitable for LLM consumption. 

**Why we need formatting**:
- **Prevents Context Overflow**: By limiting results to `numSelectedDocuments` (default 5), we ensure the formatted context fits within the LLM's context window and reduces noise.
- **Improves Readability**: Arranging chunks with clear separation and consistent formatting (company name, chunk text) helps the LLM parse and prioritize information.
- **Maintains Traceability**: Preserving metadata like company names allows the LLM to cite sources and helps humans verify claims against original documents.
- **Reduces Hallucination**: Clear, well-organized context encourages the LLM to cite supported facts rather than generate plausible-sounding but unsupported answers.

**Implementation approach**:
The `format_docs()` function below joins selected documents into a single string with company metadata and clear delimiters. This string is then inserted into the prompt template under the `{context}` placeholder, providing grounded context for answer generation.


In [38]:
# Function to format documents into a string
def format_docs(docs):
    return "\n\n".join([
        f"{doc.metadata.get('company', '')}\n{doc.page_content}"
        for doc in docs[:config["numSelectedDocuments"]]
    ])

In [39]:
# Test the formatting function
context = format_docs(retrieved_documents)
pprint(context)

('Tesla\n'
 'technology featuring three electric motors for further increased performance '
 'in certain versions of Model S and Model X, Cybertruck and the Tesla Semi.We '
 'maintain extensive testing and R&D capabilities for battery cells, packs and '
 'systems, and have built an expansive body of knowledge on lithium-ion cell '
 'chemistry types and performance characteristics. In order to enable a '
 'greater supply of cells for our products with higher energy density at lower '
 'costs, we have developed a new proprietary\n'
 '\n'
 'Tesla\n'
 'electric vehicles to address additional vehicle markets, and to continue '
 'leveraging developments in our proprietary Full Self-Driving (“FSD”) '
 'Capability features, battery cell and other technologies.Energy Generation '
 'and StorageEnergy Storage ProductsPowerwall and Megapack are our lithium-ion '
 'battery energy storage products. Powerwall, which we sell directly to '
 'customers, as well as through channel partners, is designed t

## RunnablePassthrough()

`RunnablePassthrough()` is a utility in LangChain's pipeline (LCEL - LangChain Expression Language) that acts as an identity function: it takes an input and passes it through unchanged to the next step in the chain.

In your RAG pipeline, `RunnablePassthrough()` is used for the `"question"` key:
```python
{"context": retriever | format_docs, "question": RunnablePassthrough()}
```

This line says: "Process the input (the question) through the retriever and formatter to get context, but also pass the original question directly to the prompt template."

**Why we need RunnablePassthrough()**:

In LangChain's pipeline architecture, when you need to process some keys of your input dictionary while leaving others unchanged, `RunnablePassthrough()` elegantly solves this:

1. **Maintains Original Input**: Without `RunnablePassthrough()`, you'd lose the original question while processing it through the retriever. With it, both the processed context and the original question are available to the prompt.
2. **Avoids Custom Functions**: Without it, you'd need to manually write a function to extract and preserve the question—`RunnablePassthrough()` handles this idiomatically.
3. **Expresses Intent Clearly**: Reading `"question": RunnablePassthrough()` makes it explicit that the question should be forwarded unchanged, improving code readability and maintainability.
4. **Enables Complex Pipelines**: As pipelines grow, you often need some inputs to be transformed (e.g., retrieval) and others to pass through (e.g., original query, metadata). `RunnablePassthrough()` scales this pattern cleanly.

**Real-world analogy**: Think of a mail sorting facility. The address (question) needs to go through processing to find the correct recipient (retrieved context), but the original address also needs to stay on the envelope so the recipient knows who wrote it. `RunnablePassthrough()` is the mechanism that keeps the original address intact while the contents are processed.


In [40]:
# Example usage of RunnablePassthrough
passthrough = RunnablePassthrough()
question = passthrough.invoke("What technological advancements were made in the batteries used in Tesla's Electric Vehicles?")
print(question)

What technological advancements were made in the batteries used in Tesla's Electric Vehicles?


# Prompt

## Why Use LangChain's `PromptTemplate` Instead of String Formatting?

While you could use simple Python f-strings or `.format()` for string templating, LangChain's `PromptTemplate` class provides important benefits:

1. **Validation & Safety**: `PromptTemplate` validates that all required placeholders (e.g., `{question}`, `{context}`) are supplied at runtime. This catches missing variables early, reducing silent failures.

2. **Serialization**: Prompts can be saved to JSON and loaded from disk, enabling version control and reproducibility of experiments across teams and time.

3. **Integration with Components**: `PromptTemplate` objects are first-class citizens in LCEL pipelines, allowing seamless composition with retrievers, LLMs, and other components. This makes complex chains readable and maintainable.

4. **Metadata & Documentation**: You can attach additional metadata to prompts (e.g., version, description, author) for better experiment tracking.

5. **Variable Extraction**: The `.get_input_variables()` method automatically identifies required inputs, helping you debug pipeline mismatches.

**In this notebook**, the `PromptTemplate` ensures your RAG prompt has exactly the structure needed to ground answers in the retrieved context and request clear, detailed responses.


In [41]:
# Create the prompt template from the config string
prompt_template = PromptTemplate.from_template(config["ragPromptTemplate"])

In [42]:
# Inspect the template object
prompt_template

PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template='\n    Give an answer for the `Question` using only the given `Context`. Use information relevant to the query from the entire context.\n    Provide a detailed answer with thorough explanations, avoiding summaries.\n\n    Question: {question}\n\n    Context: {context}\n\n    Answer:\n    ')

In [43]:
# Preview the final formatted prompt with actual data
# This ensures context is defined even if cells are run out of order
context = format_docs(retrieved_documents)
formatted_prompt = prompt_template.format(question=question, context=context)
print(formatted_prompt)


    Give an answer for the `Question` using only the given `Context`. Use information relevant to the query from the entire context.
    Provide a detailed answer with thorough explanations, avoiding summaries.

    Question: What technological advancements were made in the batteries used in Tesla's Electric Vehicles?

    Context: Tesla
technology featuring three electric motors for further increased performance in certain versions of Model S and Model X, Cybertruck and the Tesla Semi.We maintain extensive testing and R&D capabilities for battery cells, packs and systems, and have built an expansive body of knowledge on lithium-ion cell chemistry types and performance characteristics. In order to enable a greater supply of cells for our products with higher energy density at lower costs, we have developed a new proprietary

Tesla
electric vehicles to address additional vehicle markets, and to continue leveraging developments in our proprietary Full Self-Driving (“FSD”) Capability fea

# LLM for Answer Generation

The LLM (Large Language Model) is the reasoning engine of your RAG system. It reads the retrieved context and the user's question, then generates a grounded answer based on what the documents contain.

**Theoretical Foundation: Attention and Language Generation**

Modern LLMs (like GPT-4) are transformer-based models that use the **attention mechanism** to:
1. **Understand Context**: Given the prompt (question + retrieved passages), the model uses self-attention to compute relevance weights between all token pairs, allowing it to "focus" on the most important information.
2. **Reason Over Evidence**: The attention mechanism allows the model to trace dependencies and synthesize information across multiple passages.
3. **Generate Answers**: Using these learned representations, the model generates the next token (word) based on the previous tokens, repeating until a stopping criterion (e.g., end-of-sequence token).

Mathematically, attention computes: $\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q} \mathbf{K}^T}{\sqrt{d_k}}\right) \mathbf{V}$, where queries, keys, and values are derived from the input tokens. This allows the model to dynamically weight which parts of the input to pay attention to.

**Why Use `ChatOpenAI` in This Notebook**:
- **High Quality**: GPT-4 and similar models provide excellent reasoning, understanding of technical documents, and the ability to synthesize information across multiple sources.
- **Managed Service**: No need to run or fine-tune a model locally; just provide an API key and let OpenAI handle infrastructure, scaling, and updates.
- **Context Understanding**: These models excel at following instructions (e.g., "use only the provided context") and maintaining consistency within a conversation.
- **Instruction Following**: Modern LLMs respond well to explicit instructions in prompts, enabling better control over answer style and factuality.

**Key Configuration Parameters**:
- `model`: Model identifier (e.g., `"gpt-4o"`) determines capability level. Larger models (gpt-4) are more capable but costlier; smaller models (gpt-3.5-turbo) are faster and cheaper.
- `temperature`: Controls randomness in generation. 
  - `0.0` → deterministic, factual, reproducible (best for RAG when you want answers directly from context).
  - `0.7` → balanced creativity and consistency (suitable for open-ended questions).
  - `1.0` → highly creative but less predictable (risky for fact-based tasks).

**For RAG Systems**: prefer lower temperatures (0.0–0.3) to encourage the LLM to stick to retrieved facts and reduce hallucinations. The attention mechanism combined with low temperature makes the model more likely to rely on factual content in the context rather than generating plausible-sounding but unsupported claims.

The cell below initializes `ChatOpenAI` with your configured model and temperature, preparing it for use in the RAG chain.


In [44]:
# Initialize the Chat OpenAI model
llm = ChatOpenAI(model=config["ragAnswerModel"], temperature=config["ragAnswerModelTemeprature"])

In [45]:
# Verify model configuration
llm

ChatOpenAI(profile={'max_input_tokens': 128000, 'max_output_tokens': 16384, 'image_inputs': True, 'audio_inputs': False, 'video_inputs': False, 'image_outputs': False, 'audio_outputs': False, 'video_outputs': False, 'reasoning_output': False, 'tool_calling': True, 'structured_output': True, 'image_url_inputs': True, 'pdf_inputs': True, 'pdf_tool_message': True, 'image_tool_message': True, 'tool_choice': True}, client=<openai.resources.chat.completions.completions.Completions object at 0x7a5139aa6d50>, async_client=<openai.resources.chat.completions.completions.AsyncCompletions object at 0x7a5135e1d130>, root_client=<openai.OpenAI object at 0x7a5139aa61e0>, root_async_client=<openai.AsyncOpenAI object at 0x7a5139aa50a0>, model_name='gpt-4o', temperature=0.7, model_kwargs={}, openai_api_key=SecretStr('**********'), stream_usage=True)

In [46]:
# Generate the answer by invoking the LLM with the formatted prompt
llm_output = llm.invoke(formatted_prompt)

In [47]:
# View the raw output object (AIMessage)
llm_output

AIMessage(content="Tesla has made several technological advancements in the batteries used in its electric vehicles, primarily focusing on lithium-ion battery cells. The company has developed a new proprietary lithium-ion battery cell that offers higher energy density at lower costs. This advancement is crucial as it directly impacts the range, performance, and overall efficiency of Tesla's electric vehicles.\n\nFurthermore, Tesla has improved its manufacturing processes, which likely contribute to the increased efficiency and reduced costs associated with their battery production. This indicates a focus not only on the chemistry of the battery cells themselves but also on the way they are manufactured, allowing Tesla to scale production and meet the growing demand for their vehicles.\n\nTesla's extensive research and development capabilities are evident in their comprehensive testing of battery cells, packs, and systems. This has allowed them to build a significant body of knowledge r

In [48]:
# View just the content string
pprint(llm_output.content)

('Tesla has made several technological advancements in the batteries used in '
 'its electric vehicles, primarily focusing on lithium-ion battery cells. The '
 'company has developed a new proprietary lithium-ion battery cell that offers '
 'higher energy density at lower costs. This advancement is crucial as it '
 "directly impacts the range, performance, and overall efficiency of Tesla's "
 'electric vehicles.\n'
 '\n'
 'Furthermore, Tesla has improved its manufacturing processes, which likely '
 'contribute to the increased efficiency and reduced costs associated with '
 'their battery production. This indicates a focus not only on the chemistry '
 'of the battery cells themselves but also on the way they are manufactured, '
 'allowing Tesla to scale production and meet the growing demand for their '
 'vehicles.\n'
 '\n'
 "Tesla's extensive research and development capabilities are evident in their "
 'comprehensive testing of battery cells, packs, and systems. This has allowed '
 '

# PostProcessing

## Output Parsing and String Conversion

`StrOutputParser()` is the final component in the RAG chain that converts the LLM's structured output into a clean, readable string. Without this parser, the LLM returns an `AIMessage` object containing metadata (like token count, model name, finish reason) along with the text content. `StrOutputParser()` extracts just the text, discarding the wrapper.

**In the RAG chain**, `StrOutputParser()` is the last step:
```python
rag_chain = (...) | StrOutputParser()
```

When you call `rag_chain.invoke(question)`, the result is a plain string ready for display, rather than a complex object with embedded metadata.

**Why Output Parsing Matters**:
- **User Experience**: Users expect plain text answers, not Python objects or JSON.
- **Integration**: Downstream systems (web APIs, frontends, logging) expect strings or simple data types.
- **Composability**: In complex chains, output parsers enable type-safe data flow and prevent errors when chaining components.

## Evaluating RAG System Quality

To improve your RAG system, measure its performance on a test set:

1. **Retrieval Quality** (does the retriever find relevant passages?):
   - Precision@k: % of top-k retrieved docs that are relevant
   - Recall: % of all relevant docs in the corpus that appear in top-k results
   - MRR (Mean Reciprocal Rank): average rank of the first relevant document
   - Tool: Run `retriever.invoke(question)` for each test query and manually assess relevance.

2. **Answer Quality** (does the LLM give correct, grounded answers?):
   - F1 Score: overlap between generated answer and reference answer(s)
   - Factuality: % of claims in the answer that are directly supported by retrieved docs
   - Hallucination Rate: % of claims not in the source context
   - User satisfaction: ask domain experts to rate answer quality on a scale

3. **End-to-End Quality**:
   - Compare answers with/without retrieval (should be better with RAG).
   - Baseline against traditional search or closed-book LLM answers.

**Theoretical Note: The Hallucination Problem**

Even with retrieval context, LLMs can "hallucinate"—generate false or unsupported information. This happens because:
- The model's training data contains patterns that generate plausible-sounding text.
- The model may overgeneralize from the context or inject knowledge from its training set.
- Attention weights may focus on context tokens that don't directly answer the question.

RAG reduces (but doesn't eliminate) hallucinations by grounding answers in retrieved documents. Using low temperature and explicit instructions ("only use the provided context") further helps.

## Quick Debugging Checklist

If answers are poor:
- **Check retrieval**: print retrieved docs to see if relevant passages are being found.
- **Adjust config**: try different `chunkSize`, `chunkOverlap`, `numRetrievedDocuments`, or `temperature`.
- **Verify embeddings**: ensure you're using the same embedding model for indexing and querying.
- **Test prompt clarity**: modify `ragPromptTemplate` to give clearer instructions to the LLM.

## Next Steps for Learning

- **Experiment with different documents**: load a different SEC filing or article and re-run the pipeline.
- **Implement reranking**: after retrieval, rerank the top-k docs using a cross-encoder to improve precision.
- **Add filtering**: pre-filter chunks by date, section, or metadata before sending to the LLM.
- **Try different LLMs**: replace `ChatOpenAI` with open-source models (Llama, Mistral) or other providers (Anthropic, Cohere).
- **Optimize costs**: use smaller, cheaper embedding/LLM models if performance allows.
- **Deploy**: wrap the chain in a web service (FastAPI, Streamlit) for real-world use.


In [49]:
# Initialize the output parser
output_parser = StrOutputParser()

In [50]:
# Parse the LLM output
final_output = output_parser.invoke(llm_output)

In [51]:
# Print the final clean result
pprint(final_output)

('Tesla has made several technological advancements in the batteries used in '
 'its electric vehicles, primarily focusing on lithium-ion battery cells. The '
 'company has developed a new proprietary lithium-ion battery cell that offers '
 'higher energy density at lower costs. This advancement is crucial as it '
 "directly impacts the range, performance, and overall efficiency of Tesla's "
 'electric vehicles.\n'
 '\n'
 'Furthermore, Tesla has improved its manufacturing processes, which likely '
 'contribute to the increased efficiency and reduced costs associated with '
 'their battery production. This indicates a focus not only on the chemistry '
 'of the battery cells themselves but also on the way they are manufactured, '
 'allowing Tesla to scale production and meet the growing demand for their '
 'vehicles.\n'
 '\n'
 "Tesla's extensive research and development capabilities are evident in their "
 'comprehensive testing of battery cells, packs, and systems. This has allowed '
 '

# RAG Chain

## RAG Chain Overview

The RAG chain is a composable pipeline that orchestrates all components—retriever, formatter, prompt, and LLM—into a single callable unit. Using LCEL (LangChain Expression Language), you can express the entire pipeline as a linear sequence, making the flow clear and easy to modify.

## RAG Chain Components and Data Flow

1. **Input**: User's question (string)
2. **Retriever**: Fetches top-k most similar documents based on semantic similarity to the question.
3. **Document Formatter**: Transforms retrieved documents into a structured text block with company metadata and clear delimiters.
4. **Prompt Template**: Combines the formatted context and original question into a structured prompt string with explicit instructions (e.g., "use only provided context").
5. **LLM**: Generates a grounded answer based on the prompt.
6. **Output Parser**: Extracts the text response from the LLM's structured output (AIMessage object).
7. **Output**: Final answer string ready for display to the user.

## LCEL Pipeline Syntax

The RAG chain is built using LCEL's pipe operator (`|`):

```python
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)
```

**Reading this syntax**:
- `{"context": retriever | format_docs, "question": RunnablePassthrough()}`: Create a dictionary with two keys. For "context", pass the input through the retriever, then through `format_docs()`. For "question", pass the input unchanged via `RunnablePassthrough()`.
- `| prompt`: Pass the resulting dictionary to the prompt template, which substitutes `{context}` and `{question}` placeholders.
- `| llm`: Send the formatted prompt to the LLM.
- `| StrOutputParser()`: Extract the text response.

**Key insight**: Each `|` represents a handoff between components. Data flows left-to-right, with each component's output becoming the next component's input.

## Why This Design?

- **Modularity**: Each component (retriever, formatter, prompt, LLM) is independently testable and replaceable.
- **Readability**: The LCEL syntax reads like a narrative description of the process.
- **Composability**: You can build complex pipelines by combining simple components without nesting function calls.
- **Debuggability**: You can invoke intermediate steps (e.g., `retriever.invoke(question)`) to inspect behavior and troubleshoot issues.

## Usage and Reproducibility

Once the chain is built, invoke it with your question:

```python
answer = rag_chain.invoke("Your question here?")
```

The same chain can answer multiple questions without reinitialization. For reproducible results (especially important in research), keep `temperature` low and document your config settings.


In [52]:
# Re-creating components for clarity in the final chain definition

# 1. Retriever
retriever = loaded_vectorstore.as_retriever(search_kwargs={"k": config["numRetrievedDocuments"]})

# 2. Formatting function (re-defined or reused)
def format_docs(docs):
    return "\n\n".join([
        f"{doc.metadata.get('company', '')}\n{doc.page_content}"
        for doc in docs[:config["numSelectedDocuments"]]
    ])

# 3. LLM
llm = ChatOpenAI(model=config["ragAnswerModel"], temperature=config["ragAnswerModelTemeprature"])

# 4. Prompt
prompt = PromptTemplate.from_template(config["ragPromptTemplate"])
print(prompt)

# 5. Build the RAG Chain using LCEL
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

input_variables=['context', 'question'] input_types={} partial_variables={} template='\n    Give an answer for the `Question` using only the given `Context`. Use information relevant to the query from the entire context.\n    Provide a detailed answer with thorough explanations, avoiding summaries.\n\n    Question: {question}\n\n    Context: {context}\n\n    Answer:\n    '


# Ask a Question

Now let's invoke the complete RAG chain with a concrete question about Tesla's 10-K filing. The chain will:
1. Retrieve the 5 most relevant passages from the document.
2. Format them into a single context block.
3. Construct a prompt combining context and question.
4. Send the prompt to GPT-4o for answer generation.
5. Parse and return the final answer.

**Expected behavior**: The answer should be grounded in the retrieved passages, avoiding hallucinations. If the source document doesn't contain relevant information, the LLM should indicate that rather than inventing an answer.

**Debugging tip**: If the answer doesn't match your expectations, inspect the retrieved documents (uncomment the debug cells below) to see whether the retriever found relevant passages before blaming the LLM.


In [53]:
# Define the question
question = "What technological advancements were made in the batteries used in Tesla's Electric Vehicles?"

# Invoke the chain
answer = rag_chain.invoke(question)
print(answer)

Tesla has made several technological advancements in the batteries used in their electric vehicles, focusing on lithium-ion cell chemistry and performance characteristics. These advancements include the development of a new proprietary lithium-ion battery cell that aims to offer higher energy density at lower costs. This involves improved manufacturing processes that enhance the efficiency and performance of the battery cells.

Tesla's extensive testing and research and development (R&D) capabilities have allowed them to build a comprehensive understanding of different lithium-ion cell chemistries, which is crucial for optimizing the performance of their battery packs. These battery packs are integral to the performance and safety systems of Tesla vehicles. They utilize sophisticated control software to optimize performance, manage charging, and customize vehicle behavior.

Additionally, Tesla's battery advancements are reflected in their energy storage products, such as the Powerwall 

# Try Other Questions

Use the cell below to test the RAG system with different questions. The same `rag_chain` can answer multiple queries without reinitialization, provided the vector store and configuration remain unchanged.

**Experiment ideas**:
- Ask questions that are directly answered in the 10-K (should return high-quality answers).
- Ask questions the document doesn't address (the LLM should acknowledge this rather than hallucinate).
- Ask follow-up questions that require synthesizing information across multiple sections.
- Modify `config['numRetrievedDocuments']` (higher values = more context, potentially better answers but also more noise) and rerun cells to observe effects.

**For consistent results across runs**: keep `ragAnswerModelTemperature` at 0.0–0.3. Higher temperatures will produce varied responses even with identical inputs.


In [54]:
# Try another question to test generalization
another_question = "What are Tesla's main revenue sources?"
rag_chain.invoke(another_question)

"Tesla's main revenue sources are primarily categorized into three segments: Automotive Sales, Energy Generation and Storage, and Services and Other.\n\n1. **Automotive Sales**: This is the largest revenue source for Tesla. It includes the sale of new vehicles such as the Model S, Model X, Model 3, Model Y, Cybertruck, and Semi. The revenue from automotive sales encompasses both cash and financed deliveries. It also includes revenue from the sale of optional features and services such as Full Self-Driving (FSD) capabilities, over-the-air software updates, internet connectivity, and other subscriptions or additional features that can be purchased through the Tesla app or in-vehicle interface.\n\n2. **Automotive Regulatory Credits**: Although part of the automotive segment, it's worth mentioning separately due to its significance. Tesla earns regulatory credits by producing zero-emission vehicles, which it can then sell to other manufacturers that need these credits to comply with enviro