# Formula 1 RAG Ingestion Pipeline

This notebook handles document ingestion and retrieval for a Formula 1 knowledge base using LangChain and Pinecone.

## Setup and Imports

This cell loads environment variables and imports all necessary libraries:
- **dotenv**: Loads API keys from `.env` file
- **langchain compatibility patch**: Fixes version compatibility issues
- **LangChain components**: Text splitters, document loaders, embeddings, and vector store


In [19]:
import os
from dotenv import load_dotenv
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore

load_dotenv()



True

## Functions Imported from ingestion.py

This notebook imports two key functions from the `ingestion.py` module:

### `ingest_documents()`
Performs the complete document ingestion pipeline:
1. **Load**: Fetches web pages from Formula 1 URLs using `WebBaseLoader`
2. **Split**: Divides documents into chunks (1000 tokens, 200 overlap) using `RecursiveCharacterTextSplitter`
3. **Embed**: Generates embeddings using OpenAI's embedding model
4. **Store**: Uploads document chunks and embeddings to Pinecone vector database

### `get_retriever()`
Creates a retriever that connects to your existing Pinecone index without re-ingesting documents.

**Benefit**: These functions can now be imported in any Python module or notebook!


In [12]:
def ingest_documents():
    urls = [
        "https://rishivikram348.medium.com/formula-one-for-dummies-part-one-the-basics-of-the-sport-26de6eeeca38",
        "https://medium.com/@Formula.101/an-introduction-to-formula-1-teams-and-drivers-94de161ec82f",
        "https://en.wikipedia.org/wiki/Drag_reduction_system",
    ]
    
    print("Loading documents from URLs...")
    documents = [WebBaseLoader(url).load() for url in urls]
    documents_list = [item for sublist in documents for item in sublist]
    
    print("Splitting documents...")
    text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(chunk_size=200, chunk_overlap=0)
    doc_splits = text_splitter.split_documents(documents_list)
    
    print(f"Ingesting {len(doc_splits)} document chunks into Pinecone...")
    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    index_name = os.getenv("PINECONE_INDEX_NAME")
    
    PineconeVectorStore.from_documents(
        documents=doc_splits,
        embedding=embeddings,
        index_name=index_name
    )
    
    print(f"✓ Successfully ingested {len(doc_splits)} documents into Pinecone")


## Function: get_retriever()

This function creates a retriever that connects to your existing Pinecone index **without re-ingesting documents**.

**How it works**:
- Connects to the existing Pinecone vector store
- Returns a retriever configured to fetch the top 5 most relevant documents
- Can be called multiple times without duplicating data

**Usage**: Call this function whenever you need to query the knowledge base.


In [None]:
def get_retriever():
    embeddings = OpenAIEmbeddings()
    index_name = os.getenv("PINECONE_INDEX_NAME")
    
    vector_store = PineconeVectorStore(
        embedding=embeddings,
        index_name=index_name
    )
    
    return vector_store.as_retriever(search_kwargs={"k": 5})


---

# Part 1: Document Ingestion

## ⚠️ Run This Cell Only When Updating the Knowledge Base

**Purpose**: Loads Formula 1 articles from the web and stores them in Pinecone

**When to run**:
- First-time setup
- Adding new URLs to the knowledge base
- Refreshing existing content

**What happens**:
- Downloads content from 3 Formula 1 URLs
- Splits content into manageable chunks
- Creates embeddings (costs OpenAI API tokens)
- Uploads to Pinecone (may duplicate if run multiple times)

**Note**: Uncomment the line below to execute the ingestion.


In [13]:
ingest_documents()


Loading documents from URLs...
Splitting documents...
Ingesting 54 document chunks into Pinecone...
✓ Successfully ingested 54 documents into Pinecone


---

# Part 2: Query and Test Retrieval

## 🔍 Test Your Knowledge Base

**Purpose**: Query the existing Pinecone index to verify retrieval works correctly

**What this does**:
- Connects to your existing Pinecone vector store (no data upload)
- Searches for documents relevant to "What is Formula 1?"
- Returns the top 5 most relevant document chunks
- Displays the first 200 characters of each result

**Safe to run multiple times**: This only reads from Pinecone, it doesn't modify or duplicate data.

**Tip**: Change the query string to test different questions!


In [17]:
retriever = get_retriever()
results = retriever.invoke("What is DRS?")

print(f"\nFound {len(results)} results:\n")
for i, doc in enumerate(results, 1):
    print(f"--- Result {i} ---")
    print(doc.page_content[:200])
    print()



Found 5 results:

--- Result 1 ---
The effectiveness of the DRS will vary from track to track and, to a lesser extent, from car to car. The system's effectiveness was reviewed in 2011 to see if overtaking could be made easier, but not 

--- Result 2 ---
Reception[edit]
There has been a mixed reaction to the introduction of DRS in Formula One amongst both fans and drivers. Some believe that this is the solution to the lack of overtaking in F1 in recen

--- Result 3 ---
DRS in open (top) and closed (bottom) positions on a Red Bull RB7 in 2011
In motor racing, the drag reduction system (DRS) is a form of driver-adjustable bodywork aimed at reducing aerodynamic drag in

--- Result 4 ---
Rationale[edit]
In most higher-performance racing categories, the cars depend on the downforce produced by their aerodynamic bodywork to increase cornering speeds.[5] However, the aerodynamic bodywork

--- Result 5 ---
Functional description[edit]
The horizontal elements of the rear wing consist of the mai