# Formula 1 RAG Ingestion Pipeline

This notebook handles document ingestion and retrieval for a Formula 1 knowledge base using LangChain and Pinecone.

## Setup and Imports

This cell loads environment variables and imports all necessary libraries:
- **dotenv**: Loads API keys from `.env` file
- **langchain compatibility patch**: Fixes version compatibility issues
- **LangChain components**: Text splitters, document loaders, embeddings, and vector store


In [1]:
import os
import sys
from dotenv import load_dotenv

load_dotenv()

import langchain
if not hasattr(langchain, 'debug'):
    langchain.debug = False
if not hasattr(langchain, 'verbose'):
    langchain.verbose = False

from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore


USER_AGENT environment variable not set, consider setting it to identify your requests.
  from .autonotebook import tqdm as notebook_tqdm

For example, replace imports like: `from langchain_core.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  from langchain_pinecone.vectorstores import Pinecone, PineconeVectorStore


## Function: ingest_documents()

This function performs the complete document ingestion pipeline:

1. **Load**: Fetches web pages from Formula 1 URLs using `WebBaseLoader`
2. **Split**: Divides documents into chunks (1000 tokens, 200 overlap) using `RecursiveCharacterTextSplitter`
3. **Embed**: Generates embeddings using OpenAI's embedding model
4. **Store**: Uploads document chunks and embeddings to Pinecone vector database

**Note**: Only call this function when you need to refresh/update the knowledge base.


In [None]:
def ingest_documents():
    urls = [
        "https://rishivikram348.medium.com/formula-one-for-dummies-part-one-the-basics-of-the-sport-26de6eeeca38",
        "https://medium.com/@Formula.101/an-introduction-to-formula-1-teams-and-drivers-94de161ec82f",
        "https://en.wikipedia.org/wiki/Drag_reduction_system",
    ]
    
    print("Loading documents from URLs...")
    documents = [WebBaseLoader(url).load() for url in urls]
    documents_list = [item for sublist in documents for item in sublist]
    
    print("Splitting documents...")
    text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(chunk_size=200, chunk_overlap=0)
    doc_splits = text_splitter.split_documents(documents_list)
    
    print(f"Ingesting {len(doc_splits)} document chunks into Pinecone...")
    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    index_name = os.getenv("PINECONE_INDEX_NAME")
    
    PineconeVectorStore.from_documents(
        documents=doc_splits,
        embedding=embeddings,
        index_name=index_name
    )
    
    print(f"✓ Successfully ingested {len(doc_splits)} documents into Pinecone")


## Function: get_retriever()

This function creates a retriever that connects to your existing Pinecone index **without re-ingesting documents**.

**How it works**:
- Connects to the existing Pinecone vector store
- Returns a retriever configured to fetch the top 5 most relevant documents
- Can be called multiple times without duplicating data

**Usage**: Call this function whenever you need to query the knowledge base.


In [3]:
def get_retriever():
    embeddings = OpenAIEmbeddings()
    index_name = os.getenv("PINECONE_INDEX_NAME")
    
    vector_store = PineconeVectorStore(
        embedding=embeddings,
        index_name=index_name
    )
    
    return vector_store.as_retriever(search_kwargs={"k": 5})


---

# Part 1: Document Ingestion

## ⚠️ Run This Cell Only When Updating the Knowledge Base

**Purpose**: Loads Formula 1 articles from the web and stores them in Pinecone

**When to run**:
- First-time setup
- Adding new URLs to the knowledge base
- Refreshing existing content

**What happens**:
- Downloads content from 3 Formula 1 URLs
- Splits content into manageable chunks
- Creates embeddings (costs OpenAI API tokens)
- Uploads to Pinecone (may duplicate if run multiple times)

**Note**: Uncomment the line below to execute the ingestion.


In [10]:
ingest_documents()


Loading documents from URLs...
Splitting documents...
Ingesting 54 document chunks into Pinecone...
✓ Successfully ingested 54 documents into Pinecone


---

# Part 2: Query and Test Retrieval

## 🔍 Test Your Knowledge Base

**Purpose**: Query the existing Pinecone index to verify retrieval works correctly

**What this does**:
- Connects to your existing Pinecone vector store (no data upload)
- Searches for documents relevant to "What is Formula 1?"
- Returns the top 5 most relevant document chunks
- Displays the first 200 characters of each result

**Safe to run multiple times**: This only reads from Pinecone, it doesn't modify or duplicate data.

**Tip**: Change the query string to test different questions!


In [11]:
retriever = get_retriever()
results = retriever.invoke("What is Formula 1?")

print(f"\nFound {len(results)} results:\n")
for i, doc in enumerate(results, 1):
    print(f"--- Result {i} ---")
    print(doc.page_content[:200])
    print()



Found 5 results:

--- Result 1 ---
Formula One For Dummies — Part One: the basics of the sport | by Rishi Vikram | MediumSitemapOpen in appSign upSign inMedium LogoWriteSearchSign upSign inFormula One For Dummies — Part One: the basics

--- Result 2 ---
across the world. After each race, the teams pack up all their equipment and ship it over to the location of the next race.Scroll to the end of the article to see the current teams, cars and their dri

--- Result 3 ---
line is the winner. Sounds simple enough? There’s a lot more to it. But what I want is for you, the reader, to go from that level of understanding to something deeper; understanding how the sport work

--- Result 4 ---
An Introduction to Formula 1 Teams and Drivers | by Dylan Kane | MediumSitemapOpen in appSign upSign inMedium LogoWriteSearchSign upSign inAn Introduction to Formula 1 Teams and DriversDylan Kane9 min

--- Result 5 ---
(Top)





1
Rationale








2
Formula One




Toggle Formula One subsection





2.1
