# SETUP, GATHER RAG TOOLS, & LOAD THE DATA

In [1]:
# We start by installing the libraries we need and setting up our OpenAI API key
# This cell installs the necessary libraries. Please run it once

print("Installing necessary libraries...")
!pip install -q langchain langchain-openai openai chromadb gradio python-dotenv tiktoken langchain-community
print("Libraries installed successfully!")

Installing necessary libraries...
Libraries installed successfully!


In [2]:
# Let's install and import OpenAI Package
!pip install --upgrade openai
from openai import OpenAI  

# Let's import os, which stands for "Operating System"
import os

# This will be used to load the API key from the .env file
from dotenv import load_dotenv
load_dotenv()

# Get the OpenAI API keys from environment variables
openai_api_key = os.getenv("OPENAI_API_KEY")

# Let's configure the OpenAI Client using our key
openai_client = OpenAI(api_key=openai_api_key)
print("OpenAI client successfully configured.")

# Let's view the first few characters in the key
print(openai_api_key[:15])

OpenAI client successfully configured.
sk-proj-H3dZxa9


In [3]:
# Let's import Langchain components
from langchain_openai import OpenAIEmbeddings, OpenAI
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import TextLoader
from langchain.chains import RetrievalQAWithSourcesChain 

In [4]:
# Define the path to your data file
# Ensure 'eleven_madison_park_data.txt' is in the same folder as this notebook
DATA_FILE_PATH = "eleven_madison_park_data.txt"
print(f"Data file path set to: {DATA_FILE_PATH}")

Data file path set to: eleven_madison_park_data.txt


In [5]:
# Let's load Eleven Madison Park Restaurant data, which has been scraped from their website
# The data is saved in "eleven_madison_park_data.txt", Langchain's TextLoader makes this easy to read
print(f"Attempting to load data from: {DATA_FILE_PATH}")

# Initialize the TextLoader with the file path and specify UTF-8 encoding
# Encoding helps handle various characters correctly
loader = TextLoader(DATA_FILE_PATH, encoding = "utf-8")

# Load the document(s) using TextLoader from LangChain, which loads the entire file as one Document object
raw_documents = loader.load()
print(f"Successfully loaded {len(raw_documents)} document(s).")


Attempting to load data from: eleven_madison_park_data.txt
Successfully loaded 1 document(s).


In [6]:
# Let's display a few characters of the loaded content to perform a sanity check!
print(raw_documents[0].page_content[:500] + "...")

Source: https://www.elevenmadisonpark.com/
Title: Eleven Madison Park
Content:
Book on Resy
---END OF SOURCE---

Source: https://www.elevenmadisonpark.com/careers
Title: Careers — Eleven Madison Park
Content:
Join Our Team Eleven Madison Park ▾ All Businesses Eleven Madison Park Clemente Bar Daniel Humm Hospitality Filter Categories Culinary Pastry Wine & Beverage Dining Room Office & Admin Other Job Types Full Time Part Time Compensation Salary Hourly Apply filters OPEN OPPORTUNITIES Staff Acco...


# SPLITTING DOCUMENTS (CHUNKING) WITH LANGCHAIN TEXT SPLITTER

Large documents are hard for AI models to process efficiently and make it difficult to find specific answers. We need to split the loaded document into smaller, manageable "chunks". We'll use Langchain's `RecursiveCharacterTextSplitter`.

*   Smaller pieces are easier to embed, store, and retrieve accurately.
*   **`chunk_size`**: Max characters per chunk.
*   **`chunk_overlap`**: Characters shared between consecutive chunks (helps maintain context).

In [11]:
# Let's split the document into chunks
print("\nSplitting the loaded document into smaller chunks...")

# Let's initialize the splitter, which tries to split the document on common separators like paragraphs (\n\n),
# sentences (.), and spaces (' ').
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1000,  # Aim for chunks of about 1000 characters
                                               chunk_overlap = 150,)  # Each chunk overlaps with the previous by 150 characters

# Split the raw document(s) into smaller Document objects (chunks)
documents = text_splitter.split_documents(raw_documents)

# Check if splitting produced any documents
if not documents:
    raise ValueError("Error: Splitting resulted in zero documents. Check the input file and splitter settings.")
print(f"Document split into {len(documents)} chunks.")



Splitting the loaded document into smaller chunks...
Document split into 38 chunks.


In [12]:
# Let's display the Python list containing document chunks 
documents

[Document(metadata={'source': 'eleven_madison_park_data.txt'}, page_content='Source: https://www.elevenmadisonpark.com/\nTitle: Eleven Madison Park\nContent:\nBook on Resy\n---END OF SOURCE---'),
 Document(metadata={'source': 'eleven_madison_park_data.txt'}, page_content='Source: https://www.elevenmadisonpark.com/careers\nTitle: Careers — Eleven Madison Park\nContent:'),
 Document(metadata={'source': 'eleven_madison_park_data.txt'}, page_content="Join Our Team Eleven Madison Park ▾ All Businesses Eleven Madison Park Clemente Bar Daniel Humm Hospitality Filter Categories Culinary Pastry Wine & Beverage Dining Room Office & Admin Other Job Types Full Time Part Time Compensation Salary Hourly Apply filters OPEN OPPORTUNITIES Staff Accountant - Part Time Eleven Madison Park Part Time • Hourly ($20 - $25) Host/Reservationist Eleven Madison Park Full Time • Hourly ($24) Sous Chef Eleven Madison Park Full Time • Salary ($72K - $75K) Pastry Cook Eleven Madison Park Full Time • Hourly ($18 - $2

In [13]:
# Let's display an example chunk and its metadata
print("\n--- Example Chunk (Chunk 2) ---")
print(documents[2].page_content)
print("\n--- Metadata for Chunk 2 ---")
print(documents[2].metadata) # Should show {'source': 'eleven_madison_park_data.txt'}


--- Example Chunk (Chunk 2) ---
Join Our Team Eleven Madison Park ▾ All Businesses Eleven Madison Park Clemente Bar Daniel Humm Hospitality Filter Categories Culinary Pastry Wine & Beverage Dining Room Office & Admin Other Job Types Full Time Part Time Compensation Salary Hourly Apply filters OPEN OPPORTUNITIES Staff Accountant - Part Time Eleven Madison Park Part Time • Hourly ($20 - $25) Host/Reservationist Eleven Madison Park Full Time • Hourly ($24) Sous Chef Eleven Madison Park Full Time • Salary ($72K - $75K) Pastry Cook Eleven Madison Park Full Time • Hourly ($18 - $20) Kitchen Server Eleven Madison Park Full Time • Hourly ($16) plus tips Dining Room Manager Eleven Madison Park Full Time • Salary ($72K - $75K) Porter Manager Eleven Madison Park Full Time • Salary ($70K - $75K) Senior Sous Chef Eleven Madison Park Full Time • Salary ($85K - $95K) Maitre D Eleven Madison Park Full Time • Hourly ($16) plus tips Even if you don't see the opportunity you're looking for, we would sti

# EMBEDDINGS AND VECTOR STORE CREATION 


Now, we convert our text chunks into **embeddings** (numerical vectors) using OpenAI. Similar text chunks will have similar vectors. We then store these vectors in a **vector store** (ChromaDB) for fast searching.

*   **Embeddings:** Text -> Numbers (Vectors) representing meaning.
*   **Vector Store:** Database optimized for searching these vectors.

In [22]:
# Let's initialize our embeddings model. Note that we will use OpenAI's embedding model 
print("Initializing OpenAI Embeddings model...")

# Create an instance of the OpenAI Embeddings model
# Langchain handles using the API key we loaded earlier
embeddings = OpenAIEmbeddings(openai_api_key = openai_api_key)

print("OpenAI Embeddings model initialized.")

# Let's Create ChromaDB Vector Store
print("\nCreating ChromaDB vector store and embedding documents...")

# Now the chunks from 'documents' are being converted to a vector using the 'embeddings' model
# The vectors are then stored as a vector in ChromaDB
# You could add `persist_directory="./my_chroma_db"` to save it to disk
# You will need to specify: (1) The list of chunked Document objects and (2) The embedding model to use
vector_store = Chroma.from_documents(documents = documents, embedding = embeddings)  

# Verify the number of items in the store
vector_count = vector_store._collection.count()
print(f"ChromaDB vector store created with {vector_count} items.")

if vector_count == 0:
    raise ValueError("Vector store creation resulted in 0 items. Check previous steps.")

Initializing OpenAI Embeddings model...
OpenAI Embeddings model initialized.

Creating ChromaDB vector store and embedding documents...
ChromaDB vector store created with 38 items.


In [23]:
# Let's retrieve the first chunk of stored data from the vector store
stored_data = vector_store._collection.get(include=["embeddings", "documents"], limit = 1)  

# Display the results
print("First chunk text:\n", stored_data['documents'][0])
print("\nEmbedding vector:\n", stored_data['embeddings'][0])
print(f"\nFull embedding has {len(stored_data['embeddings'][0])} dimensions.")

First chunk text:
 Source: https://www.elevenmadisonpark.com/
Title: Eleven Madison Park
Content:
Book on Resy
---END OF SOURCE---

Embedding vector:
 [ 0.02330522 -0.01571015 -0.00706136 ... -0.02464633 -0.01022939
 -0.06158162]

Full embedding has 1536 dimensions.


# TESTING THE RETRIEVAL

We'll use the `similarity_search` method.


In [29]:
# Let's perform a similarity search in our vector store
print("\n--- Testing Similarity Search in Vector Store ---")
test_query = "What different menus are offered?"
print(f"Searching for documents similar to: '{test_query}'")


# Perform a similarity search. 'k=2' retrieves the top 2 most similar chunks
try:
    similar_docs = vector_store.similarity_search(test_query, k = 2)
    print(f"\nFound {len(similar_docs)} similar documents:")

    # Display snippets of the retrieved documents and their sources
    for i, doc in enumerate(similar_docs):
        print(f"\n--- Document {i+1} ---")
        # Displaying the first 700 chars for brevity
        content_snippet = doc.page_content[:700].strip() + "..."
        source = doc.metadata.get("source", "Unknown Source")  # Get source from metadata
        print(f"Content Snippet: {content_snippet}")
        print(f"Source: {source}")

except Exception as e:
    print(f"An error occurred during similarity search: {e}")




--- Testing Similarity Search in Vector Store ---
Searching for documents similar to: 'What different menus are offered?'

Found 2 similar documents:

--- Document 1 ---
Content Snippet: FAQs We are located at 11 Madison Avenue, on the northeast corner of East 24th and Madison Avenue, directly across the street from Madison Square Park. We offer three menus, all 100% plant-based: Full Tasting Menu : An eight- to nine-course experience priced at $365 per guest. This menu typically lasts about two to three hours and features a mix of plated and communal dishes. 5-Course Menu : Priced at $285 per guest, this menu highlights selections from the Full Tasting Menu and lasts approximately two hours. Bar Tasting Menu : Available in our lounge for $225 per guest, this menu includes four to five courses and is designed to last around two hours. Note : These durations are estimates bas...
Source: eleven_madison_park_data.txt

--- Document 2 ---
Content Snippet: Reservations are available via Res

# BUILDING & TESTING THE RAG CHAIN USING LANGCHAIN


Now we assemble the core RAG logic using Langchain's `RetrievalQAWithSourcesChain`. This chain combines:
1.  A **Retriever**: Fetches relevant documents from our `vector_store`.
2.  An **LLM**: Generates the answer based on the question and retrieved documents (we'll use OpenAI).

This specific chain type automatically handles retrieving documents, formatting them with the question for the LLM, and tracking the sources.


In [50]:
# --- 1. Define the Retriever ---
# The retriever uses the vector store to fetch documents
# We configure it to retrieve the top 'k' documents
retriever = vector_store.as_retriever(search_kwargs={"k": 3})
print("Retriever configured successfully from vector store.")

# --- 2. Define the Language Model (LLM) from OpenAI---
# Temperature controls the model's creativity; 'temperature=0' aims for more factual, less creative answers
# You might need to specify a more powerful model, such as "gpt-3.5-turbo-instruct"
llm = OpenAI(temperature = 1.3, openai_api_key = openai_api_key)
print("OpenAI LLM successfully initialized.")

# --- 3. Create the RetrievalQAWithSourcesChain ---
# This chain type is designed specifically for Q&A with source tracking.
# chain_type="stuff": Puts all retrieved text directly into the prompt context.
#                      Suitable if the total text fits within the LLM's context limit.
#                      Other types like "map_reduce" handle larger amounts of text.
qa_chain = RetrievalQAWithSourcesChain.from_chain_type(llm = llm,
                                                       chain_type = "stuff",
                                                       retriever = retriever,
                                                       return_source_documents = True,  # Request the actual Document objects used
                                                       verbose = True)  # Set to True to see Langchain's internal steps (can be noisy)

print("RetrievalQAWithSourcesChain created")

Retriever configured successfully from vector store.
OpenAI LLM successfully initialized.
RetrievalQAWithSourcesChain created


In [52]:
# --- Test the Full Chain ---
print("\n--- Testing the Full RAG Chain ---")
chain_test_query = "What kind of food does Eleven Madison Park serve?"
print(f"Query: {chain_test_query}")

# Run the query through the chain. Use invoke() for Langchain >= 0.1.0
# The input must be a dictionary, often with the key 'question'.
try:
    result = qa_chain.invoke({"question": chain_test_query})

    # Print the answer and sources from the result dictionary
    print("\n--- Answer ---")
    print(result.get("answer", "No answer generated."))

    print("\n--- Sources ---")
    print(result.get("sources", "No sources identified."))

    # Optionally print snippets from the source documents returned
    if "source_documents" in result:
        print("\n--- Source Document Snippets ---")
        for i, doc in enumerate(result["source_documents"]):
            content_snippet = doc.page_content[:250].strip()
            print(f"Doc {i+1}: {content_snippet}")

except Exception as e:
    print(f"\nAn error occurred while running the chain: {e}")
    # Consider adding more specific error handling if needed


--- Testing the Full RAG Chain ---
Query: What kind of food does Eleven Madison Park serve?


[1m> Entering new RetrievalQAWithSourcesChain chain...[0m

[1m> Finished chain.[0m

--- Answer ---
 Eleven Madison Park serves a discounted plant-based menu and farm-sourced à la carte snacks, reservations are required.


--- Sources ---
https://www.elevenmadisonpark.com/faq

--- Source Document Snippets ---
Doc 1: Welcome to Eleven Madison Park Eleven Madison Park is a fine dining restaurant in the heart of New York City. Overlooking Madison Square Park–one of Manhattan’s most beautiful green spaces–we sit at the base of a historic Art Deco building on the cor
Doc 2: Source: https://www.elevenmadisonpark.com/ourrestaurant
Title: About — Eleven Madison Park
Content:
Doc 3: Source: https://www.elevenmadisonpark.com/faq
Title: FAQs — Eleven Madison Park
Content:
