## Overview - Creating RAG System with OpenAI API and LangChain

### This notebook demonstrates how to build a complete RAG (Retrieval Augmented Generation) system using OpenAI's GPT-4o model and embedding model with LangChain framework.

## Main Features:
### - Uses OpenAI GPT-4o as the generation model for answering questions
### - Uses OpenAI text-embedding-3-small as the embedding model for document vectorization
### - Supports multiple document formats (PDF, TXT, DOCX, etc.)
### - Provides interactive Gradio web interface for user interaction
### - Implements FAISS vector database for efficient similarity search


In [14]:
# Step 1: Install Required Dependencies
# This step installs all necessary Python packages for the RAG system including:
# - openai: Official OpenAI Python client library
# - langchain: Framework for building LLM applications
# - langchain-openai: OpenAI integration for LangChain
# - langchain-community: Community-contributed LangChain components
# - python-dotenv: Environment variable management
# - gradio: Web interface for ML applications
# - faiss-cpu: Facebook AI Similarity Search library for vector operations
# - pypdf, python-docx, unstructured: Document processing libraries
# - ipywidgets: Interactive widgets for Jupyter notebooks

!pip install openai langchain langchain-openai langchain-community python-dotenv gradio faiss-cpu pypdf python-docx unstructured ipywidgets

print("✅ All dependencies installed successfully")


✅ All dependencies installed successfully


In [2]:
# Step 2: Configure OpenAI API Key
# This step handles the secure configuration of your OpenAI API key
# The key can be loaded from:
# 1. Environment variables (.env file)
# 2. Direct user input (secure prompt)
# The API key is required for accessing OpenAI's GPT-4o and embedding models

import os
from dotenv import load_dotenv
import getpass

# Load environment variables from .env file if it exists
load_dotenv()

# Attempt to get OpenAI API key from environment variables first
openai_api_key = os.getenv("OPENAI_API_KEY")

# If not found in environment, prompt user to enter it securely
if not openai_api_key:
    openai_api_key = getpass.getpass("Please enter your OpenAI API key: ")
    os.environ["OPENAI_API_KEY"] = openai_api_key

# Verify that the API key has been successfully configured
if openai_api_key:
    print(f"🔑 API key configured (first 10 characters: {openai_api_key[:10]}...)")
else:
    print("❌ API key not configured, please set your OpenAI API key")


🔑 API key configured (first 10 characters: sk-proj-j4...)


🔑 API key configured (first 10 characters: sk-proj-j4...)


In [4]:
# Step 3: Prepare Example Documents
# This step identifies and prepares the documents that will be used for the RAG system
# Priority: FLSA documents from EllerDocs-FLSA folder, fallback to default PDF
# These documents will be processed, embedded, and stored in the vector database


import os
from pathlib import Path
import requests
from urllib.parse import unquote  # Decode %20 and other escapes

# Remote raw URLs of the DOCX files
flsa_urls = [
    "https://github.com/JiakaiDu233/RAG_LangChain_OpenAI/raw/refs/heads/main/EllerDocs-FLSA/Accurate%20Timekeeping%20Supervisors%2012.2.20_AH%20edits.docx",
    "https://github.com/JiakaiDu233/RAG_LangChain_OpenAI/raw/refs/heads/main/EllerDocs-FLSA/Eller%20FLSA%20information%209.2024_AH%20edits.docx",
    "https://github.com/JiakaiDu233/RAG_LangChain_OpenAI/raw/refs/heads/main/EllerDocs-FLSA/Eller%20Overtime%20Guidelines.docx",
    "https://github.com/JiakaiDu233/RAG_LangChain_OpenAI/raw/refs/heads/main/EllerDocs-FLSA/Employee-Guide-Accurate-Timekeeping_AH%20edits.docx",
    "https://github.com/JiakaiDu233/RAG_LangChain_OpenAI/raw/refs/heads/main/EllerDocs-FLSA/Salary-vs-Hourly-Guide_AH%20edits.docx",
    "https://github.com/JiakaiDu233/RAG_LangChain_OpenAI/raw/refs/heads/main/EllerDocs-FLSA/Supervisors-Guide-Accurate-Timekeeping_AH%20edits.docx",
]

# Target folder: same directory as this script/notebook
target_dir = Path("EllerDocs-FLSA")
target_dir.mkdir(exist_ok=True)

# Download each file
for url in flsa_urls:
    file_name = unquote(url.split("/")[-1])   # Restore spaces and other characters
    file_path = target_dir / file_name

    # Skip if file already exists (optional)
    if file_path.exists():
        print(f"✔️ Already exists: {file_path}")
        continue

    print(f"⬇️ Downloading: {file_name} ...")
    try:
        resp = requests.get(url, timeout=30)
        resp.raise_for_status()
        file_path.write_bytes(resp.content)
        print(f"✅ Saved to: {file_path}")
    except Exception as e:
        print(f"❌ Failed to download {file_name}: {e}")

print("\nAll downloads finished!")

from pathlib import Path
import os

# Check for FLSA documents in the EllerDocs-FLSA folder
flsa_docs_folder = Path("EllerDocs-FLSA")

if flsa_docs_folder.exists():
    # Discover all DOCX files in the FLSA documents folder
    flsa_files = list(flsa_docs_folder.glob("*.docx"))
    print(f"Found {len(flsa_files)} FLSA-related documents:")
    for file in flsa_files:
        print(f"   - {file.name}")
    
    # Set the documents to be used (all files in the folder)
    text_example_path = [str(file) for file in flsa_files]
    print("✅ FLSA documents prepared successfully")

⬇️ Downloading: Accurate Timekeeping Supervisors 12.2.20_AH edits.docx ...
✅ Saved to: EllerDocs-FLSA/Accurate Timekeeping Supervisors 12.2.20_AH edits.docx
⬇️ Downloading: Eller FLSA information 9.2024_AH edits.docx ...
✅ Saved to: EllerDocs-FLSA/Eller FLSA information 9.2024_AH edits.docx
⬇️ Downloading: Eller Overtime Guidelines.docx ...
✅ Saved to: EllerDocs-FLSA/Eller Overtime Guidelines.docx
⬇️ Downloading: Employee-Guide-Accurate-Timekeeping_AH edits.docx ...
✅ Saved to: EllerDocs-FLSA/Employee-Guide-Accurate-Timekeeping_AH edits.docx
⬇️ Downloading: Salary-vs-Hourly-Guide_AH edits.docx ...
✅ Saved to: EllerDocs-FLSA/Salary-vs-Hourly-Guide_AH edits.docx
⬇️ Downloading: Supervisors-Guide-Accurate-Timekeeping_AH edits.docx ...
✅ Saved to: EllerDocs-FLSA/Supervisors-Guide-Accurate-Timekeeping_AH edits.docx

All downloads finished!
Found 6 FLSA-related documents:
   - Accurate Timekeeping Supervisors 12.2.20_AH edits.docx
   - Salary-vs-Hourly-Guide_AH edits.docx
   - Employee-Guide

In [5]:
# -------------------------------------------------------------
# Step 4 – Validate and Set Document Sources
# -------------------------------------------------------------
# This step confirms which documents will be used by the RAG system
# and falls back to local files if the preferred list is empty.

from pathlib import Path

if "text_example_path" in locals() and text_example_path:
    print("Using FLSA-related documents (remote or pre-defined):")
    for doc in text_example_path:
        print(f"   - {doc}")
else:
    # Fallback: search for local DOCX files under EllerDocs-FLSA/
    local_dir = Path("EllerDocs-FLSA")
    local_docs = list(local_dir.glob("*.docx")) if local_dir.exists() else []

    if local_docs:
        print("Using FLSA-related documents found locally:")
        for doc in local_docs:
            print(f"   - {doc.name}")
        text_example_path = [str(doc) for doc in local_docs]
    else:
        raise FileNotFoundError(
            "No FLSA documents found in 'text_example_path' or in the local 'EllerDocs-FLSA' folder."
        )

Using FLSA-related documents (remote or pre-defined):
   - EllerDocs-FLSA/Accurate Timekeeping Supervisors 12.2.20_AH edits.docx
   - EllerDocs-FLSA/Salary-vs-Hourly-Guide_AH edits.docx
   - EllerDocs-FLSA/Employee-Guide-Accurate-Timekeeping_AH edits.docx
   - EllerDocs-FLSA/Eller Overtime Guidelines.docx
   - EllerDocs-FLSA/Supervisors-Guide-Accurate-Timekeeping_AH edits.docx
   - EllerDocs-FLSA/Eller FLSA information 9.2024_AH edits.docx


In [6]:
# Step 5: Initialize OpenAI Embedding Model
# This step sets up the OpenAI text-embedding-3-small model which will be used to:
# 1. Convert documents into vector representations (embeddings)
# 2. Convert user queries into vector representations for similarity search
# The embedding model is crucial for the retrieval component of RAG

from langchain_openai import OpenAIEmbeddings

# Initialize OpenAI embeddings with the text-embedding-3-small model
print("🔄 Initializing OpenAI Embedding model...")
embedding = OpenAIEmbeddings(
    model="text-embedding-3-small",  # OpenAI's efficient embedding model
    openai_api_key=openai_api_key
)

# Test the embedding model to ensure it's working correctly
test_text = "This is a test document."
embedding_result = embedding.embed_query(test_text)
print(f"Embedding model test successful")
print(f"Embedding dimension: {len(embedding_result)}")  # Should be 1536 for text-embedding-3-small
print(f"First 3 values: {embedding_result[:3]}")


🔄 Initializing OpenAI Embedding model...
Embedding model test successful
Embedding dimension: 1536
First 3 values: [-0.0023375607561320066, 0.05312768369913101, 0.03345499932765961]


In [7]:
# Step 6: Initialize OpenAI GPT-4o Language Model
# This step sets up the OpenAI GPT-4o model which will be used for:
# 1. Generating answers based on retrieved document contexts
# 2. Providing conversational AI capabilities
# GPT-4o is OpenAI's most capable multimodal model with high reasoning ability

from langchain_openai import ChatOpenAI

# Initialize OpenAI LLM with GPT-4o model
print("🔄 Initializing OpenAI GPT-4o model...")
llm = ChatOpenAI(
    model="gpt-4o",           # OpenAI's most advanced model
    temperature=0.1,          # Low temperature for more focused, deterministic responses
    openai_api_key=openai_api_key
)

# Test the LLM to ensure it's working correctly
test_question = "What is 2 + 2?"
test_response = llm.invoke(test_question)
print(f"LLM model test successful")
print(f"Test response: {test_response.content}")


🔄 Initializing OpenAI GPT-4o model...
LLM model test successful
Test response: 2 + 2 equals 4.


In [8]:
# Step 7: Initialize RAG System Components
# This step sets up all the essential components for the RAG system including:
# 1. Text splitters for breaking documents into manageable chunks
# 2. Document loaders for different file formats
# 3. Vector store (FAISS) for similarity search
# 4. Prompt template for structured question-answering
# 5. Chain components for retrieval and generation

import re
from langchain.text_splitter import (
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter,
    MarkdownTextSplitter,
)
from langchain.document_loaders import (
    CSVLoader,
    PyPDFLoader,
    TextLoader,
    UnstructuredHTMLLoader,
    UnstructuredMarkdownLoader,
    UnstructuredPowerPointLoader,
    UnstructuredWordDocumentLoader,
)
from langchain.prompts import PromptTemplate
from langchain_community.vectorstores import FAISS
from langchain.chains.retrieval import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.docstore.document import Document
import gradio as gr

# Define available text splitters for different document processing strategies
TEXT_SPLITERS = {
    "Character": CharacterTextSplitter,              # Simple character-based splitting
    "RecursiveCharacter": RecursiveCharacterTextSplitter,  # Smart recursive splitting (recommended)
    "Markdown": MarkdownTextSplitter,                # Markdown-aware splitting
}

# Define document loaders for various file formats
LOADERS = {
    ".csv": (CSVLoader, {}),
    ".docx": (UnstructuredWordDocumentLoader, {}),   # Microsoft Word documents
    ".html": (UnstructuredHTMLLoader, {}),           # HTML files
    ".md": (UnstructuredMarkdownLoader, {}),         # Markdown files
    ".pdf": (PyPDFLoader, {}),                       # PDF documents
    ".pptx": (UnstructuredPowerPointLoader, {}),     # PowerPoint presentations
    ".txt": (TextLoader, {"encoding": "utf8"}),      # Plain text files
}

# Define the RAG prompt template for structured question-answering
rag_prompt_template = """You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.

Question: {input}
Context: {context}
Answer:"""

print("RAG components initialized")


RAG components initialized


In [9]:
# Step 8: Define Core RAG Functions
# This step defines the essential functions for the RAG system including:
# 1. Document loading and processing
# 2. Vector database creation and management
# 3. Retriever configuration and updates
# 4. Chatbot functionality with RAG integration

def load_single_document(file_path: str) -> list[Document]:
    """
    Load a single document using the appropriate loader based on file extension
    
    Args:
        file_path: Path to the document file
        
    Returns:
        List of Document objects containing the loaded content
    """
    ext = "." + file_path.rsplit(".", 1)[-1]
    if ext in LOADERS:
        loader_class, loader_args = LOADERS[ext]
        loader = loader_class(file_path, **loader_args)
        return loader.load()
    raise ValueError(f"Unsupported file extension: '{ext}'")

def create_vectordb(
    docs, spliter_name, chunk_size, chunk_overlap, vector_search_top_k, vector_rerank_top_n, run_rerank, search_method, score_threshold, progress=gr.Progress()
):
    """
    Initialize vector database with document processing and embedding creation
    
    Args:
        docs: List of document paths to process
        spliter_name: Text splitting strategy to use
        chunk_size: Size of text chunks for processing
        chunk_overlap: Overlap between consecutive chunks
        vector_search_top_k: Number of top results to retrieve
        vector_rerank_top_n: Number of results to rerank (not used in OpenAI version)
        run_rerank: Whether to run reranking (not supported in OpenAI version)
        search_method: Similarity search method to use
        score_threshold: Minimum similarity score threshold
        progress: Gradio progress tracker
        
    Returns:
        Status message indicating success or failure
    """
    global db
    global retriever
    global combine_docs_chain
    global rag_chain
    
    import time

    if vector_rerank_top_n > vector_search_top_k:
        gr.Warning("Search top k must >= Rerank top n")

    print(f"Starting document processing...")
    start_time = time.time()
    
    documents = []
    for doc in docs:
        if type(doc) is not str:
            doc = doc.name
        documents.extend(load_single_document(doc))

    text_splitter = TEXT_SPLITERS[spliter_name](chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    texts = text_splitter.split_documents(documents)
    
    print(f"Document splitting completed, {len(texts)} chunks created")
    
    # Create vector database - this will call Embedding API
    print(f"🔄 Creating vector database (may take a few seconds)...")
    embed_start = time.time()
    db = FAISS.from_documents(texts, embedding)
    embed_time = time.time() - embed_start
    print(f"Vector database created, time taken: {embed_time:.2f} seconds")
    
    if search_method == "similarity_score_threshold":
        search_kwargs = {"k": vector_search_top_k, "score_threshold": score_threshold}
    else:
        search_kwargs = {"k": vector_search_top_k}
    
    retriever = db.as_retriever(search_kwargs=search_kwargs, search_type=search_method)
    
    # Note: rerank functionality is not enabled as OpenAI version doesn't have rerank model
    if run_rerank:
        print("⚠️ OpenAI version does not support rerank functionality")
    
    prompt = PromptTemplate.from_template(rag_prompt_template)
    combine_docs_chain = create_stuff_documents_chain(llm, prompt)
    rag_chain = create_retrieval_chain(retriever, combine_docs_chain)

    total_time = time.time() - start_time
    print(f"Vector database initialization completed, total time: {total_time:.2f} seconds")
    
    return f"Vector database is ready (time taken: {total_time:.1f} seconds)"

def update_retriever(vector_search_top_k, vector_rerank_top_n, run_rerank, search_method, score_threshold):
    """
    Update retriever configuration with new parameters
    
    Args:
        vector_search_top_k: Number of top results to retrieve
        vector_rerank_top_n: Number of results to rerank (not used in OpenAI version)
        run_rerank: Whether to run reranking (not supported in OpenAI version)
        search_method: Similarity search method to use
        score_threshold: Minimum similarity score threshold
        
    Returns:
        Status message indicating successful update
    """
    global retriever
    global rag_chain

    if vector_rerank_top_n > vector_search_top_k:
        gr.Warning("Search top k must >= Rerank top n")

    if search_method == "similarity_score_threshold":
        search_kwargs = {"k": vector_search_top_k, "score_threshold": score_threshold}
    else:
        search_kwargs = {"k": vector_search_top_k}
    
    retriever = db.as_retriever(search_kwargs=search_kwargs, search_type=search_method)
    
    if run_rerank:
        print("⚠️ OpenAI version does not support rerank functionality")
    
    rag_chain = create_retrieval_chain(retriever, combine_docs_chain)

    return "Retriever updated"

In [10]:
# Step 9: Define Non-Streaming Bot Function
# This step defines an alternative bot function that generates complete responses
# at once instead of streaming. This version is simpler and more reliable for
# certain use cases where streaming is not required.

def bot(history,
        temperature, top_p, top_k, repetition_penalty,
        hide_full_prompt, do_rag):
    """
    Non-streaming chatbot function with RAG integration
    Generates complete answer and returns at once (more stable than streaming)
    
    Args:
        history: Chat conversation history
        temperature: LLM temperature for response randomness
        top_p: LLM top-p sampling parameter
        top_k: LLM top-k sampling parameter (not used in OpenAI)
        repetition_penalty: Penalty for repetitive text
        hide_full_prompt: Whether to hide retrieved context in response
        do_rag: Whether to use RAG for response generation
        
    Yields:
        Updated conversation history with complete response
    """
    from langchain_core.messages import HumanMessage

    # 0) First message
    if not history:
        yield history
        return

    # 1) OpenAI parameters
    llm.temperature = temperature
    llm.model_kwargs = {
        "top_p": top_p,
        "frequency_penalty": repetition_penalty,
    }

    # 2) tuple → list (for writing)
    user_msg, _ = history[-1]
    history[-1] = [user_msg, ""]          # Reserve assistant reply position

    try:
        # ---------- RAG ----------
        if do_rag:
            resp = rag_chain.invoke({"input": user_msg})
            answer = resp.get("answer", "")

            # Whether to append retrieved content
            if (not hide_full_prompt) and ("context" in resp):
                ctx_docs = resp["context"][:3]
                if ctx_docs:
                    ctx = "\n\n📄 **Retrieved relevant content:**\n"
                    for i, doc in enumerate(ctx_docs, 1):
                        txt = getattr(doc, "page_content", str(doc))[:150]
                        ctx += f"{i}. {txt}...\n"
                    answer += ctx

        # ---------- Direct LLM ----------
        else:
            resp = llm.invoke([HumanMessage(content=user_msg)])
            answer = resp.content if hasattr(resp, "content") else str(resp)

        # Write final answer
        history[-1][1] = answer
        yield history[:-1] + [tuple(history[-1])]

    except Exception as e:
        history[-1][1] = f"❌ Error occurred: {e}"
        yield history[:-1] + [tuple(history[-1])]

    # Cleanup
    history[-1] = tuple(history[-1])

In [11]:
# Step 10: Initialize Vector Database with Example Documents
# This step creates the vector database using the prepared documents
# It processes documents, creates embeddings, and sets up the retrieval system
# The vector database enables efficient similarity search for RAG

# Initialize vector database using example documents with optimized settings
print("🔄 Initializing vector database...")
result = create_vectordb(
    text_example_path,                 # Documents to process
    "RecursiveCharacter",               # Use smart recursive text splitting
    chunk_size=400,                     # Size of each text chunk
    chunk_overlap=50,                   # Overlap between chunks for context
    vector_search_top_k=4,              # Retrieve top 4 most similar chunks
    vector_rerank_top_n=2,              # Rerank parameter (not used in OpenAI)
    run_rerank=False,                   # Disable reranking for OpenAI version
    search_method="similarity",         # Use standard similarity search
    score_threshold=0.5,                # Minimum similarity threshold
)

print(result)
print("RAG system initialization successful!")


🔄 Initializing vector database...
Starting document processing...
Document splitting completed, 129 chunks created
🔄 Creating vector database (may take a few seconds)...
Vector database created, time taken: 5.04 seconds
Vector database initialization completed, total time: 11.79 seconds
Vector database is ready (time taken: 11.8 seconds)
RAG system initialization successful!


In [12]:
# Step 11: Launch Gradio Web Interface
# This final step creates and launches the interactive web interface for the RAG system
# The interface allows users to:
# 1. Upload and process documents
# 2. Build vector databases
# 3. Ask questions and get RAG-powered answers
# 4. Adjust system parameters in real-time

# Launch interface using Gradio framework
print("Starting Gradio interface...")

import requests

# Ensure gradio_helper.py is available (download if not present)
if not Path("gradio_helper.py").exists():
    r = requests.get(url="https://github.com/JiakaiDu233/RAG_LangChain_OpenAI/raw/refs/heads/main/gradio_helper.py")
    open("gradio_helper.py", "w").write(r.text)
    print("Updated gradio_helper.py file")

# Import the demo creation function
from gradio_helper import make_demo

# Create the Gradio demo interface with all RAG components
demo = make_demo(
    load_doc_fn=create_vectordb,         # Function to create vector database
    run_fn=bot,                          # Chatbot function with RAG
    stop_fn=None,                        # OpenAI version doesn't need stop functionality
    update_retriever_fn=update_retriever, # Function to update retriever settings
    model_name="GPT-4o",                 # Display name for the model
    language="English",                  # Interface language
)

# Launch the interface with fallback to shared link if local fails
try:
    print("🔄 Attempting to launch Gradio interface...")
    demo.queue().launch(debug=True)
    print("✅ Launch successful!")
except Exception as e:
    print(f"⚠️ Local launch failed: {e}")
    print("🔄 Attempting to launch with shared link...")
    demo.queue().launch(share=True, debug=True)
    print("✅ Shared link launch successful!")

print("\n🎉 RAG system has been launched!")
print("\n📖 Usage instructions:")
print("1. Click 'Step 2: Build Vector Store' to build the vector database")
print("2. Enter questions in 'Step 3: Input Query'")
print("3. Upload new documents for Q&A")
print("4. Adjust RAG and LLM parameters in settings")


Starting Gradio interface...


  chatbot = gr.Chatbot(


Updated gradio_helper.py file
🔄 Attempting to launch Gradio interface...
* Running on local URL:  http://127.0.0.1:7861
⚠️ Local launch failed: Couldn't start the app because 'http://127.0.0.1:7861/gradio_api/startup-events' failed (code 502). Check your network or proxy settings to ensure localhost is accessible.
🔄 Attempting to launch with shared link...
Rerunning server... use `close()` to stop if you need to change `launch()` parameters.
----
* Running on public URL: https://62e5ca071afc1cd2d2.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


Starting document processing...
Document splitting completed, 129 chunks created
🔄 Creating vector database (may take a few seconds)...
Vector database created, time taken: 5.59 seconds
Vector database initialization completed, total time: 5.86 seconds
Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7861 <> https://62e5ca071afc1cd2d2.gradio.live
✅ Shared link launch successful!

🎉 RAG system has been launched!

📖 Usage instructions:
1. Click 'Step 2: Build Vector Store' to build the vector database
2. Enter questions in 'Step 3: Input Query'
3. Upload new documents for Q&A
4. Adjust RAG and LLM parameters in settings
