# RAG Pipeline Implementation

This notebook implements a basic Retrieval-Augmented Generation (RAG) pipeline using:
- **crawl4ai**: For web crawling
- **ChromaDB**: For vector storage
- **LangChain**: For RAG orchestration
- **Gemini**: As the language model
- **Context7**: For additional context

In [1]:
# Install required packages
!pip install crawl4ai chromadb langchain langchain-google-genai langchain-community sentence-transformers nest-asyncio requests beautifulsoup4



In [2]:
# Import required libraries
import asyncio
from crawl4ai import AsyncWebCrawler
import chromadb
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.llms import GooglePalm
from langchain.chains import RetrievalQA
from langchain_google_genai import GoogleGenerativeAI
import os
import nest_asyncio
import requests
from bs4 import BeautifulSoup

In [3]:
from dotenv import load_dotenv
import os

# Configuration
# Load environment variables from a .env file if present
load_dotenv()
WEBSITE_URL = "https://python.langchain.com/docs/introduction/"  # Example URL - replace with your target
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")  # Load from .env or environment
if not GOOGLE_API_KEY:
    raise ValueError("GOOGLE_API_KEY not found in environment variables or .env file.")

# Set environment variable for Google API
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY

In [6]:
# Step 1: Web Crawling with crawl4ai (Fixed for Jupyter)
import nest_asyncio
import requests
from bs4 import BeautifulSoup

# Apply nest_asyncio to handle asyncio in Jupyter
nest_asyncio.apply()

def crawl_website_simple(url):
    """Simple web crawling using requests and BeautifulSoup as fallback"""
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
        }
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
        
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # Remove script and style elements
        for script in soup(["script", "style"]):
            script.decompose()
            
        # Get text content
        text = soup.get_text()
        
        # Clean up text
        lines = (line.strip() for line in text.splitlines())
        chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
        text = ' '.join(chunk for chunk in chunks if chunk)
        
        return text
    except Exception as e:
        print(f"Error crawling with simple method: {e}")
        return None

async def crawl_website_advanced(url):
    """Advanced crawling with crawl4ai (if it works)"""
    try:
        async with AsyncWebCrawler(verbose=False) as crawler:
            result = await crawler.arun(url=url)
            return result.markdown
    except Exception as e:
        print(f"Advanced crawling failed: {e}")
        return None

# Try advanced crawling first, fallback to simple
print(f"Crawling website: {WEBSITE_URL}")

try:
    # Try the advanced method first
    content = await crawl_website_advanced(WEBSITE_URL)
    if not content:
        raise Exception("Advanced crawling returned no content")
    print("✅ Used advanced crawling with crawl4ai")
except Exception as e:
    print(f"Advanced crawling failed: {e}")
    print("🔄 Falling back to simple crawling...")
    content = crawl_website_simple(WEBSITE_URL)
    if content:
        print("✅ Used simple crawling with requests/BeautifulSoup")
    else:
        print("❌ Both crawling methods failed")

if content:
    print(f"Crawled content length: {len(content)} characters")
    print(f"First 500 characters:\n{content[:500]}...")
else:
    print("No content was crawled. Please check the URL or network connection.")

Crawling website: https://python.langchain.com/docs/introduction/
Advanced crawling failed: 
Advanced crawling failed: Advanced crawling returned no content
🔄 Falling back to simple crawling...
✅ Used simple crawling with requests/BeautifulSoup
Crawled content length: 12218 characters
First 500 characters:
Introduction | 🦜️🔗 LangChain Skip to main contentOur Building Ambient Agents with LangGraph course is now available on LangChain Academy!IntegrationsAPI ReferenceMoreContributingPeopleError referenceLangSmithLangGraphLangChain HubLangChain JS/TSv0.3v0.3v0.2v0.1💬SearchIntroductionTutorialsBuild a Question Answering application over a Graph DatabaseTutorialsBuild a simple LLM application with chat models and prompt templatesBuild a ChatbotBuild a Retrieval Augmented Generation (RAG) App: Part 2Bui...


In [7]:
# Step 2: Text Processing and Splitting
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len
)

# Split the content into chunks
documents = text_splitter.split_text(content)
print(f"Split content into {len(documents)} chunks")
print(f"First chunk preview:\n{documents[0][:300]}...")

Split content into 16 chunks
First chunk preview:
Introduction | 🦜️🔗 LangChain Skip to main contentOur Building Ambient Agents with LangGraph course is now available on LangChain Academy!IntegrationsAPI ReferenceMoreContributingPeopleError referenceLangSmithLangGraphLangChain HubLangChain JS/TSv0.3v0.3v0.2v0.1💬SearchIntroductionTutorialsBuild a Que...


In [8]:
# Step 3: Set up ChromaDB Vector Store
# Initialize embeddings
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Initialize ChromaDB client
chroma_client = chromadb.Client()

# Create vector store using Chroma
vectorstore = Chroma.from_texts(
    texts=documents,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

print(f"Created vector store with {len(documents)} documents")
print("Vector store ready for retrieval")

  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
  from .autonotebook import tqdm as notebook_tqdm
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


Created vector store with 16 documents
Vector store ready for retrieval


In [9]:
# Step 4: Set up Gemini LLM and RAG Chain
# Initialize Gemini model
from langchain_google_genai import ChatGoogleGenerativeAI

llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash-lite-preview-06-17", temperature=0.3)

# Create retriever from vector store
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

# Create RAG chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True
)

print("RAG pipeline setup complete!")
print("Ready to answer questions based on the crawled content")

RAG pipeline setup complete!
Ready to answer questions based on the crawled content


In [11]:
# Step 5: Query Function
def query_rag(question):
    """Query the RAG pipeline with a question"""
    try:
        response = qa_chain({"query": question})
        return {
            "answer": response["result"],
            "source_documents": response["source_documents"]
        }
    except Exception as e:
        return {"error": str(e)}

# Example usage function
def ask_question(question):
    """Helper function to ask questions and display results"""
    print(f"Question: {question}")
    print("-" * 50)
    
    result = query_rag(question)
    
    if "error" in result:
        print(f"Error: {result['error']}")
    else:
        print(f"Answer: {result['answer']}")
        print(f"\nBased on {len(result['source_documents'])} source documents")
    
    print("=" * 50)

In [12]:
# Step 6: Test the RAG Pipeline
print("Testing RAG Pipeline with example questions:")
print("=" * 60)

# Example questions (customize based on your crawled content)
example_questions = [
    "What is the main topic of this website?",
    "What are the key features mentioned?",
    "How can I get started?",
    "What are the benefits discussed?"
]

# Test with example questions
for question in example_questions:
    ask_question(question)

  response = qa_chain({"query": question})
Failed to send telemetry event CollectionQueryEvent: capture() takes 1 positional argument but 3 were given


Testing RAG Pipeline with example questions:
Question: What is the main topic of this website?
--------------------------------------------------
Answer: The main topic of this website is the **LangChain framework**.

It provides documentation and resources for building applications with large language models, covering its various components like:

*   **langchain-core**: Base abstractions.
*   **Integration packages**: For specific models (e.g., OpenAI, Anthropic).
*   **langchain**: Chains, agents, and retrieval strategies.
*   **langchain-community**: Community-maintained integrations.
*   **langgraph**: Orchestration framework.

The site also offers guides, tutorials, and information about the LangChain ecosystem, including LangSmith and LangGraph.

Based on 4 source documents
Question: What are the key features mentioned?
--------------------------------------------------
Answer: The key features mentioned are:

*   **langchain-core:** Base abstractions for chat models and other c

## Summary

This RAG pipeline implementation includes:

1. **Web Crawling**: Using `crawl4ai` to extract content from websites
2. **Text Processing**: Splitting content into manageable chunks
3. **Vector Storage**: Using ChromaDB for efficient similarity search
4. **LLM Integration**: Gemini model for generating responses
5. **RAG Chain**: LangChain orchestration for retrieval and generation
