In [None]:
# %pip install openai>1.50.0 langchain>0.3.0 langgraph langchainhub langchain-openai langchain-community langchain-cli langchain_ollama tavily-python>=0.5.0 langchain_nomic nomic[local] langserve faiss-cpu tiktoken pypdf chroma jira google-search-results numexpr beautifulsoup4 scikit-learn

# Local RAG with LLaMA 3.2: Building an Adaptive System

## Introduction

In this tutorial, we'll build an advanced Retrieval Augmented Generation (RAG) system using local models. Our system will incorporate several key features:
- Adaptive routing between vector store and web search
- Self-correction and hallucination detection
- Document relevance grading

## Setting Up Our Environment

First, let's install the required packages:

In [1]:
# Install required packages
from langchain_ollama import ChatOllama
from langchain_nomic import NomicEmbeddings
from langchain_community.vectorstores import SKLearnVectorStore
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_community.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import json

# Initialize our local LLM
llm = ChatOllama(model="llama3.2:3b-instruct-fp16", temperature=0)
llm_json = ChatOllama(model="llama3.2:3b-instruct-fp16", temperature=0, format="json")

USER_AGENT environment variable not set, consider setting it to identify your requests.


## Creating Our Vector Store

Let's set up a vector store with some sample documents:

In [2]:
# Sample URLs to load
urls = [
    "https://langchain-ai.github.io/langgraph/",
    "https://langchain-ai.github.io/langgraph/concepts/",
    "https://langchain-ai.github.io/langgraph/concepts/low_level/"
]

# Load and process documents
loader = WebBaseLoader(urls[0])  # Loading first URL as example
documents = loader.load()

# Split documents
splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=1000, 
    chunk_overlap=200
)
splits = splitter.split_documents(documents)

# Create vector store
embeddings = NomicEmbeddings(
    model="nomic-embed-text-v1.5", 
    inference_mode="local"
)
vectorstore = SKLearnVectorStore.from_documents(
    documents=splits,
    embedding=embeddings
)

# Create retriever
retriever = vectorstore.as_retriever(k=3)

Embedding texts: 100%|██████████| 6/6 [00:00<00:00, 15.76inputs/s]


## Building the Router Component

We'll create a router to decide between vector store and web search:

In [3]:
router_system_prompt = """You are an expert at routing questions.
Route to vectorstore for AI/ML topics and to web search for current events.
Return JSON with key 'datasource' as either 'websearch' or 'vectorstore'."""

def route_question(question):
    response = llm_json.invoke([
        SystemMessage(content=router_system_prompt),
        HumanMessage(content=question)
    ])
    return json.loads(response.content)

# Test the router
test_questions = [
    "What are transformer neural networks?",
    "Who won the latest Super Bowl?"
]

for question in test_questions:
    result = route_question(question)
    print(f"Question: {question}\nRouting: {result['datasource']}\n")

Question: What are transformer neural networks?
Routing: vectorstore

Question: Who won the latest Super Bowl?
Routing: websearch



## Implementing Document Grading

Let's create a component to grade document relevance:

In [4]:
grader_prompt = """You are grading document relevance.
Rate if the document answers the question.
Return JSON with 'relevant': true/false and 'explanation'."""

def grade_document(document, question):
    prompt = f"""Document: {document}
    Question: {question}
    Is this document relevant?"""
    
    response = llm_json.invoke([
        SystemMessage(content=grader_prompt),
        HumanMessage(content=prompt)
    ])
    return json.loads(response.content)

# Test document grading
sample_doc = "Transformers are a type of neural network architecture..."
test_question = "How do transformer networks work?"
grade = grade_document(sample_doc, test_question)
print(f"Document relevance grade: {grade}")

Document relevance grade: {'relevant': True, 'explanation': "The document provides a general overview of what transformers are, which is directly related to the question about how they work. However, it does not provide specific details on the inner workings or mechanisms of transformer networks, so while it's relevant in a broad sense, it may not be sufficient for someone looking for a detailed explanation."}


## Building the Answer Generator

Finally, let's implement our RAG answer generator:

In [5]:
def generate_answer(context, question):
    prompt = f"""Using this context: {context}
    Answer this question: {question}
    Keep the answer concise and accurate."""
    
    response = llm.invoke([
        SystemMessage(content=prompt),
        HumanMessage(content=question)
    ])
    return response.content

# Test answer generation
context = "Transformers use self-attention mechanisms..."
question = "What is the key innovation in transformer networks?"
answer = generate_answer(context, question)
print(f"Generated answer: {answer}")

Generated answer: The key innovation in transformer networks is the self-attention mechanism, which replaces traditional recurrent neural network (RNN) or convolutional neural network (CNN) architectures by allowing the model to attend to all positions in an input sequence simultaneously and weigh their importance.


## Putting It All Together

Here's how to use all components together:

In [7]:
def process_query(question):
    # 1. Route the question
    route = route_question(question)
    
    # 2. Get relevant documents
    if route['datasource'] == 'vectorstore':
        docs = retriever.invoke(question)
    else:
        print("Would perform web search here")
        return
        
    # 3. Grade documents
    relevant_docs = []
    for doc in docs:
        grade = grade_document(doc.page_content, question)
        if grade['relevant']:
            relevant_docs.append(doc)
            
    # 4. Generate answer
    if relevant_docs:
        context = "\n".join(d.page_content for d in relevant_docs)
        answer = generate_answer(context, question)
        return answer
    else:
        return "No relevant documents found"

# Test the complete system
test_question = "What are the main components LangGraph?"
result = process_query(test_question)
print(f"Final answer: {result}")

Embedding texts: 100%|██████████| 1/1 [00:00<00:00, 32.88inputs/s]


Final answer: The main components of LangGraph are:

1. **LangGraph Platform**: A commercial solution for deploying agentic applications in production, built on the open-source LangGraph framework.
2. **LangGraph Server (APIs)**: Provides multiple streaming modes optimized for various application needs.
3. **LangGraph SDKs (clients for the APIs)**: Clients for interacting with the LangGraph Server.
4. **LangGraph CLI (command line tool for building the server)**: A command-line tool for building and managing the LangGraph Server.
5. **LangGraph Studio (UI/debugger)**: A user interface for debugging and monitoring LangGraph applications.

Additionally, LangGraph consists of:

1. **Nodes**: The basic building blocks of a LangGraph graph, which can represent various components such as agents, tools, and state management.
2. **Edges**: Connections between nodes that define the flow of data and control within the graph.
3. **State**: The memory of the application, which can be persisted acr

This tutorial has shown how to build a local RAG system with adaptive routing, document grading, and answer generation. The system can intelligently choose between different data sources and verify the relevance of retrieved information before generating answers.