# Chatbot application using RAG
> The goal of this notebook is to present a chatbot application, which is based on Retrieval Augmented Generation.

Expectations:
* The chatbot should demonstrate the basic elements of agentic behavior,
autonomous decision making, autonomous decomposition of subtasks and
execution.
* The application must use the Retrieval-Augmented Generation (RAG)
technique to integrate external knowledge sources.
* The solution should include structured documentation of the design
design decisions, architecture used and operational logic.

## Steps
1. Setting up the environment
2. Loading and saving the data
3. Splitting text into chunks
4. Creating embeddings and building vector database
5. Define tools and agent

### 1. Setting up the environment
* Hugging Face -> models and datasets
* LangChain -> chaining and retrieval 
* FAISS -> vector store
* LangGraph -> agentic logic

In [1]:
# Installing required packages
!pip install -q langchain-community langgraph faiss-cpu huggingface_hub transformers sentence-transformers datasets tiktoken unstructured


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [2]:
# Installing parser
!pip install mwparserfromhell


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [47]:
# Importing libraries

# Core libraries
import os
import json
import torch
from typing import List

# Hugging Face & Transformers
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset

# LangChain for RAG logic
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document
from langchain.tools.retriever import create_retriever_tool
from langchain.chat_models import ChatOpenAI
from langchain.agents import initialize_agent, AgentType
from langchain import LLMChain
from langchain.llms import HuggingFacePipeline

# LangGraph for agentic execution
from langgraph.graph import StateGraph, END
from langgraph.prebuilt import create_react_agent

# Utilities
import logging
logging.basicConfig(level=logging.INFO)

### 2. Loading and saving the data
> Loading a legacy dataset from HuggingFace.

> Saving 500 articles

In [4]:
# Loading in streaming mode, because it takes up less disc space
streamed_dataset = load_dataset("wikipedia", "20220301.en", split="train", streaming=True, trust_remote_code=True)

In [5]:
# Getting a sample from the data
sampled_data = []
for i, article in enumerate(streamed_dataset):
    if i >= 500:
        break
    sampled_data.append(article)

print(f"Collected {len(sampled_data)} articles.")

Collected 500 articles.


In [6]:
# Print titles of the first 10 articles
for i, article in enumerate(sampled_data[:10]):
    print(f"{i+1}. {article['title']}")

1. Anarchism
2. Autism
3. Albedo
4. A
5. Alabama
6. Achilles
7. Abraham Lincoln
8. Aristotle
9. An American in Paris
10. Academy Award for Best Production Design


In [7]:
# Viewing one article
sample_article = sampled_data[0]

print(f"Title: {article['title']}")
print(f"URL: {article['url']}\n")
print(article['text'][:300])  # Show only the first 300 characters

Title: Academy Award for Best Production Design
URL: https://en.wikipedia.org/wiki/Academy%20Award%20for%20Best%20Production%20Design

The Academy Award for Best Production Design recognizes achievement for art direction in film. The category's original name was Best Art Direction, but was changed to its current name in 2012 for the 85th Academy Awards. This change resulted from the Art Director's branch of the Academy of Motion Pi


In [8]:
# Saving articles

# Creating a directory
os.makedirs("data", exist_ok=True)

# Save each article
for i, article in enumerate(sampled_data):
    filename = f"{article['title']}.json"
    filepath = os.path.join("data", filename)
    
    with open(filepath, "w", encoding="utf-8") as f:
        json.dump(article, f, indent=2, ensure_ascii=False)

print(f"Saved {len(sampled_data)} articles to 'data'")

Saved 500 articles to 'data'


### 3. Splitting text into chunks
> Converting articles to LangChain documents

> Specifying chunk size and overlap

In [9]:
# LangChain Document objects
documents = [Document(page_content=article["text"], metadata={"title": article["title"]}) for article in sampled_data]

# Use recursive character splitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,      # Characters per chunk
    chunk_overlap=50     # Overlap between chunks to preserve context
)

# Split the documents into chunks
chunks = splitter.split_documents(documents)

print(f"Total number of text chunks: {len(chunks)}")
print(f"\nSample chunk (from article: {chunks[0].metadata['title']}):\n")
print(chunks[0].page_content)

Total number of text chunks: 46103

Sample chunk (from article: Anarchism):

Anarchism is a political philosophy and movement that is sceptical of authority and rejects all involuntary, coercive forms of hierarchy. Anarchism calls for the abolition of the state, which it holds to be unnecessary, undesirable, and harmful. As a historically left-wing movement, placed on the farthest left of the political spectrum, it is usually described alongside communalism and libertarian


In [10]:
# Average chunk number in one article
print(f"Average chunks per article: {len(chunks) / len(sampled_data):.2f}")

Average chunks per article: 92.21


### 4. Creating embeddings and building vector database
> Choosing an embedding model

> Build vector database with FAISS

In [11]:
# Load a sentence transformer model
embedding_model = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

  embedding_model = HuggingFaceEmbeddings(
INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cpu
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2


In [14]:
# Build FAISS vector store from document chunks
vectorstore = FAISS.from_documents(chunks, embedding_model)

# Save it to disk
vectorstore.save_local("vectorstore/faiss_index")

print("FAISS vector store created and saved.")

INFO:faiss.loader:Loading faiss with AVX512 support.
INFO:faiss.loader:Successfully loaded faiss with AVX512 support.
INFO:faiss:Failed to load GPU Faiss: name 'GpuIndexIVFFlat' is not defined. Will not load constructor refs for GPU indexes. This is only an error if you're trying to use GPU Faiss.


FAISS vector store created and saved.


### 5. Define tools and agent
> Define search tool

> Define agent with tool

> Define nodes

In [15]:
# Defining tool for searching

retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

retriever_tool = create_retriever_tool(
    retriever,
    name="vectorstore_search",
    description="Searches the document store for relevant context."
)

In [16]:
!pip install "accelerate>=0.26.0"

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [1]:
# Defining generator model
model_id = "tiiuae/falcon-rw-1b" 
#token = os.getenv("HUGGINGFACE_TOKEN")
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Wrap it for LangChain use
llm = HuggingFacePipeline(pipeline=pipe)

NameError: name 'AutoTokenizer' is not defined

In [27]:
tools = [retriever_tool]

In [42]:
# Create agent with tool

# Define input/output state
from typing import TypedDict, Annotated
from langchain_core.runnables import RunnableConfig

class AgentState(TypedDict):
    input: str
    output: str

def agent_step(state: AgentState) -> AgentState:
    query = state["input"]
    docs = retriever_tool.invoke(query)
    context = "\n\n".join([doc.page_content if hasattr(doc, "page_content") else str(doc) for doc in docs])
    prompt = f"Use the following context to answer:\n\n{context}\n\nQuestion: {query}"
    response = llm.invoke(prompt)
    return {"input": query, "output": response}

# 4. Build LangGraph
graph = StateGraph(AgentState)
graph.add_node("agent", agent_step)
graph.set_entry_point("agent")
graph.add_edge("agent", END)

app = graph.compile()

In [43]:
response = app.invoke({"input": "What is retrieval-augmented generation?"})
print(response["output"])

retrieval-augmented generation


In [45]:
respone = app.invoke({"input": "Who was Achilles?"})
print(response["output"])

retrieval-augmented generation
