# 🧠 "Retrieval-Augmented Generation with LLaMA 3.1 and Wikipedia"

### Packages & Libraries

In [1]:
# STEP 1
from datasets import load_dataset
from langchain.text_splitter import RecursiveCharacterTextSplitter

# STEP 2
from sentence_transformers import SentenceTransformer
import numpy as np
import pickle

# STEP 3
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.docstore.document import Document

# STEP 4 
import accelerate
import transformers
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# STEP 5 
import gradio as gr

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0) if torch.cuda.is_available() else "No GPU")

True
NVIDIA GeForce RTX 4060 Laptop GPU


## STEP 1: Load & Split Wikipedia into chunks

In [3]:
dataset = load_dataset("wikipedia", "20220301.simple", trust_remote_code=True)
dataset.shape

{'train': (205328, 4)}

In [4]:
print(dataset['train'][0]['text'][:500] +'...')  # First article

April is the fourth month of the year in the Julian and Gregorian calendars, and comes between March and May. It is one of four months to have 30 days.

April always begins on the same day of week as July, and additionally, January in leap years. April always ends on the same day of the week as December.

April's flowers are the Sweet Pea and Daisy. Its birthstone is the diamond. The meaning of the diamond is innocence.

The Month 

April comes between March and May, making it the fourth month o...


In [5]:
# Load a small sample for testing
#texts = [d['text'] for d in dataset['train'].select(range(10000))]  # First 10000 articles
texts = [d['text'] for d in dataset['train']]  # All articles

# Initialize the splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ".", " ", ""]
)

# Split each article into chunks
#all_chunks = []
#for text in texts:
#    chunks = text_splitter.split_text(text)
#    all_chunks.extend(chunks)

# Optimized chunks only size >50
all_chunks = [chunk for text in texts for chunk in text_splitter.split_text(text) if len(chunk.strip()) > 50]

print(f"Total chunks created: {len(all_chunks)}")
print("Example chunk:\n", all_chunks[0][:500])

Total chunks created: 654208
Example chunk:
 April is the fourth month of the year in the Julian and Gregorian calendars, and comes between March and May. It is one of four months to have 30 days.

April always begins on the same day of week as July, and additionally, January in leap years. April always ends on the same day of the week as December.

April's flowers are the Sweet Pea and Daisy. Its birthstone is the diamond. The meaning of the diamond is innocence.

The Month


## STEP 2: Batch Embedding of Text Chunks Using all-MiniLM-L6-v2 for Vector Indexing

In [6]:
# Load a pre-trained embedding model
model2 = SentenceTransformer('all-MiniLM-L6-v2', device='cuda')  # Fast & good for semantic search

# Limit to first N chunks for testing (you can expand later)
#sample_chunks = all_chunks[:1000]
sample_chunks = all_chunks[:]

# Compute embeddings (batch mode)
embeddings = model2.encode(sample_chunks, show_progress_bar=True)

Batches: 100%|██████████| 20444/20444 [08:08<00:00, 41.81it/s] 


✅ Optional: Save for Later Use

In [7]:
# Save embeddings and chunks
np.save("embeddings.npy", embeddings)
with open("chunks.pkl", "wb") as f:
    pickle.dump(sample_chunks, f)

In [2]:
# Load embeddings
embeddings = np.load("embeddings.npy")

# Load text chunks
with open("chunks.pkl", "rb") as f:
    sample_chunks = pickle.load(f)

## STEP 3: Building a Metadata-Enriched Vector Index for Semantic Retrieval with FAISS

In [4]:
# Convert your chunks into Document objects
#documents = [Document(page_content=chunk) for chunk in sample_chunks]

# Convert your chunks into Document objects with meta data
documents = [Document(page_content=chunk, metadata={"source": f"wiki_{i}"}) for i, chunk in enumerate(sample_chunks)]

# Reuse your sentence-transformers model as a LangChain embedding model
embedding_model = HuggingFaceEmbeddings(model_name='all-MiniLM-L6-v2')

# Create FAISS index
vectorstore = FAISS.from_documents(documents, embedding_model)

# Save index locally
vectorstore.save_local("faiss_index")

  embedding_model = HuggingFaceEmbeddings(model_name='all-MiniLM-L6-v2')


🔍 To Load the Index Later

In [6]:
# Reuse your sentence-transformers model as a LangChain embedding model
embedding_model = HuggingFaceEmbeddings(model_name='all-MiniLM-L6-v2')

# Load saved FAISS index
vectorstore = FAISS.load_local(
    "faiss_index",
    embeddings=embedding_model,
    allow_dangerous_deserialization=True
)

In [12]:
query = "What is artificial intelligence?"
query2 = "What is a neural network?"

# Get results with similarity scores
docs_scores = vectorstore.similarity_search_with_score(query, k=3)
docs2_scores = vectorstore.similarity_search_with_score(query2, k=3)

# Print top 3 results for the first query
print(f"\n🔎 Query: {query}")
for i, (doc, score) in enumerate(docs_scores):
    print(f"\n--- Result {i+1} (Score: {score:.4f}) ---\n{doc.page_content[:500]}")

# Print top 3 results for the second query
print(f"\n🔎 Query: {query2}")
for i, (doc, score) in enumerate(docs2_scores):
    print(f"\n--- Result {i+1} (Score: {score:.4f}) ---\n{doc.page_content[:500]}")


🔎 Query: What is artificial intelligence?

--- Result 1 (Score: 0.3028) ---
Artificial intelligence (AI) is the ability of a computer program or a machine to think and learn. It is also a field of study which tries to make computers "smart". They work on their own without being encoded with commands. John McCarthy came up with the name "Artificial Intelligence" in 1955.

--- Result 2 (Score: 0.3395) ---
In general use, the term "artificial intelligence" means a programme which mimics human cognition. At least some of the things we associate with other minds, such as learning and problem solving can be done by computers, though not in the same way as we do. Andreas Kaplan and Michael Haenlein define AI as a system’s ability to correctly interpret external data, to learn from such data, and to use those learnings to achieve specific goals and tasks through flexible adaptation.

--- Result 3 (Score: 0.5215) ---
Related pages
 Neural networks
 Expert systems
 Machine learning

References


## STEP 4: Quantized Inference with LLaMA 3.1 8B and Contextual Prompting via FAISS Retrieval

### 8-bit model

In [None]:
model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"

# 🔧 Quantization config (8-bit via bitsandbytes)
quant_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0,
    llm_int8_skip_modules=None,
    llm_int8_enable_fp32_cpu_offload=True
)

# ✅ Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quant_config,
    device_map="auto",
)

# 🔍 Prompt
prompt = "Explain how solar panels generate electricity in simple terms."

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print("\n🧠 LLaMA 3.1 Response:\n")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

### Delete the model if it exists 

In [None]:
model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"

# Delete the 8-bit model if it exists (offload GPU)
del model
torch.cuda.empty_cache()

### 4-bit model

In [7]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"

# 🔧 Quantization config (4-bit via bitsandbytes)
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,  # or torch.float16
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)


# ✅ Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quant_config,
    device_map="auto",
)

# 🔍 Prompt
prompt = "Explain how solar panels generate electricity in simple terms."

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print("\n🧠 LLaMA 3.1 Response:\n")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Loading checkpoint shards: 100%|██████████| 4/4 [00:20<00:00,  5.12s/it]
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



🧠 LLaMA 3.1 Response:

Explain how solar panels generate electricity in simple terms. Solar panels convert sunlight into electricity by using special cells called photovoltaic cells. These cells contain tiny particles called electrons that are excited by the sunlight and flow through a circuit, creating an electric current. This process is known as the photovoltaic effect.
Solar panels are made up of many photovoltaic cells that are connected together to form a panel. When sunlight hits the cells, it excites the electrons, which then flow through a circuit and generate electricity. The electricity is then sent through an inverter, which converts the DC power into AC power, making it usable for homes and businesses.
Solar panels are a clean and renewable source of energy, producing no emissions or pollution. They are also a cost-effective way to generate electricity, as they can save homeowners and businesses money on their energy bills.
In simple terms, solar panels work by:
1. Conver

In [8]:
def generate_answer(query, retriever):
    # Step 1: Retrieve relevant chunks
    retrieved_docs = retriever.get_relevant_documents(query)
    context = "\n\n".join(doc.page_content for doc in retrieved_docs[:3])
    
    # Step 2: Format prompt
    prompt = f"Answer the question based on the context.\n\nContext:\n{context}\n\nQuestion: {query}\nAnswer:"
    
    # Step 3: Tokenize and generate
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True).to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=256)
    
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

In [9]:
response = generate_answer("What is artificial intelligence?", vectorstore.as_retriever())
print(response)

  retrieved_docs = retriever.get_relevant_documents(query)
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Answer the question based on the context.

Context:
Artificial intelligence (AI) is the ability of a computer program or a machine to think and learn. It is also a field of study which tries to make computers "smart". They work on their own without being encoded with commands. John McCarthy came up with the name "Artificial Intelligence" in 1955.

In general use, the term "artificial intelligence" means a programme which mimics human cognition. At least some of the things we associate with other minds, such as learning and problem solving can be done by computers, though not in the same way as we do. Andreas Kaplan and Michael Haenlein define AI as a system’s ability to correctly interpret external data, to learn from such data, and to use those learnings to achieve specific goals and tasks through flexible adaptation.

Related pages
 Neural networks
 Expert systems
 Machine learning

References
What is Artificial Intelligence (A.I)?
Artificial intelligence
https://aiscite.blogspot.com

## STEP 5: Real-Time Semantic Chat with Contextual Answering via Vector Retrieval and LLaMA

In [10]:
retriever = vectorstore.as_retriever()

def generate_answer(message, history):
    query = message  # Get user's current message

    retrieved_docs = retriever.get_relevant_documents(query)

    if not retrieved_docs:
        return "Sorry, I couldn't find relevant information to answer your question."
    
    context = "\n\n".join(doc.page_content for doc in retrieved_docs[:3])

    prompt = f"Answer the question based on the context.\n\nContext:\n{context}\n\nQuestion: {query}\nAnswer:"
    
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=4096).to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=256)
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return answer


gr.ChatInterface(
    fn=generate_answer,
    title="🧠 LLaMA 3.1 RAG Chatbot",
    description="Ask anything based on Wikipedia (Simple English).",
    theme="soft",
).launch(share=True)

  self.chatbot = Chatbot(


* Running on local URL:  http://127.0.0.1:7860
* Running on public URL: https://a50130bf9dc3e5fff8.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
