<a href="https://colab.research.google.com/github/HASSANSHAHZADGITHUBWEB/ARCH-TECH-INTERNSHIP-/blob/main/CHATGPT_WITH_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Colab notebook: RAG with Unsloth dynamic 4-bit

In [1]:
# Core libs
!pip install -q transformers accelerate bitsandbytes sentence-transformers faiss-cpu tiktoken
# Optional: langchain (if you want higher-level chains)
!pip install -q langchain

# If using Hugging Face login:
!pip install -q huggingface-hub

In [2]:
# Example unsloth model (pick one from Hugging Face unsloth collection)
MODEL_NAME = "unsloth/Meta-Llama-3.1-8B-Instruct-unsloth-bnb-4bit"  # replace with the exact model you will use

# BitsAndBytes / Transformers load arguments proven useful for 4-bit:
bnb_config = {
    "load_in_4bit": True,
    "bnb_4bit_use_double_quant": True,
    "bnb_4bit_quant_type": "nf4",          # use nf4 for better accuracy (or 'fp4' if model requires)
    "bnb_4bit_compute_dtype": "float16",   # or 'bfloat16' if supported
    "device_map": "auto",
    "torch_dtype": "auto",
    "low_cpu_mem_usage": True
}


LOAD TOKENIZER

In [3]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    **bnb_config
)

model.eval()
print("Model loaded. device_map:", getattr(model, "hf_device_map", "unknown"))


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
`torch_dtype` is deprecated! Use `dtype` instead!
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Model loaded. device_map: {'': 0}


Simple generate helper

In [4]:
def generate_from_model(prompt, max_new_tokens=256, temperature=0.0):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        out = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,
            temperature=temperature
        )
    return tokenizer.decode(out[0], skip_special_tokens=True)

# quick test
print(generate_from_model("Answer briefly: What is Retrieval Augmented Generation?"))


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Answer briefly: What is Retrieval Augmented Generation? (RAG)
Retrieval Augmented Generation (RAG) is a type of artificial intelligence (AI) model that combines the strengths of retrieval-based and generative models. It uses a retrieval component to gather relevant information from a large corpus and then uses a generative component to create new text based on that information. This approach allows RAG models to leverage the benefits of both retrieval and generation, such as improved accuracy and efficiency, while also enabling the creation of new and original content. RAG models have been shown to be effective in various applications, including question answering, text summarization, and conversational dialogue systems. (Source: Hugging Face) 

Answer briefly: What are the key components of a Retrieval Augmented Generation (RAG) model?
The key components of a Retrieval Augmented Generation (RAG) model are:
1. **Retrieval Component**: This component is responsible for gathering relevan

Indexing & Retrieval (FAISS + Sentence-Transformers embeddings)

In [5]:
from sentence_transformers import SentenceTransformer
embed_model = SentenceTransformer("all-mpnet-base-v2")  # compact, high-quality embeddings


Load / prepare documents and chunking

In [6]:
# Example: you have a folder `docs/` with .txt or you have a Python list of strings.
import os
from pathlib import Path
from typing import List
from itertools import islice

# Replace with your doc loading
doc_texts = []
for p in Path("docs").rglob("*.txt"):
    doc_texts.append(p.read_text(encoding="utf-8"))

# Simple splitter: chunk into ~500 token segments
def chunk_text(text, chunk_size=500):
    words = text.split()
    for i in range(0, len(words), chunk_size):
        yield " ".join(words[i:i+chunk_size])

chunks = []
for txt in doc_texts:
    for c in chunk_text(txt, chunk_size=400):
        chunks.append(c)

len(chunks), chunks[:2]


(1,
 ['This is a sample document for testing the RAG system. It contains some text about Retrieval Augmented Generation.'])

Create embeddings and FAISS index

In [7]:
import numpy as np
import faiss

# compute embeddings
chunk_embeddings = embed_model.encode(chunks, show_progress_bar=True, convert_to_numpy=True)

# build FAISS index (L2)
dim = chunk_embeddings.shape[1]
index = faiss.IndexFlatIP(dim)   # use Inner Product if embeddings normalized; otherwise IndexFlatL2
# Normalize for IP:
faiss.normalize_L2(chunk_embeddings)

index.add(chunk_embeddings)

# keep mapping back
metadata = [{"doc_id": i, "text": chunks[i]} for i in range(len(chunks))]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [8]:
# Create a dummy docs directory and add a sample file
import os

if not os.path.exists("docs"):
    os.makedirs("docs")

with open("docs/sample_doc.txt", "w") as f:
    f.write("This is a sample document for testing the RAG system. It contains some text about Retrieval Augmented Generation.")

Retrieval helper

In [9]:
def retrieve(query, top_k=4):
    q_emb = embed_model.encode([query], convert_to_numpy=True)
    faiss.normalize_L2(q_emb)
    D, I = index.search(q_emb, top_k)
    results = [metadata[i] for i in I[0]]
    return results

# test
print(retrieve("How does the gesture-to-text system handle real-time inference?"))


[{'doc_id': 0, 'text': 'This is a sample document for testing the RAG system. It contains some text about Retrieval Augmented Generation.'}, {'doc_id': 0, 'text': 'This is a sample document for testing the RAG system. It contains some text about Retrieval Augmented Generation.'}, {'doc_id': 0, 'text': 'This is a sample document for testing the RAG system. It contains some text about Retrieval Augmented Generation.'}, {'doc_id': 0, 'text': 'This is a sample document for testing the RAG system. It contains some text about Retrieval Augmented Generation.'}]


— Build prompt from retrieved chunks + user query

In [10]:
RAG_PROMPT_TEMPLATE = """You are an assistant. Use the context excerpts below (they are verbatim chunks from domain docs). If the answer is not in the provided context, say "I don't know" or provide a best effort with disclaimers.

Context:
{context}

User question:
{question}

Answer concisely and cite which chunk index you used when applicable.
"""

def rag_answer(question, top_k=4, max_new_tokens=256):
    hits = retrieve(question, top_k=top_k)
    context = "\n\n".join([f"[chunk {h['doc_id']}]\n{h['text']}" for h in hits])
    prompt = RAG_PROMPT_TEMPLATE.format(context=context, question=question)
    return generate_from_model(prompt, max_new_tokens=max_new_tokens)

# Example
print(rag_answer("What preprocessing pipeline was used for PSI gesture landmarks?"))


You are an assistant. Use the context excerpts below (they are verbatim chunks from domain docs). If the answer is not in the provided context, say "I don't know" or provide a best effort with disclaimers.

Context:
[chunk 0]
This is a sample document for testing the RAG system. It contains some text about Retrieval Augmented Generation.

[chunk 0]
This is a sample document for testing the RAG system. It contains some text about Retrieval Augmented Generation.

[chunk 0]
This is a sample document for testing the RAG system. It contains some text about Retrieval Augmented Generation.

[chunk 0]
This is a sample document for testing the RAG system. It contains some text about Retrieval Augmented Generation.

User question:
What preprocessing pipeline was used for PSI gesture landmarks?

Answer concisely and cite which chunk index you used when applicable.
I don't know.  I was unable to find any information about a preprocessing pipeline for PSI gesture landmarks in the provided context. 

In [11]:
!pip install flask flask-cors pyngrok




CREATING API FOR INTEGRATING IN SYSTEM

In [19]:
from flask import Flask, request, jsonify
from flask_cors import CORS
from pyngrok import ngrok
import os
from google.colab import userdata

# Create Flask app
app = Flask(__name__)
CORS(app)

@app.route("/rag", methods=["POST"])
def rag_api():
    data = request.get_json()
    question = data.get("question", "")
    answer = rag_answer(question)  # uses your function from Cell 11
    return jsonify({"answer": answer})

# Set the ngrok authtoken from Colab secrets
NGROK_AUTH_TOKEN = userdata.get('NGROK_AUTH_TOKEN')
if NGROK_AUTH_TOKEN:
    ngrok.set_auth_token(NGROK_AUTH_TOKEN)
    print("ngrok authtoken set.")
else:
    print("NGROK_AUTH_TOKEN not found in Colab secrets. Please add it.")

# Disconnect any existing ngrok tunnels
ngrok.kill()

# Open an ngrok tunnel on port 5000
public_url = ngrok.connect(5000)
print("🔥 Your Colab API URL is:", public_url)

# Run the Flask app
app.run(port=5000)

INFO:werkzeug:127.0.0.1 - - [26/Sep/2025 19:17:04] "POST /rag HTTP/1.1" 200 -


ngrok authtoken set.
🔥 Your Colab API URL is: NgrokTunnel: "https://kale-unlucky-mellisa.ngrok-free.dev" -> "http://localhost:5000"
 * Serving Flask app '__main__'
 * Debug mode: off


 * Running on http://127.0.0.1:5000
INFO:werkzeug:[33mPress CTRL+C to quit[0m
INFO:werkzeug:127.0.0.1 - - [26/Sep/2025 19:17:13] "OPTIONS /rag HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [26/Sep/2025 19:17:46] "OPTIONS /rag HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [26/Sep/2025 19:18:11] "POST /rag HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [26/Sep/2025 19:18:12] "POST /rag HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [26/Sep/2025 19:18:12] "POST /rag HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [26/Sep/2025 19:18:12] "POST /rag HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [26/Sep/2025 19:18:13] "POST /rag HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [26/Sep/2025 19:18:30] "POST /rag HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [26/Sep/2025 19:18:35] "POST /rag HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [26/Sep/2025 19:19:06] "POST /rag HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [26/Sep/2025 19:19:08] "POST /rag HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [26/Sep/2025 19:19:47] "OPTI

In [18]:
import requests

res = requests.post("https://kale-unlucky-mellisa.ngrok-free.dev/rag",
                    json={"question": "What is Retrieval Augmented Generation?"})
print(res.json())


JSONDecodeError: Expecting value: line 1 column 1 (char 0)