<a href="https://colab.research.google.com/github/Deep-Dey1/MISTRAL-RAG/blob/main/mistral_rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
! pip install torch transformers datasets faiss-cpu sentence-transformers wandb flask bitsandbytes


Collecting datasets
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.4 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.45.2-py3-none-manylinux_2_24_x86_64.whl.metadata (5.8 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu1

In [2]:
!nvidia-smi

Mon Feb 24 18:33:56 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   40C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [3]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_02:18:23_PDT_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0


In [5]:
import huggingface_hub

In [6]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: write).
The token `mistral-rag` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `mistral-rag`

In [1]:
import torch
import faiss
import gc
import wandb
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from sentence_transformers import SentenceTransformer
from flask import Flask, request, jsonify
from huggingface_hub import notebook_login

# ==========================
# 1. Data Preparation Phase
# ==========================
wandb.init(project="rag-mistral7b", name="data_preprocessing")

# Load Alpaca dataset
data = load_dataset("tatsu-lab/alpaca", split="train")
data = data.shuffle().select(range(2000))  # Reduce dataset size for memory efficiency

# Preprocess dataset
def format_text(example):
    return f"###Human: {example['instruction']} {example['input']} ###Assistant: {example['output']}"

data = [format_text(ex) for ex in data]

# Encode dataset using SentenceTransformer for retrieval
retriever = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = retriever.encode(data, convert_to_numpy=True)

# Create FAISS index
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)
faiss.write_index(index, "rag_index.faiss")
wandb.log({"faiss_index_size": len(data)})

# Free memory
del embeddings  # Delete large variable
gc.collect()  # Manually trigger garbage collection
torch.cuda.empty_cache()  # Clear GPU cache

# ===========================
# 2. Model Training & Saving
# ===========================
wandb.init(project="rag-mistral7b", name="model_training")

# Configure 4-bit quantization for memory efficiency
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

# Load Mistral 7B model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1",
    quantization_config=quantization_config, device_map="auto")

# Save the trained model locally
model.save_pretrained("mistral-rag")
tokenizer.save_pretrained("mistral-rag")




[34m[1mwandb[0m: Currently logged in as: [33mdeepdey1226[0m ([33mdeepdey1226-manipal-university-jaipur[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-67bcc1e4-68e06c3d6df24620513541dd;c04b8acc-d1bc-4228-97b6-59a949e1ad8e)

Repository Not Found for url: https://huggingface.co/api/models/deep0210/mistral-rag/preupload/main.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid credentials in Authorization header
Note: Creating a commit assumes that the repo already exists on the Huggingface Hub. Please use `create_repo` if it's not the case.

In [2]:
notebook_login()
model.push_to_hub("mistral-rag")
tokenizer.push_to_hub("mistral-rag")

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

model.safetensors:   0%|          | 0.00/4.13G [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/deep0210/mistral-rag/commit/edbb3c82088673fe785887a3f79e234c3458ba2c', commit_message='Upload tokenizer', commit_description='', oid='edbb3c82088673fe785887a3f79e234c3458ba2c', pr_url=None, repo_url=RepoUrl('https://huggingface.co/deep0210/mistral-rag', endpoint='https://huggingface.co', repo_type='model', repo_id='deep0210/mistral-rag'), pr_revision=None, pr_num=None)

In [3]:
# =========================
# 3. Inference & Evaluation
# =========================
wandb.init(project="rag-mistral7b", name="inference_evaluation")

# Load trained model from Hugging Face
tokenizer = AutoTokenizer.from_pretrained("deep0210/mistral-rag")
model = AutoModelForCausalLM.from_pretrained("deep0210/mistral-rag", device_map="auto")

def retrieve(query, top_k=3):
    query_emb = retriever.encode([query], convert_to_numpy=True)
    _, idx = index.search(query_emb, top_k)
    retrieved_texts = [data[i] for i in idx[0]]
    return "\n".join(retrieved_texts)

app = Flask(__name__)

@app.route("/generate", methods=["POST"])
def generate():
    query = request.json.get("query")
    retrieved_context = retrieve(query)
    input_text = f"{retrieved_context}\n###Human: {query} ###Assistant:"
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(model.device)
    output = model.generate(input_ids, max_length=128)  # Reduce max_length for memory efficiency
    response = tokenizer.decode(output[0], skip_special_tokens=True)

    # Log inference performance
    wandb.log({"query": query, "response_length": len(response)})

    return jsonify({"response": response})

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=5000)


tokenizer_config.json:   0%|          | 0.00/1.03k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.51M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.14k [00:00<?, ?B/s]

Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


model.safetensors:   0%|          | 0.00/4.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:5000
 * Running on http://172.28.0.12:5000
INFO:werkzeug:[33mPress CTRL+C to quit[0m


In [4]:
!zip -r mistral-rag.zip mistral-rag


In [10]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load trained model and tokenizer from Hugging Face
model_path = "deep0210/mistral-rag"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto")

# Function to retrieve relevant context
def retrieve(query, top_k=3):
    query_emb = retriever.encode([query], convert_to_numpy=True)
    _, idx = index.search(query_emb, top_k)
    retrieved_texts = [data[i] for i in idx[0]]
    return "\n".join(retrieved_texts)

# Function to generate response
def generate_response(query):
    retrieved_context = retrieve(query)
    input_text = f"{retrieved_context}\n###Human: {query} ###Assistant:"

    input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(model.device)

    with torch.no_grad():
        output = model.generate(input_ids, max_length=128)  # Adjust max_length for memory efficiency

    response = tokenizer.decode(output[0], skip_special_tokens=True)
    return response

# Example query
query = "What is the capital of France?"
response = generate_response(query)
print("Model Response:", response)


Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set `llm_int8_enable_fp32_cpu_offload=True` and pass a custom `device_map` to `from_pretrained`. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details. 

In [6]:
import torch
import faiss
import json
import matplotlib.pyplot as plt
import seaborn as sns
import wandb
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from sentence_transformers import SentenceTransformer

# Initialize Weights & Biases
wandb.init(project="rag-mistral7b", name="inference_evaluation")

# Load model and tokenizer from local storage in Colab
model_path = "/content/mistral-rag"  # Make sure this folder exists
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto")

# Load FAISS index
index = faiss.read_index("/content/rag_index.faiss")

# Load retriever
retriever = SentenceTransformer("all-MiniLM-L6-v2")

# Load dataset for evaluation (Top 100 samples)
data = load_dataset("tatsu-lab/alpaca", split="train").shuffle().select(range(100))

# Define function to retrieve context
def retrieve(query, top_k=3):
    query_emb = retriever.encode([query], convert_to_numpy=True)
    _, idx = index.search(query_emb, top_k)
    retrieved_texts = [data[i.item()]["output"] for i in idx[0]]  # Fix here
    return retrieved_texts

# Run inference on 100 samples
generated_responses = []
actual_responses = []
scores = []

for example in data:
    query = example['instruction']
    actual_response = example['output']

    retrieved_context = retrieve(query)
    input_text = f"{retrieved_context}\n###Human: {query} ###Assistant:"
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(model.device)

    with torch.no_grad():
        output = model.generate(input_ids, max_length=128)

    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

    generated_responses.append(generated_text)
    actual_responses.append(actual_response)

    # Compute rough BLEU score (Simplified version)
    common_words = set(generated_text.split()) & set(actual_response.split())
    bleu_score = len(common_words) / max(1, len(actual_response.split()))
    scores.append(bleu_score)

    # Log to W&B
    wandb.log({"query": query, "bleu_score": bleu_score, "generated_response": generated_text})

# Save results as JSON file
results = {"generated": generated_responses, "actual": actual_responses, "scores": scores}
with open("/content/inference_results.json", "w") as f:
    json.dump(results, f)

print("Inference completed! Results saved to inference_results.json")

# ==========================
# 2️⃣ Generate Performance Graphs
# ==========================

# Set style
sns.set_style("darkgrid")

# Create BLEU score distribution plot
plt.figure(figsize=(8, 5))
sns.histplot(scores, bins=20, kde=True, color="blue")
plt.xlabel("BLEU Score")
plt.ylabel("Frequency")
plt.title("BLEU Score Distribution (Inference on 100 Samples)")
plt.savefig("/content/bleu_score_distribution.png")
wandb.log({"bleu_score_distribution": wandb.Image("/content/bleu_score_distribution.png")})

# Create sorted BLEU score line plot
plt.figure(figsize=(10, 5))
plt.plot(sorted(scores), marker="o", linestyle="-", color="red", label="BLEU Score")
plt.xlabel("Sample Index (Sorted)")
plt.ylabel("BLEU Score")
plt.title("Model Performance Over 100 Inference Samples")
plt.legend()
plt.savefig("/content/bleu_score_performance.png")
wandb.log({"bleu_score_performance": wandb.Image("/content/bleu_score_performance.png")})

print("Graphs saved! You can now download them.")


Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set `llm_int8_enable_fp32_cpu_offload=True` and pass a custom `device_map` to `from_pretrained`. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details. 