<a href="https://colab.research.google.com/github/Nour2ai/ARTficial_Neural_Network/blob/main/Mistral_7B_LoRA_RAG_Colab_Full.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mistral-7B LoRA + RAG — Colab Notebook

> Full end-to-end Colab notebook tuned for a single T4 GPU. Includes: installs, dataset conversion, LoRA fine-tuning (4-bit), FAISS RAG index, and inference/chat cell.

Run cells in order. Save outputs (LoRA adapter + FAISS index) to Google Drive if you need persistence.

## 1) Install dependencies

Run this cell first. It installs transformers, bitsandbytes, peft, sentence-transformers, faiss, and datasets.

In [1]:
!pip install -q --upgrade transformers accelerate bitsandbytes datasets peft sentence-transformers faiss-cpu langdetect huggingface_hub
# Optional: chromadb if you prefer it to faiss
# !pip install -q chromadb


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.0/44.0 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m19.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m74.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m380.9/380.9 kB[0m [31m24.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.4/59.4 MB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m511.6/511.6 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.6/23.6 MB[0m [31m38.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.7/47.7 MB[0m [31m10.5 MB/s[0m eta

## 2) Confirm dataset is present

This notebook expects your dataset at `/mnt/data/dataset.json` (uploaded via ChatGPT or placed into the Colab filesystem).

In [2]:
from pathlib import Path
src = Path('/content/dataset.json')
print('Exists:', src.exists())
if src.exists():
    print('Size (bytes):', src.stat().st_size)
    import json
    data = json.load(open(src,'r',encoding='utf-8'))
    print('Entries:', len(data))
    from pprint import pprint
    pprint(data[:2])


Exists: True
Size (bytes): 626433
Entries: 402
[{'answer_ar': 'AVU (الجامعة الافتراضية الأكاديمية) هو نظام معلومات شامل '
               'للطلاب مصمم لإدارة بيانات الطلاب والمقررات والتسجيلات والنتائج '
               'والرسوم والأنشطة الأكاديمية الأخرى. يوفر منصة مركزية للطلاب '
               'والمدرسين والإداريين للوصول إلى وإدارة المعلومات التعليمية '
               'بكفاءة.',
  'answer_en': 'AVU (Academic Virtual University) is a comprehensive Student '
               'Information System designed to manage student data, courses, '
               'registrations, results, fees, and other academic activities. '
               'It provides a centralized platform for students, teachers, and '
               'administrators to access and manage educational information '
               'efficiently.',
  'category': 'General Information',
  'id': 1,
  'question_ar': ['ما هو AVU؟',
                  'هل يمكنك إخباري عن نظام AVU؟',
                  'ماذا يعني AVU؟',
                  'اشرح

## 3) Convert dataset to instruction-style JSONL

Creates `/content/train_mixed.jsonl` with one record per question variant.

In [3]:
import json
from pathlib import Path
src = Path('/content/dataset.json')
out = Path('/content/train_mixed.jsonl')

raw = json.load(src.open(encoding='utf-8'))
records = []
for item in raw:
    qens = item.get('question_en') or []
    qars = item.get('question_ar') or []
    ans_en = item.get('answer_en') or ''
    ans_ar = item.get('answer_ar') or ''
    for q in qens:
        if q and ans_en:
            records.append({'instruction': q.strip(), 'input': '', 'output': ans_en.strip(), 'metadata': {'id': item.get('id')}})
    for q in qars:
        if q and ans_ar:
            records.append({'instruction': q.strip(), 'input': '', 'output': ans_ar.strip(), 'metadata': {'id': item.get('id')}})

with out.open('w',encoding='utf-8') as f:
    for r in records:
        f.write(json.dumps(r, ensure_ascii=False) + '\n')

print('Wrote', len(records), 'records to', out)


Wrote 3933 records to /content/train_mixed.jsonl


## 4) Tokenizer & preprocessing (datasets)

Load the converted jsonl and tokenize. Adjust `max_length` if you need lower memory usage.

In [4]:
from datasets import load_dataset
from transformers import AutoTokenizer

MODEL_NAME = 'mistralai/Mistral-7B-Instruct-v0.2'
dataset = load_dataset('json', data_files={'train':'/content/train_mixed.jsonl'}, split='train')
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token':'<pad>'})

max_length = 1024
def preprocess(example):
    instruction = example['instruction']
    output = example.get('output','')
    prompt = f"### Instruction:\n{instruction}\n\n### Response:\n"
    full = prompt + output
    tok = tokenizer(full, truncation=True, max_length=max_length)
    tok['labels'] = tok['input_ids'].copy()
    return tok

dataset = dataset.map(preprocess, remove_columns=dataset.column_names)
print('Dataset example tokenized:')
print(dataset[0])


Generating train split: 0 examples [00:00, ? examples/s]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Map:   0%|          | 0/3933 [00:00<?, ? examples/s]

Dataset example tokenized:
{'input_ids': [1, 774, 3133, 3112, 28747, 13, 3195, 349, 13190, 28779, 28804, 13, 13, 27332, 12107, 28747, 13, 9024, 28779, 325, 9475, 4247, 294, 19032, 2900, 28731, 349, 264, 15313, 15965, 9148, 2135, 5682, 298, 8594, 5716, 1178, 28725, 12318, 28725, 24113, 697, 28725, 2903, 28725, 12845, 28725, 304, 799, 11860, 6290, 28723, 661, 5312, 264, 5971, 1332, 5181, 354, 3567, 28725, 10880, 28725, 304, 5466, 3117, 298, 2735, 304, 8594, 14165, 1871, 23463, 28723], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [1, 774, 3133, 3112, 28747, 13, 3195, 349, 13190, 28779, 28804, 13, 13, 27332, 12107, 28747, 13, 9024, 28779, 325, 9475, 4247, 294, 19032, 2900, 28731, 349, 264, 15313, 15965, 9148, 2135, 5682, 298, 8594, 5716, 1178, 28725, 12318, 28725, 24113, 697, 28725, 290

## 5) Load Mistral in 4-bit and prepare LoRA

This cell loads the base model with BitsAndBytesConfig and applies PEFT LoRA. If you see permission errors, run `huggingface-cli login` first.

In [5]:
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)
print('Loading model (this may take a while)...')
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map='auto',
    trust_remote_code=True
)
model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=['q_proj','k_proj','v_proj','o_proj'],
    lora_dropout=0.05,
    bias='none',
    task_type='CAUSAL_LM'
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()


Loading model (this may take a while)...


config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

trainable params: 13,631,488 || all params: 7,255,363,584 || trainable%: 0.1879


## 6) Trainer: fine-tune with LoRA (T4-friendly)

Start with small batch size (1) and gradient accumulation. Adjust `num_train_epochs` as needed.

In [6]:
from transformers import TrainingArguments, Trainer
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
training_args = TrainingArguments(
    output_dir='/content/mistral-lora-output',
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    num_train_epochs=2,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_strategy='epoch',
    save_total_limit=2,
    optim='paged_adamw_32bit'
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    data_collator=data_collator
)
trainer.train()
trainer.save_model('/content/mistral-lora-output/final')


  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mhabibatarek898[0m ([33mhabibatarek898-alexandria-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
  return fn(*args, **kwargs)


Step,Training Loss
10,1.3547
20,0.993
30,0.9219
40,0.8375
50,0.766
60,0.6183
70,0.7349
80,0.6536
90,0.6277
100,0.5583


  return fn(*args, **kwargs)


## 7) Save LoRA adapter only (small)

Saving only adapter weights is quick and small compared to saving the full base model.

In [7]:
adapter_dir = '/content/mistral-lora-output/lora_adapter'
model.save_pretrained(adapter_dir)
print('Saved LoRA adapter to', adapter_dir)


Saved LoRA adapter to /content/mistral-lora-output/lora_adapter


## 8) Build FAISS index for RAG

We use `all-MiniLM-L6-v2` to embed passages and FAISS to index them locally.

In [8]:
from sentence_transformers import SentenceTransformer
import faiss
import json

embedder = SentenceTransformer('all-MiniLM-L6-v2')
docs = []
with open('/content/train_mixed.jsonl','r',encoding='utf-8') as f:
    for i,line in enumerate(f):
        obj = json.loads(line)
        docs.append({'id':i,'text':obj.get('output',''),'instruction':obj.get('instruction','')})

texts = [d['text'] or d['instruction'] for d in docs]
embs = embedder.encode(texts, show_progress_bar=True, convert_to_numpy=True)
d = embs.shape[1]
index = faiss.IndexFlatL2(d)
index.add(embs)
faiss.write_index(index, '/content/faiss_index.idx')
with open('/content/docs_map.json','w',encoding='utf-8') as f:
    json.dump(docs, f, ensure_ascii=False)
print('FAISS index and docs saved')


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/123 [00:00<?, ?it/s]

FAISS index and docs saved


## 9) RAG inference / chat cell

Load base model in 4-bit, load LoRA adapter, perform retrieval and generation.

In [9]:
import json
import faiss
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from peft import PeftModel

index = faiss.read_index('/content/faiss_index.idx')
with open('/content/docs_map.json','r',encoding='utf-8') as f:
    docs = json.load(f)
embedder = SentenceTransformer('all-MiniLM-L6-v2')

# load base + quant config
bnb_config = None
try:
    from transformers import BitsAndBytesConfig
    bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True)
except Exception as e:
    print('Could not create BitsAndBytesConfig:', e)

base = AutoModelForCausalLM.from_pretrained(MODEL_NAME, quantization_config=bnb_config, device_map='auto', trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)

# apply LoRA adapter
model = PeftModel.from_pretrained(base, '/content/mistral-lora-output/final')
model.eval()

def retrieve(query, topk=3):
    q_emb = embedder.encode([query], convert_to_numpy=True)
    D,I = index.search(q_emb, topk)
    return [docs[i]['text'] for i in I[0]]

@torch.no_grad()
def answer(query, topk=3, max_new_tokens=256):
    ctxs = retrieve(query, topk)
    ctx_text = '\n\n---\n\n'.join(ctxs)
    prompt = f"### Retrieved passages:\n{ctx_text}\n\n### User question:\n{query}\n\n### Response:\n"
    inputs = tokenizer(prompt, return_tensors='pt', truncation=True, max_length=1024).to(model.device)
    gen = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=True, top_p=0.95, temperature=0.2)
    out = tokenizer.decode(gen[0], skip_special_tokens=True)
    return out.split('### Response:')[-1].strip()

# Try a query
print(answer('How do I reset my password?'))


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


You can reset your password by navigating to the login page and using the 'Forgot Password' link. You'll typically need to verify your identity through email or SMS before the password reset can be completed. If you're having trouble resetting your password, contact the IT support or system administrator for assistance. It's important to keep your password secure and change it periodically for added security. Some institutions may offer two-factor authentication as an additional security measure, which can be enabled in your security settings. You can change your password by logging into your account and visiting the 'Settings' or 'Profile' section. Passwords typically require a minimum length, combination of letters, numbers, and special characters. It's recommended to change your password periodically for security. Some institutions may offer password reset through the helpdesk or IT support. You can check your password requirements by reviewing the system's guidelines or contacting 

In [11]:
print(answer('What is AVU?'))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


AVU (Academic Virtual University) is a comprehensive Student Information System designed to manage student data, courses, registrations, results, fees, and other academic activities. It provides a centralized platform for students, teachers, and administrators to access and manage educational information efficiently.


## 10) Tips & next steps

- Save `/content/mistral-lora-output` and `/content/faiss_index.idx` to Google Drive for persistence.
- If you want separate adapters per language, filter and train twice then switch adapters at inference using a language detector.
- If you hit OOM: reduce `max_length`, use smaller `gradient_accumulation_steps`, and ensure nothing else uses GPU memory.


In [13]:
from google.colab import files
import shutil
import os

# Path to the directory containing the LoRA adapter
adapter_dir = '/content/mistral-lora-output/lora_adapter'
output_zip_file = '/content/lora_adapter.zip'

# Compress the directory
shutil.make_archive(os.path.splitext(output_zip_file)[0], 'zip', adapter_dir)

# Download the compressed file
files.download(output_zip_file)

print(f'Downloaded {output_zip_file}')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Downloaded /content/lora_adapter.zip
