# **FINE TUNING CYBER BUDDY**

---



# **📅 Day 1: Foundations for Fine-Tuning LLMs**

🎯 Goal of Today:
> Understand the basics: What are LLMs, how transformers work, what embeddings and tokenization are.

1. What is Language Model?
* A **Language Model (LM)** is trained to predict the next word given some text. An LLM (Larga Language Model) is just a huge version of this.

---

2. Transformers
* Transformers are like attention machines - they pay attention to words in a sentence and understand the context. They work using:
    * Embeddings  (turning words into numbers)
    * Self-Attention  (deciding which words to fucus on)
    * Positional Encoding  (word order matters)
    * Layers  (stacked attention and feed-forward blocks)

---
3. Tokenization
* Before feeding text into a model, we must tokenize it.
[Tokenization explanation in detail](https://www.geeksforgeeks.org/what-is-tokenization/)

In [None]:
# Tokenization demo

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('gpt2')
text = "Hello, Brinda!"
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)

print(f'Tokens: {tokens}')
print(f'Token IDs: {ids}')

# 📅 **Day 2: Hugging Face 101 - The Language Model Playground**

🎯 Goal of the Day:
> Learn how to use Hugging Face 🤗 to browse, load, and run models and datasets. We'll also generate our first real outputs from pre-trained models!

1. What is HuggingFace?
* Hugging Face is like a GitHub of AI Models - it hosts thousands of Models and Datasets
* Core Libraries:
    * `transformers`: for working with LLMs
    * `datasets`: for ready-made datasets
    * `huggingface_hubs`: to upload/share ready-made datasets

🔗 Explore: https://huggingface.co/models



2. Install Hugging Face on Colab

In [None]:
!pip3 install -q transformers datasets huggingface_hub

3. Try a model with Pipeline
>The easiest way to use any model <br>
>🧠 This runs GPT-2 to generate text given a prompt.



In [None]:
from transformers import pipeline

# Text generation pipiline
generator = pipeline('text-generation', model='gpt2')

output = generator("Cyber Security is", max_length=30, num_return_sequences=1)
print(output[0]['generated_text'])

📂 4. Try a Text Classification Pipeline
>🔎 We'll see whether the text is positive/negative — and how confident the model is.



In [None]:
classifier = pipeline('sentiment-analysis')

result = classifier('I like Cycling, but I am tired of it!')
print(result)

🗂️ 5. Load a Dataset
>💡 This loads 1000 news headlines with categories (world, sports, business, sci/tech).

In [None]:
# Load AG News dataset
from datasets import load_dataset
dataset = load_dataset("ag_news", split="train[:1000]", download_mode="force_redownload")
print(dataset[0])


---

# **📅 Day 3: Tokenizers Deep Dive**

🎯 Goal of the Day:

> 1. What is Tokenization?
2. Types of Tokenizers:
    * Word-level
    * Character-level
    * Subword-level: BPE, WordPiece, SentencePiece
3. Tokenization Demo using Hugging Face
4. Token IDs vs Tokens
5. Visualize Tokenization
6. Bonus: Custom training of your own tokenizer (BPE)

---

**1. What is Tokenization?**
> Tokenization is a process of splitting input text into pieces (called Tokens) so that models can work with them.

**Example:**
```
Input Text: "I love CyberBuddy"
Word Tokens: ["I", "love", CyberBuddy"]
```

But models don't use wrods, they use Token IDs

**2. Types of Tokenizers**

| Type   | Description | Example Tokens|
| ------------ | ------------------ | ----------------- |
| **Word-level** | Splits on whitespace  | ["I", "love", "CyberBuddy"]    |
| **Character-level** | Every character is a token | ["I", " ", "l", "o", "v", "e"] |
| **Subword-level**   | Splits into word pieces (most powerful) | ["Cy", "##ber", "Bud", "##dy"] |


**Most modern LLMs use sub-word Tokenizers**

**3. Common Sub-word Tokenizers type:**

| Type                         | Used by        | Algorithm                                               |
| ---------------------------- | -------------- | ------------------------------------------------------- |
| **BPE (Byte Pair Encoding)** | GPT-2, Mistral | Merges frequent character pairs                         |
| **WordPiece**                | BERT           | Similar to BPE, adds constraint to limit vocabulary     |
| **SentencePiece**            | T5, ALBERT     | Works on raw text without whitespace-based tokenization |


---

## 🧪 Part-1: Tokenizing with GPT-2 (Subword - BPE):

We'll se how Hugging Face Tokenizers work under the hood


🔍 Step-1: Load GPT-2 Tokenizer

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")


🔍 Step-2: Tokenize a Simple Sentance

In [None]:
text = "CyberBuddy is your personal assistant!"

# Tokenize (subwords)
tokens = tokenizer.tokenize(text)

# Convert tokens to IDs
ids = tokenizer.convert_tokens_to_ids(tokens)

print(f"Input Text: {text}")
print(f"Tokens: {tokens}")
print(f"Token IDs: {ids}")


We’ll notice weird things like 'Ġ', or tokens like "Cy", "ber", "Bud", "##dy" — this is how BPE works: it splits into meaningful sub-parts based on training frequency

🧰 Step-3: Encode + Decode

In [None]:
# Encode directly
encoded = tokenizer.encode(text)
print(f"\nEncoded IDs: {encoded}")

# Decode back to text
decoded = tokenizer.decode(encoded)
print(f"Decoded Text: {decoded}")


🧠 This is exactly how a model like GPT-2 understands and responds to input.

## 🎨 Part 2: Visualize with Token Strings and Offsets
Let’s see how tokens map back to parts of your original sentence.

In [None]:
output = tokenizer(text, return_offsets_mapping=True)
tokens = tokenizer.convert_ids_to_tokens(output["input_ids"])
offsets = output["offset_mapping"]

for token, (start, end) in zip(tokens, offsets):
    print(f"{token:15} -> '{text[start:end]}'")

In [None]:
output

---

## ⭐ Bonus: Build Our Own Tokenizer


In [None]:
# 🧰 Step 0: Install Required Libraries
!pip install -q tokenizers


In [None]:
# 📦 Step 1: Prepare a small corpus for training
custom_corpus = [
    "CyberBuddy is your AI-powered cybersecurity assistant.",
    "Phishing attacks are dangerous and increasing in India.",
    "OTP frauds, malware, spyware — CyberBuddy helps prevent them.",
    "A good assistant understands your needs and protects your identity.",
    "India’s cybercrime rate is growing — awareness is key."
]


In [None]:
with open("corpus.txt", "w") as f:
    for line in custom_corpus:
        f.write(line + "\n")


In [None]:

# 🔧 Step 2: Train a BPE Tokenizer from scratch
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace


In [None]:
# Initialize empty BPE tokenizer
tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = Whitespace()

trainer = BpeTrainer(vocab_size=100, special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"])
tokenizer.train(["corpus.txt"], trainer)


In [None]:
# 💾 Save tokenizer
tokenizer.save("cyberbuddy-tokenizer.json")


In [None]:
# 🔁 Step 3: Load and Test the Trained Tokenizer
from tokenizers import Tokenizer as LoadTokenizer
loaded_tokenizer = LoadTokenizer.from_file("cyberbuddy-tokenizer.json")


In [None]:
# Tokenize custom sentence
text = "CyberBuddy prevents phishing and OTP fraud."
output = loaded_tokenizer.encode(text)


In [None]:

print(f"Input: {text}")
print(f"Tokens: {output.tokens}")
print(f"Token IDs: {output.ids}")


In [None]:

# 🔍 Step 4: Compare with GPT-2 Tokenizer
from transformers import AutoTokenizer
hf_tokenizer = AutoTokenizer.from_pretrained("gpt2")

print("\n[GPT-2 Tokenizer]")
hf_output = hf_tokenizer.tokenize(text)
print("GPT-2 Tokens:", hf_output)



We:

* Trained a custom BPE tokenizer on cybercrime text

* Saved, reloaded, and tested it

* Compared it with GPT-2

This gave us full control over how CyberBuddy or any assistant interprets your domain-specific text 🧠⚙️

---

# **📘 Day-4: Dataset Proprocessing for Fine-Tuning (CyberBuddy Edition)**

🧠 **Why Preprocedd Data for Fine-Tuning?**
> Fine-Tuning teaches the model how to behave in specific situations - in our case, answersng questions about cyber crime in India

> But raw data (like our JSON with crime_type, description, etc.) isn't enough — we need to restructure it into instruction-following format the model can learn from.

🤖 **What does the model expect?**
> LLMs traqined from instruction-following mode (like Mistral, LLaMA) learn from prompt like this:

> ```
<s>[INST] Your question goes here [/INST] The assistant’s helpful answer goes here </s>
```

* The `[INST]...[/INST]` marks the instruction.

* After `[/INST]`, you write the response.

* The `<s>` and `</s>` mark the start and end of a conversation.

> This teaches the model: “When I see a question like this, I should respond like that.”



🧾 **Our Cyber Crime Dataset Structure**

Our JSON entries look like this:

```
{
  "crime_type": "Hacking",
  "description": "...",
  "applicable_laws": "...",
  "penalty": "...",
  "prevention_tips": [...],
  "source_url": "..."
}

```

But what the model needs is something like this:

```
<s>[INST] What is the punishment and applicable law for Hacking? [/INST]
Description: ...
Applicable Law: ...
Penalty: ...
Prevention Tips: ...
Source: ...
</s>

```

✨ What We’ll Learn Today
By the end of Day 4, you’ll know:

✅ What instruction-tuning format is

✅ Why it’s needed

✅ How to write a data formatter that takes your JSON → model-ready format

✅ How to save your data as .jsonl (one sample per line)

In [None]:
!pip3 uninstall datasets

Found existing installation: datasets 3.6.0
Uninstalling datasets-3.6.0:
  Would remove:
    /usr/local/bin/datasets-cli
    /usr/local/lib/python3.11/dist-packages/datasets-3.6.0.dist-info/*
    /usr/local/lib/python3.11/dist-packages/datasets/*
Proceed (Y/n)? y
  Successfully uninstalled datasets-3.6.0


In [None]:
!pip3 install -U datasets

## **Load the Dataset**

In [None]:
from datasets import load_dataset

dataset = load_dataset("json", data_files="/content/drive/MyDrive/Fine Tuning CyberBuddy/data/cybercrime_dataset.json", split="train")

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
dataset

Dataset({
    features: ['crime_type', 'description', 'applicable_laws', 'penalty', 'prevention_tips', 'source_url'],
    num_rows: 406
})

In [None]:
entry = dataset[0]

entry, entry['crime_type']

({'crime_type': 'Tampering with computer source documents',
  'description': 'Knowingly concealing, destroying or altering computer source code required by law.',
  'applicable_laws': 'Section\xa065, IT\xa0Act\xa02000',
  'penalty': 'Imprisonment up to 3\u202fyears, or fine up to ₹2\u202flakh, or both.',
  'prevention_tips': ['Use version control',
   'Maintain secure backups',
   'Restrict source code access'],
  'source_url': 'https://indiacode.nic.in/show-data?actid=AC_CEN_45_76_00001_200021_1517807324077&orderno=75'},
 'Tampering with computer source documents')

Description, applicable laws, penalty, prevention tips, source url

In [None]:

dataset[0]['applicable_laws']

'Section\xa065, IT\xa0Act\xa02000'

In [None]:
def clean_unicode(text):
    # Replace known invisible unicode characters with space or nothing
    return (text.replace('\xa0', ' ')
                .replace('\u202f', ' ')
                .replace('\u200b', '')  # zero-width space
                .replace('\ufeff', '')  # BOM
                .replace('\u2060', '')  # Word joiner
            )

In [None]:
def format(entry):
    instruction = f"What is the law and punishment for {entry['crime_type']}?"
    output_parts = [
        f"Description: {entry['description']}",
        f"Applicable Laws: {entry['applicable_laws']}",
        f"Penalty: {entry['penalty']}",
        f"Prevention Tips: {entry['prevention_tips']}",
        f"Source URL: {entry['source_url']}"
    ]
    output = '\n'.join(output_parts)
    del entry['crime_type']
    del entry['description']
    del entry['applicable_laws']
    del entry['penalty']
    del entry['prevention_tips']
    del entry['source_url']

    return {
        'instruction': clean_unicode(instruction),
        "input": "",
        "output": clean_unicode(output),
        "text": f"<s>[INST] {clean_unicode(instruction)} [/INST] \n {clean_unicode(output)} </s>"
    }



In [None]:
format(dataset[0])

{'instruction': 'What is the law and punishment for Tampering with computer source documents?',
 'input': '',
 'output': "Description: Knowingly concealing, destroying or altering computer source code required by law.\nApplicable Laws: Section 65, IT Act 2000\nPenalty: Imprisonment up to 3 years, or fine up to ₹2 lakh, or both.\nPrevention Tips: ['Use version control', 'Maintain secure backups', 'Restrict source code access']\nSource URL: https://indiacode.nic.in/show-data?actid=AC_CEN_45_76_00001_200021_1517807324077&orderno=75",
 'text': "<s>[INST] What is the law and punishment for Tampering with computer source documents? [/INST] \n Description: Knowingly concealing, destroying or altering computer source code required by law.\nApplicable Laws: Section 65, IT Act 2000\nPenalty: Imprisonment up to 3 years, or fine up to ₹2 lakh, or both.\nPrevention Tips: ['Use version control', 'Maintain secure backups', 'Restrict source code access']\nSource URL: https://indiacode.nic.in/show-data

In [None]:
# Format Dataset

dataset = dataset.map(format)

Map:   0%|          | 0/406 [00:00<?, ? examples/s]

In [None]:
dataset[0]

{'instruction': 'What is the law and punishment for Tampering with computer source documents?',
 'input': '',
 'output': "Description: Knowingly concealing, destroying or altering computer source code required by law.\nApplicable Laws: Section 65, IT Act 2000\nPenalty: Imprisonment up to 3 years, or fine up to ₹2 lakh, or both.\nPrevention Tips: ['Use version control', 'Maintain secure backups', 'Restrict source code access']\nSource URL: https://indiacode.nic.in/show-data?actid=AC_CEN_45_76_00001_200021_1517807324077&orderno=75",
 'text': "<s>[INST] What is the law and punishment for Tampering with computer source documents? [/INST] \n Description: Knowingly concealing, destroying or altering computer source code required by law.\nApplicable Laws: Section 65, IT Act 2000\nPenalty: Imprisonment up to 3 years, or fine up to ₹2 lakh, or both.\nPrevention Tips: ['Use version control', 'Maintain secure backups', 'Restrict source code access']\nSource URL: https://indiacode.nic.in/show-data

```
{
    "instruction": "What is the law and punishment for Tampering with computer source documents?",
    "input": "",
    "output": "🔍 Description: Knowingly concealing, destroying or altering computer source code required by law.\n⚖️ Law: Section 65, IT Act 2000\n🚨 Penalty: Imprisonment up to 3 years, or fine up to ₹2 lakh, or both.\n✅ Prevention Tips: Use version control, Maintain secure backups, Restrict source code access",
    "text": "<s>[INST] What is the law and punishment for Tampering with computer source documents? [/INST] 🔍 Description: Knowingly concealing, destroying or altering computer source code required by law.\n⚖️ Law: Section 65, IT Act 2000\n🚨 Penalty: Imprisonment up to 3 years, or fine up to ₹2 lakh, or both.\n✅ Prevention Tips: Use version control, Maintain secure backups, Restrict source code access</s>"
  },

```


### **Shuffle Dataset**

In [None]:
dataset = dataset.shuffle(seed=42)


### **Split Dataset**

In [None]:
split_dataset = dataset.train_test_split(test_size=0.2)   # 80 - 20 ratio

train_dataset = split_dataset['train']
eval_dataset = split_dataset['test']

In [None]:
train_dataset, eval_dataset

(Dataset({
     features: ['instruction', 'input', 'output', 'text'],
     num_rows: 324
 }),
 Dataset({
     features: ['instruction', 'input', 'output', 'text'],
     num_rows: 82
 }))

## **Tokenization Time ✨**

In [None]:
from huggingface_hub import login
login('HugginfFaceToken')
token='HugginfFaceToken'


### Step-1: Load a Tokenizer

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('mistralai/Mistral-7B-v0.1', use_fast=True)
tokenizer.pad_token = tokenizer.eos_token

### Step-2: Create Tokenization Function

In [None]:
def tokenize_function(example):
    return tokenizer(
        example['text'],
        truncation=True,
        padding='max_length',
        max_length=512,
        return_tensors='pt'
    )

### Apply `.map()` for Tokenization

In [None]:
train_dataset = train_dataset.map(tokenize_function, batched=True)
eval_dataset = eval_dataset.map(tokenize_function, batched=True)

train_dataset, eval_dataset

Map:   0%|          | 0/324 [00:00<?, ? examples/s]

Map:   0%|          | 0/82 [00:00<?, ? examples/s]

(Dataset({
     features: ['instruction', 'input', 'output', 'text', 'input_ids', 'attention_mask'],
     num_rows: 324
 }),
 Dataset({
     features: ['instruction', 'input', 'output', 'text', 'input_ids', 'attention_mask'],
     num_rows: 82
 }))

---

# **🔥 Day 5: QLoRA Fine-Tuning Setup**


### 🎓 What is QLoRA?

🧠 QLoRA stands for:

> **`Quantized Low-Rank Adapter`** — a technique to efficiently fine-tune large language models (LLMs) like Mistral-7B on limited hardware (like our Colab GPU).

**🤔 Why do we need QLoRA?**

| Without QLoRA                 | With QLoRA                                             |
| ----------------------------- | ------------------------------------------------------ |
| Requires > 48 GB VRAM 😵‍💫   | Fine-tunes on just 12–16 GB GPU (Colab-compatible!) 🥳 |
| Full model weights updated 💸 | Only *tiny adapters* are trained 🧩                    |
| Slow & expensive ⌛            | Fast & efficient ⚡                                     |


**⚙️ QLoRA Workflow:**


          ┌───────────────────────────┐
          │  Base LLM (e.g. Mistral)  │
          └────────────┬──────────────┘
                       │
        Load with 4-bit quantization  ← 💾 Memory-efficient
                       │
               Add LoRA adapters      ← 🧩 Small trainable modules
                       │
              Fine-tune on your data  ← 💻 CyberBuddy dataset
                       │
              Save only the adapters  ← 📁 Small files


✅ Key Benefits:

* ✅ You don't touch original weights (safe & reusable)

* ✅ You can share adapters easily (like plugins!)

* ✅ You can merge adapter with base model later if needed



🚨 Requirements for QLoRA Setup:

* `transformers`

* `accelerate`

* `peft` (for LoRA)

* `bitsandbytes` (for 4-bit quantization)

* HuggingFace Datasets and Tokenizers

## **🧰 Step 1: Install Required Libraries**


In [None]:
!pip3 install -q transformers datasets accelerate peft trl scipy

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/375.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m375.8/375.8 kB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
!pip3 install -U bitsandbytes

Collecting bitsandbytes
  Downloading bitsandbytes-0.46.0-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch<3,>=2.2->bitsandbytes)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch<3,>=2.2->bitsandbytes)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch<3,>=2.2->bitsandbytes)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch<3,>=2.2->bitsandbytes)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch<3,>=2.2->bitsandbytes)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-c

> 📝 `trl` is for training utilities (like SFTTrainer), and `bitsandbytes` enables 4-bit quantization.

## **🚦 Step 2: Login to HuggingFace**

In [None]:
# This was done above!

In [None]:
import torch
print(torch.cuda.is_available())


True


---
---
Error solve

In [None]:
# import sys
# modules_to_delete = [k for k in sys.modules.keys() if "bitsandbytes" in k]
# for k in modules_to_delete:
#     print(k)
#     del sys.modules[k]


In [None]:
# !pip show bitsandbytes

----
----

In [None]:
# !pip uninstall -y bitsandbytes
# !rm -rf /usr/local/lib/python*/dist-packages/bitsandbytes

In [None]:
# !CUDA_VERSION=124 pip install git+https://github.com/TimDettmers/bitsandbytes.git

In [None]:
# import os
# os.kill(os.getpid(), 9)

In [None]:
import bitsandbytes as bnb
from transformers.utils import is_bitsandbytes_available

print("✅ bitsandbytes available:", is_bitsandbytes_available())
print("✅ version:", bnb.__version__)

✅ bitsandbytes available: True
✅ version: 0.46.0


## **🧠 Step 3: Load the Mistral-7B Model in 4-bit**

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

model_name = "mistralai/Mistral-7B-v0.1"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",  # NormalFloat4 - best performance/quality tradeoff
    bnb_4bit_compute_dtype="bfloat16"
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.float16,
    token=True
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token  # ✅ Fixes the padding issue from before

print("✅ Mistral-7B model loaded in 4-bit successfully!")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/996 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

✅ Mistral-7B model loaded in 4-bit successfully!


## **Applying PEFT + LoRA (QLoRA)**

### ⚙️ Step 1: Install PEFT + Accelerate

In [None]:
!pip3 install -q peft accelerate

### 🧪 Step 2: Configure LoRA (Low-Rank Adaptation)

In [None]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=64,
    lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.1,
    bias="none",
    task_type=TaskType.CAUSAL_LM   # for Mistral
)

> 🔎 You can think of LoRA as a lightweight way to fine-tune only the attention layers using low-rank matrices, which saves memory + time.

### 🧬 Step 3: Wrap Mistral with LoRA

In [None]:
from peft import get_peft_model

model = get_peft_model(model, lora_config)

model.enable_input_require_grads()  # 👈 This allows gradients to flow into input
model.gradient_checkpointing_enable()

for name, param in model.named_parameters():
    if param.requires_grad:
        print(name)


base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight
base_model.model.model.layers.0.self_attn.q_proj.lora_B.default.weight
base_model.model.model.layers.0.self_attn.k_proj.lora_A.default.weight
base_model.model.model.layers.0.self_attn.k_proj.lora_B.default.weight
base_model.model.model.layers.0.self_attn.v_proj.lora_A.default.weight
base_model.model.model.layers.0.self_attn.v_proj.lora_B.default.weight
base_model.model.model.layers.0.self_attn.o_proj.lora_A.default.weight
base_model.model.model.layers.0.self_attn.o_proj.lora_B.default.weight
base_model.model.model.layers.1.self_attn.q_proj.lora_A.default.weight
base_model.model.model.layers.1.self_attn.q_proj.lora_B.default.weight
base_model.model.model.layers.1.self_attn.k_proj.lora_A.default.weight
base_model.model.model.layers.1.self_attn.k_proj.lora_B.default.weight
base_model.model.model.layers.1.self_attn.v_proj.lora_A.default.weight
base_model.model.model.layers.1.self_attn.v_proj.lora_B.default.weight
base_m

## **Step-4: Set up the Trainer**

We'll use Hugging Face's `transformers.Trainer` with the following steps:

### 1. Define Training Arguements

In [None]:
model.gradient_checkpointing_enable()

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./mistral-cyberbuddy-checkpoints",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,   # To maintain same effective batch size
    learning_rate=2e-4,
    num_train_epochs=3,
    max_steps=10,
    save_steps=10,
    logging_dir="./logs",
    logging_steps=10,
    save_strategy="epoch",
    eval_strategy="epoch",
    bf16=True,    # or fp16=True if bf16 not supported
    report_to="none"
)

### 2. Define Dara Collactor

We need a causal language modelling-friendly data collector:

In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False     # Because we're not doing masked language modeling
)

**🧠 What is Masked Language Modeling (MLM)?**

Masked Language Modeling (MLM) is a training strategy used in models like BERT. In this approach:

* Some words in the input are replaced with [MASK] tokens.

* The model is trained to predict the missing (masked) words.

### 3. Create the Trainer

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

  trainer = Trainer(
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [None]:
trainer.train()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Epoch,Training Loss,Validation Loss
0,2.4473,1.964932


TrainOutput(global_step=10, training_loss=2.4472970962524414, metrics={'train_runtime': 1245.8764, 'train_samples_per_second': 0.064, 'train_steps_per_second': 0.008, 'total_flos': 1760916123156480.0, 'train_loss': 2.4472970962524414, 'epoch': 0.24691358024691357})

## **📁 Save Fine-Tuned Model**

In [None]:
trainer.save_model("cyberbuddy-finetuned")
tokenizer.save_pretrained("cyberbuddy-finetuned")

NameError: name 'trainer' is not defined

In [None]:
dataset['instruction']

['What is the law and punishment for E-commerce warehousing fraud?',
 'What is the law and punishment for Delayed forensic response?',
 'What is the law and punishment for Cyber fraud?',
 'What is the law and punishment for Extended CERT‑In mandatory reporting?',
 'What is the law and punishment for Unauthorized data breach?',
 'What is the law and punishment for ESXi hypervisor attack?',
 'What is the law and punishment for Online education fraud?',
 'What is the law and punishment for Biometric sensor hacking?',
 'What is the law and punishment for Criminal intimidation via electronic means?',
 'What is the law and punishment for Ransomware?',
 'What is the law and punishment for Cloud compliance migration importance?',
 'What is the law and punishment for Digital evidence admissibility issues?',
 'What is the law and punishment for ATM jackpotting?',
 'What is the law and punishment for eGov CISOs across states?',
 'What is the law and punishment for Good faith protection?',
 'What 

In [None]:
from transformers import pipeline

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

response = pipe("<s>[INST] 'What is the purpose of this chatbot? [/INST]",
                max_new_tokens=100,
                do_sample=True,
                temperature=0.7)

print(response[0]['generated_text'])


Device set to use cuda:0


<s>[INST] 'What is the purpose of this chatbot? [/INST] [ANS]To train chatbots to respond to conversations with empathy. [/ANS] [EXA]What can you use the bot for? [/EXA] [EXA]How can I train the bot? [/EXA]'
même

[INST] 'What is the purpose of this chatbot? [/INST] [ANS]To train chatbots to respond to conversations with empathy. [/ANS] [


In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# ✅ Load your fine-tuned model
model = AutoModelForCausalLM.from_pretrained("cyberbuddy-finetuned")
tokenizer = AutoTokenizer.from_pretrained("cyberbuddy-finetuned")



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
# ✅ Generate response
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)



In [None]:
prompt = "<s>[INST] What is the punishment for hacking? [/INST]"
response = pipe(prompt, max_new_tokens=100, do_sample=True, temperature=0.7)

print(response[0]['generated_text'])

In [None]:
import os
os.listdir("cyberbuddy-finetuned")


FileNotFoundError: [Errno 2] No such file or directory: 'cyberbuddy-finetuned'