<a href="https://colab.research.google.com/github/ADHIL007/ADHIL007/blob/main/morefix_train.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **MICROSOFT CODEBERT BASE MODEL SETUP**
---

In [None]:
!pip install transformers torch

In [5]:
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import os
os.environ['TRANSFORMERS_CACHE'] = '/content/drive/MyDrive/huggingface_cache'


# **Model Distillation with Kullback-Leibler (KL) Divergence (Forward)**

In [4]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): Traceback (most recent call last):
  File "/usr/lib/python3.11/getpass.py", line 77, in unix_getpass
    passwd = _raw_input(prompt, stream, input=input)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/getpass.py", line 146, in _raw_input
   

## **1️⃣ Load Teacher & Student Models**

In [None]:
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

teacher_model_name = "deepseek-ai/deepseek-coder-6.7B"
student_model_name = "microsoft/codebert-base"

teacher_model = AutoModelForSequenceClassification.from_pretrained(teacher_model_name)
student_model = AutoModelForSequenceClassification.from_pretrained(student_model_name)
teacher_tokenizer = AutoTokenizer.from_pretrained(teacher_model_name)
student_tokenizer = AutoTokenizer.from_pretrained(student_model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


OSError: deepseek-ai/deepseek-coder-6.7B is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`

## **2️⃣ Prune Teacher Model to Reduce Complexity**

In [None]:
def prune_teacher_model(model, num_layers_to_keep=12):
    model.config.num_hidden_layers = num_layers_to_keep
    model.encoder.layer = torch.nn.ModuleList(model.encoder.layer[:num_layers_to_keep])
    return model

teacher_model = prune_teacher_model(teacher_model, num_layers_to_keep=12)


## **3️⃣ Generate SQL Dataset using Pruned DeepSeek Coder**

In [None]:
from datasets import Dataset
import random

def generate_sql_data(num_samples=5000):
    queries = []
    input_prompt = "Generate an optimized SQL query for financial transactions."
    inputs = teacher_tokenizer([input_prompt] * num_samples, return_tensors="pt", padding=True, truncation=True)

    with torch.no_grad():
        outputs = teacher_model.generate(**inputs, max_length=75, num_return_sequences=num_samples)

    queries = [{"unoptimized_sql": teacher_tokenizer.decode(output, skip_special_tokens=True),
                "optimized_sql": teacher_tokenizer.decode(output, skip_special_tokens=True).replace("SELECT *", "SELECT col1, col2")}
                for output in outputs]

    return Dataset.from_list(queries)

dataset = generate_sql_data()

## **4️⃣ Tokenize Dataset for Training**

In [None]:
def tokenize_function(examples):
    return student_tokenizer(examples["unoptimized_sql"], truncation=True, padding="max_length")

dataset = dataset.map(tokenize_function, batched=True)

## **5️⃣ Define Forward KL Divergence Loss for Distillation**

In [None]:
def compute_kld_loss(student_logits, teacher_logits, temperature=2.0):
    loss_fn = torch.nn.KLDivLoss(reduction='batchmean')
    student_probs = torch.nn.functional.log_softmax(student_logits / temperature, dim=-1)
    teacher_probs = torch.nn.functional.softmax(teacher_logits / temperature, dim=-1)
    return loss_fn(student_probs, teacher_probs)

## **6️⃣ Custom Trainer for Distillation**

In [None]:
from transformers import TrainingArguments, Trainer

class DistillationTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        student_outputs = model(**inputs)
        with torch.no_grad():
            teacher_outputs = teacher_model(**inputs)
        loss = compute_kld_loss(student_outputs.logits, teacher_outputs.logits)
        return (loss, student_outputs) if return_outputs else loss

## **7️⃣ Training Configuration**

In [None]:
training_args = TrainingArguments(
    output_dir="./distilled_codebert",
    per_device_train_batch_size=64,
    learning_rate=1.5e-5,
    num_train_epochs=3,
    save_total_limit=1,
    evaluation_strategy="epoch",
    logging_dir="./logs",
    gradient_accumulation_steps=16,
    fp16=True,  # Enable mixed precision for efficiency
    dataloader_num_workers=4,
)

## **8️⃣ Train the Student Model**

In [None]:
trainer = DistillationTrainer(
    model=student_model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=student_tokenizer,
)

trainer.train()

## **✅ Training Complete: Save the Fine-Tuned Model**

In [None]:
trainer.save_model("./final_student_model")