<a href="https://colab.research.google.com/github/AlexXPZhu/XMUM-FYP-Code/blob/main/FYP_TinyBert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# FYP: TinyBERT for SQL Injection Detection

This notebook implements TinyBERT fine-tuning for SQL injection detection,
fulfilling the proposal requirement for lightweight transformer evaluation.

**TinyBERT Specifications:**
- 4 transformer layers (vs BERT's 12)
- ~14.5M parameters (vs BERT's 110M)
- ~7-9x faster inference than BERT
- Retains ~96.8% of BERT's performance

## 1. Environment Setup

In [1]:
# Install required packages
!pip install transformers datasets evaluate accelerate wandb -q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer
)
import evaluate
import wandb
import torch

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

PyTorch version: 2.9.0+cu126
CUDA available: True
GPU: Tesla T4


## 2. Load and Prepare Data

In [3]:
# Mount Google Drive (if using Colab)
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
# Load merged dataset
file_path = '/content/drive/MyDrive/FYP/merged_data.csv'
df = pd.read_csv(file_path)

print(f"Dataset shape: {df.shape}")
print(f"\nLabel distribution:")
print(df['Label'].value_counts())
print(f"\nSample data:")
df.head()

Dataset shape: (56897, 2)

Label distribution:
Label
0    34633
1    22264
Name: count, dtype: int64

Sample data:


Unnamed: 0,Sentence,Label
0,a,1
1,a',1
2,a' --,1
3,a' or 1 = 1; --,1
4,@,1


In [5]:
# Split data: 70% train, 15% validation, 15% test (same as DistilBERT/MobileBERT)
train_df, temp_df = train_test_split(
    df, test_size=0.3, random_state=42, stratify=df['Label']
)
val_df, test_df = train_test_split(
    temp_df, test_size=0.5, random_state=42, stratify=temp_df['Label']
)

print(f"Training samples: {len(train_df)}")
print(f"Validation samples: {len(val_df)}")
print(f"Test samples: {len(test_df)}")

Training samples: 39827
Validation samples: 8535
Test samples: 8535


In [6]:
# Convert to Hugging Face Dataset format
train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)
test_dataset = Dataset.from_pandas(test_df)

# Rename 'Label' to 'labels' (required by Trainer)
train_dataset = train_dataset.rename_column('Label', 'labels')
val_dataset = val_dataset.rename_column('Label', 'labels')
test_dataset = test_dataset.rename_column('Label', 'labels')

print("Dataset columns:", train_dataset.column_names)

Dataset columns: ['Sentence', 'labels', '__index_level_0__']


## 3. Load TinyBERT Model and Tokenizer

In [7]:
# TinyBERT model checkpoint
# Options:
# - "huawei-noah/TinyBERT_General_4L_312D" (4 layers, 312 hidden dim, ~14.5M params)
# - "huawei-noah/TinyBERT_General_6L_768D" (6 layers, 768 hidden dim, ~67M params)

model_checkpoint = "huawei-noah/TinyBERT_General_4L_312D"
num_labels = 2

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

# Load model for sequence classification
model = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint,
    num_labels=num_labels
)

# Print model info
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Model: {model_checkpoint}")
print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/409 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/62.7M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at huawei-noah/TinyBERT_General_4L_312D and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model: huawei-noah/TinyBERT_General_4L_312D
Total parameters: 14,350,874
Trainable parameters: 14,350,874


## 4. Tokenize Data

In [8]:
def tokenize_function(examples):
    """Tokenize text with padding and truncation"""
    return tokenizer(
        examples['Sentence'],
        padding='max_length',
        truncation=True,
        max_length=128  # Same as DistilBERT/MobileBERT
    )

# Tokenize all datasets
tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_val = val_dataset.map(tokenize_function, batched=True)
tokenized_test = test_dataset.map(tokenize_function, batched=True)

print("Tokenization complete!")
print(f"Columns after tokenization: {tokenized_train.column_names}")

Map:   0%|          | 0/39827 [00:00<?, ? examples/s]

Map:   0%|          | 0/8535 [00:00<?, ? examples/s]

Map:   0%|          | 0/8535 [00:00<?, ? examples/s]

Tokenization complete!
Columns after tokenization: ['Sentence', 'labels', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask']


## 5. Define Evaluation Metrics

In [9]:
# Load accuracy metric
accuracy_metric = evaluate.load('accuracy')

def compute_metrics(eval_pred):
    """Compute accuracy, precision, recall, and F1"""
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)

    accuracy = accuracy_score(labels, predictions)
    precision = precision_score(labels, predictions)
    recall = recall_score(labels, predictions)
    f1 = f1_score(labels, predictions)

    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1
    }

Downloading builder script: 0.00B [00:00, ?B/s]

## 6. Configure Training

In [10]:
# Initialize Weights & Biases
wandb.init(
    project="tinybert-sqli-detection",
    name="tinybert-4L-run-1",
    config={
        "model": model_checkpoint,
        "epochs": 3,
        "batch_size": 16,
        "learning_rate": 2e-5,
        "max_length": 128
    }
)

  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: (1) Create a W&B account
[34m[1mwandb[0m: (2) Use an existing W&B account
[34m[1mwandb[0m: (3) Don't visualize my results
[34m[1mwandb[0m: Enter your choice:

 2


[34m[1mwandb[0m: You chose 'Use an existing W&B account'
[34m[1mwandb[0m: Logging into https://api.wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: Find your API key here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mzxp1279839620[0m ([33mzxp1279839620-xiamen-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [11]:
# Training arguments (same as DistilBERT/MobileBERT for fair comparison)
training_args = TrainingArguments(
    output_dir='./results-tinybert',
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    eval_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
    report_to='wandb',
    logging_steps=50,
    metric_for_best_model='f1',
    greater_is_better=True,
)

## 7. Train Model

In [12]:
# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    compute_metrics=compute_metrics,
)

# Train!
print("Starting training...")
train_result = trainer.train()

# Print training summary
print("\n" + "="*50)
print("Training Complete!")
print("="*50)
print(f"Total training time: {train_result.metrics['train_runtime']:.2f} seconds")
print(f"Training loss: {train_result.metrics['train_loss']:.4f}")

Starting training...


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.0008,0.013055,0.997774,0.997007,0.997305,0.997156
2,0.0004,0.007194,0.998711,0.998503,0.998204,0.998353
3,0.0001,0.007667,0.998946,0.998802,0.998503,0.998652



Training Complete!
Total training time: 298.89 seconds
Training loss: 0.0174


## 8. Evaluate on Test Set

In [13]:
# Evaluate on test set
print("Evaluating on test set...")
test_results = trainer.evaluate(tokenized_test)

print("\n" + "="*50)
print("TinyBERT Test Results")
print("="*50)
print(f"Accuracy:  {test_results['eval_accuracy']:.4f}")
print(f"Precision: {test_results['eval_precision']:.4f}")
print(f"Recall:    {test_results['eval_recall']:.4f}")
print(f"F1-Score:  {test_results['eval_f1']:.4f}")

Evaluating on test set...



TinyBERT Test Results
Accuracy:  0.9984
Precision: 0.9991
Recall:    0.9967
F1-Score:  0.9979


In [14]:
# Generate predictions for detailed analysis
predictions = trainer.predict(tokenized_test)
y_pred = np.argmax(predictions.predictions, axis=1)
y_true = predictions.label_ids

# Classification report
print("\nDetailed Classification Report:")
print(classification_report(y_true, y_pred, target_names=['Benign', 'SQLi']))


Detailed Classification Report:
              precision    recall  f1-score   support

      Benign       1.00      1.00      1.00      5195
        SQLi       1.00      1.00      1.00      3340

    accuracy                           1.00      8535
   macro avg       1.00      1.00      1.00      8535
weighted avg       1.00      1.00      1.00      8535



## 9. Save Model

In [15]:
# Save model and tokenizer
save_path = '/content/drive/MyDrive/FYP/models/tinybert-sqli'

trainer.save_model(save_path)
tokenizer.save_pretrained(save_path)

print(f"Model saved to: {save_path}")

# Calculate model size
import os
total_size = 0
for dirpath, dirnames, filenames in os.walk(save_path):
    for f in filenames:
        fp = os.path.join(dirpath, f)
        total_size += os.path.getsize(fp)

print(f"Model size: {total_size / (1024*1024):.2f} MB")

Model saved to: /content/drive/MyDrive/FYP/models/tinybert-sqli
Model size: 55.66 MB


## 10. Log Final Results to WandB

In [16]:
# Log final metrics
wandb.log({
    "test_accuracy": test_results['eval_accuracy'],
    "test_precision": test_results['eval_precision'],
    "test_recall": test_results['eval_recall'],
    "test_f1": test_results['eval_f1'],
    "model_size_mb": total_size / (1024*1024),
    "total_parameters": total_params
})

# Finish WandB run
wandb.finish()

print("\nExperiment logged to Weights & Biases!")

0,1
eval/accuracy,▁▇█▄
eval/f1,▁▇█▄
eval/loss,█▁▂▃
eval/precision,▁▆▇█
eval/recall,▃▇█▁
eval/runtime,▇█▁▅
eval/samples_per_second,▂▁█▄
eval/steps_per_second,▂▁█▄
model_size_mb,▁
test/accuracy,▁

0,1
eval/accuracy,0.99836
eval/f1,0.9979
eval/loss,0.00864
eval/precision,0.9991
eval/recall,0.99671
eval/runtime,6.5297
eval/samples_per_second,1307.109
eval/steps_per_second,81.78
model_size_mb,55.65926
test/accuracy,0.99836



Experiment logged to Weights & Biases!


## 11. Model Comparison Summary

In [17]:
# Summary comparison with other models
comparison_data = {
    'Model': ['DistilBERT', 'MobileBERT', 'TinyBERT (4L)', 'BiLSTM', 'SVM'],
    'Parameters': ['66M', '25M', '14.5M', '~0.5M', 'N/A'],
    'Accuracy': [0.9986, 0.9987, test_results['eval_accuracy'], 0.9964, 0.9904],
    'F1-Score': [0.9982, 0.9984, test_results['eval_f1'], 0.9954, 0.9878],
}

comparison_df = pd.DataFrame(comparison_data)
print("\n" + "="*60)
print("Model Comparison Summary")
print("="*60)
print(comparison_df.to_string(index=False))


Model Comparison Summary
        Model Parameters  Accuracy  F1-Score
   DistilBERT        66M   0.99860  0.998200
   MobileBERT        25M   0.99870  0.998400
TinyBERT (4L)      14.5M   0.99836  0.997902
       BiLSTM      ~0.5M   0.99640  0.995400
          SVM        N/A   0.99040  0.987800


---

## ✅ TinyBERT Training Complete!

This notebook fulfills the proposal requirement:
> "Fine-tune lightweight transformer models (e.g. DistilBERT, **TinyBERT**)"

Next steps:
1. Run `FYP_Benchmark.ipynb` for latency/throughput measurements
2. Export model to ONNX format
3. Apply INT8 quantization
4. Generate comparison charts