# Fine-tune Intent Classification Model
## Customer Service Chatbot - Intent Classifier Training

This notebook fine-tunes a DistilBERT model for intent classification.

**Platform:** Google Colab or Kaggle

**Steps:**
1. Install dependencies
2. Load training data
3. Prepare dataset
4. Fine-tune DistilBERT
5. Evaluate and save model

In [1]:
# Fix tokenizer parallelism issue
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

## 1. Install Dependencies

In [2]:
!pip install transformers datasets torch scikit-learn accelerate -q

## 2. Import Libraries

In [3]:
import json
import numpy as np
from datasets import Dataset
from transformers import (
    AutoTokenizer, 
    AutoModelForSequenceClassification,
    TrainingArguments, 
    Trainer,
    DataCollatorWithPadding
)
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import torch
print(f"Using device: {torch.device('cuda' if torch.cuda.is_available() else 'cpu')}")

2026-01-28 06:34:05.629359: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1769582045.831651      24 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1769582045.895892      24 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1769582046.393212      24 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1769582046.393255      24 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1769582046.393258      24 computation_placer.cc:177] computation placer alr

Using device: cuda


## 3. Load Training Data

**Note:** Upload `train_data.json` and `val_data.json` to Colab/Kaggle before running this cell.

In Colab: Click folder icon ‚Üí Upload files

In Kaggle: Add files in the Input section

In [4]:
# Load data
with open('/kaggle/input/customer-chatbot/train_data.json', 'r') as f:
    train_data = json.load(f)

with open('/kaggle/input/customer-chatbot/val_data.json', 'r') as f:
    val_data = json.load(f)

print(f"Training examples: {len(train_data)}")
print(f"Validation examples: {len(val_data)}")
print(f"\nSample: {train_data[0]}")

Training examples: 1315
Validation examples: 329

Sample: {'text': 'i want exit', 'label': 'goodbye'}


## 4. Prepare Dataset

In [5]:
# Create label mapping
unique_labels = sorted(list(set([item['label'] for item in train_data])))
label2id = {label: idx for idx, label in enumerate(unique_labels)}
id2label = {idx: label for label, idx in label2id.items()}

print(f"Number of intents: {len(unique_labels)}")
print(f"Labels: {unique_labels}")

# Convert labels to IDs
for item in train_data:
    item['labels'] = label2id[item['label']]

for item in val_data:
    item['labels'] = label2id[item['label']]

# Create Hugging Face datasets
train_dataset = Dataset.from_list(train_data)
val_dataset = Dataset.from_list(val_data)

print(f"\nDataset created successfully!")

Number of intents: 10
Labels: ['goodbye', 'greeting', 'hours', 'order_status', 'payment', 'pricing', 'product_info', 'return', 'support', 'thanks']

Dataset created successfully!


## 5. Tokenize Data

In [6]:
# Load tokenizer
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Tokenization function
def tokenize_function(examples):
    return tokenizer(examples['text'], truncation=True, padding=True, max_length=128)

# Tokenize datasets
tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_val = val_dataset.map(tokenize_function, batched=True)

# Remove unnecessary columns - keep only what the model needs
tokenized_train = tokenized_train.remove_columns(['text', 'label'])
tokenized_val = tokenized_val.remove_columns(['text', 'label'])

print("‚úÖ Tokenization complete!")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/1315 [00:00<?, ? examples/s]

Map:   0%|          | 0/329 [00:00<?, ? examples/s]

‚úÖ Tokenization complete!


## 6. Load Model

In [7]:
# Load model
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=len(unique_labels),
    id2label=id2label,
    label2id=label2id
)

print(f"‚úÖ Model loaded: {model_name}")
print(f"üìä Parameters: {model.num_parameters():,}")

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


‚úÖ Model loaded: distilbert-base-uncased
üìä Parameters: 66,961,162


## 7. Define Metrics

In [8]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    
    accuracy = accuracy_score(labels, predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='weighted')
    
    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1
    }

## 8. Training Arguments

In [9]:
training_args = TrainingArguments(
    output_dir="./intent_classifier",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    learning_rate=2e-5,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    push_to_hub=False,
    logging_steps=10,
    warmup_steps=100,
    dataloader_num_workers=0,
    report_to="none",                    # ADD THIS LINE - disables wandb
)

In [10]:
# Check GPU memory before training
import subprocess
result = subprocess.run(['nvidia-smi'], stdout=subprocess.PIPE)
print(result.stdout.decode('utf-8'))

Wed Jan 28 06:34:25 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.172.08             Driver Version: 570.172.08     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla P100-PCIE-16GB           Off |   00000000:00:04.0 Off |                    0 |
| N/A   33C    P0             30W /  250W |     257MiB /  16384MiB |      2%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## 9. Train Model

In [11]:

# Create data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Create trainer with more verbosity
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# Train with progress tracking
print("üöÄ Starting training...")
print(f"Training samples: {len(tokenized_train)}")
print(f"Validation samples: {len(tokenized_val)}")
print(f"Starting epoch 1 of {training_args.num_train_epochs}...")

trainer.train()
print("‚úÖ Training complete!")

üöÄ Starting training...
Training samples: 1315
Validation samples: 329
Starting epoch 1 of 3...


  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.932,0.669977,0.963526,0.937067,0.963526,0.948735
2,0.0746,0.049912,1.0,1.0,1.0,1.0
3,0.0398,0.028604,1.0,1.0,1.0,1.0


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


‚úÖ Training complete!


## 10. Evaluate Model

In [12]:
# Evaluate
results = trainer.evaluate()
print("\nüìä Evaluation Results:")
for key, value in results.items():
    print(f"  {key}: {value:.4f}")


üìä Evaluation Results:
  eval_loss: 0.0499
  eval_accuracy: 1.0000
  eval_precision: 1.0000
  eval_recall: 1.0000
  eval_f1: 1.0000
  eval_runtime: 0.2542
  eval_samples_per_second: 1294.4280
  eval_steps_per_second: 165.2460
  epoch: 3.0000


## 11. Test Predictions

In [13]:
# Test the model
def predict_intent(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    
    with torch.no_grad():
        outputs = model(**inputs)
    
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class_id = predictions.argmax().item()
    confidence = predictions[0][predicted_class_id].item()
    
    return id2label[predicted_class_id], confidence

# Test examples
test_examples = [
    "Hello, how are you?",
    "Where is my package?",
    "How much does this cost?",
    "I want to return my order",
    "What products do you have?"
]

print("üß™ Testing predictions:\n")
for example in test_examples:
    intent, confidence = predict_intent(example)
    print(f"Text: '{example}'")
    print(f"  ‚Üí Intent: {intent} (confidence: {confidence:.2%})\n")

üß™ Testing predictions:

Text: 'Hello, how are you?'
  ‚Üí Intent: greeting (confidence: 94.38%)

Text: 'Where is my package?'
  ‚Üí Intent: order_status (confidence: 95.86%)

Text: 'How much does this cost?'
  ‚Üí Intent: pricing (confidence: 96.60%)

Text: 'I want to return my order'
  ‚Üí Intent: return (confidence: 46.21%)

Text: 'What products do you have?'
  ‚Üí Intent: product_info (confidence: 95.58%)



## 12. Save Model

In [14]:
# Save model and tokenizer
output_dir = "./intent_classifier_final"
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)

# Save label mappings
with open(f"{output_dir}/label_mappings.json", 'w') as f:
    json.dump({
        'label2id': label2id,
        'id2label': id2label
    }, f, indent=2)

print(f"‚úÖ Model saved to {output_dir}")

# List all saved files to verify
import os
print("\nüì¶ Saved files:")
for file in os.listdir(output_dir):
    file_path = os.path.join(output_dir, file)
    size = os.path.getsize(file_path) / (1024*1024)  # MB
    print(f"  - {file} ({size:.2f} MB)")

‚úÖ Model saved to ./intent_classifier_final

üì¶ Saved files:
  - model.safetensors (255.45 MB)
  - label_mappings.json (0.00 MB)
  - tokenizer.json (0.68 MB)
  - tokenizer_config.json (0.00 MB)
  - vocab.txt (0.22 MB)
  - config.json (0.00 MB)
  - special_tokens_map.json (0.00 MB)
  - training_args.bin (0.01 MB)


In [None]:
# Create a zip file of the trained model for download
import shutil
from datetime import datetime

# Ensure we're in the working directory where Kaggle can find outputs
os.chdir('/kaggle/working')

# Create zip file with timestamp
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
zip_filename = f"intent_classifier_final_{timestamp}"
shutil.make_archive(zip_filename, 'zip', output_dir)

# Verify the file exists
zip_path = f"{zip_filename}.zip"
if os.path.exists(zip_path):
    size_mb = os.path.getsize(zip_path) / (1024*1024)
    print(f"‚úÖ Model zipped successfully: {zip_filename}.zip")
    print(f"üì¶ Size: {size_mb:.2f} MB")
    print(f"\nüì• To download from Kaggle:")
    print(f"   1. Click 'Output' tab in the right sidebar")
    print(f"   2. Look for '{zip_filename}.zip'")
    print(f"   3. Click download icon")
    print(f"\nüí° After download, extract to:")
    print(f"   d:\\3224\\customer-service-chatbot\\models\\intent_classifier_final\\")
else:
    print("‚ùå Error: Zip file not created. Check permissions.")


In [None]:
# Alternative: Use Kaggle's download function
from IPython.display import FileLink

# Display clickable download link
zip_path = f"{zip_filename}.zip"
display(FileLink(zip_path))
print(f"üëÜ Click the link above to download directly!")


### Alternative: Force Download in Browser

## 13. Download Model (Auto-Download)