# Fine-tune Intent Classification Model
## Customer Service Chatbot - Intent Classifier Training

This notebook fine-tunes a DistilBERT model for intent classification.

**Platform:** Google Colab or Kaggle

**Steps:**
1. Install dependencies
2. Load training data
3. Prepare dataset
4. Fine-tune DistilBERT
5. Evaluate and save model

In [1]:
# Fix tokenizer parallelism issue
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

## 1. Install Dependencies

In [2]:
!pip install transformers datasets torch scikit-learn accelerate -q

## 2. Import Libraries

In [3]:
import json
import numpy as np
from datasets import Dataset
from transformers import (
    AutoTokenizer, 
    AutoModelForSequenceClassification,
    TrainingArguments, 
    Trainer,
    DataCollatorWithPadding
)
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import torch
print(f"Using device: {torch.device('cuda' if torch.cuda.is_available() else 'cpu')}")

2026-01-28 09:21:58.541852: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1769592118.727656      24 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1769592118.782893      24 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1769592119.198412      24 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1769592119.198448      24 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1769592119.198451      24 computation_placer.cc:177] computation placer alr

Using device: cuda


## 3. Load Training Data

**Choose one of the options below:**

**Option 1:** Use SNIPS Intent Dataset (7 intents, 13.8K examples) - **VERIFIED WORKING** ‚úÖ

**Option 2:** Upload your own train_data.json and val_data.json (1,315 examples)

In [4]:
# OPTION 1: Use SNIPS Intent Dataset ‚úÖ VERIFIED WORKING
from datasets import load_dataset

print("üì• Loading SNIPS Intent dataset from bkonkle/snips-joint-intent...")
dataset = load_dataset("bkonkle/snips-joint-intent")

# Inspect the actual dataset structure
print(f"\nüìä Available splits: {list(dataset.keys())}")
print(f"üìä Train dataset features: {dataset['train'].features}")
print(f"üìä Train dataset columns: {dataset['train'].column_names}")
print(f"\nüìù First item in dataset:")
first_item = dataset['train'][0]
print(f"  Keys: {list(first_item.keys())}")
print(f"  Full item: {first_item}")

# Auto-detect text and label columns from actual keys
all_keys = list(first_item.keys())
print(f"\nüîç All available columns: {all_keys}")

# Try to find text column
text_col = None
for col in all_keys:
    if isinstance(first_item[col], str) and len(first_item[col]) > 10:
        text_col = col
        break
    elif isinstance(first_item[col], list) and all(isinstance(x, str) for x in first_item[col]):
        text_col = col  # Could be tokens
        break

# Try to find label/intent column  
label_col = None
for col in all_keys:
    if col != text_col and (isinstance(first_item[col], str) or isinstance(first_item[col], int)):
        label_col = col
        break

if not text_col or not label_col:
    print(f"‚ùå Could not auto-detect columns. Available: {all_keys}")
    print(f"Please check the dataset structure above and manually set:")
    print(f"  text_col = 'your_text_column_name'")
    print(f"  label_col = 'your_label_column_name'")
    raise ValueError("Column auto-detection failed")

print(f"\n‚úÖ Using columns: text='{text_col}', label='{label_col}'")

# Convert to our format - handle tokens if needed
if isinstance(first_item[text_col], list):
    print("üìù Text column contains tokens/list - joining into sentences...")
    train_data = [
        {"text": " ".join(item[text_col]) if isinstance(item[text_col], list) else item[text_col], 
         "label": item[label_col]} 
        for item in dataset["train"]
    ]
    val_data = [
        {"text": " ".join(item[text_col]) if isinstance(item[text_col], list) else item[text_col], 
         "label": item[label_col]} 
        for item in dataset["test"]
    ]
else:
    train_data = [
        {"text": item[text_col], "label": item[label_col]} 
        for item in dataset["train"]
    ]
    val_data = [
        {"text": item[text_col], "label": item[label_col]} 
        for item in dataset["test"]
    ]

print(f"\n‚úÖ Training examples: {len(train_data)}")
print(f"‚úÖ Validation examples: {len(val_data)}")
print(f"\nüìù Sample converted data: {train_data[0]}")

# Show all intents
unique_labels = sorted(list(set([item['label'] for item in train_data])))
print(f"\nüìä Number of intents: {len(unique_labels)}")
print(f"Intents: {unique_labels}")

üì• Loading SNIPS Intent dataset from bkonkle/snips-joint-intent...


README.md:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

train.csv: 0.00B [00:00, ?B/s]

test.csv: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/13084 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/700 [00:00<?, ? examples/s]


üìä Available splits: ['train', 'test']
üìä Train dataset features: {'input': Value('string'), 'intent': Value('string'), 'slots': Value('string')}
üìä Train dataset columns: ['input', 'intent', 'slots']

üìù First item in dataset:
  Keys: ['input', 'intent', 'slots']
  Full item: {'input': 'listen to westbam alumb allergic on google music', 'intent': 'PlayMusic', 'slots': 'O O B-artist O B-album O B-service I-service'}

üîç All available columns: ['input', 'intent', 'slots']

‚úÖ Using columns: text='input', label='intent'

‚úÖ Training examples: 13084
‚úÖ Validation examples: 700

üìù Sample converted data: {'text': 'listen to westbam alumb allergic on google music', 'label': 'PlayMusic'}

üìä Number of intents: 7
Intents: ['AddToPlaylist', 'BookRestaurant', 'GetWeather', 'PlayMusic', 'RateBook', 'SearchCreativeWork', 'SearchScreeningEvent']


In [5]:
# OPTION 2: Upload your own files (comment out SNIPS above first)
# import json
# with open('/kaggle/input/customer-chatbot/train_data.json', 'r') as f:
#     train_data = json.load(f)
# with open('/kaggle/input/customer-chatbot/val_data.json', 'r') as f:
#     val_data = json.load(f)
# print(f"‚úÖ Training examples: {len(train_data)}")
# print(f"‚úÖ Validation examples: {len(val_data)}")
# unique_labels = sorted(list(set([item['label'] for item in train_data])))
# print(f"\nüìä Number of intents: {len(unique_labels)}")

### Option 2: Upload Your Own Data

## 4. Prepare Dataset

In [6]:
# Create label mapping
unique_labels = sorted(list(set([item['label'] for item in train_data])))
label2id = {label: idx for idx, label in enumerate(unique_labels)}
id2label = {idx: label for label, idx in label2id.items()}

print(f"Number of intents: {len(unique_labels)}")
print(f"Labels: {unique_labels}")

# Convert labels to IDs
for item in train_data:
    item['labels'] = label2id[item['label']]

for item in val_data:
    item['labels'] = label2id[item['label']]

# Create Hugging Face datasets
train_dataset = Dataset.from_list(train_data)
val_dataset = Dataset.from_list(val_data)

print(f"\nDataset created successfully!")

Number of intents: 7
Labels: ['AddToPlaylist', 'BookRestaurant', 'GetWeather', 'PlayMusic', 'RateBook', 'SearchCreativeWork', 'SearchScreeningEvent']

Dataset created successfully!


## 5. Tokenize Data

In [7]:
# Load tokenizer
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Tokenization function
def tokenize_function(examples):
    return tokenizer(examples['text'], truncation=True, padding=True, max_length=128)

# Tokenize datasets
tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_val = val_dataset.map(tokenize_function, batched=True)

# Remove unnecessary columns - keep only what the model needs
tokenized_train = tokenized_train.remove_columns(['text', 'label'])
tokenized_val = tokenized_val.remove_columns(['text', 'label'])

print("‚úÖ Tokenization complete!")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/13084 [00:00<?, ? examples/s]

Map:   0%|          | 0/700 [00:00<?, ? examples/s]

‚úÖ Tokenization complete!


## 6. Load Model

In [8]:
# Load model
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=len(unique_labels),
    id2label=id2label,
    label2id=label2id
)

print(f"‚úÖ Model loaded: {model_name}")
print(f"üìä Parameters: {model.num_parameters():,}")

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


‚úÖ Model loaded: distilbert-base-uncased
üìä Parameters: 66,958,855


## 7. Define Metrics

In [9]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    
    accuracy = accuracy_score(labels, predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='weighted')
    
    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1
    }

## 8. Training Arguments

In [10]:
training_args = TrainingArguments(
    output_dir="./intent_classifier",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    learning_rate=2e-5,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    push_to_hub=False,
    logging_steps=10,
    warmup_steps=100,
    dataloader_num_workers=0,
    report_to="none",                    # ADD THIS LINE - disables wandb
)

In [11]:
# Check GPU memory before training
import subprocess
result = subprocess.run(['nvidia-smi'], stdout=subprocess.PIPE)
print(result.stdout.decode('utf-8'))

Wed Jan 28 09:22:20 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.172.08             Driver Version: 570.172.08     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla P100-PCIE-16GB           Off |   00000000:00:04.0 Off |                    0 |
| N/A   35C    P0             32W /  250W |     257MiB /  16384MiB |      2%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## 9. Train Model

In [12]:

# Create data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Create trainer with more verbosity
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# Train with progress tracking
print("üöÄ Starting training...")
print(f"Training samples: {len(tokenized_train)}")
print(f"Validation samples: {len(tokenized_val)}")
print(f"Starting epoch 1 of {training_args.num_train_epochs}...")

trainer.train()
print("‚úÖ Training complete!")

üöÄ Starting training...
Training samples: 13084
Validation samples: 700
Starting epoch 1 of 3...


  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.0357,0.106027,0.968571,0.972471,0.968571,0.968879
2,0.0006,0.116002,0.971429,0.973924,0.971429,0.971679
3,0.0004,0.120949,0.975714,0.977218,0.975714,0.975794


‚úÖ Training complete!


## 10. Evaluate Model

In [13]:
# Evaluate
results = trainer.evaluate()
print("\nüìä Evaluation Results:")
for key, value in results.items():
    print(f"  {key}: {value:.4f}")


üìä Evaluation Results:
  eval_loss: 0.1209
  eval_accuracy: 0.9757
  eval_precision: 0.9772
  eval_recall: 0.9757
  eval_f1: 0.9758
  eval_runtime: 0.6550
  eval_samples_per_second: 1068.7600
  eval_steps_per_second: 134.3580
  epoch: 3.0000


## 11. Test Predictions

In [14]:
# Test the model
def predict_intent(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    
    with torch.no_grad():
        outputs = model(**inputs)
    
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class_id = predictions.argmax().item()
    confidence = predictions[0][predicted_class_id].item()
    
    return id2label[predicted_class_id], confidence

# Test examples
test_examples = [
    "Hello, how are you?",
    "Where is my package?",
    "How much does this cost?",
    "I want to return my order",
    "What products do you have?"
]

print("üß™ Testing predictions:\n")
for example in test_examples:
    intent, confidence = predict_intent(example)
    print(f"Text: '{example}'")
    print(f"  ‚Üí Intent: {intent} (confidence: {confidence:.2%})\n")

üß™ Testing predictions:

Text: 'Hello, how are you?'
  ‚Üí Intent: SearchCreativeWork (confidence: 99.32%)

Text: 'Where is my package?'
  ‚Üí Intent: SearchCreativeWork (confidence: 99.41%)

Text: 'How much does this cost?'
  ‚Üí Intent: BookRestaurant (confidence: 55.46%)

Text: 'I want to return my order'
  ‚Üí Intent: AddToPlaylist (confidence: 93.44%)

Text: 'What products do you have?'
  ‚Üí Intent: SearchCreativeWork (confidence: 90.08%)



## 12. Save Model

In [15]:
# Save model and create downloadable zip
import shutil
from datetime import datetime
import json

# Define output directory
output_dir = "./intent_classifier_final"

# Save the trained model
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"‚úÖ Model saved to: {output_dir}")

# Save label mappings (needed for your chatbot)
label_mappings = {
    'label2id': label2id,
    'id2label': id2label
}
with open(f"{output_dir}/label_mappings.json", 'w') as f:
    json.dump(label_mappings, f, indent=2)
print(f"‚úÖ Label mappings saved")

# Change to Kaggle working directory
os.chdir('/kaggle/working')

# Create zip file with timestamp
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
zip_filename = f"intent_classifier_final_{timestamp}"
shutil.make_archive(zip_filename, 'zip', output_dir)

# Verify the file exists
zip_path = f"{zip_filename}.zip"
if os.path.exists(zip_path):
    size_mb = os.path.getsize(zip_path) / (1024*1024)
    print(f"\n‚úÖ Model zipped successfully: {zip_filename}.zip")
    print(f"üì¶ Size: {size_mb:.2f} MB")
    print(f"\nüì• To download from Kaggle:")
    print(f"   1. Click 'Output' tab in the right sidebar")
    print(f"   2. Look for '{zip_filename}.zip'")
    print(f"   3. Click download icon")
    print(f"\nüí° After download, extract to:")
    print(f"   d:\\3224\\customer-service-chatbot\\models\\intent_classifier_final\\")
else:
    print("‚ùå Error: Zip file not created. Check permissions.")

‚úÖ Model saved to: ./intent_classifier_final
‚úÖ Label mappings saved

‚úÖ Model zipped successfully: intent_classifier_final_20260128_092434.zip
üì¶ Size: 235.87 MB

üì• To download from Kaggle:
   1. Click 'Output' tab in the right sidebar
   2. Look for 'intent_classifier_final_20260128_092434.zip'
   3. Click download icon

üí° After download, extract to:
   d:\3224\customer-service-chatbot\models\intent_classifier_final\


In [16]:
# Alternative: Use Kaggle's download function
from IPython.display import FileLink

# Display clickable download link
zip_path = f"{zip_filename}.zip"
display(FileLink(zip_path))
print(f"üëÜ Click the link above to download directly!")


üëÜ Click the link above to download directly!
