# MPNet Dual-Objective Training Pipeline
## Similar Ticket Detection + Classification

**Project:** ITSM Incident Management AI  
**Model:** all-mpnet-base-v2 (sentence-transformers)  
**Date:** December 2025

---

## üéØ Training Objectives

### PRIMARY: Similar Ticket Detection
Train model to produce embeddings where similar tickets are **semantically close**.

**Use Cases:**
- üîç Find duplicate tickets
- üîó Link related incidents for root cause analysis
- üí° Suggest similar resolved tickets to speed up resolution
- üìä Cluster tickets by problem type

**Method:** Contrastive Learning
- Positive pairs: Tickets from same category (similar)
- Negative pairs: Tickets from different categories (dissimilar)
- Loss: Contrastive loss with margin

### SECONDARY: Ticket Classification
Predict category labels for automatic routing.

**Method:** Cross-Entropy Loss with classification head

---

## üìã Pipeline Overview

1. Data Loading & EDA
2. Create Contrastive Pairs (similar/dissimilar tickets)
3. Model Training with Dual Loss
4. **Similarity Search Index Building**
5. **Similar Ticket Retrieval Demo**
6. Classification Evaluation
7. Visualization & Analysis

## 1. Setup and Configuration

In [1]:
# Core libraries
import os
import sys
import json
import pickle
import warnings
from pathlib import Path
from datetime import datetime

# Data and ML
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

# Sentence transformers
from sentence_transformers import SentenceTransformer, losses, InputExample

# Metrics and utilities
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from sklearn.metrics.pairwise import cosine_similarity

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.manifold import TSNE
from tqdm.auto import tqdm

# Database (optional)
try:
    import psycopg2
    from sqlalchemy import create_engine
    DB_AVAILABLE = True
except ImportError as e:
    DB_AVAILABLE = False
    print("‚ö†Ô∏è psycopg2/sqlalchemy not available; DB loading will be skipped. Using sample data instead.")
    print(f"Import error: {e}")

warnings.filterwarnings('ignore')
print("‚úì All libraries imported")

  from .autonotebook import tqdm as notebook_tqdm


‚úì All libraries imported


In [2]:
# Configuration
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)

# Model settings
MODEL_NAME = 'sentence-transformers/all-mpnet-base-v2'
MAX_SEQ_LENGTH = 512
EMBEDDING_DIM = 768

# Training hyperparameters
BATCH_SIZE = 16
LEARNING_RATE = 2e-5
NUM_EPOCHS = 10
CONTRASTIVE_MARGIN = 0.5  # For contrastive loss

# Directories - use relative paths that work on Windows and Linux
OUTPUT_DIR = Path('models')
MODEL_DIR = OUTPUT_DIR / 'mpnet_similarity_model'
PLOTS_DIR = OUTPUT_DIR / 'plots'
RESULTS_DIR = OUTPUT_DIR / 'results'

for d in [MODEL_DIR, PLOTS_DIR, RESULTS_DIR]:
    d.mkdir(parents=True, exist_ok=True)

# Device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
print(f"Model: {MODEL_NAME}")
print(f"Batch size: {BATCH_SIZE}")

Using device: cuda
Model: sentence-transformers/all-mpnet-base-v2
Batch size: 16


## 2. Data Loading

In [3]:
from pathlib import Path
import pandas as pd
import os

DATA_CSV_PATH = Path('data_new/SNow_incident_ticket_data.csv')

def load_or_create_sample_data():
    """Load from CSV; fallback to sample data if CSV missing."""
    if DATA_CSV_PATH.exists():
        print(f"‚úì Loading real data from CSV: {DATA_CSV_PATH}")
        df = pd.read_csv(DATA_CSV_PATH)
        
        # Normalize column names: lowercase and replace spaces with underscores
        df.columns = df.columns.str.lower().str.strip().str.replace(' ', '_')
        
        # Basic required columns check
        required = ['description', 'category']
        missing = [c for c in required if c not in df.columns]
        if missing:
            print(f"Available columns: {df.columns.tolist()}")
            raise ValueError(f"CSV missing required columns: {missing}")
        
        # Minimal cleanup for expected columns
        if 'number' in df.columns and 'ticket_id' not in df.columns:
            df['ticket_id'] = df['number']
        elif 'ticket_id' not in df.columns:
            df['ticket_id'] = [f"INC{i:06d}" for i in range(len(df))]
        
        if 'service_offering' not in df.columns:
            df['service_offering'] = ''
        
        if 'subcategory' not in df.columns:
            df['subcategory'] = df['category']
        
        if 'assignment_group' not in df.columns:
            df['assignment_group'] = ''
        
        print(f"‚úì Loaded {len(df)} rows from CSV")
        return df
    
    # Fallback: synthetic sample data
    print(f"‚ö†Ô∏è CSV not found at {DATA_CSV_PATH.resolve()}; using sample data instead.")
    categories = ['Access Issues', 'Hardware Issue', 'Password Reset', 
                 'Software Issue', 'Network Issue']
    samples = []
    descriptions = {
        'Access Issues': [
            "Cannot access email account. Login fails with authentication error.",
            "User locked out of system after failed login attempts.",
            "Access denied to shared folder on network drive.",
        ],
        'Hardware Issue': [
            "Printer not responding. Paper jam error on printer HP4500.",
            "Monitor display flickering. Screen shows vertical lines.",
            "Keyboard keys not working properly. Multiple keys stuck.",
        ],
        'Password Reset': [
            "Need password reset for SAP system. Locked out after failed attempts.",
            "Forgot password for email account. Cannot log in.",
            "Password expired for VPN access. Need immediate reset.",
        ],
        'Software Issue': [
            "Application crashes when opening reports. Error code 0x8007000E.",
            "Software freezes during data export. Have to force quit.",
            "Cannot install software update. Error message appears.",
        ],
        'Network Issue': [
            "Network connection keeps dropping. Unable to access shared drives.",
            "Internet connectivity issues. Cannot browse websites.",
            "VPN connection fails. Cannot connect to office network.",
        ]
    }
    
    for category in categories:
        for desc in descriptions[category] * 67:  # ~1000 samples
            samples.append({
                'description': desc,
                'category': category,
                'service_offering': category.split()[0] + ' Services'
            })
    
    df = pd.DataFrame(samples)
    df['ticket_id'] = [f'INC{i:06d}' for i in range(len(df))]
    df['subcategory'] = df['category']
    df['assignment_group'] = 'Support Team'
    
    print(f"‚úì Created {len(df)} sample tickets")
    return df

# Load data
df = load_or_create_sample_data()
print(f"\nDataset shape: {df.shape}")
print(f"Categories: {df['category'].unique()}")
df.head()

‚úì Loading real data from CSV: data_new\SNow_incident_ticket_data.csv
‚úì Loaded 10633 rows from CSV

Dataset shape: (10633, 31)
Categories: ['Application/Software' 'Network' nan 'Server' 'Hardware']
‚úì Loaded 10633 rows from CSV

Dataset shape: (10633, 31)
Categories: ['Application/Software' 'Network' nan 'Server' 'Hardware']


Unnamed: 0,number,description,opened_by,company,itsm_department,created,urgency,impact,priority,assignment_group,...,comments_and_work_notes,manday_effort_(hrs),ticket_type,ams_domain,ams_system_type,ams_category_type,ams_service_type,ams_business_related,ams_it_related,ticket_id
0,INC0010171,GRPT not working as expected. ZMMM_PO_REV is n...,Indah Humairah Sulaiman,PIDSAP,PIDSAP,18/3/24 9:07,2 - Medium,3 - Low,4 - Low,PISCAP L2 SD BRS,...,2025-04-11 13:26:58 - BALAKUMAR GANESAN (Addit...,3.0,Issue,IS,S4HANA,Non-Genesis,Business-Related,BZ-B12-Master Data (Wrong Maintenance),,INC0010171
1,INC0010181,eTR-S1-24000073\r\nExchange Rate did not auto ...,Indah Humairah Sulaiman,PIDSAP,PIDSAP,18/3/24 9:51,2 - Medium,3 - Low,4 - Low,PISCAP L2 Workflow (SN),...,2024-04-05 02:49:36 - Reeman Mathur (Additiona...,,Issue,,,,,,,INC0010181
2,INC0010188,There is no GRPT maintenance for Sold-To: 3901...,Indah Humairah Sulaiman,PIDSAP,PIDSAP,18/3/24 10:19,3 - Low,3 - Low,4 - Low,PISCAP L2 SD BRS,...,2024-05-13 12:57:15 - BALAKUMAR GANESAN (Addit...,,Issue,,,,,,,INC0010188
3,INC0010189,Interface\t fpl\r\nSubsidiary\t...,Chenxing Cao,PA,PISCAP,18/3/24 10:24,3 - Low,3 - Low,4 - Low,PISCAP L2 Mulesoft/SOA,...,2024-03-18 10:30:07 - Chenxing Cao (Work notes...,,Issue,,,,,,,INC0010189
4,INC0010192,"retrieve new SAP password, thank you.",SOOK FONG NG,PM,PM,18/3/24 10:33,1 - High,3 - Low,3 - Moderate,PISCAP L2 SAP BASIS,...,,,Issue,,,,,,,INC0010192


## 3. Text Preprocessing

In [4]:
import re

def preprocess_text(text):
    """Clean and normalize text"""
    if pd.isna(text):
        return ""
    text = str(text)
    text = re.sub(r'http\S+|www\S+', '', text)  # Remove URLs
    text = re.sub(r'\S+@\S+', '[EMAIL]', text)  # Mask emails
    text = re.sub(r'\s+', ' ', text)  # Normalize whitespace
    return text.strip()

# Apply preprocessing
df['description_clean'] = df['description'].apply(preprocess_text)

# Remove rows with NaN or missing categories
print(f"Before cleaning: {len(df)} rows, {df['category'].isna().sum()} rows with missing category")
df = df.dropna(subset=['category', 'description']).copy()
df = df[df['description_clean'].str.split().str.len() >= 5].copy()

# Encode labels
label_encoder = LabelEncoder()
df['category_id'] = label_encoder.fit_transform(df['category'])

print(f"After preprocessing: {len(df)} tickets")
print(f"Categories: {len(label_encoder.classes_)}")
print(f"Category list: {label_encoder.classes_.tolist()}")

# Save label encoder
with open(MODEL_DIR / 'label_encoder.pkl', 'wb') as f:
    pickle.dump(label_encoder, f)

Before cleaning: 10633 rows, 563 rows with missing category
After preprocessing: 9808 tickets
Categories: 4
Category list: ['Application/Software', 'Hardware', 'Network', 'Server']
After preprocessing: 9808 tickets
Categories: 4
Category list: ['Application/Software', 'Hardware', 'Network', 'Server']


## 4. Train/Val/Test Split

In [5]:
# Stratified split
train_val_df, test_df = train_test_split(
    df, test_size=0.15, stratify=df['category_id'], random_state=RANDOM_SEED
)

train_df, val_df = train_test_split(
    train_val_df, test_size=0.15/(1-0.15), stratify=train_val_df['category_id'], 
    random_state=RANDOM_SEED
)

print(f"Training:   {len(train_df):5d} ({len(train_df)/len(df)*100:.1f}%)")
print(f"Validation: {len(val_df):5d} ({len(val_df)/len(df)*100:.1f}%)")
print(f"Test:       {len(test_df):5d} ({len(test_df)/len(df)*100:.1f}%)")

Training:    6864 (70.0%)
Validation:  1472 (15.0%)
Test:        1472 (15.0%)


## 5. Create Contrastive Pairs for Similarity Learning

**This is the KEY step for learning good embeddings!**

We create pairs of tickets:
- **Positive pairs (label=1):** Two tickets from SAME category ‚Üí should be close in embedding space
- **Negative pairs (label=0):** Two tickets from DIFFERENT categories ‚Üí should be far apart

The model learns to minimize distance for positive pairs and maximize distance for negative pairs.

In [6]:
def create_contrastive_pairs(df, num_pairs_per_ticket=2):
    """
    Create contrastive pairs for training.
    
    Returns:
        list of (text1, text2, label) tuples
        label=1 for similar, 0 for dissimilar
    """
    pairs = []
    grouped = df.groupby('category_id')
    
    for idx, row in tqdm(df.iterrows(), total=len(df), desc="Creating pairs"):
        text1 = row['description_clean']
        category = row['category_id']
        
        # Positive pairs: same category
        same_cat = grouped.get_group(category)
        same_cat = same_cat[same_cat.index != idx]
        
        if len(same_cat) > 0:
            n_pos = min(num_pairs_per_ticket, len(same_cat))
            pos_samples = same_cat.sample(n=n_pos, random_state=RANDOM_SEED)
            for _, pos_row in pos_samples.iterrows():
                pairs.append((text1, pos_row['description_clean'], 1))
        
        # Negative pairs: different category  
        diff_cat = df[df['category_id'] != category]
        if len(diff_cat) > 0:
            n_neg = min(num_pairs_per_ticket, len(diff_cat))
            neg_samples = diff_cat.sample(n=n_neg, random_state=RANDOM_SEED)
            for _, neg_row in neg_samples.iterrows():
                pairs.append((text1, neg_row['description_clean'], 0))
    
    return pairs

# Create pairs
print("Creating contrastive pairs...")
train_pairs = create_contrastive_pairs(train_df, num_pairs_per_ticket=2)
val_pairs = create_contrastive_pairs(val_df, num_pairs_per_ticket=1)

print(f"\nContrastive Pairs Created:")
print(f"  Training: {len(train_pairs)} pairs")
print(f"  - Positive (similar): {sum(1 for p in train_pairs if p[2]==1)}")
print(f"  - Negative (dissimilar): {sum(1 for p in train_pairs if p[2]==0)}")
print(f"  Validation: {len(val_pairs)} pairs")

Creating contrastive pairs...


Creating pairs: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 6864/6864 [00:29<00:00, 231.00it/s]
Creating pairs: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 6864/6864 [00:29<00:00, 231.00it/s]
Creating pairs: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1472/1472 [00:02<00:00, 604.74it/s]


Contrastive Pairs Created:
  Training: 27456 pairs
  - Positive (similar): 13728
  - Negative (dissimilar): 13728
  Validation: 2943 pairs





## 6. Model Training with Contrastive Loss

We use Sentence-Transformers library which handles:
- MPNet encoding
- Contrastive loss computation  
- Efficient batch processing

In [7]:
from sentence_transformers import InputExample, losses
from torch.utils.data import DataLoader

# Load pre-trained model
model = SentenceTransformer(MODEL_NAME, device=device)
print(f"‚úì Loaded model: {MODEL_NAME}")

# Convert pairs to InputExample format
train_examples = [
    InputExample(texts=[text1, text2], label=float(label))
    for text1, text2, label in train_pairs
]

val_examples = [
    InputExample(texts=[text1, text2], label=float(label))
    for text1, text2, label in val_pairs
]

# Create dataloaders
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=BATCH_SIZE)

# Use OnlineContrastiveLoss - perfect for our use case!
train_loss = losses.OnlineContrastiveLoss(
    model=model,
    margin=CONTRASTIVE_MARGIN
)

print(f"\nTraining Setup:")
print(f"  - Loss: OnlineContrastiveLoss (margin={CONTRASTIVE_MARGIN})")
print(f"  - Batch size: {BATCH_SIZE}")
print(f"  - Batches per epoch: {len(train_dataloader)}")
print(f"  - Total training steps: {len(train_dataloader) * NUM_EPOCHS}")

‚úì Loaded model: sentence-transformers/all-mpnet-base-v2

Training Setup:
  - Loss: OnlineContrastiveLoss (margin=0.5)
  - Batch size: 16
  - Batches per epoch: 1716
  - Total training steps: 17160


In [8]:
# Train model with GPU memory management
print("\n" + "="*60)
print("Starting Training")
print("="*60)

if torch.cuda.is_available():
    torch.cuda.empty_cache()
    print("‚úì GPU cache cleared")

try:
    model.fit(
        train_objectives=[(train_dataloader, train_loss)],
        epochs=NUM_EPOCHS,
        warmup_steps=100,
        output_path=str(MODEL_DIR / 'mpnet_similarity_model'),
        show_progress_bar=True,
        save_best_model=True
    )
    
    print("\n‚úì Training complete!")
    print(f"Model saved to: {MODEL_DIR / 'mpnet_similarity_model'}")
except RuntimeError as e:
    if 'CUDA' in str(e) or 'out of memory' in str(e):
        print(f"‚ö†Ô∏è GPU Memory error: {e}")
        print("Try reducing BATCH_SIZE or using CPU by setting device='cpu'")
        raise
    else:
        raise


Starting Training
‚úì GPU cache cleared


                                                                     

Step,Training Loss


KeyboardInterrupt: 

## 7. Build Similarity Search Index

Now we encode ALL training tickets to create a searchable index.

In [None]:
print("Building similarity search index...")
print("="*60)

# Get all training tickets
all_train_texts = train_df['description_clean'].tolist()
all_train_ids = train_df['ticket_id'].tolist()
all_train_categories = train_df['category'].tolist()

print(f"Encoding {len(all_train_texts)} tickets...")
train_embeddings = model.encode(
    all_train_texts,
    show_progress_bar=True,
    convert_to_tensor=False,
    normalize_embeddings=True  # Normalize for cosine similarity
)

# Create searchable index
embedding_index = {
    'embeddings': train_embeddings,
    'ticket_ids': all_train_ids,
    'texts': all_train_texts,
    'categories': all_train_categories
}

# Save index
with open(MODEL_DIR / 'embedding_index.pkl', 'wb') as f:
    pickle.dump(embedding_index, f)

print(f"\n‚úì Index created!")
print(f"  - Size: {len(all_train_texts)} tickets")
print(f"  - Dimensions: {train_embeddings.shape}")
print(f"  - Saved to: {MODEL_DIR / 'embedding_index.pkl'}")

## 8. üîç Similar Ticket Search Demo

**This is what we trained for!** Let's find similar tickets.

In [None]:
def find_similar_tickets(query_text, embedding_index, model, top_k=5):
    """
    Find most similar tickets to query.
    
    Returns:
        list of dicts with ticket info and similarity scores
    """
    if not query_text or len(query_text.strip()) == 0:
        print("‚ö†Ô∏è Empty query text provided")
        return []
    
    # Encode query
    query_emb = model.encode(
        [query_text],
        convert_to_tensor=False,
        normalize_embeddings=True
    )
    
    # Compute cosine similarity
    similarities = cosine_similarity(query_emb, embedding_index['embeddings'])[0]
    
    # Get top-k
    top_indices = np.argsort(similarities)[::-1][:top_k]
    
    results = []
    for idx in top_indices:
        results.append({
            'ticket_id': embedding_index['ticket_ids'][idx],
            'description': embedding_index['texts'][idx],
            'category': embedding_index['categories'][idx],
            'similarity': float(similarities[idx])
        })
    
    return results

print("SIMILAR TICKET SEARCH DEMO")
print("="*60)

# Try examples from test set (with safety checks)
if len(test_df) == 0:
    print("‚ö†Ô∏è Test set is empty - skipping demo")
else:
    num_demos = min(3, len(test_df))
    for i in range(num_demos):
        query = test_df.iloc[i]
        print(f"\n{'‚îÄ'*60}")
        print(f"QUERY TICKET #{i+1}: {query['ticket_id']}")
        print(f"Category: {query['category']}")
        print(f"Description: {query['description_clean'][:100]}...")
        print(f"\nTop 5 Similar Tickets:")
        print(f"{'‚îÄ'*60}")
        
        similar = find_similar_tickets(
            query['description_clean'], 
            embedding_index, 
            model, 
            top_k=5
        )
        
        if not similar:
            print("‚ö†Ô∏è No similar tickets found")
            continue
        
        for rank, ticket in enumerate(similar, 1):
            match = "‚úì" if ticket['category'] == query['category'] else "‚úó"
            print(f"\n{rank}. {ticket['ticket_id']} | Similarity: {ticket['similarity']:.4f} {match}")
            print(f"   Category: {ticket['category']}")
            print(f"   Description: {ticket['description'][:80]}...")

print(f"\n{'='*60}")

## 9. Evaluate Similarity Retrieval Quality

Measure how well the model finds similar tickets using:
- **Mean Reciprocal Rank (MRR):** Average rank of first relevant result
- **Precision@K:** % of relevant results in top-K

In [None]:
def evaluate_similarity_retrieval(test_df, embedding_index, model, k=10):
    """Evaluate with MRR and Precision@K"""
    reciprocal_ranks = []
    precisions = []
    
    for _, row in tqdm(test_df.iterrows(), total=len(test_df), desc="Evaluating"):
        query_text = row['description_clean']
        query_category = row['category']
        
        # Find similar
        similar = find_similar_tickets(query_text, embedding_index, model, top_k=k)
        
        # Calculate MRR
        first_correct_rank = None
        for rank, ticket in enumerate(similar, 1):
            if ticket['category'] == query_category:
                first_correct_rank = rank
                break
        
        if first_correct_rank:
            reciprocal_ranks.append(1.0 / first_correct_rank)
        else:
            reciprocal_ranks.append(0.0)
        
        # Calculate Precision@K
        correct = sum(1 for t in similar if t['category'] == query_category)
        precisions.append(correct / k)
    
    return np.mean(reciprocal_ranks), np.mean(precisions)

# Evaluate
print("Evaluating similarity retrieval...")
mrr, precision_10 = evaluate_similarity_retrieval(test_df, embedding_index, model, k=10)

print(f"\n{'='*60}")
print("SIMILARITY RETRIEVAL RESULTS")
print(f"{'='*60}")
print(f"Mean Reciprocal Rank (MRR): {mrr:.4f}")
print(f"Precision@10:               {precision_10:.4f}")
print(f"\nInterpretation:")
if mrr > 0:
    print(f"  - First relevant result appears at rank ~{1/mrr:.1f} on average")
else:
    print(f"  - No relevant results found in top-10")
print(f"  - {precision_10*100:.1f}% of top-10 results are relevant")

if mrr > 0.75:
    print(f"\n‚úì EXCELLENT: MRR > 0.75 (Target achieved!)")
elif mrr > 0.60:
    print(f"\n‚úì GOOD: MRR > 0.60 (Close to target)")
else:
    print(f"\n‚ö† Needs improvement (Target: MRR > 0.75)")
print(f"{'='*60}")

## 10. Visualization: t-SNE Embedding Space

In [None]:
# Sample for visualization
sample_size = min(500, len(test_df))
sample_texts = test_df['description_clean'].sample(sample_size, random_state=RANDOM_SEED)
sample_labels = test_df.loc[sample_texts.index, 'category']

print(f"Computing embeddings for {sample_size} samples...")
embeddings = model.encode(sample_texts.tolist(), show_progress_bar=True)

print("Computing t-SNE projection...")
tsne = TSNE(n_components=2, random_state=RANDOM_SEED, perplexity=30)
embeddings_2d = tsne.fit_transform(embeddings)

# Plot
fig, ax = plt.subplots(figsize=(12, 10))
unique_cats = sample_labels.unique()
colors = plt.cm.tab10(np.linspace(0, 1, len(unique_cats)))

for cat, color in zip(unique_cats, colors):
    mask = sample_labels == cat
    ax.scatter(
        embeddings_2d[mask, 0],
        embeddings_2d[mask, 1],
        c=[color],
        label=cat,
        alpha=0.6,
        s=50
    )

ax.set_xlabel('t-SNE Dimension 1', fontsize=12, fontweight='bold')
ax.set_ylabel('t-SNE Dimension 2', fontsize=12, fontweight='bold')
ax.set_title('t-SNE: Ticket Embeddings by Category', fontsize=14, fontweight='bold')
ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(PLOTS_DIR / 'tsne_embeddings.png', dpi=300, bbox_inches='tight')
plt.show()

print(f"‚úì Saved to {PLOTS_DIR / 'tsne_embeddings.png'}")

## 11. Summary & Next Steps

### ‚úÖ Completed:
- [x] Loaded and preprocessed ticket data
- [x] Created contrastive pairs for similarity learning
- [x] Trained MPNet with contrastive loss
- [x] Built searchable embedding index
- [x] Demonstrated similar ticket search
- [x] Evaluated retrieval quality (MRR, Precision@K)
- [x] Visualized embedding space

### üöÄ Next Steps:
1. **Deploy similarity search API** - Expose find_similar_tickets() as REST endpoint
2. **Integrate with ServiceNow** - Auto-link similar tickets when created
3. **Add classification head** - For category prediction (secondary objective)
4. **Implement caching** - Store embeddings for fast retrieval
5. **Monitor performance** - Track MRR in production
6. **Fine-tune on feedback** - Retrain with user relevance signals

### üíæ Saved Artifacts:
- Model: `models/mpnet_similarity_model/`
- Embedding index: `models/embedding_index.pkl`
- Label encoder: `models/label_encoder.pkl`
- Visualizations: `plots/`

In [None]:
# Save final metrics
metrics = {
    'mrr': float(mrr),
    'precision_at_10': float(precision_10),
    'num_train_tickets': len(train_df),
    'num_test_tickets': len(test_df),
    'num_categories': len(label_encoder.classes_),
    'model_name': MODEL_NAME,
    'training_date': datetime.now().isoformat()
}

with open(RESULTS_DIR / 'similarity_metrics.json', 'w') as f:
    json.dump(metrics, f, indent=2)

print("="*60)
print("TRAINING COMPLETE!")
print("="*60)
print(f"Model saved to: {MODEL_DIR}")
print(f"Results saved to: {RESULTS_DIR}")
print(f"\nKey Metrics:")
print(f"  - MRR: {mrr:.4f}")
print(f"  - Precision@10: {precision_10:.4f}")
print("="*60)