# FraudDiffuse Implementation: Diffusion-Based Synthetic Fraud Data Generation

## Overview
This notebook implements the FraudDiffuse model from the paper "FraudDiffuse: Diffusion-aided Synthetic Fraud Augmentation for Improved Fraud Detection". The goal is to generate synthetic fraud cases that maintain the statistical properties of real fraud while being diverse enough to improve fraud detection models.

## Key Components

### 1. Base Architecture
- **Diffusion Model**: Progressive noising and denoising of features
- **Conditional Generation**: Uses fraud/non-fraud labels to guide generation
- **Time Embedding**: Sinusoidal embeddings for timestep information

### 2. FraudDiffuse Innovations
- **Adaptive Prior**: Uses non-fraud distribution statistics as prior
- **Probability-Based Loss**: Ensures generated samples follow expected distributions
- **Contrastive Learning**: Triplet loss to maintain fraud/non-fraud distinctions

### 3. Loss Components
- **MSE Loss**: Basic reconstruction loss
- **Probability Loss**: Based on z-scores from non-fraud distribution
- **Triplet Loss**: Contrastive loss for fraud/non-fraud separation

## Implementation Notes
- Features are normalized and scaled appropriately
- Loss components are balanced using weights
- Early stopping is implemented to prevent overfitting
- GPU acceleration is used where available

## Training Process
1. Load and preprocess fraud transaction data
2. Train diffusion model with combined losses
3. Generate synthetic fraud samples
4. Evaluate quality of generated samples

## Evaluation Metrics
We'll evaluate the generated samples using:
- Distribution similarity to real fraud cases
- Statistical measures (mean, variance, correlations)
- Fraud detection model performance improvement

## Usage
The model can be used to:
1. Generate synthetic fraud cases
2. Augment fraud detection training data
3. Study fraud patterns and characteristics

In [1]:

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import pandas as pd
import numpy as np
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
import math
import time 

In [3]:
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU Device: {torch.cuda.get_device_name(0)}")

PyTorch version: 2.6.0+cpu
CUDA available: False


In [3]:
# For reproducibility
torch.manual_seed(42)
np.random.seed(42)


# First define the Dataset class
class FraudDataset(Dataset):
    def __init__(self, data_path):
        # Load data in chunks to handle large files better
        self.data = pd.read_csv(data_path)
        
        # Convert to tensors once during initialization
        self.features = torch.FloatTensor(
            self.data.drop(['is_fraud'], axis=1).values
        )
        self.labels = torch.FloatTensor(
            self.data['is_fraud'].values
        )
        
        self.num_features = self.features.shape[1]
        
    def __len__(self):
        return len(self.features)
    
    def __getitem__(self, idx):
        return {
            'features': self.features[idx],
            'label': self.labels[idx]
        }

# Check GPU availability and print info
print(f"GPU Available: {torch.cuda.is_available()}")
print(f"GPU Device Name: {torch.cuda.get_device_name(0)}")
print(f"GPU Memory Usage: {torch.cuda.memory_allocated(0)/1024**2:.2f}MB / {torch.cuda.get_device_properties(0).total_memory/1024**2:.2f}MB")

# 2. Reinitialize the datasets with the modified settings
print("Reinitializing datasets...")
train_dataset = FraudDataset('Data/processed/train.csv')
val_dataset = FraudDataset('Data/processed/val.csv')
test_dataset = FraudDataset('Data/processed/test.csv')


# 3. Create new data loaders with conservative settings
train_loader = DataLoader(
    train_dataset,
    batch_size=128,
    shuffle=True,
    num_workers=0,  # Start with 0 workers
    pin_memory=False,  # Disable pin memory initially
    persistent_workers=False
)

val_loader = DataLoader(
    val_dataset,
    batch_size=128,
    shuffle=False,
    num_workers=0,
    pin_memory=False
)

test_loader = DataLoader(
    test_dataset,
    batch_size=128,
    shuffle=False,
    num_workers=0,
    pin_memory=False
)

# 4. Test the full train loader
print("\nTesting full train loader...")
start_time = time.time()
train_iter = iter(train_loader)
print(f"Iterator creation time: {time.time() - start_time:.3f}s")

start_time = time.time()
first_batch = next(train_iter)
print(f"First batch load time: {time.time() - start_time:.3f}s")
print(f"Batch shapes: features={first_batch['features'].shape}, labels={first_batch['label'].shape}")

GPU Available: True
GPU Device Name: NVIDIA GeForce RTX 4060
GPU Memory Usage: 0.00MB / 8187.50MB
Reinitializing datasets...

Testing full train loader...
Iterator creation time: 0.000s
First batch load time: 0.027s
Batch shapes: features=torch.Size([128, 25]), labels=torch.Size([128])


In [4]:
# # Test gradual optimization of data loading
# optimizations = [
#     {
#         "name": "Base (current)",
#         "settings": {"num_workers": 0, "pin_memory": False}
#     },
#     {
#         "name": "With pin_memory",
#         "settings": {"num_workers": 0, "pin_memory": True}
#     },
#     {
#         "name": "With 1 worker",
#         "settings": {"num_workers": 1, "pin_memory": True}
#     },
#     {
#         "name": "With 2 workers",
#         "settings": {"num_workers": 2, "pin_memory": True}
#     }
# ]

# print("Testing DataLoader optimizations...")
# for opt in optimizations:
#     print(f"\nTesting {opt['name']}:")
#     test_loader = DataLoader(
#         train_dataset,
#         batch_size=128,
#         shuffle=True,
#         **opt['settings']
#     )
    
#     # Test iteration speed
#     start_time = time.time()
#     test_iter = iter(test_loader)
#     first_batch = next(test_iter)
#     batch_time = time.time() - start_time
    
#     print(f"Batch load time: {batch_time:.3f}s")

Seems like I am running into issues with PyTorch's DataLoader multiprocessing on Windows

In [5]:
# 1. First, the time embedding component
class SinusoidalPositionEmbeddings(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.dim = dim

    def forward(self, time):
        device = time.device
        half_dim = self.dim // 2
        embeddings = math.log(10000) / (half_dim - 1)
        embeddings = torch.exp(torch.arange(half_dim, device=device) * -embeddings)
        embeddings = time[:, None] * embeddings[None, :]
        embeddings = torch.cat((embeddings.sin(), embeddings.cos()), dim=-1)
        return embeddings

# 2. The U-Net building block
class Block(nn.Module):
    def __init__(self, in_ch, out_ch, time_emb_dim, up=False):
        super().__init__()
        self.time_mlp = nn.Linear(time_emb_dim, out_ch)
        if up:
            self.conv1 = nn.Conv1d(in_ch, out_ch, 3, padding=1)
            self.transform = nn.ConvTranspose1d(out_ch, out_ch, 4, 2, 1)
        else:
            self.conv1 = nn.Conv1d(in_ch, out_ch, 3, padding=1)
            self.transform = nn.Conv1d(out_ch, out_ch, 4, 2, 1)
        self.conv2 = nn.Conv1d(out_ch, out_ch, 3, padding=1)
        self.bnorm1 = nn.BatchNorm1d(out_ch)
        self.bnorm2 = nn.BatchNorm1d(out_ch)
        self.relu = nn.ReLU()
        
    def forward(self, x, t):
        # First convolution
        h = self.bnorm1(self.relu(self.conv1(x)))
        
        # Time embedding
        time_emb = self.relu(self.time_mlp(t))
        # Reshape time embedding to match the feature dimension
        time_emb = time_emb.unsqueeze(-1).repeat(1, 1, h.shape[-1])
        
        # Add time embedding
        h = h + time_emb
        
        # Second convolution and transform
        h = self.bnorm2(self.relu(self.conv2(h)))
        return self.transform(h)

In [6]:
# 3. The complete U-Net architecture
class UNet(nn.Module):
    def __init__(self, num_features, time_emb_dim=32):
        super().__init__()
        self.num_features = num_features
        
        # Time embedding
        self.time_mlp = nn.Sequential(
            SinusoidalPositionEmbeddings(time_emb_dim),
            nn.Linear(time_emb_dim, time_emb_dim),
            nn.ReLU()
        )
        
        # Condition embedding
        self.condition_embedding = nn.Sequential(
            nn.Linear(1, time_emb_dim),
            nn.ReLU(),
            nn.Linear(time_emb_dim, time_emb_dim)
        )
        
        # Initial projection
        self.proj = nn.Linear(num_features, 64)
        
        # Encoder
        self.enc1 = Block(1, 64, time_emb_dim)
        self.enc2 = Block(64, 128, time_emb_dim)
        self.enc3 = Block(128, 256, time_emb_dim)
        self.enc4 = Block(256, 512, time_emb_dim)
        
        # Bottleneck
        self.bottleneck = nn.Sequential(
            nn.Conv1d(512, 512, 3, padding=1),
            nn.BatchNorm1d(512),
            nn.ReLU()
        )
        
        # Decoder
        self.dec4 = Block(512, 256, time_emb_dim, up=True)
        self.dec3 = Block(256, 128, time_emb_dim, up=True)
        self.dec2 = Block(128, 64, time_emb_dim, up=True)
        self.dec1 = Block(64, 1, time_emb_dim, up=True)
        
        # Final projection
        self.final = nn.Linear(64, num_features)
        
    def forward(self, x, t, y):
        # Time embedding
        t = self.time_mlp(t)
        
        # Condition embedding
        y = y.view(-1, 1)
        y = self.condition_embedding(y)
        
        # Initial projection and reshape
        x = self.proj(x)
        x = x.unsqueeze(1)  # Add channel dimension
        
        # U-Net forward pass
        x1 = self.enc1(x, t)
        x2 = self.enc2(x1, t)
        x3 = self.enc3(x2, t)
        x4 = self.enc4(x3, t)
        
        x4 = self.bottleneck(x4)
        
        x = self.dec4(x4, t)
        x = self.dec3(x + x3, t)
        x = self.dec2(x + x2, t)
        x = self.dec1(x + x1, t)
        
        # Reshape and final projection
        x = x.squeeze(1)  # Remove channel dimension
        x = self.final(x)
        
        return x

# Initialize model and move to GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = UNet(num_features=train_dataset.num_features).to(device)
print(f"Model initialized on: {device}")

Model initialized on: cuda


In [7]:
# 5. Diffusion process
class DiffusionModel:
    def __init__(self, num_timesteps=1000, beta_start=1e-4, beta_end=0.02):
        self.num_timesteps = num_timesteps
        self.beta_start = beta_start
        self.beta_end = beta_end
        
        self.beta = self.prepare_noise_schedule().to(device)
        self.alpha = 1. - self.beta
        self.alpha_hat = torch.cumprod(self.alpha, dim=0)
        
    def prepare_noise_schedule(self):
        return torch.linspace(self.beta_start, self.beta_end, self.num_timesteps)
    
    def noise_images(self, x, t):
        sqrt_alpha_hat = torch.sqrt(self.alpha_hat[t])[:, None]
        sqrt_one_minus_alpha_hat = torch.sqrt(1 - self.alpha_hat[t])[:, None]
        ε = torch.randn_like(x)
        return sqrt_alpha_hat * x + sqrt_one_minus_alpha_hat * ε, ε

# Initialize diffusion model and optimizer
diffusion = DiffusionModel()
optimizer = torch.optim.Adam(model.parameters(), lr=2e-4)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode='min', factor=0.5, patience=3, verbose=True
)

# Test a single training iteration
print("\nTesting single training iteration...")
model.train()
batch = next(iter(train_loader))
x = batch['features'].to(device)
labels = batch['label'].to(device)

# Sample timestep
t = torch.randint(0, diffusion.num_timesteps, (x.shape[0],), device=device).long()

# Get noisy features and noise
noisy_x, noise = diffusion.noise_images(x, t)

# Forward pass
predicted_noise = model(noisy_x, t, labels)

# Calculate loss
loss = F.mse_loss(noise, predicted_noise)

# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()

print(f"Single iteration test complete!")
print(f"Loss: {loss.item():.6f}")
print(f"GPU Memory Usage: {torch.cuda.memory_allocated()/1024**2:.1f}MB / {torch.cuda.get_device_properties(0).total_memory/1024**2:.2f}MB")




Testing single training iteration...
Single iteration test complete!
Loss: 1.097803
GPU Memory Usage: 93.7MB / 8187.50MB


Implementing the full training loop

In [9]:
# First add the sampling function
@torch.no_grad()
def sample_features(model, diffusion, n_samples, labels, device):
    model.eval()
    
    # Start from random noise
    x = torch.randn((n_samples, model.num_features)).to(device)
    
    # Gradually denoise
    for i in tqdm(reversed(range(diffusion.num_timesteps)), desc='Sampling'):
        t = torch.full((n_samples,), i, device=device, dtype=torch.long)
        predicted_noise = model(x, t, labels)
        
        alpha = diffusion.alpha[t][:, None]
        alpha_hat = diffusion.alpha_hat[t][:, None]
        beta = diffusion.beta[t][:, None]
        
        if i > 0:
            noise = torch.randn_like(x)
        else:
            noise = torch.zeros_like(x)
            
        x = 1 / torch.sqrt(alpha) * (x - ((1 - alpha) / (torch.sqrt(1 - alpha_hat))) * predicted_noise) + torch.sqrt(beta) * noise
    
    return x

# Training configuration and setup
num_epochs = 100
best_val_loss = float('inf')
early_stopping_patience = 5
early_stopping_counter = 0

# Create directories for checkpoints and samples
import os
os.makedirs('checkpoints', exist_ok=True)
os.makedirs('samples', exist_ok=True)

# Training history
history = {
    'train_loss': [],
    'val_loss': [],
    'lr': []
}

# Main training loop
print("Starting training...")
print(f"Training on device: {device}")
print(f"Number of epochs: {num_epochs}")
print(f"Batch size: {train_loader.batch_size}")
print(f"Training batches per epoch: {len(train_loader)}")
print(f"Initial learning rate: {optimizer.param_groups[0]['lr']}")

training_start_time = time.time()

for epoch in range(num_epochs):
    epoch_start_time = time.time()
    model.train()
    total_loss = 0
    
    # Training phase
    progress_bar = tqdm(train_loader, desc=f'Epoch {epoch+1}/{num_epochs}')
    for batch_idx, batch in enumerate(progress_bar):
        optimizer.zero_grad()
        
        # Get batch data
        x = batch['features'].to(device)
        labels = batch['label'].to(device)
        
        # Sample timestep
        t = torch.randint(0, diffusion.num_timesteps, (x.shape[0],), device=device).long()
        
        # Get noisy features and noise
        noisy_x, noise = diffusion.noise_images(x, t)
        
        # Predict noise
        predicted_noise = model(noisy_x, t, labels)
        
        # Calculate loss
        loss = F.mse_loss(noise, predicted_noise)
        
        # Backward pass
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()
        
        # Update progress
        total_loss += loss.item()
        avg_loss = total_loss / (batch_idx + 1)
        progress_bar.set_postfix({
            'loss': f"{loss.item():.4f}",
            'avg_loss': f"{avg_loss:.4f}",
            'gpu_mem': f"{torch.cuda.memory_allocated()/1024**2:.0f}MB"
        })
    
    train_loss = total_loss / len(train_loader)
    
    # Validation phase
    model.eval()
    val_loss = 0
    with torch.no_grad():
        for batch in val_loader:
            x = batch['features'].to(device)
            labels = batch['label'].to(device)
            t = torch.randint(0, diffusion.num_timesteps, (x.shape[0],), device=device).long()
            noisy_x, noise = diffusion.noise_images(x, t)
            predicted_noise = model(noisy_x, t, labels)
            val_loss += F.mse_loss(noise, predicted_noise).item()
    
    val_loss /= len(val_loader)
    
    # Update learning rate
    scheduler.step(val_loss)
    
    # Update history
    history['train_loss'].append(train_loss)
    history['val_loss'].append(val_loss)
    history['lr'].append(optimizer.param_groups[0]['lr'])
    
    # Print epoch summary
    epoch_time = time.time() - epoch_start_time
    print(f"\nEpoch {epoch+1}/{num_epochs} Summary:")
    print(f"Train Loss: {train_loss:.6f}")
    print(f"Val Loss: {val_loss:.6f}")
    print(f"Learning Rate: {optimizer.param_groups[0]['lr']:.6f}")
    print(f"Epoch Time: {epoch_time/60:.1f} minutes")
    
    # Save best model
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        early_stopping_counter = 0
        torch.save({
            'epoch': epoch,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'train_loss': train_loss,
            'val_loss': val_loss,
            'history': history,
        }, 'checkpoints/best_model.pt')
        print("Saved new best model!")
    else:
        early_stopping_counter += 1
    
    # Early stopping
    if early_stopping_counter >= early_stopping_patience:
        print(f"\nEarly stopping triggered after {epoch+1} epochs")
        break
    
    # Generate and save samples every 10 epochs
    if (epoch + 1) % 10 == 0:
        print("\nGenerating samples...")
        model.eval()
        with torch.no_grad():
            # Generate fraud samples
            fraud_samples = sample_features(
                model, 
                diffusion, 
                n_samples=100, 
                labels=torch.ones(100).to(device),
                device=device
            )
            
            # Save samples
            torch.save(fraud_samples.cpu(), f'samples/fraud_samples_epoch_{epoch+1}.pt')
            
            # Plot training history
            plt.figure(figsize=(12, 4))
            plt.subplot(1, 2, 1)
            plt.plot(history['train_loss'], label='Train Loss')
            plt.plot(history['val_loss'], label='Val Loss')
            plt.xlabel('Epoch')
            plt.ylabel('Loss')
            plt.legend()
            
            plt.subplot(1, 2, 2)
            plt.plot(history['lr'], label='Learning Rate')
            plt.xlabel('Epoch')
            plt.ylabel('Learning Rate')
            plt.legend()
            
            plt.tight_layout()
            plt.savefig(f'samples/training_history_epoch_{epoch+1}.png')
            plt.close()

total_training_time = time.time() - training_start_time
print("\nTraining completed!")
print(f"Total training time: {total_training_time/3600:.1f} hours")
print(f"Best validation loss: {best_val_loss:.6f}")

Starting training...
Training on device: cuda
Number of epochs: 100
Batch size: 128
Training batches per epoch: 9407
Initial learning rate: 0.0001


Epoch 1/100: 100%|██████████| 9407/9407 [02:40<00:00, 58.48it/s, loss=0.9808, avg_loss=0.9998, gpu_mem=92MB]



Epoch 1/100 Summary:
Train Loss: 0.999816
Val Loss: 1.000211
Learning Rate: 0.000100
Epoch Time: 2.8 minutes
Saved new best model!


Epoch 2/100: 100%|██████████| 9407/9407 [03:13<00:00, 48.65it/s, loss=1.0235, avg_loss=1.0000, gpu_mem=92MB]



Epoch 2/100 Summary:
Train Loss: 1.000012
Val Loss: 0.998341
Learning Rate: 0.000100
Epoch Time: 3.5 minutes
Saved new best model!


Epoch 3/100: 100%|██████████| 9407/9407 [03:48<00:00, 41.22it/s, loss=0.9912, avg_loss=0.9997, gpu_mem=92MB]



Epoch 3/100 Summary:
Train Loss: 0.999748
Val Loss: 1.000156
Learning Rate: 0.000100
Epoch Time: 4.1 minutes


Epoch 4/100: 100%|██████████| 9407/9407 [03:16<00:00, 47.91it/s, loss=1.0022, avg_loss=1.0003, gpu_mem=92MB]



Epoch 4/100 Summary:
Train Loss: 1.000306
Val Loss: 0.999662
Learning Rate: 0.000100
Epoch Time: 3.4 minutes


Epoch 5/100: 100%|██████████| 9407/9407 [03:13<00:00, 48.65it/s, loss=1.0832, avg_loss=0.9999, gpu_mem=92MB]



Epoch 5/100 Summary:
Train Loss: 0.999902
Val Loss: 0.999844
Learning Rate: 0.000100
Epoch Time: 3.5 minutes


Epoch 6/100: 100%|██████████| 9407/9407 [03:22<00:00, 46.38it/s, loss=1.0145, avg_loss=1.0000, gpu_mem=92MB]



Epoch 6/100 Summary:
Train Loss: 0.999981
Val Loss: 0.999655
Learning Rate: 0.000050
Epoch Time: 3.7 minutes


Epoch 7/100: 100%|██████████| 9407/9407 [03:23<00:00, 46.20it/s, loss=1.0605, avg_loss=1.0000, gpu_mem=92MB]



Epoch 7/100 Summary:
Train Loss: 0.999979
Val Loss: 0.999658
Learning Rate: 0.000050
Epoch Time: 3.7 minutes

Early stopping triggered after 7 epochs

Training completed!
Total training time: 0.4 hours
Best validation loss: 0.998341


The above is the baseline diffusion model implementation.

Adding in the FraudDiffuse Model to the baseline

In [36]:
# 1. Required imports
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np
import math
import time
from tqdm import tqdm
import os
import pandas as pd

# 2. Dataset class
class FraudDataset(Dataset):
    def __init__(self, features, labels):
        self.features = torch.FloatTensor(features)
        self.labels = torch.LongTensor(labels)
        self.num_features = features.shape[1]
        
    def __len__(self):
        return len(self.features)
    
    def __getitem__(self, idx):
        return self.features[idx], self.labels[idx]

# 3. Load and prepare data
train_data = pd.read_csv('Data/processed/train.csv')
val_data = pd.read_csv('Data/processed/val.csv')
test_data = pd.read_csv('Data/processed/test.csv')

# Separate features and labels
train_features = train_data.drop('is_fraud', axis=1).values
train_labels = train_data['is_fraud'].values
val_features = val_data.drop('is_fraud', axis=1).values
val_labels = val_data['is_fraud'].values
test_features = test_data.drop('is_fraud', axis=1).values
test_labels = test_data['is_fraud'].values

# Create datasets
train_dataset = FraudDataset(train_features, train_labels)
val_dataset = FraudDataset(val_features, val_labels)
test_dataset = FraudDataset(test_features, test_labels)

# Create dataloaders
train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=128, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=128, shuffle=False)

# 4. Time embedding
class SinusoidalEmbedding(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.dim = dim

    def forward(self, time):
        device = time.device
        half_dim = self.dim // 2
        embeddings = math.log(10000) / (half_dim - 1)
        embeddings = torch.exp(torch.arange(half_dim, device=device) * -embeddings)
        embeddings = time[:, None] * embeddings[None, :]
        embeddings = torch.cat((embeddings.sin(), embeddings.cos()), dim=-1)
        return embeddings

# 5. Model architecture
class FraudDiffuseNet(nn.Module):
    def __init__(self, num_features, hidden_size=256, time_dim=32):
        super().__init__()
        self.num_features = num_features
        
        # Time embedding
        self.time_mlp = nn.Sequential(
            SinusoidalEmbedding(time_dim),
            nn.Linear(time_dim, hidden_size),
            nn.ReLU()
        )
        
        # Label embedding
        self.label_emb = nn.Embedding(2, hidden_size)
        
        # Feature transformation
        self.net = nn.Sequential(
            nn.Linear(num_features + hidden_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, num_features)
        )
        
    def forward(self, x, t, labels):
        # Time and label embeddings
        t = self.time_mlp(t)
        t = t + self.label_emb(labels)
        
        # Concatenate features with time embedding
        x_t = torch.cat([x, t], dim=1)
        
        # Transform features
        return self.net(x_t)

# 6. Diffusion process with adaptive prior
class FraudDiffusion:
    def __init__(self, train_dataset, num_timesteps=1000, beta_start=1e-4, beta_end=0.02):
        self.num_timesteps = num_timesteps
        self.beta_start = beta_start
        self.beta_end = beta_end
        
        # Calculate non-fraud distribution parameters
        non_fraud_mask = train_dataset.labels == 0
        non_fraud_features = train_dataset.features[non_fraud_mask]
        self.mu_nf = non_fraud_features.mean(dim=0)
        self.sigma_nf = non_fraud_features.std(dim=0)
        
        # Setup noise schedule
        self.beta = self.prepare_noise_schedule()
        self.alpha = 1. - self.beta
        self.alpha_hat = torch.cumprod(self.alpha, dim=0)
        
    def prepare_noise_schedule(self):
        return torch.linspace(self.beta_start, self.beta_end, self.num_timesteps)
    
    def noise_images(self, x, t):
        sqrt_alpha_hat = torch.sqrt(self.alpha_hat[t])[:, None]
        sqrt_one_minus_alpha_hat = torch.sqrt(1 - self.alpha_hat[t])[:, None]
        # Use non-fraud distribution for noise
        ε = torch.randn_like(x) * self.sigma_nf + self.mu_nf
        return sqrt_alpha_hat * x + sqrt_one_minus_alpha_hat * ε, ε
    
    def sample_timesteps(self, n):
        return torch.randint(low=1, high=self.num_timesteps, size=(n,))

# 7. Loss functions
def probability_loss(predicted_noise, prior_mu, prior_sigma):
    z_score = (predicted_noise - prior_mu) / prior_sigma
    return 2 * torch.distributions.Normal(0, 1).cdf(torch.abs(z_score)).mean()

# UPDATED triplet loss function
def triplet_loss(anchor, positive, negative, margin=1.0):
    # Make sure we have equal number of samples by taking the minimum
    min_samples = min(len(positive), len(negative))
    
    # Randomly sample to get equal sizes
    if len(positive) > min_samples:
        idx = torch.randperm(len(positive))[:min_samples]
        positive = positive[idx]
    if len(negative) > min_samples:
        idx = torch.randperm(len(negative))[:min_samples]
        negative = negative[idx]
    
    # Also sample the anchor to match
    if len(anchor) > min_samples:
        idx = torch.randperm(len(anchor))[:min_samples]
        anchor = anchor[idx]
    
    # Now calculate distances with balanced samples
    pos_dist = F.pairwise_distance(anchor, positive)
    neg_dist = F.pairwise_distance(anchor, negative)
    
    return torch.mean(torch.clamp(pos_dist - neg_dist + margin, min=0))

# 8. Training setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = FraudDiffuseNet(num_features=train_dataset.num_features).to(device)
diffusion = FraudDiffusion(train_dataset)
diffusion.beta = diffusion.beta.to(device)
diffusion.alpha = diffusion.alpha.to(device)
diffusion.alpha_hat = diffusion.alpha_hat.to(device)
diffusion.mu_nf = diffusion.mu_nf.to(device)
diffusion.sigma_nf = diffusion.sigma_nf.to(device)

optimizer = optim.AdamW(model.parameters(), lr=1e-4)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.5)

# Loss weights
w1 = 0.1  # Weight for probability loss
w2 = 0.1  # Weight for triplet loss

# Training configuration
num_epochs = 100
best_val_loss = float('inf')
early_stopping_patience = 5
early_stopping_counter = 0

# Create directories
os.makedirs('checkpoints/frauddiffuse', exist_ok=True)
os.makedirs('samples/frauddiffuse', exist_ok=True)

# 9. Training loop
training_start_time = time.time()

for epoch in range(num_epochs):
    model.train()
    epoch_losses = []
    epoch_start_time = time.time()
    
    progress_bar = tqdm(train_loader, desc=f'Epoch {epoch+1}/{num_epochs}')
    for batch_features, batch_labels in progress_bar:
        batch_features = batch_features.to(device)
        batch_labels = batch_labels.to(device)
        batch_size = batch_features.shape[0]
        
        # Sample timesteps
        t = diffusion.sample_timesteps(batch_size).to(device)
        
        # Get noisy features and noise
        x_t, noise = diffusion.noise_images(batch_features, t)
        
        # Predict noise
        predicted_noise = model(x_t, t, batch_labels)
        
        # Calculate losses
        mse_loss = F.mse_loss(noise, predicted_noise)
        prob_loss = probability_loss(predicted_noise, diffusion.mu_nf, diffusion.sigma_nf)
        
        # UPDATED triplet loss calculation
        trip_loss = torch.tensor(0.0).to(device)
        fraud_mask = batch_labels == 1
        non_fraud_mask = batch_labels == 0
        if fraud_mask.any() and non_fraud_mask.any():
            fraud_samples = predicted_noise[fraud_mask]
            non_fraud_samples = predicted_noise[non_fraud_mask]
            if len(fraud_samples) > 0 and len(non_fraud_samples) > 0:
                # Use the smaller batch as anchor
                if len(fraud_samples) <= len(non_fraud_samples):
                    trip_loss = triplet_loss(fraud_samples, fraud_samples, non_fraud_samples)
                else:
                    trip_loss = triplet_loss(non_fraud_samples, non_fraud_samples, fraud_samples)
        
        # Combined loss
        loss = mse_loss + w1 * prob_loss + w2 * trip_loss
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        epoch_losses.append(loss.item())
        progress_bar.set_postfix({
            'loss': loss.item(),
            'avg_loss': np.mean(epoch_losses),
            'gpu_mem': f"{torch.cuda.memory_allocated()/1024**2:.0f}MB"
        })
    
    # Validation
    model.eval()
    val_losses = []
    with torch.no_grad():
        for val_features, val_labels in val_loader:
            val_features = val_features.to(device)
            val_labels = val_labels.to(device)
            batch_size = val_features.shape[0]
            
            t = diffusion.sample_timesteps(batch_size).to(device)
            x_t, noise = diffusion.noise_images(val_features, t)
            predicted_noise = model(x_t, t, val_labels)
            
            val_loss = F.mse_loss(noise, predicted_noise)
            val_losses.append(val_loss.item())
    
    # Calculate average losses
    train_loss = np.mean(epoch_losses)
    val_loss = np.mean(val_losses)
    
    # Learning rate scheduling
    scheduler.step()
    
    # Print epoch summary
    epoch_time = (time.time() - epoch_start_time) / 60
    print(f"\nEpoch {epoch+1}/{num_epochs} Summary:")
    print(f"Train Loss: {train_loss:.6f}")
    print(f"Val Loss: {val_loss:.6f}")
    print(f"Learning Rate: {scheduler.get_last_lr()[0]:.6f}")
    print(f"Epoch Time: {epoch_time:.1f} minutes")
    
    # Save best model
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        torch.save(model.state_dict(), 'checkpoints/frauddiffuse/best_model.pt')
        print("Saved new best model!")
        early_stopping_counter = 0
    else:
        early_stopping_counter += 1
    
    # Early stopping
    if early_stopping_counter >= early_stopping_patience:
        print(f"\nEarly stopping triggered after {epoch+1} epochs")
        break

# Print final training summary
total_training_time = time.time() - training_start_time
print("\nTraining completed!")
print(f"Total training time: {total_training_time/3600:.1f} hours")
print(f"Best validation loss: {best_val_loss:.6f}")

Epoch 1/100: 100%|██████████| 9407/9407 [01:08<00:00, 137.80it/s, loss=3.82e+9, avg_loss=4.77e+9, gpu_mem=81MB] 



Epoch 1/100 Summary:
Train Loss: 4773571137.626023
Val Loss: 4218566401.002303
Learning Rate: 0.000100
Epoch Time: 1.2 minutes
Saved new best model!


Epoch 2/100: 100%|██████████| 9407/9407 [01:06<00:00, 142.28it/s, loss=3.45e+9, avg_loss=3.79e+9, gpu_mem=81MB]



Epoch 2/100 Summary:
Train Loss: 3785866024.303604
Val Loss: 3686459207.753109
Learning Rate: 0.000100
Epoch Time: 1.2 minutes
Saved new best model!


Epoch 3/100: 100%|██████████| 9407/9407 [01:08<00:00, 137.77it/s, loss=2.94e+9, avg_loss=3.59e+9, gpu_mem=81MB]



Epoch 3/100 Summary:
Train Loss: 3585768195.986818
Val Loss: 3457558228.193459
Learning Rate: 0.000100
Epoch Time: 1.2 minutes
Saved new best model!


Epoch 4/100: 100%|██████████| 9407/9407 [01:20<00:00, 116.68it/s, loss=3.28e+9, avg_loss=3.39e+9, gpu_mem=26MB]



Epoch 4/100 Summary:
Train Loss: 3389646098.015521
Val Loss: 3301076337.260249
Learning Rate: 0.000100
Epoch Time: 1.4 minutes
Saved new best model!


Epoch 5/100: 100%|██████████| 9407/9407 [01:24<00:00, 110.67it/s, loss=3.74e+9, avg_loss=3.27e+9, gpu_mem=26MB]



Epoch 5/100 Summary:
Train Loss: 3265047856.875944
Val Loss: 3230065718.301244
Learning Rate: 0.000050
Epoch Time: 1.5 minutes
Saved new best model!


Epoch 6/100: 100%|██████████| 9407/9407 [01:09<00:00, 135.68it/s, loss=3.68e+9, avg_loss=3.2e+9, gpu_mem=26MB] 



Epoch 6/100 Summary:
Train Loss: 3197526347.872010
Val Loss: 3192289668.716721
Learning Rate: 0.000050
Epoch Time: 1.2 minutes
Saved new best model!


Epoch 7/100: 100%|██████████| 9407/9407 [01:21<00:00, 115.88it/s, loss=1.98e+9, avg_loss=3.16e+9, gpu_mem=26MB]



Epoch 7/100 Summary:
Train Loss: 3163393336.427766
Val Loss: 3232300259.169046
Learning Rate: 0.000050
Epoch Time: 1.5 minutes


Epoch 8/100: 100%|██████████| 9407/9407 [01:29<00:00, 105.65it/s, loss=4.03e+9, avg_loss=3.14e+9, gpu_mem=26MB]



Epoch 8/100 Summary:
Train Loss: 3139622344.824067
Val Loss: 3144024757.357900
Learning Rate: 0.000050
Epoch Time: 1.6 minutes
Saved new best model!


Epoch 9/100: 100%|██████████| 9407/9407 [01:28<00:00, 106.28it/s, loss=2.47e+9, avg_loss=3.12e+9, gpu_mem=26MB]



Epoch 9/100 Summary:
Train Loss: 3117264546.262145
Val Loss: 3139966354.159374
Learning Rate: 0.000050
Epoch Time: 1.6 minutes
Saved new best model!


Epoch 10/100: 100%|██████████| 9407/9407 [01:28<00:00, 106.43it/s, loss=3.66e+9, avg_loss=3.12e+9, gpu_mem=26MB]



Epoch 10/100 Summary:
Train Loss: 3116782505.732327
Val Loss: 3053407211.541225
Learning Rate: 0.000025
Epoch Time: 1.6 minutes
Saved new best model!


Epoch 11/100: 100%|██████████| 9407/9407 [01:27<00:00, 108.01it/s, loss=3.01e+9, avg_loss=3.07e+9, gpu_mem=26MB]



Epoch 11/100 Summary:
Train Loss: 3073021358.596790
Val Loss: 3103807486.526025
Learning Rate: 0.000025
Epoch Time: 1.6 minutes


Epoch 12/100: 100%|██████████| 9407/9407 [01:24<00:00, 111.22it/s, loss=4.05e+9, avg_loss=3.08e+9, gpu_mem=26MB]



Epoch 12/100 Summary:
Train Loss: 3081908854.556819
Val Loss: 3062055300.421926
Learning Rate: 0.000025
Epoch Time: 1.5 minutes


Epoch 13/100: 100%|██████████| 9407/9407 [01:22<00:00, 113.54it/s, loss=2.28e+9, avg_loss=3.05e+9, gpu_mem=26MB]



Epoch 13/100 Summary:
Train Loss: 3052357906.083555
Val Loss: 3079562436.864118
Learning Rate: 0.000025
Epoch Time: 1.5 minutes


Epoch 14/100: 100%|██████████| 9407/9407 [01:20<00:00, 116.19it/s, loss=2.82e+9, avg_loss=3.07e+9, gpu_mem=26MB]



Epoch 14/100 Summary:
Train Loss: 3065496078.967577
Val Loss: 3086353911.863657
Learning Rate: 0.000025
Epoch Time: 1.4 minutes


Epoch 15/100: 100%|██████████| 9407/9407 [01:19<00:00, 118.32it/s, loss=2.94e+9, avg_loss=3.06e+9, gpu_mem=26MB]



Epoch 15/100 Summary:
Train Loss: 3061312181.461465
Val Loss: 3057757389.413174
Learning Rate: 0.000013
Epoch Time: 1.4 minutes

Early stopping triggered after 15 epochs

Training completed!
Total training time: 0.4 hours
Best validation loss: 3053407211.541225


# Dataset Class Changes:
- Added MinMaxScaler with feature range (-1, 1) instead of standard scaling
- Added scaler parameter to allow reuse of the same scaling for validation/test sets
- Added inverse_transform method to convert generated samples back to original scale
- Added fit parameter to control when scaling is fitted


# Model Architecture Changes:
- Added LayerNorm to help with numerical stability
- Changed ReLU to GELU activation for better gradient flow
- Added Dropout layers (0.1) to prevent overfitting
- Enhanced time embedding with GELU and Dropout


# Diffusion Process Changes:
- Added clamp(min=1e-5) to prevent division by zero in standard deviation
- Scaled noise magnitude (ε 0.1) to prevent extreme values
- Added numerical stability in noise_images method

# Loss Function Changes:
- Added epsilon (1e-5) to prevent division by zero in probability_loss
- Scaled distances in triplet_loss (multiplied by 0.1)
- Reduced margin in triplet_loss from 1.0 to 0.1
- Added better handling of uneven batch sizes

In [45]:
# 1. Required imports
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np
import math
import time
from tqdm import tqdm
import os
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
import seaborn as sns

# 2. Dataset class with MinMax scaling
class FraudDataset(Dataset):
    def __init__(self, features, labels, scaler=None, fit=False):
        if scaler is None and fit:
            self.scaler = MinMaxScaler(feature_range=(-1, 1))
            self.features = torch.FloatTensor(self.scaler.fit_transform(features))
        elif scaler is not None:
            self.scaler = scaler
            self.features = torch.FloatTensor(self.scaler.transform(features))
        else:
            self.scaler = None
            self.features = torch.FloatTensor(features)
            
        self.labels = torch.LongTensor(labels)
        self.num_features = features.shape[1]
        
    def __len__(self):
        return len(self.features)
    
    def __getitem__(self, idx):
        return self.features[idx], self.labels[idx]
    
    def inverse_transform(self, features):
        if self.scaler is not None:
            return self.scaler.inverse_transform(features)
        return features

# 3. Time embedding
class SinusoidalEmbedding(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.dim = dim

    def forward(self, time):
        device = time.device
        half_dim = self.dim // 2
        embeddings = math.log(10000) / (half_dim - 1)
        embeddings = torch.exp(torch.arange(half_dim, device=device) * -embeddings)
        embeddings = time[:, None] * embeddings[None, :]
        embeddings = torch.cat((embeddings.sin(), embeddings.cos()), dim=-1)
        return embeddings

# 4. Model architecture
class FraudDiffuseNet(nn.Module):
    def __init__(self, num_features, hidden_size=256, time_dim=32):
        super().__init__()
        self.num_features = num_features
        
        # Time embedding
        self.time_mlp = nn.Sequential(
            SinusoidalEmbedding(time_dim),
            nn.Linear(time_dim, hidden_size),
            nn.GELU(),
            nn.Dropout(0.1)
        )
        
        # Label embedding
        self.label_emb = nn.Embedding(2, hidden_size)
        
        # Feature transformation
        self.net = nn.Sequential(
            nn.Linear(num_features + hidden_size, hidden_size),
            nn.GELU(),
            nn.Dropout(0.1),
            nn.Linear(hidden_size, hidden_size),
            nn.GELU(),
            nn.Dropout(0.1),
            nn.Linear(hidden_size, hidden_size),
            nn.GELU(),
            nn.Dropout(0.1),
            nn.Linear(hidden_size, num_features)
        )
        
        # Layer normalization
        self.norm = nn.LayerNorm(num_features)
        
    def forward(self, x, t, labels):
        # Time and label embeddings
        t = self.time_mlp(t)
        t = t + self.label_emb(labels)
        
        # Concatenate features with time embedding
        x_t = torch.cat([x, t], dim=1)
        
        # Transform features
        return self.norm(self.net(x_t))

# 5. Diffusion process
class FraudDiffusion:
    def __init__(self, train_dataset, num_timesteps=1000, beta_start=1e-4, beta_end=0.02):
        self.num_timesteps = num_timesteps
        self.beta_start = beta_start
        self.beta_end = beta_end
        
        # Calculate non-fraud distribution parameters
        non_fraud_mask = train_dataset.labels == 0
        non_fraud_features = train_dataset.features[non_fraud_mask]
        self.mu_nf = non_fraud_features.mean(dim=0)
        self.sigma_nf = non_fraud_features.std(dim=0).clamp(min=1e-5)
        
        # Setup noise schedule
        self.beta = self.prepare_noise_schedule()
        self.alpha = 1. - self.beta
        self.alpha_hat = torch.cumprod(self.alpha, dim=0)
        
    def prepare_noise_schedule(self):
        return torch.linspace(self.beta_start, self.beta_end, self.num_timesteps)
    
    def noise_images(self, x, t):
        sqrt_alpha_hat = torch.sqrt(self.alpha_hat[t])[:, None]
        sqrt_one_minus_alpha_hat = torch.sqrt(1 - self.alpha_hat[t])[:, None]
        ε = torch.randn_like(x)
        scaled_noise = ε * 0.1
        return sqrt_alpha_hat * x + sqrt_one_minus_alpha_hat * scaled_noise, scaled_noise
    
    def sample_timesteps(self, n):
        return torch.randint(low=1, high=self.num_timesteps, size=(n,))

# 6. Loss functions
def probability_loss(predicted_noise, prior_mu, prior_sigma):
    z_score = (predicted_noise - prior_mu) / (prior_sigma + 1e-5)
    return torch.mean(torch.abs(z_score))

def triplet_loss(anchor, positive, negative, margin=0.1):
    min_samples = min(len(positive), len(negative))
    
    if len(positive) > min_samples:
        idx = torch.randperm(len(positive))[:min_samples]
        positive = positive[idx]
    if len(negative) > min_samples:
        idx = torch.randperm(len(negative))[:min_samples]
        negative = negative[idx]
    if len(anchor) > min_samples:
        idx = torch.randperm(len(anchor))[:min_samples]
        anchor = anchor[idx]
    
    pos_dist = F.pairwise_distance(anchor, positive) * 0.1
    neg_dist = F.pairwise_distance(anchor, negative) * 0.1
    
    return torch.mean(torch.clamp(pos_dist - neg_dist + margin, min=0))

# 7. Load and prepare data
print("Loading and preparing data...")
train_data = pd.read_csv('Data/processed/train.csv').drop('Unnamed: 0', axis=1, errors='ignore')
val_data = pd.read_csv('Data/processed/val.csv').drop('Unnamed: 0', axis=1, errors='ignore')
test_data = pd.read_csv('Data/processed/test.csv').drop('Unnamed: 0', axis=1, errors='ignore')

# Print feature names for verification
print("\nFeatures being used:")
print(train_data.drop('is_fraud', axis=1).columns.tolist())

# Prepare features and labels
train_features = train_data.drop('is_fraud', axis=1).values
train_labels = train_data['is_fraud'].values
val_features = val_data.drop('is_fraud', axis=1).values
val_labels = val_data['is_fraud'].values
test_features = test_data.drop('is_fraud', axis=1).values
test_labels = test_data['is_fraud'].values

# Create datasets
train_dataset = FraudDataset(train_features, train_labels, scaler=None, fit=True)
val_dataset = FraudDataset(val_features, val_labels, scaler=train_dataset.scaler)
test_dataset = FraudDataset(test_features, test_labels, scaler=train_dataset.scaler)

# Create dataloaders
train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=128)
test_loader = DataLoader(test_dataset, batch_size=128)

# 8. Training setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = FraudDiffuseNet(num_features=train_features.shape[1]).to(device)
diffusion = FraudDiffusion(train_dataset)

# Move diffusion parameters to device
diffusion.beta = diffusion.beta.to(device)
diffusion.alpha = diffusion.alpha.to(device)
diffusion.alpha_hat = diffusion.alpha_hat.to(device)
diffusion.mu_nf = diffusion.mu_nf.to(device)
diffusion.sigma_nf = diffusion.sigma_nf.to(device)

# Optimizer and scheduler
optimizer = optim.AdamW(model.parameters(), lr=1e-4)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=5, verbose=True)

# Create directories
os.makedirs('checkpoints', exist_ok=True)
os.makedirs('samples', exist_ok=True)
os.makedirs('metrics', exist_ok=True)

# Training configuration
num_epochs = 100
best_val_loss = float('inf')
early_stopping_patience = 10
early_stopping_counter = 0
training_start_time = time.time()

# Training history
history = {
    'train_loss': [], 'val_loss': [],
    'prob_loss': [], 'triplet_loss': [],
    'learning_rate': []
}

# 9. Training loop
print("\nStarting training...")
print(f"Training on device: {device}")
print(f"Number of features: {train_features.shape[1]}")
print(f"Number of epochs: {num_epochs}")
print(f"Batch size: 128")
print(f"Initial learning rate: 1e-4")

for epoch in range(num_epochs):
    model.train()
    epoch_start_time = time.time()
    
    # Training metrics
    train_losses = []
    prob_losses = []
    triplet_losses = []
    
    progress_bar = tqdm(train_loader, desc=f'Epoch {epoch+1}/{num_epochs}')
    for batch_features, batch_labels in progress_bar:
        batch_features = batch_features.to(device)
        batch_labels = batch_labels.to(device)
        
        # Sample timesteps
        t = diffusion.sample_timesteps(batch_features.shape[0]).to(device)
        
        # Get noisy features and noise
        x_t, noise = diffusion.noise_images(batch_features, t)
        
        # Predict noise
        predicted_noise = model(x_t, t, batch_labels)
        
        # Calculate losses
        mse_loss = F.mse_loss(noise, predicted_noise)
        prob_loss = probability_loss(predicted_noise, diffusion.mu_nf, diffusion.sigma_nf)
        
        # Calculate triplet loss if we have both fraud and non-fraud samples
        trip_loss = torch.tensor(0.0).to(device)
        fraud_mask = batch_labels == 1
        non_fraud_mask = batch_labels == 0
        
        if fraud_mask.any() and non_fraud_mask.any():
            fraud_samples = predicted_noise[fraud_mask]
            non_fraud_samples = predicted_noise[non_fraud_mask]
            if len(fraud_samples) > 0 and len(non_fraud_samples) > 0:
                trip_loss = triplet_loss(fraud_samples, fraud_samples, non_fraud_samples)
        
        # Combined loss
        loss = mse_loss + 0.1 * prob_loss + 0.1 * trip_loss
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()
        
        # Record losses
        train_losses.append(loss.item())
        prob_losses.append(prob_loss.item())
        triplet_losses.append(trip_loss.item())
        
        # Update progress bar
        progress_bar.set_postfix({
            'loss': f"{loss.item():.4f}",
            'prob_loss': f"{prob_loss.item():.4f}",
            'trip_loss': f"{trip_loss.item():.4f}"
        })
    
    # Validation phase
    model.eval()
    val_losses = []
    
    with torch.no_grad():
        for val_features, val_labels in val_loader:
            val_features = val_features.to(device)
            val_labels = val_labels.to(device)
            
            t = diffusion.sample_timesteps(val_features.shape[0]).to(device)
            x_t, noise = diffusion.noise_images(val_features, t)
            predicted_noise = model(x_t, t, val_labels)
            
            val_loss = F.mse_loss(noise, predicted_noise)
            val_losses.append(val_loss.item())
    
    # Calculate epoch metrics
    train_loss = np.mean(train_losses)
    val_loss = np.mean(val_losses)
    epoch_prob_loss = np.mean(prob_losses)
    epoch_triplet_loss = np.mean(triplet_losses)
    
    # Update learning rate
    scheduler.step(val_loss)
    
    # Update history
    history['train_loss'].append(train_loss)
    history['val_loss'].append(val_loss)
    history['prob_loss'].append(epoch_prob_loss)
    history['triplet_loss'].append(epoch_triplet_loss)
    history['learning_rate'].append(optimizer.param_groups[0]['lr'])
    
    # Print epoch summary
    epoch_time = time.time() - epoch_start_time
    print(f"\nEpoch {epoch+1}/{num_epochs} Summary:")
    print(f"Train Loss: {train_loss:.6f}")
    print(f"Val Loss: {val_loss:.6f}")
    print(f"Probability Loss: {epoch_prob_loss:.6f}")
    print(f"Triplet Loss: {epoch_triplet_loss:.6f}")
    print(f"Learning Rate: {optimizer.param_groups[0]['lr']:.6f}")
    print(f"Epoch Time: {epoch_time/60:.1f} minutes")
    
    # Save best model
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        torch.save({
            'epoch': epoch,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'scheduler_state_dict': scheduler.state_dict(),
            'train_loss': train_loss,
            'val_loss': val_loss,
            'history': history,
            'scaler': train_dataset.scaler,
            'feature_names': train_data.drop('is_fraud', axis=1).columns.tolist()
        }, 'checkpoints/best_model_v2.pt')
        print("Saved new best model!")
        early_stopping_counter = 0
    else:
        early_stopping_counter += 1
    
    # Early stopping
    if early_stopping_counter >= early_stopping_patience:
        print(f"\nEarly stopping triggered after {epoch+1} epochs")
        break

# Print final training summary
total_training_time = time.time() - training_start_time
print("\nTraining completed!")
print(f"Total training time: {total_training_time/3600:.1f} hours")
print(f"Best validation loss: {best_val_loss:.6f}")

Loading and preparing data...

Features being used:
['cc_num', 'merchant', 'category', 'amt', 'first', 'last', 'gender', 'street', 'city', 'state', 'zip', 'lat', 'long', 'city_pop', 'job', 'dob', 'trans_num', 'unix_time', 'merch_lat', 'merch_long', 'trans_hour', 'trans_day', 'trans_month', 'trans_dayofweek']





Starting training...
Training on device: cuda
Number of features: 24
Number of epochs: 100
Batch size: 128
Initial learning rate: 1e-4


Epoch 1/100: 100%|██████████| 9407/9407 [01:06<00:00, 140.96it/s, loss=0.0558, prob_loss=0.3549, trip_loss=0.0000]



Epoch 1/100 Summary:
Train Loss: 0.232400
Val Loss: 0.020758
Probability Loss: 0.323599
Triplet Loss: 0.000126
Learning Rate: 0.000100
Epoch Time: 1.2 minutes
Saved new best model!


Epoch 2/100: 100%|██████████| 9407/9407 [01:03<00:00, 147.98it/s, loss=0.0554, prob_loss=0.3650, trip_loss=0.0000]



Epoch 2/100 Summary:
Train Loss: 0.055730
Val Loss: 0.019081
Probability Loss: 0.360616
Triplet Loss: 0.000189
Learning Rate: 0.000100
Epoch Time: 1.1 minutes
Saved new best model!


Epoch 3/100: 100%|██████████| 9407/9407 [00:57<00:00, 163.01it/s, loss=0.0551, prob_loss=0.3607, trip_loss=0.0000]



Epoch 3/100 Summary:
Train Loss: 0.055412
Val Loss: 0.019043
Probability Loss: 0.362016
Triplet Loss: 0.000177
Learning Rate: 0.000100
Epoch Time: 1.0 minutes
Saved new best model!


Epoch 4/100: 100%|██████████| 9407/9407 [01:24<00:00, 111.61it/s, loss=0.0542, prob_loss=0.3556, trip_loss=0.0000]



Epoch 4/100 Summary:
Train Loss: 0.055189
Val Loss: 0.018684
Probability Loss: 0.361937
Triplet Loss: 0.000174
Learning Rate: 0.000100
Epoch Time: 1.5 minutes
Saved new best model!


Epoch 5/100: 100%|██████████| 9407/9407 [01:18<00:00, 119.24it/s, loss=0.0543, prob_loss=0.3641, trip_loss=0.0000]



Epoch 5/100 Summary:
Train Loss: 0.055003
Val Loss: 0.018466
Probability Loss: 0.363193
Triplet Loss: 0.000171
Learning Rate: 0.000100
Epoch Time: 1.4 minutes
Saved new best model!


Epoch 6/100: 100%|██████████| 9407/9407 [01:14<00:00, 125.53it/s, loss=0.0563, prob_loss=0.3626, trip_loss=0.0000]



Epoch 6/100 Summary:
Train Loss: 0.054876
Val Loss: 0.018525
Probability Loss: 0.363490
Triplet Loss: 0.000181
Learning Rate: 0.000100
Epoch Time: 1.3 minutes


Epoch 7/100: 100%|██████████| 9407/9407 [01:23<00:00, 112.98it/s, loss=0.0546, prob_loss=0.3577, trip_loss=0.0000]



Epoch 7/100 Summary:
Train Loss: 0.054797
Val Loss: 0.018767
Probability Loss: 0.363645
Triplet Loss: 0.000196
Learning Rate: 0.000100
Epoch Time: 1.4 minutes


Epoch 8/100: 100%|██████████| 9407/9407 [01:14<00:00, 126.65it/s, loss=0.0549, prob_loss=0.3672, trip_loss=0.0000]



Epoch 8/100 Summary:
Train Loss: 0.054741
Val Loss: 0.018052
Probability Loss: 0.363680
Triplet Loss: 0.000201
Learning Rate: 0.000100
Epoch Time: 1.3 minutes
Saved new best model!


Epoch 9/100: 100%|██████████| 9407/9407 [01:07<00:00, 138.86it/s, loss=0.0549, prob_loss=0.3621, trip_loss=0.0000]



Epoch 9/100 Summary:
Train Loss: 0.054701
Val Loss: 0.018181
Probability Loss: 0.363684
Triplet Loss: 0.000185
Learning Rate: 0.000100
Epoch Time: 1.2 minutes


Epoch 10/100: 100%|██████████| 9407/9407 [01:14<00:00, 126.85it/s, loss=0.0539, prob_loss=0.3606, trip_loss=0.0000]



Epoch 10/100 Summary:
Train Loss: 0.054652
Val Loss: 0.017993
Probability Loss: 0.363641
Triplet Loss: 0.000180
Learning Rate: 0.000100
Epoch Time: 1.3 minutes
Saved new best model!


Epoch 11/100: 100%|██████████| 9407/9407 [01:10<00:00, 134.24it/s, loss=0.0534, prob_loss=0.3615, trip_loss=0.0000]



Epoch 11/100 Summary:
Train Loss: 0.054627
Val Loss: 0.018216
Probability Loss: 0.363659
Triplet Loss: 0.000185
Learning Rate: 0.000100
Epoch Time: 1.2 minutes


Epoch 12/100: 100%|██████████| 9407/9407 [01:05<00:00, 142.79it/s, loss=0.0534, prob_loss=0.3667, trip_loss=0.0000]



Epoch 12/100 Summary:
Train Loss: 0.054604
Val Loss: 0.017729
Probability Loss: 0.363654
Triplet Loss: 0.000181
Learning Rate: 0.000100
Epoch Time: 1.2 minutes
Saved new best model!


Epoch 13/100: 100%|██████████| 9407/9407 [01:07<00:00, 139.38it/s, loss=0.0543, prob_loss=0.3641, trip_loss=0.0000]



Epoch 13/100 Summary:
Train Loss: 0.054569
Val Loss: 0.018037
Probability Loss: 0.363606
Triplet Loss: 0.000193
Learning Rate: 0.000100
Epoch Time: 1.2 minutes


Epoch 14/100: 100%|██████████| 9407/9407 [01:16<00:00, 123.59it/s, loss=0.0547, prob_loss=0.3657, trip_loss=0.0000]



Epoch 14/100 Summary:
Train Loss: 0.054512
Val Loss: 0.018289
Probability Loss: 0.363212
Triplet Loss: 0.000179
Learning Rate: 0.000100
Epoch Time: 1.3 minutes


Epoch 15/100: 100%|██████████| 9407/9407 [01:11<00:00, 131.09it/s, loss=0.0530, prob_loss=0.3569, trip_loss=0.0000]



Epoch 15/100 Summary:
Train Loss: 0.054448
Val Loss: 0.017934
Probability Loss: 0.362927
Triplet Loss: 0.000199
Learning Rate: 0.000100
Epoch Time: 1.3 minutes


Epoch 16/100: 100%|██████████| 9407/9407 [01:08<00:00, 138.30it/s, loss=0.0543, prob_loss=0.3614, trip_loss=0.0000]



Epoch 16/100 Summary:
Train Loss: 0.054415
Val Loss: 0.018093
Probability Loss: 0.362909
Triplet Loss: 0.000201
Learning Rate: 0.000100
Epoch Time: 1.2 minutes


Epoch 17/100: 100%|██████████| 9407/9407 [00:58<00:00, 160.63it/s, loss=0.0537, prob_loss=0.3554, trip_loss=0.0000]



Epoch 17/100 Summary:
Train Loss: 0.054407
Val Loss: 0.018125
Probability Loss: 0.362895
Triplet Loss: 0.000195
Learning Rate: 0.000100
Epoch Time: 1.0 minutes


Epoch 18/100: 100%|██████████| 9407/9407 [00:58<00:00, 161.68it/s, loss=0.0553, prob_loss=0.3644, trip_loss=0.0000]



Epoch 18/100 Summary:
Train Loss: 0.054395
Val Loss: 0.018023
Probability Loss: 0.362924
Triplet Loss: 0.000192
Learning Rate: 0.000050
Epoch Time: 1.0 minutes


Epoch 19/100: 100%|██████████| 9407/9407 [01:02<00:00, 151.62it/s, loss=0.0544, prob_loss=0.3586, trip_loss=0.0000]



Epoch 19/100 Summary:
Train Loss: 0.054365
Val Loss: 0.017924
Probability Loss: 0.362959
Triplet Loss: 0.000189
Learning Rate: 0.000050
Epoch Time: 1.1 minutes


Epoch 20/100: 100%|██████████| 9407/9407 [01:07<00:00, 140.10it/s, loss=0.0553, prob_loss=0.3637, trip_loss=0.0000]



Epoch 20/100 Summary:
Train Loss: 0.054356
Val Loss: 0.018140
Probability Loss: 0.362954
Triplet Loss: 0.000176
Learning Rate: 0.000050
Epoch Time: 1.2 minutes


Epoch 21/100: 100%|██████████| 9407/9407 [01:08<00:00, 138.12it/s, loss=0.0534, prob_loss=0.3593, trip_loss=0.0000]



Epoch 21/100 Summary:
Train Loss: 0.054353
Val Loss: 0.018133
Probability Loss: 0.362958
Triplet Loss: 0.000196
Learning Rate: 0.000050
Epoch Time: 1.2 minutes


Epoch 22/100: 100%|██████████| 9407/9407 [01:03<00:00, 148.22it/s, loss=0.0549, prob_loss=0.3673, trip_loss=0.0000]



Epoch 22/100 Summary:
Train Loss: 0.054345
Val Loss: 0.017887
Probability Loss: 0.362949
Triplet Loss: 0.000206
Learning Rate: 0.000050
Epoch Time: 1.1 minutes

Early stopping triggered after 22 epochs

Training completed!
Total training time: 0.4 hours
Best validation loss: 0.017729
