# Legal Clause Semantic Similarity Detection - GPU Optimized
## Baseline NLP Architectures with GPU Acceleration

**Author:** NLP Expert Implementation  
**Date:** November 2025  
**Dataset:** Legal Clause Dataset from Kaggle  
**Hardware:** GPU-Accelerated Training with Mixed Precision

---

### üöÄ GPU Optimizations Included

1. **Automatic GPU Detection and Configuration**
2. **Memory Growth Management** - Prevents OOM errors
3. **Mixed Precision Training** - 2-3x faster training
4. **XLA Compilation** - Just-In-Time optimization
5. **Data Pipeline Optimization** - `tf.data` with prefetching
6. **Batch Size Optimization** - Larger batches for GPU
7. **Multi-GPU Support** - Automatic distribution strategy

### üìã Assignment Overview

This notebook implements two baseline NLP models to detect semantic similarity between legal clauses:
1. **BiLSTM Siamese Network** - Shared encoder architecture
2. **Attention-Based Encoder** - Self-attention mechanism

### üö´ Constraints
- No pre-trained transformers (BERT, RoBERTa, Legal-BERT)
- Only TensorFlow/Keras built-in layers

## 1. GPU Configuration and Setup

In [None]:
"""GPU Configuration and Environment Setup."""

import os
import tensorflow as tf
from tensorflow import keras
import numpy as np

# ============================================================================
# GPU CONFIGURATION
# ============================================================================

def configure_gpu():
    """
    Configure TensorFlow for optimal GPU usage.
    
    Optimizations:
    1. Enable memory growth to prevent TensorFlow from allocating all GPU memory
    2. Set up mixed precision for faster training (2-3x speedup)
    3. Enable XLA compilation for optimized operations
    4. Configure multi-GPU strategy if available
    """
    print("=" * 80)
    print("GPU CONFIGURATION")
    print("=" * 80)
    
    # List available GPUs
    gpus = tf.config.list_physical_devices('GPU')
    print(f"\nüîç Detecting GPUs...")
    print(f"Number of GPUs Available: {len(gpus)}")
    
    if gpus:
        try:
            # Enable memory growth for all GPUs
            for gpu in gpus:
                tf.config.experimental.set_memory_growth(gpu, True)
                print(f"‚úì Memory growth enabled for: {gpu.name}")
            
            # Get GPU details
            for i, gpu in enumerate(gpus):
                print(f"\nGPU {i}: {gpu.name}")
                print(f"  Type: {gpu.device_type}")
            
            # Enable Mixed Precision Training
            print("\nüöÄ Enabling Mixed Precision Training (float16)...")
            policy = keras.mixed_precision.Policy('mixed_float16')
            keras.mixed_precision.set_global_policy(policy)
            print(f"‚úì Compute dtype: {policy.compute_dtype}")
            print(f"‚úì Variable dtype: {policy.variable_dtype}")
            print("‚úì Mixed precision enabled - Expected 2-3x speedup!")
            
            # Enable XLA (Accelerated Linear Algebra)
            print("\n‚ö° Enabling XLA Compilation...")
            tf.config.optimizer.set_jit(True)
            print("‚úì XLA JIT compilation enabled")
            
            # Set up distribution strategy for multi-GPU
            if len(gpus) > 1:
                print(f"\nüîÑ Setting up Multi-GPU Strategy ({len(gpus)} GPUs)...")
                strategy = tf.distribute.MirroredStrategy()
                print(f"‚úì MirroredStrategy initialized")
                print(f"‚úì Number of devices: {strategy.num_replicas_in_sync}")
                return strategy
            else:
                print("\n‚úì Single GPU mode")
                return None
                
        except RuntimeError as e:
            print(f"\n‚ö†Ô∏è GPU configuration error: {e}")
            return None
    else:
        print("\n‚ö†Ô∏è No GPU detected. Running on CPU.")
        print("   For GPU support, ensure:")
        print("   1. NVIDIA GPU with CUDA support")
        print("   2. CUDA Toolkit installed (11.2+)")
        print("   3. cuDNN library installed (8.1+)")
        print("   4. TensorFlow-GPU installed: pip install tensorflow[and-cuda]")
        return None

# Configure GPU
strategy = configure_gpu()

# Display TensorFlow build information
print("\n" + "=" * 80)
print("TENSORFLOW BUILD INFO")
print("=" * 80)
print(f"TensorFlow version: {tf.__version__}")
print(f"Keras version: {keras.__version__}")
print(f"CUDA available: {tf.test.is_built_with_cuda()}")
print(f"GPU available: {tf.test.is_gpu_available()}" if hasattr(tf.test, 'is_gpu_available') else "GPU check: Use tf.config.list_physical_devices('GPU')")
print("\n‚úÖ GPU Configuration Complete!\n")

## 2. Environment Setup and Imports

In [None]:
# Core libraries
import re
import glob
import warnings
import time
from typing import List, Tuple, Dict

warnings.filterwarnings('ignore')

# Data manipulation
import pandas as pd
from collections import Counter

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

# TensorFlow/Keras
from tensorflow.keras.models import Model
from tensorflow.keras.layers import (
    Input, Embedding, LSTM, Bidirectional, Dense, Dropout, 
    Lambda, Concatenate, Multiply, Attention,
    GlobalAveragePooling1D, BatchNormalization
)
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import backend as K

# Scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, average_precision_score, confusion_matrix,
    classification_report, roc_curve, precision_recall_curve
)

# Set random seeds for reproducibility
SEED = 42
np.random.seed(SEED)
tf.random.set_seed(SEED)

print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")
print("\n‚úÖ All imports successful!")

## 3. GPU-Optimized Configuration

In [None]:
class GPUConfig:
    """GPU-optimized configuration for hyperparameters and settings."""
    
    # Data paths
    DATA_DIR = '.'
    
    # Text preprocessing
    MAX_VOCAB_SIZE = 20000
    MAX_SEQUENCE_LENGTH = 200
    OOV_TOKEN = '<OOV>'
    
    # Model architecture
    EMBEDDING_DIM = 128
    LSTM_UNITS = 128
    ATTENTION_UNITS = 64
    DENSE_UNITS = 64
    DROPOUT_RATE = 0.3
    
    # GPU-Optimized Training Parameters
    BATCH_SIZE = 128  # Increased for GPU (was 64)
    EPOCHS = 50
    LEARNING_RATE = 0.001
    VALIDATION_SPLIT = 0.2
    TEST_SPLIT = 0.2
    
    # Data pipeline optimization
    PREFETCH_BUFFER_SIZE = tf.data.AUTOTUNE  # Auto-tune prefetching
    NUM_PARALLEL_CALLS = tf.data.AUTOTUNE  # Auto-tune parallel processing
    
    # Pair generation
    POSITIVE_PAIRS_PER_CATEGORY = 100
    NEGATIVE_SAMPLE_RATIO = 1.0
    
    # Callbacks
    EARLY_STOPPING_PATIENCE = 10
    REDUCE_LR_PATIENCE = 5
    
    # Mixed precision
    USE_MIXED_PRECISION = True
    
    # XLA compilation
    USE_XLA = True

config = GPUConfig()

print("GPU-Optimized Configuration:")
print(f"  Max Vocabulary Size: {config.MAX_VOCAB_SIZE}")
print(f"  Max Sequence Length: {config.MAX_SEQUENCE_LENGTH}")
print(f"  Batch Size (GPU-optimized): {config.BATCH_SIZE}")
print(f"  Embedding Dimension: {config.EMBEDDING_DIM}")
print(f"  LSTM Units: {config.LSTM_UNITS}")
print(f"  Mixed Precision: {config.USE_MIXED_PRECISION}")
print(f"  XLA Compilation: {config.USE_XLA}")
print(f"  Data Prefetching: AUTOTUNE")

## 4. Data Loading and Exploration

In [None]:
def load_legal_clauses(data_dir: str) -> pd.DataFrame:
    """
    Load all CSV files from the data directory and combine into a single DataFrame.
    
    Args:
        data_dir: Directory containing CSV files
        
    Returns:
        Combined DataFrame with all legal clauses
    """
    csv_files = glob.glob(os.path.join(data_dir, '*.csv'))
    
    if not csv_files:
        raise FileNotFoundError(f"No CSV files found in {data_dir}")
    
    print(f"Found {len(csv_files)} CSV files")
    
    all_data = []
    
    for file_path in csv_files:
        try:
            df = pd.read_csv(file_path)
            df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
            all_data.append(df)
            print(f"‚úì Loaded: {os.path.basename(file_path)} - {len(df)} clauses")
        except Exception as e:
            print(f"‚úó Error loading {file_path}: {e}")
    
    combined_df = pd.concat(all_data, ignore_index=True)
    return combined_df

# Load data
print("=" * 80)
print("LOADING LEGAL CLAUSE DATASET")
print("=" * 80)

df_clauses = load_legal_clauses(config.DATA_DIR)

print(f"\nTotal clauses loaded: {len(df_clauses)}")
print(f"Columns: {list(df_clauses.columns)}")
print(f"DataFrame Shape: {df_clauses.shape}")

In [None]:
# Data exploration
print("\n" + "=" * 80)
print("DATA EXPLORATION")
print("=" * 80)

# Check for missing values and clean
print("\nMissing Values:")
print(df_clauses.isnull().sum())

df_clauses = df_clauses.dropna()
print(f"\nClauses after removing NaN: {len(df_clauses)}")

# Display sample
print("\nSample Clauses:")
print(df_clauses.head())

# Clause type distribution
print("\nClause Type Distribution:")
clause_type_counts = df_clauses['clause_type'].value_counts()
print(clause_type_counts.head(20))
print(f"\nTotal unique clause types: {df_clauses['clause_type'].nunique()}")

In [None]:
# Visualizations
plt.figure(figsize=(14, 6))

top_20_types = df_clauses['clause_type'].value_counts().head(20)
plt.subplot(1, 2, 1)
top_20_types.plot(kind='barh', color='steelblue')
plt.xlabel('Number of Clauses')
plt.ylabel('Clause Type')
plt.title('Top 20 Most Common Clause Types')
plt.gca().invert_yaxis()

df_clauses['text_length'] = df_clauses['clause_text'].str.len()
plt.subplot(1, 2, 2)
plt.hist(df_clauses['text_length'], bins=50, color='coral', edgecolor='black')
plt.xlabel('Character Length')
plt.ylabel('Frequency')
plt.title('Distribution of Clause Text Length')
plt.axvline(df_clauses['text_length'].mean(), color='red', linestyle='--', label='Mean')
plt.legend()

plt.tight_layout()
plt.show()

print(f"\nText Length Statistics:")
print(df_clauses['text_length'].describe())

## 5. Text Preprocessing

In [None]:
def preprocess_text(text: str) -> str:
    """
    Preprocess legal text by cleaning and normalizing.
    
    Args:
        text: Raw text string
        
    Returns:
        Cleaned text string
    """
    if not isinstance(text, str):
        return ""
    
    text = text.lower()
    text = ' '.join(text.split())
    text = re.sub(r'[^a-z0-9\s\.,;:\-]', '', text)
    text = re.sub(r'([.,;:]){2,}', r'\1', text)
    
    return text.strip()

print("Preprocessing clause texts...")
df_clauses['clause_text_clean'] = df_clauses['clause_text'].apply(preprocess_text)
df_clauses = df_clauses[df_clauses['clause_text_clean'].str.len() > 0]

print(f"Clauses after preprocessing: {len(df_clauses)}")
print("\nExample:")
print(f"Original: {df_clauses['clause_text'].iloc[0][:200]}")
print(f"Cleaned:  {df_clauses['clause_text_clean'].iloc[0][:200]}")

## 6. Generate Training Pairs

In [None]:
def generate_clause_pairs(df: pd.DataFrame, 
                         max_positive_per_category: int = 100,
                         negative_ratio: float = 1.0) -> Tuple[List, List, List]:
    """
    Generate positive (similar) and negative (different) clause pairs.
    
    Args:
        df: DataFrame with clause_text_clean and clause_type columns
        max_positive_per_category: Maximum positive pairs per category
        negative_ratio: Ratio of negative to positive pairs
        
    Returns:
        Tuple of (clause1_list, clause2_list, labels_list)
    """
    clause1_list = []
    clause2_list = []
    labels_list = []
    
    grouped = df.groupby('clause_type')
    
    print("Generating positive pairs (same category)...")
    positive_count = 0
    
    for clause_type, group in grouped:
        texts = group['clause_text_clean'].values
        
        if len(texts) < 2:
            continue
        
        pairs_generated = 0
        for i in range(len(texts)):
            if pairs_generated >= max_positive_per_category:
                break
            for j in range(i + 1, len(texts)):
                if pairs_generated >= max_positive_per_category:
                    break
                clause1_list.append(texts[i])
                clause2_list.append(texts[j])
                labels_list.append(1)
                pairs_generated += 1
                positive_count += 1
    
    print(f"Generated {positive_count} positive pairs")
    
    print("Generating negative pairs (different categories)...")
    negative_target = int(positive_count * negative_ratio)
    negative_count = 0
    
    clause_types = list(grouped.groups.keys())
    
    while negative_count < negative_target:
        type1, type2 = np.random.choice(clause_types, size=2, replace=False)
        text1 = np.random.choice(grouped.get_group(type1)['clause_text_clean'].values)
        text2 = np.random.choice(grouped.get_group(type2)['clause_text_clean'].values)
        
        clause1_list.append(text1)
        clause2_list.append(text2)
        labels_list.append(0)
        negative_count += 1
    
    print(f"Generated {negative_count} negative pairs")
    print(f"Total pairs: {len(labels_list)}")
    print(f"Class balance: {np.mean(labels_list):.2%} positive")
    
    return clause1_list, clause2_list, labels_list

print("\n" + "=" * 80)
print("GENERATING CLAUSE PAIRS")
print("=" * 80)

clause1, clause2, labels = generate_clause_pairs(
    df_clauses,
    max_positive_per_category=config.POSITIVE_PAIRS_PER_CATEGORY,
    negative_ratio=config.NEGATIVE_SAMPLE_RATIO
)

clause1 = np.array(clause1)
clause2 = np.array(clause2)
labels = np.array(labels)

print(f"\nDataset size: {len(labels)} pairs")

## 7. Tokenization and Sequence Preparation

In [None]:
def prepare_sequences(clause1: np.ndarray, 
                     clause2: np.ndarray, 
                     labels: np.ndarray,
                     max_vocab_size: int,
                     max_seq_length: int,
                     oov_token: str) -> Tuple:
    """Tokenize and prepare sequences."""
    print("Tokenizing texts...")
    
    all_texts = np.concatenate([clause1, clause2])
    
    tokenizer = Tokenizer(
        num_words=max_vocab_size,
        oov_token=oov_token,
        lower=True
    )
    tokenizer.fit_on_texts(all_texts)
    
    clause1_seq = tokenizer.texts_to_sequences(clause1)
    clause2_seq = tokenizer.texts_to_sequences(clause2)
    
    clause1_padded = pad_sequences(clause1_seq, maxlen=max_seq_length, padding='post', truncating='post')
    clause2_padded = pad_sequences(clause2_seq, maxlen=max_seq_length, padding='post', truncating='post')
    
    vocab_size = len(tokenizer.word_index) + 1
    print(f"Vocabulary size: {vocab_size}")
    
    return clause1_padded, clause2_padded, labels, tokenizer

print("\n" + "=" * 80)
print("TOKENIZATION")
print("=" * 80)

clause1_seq, clause2_seq, labels, tokenizer = prepare_sequences(
    clause1, clause2, labels,
    config.MAX_VOCAB_SIZE,
    config.MAX_SEQUENCE_LENGTH,
    config.OOV_TOKEN
)

vocab_size = len(tokenizer.word_index) + 1
print(f"\nFinal vocabulary size: {vocab_size}")

## 8. GPU-Optimized Data Pipeline with tf.data

In [None]:
"""Create GPU-optimized data pipelines using tf.data API."""

# Split data
indices = np.arange(len(labels))
train_idx, test_idx = train_test_split(
    indices, test_size=config.TEST_SPLIT, stratify=labels, random_state=SEED
)

X1_train = clause1_seq[train_idx]
X2_train = clause2_seq[train_idx]
y_train = labels[train_idx]

X1_test = clause1_seq[test_idx]
X2_test = clause2_seq[test_idx]
y_test = labels[test_idx]

print("=" * 80)
print("GPU-OPTIMIZED DATA PIPELINE")
print("=" * 80)
print(f"\nTraining set: {len(y_train)}")
print(f"Test set: {len(y_test)}")

def create_gpu_dataset(X1, X2, y, batch_size, shuffle=True):
    """
    Create GPU-optimized tf.data dataset with prefetching.
    
    Optimizations:
    - Prefetch: Overlaps data preprocessing and model execution
    - Cache: Caches dataset in memory after first epoch
    - Shuffle: Better generalization
    """
    dataset = tf.data.Dataset.from_tensor_slices(((X1, X2), y))
    
    if shuffle:
        dataset = dataset.shuffle(buffer_size=10000)
    
    dataset = dataset.batch(batch_size)
    dataset = dataset.cache()  # Cache in memory
    dataset = dataset.prefetch(buffer_size=tf.data.AUTOTUNE)  # Prefetch for GPU
    
    return dataset

# Create training dataset
train_dataset = create_gpu_dataset(X1_train, X2_train, y_train, 
                                  config.BATCH_SIZE, shuffle=True)

# Create validation split
val_size = int(len(X1_train) * config.VALIDATION_SPLIT)
X1_val = X1_train[-val_size:]
X2_val = X2_train[-val_size:]
y_val = y_train[-val_size:]

X1_train_only = X1_train[:-val_size]
X2_train_only = X2_train[:-val_size]
y_train_only = y_train[:-val_size]

train_dataset = create_gpu_dataset(X1_train_only, X2_train_only, y_train_only,
                                  config.BATCH_SIZE, shuffle=True)
val_dataset = create_gpu_dataset(X1_val, X2_val, y_val,
                                config.BATCH_SIZE, shuffle=False)

print("\n‚úÖ GPU-optimized data pipeline created!")
print(f"  ‚Ä¢ Batch size: {config.BATCH_SIZE}")
print(f"  ‚Ä¢ Prefetching: AUTOTUNE")
print(f"  ‚Ä¢ Caching: Enabled")
print(f"  ‚Ä¢ Training batches: {len(list(train_dataset))}")
print(f"  ‚Ä¢ Validation batches: {len(list(val_dataset))}")

## 9. Model 1: BiLSTM Siamese Network (GPU-Optimized)

In [None]:
def build_bilstm_siamese_model(vocab_size: int,
                               embedding_dim: int,
                               max_seq_length: int,
                               lstm_units: int,
                               dropout_rate: float = 0.3) -> Model:
    """
    Build GPU-optimized BiLSTM Siamese Network.
    
    GPU Optimizations:
    - CuDNN-optimized LSTM layers (automatic when GPU available)
    - Mixed precision compatible architecture
    - Batch normalization for faster convergence
    """
    input_1 = Input(shape=(max_seq_length,), name='clause_1')
    input_2 = Input(shape=(max_seq_length,), name='clause_2')
    
    # Shared embedding
    embedding_layer = Embedding(
        input_dim=vocab_size,
        output_dim=embedding_dim,
        input_length=max_seq_length,
        mask_zero=True,
        name='shared_embedding'
    )
    
    # Shared BiLSTM (CuDNN-optimized on GPU)
    bilstm_layer = Bidirectional(
        LSTM(lstm_units, return_sequences=False, dropout=dropout_rate),
        name='shared_bilstm'
    )
    
    embedded_1 = embedding_layer(input_1)
    embedded_2 = embedding_layer(input_2)
    
    encoded_1 = bilstm_layer(embedded_1)
    encoded_2 = bilstm_layer(embedded_2)
    
    # Similarity features
    difference = Lambda(lambda tensors: K.abs(tensors[0] - tensors[1]))([encoded_1, encoded_2])
    multiplication = Multiply()([encoded_1, encoded_2])
    merged = Concatenate()([difference, multiplication])
    
    # Dense layers with BatchNorm for GPU
    dense1 = Dense(64, activation='relu')(merged)
    dense1 = BatchNormalization()(dense1)
    dense1 = Dropout(dropout_rate)(dense1)
    
    dense2 = Dense(32, activation='relu')(dense1)
    dense2 = BatchNormalization()(dense2)
    dense2 = Dropout(dropout_rate)(dense2)
    
    # Output (float32 for mixed precision)
    output = Dense(1, activation='sigmoid', dtype='float32', name='similarity_output')(dense2)
    
    model = Model(inputs=[input_1, input_2], outputs=output, name='BiLSTM_Siamese_GPU')
    return model

print("\n" + "=" * 80)
print("MODEL 1: BiLSTM SIAMESE NETWORK (GPU-OPTIMIZED)")
print("=" * 80)

# Build model within strategy scope if multi-GPU
if strategy:
    with strategy.scope():
        model_bilstm = build_bilstm_siamese_model(
            vocab_size, config.EMBEDDING_DIM, config.MAX_SEQUENCE_LENGTH,
            config.LSTM_UNITS, config.DROPOUT_RATE
        )
        model_bilstm.compile(
            optimizer=Adam(learning_rate=config.LEARNING_RATE),
            loss='binary_crossentropy',
            metrics=['accuracy', tf.keras.metrics.Precision(), tf.keras.metrics.Recall()],
            jit_compile=config.USE_XLA  # XLA compilation
        )
else:
    model_bilstm = build_bilstm_siamese_model(
        vocab_size, config.EMBEDDING_DIM, config.MAX_SEQUENCE_LENGTH,
        config.LSTM_UNITS, config.DROPOUT_RATE
    )
    model_bilstm.compile(
        optimizer=Adam(learning_rate=config.LEARNING_RATE),
        loss='binary_crossentropy',
        metrics=['accuracy', tf.keras.metrics.Precision(), tf.keras.metrics.Recall()],
        jit_compile=config.USE_XLA
    )

model_bilstm.summary()
print(f"\n‚úÖ Model compiled with XLA: {config.USE_XLA}")
print(f"‚úÖ Mixed precision: {config.USE_MIXED_PRECISION}")

In [None]:
# GPU-optimized callbacks
callbacks_bilstm = [
    EarlyStopping(
        monitor='val_loss',
        patience=config.EARLY_STOPPING_PATIENCE,
        restore_best_weights=True,
        verbose=1
    ),
    ReduceLROnPlateau(
        monitor='val_loss',
        factor=0.5,
        patience=config.REDUCE_LR_PATIENCE,
        min_lr=1e-6,
        verbose=1
    ),
    ModelCheckpoint(
        'best_bilstm_siamese_gpu.h5',
        monitor='val_accuracy',
        save_best_only=True,
        verbose=1
    )
]

# Train with GPU acceleration
print("\nüöÄ Training BiLSTM Siamese Network on GPU...")
print("=" * 80)

start_time = time.time()

history_bilstm = model_bilstm.fit(
    train_dataset,
    epochs=config.EPOCHS,
    validation_data=val_dataset,
    callbacks=callbacks_bilstm,
    verbose=1
)

training_time_bilstm = time.time() - start_time
print(f"\n‚úÖ Training completed in {training_time_bilstm:.2f}s ({training_time_bilstm/60:.2f} min)")
print(f"   Average time per epoch: {training_time_bilstm/len(history_bilstm.history['loss']):.2f}s")

## 10. Model 2: Attention-Based Encoder (GPU-Optimized)

In [None]:
def build_attention_encoder_model(vocab_size: int,
                                 embedding_dim: int,
                                 max_seq_length: int,
                                 lstm_units: int,
                                 dropout_rate: float = 0.3) -> Model:
    """
    Build GPU-optimized Attention-based Encoder.
    
    GPU Optimizations:
    - Efficient attention computation
    - Batch normalization
    - Mixed precision compatible
    """
    input_1 = Input(shape=(max_seq_length,), name='clause_1')
    input_2 = Input(shape=(max_seq_length,), name='clause_2')
    
    embedding_layer = Embedding(
        input_dim=vocab_size,
        output_dim=embedding_dim,
        input_length=max_seq_length,
        mask_zero=True,
        name='shared_embedding'
    )
    
    bilstm_layer = Bidirectional(
        LSTM(lstm_units, return_sequences=True, dropout=dropout_rate),
        name='shared_bilstm'
    )
    
    embedded_1 = embedding_layer(input_1)
    embedded_2 = embedding_layer(input_2)
    
    lstm_output_1 = bilstm_layer(embedded_1)
    lstm_output_2 = bilstm_layer(embedded_2)
    
    attention_layer = Attention(name='shared_attention')
    
    attention_output_1 = attention_layer([lstm_output_1, lstm_output_1])
    attention_output_2 = attention_layer([lstm_output_2, lstm_output_2])
    
    pooling_1 = GlobalAveragePooling1D()(attention_output_1)
    pooling_2 = GlobalAveragePooling1D()(attention_output_2)
    
    difference = Lambda(lambda tensors: K.abs(tensors[0] - tensors[1]))([pooling_1, pooling_2])
    multiplication = Multiply()([pooling_1, pooling_2])
    merged = Concatenate()([pooling_1, pooling_2, difference, multiplication])
    
    dense1 = Dense(128, activation='relu')(merged)
    dense1 = BatchNormalization()(dense1)
    dense1 = Dropout(dropout_rate)(dense1)
    
    dense2 = Dense(64, activation='relu')(dense1)
    dense2 = BatchNormalization()(dense2)
    dense2 = Dropout(dropout_rate)(dense2)
    
    dense3 = Dense(32, activation='relu')(dense2)
    dense3 = Dropout(dropout_rate)(dense3)
    
    output = Dense(1, activation='sigmoid', dtype='float32', name='similarity_output')(dense3)
    
    model = Model(inputs=[input_1, input_2], outputs=output, name='Attention_Encoder_GPU')
    return model

print("\n" + "=" * 80)
print("MODEL 2: ATTENTION-BASED ENCODER (GPU-OPTIMIZED)")
print("=" * 80)

if strategy:
    with strategy.scope():
        model_attention = build_attention_encoder_model(
            vocab_size, config.EMBEDDING_DIM, config.MAX_SEQUENCE_LENGTH,
            config.LSTM_UNITS, config.DROPOUT_RATE
        )
        model_attention.compile(
            optimizer=Adam(learning_rate=config.LEARNING_RATE),
            loss='binary_crossentropy',
            metrics=['accuracy', tf.keras.metrics.Precision(), tf.keras.metrics.Recall()],
            jit_compile=config.USE_XLA
        )
else:
    model_attention = build_attention_encoder_model(
        vocab_size, config.EMBEDDING_DIM, config.MAX_SEQUENCE_LENGTH,
        config.LSTM_UNITS, config.DROPOUT_RATE
    )
    model_attention.compile(
        optimizer=Adam(learning_rate=config.LEARNING_RATE),
        loss='binary_crossentropy',
        metrics=['accuracy', tf.keras.metrics.Precision(), tf.keras.metrics.Recall()],
        jit_compile=config.USE_XLA
    )

model_attention.summary()

In [None]:
callbacks_attention = [
    EarlyStopping(
        monitor='val_loss',
        patience=config.EARLY_STOPPING_PATIENCE,
        restore_best_weights=True,
        verbose=1
    ),
    ReduceLROnPlateau(
        monitor='val_loss',
        factor=0.5,
        patience=config.REDUCE_LR_PATIENCE,
        min_lr=1e-6,
        verbose=1
    ),
    ModelCheckpoint(
        'best_attention_encoder_gpu.h5',
        monitor='val_accuracy',
        save_best_only=True,
        verbose=1
    )
]

print("\nüöÄ Training Attention-Based Encoder on GPU...")
print("=" * 80)

start_time = time.time()

history_attention = model_attention.fit(
    train_dataset,
    epochs=config.EPOCHS,
    validation_data=val_dataset,
    callbacks=callbacks_attention,
    verbose=1
)

training_time_attention = time.time() - start_time
print(f"\n‚úÖ Training completed in {training_time_attention:.2f}s ({training_time_attention/60:.2f} min)")
print(f"   Average time per epoch: {training_time_attention/len(history_attention.history['loss']):.2f}s")

## 11. Evaluation

In [None]:
def evaluate_model(model: Model, X1_test, X2_test, y_test, model_name: str) -> Dict:
    """Comprehensive model evaluation."""
    print(f"\n{'=' * 80}")
    print(f"EVALUATING {model_name.upper()}")
    print(f"{'=' * 80}")
    
    y_pred_prob = model.predict([X1_test, X2_test], verbose=0)
    y_pred = (y_pred_prob > 0.5).astype(int).flatten()
    
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, y_pred_prob)
    pr_auc = average_precision_score(y_test, y_pred_prob)
    
    print(f"\nüìä Performance Metrics:")
    print(f"  Accuracy:  {accuracy:.4f} ({accuracy*100:.2f}%)")
    print(f"  Precision: {precision:.4f}")
    print(f"  Recall:    {recall:.4f}")
    print(f"  F1-Score:  {f1:.4f}")
    print(f"  ROC-AUC:   {roc_auc:.4f}")
    print(f"  PR-AUC:    {pr_auc:.4f}")
    
    cm = confusion_matrix(y_test, y_pred)
    print(f"\nüìä Confusion Matrix:")
    print(f"  TN: {cm[0][0]:4d}  |  FP: {cm[0][1]:4d}")
    print(f"  FN: {cm[1][0]:4d}  |  TP: {cm[1][1]:4d}")
    
    print(f"\nüìä Classification Report:")
    print(classification_report(y_test, y_pred, target_names=['Different', 'Similar']))
    
    return {
        'model_name': model_name,
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1_score': f1,
        'roc_auc': roc_auc,
        'pr_auc': pr_auc,
        'confusion_matrix': cm,
        'y_true': y_test,
        'y_pred': y_pred,
        'y_pred_prob': y_pred_prob.flatten()
    }

results_bilstm = evaluate_model(
    model_bilstm, X1_test, X2_test, y_test, "BiLSTM Siamese (GPU)"
)

results_attention = evaluate_model(
    model_attention, X1_test, X2_test, y_test, "Attention Encoder (GPU)"
)

## 12. Training History Visualization

In [None]:
def plot_training_history(history, model_name):
    """Plot training curves."""
    fig, axes = plt.subplots(1, 2, figsize=(15, 5))
    
    axes[0].plot(history.history['accuracy'], label='Train', linewidth=2)
    axes[0].plot(history.history['val_accuracy'], label='Val', linewidth=2)
    axes[0].set_xlabel('Epoch', fontsize=12)
    axes[0].set_ylabel('Accuracy', fontsize=12)
    axes[0].set_title(f'{model_name} - Accuracy', fontsize=14, fontweight='bold')
    axes[0].legend(fontsize=11)
    axes[0].grid(True, alpha=0.3)
    
    axes[1].plot(history.history['loss'], label='Train', linewidth=2)
    axes[1].plot(history.history['val_loss'], label='Val', linewidth=2)
    axes[1].set_xlabel('Epoch', fontsize=12)
    axes[1].set_ylabel('Loss', fontsize=12)
    axes[1].set_title(f'{model_name} - Loss', fontsize=14, fontweight='bold')
    axes[1].legend(fontsize=11)
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

plot_training_history(history_bilstm, "BiLSTM Siamese (GPU)")
plot_training_history(history_attention, "Attention Encoder (GPU)")

## 13. Evaluation Visualizations

In [None]:
def plot_confusion_matrices(results_list):
    """Plot confusion matrices."""
    fig, axes = plt.subplots(1, len(results_list), figsize=(7*len(results_list), 5))
    if len(results_list) == 1:
        axes = [axes]
    
    for idx, results in enumerate(results_list):
        cm = results['confusion_matrix']
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                   xticklabels=['Different', 'Similar'],
                   yticklabels=['Different', 'Similar'],
                   ax=axes[idx], cbar=True, square=True)
        axes[idx].set_xlabel('Predicted', fontsize=12)
        axes[idx].set_ylabel('Actual', fontsize=12)
        axes[idx].set_title(f"{results['model_name']}\nConfusion Matrix",
                          fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()

plot_confusion_matrices([results_bilstm, results_attention])

In [None]:
def plot_roc_curves(results_list):
    """Plot ROC curves."""
    plt.figure(figsize=(10, 8))
    colors = ['blue', 'red']
    
    for idx, results in enumerate(results_list):
        fpr, tpr, _ = roc_curve(results['y_true'], results['y_pred_prob'])
        auc_score = results['roc_auc']
        plt.plot(fpr, tpr, label=f"{results['model_name']} (AUC = {auc_score:.4f})",
                linewidth=2, color=colors[idx])
    
    plt.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random')
    plt.xlabel('False Positive Rate', fontsize=12)
    plt.ylabel('True Positive Rate', fontsize=12)
    plt.title('ROC Curves - GPU-Accelerated Models', fontsize=14, fontweight='bold')
    plt.legend(fontsize=11)
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

plot_roc_curves([results_bilstm, results_attention])

## 14. Comparative Analysis with GPU Speedup

In [None]:
# Performance comparison
comparison_df = pd.DataFrame([
    {
        'Model': 'BiLSTM Siamese (GPU)',
        'Accuracy': results_bilstm['accuracy'],
        'Precision': results_bilstm['precision'],
        'Recall': results_bilstm['recall'],
        'F1-Score': results_bilstm['f1_score'],
        'ROC-AUC': results_bilstm['roc_auc'],
        'Training Time (min)': training_time_bilstm / 60
    },
    {
        'Model': 'Attention Encoder (GPU)',
        'Accuracy': results_attention['accuracy'],
        'Precision': results_attention['precision'],
        'Recall': results_attention['recall'],
        'F1-Score': results_attention['f1_score'],
        'ROC-AUC': results_attention['roc_auc'],
        'Training Time (min)': training_time_attention / 60
    }
])

print("\n" + "=" * 80)
print("GPU-ACCELERATED MODEL COMPARISON")
print("=" * 80)
print("\n", comparison_df.to_string(index=False))

best_idx = comparison_df['F1-Score'].idxmax()
print(f"\nüèÜ Best Model: {comparison_df.iloc[best_idx]['Model']}")

print("\n‚ö° GPU Performance Benefits:")
print(f"  ‚Ä¢ Larger batch size: {config.BATCH_SIZE} (vs 64 on CPU)")
print(f"  ‚Ä¢ Mixed precision training: {config.USE_MIXED_PRECISION}")
print(f"  ‚Ä¢ XLA compilation: {config.USE_XLA}")
print(f"  ‚Ä¢ Data prefetching: Enabled")
print(f"  ‚Ä¢ CuDNN-optimized LSTM: Automatic on GPU")
print(f"\n  Expected speedup: 2-5x faster than CPU training")

## 15. Save Results

In [None]:
import pickle

# Save comparison
comparison_df.to_csv('gpu_model_comparison_results.csv', index=False)
print("‚úì Saved: gpu_model_comparison_results.csv")

# Save tokenizer
with open('tokenizer.pkl', 'wb') as f:
    pickle.dump(tokenizer, f)
print("‚úì Saved: tokenizer.pkl")

# Save models
model_bilstm.save('final_bilstm_siamese_gpu.h5')
print("‚úì Saved: final_bilstm_siamese_gpu.h5")

model_attention.save('final_attention_encoder_gpu.h5')
print("‚úì Saved: final_attention_encoder_gpu.h5")

print("\n‚úÖ All GPU-optimized models saved!")

## 16. Technical Summary - GPU Optimization

In [None]:
print("\n" + "=" * 80)
print("GPU-OPTIMIZED IMPLEMENTATION SUMMARY")
print("=" * 80)

print("\nüöÄ GPU Optimizations Applied:")
print("  1. Mixed Precision Training (FP16)")
print("     - Compute: float16 for 2-3x speedup")
print("     - Variables: float32 for numerical stability")
print("     - Output: float32 for accurate predictions")
print("\n  2. Memory Growth Management")
print("     - Prevents TensorFlow from allocating all GPU memory")
print("     - Allows multiple processes to share GPU")
print("\n  3. XLA (Accelerated Linear Algebra)")
print("     - Just-In-Time compilation of operations")
print("     - Optimized GPU kernel fusion")
print("\n  4. Data Pipeline Optimization")
print("     - tf.data with AUTOTUNE prefetching")
print("     - Dataset caching in memory")
print("     - Parallel data loading")
print("\n  5. CuDNN-Optimized LSTM")
print("     - Automatic when GPU available")
print("     - Significantly faster than CPU LSTM")
print("\n  6. Batch Size Optimization")
print(f"     - Increased to {config.BATCH_SIZE} for GPU")
print("     - Better GPU utilization")
print("\n  7. Multi-GPU Support")
print("     - MirroredStrategy for data parallelism")
print("     - Automatic when multiple GPUs detected")

print("\nüìä Performance Metrics:")
print(f"  BiLSTM Siamese:")
print(f"    F1-Score: {results_bilstm['f1_score']:.4f}")
print(f"    Training Time: {training_time_bilstm/60:.2f} min")
print(f"  Attention Encoder:")
print(f"    F1-Score: {results_attention['f1_score']:.4f}")
print(f"    Training Time: {training_time_attention/60:.2f} min")

print("\n" + "=" * 80)
print("‚úÖ GPU-OPTIMIZED IMPLEMENTATION COMPLETE")
print("=" * 80)
print("\nüí° Key Takeaways:")
print("  ‚Ä¢ Mixed precision provides 2-3x speedup with minimal accuracy loss")
print("  ‚Ä¢ Larger batch sizes improve GPU utilization")
print("  ‚Ä¢ Data prefetching overlaps I/O with computation")
print("  ‚Ä¢ XLA compilation optimizes operations for target hardware")
print("  ‚Ä¢ CuDNN-optimized layers provide significant speedups")

---

## üìã GPU Setup Instructions

### For NVIDIA GPUs:

1. **Install NVIDIA GPU Driver**
   ```bash
   # Check current driver
   nvidia-smi
   ```

2. **Install CUDA Toolkit (11.2+)**
   - Download from: https://developer.nvidia.com/cuda-toolkit

3. **Install cuDNN (8.1+)**
   - Download from: https://developer.nvidia.com/cudnn

4. **Install TensorFlow with GPU Support**
   ```bash
   pip install tensorflow[and-cuda]
   # OR for specific version
   pip install tensorflow-gpu==2.12.0
   ```

5. **Verify GPU Setup**
   ```python
   import tensorflow as tf
   print("Num GPUs:", len(tf.config.list_physical_devices('GPU')))
   ```

### For Google Colab:

1. Runtime ‚Üí Change runtime type ‚Üí GPU (T4, V100, or A100)
2. Run this notebook - GPU setup is automatic!

### Performance Tips:

- **Batch Size:** Increase until you hit OOM errors, then reduce slightly
- **Mixed Precision:** Always enable for modern GPUs (Volta and newer)
- **XLA:** Test with/without - sometimes faster, sometimes not
- **Data Prefetching:** Always use `tf.data.AUTOTUNE`
- **Monitor GPU:** Use `nvidia-smi` or `nvtop` to check utilization

---