# Spam Detection Pipeline

This notebook runs the full spam detection pipeline used for the assignment:
1. Combine and clean multiple message sources (SMS, email, mixed, YouTube, reviews).
2. Preprocess and balance the data for downstream models.
3. Feature engineering for K-Means, XGBoost and BiLSTM models.
4. Experiments and model training: K-Means parameter sweeps, XGBoost hyperparameter search, and BiLSTM experiments.
5. Evaluation and visual analysis of model performance and feature importance.

Notebook structure and where each major step is implemented in the code are noted in the cells below. See the code cells for exact function names and parameters.


## 1. Setup and imports

This cell loads required Python libraries and project modules. GPU memory is cleared where appropriate to avoid allocation conflicts when training multiple models sequentially. If you run on a CPU-only machine, the notebook will fall back gracefully.

Key modules:
- Data combination and base preprocessing functions used to create train and test data in `final_dataset` folder.
- Model training wrappers for K-Means, XGBoost and LSTM that are called later in the notebook.

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import json
import time
from pathlib import Path
import torch
from pathlib import Path

# Custom modules
from src.base_preprocessing.combine import combine_datasets
from src.base_preprocessing.clean import run_base_preprocessing_pipeline

from src.kmeans.feature_engineer import engineer_features
from src.kmeans.train import train_kmeans
from src.kmeans.model import KMeansTextInferencer

from src.xgb.feature_engineer import engineer_ml_features
from src.xgb.train import train_xgboost_model
from src.xgb.model import XGBTextClassifier

from src.lstm.preprocessing import run_lstm_preprocessing
from src.lstm.train import run_lstm_experiments
from src.lstm.model import LSTMTextClassifier


## 2. Combine datasets and base preprocessing

This step merges the SMS, email, mixed, YouTube and review sources into a unified CSV, then runs the base preprocessing pipeline that:
- normalizes labels to binary spam/ham
- performs token replacements for URLs, emails, phones
- balances and splits the dataset into train and test sets

The main invocations are `combine_datasets(...)` and `run_base_preprocessing_pipeline(...)`. Check that the produced files are in `final_dataset` folder before continuing.

In [None]:
file_paths = {
    "sms": "datasets/sms/sms.csv",
    "email": "datasets/email/email.csv",
    "mix": "datasets/mix_email_sms/mix.csv",
    "youtube": "datasets/comment/youtube_comments.csv",
    "review": "datasets/review/review.jsonl",
}

combine_datasets(file_paths, output_path="datasets/combined.csv")

Starting data combination process...
Loaded 5572 SMS records
Loaded 5728 email records
Loaded 8175 mixed email/SMS records
Loaded 1956 YouTube comment records

Combining datasets...

Dataset Summary:
Total records: 21431

Records by source:
  mix_email_sms: 8175
  email: 5728
  sms: 5572
  youtube: 1956

Label distribution (0=ham, 1=spam):
  0: 14186
  1: 7245

Spam rate by source:
  sms: 13.4% (747/5572)
  email: 23.9% (1368/5728)
  mix_email_sms: 50.5% (4125/8175)
  youtube: 51.4% (1005/1956)

Overall spam rate: 33.8%

Combined dataset saved to: datasets/combined.csv
Final dataset shape: (21431, 3)

Sample of combined data:
                                                                                                                                                          text  label source
0                                              Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...      0    sms
1                   

This stage cleans and balances the combined dataset to ensure a representative spam ratio across sources.
The `run_base_preprocessing_pipeline` function:
- Applies text normalization and token replacements (`<URL>`, `<EMAIL>`, etc.)
- Balances datasets by adjusting spam ratios per source
- Splits the data into training and testing sets (default 85/15)
- Saves preprocessed files to `final_dataset`


In [2]:
target_spam_rates = {
    "sms": 0.3,
    "email": 0.35,
    "mix_email_sms": None,  # Keep original
    "youtube": None,  # Keep original
}

success = run_base_preprocessing_pipeline(
    target_spam_rates=target_spam_rates,
    input_path="datasets/combined.csv",
    test_size=0.15,
    soft_cap_ratio=0.35,
    output_dir="final_dataset"
)

if success:
    print("\n Preprocessing pipeline completed successfully!")
else:
    print("\n Preprocessing failed. Check error messages above.")

STARTING BASE PREPROCESSING PIPELINE
Strategy:
  1. Stratified split by source and label (80/20)
  2. Rebalance training data only:
     - sms/email: undersample ham to achieve 35-40% spam
     - mix_email_sms/youtube: keep original spam/ham ratio
  3. Apply soft cap: each source ≤40-45% of total training samples
LOADING AND ANALYZING DATA
Loaded 21,431 records

Dataset shape: (21431, 3)
Columns: ['text', 'label', 'source']

Source Distribution:
  mix_email_sms: 8,175 (38.15%)
  email: 5,728 (26.73%)
  sms: 5,572 (26.0%)
  youtube: 1,956 (9.13%)

Spam Rate by Source:
  sms: 13.4% (747/5,572)
  email: 23.9% (1,368/5,728)
  mix_email_sms: 50.5% (4,125/8,175)
  youtube: 51.4% (1,005/1,956)

Overall spam rate: 33.8%

TEXT CLEANING
Cleaning text data...
Removed 18 invalid records
Text cleaning completed. Final dataset: 21,413 records

CREATING STRATIFIED TRAIN/TEST SPLITS
Original distribution by source and label:
  email_ham: 4,360 records
  email_spam: 1,368 records
  mix_email_sms_ham: 4

## 3. Feature Engineering
Feature engineering transforms the cleaned text into numerical vectors suitable for machine learning models.

The pipeline creates three main feature types:
1. **TF-IDF Features:** Capture keyword frequency and context.
2. **Statistical Features:** Text length, word count, average word size.
3. **Special Token Features:** Count of special markers like `<EMAIL>`, `<URL>`, and punctuation ratios.

In [None]:
output_dir, metadata = engineer_ml_features(
    data_path="final_dataset",
    output_dir="preprocessed/xgboost",
    max_tfidf_features=10000, 
    ngram_range=(1, 2),  # Unigrams and bigrams
)

Starting ML feature engineering pipeline
Loading preprocessed data...
Training data: 12003 records
Test data: 3212 records
Sources found: ['email', 'mix_email_sms', 'sms', 'youtube']
Extracting text statistics...
Extracting text statistics...
Extracting special character features...
Extracting special character features...
Extracting source features...
Extracting source features...
Creating TF-IDF features (max_features=10000, ngram_range=(1, 2))...
Fitting TF-IDF on training data...
Transforming test data...
TF-IDF matrix shape - Train: (12003, 10000), Test: (3212, 10000)
TF-IDF sparsity - Train: 0.0083
TF-IDF sparsity - Test: 0.0080
Combining all features...
Note: Numerical features NOT scaled (appropriate for tree-based models)
Combined feature matrix shape: (12003, 10023)
Combined sparsity: 0.0092
Combining all features...
Note: Numerical features NOT scaled (appropriate for tree-based models)
Combined feature matrix shape: (3212, 10023)
Combined sparsity: 0.0089
Saving features an

## 4. XGBoost Classification
XGBoost is used as a supervised model optimized for high accuracy and explainability.

The process includes:
1. **Randomized Hyperparameter Search:** Finds optimal parameters using cross-validation with AUCPR as the metric.
2. **Retraining with Early Stopping:** Fine-tunes the best model with larger iterations and early stopping to prevent overfitting.
3. **Threshold Optimization:** Chooses the F1-optimal decision threshold instead of the default 0.5 for better balance between precision and recall.
4. **Evaluation:** Assesses model performance across sources and generates feature importance plots.


In [None]:
model, results = train_xgboost_model(
    features_dir="xgboost",
    output_dir="models/xgboost",
    cv_folds=5,  # 5-fold cross-validation
    n_iter=500,  # Number of random parameter combinations
    random_state=36,  # For reproducibility
    verbose=1,  # Show training progress
)

print("\n XGBoost training pipeline completed!")
print("Key results:")
print(f"- Best parameters: {results['best_parameters']}")
print(f"- Cross-validation F1: {results['best_cv_score']:.4f}")
print(f"- Test set F1: {results['test_metrics']['f1']:.4f}")
print(f"- Test set Accuracy: {results['test_metrics']['accuracy']:.4f}")
print(f"- GPU acceleration: {results['optimization']['gpu_used']}")
print(f"- Parameter combinations tested: {results['n_iter']}")

Redirecting to new AUCPR-based training pipeline...
Starting XGBoost training pipeline with AUCPR optimization and early stopping
âœ“ GPU is available and working with XGBoost
Loading engineered features...
Training features: (12003, 10023)
Test features: (3212, 10023)
Total features: 10,023
Training samples: 12,003
Test samples: 3,212

Dataset class distribution:
Training: 6,872 negative, 5,131 positive (42.7% spam)
Test: 2,125 negative, 1,087 positive (33.8% spam)
Class ratio (neg/pos): 1.339
Setting up 5-fold cross-validation with source+label stratification...
Stratification groups found:
  email_0: 2159 samples
  email_1: 1163 samples
  mix_email_sms_0: 2433 samples
  mix_email_sms_1: 2479 samples
  sms_0: 1481 samples
  sms_1: 635 samples
  youtube_0: 799 samples
  youtube_1: 854 samples

Fold balance verification:
Fold 1:
  Spam rates - Train: 0.428, Val: 0.427
  Val source dist: mix_email_sms: 0.41, email: 0.28, sms: 0.18, youtube: 0.14
Fold 2:
  Spam rates - Train: 0.428, Val:

## 5. K-Means Clustering
K-Means is applied as an unsupervised baseline to explore the natural grouping of messages.

The script:
- Loads engineered features (`X_train.npy`, `X_test.npy`)
- Trains a K-Means model with a predefined number of clusters
- Evaluates cluster quality using metrics like Silhouette Score, Adjusted Rand Index (ARI), and Normalized Mutual Information (NMI)
- Maps clusters to spam/ham classes for comparison with supervised models
- Generates detailed cluster composition and visualization plots


### 5.1. CLUSTERING CONFIGURATION
Modify these parameters to experiment with different clustering approaches:

In [None]:
# CLUSTERING CONFIGURATION
# Streamlined configuration for K-Means analysis

CLUSTERING_CONFIG = {
    # Feature Engineering Parameters
    "max_tfidf_features": 500,       # Default TF-IDF features
    "source_weight": 0.0,            # Default source weight
    
    # Clustering Parameters - K values to test (2-8)
    "k_values": list(range(2, 9)),   # Test k from 2 to 8
    
    # Paths
    "data_path": "final_dataset",
    "features_base_dir": "preprocessed/kmeans",  # Features independent of k
    "models_base_dir": "models/kmeans"           # Models depend on k
}

print("Configuration set:")
for key, value in CLUSTERING_CONFIG.items():
    if key == "k_values":
        print(f"  {key}: {value} (will test all)")
    else:
        print(f"  {key}: {value}")

Configuration set:
  max_tfidf_features: 500
  source_weight: 0.0
  k_values: [2, 3, 4, 5, 6, 7, 8] (will test all)
  data_path: datasets/preprocessed
  features_base_dir: preprocessed/kmeans
  models_base_dir: models/kmeans


### 5.2. Comprehensive K-Values Analysis

Test all k values from 2 to 8 with multiple configurations:

In [None]:
# Streamlined K-Means Analysis (K=2-8)
print("STREAMLINED K-MEANS ANALYSIS")
print("=" * 60)
print("Testing k values from 2 to 8 with selected configurations...")

def test_clustering_config(config_name, max_tfidf=50, source_weight=0.0, k_clusters=3):
    """Test a specific clustering configuration"""
    import time
    start_time = time.time()
    
    print(f"\n{'='*60}")
    print(f"TESTING: {config_name}")
    print(f"TF-IDF Features: {max_tfidf}, Source Weight: {source_weight}, K Clusters: {k_clusters}")
    print(f"{'='*60}")
    
    try:
        # Feature Engineering (independent of k - only run once per config)
        features_dir = f"{CLUSTERING_CONFIG['features_base_dir']}/{config_name}"
        features_path = Path(features_dir)
        
        if not features_path.exists():
            print(f"Creating features for {config_name}...")
            features_path, metadata = engineer_features(
                data_path=CLUSTERING_CONFIG["data_path"],
                output_dir=features_dir,
                max_tfidf_features=max_tfidf,
                source_weight=source_weight
            )
        else:
            print(f"Using existing features: {config_name}")
            # Load existing metadata
            with open(features_path / "feature_metadata.json", "r") as f:
                metadata = json.load(f)
        
        # Clustering (depends on k)
        model_dir = f"{CLUSTERING_CONFIG['models_base_dir']}/k{k_clusters}/{config_name}"
        model, results = train_kmeans(
            features_dir=features_dir,
            output_dir=model_dir,
            k_clusters=k_clusters
        )
        
        # Extract key metrics
        perf = results['clustering_performance']
        clusters = results['cluster_analysis']['cluster_composition']
        
        print(f"\nRESULTS SUMMARY:")
        print(f"  Silhouette Score: {perf['silhouette_score']['test']:.3f}")
        print(f"  Rand Index: {perf['adjusted_rand_index']['test']:.3f}")
        print(f"  Mutual Information: {perf['normalized_mutual_info']['test']:.3f}")
        print(f"  Total Features: {metadata['n_features']}")
        
        print(f"\nCluster Composition:")
        for i, cluster in enumerate(clusters):
            spam_rate = cluster['spam_rate'] * 100
            size = cluster['size']
            dominant = cluster['dominant_source']
            print(f"  Cluster {i}: {size:4d} samples, {spam_rate:5.1f}% spam, mainly {dominant}")
        
        # Calculate and display runtime
        end_time = time.time()
        runtime = end_time - start_time
        print(f"\nRuntime: {runtime:.2f} seconds")
        
        return {
            'config': config_name,
            'k': k_clusters,
            'silhouette': perf['silhouette_score']['test'],
            'ari': perf['adjusted_rand_index']['test'],
            'nmi': perf['normalized_mutual_info']['test'],
            'features': metadata['n_features'],
            'clusters': clusters,
            'runtime': runtime
        }
        
    except Exception as e:
        end_time = time.time()
        runtime = end_time - start_time
        print(f"ERROR: {e}")
        print(f"Runtime before error: {runtime:.2f} seconds")
        return None

# Define streamlined test configurations
test_configs = [
    # Current configuration
    {"name": "current", "max_tfidf": 500, "source_weight": 0.0},
    
    # TF-IDF experiments (with source_weight = 0.0)
    {"name": "tfidf_50", "max_tfidf": 50, "source_weight": 0.0},
    {"name": "tfidf_100", "max_tfidf": 100, "source_weight": 0.0},
    {"name": "tfidf_1000", "max_tfidf": 1000, "source_weight": 0.0},
    
    # Source weight experiments (with max_tfidf = 500)
    {"name": "source_0.1", "max_tfidf": 500, "source_weight": 0.1},
    {"name": "source_0.3", "max_tfidf": 500, "source_weight": 0.3},
    {"name": "source_0.5", "max_tfidf": 500, "source_weight": 0.5}
]

all_results = []

for k in CLUSTERING_CONFIG["k_values"]:
    print(f"\nTesting K = {k}")
    print("-" * 40)
    
    k_results = []
    for config in test_configs:
        result = test_clustering_config(
            config["name"],
            max_tfidf=config["max_tfidf"],
            source_weight=config["source_weight"],
            k_clusters=k
        )
        if result:
            k_results.append(result)
            all_results.append(result)
    
    print(f"Completed K = {k}: {len(k_results)} configurations")

print(f"\nTotal completed: {len(all_results)} configurations across all k values")

STREAMLINED K-MEANS ANALYSIS
Testing k values from 2 to 8 with selected configurations...

Testing K = 2
----------------------------------------

TESTING: current
TF-IDF Features: 500, Source Weight: 0.0, K Clusters: 2
Using existing features: current
Starting K-Means training with k=2
Loading features from: preprocessed\kmeans\current
Training features: (12003, 557)
Test features: (3212, 557)
Training K-Means with k=2...
Model trained. Inertia: 6446996
Evaluating clustering performance...
Clustering Metrics:
  Silhouette Score - Train: 0.166, Test: 0.161
  Adjusted Rand Index - Train: -0.007, Test: -0.045
  Normalized Mutual Info - Train: 0.050, Test: 0.046
Evaluating classification performance...
Classification Metrics:
  Accuracy: 0.662
  Precision: 0.000
  Recall: 0.000
  F1-Score: 0.000
  ROC-AUC: 0.579
Analyzing cluster composition (Test)...
  Cluster 0: size= 505, spam_rate=11.3%, dominant=email
  Cluster 1: size=2707, spam_rate=38.0%, dominant=mix_email_sms
Creating enhanced v

### 5.3. Comprehensive Analysis Summary

Detailed analysis for all tested k values with current configuration:

In [None]:
# Comprehensive analysis for current configuration across all k values
print("COMPREHENSIVE CLUSTERING ANALYSIS - CURRENT CONFIGURATION")
print("=" * 80)

# Load and display results for each k value
for k in CLUSTERING_CONFIG["k_values"]:
    results_path = Path(f"{CLUSTERING_CONFIG['models_base_dir']}/k{k}/current/clustering_results.json")
    
    if results_path.exists():
        with open(results_path, 'r') as f:
            results = json.load(f)
        
        # Add indicator for chosen configuration
        chosen_indicator = " (chosen one)" if k == 3 else ""
        
        print(f"{'='*80}")
        print(f"K = {k} RESULTS (Current Configuration){chosen_indicator}")
        print(f"{'='*80}")
        
        # Clustering Performance
        clustering_perf = results['clustering_performance']
        print("\nClustering Performance:")
        print(f"  Silhouette Score:       Train={clustering_perf['silhouette_score']['train']:.3f}, Test={clustering_perf['silhouette_score']['test']:.3f}")
        print(f"  Adjusted Rand Index:    Train={clustering_perf['adjusted_rand_index']['train']:.3f}, Test={clustering_perf['adjusted_rand_index']['test']:.3f}")
        print(f"  Normalized Mutual Info: Train={clustering_perf['normalized_mutual_info']['train']:.3f}, Test={clustering_perf['normalized_mutual_info']['test']:.3f}")
        
        # Classification Performance
        class_perf = results['classification_performance']
        print("\nClassification Performance:")
        print(f"  Accuracy:          {class_perf['accuracy']:.3f}")
        print(f"  Precision:         {class_perf['precision']:.3f}")
        print(f"  Recall:            {class_perf['recall']:.3f}")
        print(f"  F1-Score:          {class_perf['f1_score']:.3f}")
        print(f"  ROC-AUC:           {class_perf['roc_auc']:.3f}")
        print(f"  Average Precision: {class_perf['average_precision']:.3f}")
        
        # Add special emphasis for chosen configuration
        if k == 3:
            print("\n" + ">" * 80)
            print(">>> THIS IS THE CHOSEN CONFIGURATION FOR FINAL MODEL <<<")
            print(">" * 80)
        
        print()
    else:
        print(f"K = {k}: Results not found\n")

print(f"{'='*80}")
print("For parameter fine-tuning results across different configurations,")
print(f"see: {CLUSTERING_CONFIG['models_base_dir']}/k{{X}}/{{config_name}}/")
print(f"{'='*80}")

COMPREHENSIVE CLUSTERING ANALYSIS - CURRENT CONFIGURATION
Total runtime: 345.30 seconds

K = 2 RESULTS (Current Configuration)

Clustering Performance:
  Silhouette Score:       Train=0.166, Test=0.161
  Adjusted Rand Index:    Train=-0.007, Test=-0.045
  Normalized Mutual Info: Train=0.050, Test=0.046

Classification Performance:
  Accuracy:          0.662
  Precision:         0.000
  Recall:            0.000
  F1-Score:          0.000
  ROC-AUC:           0.579
  Average Precision: 0.378

K = 3 RESULTS (Current Configuration) (chosen one)

Clustering Performance:
  Silhouette Score:       Train=0.014, Test=0.011
  Adjusted Rand Index:    Train=0.180, Test=0.270
  Normalized Mutual Info: Train=0.255, Test=0.308

Classification Performance:
  Accuracy:          0.835
  Precision:         0.996
  Recall:            0.514
  F1-Score:          0.678
  ROC-AUC:           0.782
  Average Precision: 0.694

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>> T

## 6. BiLSTM Deep Learning Model
This section trains a bidirectional LSTM using PyTorch for contextual spam detection.

Steps:
- Create and loads tokenized and padded sequences with corresponding labels and sources.
- Trains multiple BiLSTM variants (basic, deep, attention-based) with early stopping and validation monitoring.
- Evaluates final models on test data using accuracy, F1, ROC-AUC, and source-level breakdown.
- Visualizes confusion matrices, ROC curves, and performance across different message sources.


In [None]:
run_lstm_preprocessing(
    input_dir="final_dataset",  # Input from shared preprocessing
    output_dir="preprocessed/lstm",  # Output directory for LSTM data
    vocab_size=10000,  # Vocabulary size
    max_length=None,  # Auto-determine from data (will cap at 128)
    min_freq=2,  # Minimum token frequency
)

STARTING BILSTM PREPROCESSING PIPELINE
LOADING PREPROCESSED DATA FOR BILSTM
Loaded training data: 12,003 records
Loaded test data: 3,212 records

Applying LSTM-specific text cleaning...

BUILDING VOCABULARY
Building vocabulary...
Vocabulary size: 10,000
Total tokens processed: 1,301,127
Unique tokens: 36,568

CONVERTING TEXT TO SEQUENCES
Creating LSTM dataset with max_length=128...
Creating LSTM dataset with max_length=128...

Analyzing sequence lengths...
Sequence length statistics:
  Mean: 61.7
  Median: 48.0
  25th percentile: 20.0
  75th percentile: 116.0
  95th percentile: 128.0
  Max: 128
  Suggested max_length: 128
Using max_length: 128

PADDING SEQUENCES
Padded sequences to length: 128

Creating PyTorch tensors...
Created PyTorch tensors:
  X shape: torch.Size([12003, 128])
  y shape: torch.Size([12003])
Created PyTorch tensors:
  X shape: torch.Size([3212, 128])
  y shape: torch.Size([3212])
SAVING LSTM PREPROCESSED DATA
LSTM preprocessed data saved to: preprocessed\lstm
Files

True

In [None]:
data_dir = "preprocessed/lstm/"
output_dir = "models/lstm"
device = "cuda" if torch.cuda.is_available() else "cpu"

In [None]:
model_configs = [
    {
        "name": "basic_bilstm",
        "config": {
            "embedding_dim": 128,
            "hidden_dims": [64],
            "dense_layers": [32],
            "dropout_rate": 0.3,
            "bidirectional": True,
            "use_attention": False,
        },
    },
    {
        "name": "deep_bilstm", # Best model so far (user should choose this one)
        "config": {
            "embedding_dim": 128,
            "hidden_dims": [128, 64],
            "dense_layers": [64, 32],
            "dropout_rate": 0.3,
            "bidirectional": True,
            "use_attention": False,
        },
    },
    {
        "name": "bilstm_attention",
        "config": {
            "embedding_dim": 128,
            "hidden_dims": [128],
            "dense_layers": [64, 32],
            "dropout_rate": 0.3,
            "bidirectional": True,
            "use_attention": True,
            "attention_dim": 64,
        },
    },
    {
        "name": "unidirectional_lstm",
        "config": {
            "embedding_dim": 128,
            "hidden_dims": [128, 64],
            "dense_layers": [64, 32],
            "dropout_rate": 0.3,
            "bidirectional": False,
            "use_attention": False,
        },
    },
]

# Training configuration
train_config = {
    "epochs": 500,
    "batch_size": 512,
    "learning_rate": 0.0001,
    "weight_decay": 1e-5,
    "early_stopping_patience": 10,
    "lr_patience": 3,
    "lr_factor": 0.5,
    "clip_grad_norm": 1.0,
    "print_every": 5,
}

# Run experiments
results = run_lstm_experiments(
    model_configs,
    train_config,
    data_dir=data_dir,
    output_dir=output_dir,
    device=device,
)

Starting LSTM Spam Detection Training Pipeline
Loading LSTM preprocessed data...
Training data: torch.Size([12003, 128])
Test data: torch.Size([3212, 128])
Vocabulary size: 10000
Training sources: ['email' 'mix_email_sms' 'sms' 'youtube']
Test sources: ['email' 'mix_email_sms' 'sms' 'youtube']

Training basic_bilstm
Model config: {'embedding_dim': 128, 'hidden_dims': [64], 'dense_layers': [32], 'dropout_rate': 0.3, 'bidirectional': True, 'use_attention': False}

Training model with validation split...
Setting up stratified train-validation split based on source+label...
Stratification groups found:
  email_0.0: 2159 samples
  email_1.0: 1163 samples
  mix_email_sms_0.0: 2433 samples
  mix_email_sms_1.0: 2479 samples
  sms_0.0: 1481 samples
  sms_1.0: 635 samples
  youtube_0.0: 799 samples
  youtube_1.0: 854 samples
Train set: 9602 samples
Validation set: 2401 samples
Model architecture:
LSTMModel(
  (embedding): Embedding(10000, 128, padding_idx=0)
  (lstm_layers): ModuleList(
    (0):

## 7. Load and Inference text with all models

In [2]:
# List of input texts
input_texts = [
    "win a free prize!!! Only 1 chance :D",
    "how r u today?"
]

In [None]:
# XGB
xgb = XGBTextClassifier(model_dir="models/xgboost")
pred = xgb.predict(input_texts)

pred

Extracting text statistics...
Extracting special character features...
Extracting source features...


array([1, 0])

In [4]:
# KMeans
kmeans = KMeansTextInferencer(model_dir="models/kmeans/k3/tfidf_1000")
pred = kmeans.predict(input_texts)

pred

Extracting comprehensive text features...
Extracted 53 comprehensive text features per sample


array([0, 0])

In [5]:
lstm = LSTMTextClassifier(model_dir="models/lstm/deep_bilstm_chosen")
pred = lstm.predict(input_texts)

pred

array([1, 0])