# M.A.R.L.IN eDNA Species Classifier - Model Training

## Overview
This notebook trains deep learning models for:
1. **Taxonomic Classification**: Predicting taxonomic labels from sequence embeddings
2. **Novelty Detection**: Identifying potentially novel species/taxa
3. **Cluster Prediction**: Assigning sequences to biological clusters
4. **Database Classification**: Predicting the most likely reference database

## Models Architecture
- **Feed-forward Neural Networks**: For embedding-based classification
- **CNN Models**: For sequence-based learning (optional)
- **Ensemble Methods**: Combining multiple models for better predictions
- **Autoencoder**: For novelty detection based on reconstruction error

## Training Strategy
- Stratified train/validation/test splits
- Cross-validation for robust evaluation
- Hyperparameter optimization
- Early stopping and regularization
- Class imbalance handling

## Goals
- Train robust models for taxonomic classification
- Develop effective novelty detection system
- Create model ensemble for improved accuracy
- Generate model confidence scores and uncertainty estimates

In [4]:
# Import required libraries
import os
import sys
import pandas as pd
import numpy as np
import pickle
import json
from pathlib import Path
from collections import Counter
import random

# Machine learning libraries
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier, IsolationForest
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from sklearn.utils.class_weight import compute_class_weight

# Deep learning libraries
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, TensorDataset
import torch.nn.functional as F

# Metrics and evaluation
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, roc_curve, auc

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

# Set up paths
BASE_DIR = Path("../data")
PROCESSED_DIR = BASE_DIR / "processed"
EMBEDDINGS_DIR = BASE_DIR / "embeddings"
MODEL_DIR = Path("../model")
DNABERT_DIR = MODEL_DIR / "dnabert_finetuned"

# Create directories
DNABERT_DIR.mkdir(parents=True, exist_ok=True)

# Set random seeds for reproducibility
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)

# Check for GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
print(f"Model training directory: {MODEL_DIR}")

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

Using device: cpu
Model training directory: ../model


In [5]:
# Load data for model training
print("Loading processed data for model training...")

try:
    # Load clustered sequences with embeddings
    with open(PROCESSED_DIR / "sequences_clustered.pkl", 'rb') as f:
        df_sequences = pickle.load(f)
    
    # Load embeddings
    with open(EMBEDDINGS_DIR / "sequence_embeddings.pkl", 'rb') as f:
        embeddings_data = pickle.load(f)
    
    # Load biodiversity analysis
    with open(PROCESSED_DIR / "biodiversity_analysis.pkl", 'rb') as f:
        analysis_results = pickle.load(f)
    
    print(f"Loaded {len(df_sequences)} sequences with embeddings")
    print(f"Embedding dimensions: {embeddings_data['original_embeddings'].shape[1]}")
    
except FileNotFoundError as e:
    print(f"Error loading data: {e}")
    print("Please run all previous notebooks first!")
    sys.exit(1)

# Extract embeddings and prepare features
X_original = embeddings_data['original_embeddings']
X_pca = embeddings_data['pca_embeddings']

print(f"Original embeddings shape: {X_original.shape}")
print(f"PCA embeddings shape: {X_pca.shape}")

# Prepare target variables for different tasks
# 1. Database classification
y_database = df_sequences['database'].values
database_encoder = LabelEncoder()
y_database_encoded = database_encoder.fit_transform(y_database)

# 2. Cluster prediction (remove noise points)
mask_no_noise = df_sequences['cluster_id'] != -1
X_clustered = X_pca[mask_no_noise]
y_cluster = df_sequences[mask_no_noise]['cluster_id'].values
cluster_encoder = LabelEncoder()
y_cluster_encoded = cluster_encoder.fit_transform(y_cluster)

# 3. Novelty detection (binary: novel vs known)
novelty_threshold = 0.5
y_novelty = (df_sequences['cluster_novelty'] >= novelty_threshold).astype(int).values

print(f"\nTarget variable distributions:")
print(f"Databases: {Counter(y_database)}")
print(f"Clusters (no noise): {len(np.unique(y_cluster_encoded))} clusters")
print(f"Novelty: {Counter(y_novelty)} (0=known, 1=novel)")

# Basic statistics
print(f"\nData preparation:")
print(f"Total sequences: {len(df_sequences)}")
print(f"Sequences with clusters (no noise): {len(X_clustered)}")
print(f"Novel sequences: {np.sum(y_novelty)}")
print(f"Known sequences: {len(y_novelty) - np.sum(y_novelty)}")

Loading processed data for model training...
Loaded 15000 sequences with embeddings
Embedding dimensions: 1344
Original embeddings shape: (15000, 1344)
PCA embeddings shape: (15000, 50)

Target variable distributions:
Databases: Counter({'16S_ribosomal_RNA': 5000, '18S_fungal_sequences': 5000, '28S_fungal_sequences': 5000})
Clusters (no noise): 1 clusters
Novelty: Counter({np.int64(0): 15000}) (0=known, 1=novel)

Data preparation:
Total sequences: 15000
Sequences with clusters (no noise): 15000
Novel sequences: 0
Known sequences: 15000
Loaded 15000 sequences with embeddings
Embedding dimensions: 1344
Original embeddings shape: (15000, 1344)
PCA embeddings shape: (15000, 50)

Target variable distributions:
Databases: Counter({'16S_ribosomal_RNA': 5000, '18S_fungal_sequences': 5000, '28S_fungal_sequences': 5000})
Clusters (no noise): 1 clusters
Novelty: Counter({np.int64(0): 15000}) (0=known, 1=novel)

Data preparation:
Total sequences: 15000
Sequences with clusters (no noise): 15000
Nov

In [7]:
# Define neural network models
class TaxonomicClassifier(nn.Module):
    """Neural network for taxonomic classification"""
    
    def __init__(self, input_dim, num_classes, hidden_dims=[256, 128, 64], dropout_rate=0.3):
        super(TaxonomicClassifier, self).__init__()
        
        layers = []
        prev_dim = input_dim
        
        for hidden_dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.BatchNorm1d(hidden_dim),
                nn.ReLU(),
                nn.Dropout(dropout_rate)
            ])
            prev_dim = hidden_dim
        
        # Output layer
        layers.append(nn.Linear(prev_dim, num_classes))
        
        self.network = nn.Sequential(*layers)
        
    def forward(self, x):
        return self.network(x)

class NoveltyDetector(nn.Module):
    """Autoencoder for novelty detection"""
    
    def __init__(self, input_dim, encoding_dims=[128, 64, 32], dropout_rate=0.2):
        super(NoveltyDetector, self).__init__()
        
        # Encoder
        encoder_layers = []
        prev_dim = input_dim
        
        for encoding_dim in encoding_dims:
            encoder_layers.extend([
                nn.Linear(prev_dim, encoding_dim),
                nn.BatchNorm1d(encoding_dim),
                nn.ReLU(),
                nn.Dropout(dropout_rate)
            ])
            prev_dim = encoding_dim
        
        self.encoder = nn.Sequential(*encoder_layers)
        
        # Decoder
        decoder_layers = []
        decoding_dims = list(reversed(encoding_dims[:-1])) + [input_dim]
        
        for i, decoding_dim in enumerate(decoding_dims):
            if i == len(decoding_dims) - 1:  # Last layer
                decoder_layers.append(nn.Linear(prev_dim, decoding_dim))
            else:
                decoder_layers.extend([
                    nn.Linear(prev_dim, decoding_dim),
                    nn.BatchNorm1d(decoding_dim),
                    nn.ReLU(),
                    nn.Dropout(dropout_rate)
                ])
            prev_dim = decoding_dim
        
        self.decoder = nn.Sequential(*decoder_layers)
        
    def forward(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return decoded, encoded
    
    def get_reconstruction_error(self, x):
        with torch.no_grad():
            decoded, _ = self.forward(x)
            mse = torch.mean((x - decoded) ** 2, dim=1)
            return mse.cpu().numpy()

class ModelTrainer:
    """Comprehensive model trainer for eDNA classification"""
    
    def __init__(self, device='cpu'):
        self.device = device
        self.models = {}
        self.scalers = {}
        self.encoders = {}
        self.training_history = {}
        
    def prepare_data(self, X, y, test_size=0.2, val_size=0.2):
        """Prepare train/validation/test splits"""
        # First split: separate test set
        X_temp, X_test, y_temp, y_test = train_test_split(
            X, y, test_size=test_size, random_state=42, stratify=y
        )
        
        # Second split: separate train and validation
        val_size_adjusted = val_size / (1 - test_size)
        X_train, X_val, y_train, y_val = train_test_split(
            X_temp, y_temp, test_size=val_size_adjusted, random_state=42, stratify=y_temp
        )
        
        return X_train, X_val, X_test, y_train, y_val, y_test
    
    def scale_features(self, X_train, X_val, X_test, scaler_name):
        """Scale features and store scaler"""
        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train)
        X_val_scaled = scaler.transform(X_val)
        X_test_scaled = scaler.transform(X_test)
        
        self.scalers[scaler_name] = scaler
        
        return X_train_scaled, X_val_scaled, X_test_scaled
    
    def train_classifier(self, X_train, X_val, X_test, y_train, y_val, y_test, 
                        model_name, num_epochs=100, lr=0.001, batch_size=32):
        """Train taxonomic classifier"""
        print(f"Training {model_name} classifier...")
        
        # Scale features
        X_train_scaled, X_val_scaled, X_test_scaled = self.scale_features(
            X_train, X_val, X_test, f"{model_name}_scaler"
        )
        
        # Prepare data loaders
        train_dataset = TensorDataset(
            torch.FloatTensor(X_train_scaled), 
            torch.LongTensor(y_train)
        )
        val_dataset = TensorDataset(
            torch.FloatTensor(X_val_scaled), 
            torch.LongTensor(y_val)
        )
        
        train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
        val_loader = DataLoader(val_dataset, batch_size=batch_size)
        
        # Initialize model
        num_classes = len(np.unique(np.concatenate([y_train, y_val, y_test])))
        model = TaxonomicClassifier(
            input_dim=X_train_scaled.shape[1],
            num_classes=num_classes
        ).to(self.device)
        
        # Loss and optimizer
        criterion = nn.CrossEntropyLoss()
        optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=1e-5)
        scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=10, factor=0.5)
        
        # Training loop
        train_losses = []
        val_losses = []
        val_accuracies = []
        best_val_acc = 0
        patience_counter = 0
        
        for epoch in range(num_epochs):
            # Training
            model.train()
            train_loss = 0
            for batch_X, batch_y in train_loader:
                batch_X, batch_y = batch_X.to(self.device), batch_y.to(self.device)
                
                optimizer.zero_grad()
                outputs = model(batch_X)
                loss = criterion(outputs, batch_y)
                loss.backward()
                optimizer.step()
                
                train_loss += loss.item()
            
            # Validation
            model.eval()
            val_loss = 0
            correct = 0
            total = 0
            
            with torch.no_grad():
                for batch_X, batch_y in val_loader:
                    batch_X, batch_y = batch_X.to(self.device), batch_y.to(self.device)
                    outputs = model(batch_X)
                    loss = criterion(outputs, batch_y)
                    val_loss += loss.item()
                    
                    _, predicted = torch.max(outputs.data, 1)
                    total += batch_y.size(0)
                    correct += (predicted == batch_y).sum().item()
            
            train_loss /= len(train_loader)
            val_loss /= len(val_loader)
            val_acc = correct / total
            
            train_losses.append(train_loss)
            val_losses.append(val_loss)
            val_accuracies.append(val_acc)
            
            scheduler.step(val_loss)
            
            # Early stopping
            if val_acc > best_val_acc:
                best_val_acc = val_acc
                torch.save(model.state_dict(), f"{model_name}_best.pth")
                patience_counter = 0
            else:
                patience_counter += 1
                if patience_counter >= 20:
                    print(f"Early stopping at epoch {epoch}")
                    break
            
            if epoch % 10 == 0:
                print(f"Epoch {epoch}: Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.4f}")
        
        # Load best model
        model.load_state_dict(torch.load(f"{model_name}_best.pth"))
        
        # Evaluate on test set
        test_dataset = TensorDataset(torch.FloatTensor(X_test_scaled), torch.LongTensor(y_test))
        test_loader = DataLoader(test_dataset, batch_size=batch_size)
        
        model.eval()
        y_pred = []
        y_true = []
        
        with torch.no_grad():
            for batch_X, batch_y in test_loader:
                batch_X = batch_X.to(self.device)
                outputs = model(batch_X)
                _, predicted = torch.max(outputs.data, 1)
                y_pred.extend(predicted.cpu().numpy())
                y_true.extend(batch_y.numpy())
        
        test_acc = accuracy_score(y_true, y_pred)
        print(f"Test Accuracy: {test_acc:.4f}")
        
        # Store model and results
        self.models[model_name] = model
        self.training_history[model_name] = {
            'train_losses': train_losses,
            'val_losses': val_losses,
            'val_accuracies': val_accuracies,
            'test_accuracy': test_acc,
            'y_true': y_true,
            'y_pred': y_pred
        }
        
        return model, test_acc
    
    def train_novelty_detector(self, X_train, X_val, X_test, model_name, 
                              num_epochs=100, lr=0.001, batch_size=32):
        """Train autoencoder for novelty detection"""
        print(f"Training {model_name} novelty detector...")
        
        # Scale features
        X_train_scaled, X_val_scaled, X_test_scaled = self.scale_features(
            X_train, X_val, X_test, f"{model_name}_scaler"
        )
        
        # Prepare data loaders (autoencoder uses input as target)
        train_dataset = TensorDataset(torch.FloatTensor(X_train_scaled), torch.FloatTensor(X_train_scaled))
        val_dataset = TensorDataset(torch.FloatTensor(X_val_scaled), torch.FloatTensor(X_val_scaled))
        
        train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
        val_loader = DataLoader(val_dataset, batch_size=batch_size)
        
        # Initialize model
        model = NoveltyDetector(input_dim=X_train_scaled.shape[1]).to(self.device)
        
        # Loss and optimizer
        criterion = nn.MSELoss()
        optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=1e-5)
        scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=10, factor=0.5)
        
        # Training loop
        train_losses = []
        val_losses = []
        best_val_loss = float('inf')
        patience_counter = 0
        
        for epoch in range(num_epochs):
            # Training
            model.train()
            train_loss = 0
            for batch_X, batch_y in train_loader:
                batch_X, batch_y = batch_X.to(self.device), batch_y.to(self.device)
                
                optimizer.zero_grad()
                decoded, _ = model(batch_X)
                loss = criterion(decoded, batch_y)
                loss.backward()
                optimizer.step()
                
                train_loss += loss.item()
            
            # Validation
            model.eval()
            val_loss = 0
            
            with torch.no_grad():
                for batch_X, batch_y in val_loader:
                    batch_X, batch_y = batch_X.to(self.device), batch_y.to(self.device)
                    decoded, _ = model(batch_X)
                    loss = criterion(decoded, batch_y)
                    val_loss += loss.item()
            
            train_loss /= len(train_loader)
            val_loss /= len(val_loader)
            
            train_losses.append(train_loss)
            val_losses.append(val_loss)
            
            scheduler.step(val_loss)
            
            # Early stopping
            if val_loss < best_val_loss:
                best_val_loss = val_loss
                torch.save(model.state_dict(), f"{model_name}_best.pth")
                patience_counter = 0
            else:
                patience_counter += 1
                if patience_counter >= 20:
                    print(f"Early stopping at epoch {epoch}")
                    break
            
            if epoch % 10 == 0:
                print(f"Epoch {epoch}: Train Loss: {train_loss:.6f}, Val Loss: {val_loss:.6f}")
        
        # Load best model
        model.load_state_dict(torch.load(f"{model_name}_best.pth"))
        
        # Store model and results
        self.models[model_name] = model
        self.training_history[model_name] = {
            'train_losses': train_losses,
            'val_losses': val_losses,
            'best_val_loss': best_val_loss
        }
        
        return model

# Initialize trainer
trainer = ModelTrainer(device=device)
print("Model trainer initialized!")

Model trainer initialized!


In [8]:
# Train models for different tasks
print("=== Training Models ===")

# 1. Database Classification Model
print("\n--- Training Database Classifier ---")
X_train_db, X_val_db, X_test_db, y_train_db, y_val_db, y_test_db = trainer.prepare_data(
    X_pca, y_database_encoded
)

db_model, db_accuracy = trainer.train_classifier(
    X_train_db, X_val_db, X_test_db, y_train_db, y_val_db, y_test_db,
    model_name="database_classifier",
    num_epochs=50,
    lr=0.001
)

# 2. Cluster Prediction Model (only for non-noise sequences)
print("\n--- Training Cluster Classifier ---")
if len(np.unique(y_cluster_encoded)) > 1:  # Only if we have multiple clusters
    X_train_cl, X_val_cl, X_test_cl, y_train_cl, y_val_cl, y_test_cl = trainer.prepare_data(
        X_clustered, y_cluster_encoded
    )
    
    cluster_model, cluster_accuracy = trainer.train_classifier(
        X_train_cl, X_val_cl, X_test_cl, y_train_cl, y_val_cl, y_test_cl,
        model_name="cluster_classifier", 
        num_epochs=75,
        lr=0.001
    )
else:
    print("Not enough clusters for cluster classification")

# 3. Novelty Detection Model
print("\n--- Training Novelty Detector ---")
# For novelty detection, we train on known sequences only
known_mask = y_novelty == 0
X_known = X_pca[known_mask]

if len(X_known) > 50:  # Only if we have enough known sequences
    X_train_nov, X_val_nov, X_test_nov, _, _, _ = trainer.prepare_data(
        X_known, np.zeros(len(X_known))  # Dummy targets for autoencoders
    )
    
    novelty_model = trainer.train_novelty_detector(
        X_train_nov, X_val_nov, X_test_nov,
        model_name="novelty_detector",
        num_epochs=100,
        lr=0.001
    )
else:
    print("Not enough known sequences for novelty detection training")

print("\n=== Model Training Completed ===")
print(f"Database classifier accuracy: {db_accuracy:.4f}")
if 'cluster_classifier' in trainer.training_history:
    print(f"Cluster classifier accuracy: {trainer.training_history['cluster_classifier']['test_accuracy']:.4f}")
if 'novelty_detector' in trainer.training_history:
    print(f"Novelty detector trained successfully")

=== Training Models ===

--- Training Database Classifier ---
Training database_classifier classifier...
Epoch 0: Train Loss: 1.0778, Val Loss: 1.0293, Val Acc: 0.4083
Epoch 0: Train Loss: 1.0778, Val Loss: 1.0293, Val Acc: 0.4083
Epoch 10: Train Loss: 0.9663, Val Loss: 1.0253, Val Acc: 0.4030
Epoch 10: Train Loss: 0.9663, Val Loss: 1.0253, Val Acc: 0.4030
Epoch 20: Train Loss: 0.8959, Val Loss: 1.0473, Val Acc: 0.4110
Epoch 20: Train Loss: 0.8959, Val Loss: 1.0473, Val Acc: 0.4110
Epoch 30: Train Loss: 0.8546, Val Loss: 1.0703, Val Acc: 0.4003
Epoch 30: Train Loss: 0.8546, Val Loss: 1.0703, Val Acc: 0.4003
Epoch 40: Train Loss: 0.8273, Val Loss: 1.0944, Val Acc: 0.4047
Epoch 40: Train Loss: 0.8273, Val Loss: 1.0944, Val Acc: 0.4047
Early stopping at epoch 42
Test Accuracy: 0.4153

--- Training Cluster Classifier ---
Not enough clusters for cluster classification

--- Training Novelty Detector ---
Training novelty_detector novelty detector...
Early stopping at epoch 42
Test Accuracy: 0

In [9]:
# Save trained models and encoders
print("\n=== Saving Models and Results ===")

# Create model save directory
model_save_dir = DNABERT_DIR
model_save_dir.mkdir(exist_ok=True)

# Save PyTorch models
for model_name, model in trainer.models.items():
    model_path = model_save_dir / f"{model_name}.pth"
    torch.save(model.state_dict(), model_path)
    print(f"Saved {model_name} to {model_path}")

# Save scalers and encoders
preprocessing_data = {
    'scalers': trainer.scalers,
    'database_encoder': database_encoder,
    'cluster_encoder': cluster_encoder if 'cluster_classifier' in trainer.models else None,
    'embedding_metadata': {
        'original_dim': X_original.shape[1],
        'pca_dim': X_pca.shape[1],
        'num_sequences': len(df_sequences)
    }
}

preprocessing_path = model_save_dir / "preprocessing_data.pkl"
with open(preprocessing_path, 'wb') as f:
    pickle.dump(preprocessing_data, f)
print(f"Saved preprocessing data to {preprocessing_path}")

# Save training history and results
results_data = {
    'training_history': trainer.training_history,
    'model_architectures': {
        'database_classifier': {
            'input_dim': X_pca.shape[1],
            'num_classes': len(np.unique(y_database_encoded)),
            'task': 'database_classification'
        },
        'cluster_classifier': {
            'input_dim': X_clustered.shape[1] if len(X_clustered) > 0 else X_pca.shape[1],
            'num_classes': len(np.unique(y_cluster_encoded)) if 'cluster_classifier' in trainer.models else 0,
            'task': 'cluster_classification'
        } if 'cluster_classifier' in trainer.models else None,
        'novelty_detector': {
            'input_dim': X_pca.shape[1],
            'task': 'novelty_detection'
        } if 'novelty_detector' in trainer.models else None
    },
    'performance_metrics': {
        'database_accuracy': db_accuracy,
        'cluster_accuracy': trainer.training_history.get('cluster_classifier', {}).get('test_accuracy', 0),
    }
}

results_path = model_save_dir / "training_results.pkl"
with open(results_path, 'wb') as f:
    pickle.dump(results_data, f)
print(f"Saved training results to {results_path}")

# Create a simple model configuration file
config = {
    'model_version': '1.0',
    'training_date': str(pd.Timestamp.now()),
    'models_available': list(trainer.models.keys()),
    'database_classes': database_encoder.classes_.tolist(),
    'cluster_classes': cluster_encoder.classes_.tolist() if 'cluster_classifier' in trainer.models else [],
    'embedding_dim': X_pca.shape[1],
    'device_used': str(device),
    'performance': {
        'database_classification_accuracy': float(db_accuracy),
        'cluster_classification_accuracy': float(trainer.training_history.get('cluster_classifier', {}).get('test_accuracy', 0))
    }
}

config_path = model_save_dir / "model_config.json"
with open(config_path, 'w') as f:
    json.dump(config, f, indent=2)
print(f"Saved model configuration to {config_path}")

print(f"\n=== Model Training Complete ===")
print(f"Models saved to: {model_save_dir}")
print(f"Available models: {list(trainer.models.keys())}")
print("\nModel files created:")
print("  - *.pth files (PyTorch model weights)")
print("  - preprocessing_data.pkl (scalers and encoders)")
print("  - training_results.pkl (training history and metrics)")
print("  - model_config.json (model configuration)")
print("\nReady for model evaluation!")


=== Saving Models and Results ===
Saved database_classifier to ../model/dnabert_finetuned/database_classifier.pth
Saved novelty_detector to ../model/dnabert_finetuned/novelty_detector.pth
Saved preprocessing data to ../model/dnabert_finetuned/preprocessing_data.pkl
Saved training results to ../model/dnabert_finetuned/training_results.pkl
Saved model configuration to ../model/dnabert_finetuned/model_config.json

=== Model Training Complete ===
Models saved to: ../model/dnabert_finetuned
Available models: ['database_classifier', 'novelty_detector']

Model files created:
  - *.pth files (PyTorch model weights)
  - preprocessing_data.pkl (scalers and encoders)
  - training_results.pkl (training history and metrics)
  - model_config.json (model configuration)

Ready for model evaluation!
