# 16S Model Training and Evaluation

**Objective:** To build, train, and evaluate a deep learning classifier for the 16S rRNA gene using the pre-processed data.

**Methodology:**
1. Load the training/testing data and encoders from disk.
2. Define the neural network architecture using TensorFlow/Keras.
3. Train the model on the training data, using the GPU if available.
4. Evaluate the final model's accuracy on the unseen test data.

In [1]:
import numpy as np
import tensorflow as tf
from scipy.sparse import load_npz
import pickle
from pathlib import Path
import sys

# Set up project path
project_root = Path.cwd().parent
sys.path.append(str(project_root))

# --- Verification Step: Check for GPU ---
# This will tell us if TensorFlow can see your GPU.
print("--- TensorFlow Setup ---")
print(f"TensorFlow Version: {tf.__version__}")
gpu_devices = tf.config.list_physical_devices('GPU')
if gpu_devices:
    print(f"✅ GPU detected: {gpu_devices[0]}")
else:
    print("⚠️ No GPU detected. TensorFlow will run on CPU.")
print("-" * 26)

--- TensorFlow Setup ---
TensorFlow Version: 2.10.1
✅ GPU detected: PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
--------------------------


### Step 2: Load Pre-processed Data and Encoders

We will now load all the artifacts that were saved by our data preparation notebook. This includes the training data, testing data, and the crucial `vectorizer` and `label_encoder` objects.

In [2]:
# --- Define file paths ---
PROCESSED_DATA_DIR = project_root / "data" / "processed"
MODELS_DIR = project_root / "models"

X_TRAIN_PATH = PROCESSED_DATA_DIR / "X_train_16s.npz"
X_TEST_PATH = PROCESSED_DATA_DIR / "X_test_16s.npz"
Y_TRAIN_PATH = PROCESSED_DATA_DIR / "y_train_16s.npy"
Y_TEST_PATH = PROCESSED_DATA_DIR / "y_test_16s.npy"

VECTORIZER_PATH = MODELS_DIR / "16s_genus_vectorizer.pkl"
LABEL_ENCODER_PATH = MODELS_DIR / "16s_genus_label_encoder.pkl"


# --- Load the data and encoders ---
print("Loading data from disk...")
X_train = load_npz(X_TRAIN_PATH)
X_test = load_npz(X_TEST_PATH)
y_train = np.load(Y_TRAIN_PATH)
y_test = np.load(Y_TEST_PATH)

with open(LABEL_ENCODER_PATH, 'rb') as f:
    label_encoder = pickle.load(f)

# Note: We don't need to load the vectorizer right now, but we will need it for a final script.
# The label_encoder is important because it tells us the number of classes.
print("✅ Data loading complete.")


# --- Verification Step ---
print("\n--- Loaded Data Shapes ---")
print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of y_train: {y_train.shape}")
print("-" * 30)
print(f"Shape of X_test:  {X_test.shape}")
print(f"Shape of y_test:  {y_test.shape}")
print(f"Number of classes (genera): {len(label_encoder.classes_)}")

Loading data from disk...
✅ Data loading complete.

--- Loaded Data Shapes ---
Shape of X_train: (4449, 12850)
Shape of y_train: (4449,)
------------------------------
Shape of X_test:  (1113, 12850)
Shape of y_test:  (1113,)
Number of classes (genera): 345


### Step 3: Define the Neural Network Architecture

We will now define our deep learning model using the Keras `Sequential` API. The architecture will consist of a series of layers:

-   An **Input Layer** that accepts our k-mer feature vectors.
-   Two hidden **Dense** layers with ReLU activation, which act as the primary learning components of the network.
-   **Dropout** layers placed after each Dense layer to prevent the model from overfitting to the training data.
-   An **Output Layer** with a `softmax` activation function, which will output the probability for each of the possible genera.

The model will then be compiled with an `adam` optimizer and a `sparse_categorical_crossentropy` loss function, which are standard and effective choices for multi-class classification.

In [3]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

# --- Get model parameters from our loaded data ---
num_classes = len(label_encoder.classes_)
input_shape = X_train.shape[1] # The number of unique k-mer features

# --- Define the Sequential model ---
model = Sequential([
    # Input layer and first hidden layer
    Dense(2048, activation='relu', input_shape=(input_shape,)),
    Dropout(0.5),
    
    # Second hidden layer
    Dense(1024, activation='relu'),
    Dropout(0.5),
    
    # Output layer
    Dense(num_classes, activation='softmax')
])

# --- Compile the model ---
model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

# --- Print a summary of the model's architecture ---
print("Model Architecture Defined and Compiled.")
print("Here is a summary:")

# ASCII Art representation
print("\n+-----------------------------------------------------------------+")
print(f"| Input Layer:         (None, {input_shape})                     |")
print("| Dense Layer (ReLU):    (None, 2048)                               |")
print("| Dropout (0.5):         (None, 2048)                               |")
print("| Dense Layer (ReLU):    (None, 1024)                               |")
print("| Dropout (0.5):         (None, 1024)                               |")
print(f"| Output Layer (Softmax):(None, {num_classes})                                |")
print("+-----------------------------------------------------------------+\n")

model.summary()

Model Architecture Defined and Compiled.
Here is a summary:

+-----------------------------------------------------------------+
| Input Layer:         (None, 12850)                     |
| Dense Layer (ReLU):    (None, 2048)                               |
| Dropout (0.5):         (None, 2048)                               |
| Dense Layer (ReLU):    (None, 1024)                               |
| Dropout (0.5):         (None, 1024)                               |
| Output Layer (Softmax):(None, 345)                                |
+-----------------------------------------------------------------+

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 2048)              26318848  
                                                                 
 dropout (Dropout)           (None, 2048)              0         
                                           

### Step 4: Train the Neural Network

We will now begin the training process by calling the `model.fit()` method. This function will feed the training data (`X_train`, `y_train`) to the model for a specified number of cycles, or **epochs**.

During training, it will:
-   Show the progress for each epoch.
-   Calculate the `loss` and `accuracy` on the training data.
-   After each epoch, it will evaluate the model on the **validation data** (a small portion of the training set held aside) to monitor for overfitting.
-   We will use an `EarlyStopping` callback, which automatically stops the training process if the validation accuracy does not improve for a set number of epochs. This saves time and prevents overfitting.

In [4]:
import numpy as np
from tensorflow.keras.callbacks import EarlyStopping, Callback

# --- PART 1: FIX THE STRATIFICATION VALUE-ERROR ---
# We must ensure our y_train set doesn't have any singleton classes before splitting it.

print("--- Pre-flight Check for Stratification ---")
# Count the occurrences of each class in the training labels
unique_classes, class_counts = np.unique(y_train, return_counts=True)

# Find which classes have fewer than 2 members (the new singletons)
singletons = unique_classes[class_counts < 2]

if len(singletons) > 0:
    print(f"Found {len(singletons)} singleton class(es) in the training set. Removing them...")
    
    # Get the indices of the rows that are NOT singletons
    non_singleton_indices = np.where(~np.isin(y_train, singletons))[0]
    
    # Filter both X_train and y_train to keep only the non-singletons
    X_train = X_train[non_singleton_indices]
    y_train = y_train[non_singleton_indices]
    
    print(f"Cleaned training set shape: {X_train.shape}")
else:
    print("Training set is clean. No singletons found.")

print("-" * 43)


# --- PART 2: CREATE THE BEAUTIFUL OUTPUT CALLBACK ---

class TrainingProgressCallback(Callback):
    """A custom callback to print a single, clean line of progress for each epoch."""
    def on_epoch_end(self, epoch, logs=None):
        # Access the metrics from the logs dictionary
        acc = logs.get('accuracy', 0)
        val_acc = logs.get('val_accuracy', 0)
        loss = logs.get('loss', 0)
        val_loss = logs.get('val_loss', 0)

        # Create progress bars using ASCII block characters
        acc_bar = '█' * int(acc * 20) + '·' * (20 - int(acc * 20))
        val_acc_bar = '█' * int(val_acc * 20) + '·' * (20 - int(val_acc * 20))

        # Print the formatted output string
        print(f"Epoch {epoch+1:02d}/{EPOCHS} | Loss: {loss:.4f} | Acc: {acc:.2%} [{acc_bar}] | Val_Loss: {val_loss:.4f} | Val_Acc: {val_acc:.2%} [{val_acc_bar}]")


# --- PART 3: TRAIN THE MODEL WITH THE FIX AND THE NEW CALLBACK ---

# Define training parameters
EPOCHS = 50
BATCH_SIZE = 16
RANDOM_STATE = 42

# Manually create the validation set from our now-guaranteed-clean training data
X_train_final, X_val, y_train_final, y_val = train_test_split(
    X_train, y_train,
    test_size=0.1,
    random_state=RANDOM_STATE,
    stratify=y_train
)

# Define the EarlyStopping callback
early_stopping = EarlyStopping(
    monitor='val_accuracy',
    patience=3,
    verbose=1,
    restore_best_weights=True
)

# --- Start training ---
print("\n--- Starting Model Training ---")

history = model.fit(
    X_train_final, y_train_final,
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    validation_data=(X_val, y_val),
    # We remove the default verbose output and add our custom callbacks
    verbose=0, 
    callbacks=[early_stopping, TrainingProgressCallback()]
)

print("\n--- Training complete. ---")

--- Pre-flight Check for Stratification ---
Training set is clean. No singletons found.
-------------------------------------------


NameError: name 'train_test_split' is not defined