# 16S Model Training and Evaluation

**Objective:** To build, train, and evaluate a deep learning classifier for the 16S rRNA gene using the pre-processed data.

**Methodology:**
1. Load the training/testing data and encoders from disk.
2. Define the neural network architecture using TensorFlow/Keras.
3. Train the model on the training data, using the GPU if available.
4. Evaluate the final model's accuracy on the unseen test data.

In [1]:
import numpy as np
import tensorflow as tf
from scipy.sparse import load_npz
import pickle
from pathlib import Path
import sys

# Set up project path
project_root = Path.cwd().parent
sys.path.append(str(project_root))

# --- Verification Step: Check for GPU ---
# This will tell us if TensorFlow can see your GPU.
print("--- TensorFlow Setup ---")
print(f"TensorFlow Version: {tf.__version__}")
gpu_devices = tf.config.list_physical_devices('GPU')
if gpu_devices:
    print(f"✅ GPU detected: {gpu_devices[0]}")
else:
    print("⚠️ No GPU detected. TensorFlow will run on CPU.")
print("-" * 26)

--- TensorFlow Setup ---
TensorFlow Version: 2.10.1
✅ GPU detected: PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
--------------------------


### Step 2: Load Pre-processed Data and Encoders

We will now load all the artifacts that were saved by our data preparation notebook. This includes the training data, testing data, and the crucial `vectorizer` and `label_encoder` objects.

In [2]:
# --- Define file paths ---
PROCESSED_DATA_DIR = project_root / "data" / "processed"
MODELS_DIR = project_root / "models"

X_TRAIN_PATH = PROCESSED_DATA_DIR / "X_train_16s.npz"
X_TEST_PATH = PROCESSED_DATA_DIR / "X_test_16s.npz"
Y_TRAIN_PATH = PROCESSED_DATA_DIR / "y_train_16s.npy"
Y_TEST_PATH = PROCESSED_DATA_DIR / "y_test_16s.npy"

VECTORIZER_PATH = MODELS_DIR / "16s_genus_vectorizer.pkl"
LABEL_ENCODER_PATH = MODELS_DIR / "16s_genus_label_encoder.pkl"


# --- Load the data and encoders ---
print("Loading data from disk...")
X_train = load_npz(X_TRAIN_PATH)
X_test = load_npz(X_TEST_PATH)
y_train = np.load(Y_TRAIN_PATH)
y_test = np.load(Y_TEST_PATH)

with open(LABEL_ENCODER_PATH, 'rb') as f:
    label_encoder = pickle.load(f)

# Note: We don't need to load the vectorizer right now, but we will need it for a final script.
# The label_encoder is important because it tells us the number of classes.
print("✅ Data loading complete.")


# --- Verification Step ---
print("\n--- Loaded Data Shapes ---")
print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of y_train: {y_train.shape}")
print("-" * 30)
print(f"Shape of X_test:  {X_test.shape}")
print(f"Shape of y_test:  {y_test.shape}")
print(f"Number of classes (genera): {len(label_encoder.classes_)}")

Loading data from disk...
✅ Data loading complete.

--- Loaded Data Shapes ---
Shape of X_train: (4744, 13261)
Shape of y_train: (4744,)
------------------------------
Shape of X_test:  (1186, 13261)
Shape of y_test:  (1186,)
Number of classes (genera): 529
