## Description

### PyTorch k-fold Cross-Validation Hyperparameter Tuning (Bayesian Optimization)

This strategy is similar to k-fold cross-validation, but instead of dividing a single dataset into folds, you use each of the 9 datasets sequentially for evaluation while training the model on the remaining 8 datasets. 

A hyperparameter is a parameter that can be set in order to define any configurable part of a model's learning process. For this CNN, the architecture hyperparameters we optimize are:
- The number and structure of Conv1D layers.
- Filter sizes, kernel sizes, and strides.
- Max-pooling sizes and activation functions for each layer.
- The size of the dense layer.
- The learning rate for optimization.

This approach uses PyTorch to make the sequence generation/infill easier.

### Architecture

Model chnages: higher TX/expression is now higher prediction.

This version optimizes the process of training/testing and uses hyperparameter tuning. It uses a similar architecture to CNN_5_0. It does not include augmented data, just takes the data from La Fleur's supplemental materials including:
- La Fleur et al (and De Novo Designs)
- Urtecho et al
- Hossain et al
- Yu et al
- Lagator (36N, Pl, and Pr)
- Anderson Series

We onehot encode each basepair and pad the whole sequence. Because we use a CNN which is designed to identify "features," the input promoter can be any length (with padding) and the model will be able to accurately predict the expression.

In [1]:
from CNN_6_3 import *

In [2]:
epochs = 100

# Paths and filenames
results_file = "../Testing Expression Prediction/Hyperparameter Search/CNN_6_3_cross_validation/hyperparameter_results.json"
runtime_model_path = "../Models/CNN_6_3_runtime.pth"
model_path = '../Models/CNN_6_3.pt'

# Load all datasets from the directory
files = glob.glob('../Data/Cross Validation/*.csv')
file_data = {file.split('\\')[-1].split('.csv')[0]: load_features(file) for file in files}

file_keys = ['Anderson_Series',
             'De_Novo_Designs',
             'Hossain_et_al',
             'Lagator_36N',
             'Lagator_Pl',
             'Lagator_Pr',
             'La_Fleur_et_al',
             'Urtecho_et_al',
             'Yu_et_al'
]

# Load previous results (or start fresh)
results = load_results(results_file)

all_mse_scores = []
best_hyperparams = None
best_mse = float('inf')

#### Using a function and breaking up the calls means that if something goes wrong, we dont need to restart (also using saving and loading model states for backup)

In [3]:
def run_loocv(i):

    test_key = file_keys[i]
    X_test, y_test = file_data[test_key]

    print(f"Fold {i + 1}: Test File = {test_key}")

    # Prepare training data for the current fold
    X_train_list = [X for key, (X, y) in file_data.items() if key != test_key]
    y_train_list = [y for key, (X, y) in file_data.items() if key != test_key]

    X_train = np.concatenate(X_train_list, axis=0)
    y_train = np.concatenate(y_train_list, axis=0)

    # Define input shape for the model
    input_shape = (X_train.shape[0], X_train.shape[1], X_train.shape[2])

    # Perform hyperparameter search (for this fold)
    params = hyperparameter_search(X_train, y_train, input_shape, epochs)
    print(f"Best Hyperparameters for Fold {i + 1}: {params}")

    # Train the model with the best hyperparameters for this fold
    model = PyTorchRegressor(input_shape, params, epochs=epochs)
    model.fit(X_train, y_train)

    # Make predictions on the test dataset
    y_pred = model.predict(X_test)

    # Calculate MSE for this fold
    mse = mean_squared_error(y_test, y_pred)
    print(f"Fold {i + 1}: MSE = {mse:.4f}")
    all_mse_scores.append(mse)

    # Save the hyperparameters, fold metrics, and MSE to results JSON
    trial_data = {
        "fold": i + 1,
        "hyperparameters": params,
        "mse": mse,
        "training_data_size": len(X_train),
        "test_data_size": len(X_test)
    }
    results["hyperparameter_trials"].append(trial_data)

    # Check if this fold has the best MSE so far
    if mse < best_mse:
        best_mse = mse
        best_hyperparams = params

        # Update best model metrics in the results JSON
        results["best_mse"] = best_mse
        results["best_hyperparameters"] = best_hyperparams

        # Save the best model by overriding the previous one
        save_best_model(model, runtime_model_path)

    # Save the updated results to the JSON file after each fold
    save_results(results)

In [None]:
run_loocv(0)

In [None]:
run_loocv(1)

In [None]:
run_loocv(2)

In [None]:
run_loocv(3)

In [None]:
run_loocv(4)

In [None]:
run_loocv(5)

In [None]:
run_loocv(6)

In [None]:
run_loocv(7)

In [None]:
run_loocv(8)

In [None]:
# Print final summary
print(f"All MSE Scores: {all_mse_scores}")
print(f"Best Hyperparameters: {best_hyperparams}")
print(f"Best MSE: {best_mse}")

# Train the final model with the best hyperparameters on all data
X_all = np.concatenate([X for X, y in file_data.values()], axis=0)
y_all = np.concatenate([y for X, y in file_data.values()], axis=0)

input_shape = (X_all.shape[0], X_all.shape[1], X_all.shape[2])

final_model = PyTorchRegressor(input_shape, best_hyperparams, epochs=epochs)
final_model.fit(X_all, y_all)

# Save the final model
torch.save(final_model.model.state_dict(), model_path)
print(f"Final model saved to {model_path}")