# Predicting Lethality of rAAV Drug Formulations using Neural Networks

#### By Emmitt Tucker 

## Introduction

Recombinant adeno-associated virus (rAAV) drug formulation is a time-consuming and expertise-intensive task, involving years of research and experimentation. The traditional approach of running experiments to test every formulation condition to determine lethality can be resource-intensive, expensive, and impractical. As a potential solution, this proof of concept project explores the use of artificial neural networks to predict the lethality of different rAAV drug formulations, streamlining the formulation optimization process.<br>
<br>
To make this project more approachable for researchers with limited experience in AI, machine learning, and coding, we have received invaluable assistance from ChatGPT, an AI language model. ChatGPT's collaboration has facilitated the generation of key content, including this introduction, data pre-processing, model evaluation, and neural network architecture sections.<br>
<br>
Moreover, this project embraces the spirit of citizen science, enabling amateur scientists and enthusiasts to actively contribute to scientific research. By leveraging AI-driven methodologies, non-experts can participate in drug formulation development and contribute to advancements in biopharmaceuticals and gene therapy.

### Disclaimer!

This is meant for proof of concept purposes only and is not meant to be directly used in research projects until fit for the purpose of the user. 

## Data Pre-Processing

Data preprocessing is a crucial step in preparing the dataset for training the neural network model. In this project, the provided dataset contains a combination of numerical and categorical variables related to rAAV drug formulations. The data preprocessing steps involve handling missing values, encoding categorical variables, and normalizing numerical variables to ensure the neural network can effectively learn from the data.

### Handling Missing Values 

Missing values can adversely affect the performance of the neural network model. Before proceeding with any further steps, it is essential to identify and handle missing data appropriately. In this project, we employed techniques such as mean imputation or median imputation to fill missing numerical values and mode imputation to fill missing categorical values. This ensures that the dataset is complete and ready for training. For the synthetic data initially used in this project, no values were missing.

### Encoding Categorical Variables 

Neural networks require numerical input, and categorical variables need to be encoded to facilitate training. In this project, we used the LabelEncoder from the scikit-learn library to convert categorical variables, such as Cryoprotectant_Type, Lypoprotectant_Type, Buffer_Type, BulkingAgent_Type, Preservative_Type, and Lethality, into numerical values. Each unique category is assigned a unique integer, allowing the neural network to process categorical information effectively.

### Normalizing Numerical Variables 

Since the numerical variables in the dataset have different scales, normalization is essential to prevent certain features from dominating the learning process during training. We utilized the StandardScaler from scikit-learn to scale the numerical variables (Vector_Concentration, Cryoprotectant_Concentration, Lypoprotectant_Concentration, Buffer_pH, Buffer_Concentration, and Preservative_Concentration). Normalization transforms each feature to have zero mean and unit variance, ensuring a consistent scale across all numerical variables.

### Splitting Data Into Training and Test Sets 

Before feeding the data into the neural network, it is essential to split it into training and testing sets. The dataset is divided into two parts: the training set and the testing set. The training set is used to train the neural network, while the testing set is used to evaluate its performance on unseen data. In this project, an 80-20 split was employed, with 80% of the data used for training and 20% for testing.

### Limitations and Considerations 

It is crucial to acknowledge certain limitations and considerations in the data preprocessing phase. The use of imputation techniques to handle missing values assumes that the imputed values adequately represent the true values, which may not always be the case. In some scenarios, the presence of missing data may require more sophisticated imputation methods or careful handling based on domain knowledge. <br>
<br>
Additionally, encoding categorical variables using LabelEncoder assigns ordinal values, which could potentially introduce unintended relationships between categories. For nominal variables (without an inherent order), one-hot encoding or other categorical encoding methods may be more appropriate. <br>
<br>
In conclusion, data preprocessing is a critical step to ensure the dataset is appropriately prepared for neural network training. Proper handling of missing values, categorical encoding, and numerical normalization are essential for the successful implementation of the model. While this project makes efforts to address these preprocessing steps, further exploration and adaptation may be needed based on the specific characteristics and complexity of real-world data. <br>

## Neural Network Architecture

The neural network architecture is the design of the "brain" of our model, where it learns from the data and makes predictions. We aim to create a simple yet effective neural network that can predict the lethality of rAAV drug formulations based on various input features.

### Input Layer

The neural network starts with an input layer, which is responsible for receiving the input data. In this project, the input layer has neurons corresponding to the number of features in our dataset, representing the formulation parameters (e.g., Vector_Concentration, Cryoprotectant_Concentration, Lypoprotectant_Concentration, Buffer_pH, Buffer_Concentration, and Preservative_Concentration). Each neuron takes one of these features as input.

### Activation Function

After the input layer, we use an activation function called ReLU (Rectified Linear Unit). ReLU introduces non-linearity to the model, which is crucial for enabling the neural network to learn complex patterns and make accurate predictions. When the input to a neuron is positive or zero, ReLU outputs the same value, but when the input is negative, ReLU outputs zero. This process allows the neural network to activate certain neurons while deactivating others based on the input data.

###  Dropout Layer

Next, we add a dropout layer after the first activation function. The dropout layer helps prevent overfitting, which occurs when the model becomes too specialized in the training data and does not generalize well to new, unseen data. During training, the dropout layer randomly deactivates (sets to zero) a fraction of neurons. This dropout forces the neural network to learn more robust features, reducing overreliance on specific neurons and improving generalization to new data.

### Hidden Layer

Following the dropout layer, we add another hidden layer with ReLU activation. This hidden layer is responsible for learning more complex relationships between the input features and their impact on lethality predictions. The number of neurons in this hidden layer is a hyperparameter, and in this project, we explore different configurations to find the optimal number through the random search.

### Output Layer 

Finally, we have the output layer, which produces the final prediction. Since we are performing binary classification (predicting lethality or non-lethality), we use a single neuron in the output layer with a sigmoid activation function. The sigmoid function compresses the output between 0 and 1, representing the probability of the input belonging to the positive class (lethal) or negative class (non-lethal).

### Compilation and Optimization 

Once the neural network architecture is defined, we compile the model with specific settings. We use the Adam optimizer, which is an efficient optimization algorithm, to adjust the model's parameters during training. The loss function chosen for this binary classification task is binary cross-entropy, which measures the difference between the predicted probabilities and the actual labels.

## Hyperparameter Tuning

Hyperparameter tuning is a critical process that allows us to optimize the neural network model for predicting rAAV drug formulation lethality. However, it can be a daunting task, especially for researchers with limited experience in machine learning and coding. To address this challenge, we adopt a user-friendly and automated approach to hyperparameter tuning. <br>
<br>
In this project, we utilize the random search method to explore various combinations of hyperparameters, such as the number of neurons in hidden layers, dropout rate, and learning rate. Random search efficiently samples different hyperparameter configurations, eliminating the need for researchers to manually fine-tune these settings.<br>
<br>
This automated hyperparameter tuning serves as a valuable aid for novice users, enabling them to focus on other critical aspects of research without becoming overwhelmed by intricate model optimization. Researchers with limited expertise in machine learning can now rapidly experiment with different configurations, efficiently exploring the hyperparameter space to identify the optimal neural network architecture for predicting lethality.<br>
<br>
By simplifying the process of model optimization, this approach empowers researchers from diverse backgrounds, including wetlab biologists, to delve into AI-driven solutions without the need for extensive ML knowledge. The random search method provides an accessible gateway for novice users to harness the power of neural networks effectively.<br>
<br>
Through this user-friendly hyperparameter tuning process, we demonstrate how AI, in the form of ChatGPT, collaborates with researchers to streamline model development. This collaborative effort enables researchers to focus on the formulation development itself, while ChatGPT helps optimize the neural network model, paving the way for faster and more efficient drug product development.<br>
<br>
In conclusion, the random search-based hyperparameter tuning method simplifies model optimization, making it accessible to researchers with limited experience in machine learning and coding. This approach exemplifies the potential of AI to support and empower researchers in diverse fields, enabling them to accelerate research, gain valuable insights, and contribute to scientific progress with AI-driven solutions.

## Model Evaluation 

The neural network model's performance is a critical aspect of this project, as it determines the predictive capability in identifying lethal and non-lethal rAAV drug formulations. The model was trained on a synthetic dataset, which was not subjected to peer review or experimental validation. As such, it is essential to approach the model evaluation with a cautious perspective and recognize its limitations.

### Evaluation Metrics

To assess the model's effectiveness, several evaluation metrics were utilized to gauge its performance on the test dataset. The primary evaluation metrics employed are:<br>
<br>
1) Accuracy: The proportion of correctly predicted outcomes (both lethal and non-lethal) over the total number of samples. While accuracy is a commonly used metric, it may not provide a comprehensive assessment when classes are imbalanced.<br>
<br>
2) Confusion Matrix: A confusion matrix helps visualize the true positive (TP), true negative (TN), false positive (FP), and false negative (FN) predictions made by the model. It provides insights into how well the model distinguishes between lethal and non-lethal formulations. <br>
<br>
3) Precision: Precision measures the proportion of true positive predictions (lethal formulations) over the total number of positive predictions. It helps understand the model's accuracy in identifying lethal formulations specifically. <br>
<br>
4) Recall (Sensitivity): Recall represents the proportion of true positive predictions over the total number of actual positive samples (lethal formulations). It reflects the model's ability to correctly identify lethal formulations. <br>
<br>
5) F1-Score: The F1-score is the harmonic mean of precision and recall, providing a balanced assessment of the model's overall performance.

### Performance and Interpretation

Given that the model achieved an accuracy of approximately 54%, it is evident that there is room for improvement. However, the performance evaluation goes beyond just accuracy. The confusion matrix analysis reveals the trade-off between true positive and false positive predictions, which is particularly important when dealing with potential lethal formulations.<br>
<br>
Precision, recall, and F1-score complement the accuracy metric by offering a deeper understanding of the model's strengths and weaknesses. For instance, a high precision indicates the model's confidence in identifying lethal formulations correctly, while a high recall shows its ability to avoid missing actual lethal cases. Balancing these metrics is crucial to ensure the model's reliability in predicting potentially lethal drug formulations.

## Conclusion

In conclusion, this proof of concept exemplifies the transformative potential of AI-driven citizen science in drug formulation development. Assisted by ChatGPT, researchers with limited expertise have navigated complex AI-driven methodologies and demonstrated the capability of artificial neural networks in predicting the lethality of rAAV drug formulations.<br>
<br>
While the current accuracy of approximately 54% serves as a starting point, further refinement and validation with real-world experimental data are essential for future improvements. The automated hyperparameter tuning demonstrated how AI can simplify model optimization, enabling researchers to focus on the formulation development itself.<br>
<br>
Through the collaboration of researchers and AI, this project highlights the collaborative potential of AI-driven citizen science. By democratizing access to AI technologies, we envision a future where non-experts actively contribute to scientific research, accelerating the pace of drug development and fostering collective intelligence.<br>
<br>
As a proof of concept, this project underscores the importance of validation with experimental data and peer review. Nonetheless, it showcases how AI can be harnessed as a powerful tool in citizen science, empowering researchers from diverse backgrounds to embark on AI-driven solutions and drive innovation in scientific discovery.

## Overall Code Process Summarized

1) Loads the data from the CSV file into a pandas DataFrame.<br>
<br>
2) Encodes categorical variables to numerical values using LabelEncoder.<br>
<br>
3) Splits the data into features (X) and the target variable (y).<br>
<br>
4) Normalizes the numerical variables using StandardScaler.<br>
<br>
5) Splits the data into training and testing sets using train_test_split.<br>
<br>
6) Defines a function create_model that constructs the neural network architecture, compiles it, and returns the model.<br>
<br>
7) Creates a KerasClassifier wrapper for the create_model function, allowing it to be used with scikit-learn utilities.<br>
<br>
8) Defines the hyperparameter search space for RandomizedSearchCV to explore different combinations of hyperparameters.<br>
<br>
9) Performs random search with cross-validation using RandomizedSearchCV to find the best model architecture and hyperparameters.<br>
<br>
10) Prints the best score (accuracy) and the corresponding best hyperparameters found during the search.

##  The Code

In [36]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
import tensorflow as tf
from sklearn.model_selection import RandomizedSearchCV
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

df = pd.read_csv('original_data2.csv')

# Encode categorical variables to numerical values
label_encoder = LabelEncoder()
categorical_cols = ['Cryoprotectant_Type', 'Lypoprotectant_Type', 'Buffer_Type', 'BulkingAgent_Type', 'Preservative_Type', 'Lethality']

for col in categorical_cols:
    df[col] = label_encoder.fit_transform(df[col])

# Split the data into features (X) and target (y)
X = df.drop(columns=['Lethality'])
y = df['Lethality']

# Normalize the numerical variables to have zero mean and unit variance
scaler = StandardScaler()
numerical_cols = ['Vector_Concentration', 'Cryoprotectant_Concentration', 'Lypoprotectant_Concentration', 'Buffer_pH', 'Buffer_Concentration', 'Preservative_Concentration']
X[numerical_cols] = scaler.fit_transform(X[numerical_cols])

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

import numpy as np
from sklearn.model_selection import RandomizedSearchCV
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

# Define a function to create the neural network model
def create_model(units_layer1=64, units_layer2=32, dropout_rate=0.2, learning_rate=0.001):
    model = keras.Sequential([
        layers.Dense(units_layer1, activation='relu', input_shape=(X_train.shape[1],)),
        layers.Dropout(dropout_rate),
        layers.Dense(units_layer2, activation='relu'),
        layers.Dense(1, activation='sigmoid')
    ])

    # Compile the model with the given learning rate
    optimizer = keras.optimizers.Adam(learning_rate=learning_rate)
    model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])

    return model

# Create a KerasClassifier wrapper for the create_model function
model_wrapper = KerasClassifier(build_fn=create_model, epochs=100, batch_size=32, verbose=0)

# Define the hyperparameter search space
param_dist = {
    'units_layer1': np.arange(32, 128, 8),
    'units_layer2': np.arange(16, 64, 8),
    'dropout_rate': np.arange(0.1, 0.5, 0.1),
    'learning_rate': [0.001, 0.01, 0.1]
}

# Perform random search with cross-validation
random_search = RandomizedSearchCV(model_wrapper, param_distributions=param_dist, cv=3, n_iter=10, scoring='accuracy', verbose=1)
random_search.fit(X_train, y_train)

# Get the best model and its score
best_model = random_search.best_estimator_
best_score = random_search.best_score_

print(f"Best score: {best_score:.4f}")
print(f"Best parameters: {random_search.best_params_}")


Fitting 3 folds for each of 10 candidates, totalling 30 fits






Best score: 0.5412
Best parameters: {'units_layer2': 48, 'units_layer1': 32, 'learning_rate': 0.001, 'dropout_rate': 0.30000000000000004}
