# OpenSource Citizen Science rAAV Drug Product Formulation Data Generator

## Introduction

### The Problem 

I was looking online for opensource training data from recombinant adeno-associated virus drug product formulation development to try and work on developing a machine learning algorithm. Tried as I might, I was unable to find anything online that fit my purpose.

### The Solution 

In order to get data for training a potential machine learning models, I needed to generate synthetic data systematically. I am not a formulation scientist, or even a chemist, but I was able to figure out some of the basic information about rAAV formulation through ChatGPT. I also was able to figure out how to generate data that is somewhat realistic using chatGPT. 

### The Code 

In [None]:
import pandas as pd
import numpy as np

# Generate synthetic data
np.random.seed(42)
num_samples = 1000

# AAV Serotype
aav_serotypes = ['AAV1', 'AAV2', 'AAV3B', 'AAV5', 'AAV6', 'AAV8','AAV9']
aav_serotype = np.random.choice(aav_serotypes, num_samples)

# Vector Concentration in VG/mL
vector_concentration = np.random.uniform(1.0e12, 1.0e14, num_samples)

#Cryoprotectant Type
cryoprotectant_types = ['DMSO','Glycerol','PEG','Trehalose','Sucrose','Mannitol','Sorbitol','Ethylene_Glycol','Propylene_Glycol']
cryoprotectant_type = np.random.choice(cryoprotectant_types, num_samples)

#Cryoprotectant Concentration in v/v or v/w
cryoprotectant_concentration = np.random.uniform(1.0, 10.0, num_samples)

#Lyoprotectant Type
lypoprotectant_types = ['Trehalose','Mannitol','Sucrose','Sorbitol','Maltose','Lactose','Mannose']
lypoprotectant_type = np.random.choice(lypoprotectant_types, num_samples)

#Lyoprotectant Concentration in v/v or v/w
lyoprotectant_concentration = np.random.uniform(1.0, 10.0, num_samples)

#Surfactant Type 
surfactant_types = ['Polysorbate_80', 'Poloxamer','Pluronic_F68','Triton_X100','SDS','Pluronics']
surfactant_type = np.random.choice(surfactant_types, num_samples)

#Surfactant Concentration in v/v
surfactant_concentration = np.random.uniform(0.01, 0.1,num_samples)

# Buffer Type
buffer_types = ['PBS', 'HEPES', 'Tris', 'Sodium_Phosphate','Acetate_Buffer']
buffer_type = np.random.choice(buffer_types, num_samples)

# Buffer pH
buffer_pH = np.random.uniform(7.2, 8.5, num_samples)

# Buffer Concentration in moles
buffer_concentration = np.random.uniform(0.01, 0.15, num_samples)

# Bulking Agent Type
bulkingagent_types = ['Mannitol','Sucrose','Sorbitol']
bulkingagent_type = np.random.choice(bulkingagent_types, num_samples)

# Bulking Agent Concentration in moles
bulkingagent_concentration = np.random.uniform(1,10, num_samples)

# Preservatives Type
preservative_types = ['Ethanol','Phenol','Benzyl_Alcohol']
preservative_type = np.random.choice(bulkingagent_types, num_samples)

# Preservative Concentration in v/v
preservative_concentration = np.random.uniform(0.1, 2, num_samples)

# Stability (in months)
stability = np.random.randint(6, 12, num_samples)

# Lethality types
lethality_types = ['Yes','No']
lethality_type = np.random.choice(lethality_types, num_samples)


# Create an empty list to store each sample's DataFrame
dfs = []

# Convert each sample's data to a DataFrame and append to the list
for i in range(num_samples):
    sample_data = {
        'AAV_Serotype': aav_serotype[i],
        'Vector_Concentration': vector_concentration[i],
        'Cryoprotectant_Type': cryoprotectant_type[i],
        'Cyroprotectant_Concentration': cryoprotectant_concentration[i],
        'Lypoprotectant_Type': lypoprotectant_type[i],
        'Lypoprotectant_Concentration': lyoprotectant_concentration[i],
        'Buffer_Type': buffer_type[i],
        'Buffer_pH': buffer_pH[i],
        'Buffer_Concentration': buffer_concentration[i],
        'BulkingAgent_Type': bulkingagent_type[i],
        'BulkingAgent_Concentration': bulkingagent_concentration[i],
        'Preservatives_Types': preservative_type[i],
        'Preservative_Concentration': preservative_concentration[i],
        'Stability': stability[i],
        'Lethality': lethality_type[i]
    }
    df = pd.DataFrame([sample_data])
    dfs.append(df)

# Concatenate all the DataFrames in the list
data_df = pd.concat(dfs, ignore_index=True)

# Save the dataset to a CSV file
data_df.to_csv('synthetic_rAAV_formulation_data.csv', index=False)


### Known Errors 

As mentioned, I am not a formulation scientist. Just a simple molecular biologist. I also wrote this code in about 4 hours, so I was unable to add all of the features that I wanted to. 

For example, some of the conentrations are not dependent on, say, the buffer type. So there will be obviously lethal combinations in the data that the algorithm would not be able to determine. My solution to this is to add contional formating to the data generation to make sure that concentration information is specific to the type of material it is associated with. 

If you're reading this and have any comments, I would love to hear them!