## Imports

In [61]:
import pandas as pd
import numpy as np

df_AA2024 = pd.read_excel('/workspaces/project-project-surface-science-syndicate/data/filtered_AA2024.xlsx')
print(df_AA2024.describe())

           Time_h          pH  Inhib_Concentrat_M  Salt_Concentrat_M  \
count  611.000000  611.000000          611.000000         611.000000   
mean   135.801964    6.342062            0.006808           0.145450   
std    201.683867    2.529080            0.014059           0.200575   
min      0.500000    0.000000            0.000010           0.000000   
25%     24.000000    4.000000            0.000500           0.010000   
50%     24.000000    7.000000            0.001000           0.100000   
75%    144.000000    7.000000            0.003000           0.100000   
max    672.000000   10.000000            0.100000           0.600000   

        Efficiency  
count   611.000000  
mean     26.736841  
std     288.788317  
min   -4834.000000  
25%      30.000000  
50%      58.000000  
75%      87.950000  
max     100.000000  


In [62]:
print(df_AA2024.head())

                                    SMILES  Time_h    pH  Inhib_Concentrat_M  \
0             COCCOC(=O)OCSc1nc2c(s1)cccc2    24.0   4.0               0.001   
1             COCCOC(=O)OCSc1nc2c(s1)cccc2    24.0  10.0               0.001   
2            Cc1ccc(c(c1)n1nc2c(n1)cccc2)O    24.0   4.0               0.001   
3            Cc1ccc(c(c1)n1nc2c(n1)cccc2)O    24.0  10.0               0.001   
4  Clc1ccc(cc1)CC[C@](C(C)(C)C)(Cn1cncn1)O    24.0   4.0               0.001   

   Salt_Concentrat_M  Efficiency  
0                0.1         0.0  
1                0.1         0.0  
2                0.1        30.0  
3                0.1        30.0  
4                0.1        30.0  


Construct dataframe to work with

In [63]:
df = df_AA2024

### Set targets/objectives = efficiency for now

In [64]:
from baybe.targets import NumericalTarget
from baybe.objective import Objective

target = NumericalTarget(
    name="Efficiency",
    mode="MAX",
)
objective = Objective(mode="SINGLE", targets=[target])

### Search Space

In [65]:
from baybe.parameters import NumericalContinuousParameter, CategoricalParameter
from baybe.searchspace import SearchSpace

parameters = [
CategoricalParameter(
    name="SMILES",
    values=df['SMILES'].unique().tolist(),
    encoding="OHE"
),
NumericalContinuousParameter(
    name="Time_h",
    bounds=(df['Time_h'].min(), df['Time_h'].max()),
),
NumericalContinuousParameter(
    name="pH",
    bounds=(1, 14),
    ),  
NumericalContinuousParameter(
    name="Inhib_Concentrat_M",
    bounds=(df['Inhib_Concentrat_M'].min(), df['Inhib_Concentrat_M'].max()),
    ),
NumericalContinuousParameter(
    name="Salt_Concentrat_M",
    bounds=(df['Salt_Concentrat_M'].min(), df['Salt_Concentrat_M'].max()),
    )
]

**Substance parameter**

Instead of values, this parameter accepts data in form of a dictionary. The items correspond to pairs of labels and SMILES. SMILES are string-based representations of molecular structures. Based on these, BayBE can assign each label a set of molecular descriptors as encoding.

For instance, a parameter corresponding to a choice of solvents can be initialized with:

These calculations will typically result in 500 to 1500 numbers per molecule. **To avoid detrimental effects on the surrogate model fit, we reduce the number of descriptors via decorrelation before using them.** For instance, the decorrelate option in the example above specifies that only descriptors with a correlation lower than 0.7 to any other descriptor will be kept. This usually reduces the number of descriptors to 10-50, depending on the specific items in data.

In [66]:
"""
The encoding concept introduced above is generalized by the CustomParameter. Here, the user is expected to provide their own descriptors for the encoding.

Take, for instance, a parameter that corresponds to the choice of a polymer. Polymers are not well represented by the small molecule descriptors utilized in the SubstanceParameter. 
Still, one could provide experimental measurements or common metrics used to classify polymers:
from baybe.parameters import CustomDiscreteParameter

# Create or import new dataframe containing custom descriptors

descriptors = pd.DataFrame(
    {
        "Glass_Transition_TempC": [20, -71, -39],
        "Weight_kDalton": [120, 32, 241],
    },
    index=["Polymer A", "Polymer B", "Polymer C"],  # put labels in the index
)

CustomDiscreteParameter(
    name="Polymer",
    data=descriptors,
    decorrelate=True,  # optional, uses default correlation threshold = 0.7?
)
""" 

'\nThe encoding concept introduced above is generalized by the CustomParameter. Here, the user is expected to provide their own descriptors for the encoding.\n\nTake, for instance, a parameter that corresponds to the choice of a polymer. Polymers are not well represented by the small molecule descriptors utilized in the SubstanceParameter. \nStill, one could provide experimental measurements or common metrics used to classify polymers:\nfrom baybe.parameters import CustomDiscreteParameter\n\n# Create or import new dataframe containing custom descriptors\n\ndescriptors = pd.DataFrame(\n    {\n        "Glass_Transition_TempC": [20, -71, -39],\n        "Weight_kDalton": [120, 32, 241],\n    },\n    index=["Polymer A", "Polymer B", "Polymer C"],  # put labels in the index\n)\n\nCustomDiscreteParameter(\n    name="Polymer",\n    data=descriptors,\n    decorrelate=True,  # optional, uses default correlation threshold = 0.7?\n)\n'

In [67]:
searchspace = SearchSpace.from_product(parameters)

### Recommenders

The **SequentialGreedyRecommender** is a powerful recommender that leverages BoTorch optimization functions to perform sequential Greedy optimization. It can be applied for discrete, continuous and hybrid sarch spaces. It is an implementation of the BoTorch optimization functions for discrete, continuous and mixed spaces. **It is important to note that this recommender performs a brute-force search when applied in hybrid search spaces, as it optimizes the continuous part of the space while exhaustively searching choices in the discrete subspace.** You can customize this behavior to only sample a certain percentage of the discrete subspace via the sample_percentage attribute and to choose different sampling strategies via the hybrid_sampler attribute. 

e.g.
strategy = TwoPhaseStrategy(recommender=SequentialGreedyRecommender(hybrid_sampler="Farthest", sampling_percentage=0.3))

For implementing fully customized surrogate models e.g. from sklearn or PyTorch, see:
https://emdgroup.github.io/baybe/examples/Custom_Surrogates/Custom_Surrogates.html


In [68]:
from baybe.recommenders import RandomRecommender, SequentialGreedyRecommender
from baybe.surrogates import GaussianProcessSurrogate

available_surr_models = [
    "GaussianProcessSurrogate", 
    "BayesianLinearSurrogate",
    "MeanPredictionSurrogate",
    "NGBoostSurrogate",
    "RandomForestSurrogate"
]

available_acq_functions = [
    "qPI",  # q-Probability Of Improvement
    "qEI",  # q-Expected Improvement
    "qUCB", # q-upper confidence bound with beta of 1.0
]

# Defaults anyway
SURROGATE_MODEL = GaussianProcessSurrogate()
ACQ_FUNCTION = "qEI" # q-Expected Improvement, only q-fuctions are available for batch_size > 1

seq_greedy_recommender = SequentialGreedyRecommender(
        surrogate_model=SURROGATE_MODEL,
        acquisition_function_cls=ACQ_FUNCTION,
        hybrid_sampler="Farthest", # find more details in the documentation
        sampling_percentage=0.3, # should be relatively low
        allow_repeated_recommendations=False,
        allow_recommending_already_measured=False,
    )

### Campaign Strategy

In [69]:
from baybe.strategies import TwoPhaseStrategy
from baybe import Campaign

strategy = TwoPhaseStrategy(
    initial_recommender = RandomRecommender(),  # Initial recommender
    # Doesn't matter since I already have training data, BUT CAN BE USED FOR BENCHMARKING
    recommender = seq_greedy_recommender,  # Bayesian model-based optimization
    switch_after=1  # Switch to the model-based recommender after 1 batches = immediately
)

campaign = Campaign(searchspace, objective, strategy)



### Get recommendations

In [70]:
from baybe.simulation import simulate_experiment

# Simulate the experiment
df_simulation = simulate_experiment(campaign, df)

IndexError: Cannot match the recommended row SMILES                OC[C@H]([C@H]([C@@H]([C@@H](CO)O)O)O)O
Time_h                                            177.249612
pH                                                 12.928015
Inhib_Concentrat_M                                  0.081588
Salt_Concentrat_M                                   0.368947
Name: 74, dtype: object to any of the rows in the lookup.

In [None]:
new_rec = campaign.recommend(batch_size=3) # TEST with different batch sizes for optimal performance
print("\n\nRecommended experiments: ")
print(new_rec.to_markdown())



Recommended experiments: 
|         |   Time (h) |   pH |   Inhibitor Concentration (M) |   Salt Concentration (M) |
|--------:|-----------:|-----:|------------------------------:|-------------------------:|
| 4924793 |         16 |  2.5 |                          0.01 |                     0.92 |
| 6006943 |         19 |  8   |                          0.05 |                     0.58 |
| 6994486 |         22 |  8.8 |                          0.08 |                     0.88 |


In [None]:
# Get and input efficiency value from Excel table, for specific SMILES component first, 
# then for the closest values of the rest of the parameters

new_rec["efficiency"] = [0.1, 0.2, 0.3]
print("\n\nRecommended experiments with measured values: ")
print(new_rec.to_markdown())



Recommended experiments with measured values: 
|         |   Time (h) |   pH |   Inhibitor Concentration (M) |   Salt Concentration (M) |   efficiency |
|--------:|-----------:|-----:|------------------------------:|-------------------------:|-------------:|
| 4924793 |         16 |  2.5 |                          0.01 |                     0.92 |          0.1 |
| 6006943 |         19 |  8   |                          0.05 |                     0.58 |          0.2 |
| 6994486 |         22 |  8.8 |                          0.08 |                     0.88 |          0.3 |


Wtf, why is recommending the same values as before here? 

In [None]:
new_new_rec = campaign.recommend(batch_size=3) # TEST with different batch sizes for optimal performance
print("\n\nRecommended experiments: ")
print(new_new_rec.to_markdown())



Recommended experiments: 
|         |   Time (h) |   pH |   Inhibitor Concentration (M) |   Salt Concentration (M) |   efficiency |
|--------:|-----------:|-----:|------------------------------:|-------------------------:|-------------:|
| 4924793 |         16 |  2.5 |                          0.01 |                     0.92 |          0.1 |
| 6006943 |         19 |  8   |                          0.05 |                     0.58 |          0.2 |
| 6994486 |         22 |  8.8 |                          0.08 |                     0.88 |          0.3 |


### Merge all results into a dataframe

### Transfer learning + Initial Data INFO

https://emdgroup.github.io/baybe/examples/Transfer_Learning/basic_transfer_learning.html

https://emdgroup.github.io/baybe/userguide/transfer_learning.html

https://emdgroup.github.io/baybe/examples/Backtesting/full_initial_data.html

https://emdgroup.github.io/baybe/examples/Backtesting/full_lookup.html