## Classical Machine learning models for the GPTchallenge
### Overall demo
This notebook showcases different models (Regression and Classification) to predict the melting point based on smiles. It might include errors, this is an overall roadmap. For detail information visit each experiment folder and go the the directory 'ml'

In [15]:
import pandas as pd
import numpy as np
import zipfile
from rdkit import Chem
from rdkit.Chem import rdFingerprintGenerator

path_to_dataset = 'experiments/01_Materials_and_Properties/MeltingPoint_molecules/ml/train_meltingPoint_noDuplicates.zip'
csv_filename = 'train_meltingPoint_noDuplicates.csv'

# Open the file, Correct the encoding and sep if necessary
if path_to_dataset.endswith('.zip'):
    with zipfile.ZipFile(path_to_dataset, 'r') as z:
        # Open the CSV file within the ZIP file
        with z.open(csv_filename) as f:
            # Read the CSV file into a DataFrame
            df = pd.read_csv(f, sep=',', on_bad_lines='warn')
else:
    # Read the CSV file into a DataFrame
    df = pd.read_csv(path_to_dataset, sep=',', on_bad_lines='warn')

print('Count of unique smiles:', df.SMILES.unique().shape[0])
print('Count of all of the smiles:', df.shape[0])

fpsize = 1000

def smiles_to_fingerprint(smiles: str) -> np.ndarray:
    """
    Convert a SMILES string to a molecular fingerprint using RDKit.
    """
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        print(f"Invalid SMILES string: {smiles} \n Skipping this molecule")
        return None
    # Generate Morgan fingerprint
    mfpgen = rdFingerprintGenerator.GetMorganGenerator(radius=20, fpSize=fpsize)
    fingerprint = mfpgen.GetFingerprint(mol)
    return np.array(fingerprint)

df['fingerprint'] = df.SMILES.apply(smiles_to_fingerprint)

a = df.fingerprint.iloc[15] == df.fingerprint.iloc[4]

fpsize - sum(a)


Unnamed: 0.1,Unnamed: 0,SMILES,mp,NAME,Melting Point,mp_bin,len_smiles,SMILES_random,name_smiles
0,0,C(CCC)C1=NC=CC2=C(C=CC=C12)[N+](=O)[O-],69.25,1-n-butyl-5-nitro-isoquinoline,69.0 - 69.5,0,39,(CCNO)2=CCCC2C([1C]N)=OC=CC=+(1CC==]-[),1-n-butyl-5-nitro-isoquinoline (C(CCC)C1=NC=CC...
1,1,ClC=1N=CC2=CC=CC(=C2C1)N,176.5,3-chloro-5-amino-isoquinoline,176.0 - 177.0,1,24,C===CCCCC=221C)NNC=C(Cl1,3-chloro-5-amino-isoquinoline (ClC=1N=CC2=CC=C...
2,2,ClC1=NC(=CC2=C(C=CC=C12)[N+](=O)[O-])C,112.0,1-chloro-3-methyl-5-nitro-isoquinoline,112.0,0,38,C((C1CC+[[C=)]O=C=C)lNO2C1-=C==C)]2CN(,1-chloro-3-methyl-5-nitro-isoquinoline (ClC1=N...
3,3,C(CCCCCCCCCCC)C=1C(C=CC(C1)=O)=S,131.5,2-dodecylthio-p-benzoquinone,131.0 - 132.0,0,32,CCCCCCC(C)O=C==1)C1CS)C=CCCC(CC(,2-dodecylthio-p-benzoquinone (C(CCCCCCCCCCC)C=...
4,4,[N+](=O)([O-])OCC12CC(C3C4(C=CC(C=C4CCC3C1CCC2...,163.0,"11,18-dihydroxy-pregna-1,4-diene-3,20-dione 18...",162.0 - 164.0,1,59,OCC=C](O))CCOO[C)C)1C(C(C3NCC)CO2C=((=(C]4=[CC...,"11,18-dihydroxy-pregna-1,4-diene-3,20-dione 18..."


Count of unique smiles: 273237
Count of all of the smiles: 273237


[16:13:34] Explicit valence for atom # 10 S, 9, is greater than permitted


Invalid SMILES string: ClC1C(C(=O)O)=CC(=C(C1=[SH3](=O)=O)Cl)C(F)(F)F 
 Skipping this molecule


[16:13:37] Explicit valence for atom # 16 S, 9, is greater than permitted
[16:13:37] Explicit valence for atom # 10 S, 9, is greater than permitted


Invalid SMILES string: C(C)N(CC)CCNC(C=1C(C(C=C(C1)C)=[SH3](=O)=O)OC)=O 
 Skipping this molecule
Invalid SMILES string: N1(CCCC1)N2C(NC(C2)=[SH3](=O)=O)=O 
 Skipping this molecule


[16:13:49] Explicit valence for atom # 14 S, 9, is greater than permitted
[16:13:49] Explicit valence for atom # 11 S, 9, is greater than permitted


Invalid SMILES string: C(=C)C(CCNC(=O)NCCC(C=C)=[SH3](=O)=O)=[SH3](=O)=O 
 Skipping this molecule
Invalid SMILES string: CON=C(C=1C(C(C=CC1Cl)=[SH3](=O)=O)Cl)C#N 
 Skipping this molecule


[16:13:51] Explicit valence for atom # 10 S, 9, is greater than permitted


Invalid SMILES string: COC(=O)C=1S(C=CC1Cl)=[SH3](=O)=O 
 Skipping this molecule


[16:14:02] Explicit valence for atom # 29 S, 9, is greater than permitted
[16:14:02] Explicit valence for atom # 24 S, 9, is greater than permitted
[16:14:02] Explicit valence for atom # 22 S, 9, is greater than permitted
[16:14:02] Explicit valence for atom # 30 S, 9, is greater than permitted


Invalid SMILES string: C(C1=CC=CC=C1)NC=2C(C(C(=O)O)C=C(C2SC3=CC=CC=C3)N4C=CC=C4)=[SH3](=O)=O 
 Skipping this molecule
Invalid SMILES string: COC(C1C(C(=C(C(=C1)N2C=CC=C2)SC3=CC=CC=C3)[N+](=O)[O-])=[SH3](=O)=O)=O 
 Skipping this molecule
Invalid SMILES string: COC(C1C(C(=C(C(=C1)N2C=CC=C2)SC3=CC=CC=C3)N)=[SH3](=O)=O)=O 
 Skipping this molecule
Invalid SMILES string: COC(C1C(C(=C(C(=C1)N2C=CC=C2)SC3=CC=CC=C3)NC(C4=CC=CC=C4)=O)=[SH3](=O)=O)=O 
 Skipping this molecule


[16:14:05] Explicit valence for atom # 9 S, 9, is greater than permitted


Invalid SMILES string: C1(=CC(=CC=C1)SC(=C=[SH3](=O)=O)C2=CC=CC=C2)C 
 Skipping this molecule


[16:14:07] Explicit valence for atom # 11 S, 9, is greater than permitted
[16:14:07] Explicit valence for atom # 12 S, 9, is greater than permitted
[16:14:07] Explicit valence for atom # 12 S, 9, is greater than permitted
[16:14:07] Explicit valence for atom # 20 S, 9, is greater than permitted


Invalid SMILES string: C1(CCCCC1)C2C(NC(C2=[SH3](=O)=O)=O)=O 
 Skipping this molecule
Invalid SMILES string: C1(=CC=CC=C1)N2C(C(C(C2=O)=[SH3](=O)=O)C3CCCCC3)=O 
 Skipping this molecule
Invalid SMILES string: C1(=CC=C(C=C1)N2C(C(C(C2=O)=[SH3](=O)=O)C3CCCCC3)=O)C 
 Skipping this molecule
Invalid SMILES string: C(C)(=O)OC=1C=C(C=CC1C(=O)OC)N2C(C(C(C2=O)=[SH3](=O)=O)C3CCCCC3)=O 
 Skipping this molecule


[16:14:21] Explicit valence for atom # 24 S, 9, is greater than permitted


Invalid SMILES string: C(C)OC(=O)C1=NC(C2=NC3=CC=C(C=C3C2=C1COC)N(C)C)=[SH3](=O)=O 
 Skipping this molecule


[16:14:23] Explicit valence for atom # 16 S, 9, is greater than permitted


Invalid SMILES string: C(C)(=O)C1=C(C(=C(OCCCCC(C(=O)O)=[SH3](=O)=O)C=C1)CC=C)O 
 Skipping this molecule


[16:14:23] Explicit valence for atom # 0 S, 9, is greater than permitted


Invalid SMILES string: [SH3](=O)(=O)=CCCN1C(C=NC(=C1)C)C 
 Skipping this molecule


[16:14:50] Explicit valence for atom # 19 S, 9, is greater than permitted


Invalid SMILES string: C(C)(C)(C)C1=CC(N(O1)C=2C(C(C(=CC2)S(=O)(=O)C)=[SH3](=O)=O)CCCCCCCCCCCCCC)=O 
 Skipping this molecule


[16:14:51] Explicit valence for atom # 12 S, 9, is greater than permitted


Invalid SMILES string: OCCC1=C(C(C(C(=O)O)C=C1)=[SH3](=O)=O)[N+](=O)[O-] 
 Skipping this molecule


[16:15:20] Explicit valence for atom # 7 S, 9, is greater than permitted


Invalid SMILES string: ClCC1=C(C(N(C1=[SH3](=O)=O)C2=CC=C(C=C2)C)C)C 
 Skipping this molecule


[16:15:24] Explicit valence for atom # 10 S, 9, is greater than permitted


Invalid SMILES string: CN(C)C=NCCC(C(I)=[SH3](=O)=O)(C)C 
 Skipping this molecule


[16:15:32] Explicit valence for atom # 24 S, 9, is greater than permitted


Invalid SMILES string: C(#N)C1=CC=C(C=C1)N2C(N(CC2)C3C(C=C(C=C3)CC(=O)OC)=[SH3](=O)=O)=O 
 Skipping this molecule


[16:15:46] Explicit valence for atom # 17 S, 9, is greater than permitted


Invalid SMILES string: ClC1=C(C(=O)OC)C=CC(C1NCC(=O)OC)=[SH3](=O)=O 
 Skipping this molecule


[16:15:47] Explicit valence for atom # 12 S, 9, is greater than permitted


Invalid SMILES string: [N+](=[N-])=C(C(C(CC1=CC=CC=C1)=[SH3](=O)=O)=O)C 
 Skipping this molecule


[16:15:53] Explicit valence for atom # 16 S, 9, is greater than permitted


Invalid SMILES string: NC(=NC(C=1C(C(C(=C(C1)C)N2C=NC=C2)=[SH3](=O)=O)C)=O)N 
 Skipping this molecule


[16:15:59] Explicit valence for atom # 27 S, 9, is greater than permitted


Invalid SMILES string: NC1=CC(=C(C(=O)NCC2CCN(CC2)CCC(CC3=CC=C(C=C3)OC)=[SH3](=O)=O)C=C1Cl)OC 
 Skipping this molecule


[16:16:10] Explicit valence for atom # 26 S, 9, is greater than permitted


Invalid SMILES string: CC1=NC2=C(C=CC=C2C(=C1)C)OCC=3C(=C(C=CC3Cl)N4C(C(CC4)=[SH3](=O)=O)C(=O)N5CCN(CC5)S(=O)(=O)C6=CC=C(C=C6)C=NN)Cl 
 Skipping this molecule


[16:16:14] Explicit valence for atom # 7 S, 9, is greater than permitted


Invalid SMILES string: BrC1=C(C(C(C=C1)=[SH3](=O)=O)CF)Cl 
 Skipping this molecule


[16:16:18] Explicit valence for atom # 14 S, 9, is greater than permitted


Invalid SMILES string: ClC1=C(C(=O)OC)C=CC(=C1CBr)C=[SH3](=O)=O 
 Skipping this molecule


[16:16:24] Explicit valence for atom # 18 S, 9, is greater than permitted


Invalid SMILES string: FC=1C=C2C(=C(C(C2=CC1)=CC3C(C=C(C=C3)C)=[SH3](=O)=O)C)CCN(C(=O)N)O 
 Skipping this molecule


[16:16:28] Explicit valence for atom # 21 S, 9, is greater than permitted


Invalid SMILES string: C(#N)C1=NN(C(C1)C2=CC=C(C=C2)C)C3C(C=C(C=C3)C)=[SH3](=O)=O 
 Skipping this molecule


[16:16:33] Explicit valence for atom # 14 S, 9, is greater than permitted


Invalid SMILES string: CN1N=CC(=C1O)C(C=2C(=C(C(C(C2)=[SH3](=O)=O)C)C3=NOCC3)C)=O 
 Skipping this molecule


[16:16:38] Explicit valence for atom # 17 S, 9, is greater than permitted


Invalid SMILES string: ClC1=C(C(=O)OC)C=CC(=C1C=C[N+](=O)[O-])C=[SH3](=O)=O 
 Skipping this molecule


[16:16:40] Explicit valence for atom # 27 S, 9, is greater than permitted


Invalid SMILES string: CC=1C(C(=O)C2C(C3CCC(C2=O)C3)=O)=CC(C(C1C4=NOC(C4)CCl)C)=[SH3](=O)=O 
 Skipping this molecule


[16:16:43] Explicit valence for atom # 26 S, 9, is greater than permitted


Invalid SMILES string: CC=1C(C(=O)C2=C(C(CC(C2=O)C)C)O)=CC(C(C1C3=NOC4CC34)C)=[SH3](=O)=O 
 Skipping this molecule


[16:16:46] Explicit valence for atom # 20 S, 9, is greater than permitted


Invalid SMILES string: ClC=1C(C(=O)O)=CC(C(C1CSC2=NN=NN2CC)C)=[SH3](=O)=O 
 Skipping this molecule


[16:16:47] Explicit valence for atom # 30 S, 9, is greater than permitted


Invalid SMILES string: COC=1C=C2C=C(N(C2=CC1)S(=O)(=O)C3=CC=CC=C3)CC4N(C5=CC=CC=C5C4=[SH3](=O)=O)C6=CC=CC=C6 
 Skipping this molecule


[16:17:42] Explicit valence for atom # 25 S, 9, is greater than permitted


Invalid SMILES string: ClC=1C(C(=O)C=2C=NN(C2O)C3CC3)=CC(C(C1C4CC(=NO4)C)C)=[SH3](=O)=O 
 Skipping this molecule


[16:17:46] Explicit valence for atom # 15 S, 9, is greater than permitted


Invalid SMILES string: C(C)C=1C=CC=C2C(C(=NC12)C(=O)OC)=[SH3](=O)=O 
 Skipping this molecule


[16:17:54] Explicit valence for atom # 11 S, 9, is greater than permitted


Invalid SMILES string: ClC1C(C(=O)OC)=CC(=C(C1=[SH3](=O)=O)C)C2=NOC(C2)C 
 Skipping this molecule


[16:18:55] Explicit valence for atom # 12 Cl, 7, is greater than permitted


Invalid SMILES string: CON(C(=O)NC1=CC(=CC=C1)Cl(=O)(=O)=O)C 
 Skipping this molecule


[16:18:55] Explicit valence for atom # 22 S, 9, is greater than permitted
[16:18:55] Explicit valence for atom # 12 S, 9, is greater than permitted
[16:18:55] Explicit valence for atom # 0 Cl, 5, is greater than permitted


Invalid SMILES string: ClC1=C(C=CC=C1)N(C2=NC(NC=N2)(C)NC(C)=O)C(=O)N=[SH3](=O)=O 
 Skipping this molecule
Invalid SMILES string: FC(C=1C=C(CN2N=C(C(=C2N=[SH3](=O)=O)C#N)C)C=CC1)(F)F 
 Skipping this molecule
Invalid SMILES string: Cl(=O)(=O)CC1=CC(=C(OCC(=O)OCC)C=C1)NC(C2=CC=C(C=C2)OCCCCC3=CC=CC=C3)=O 
 Skipping this molecule


[16:18:56] Explicit valence for atom # 25 S, 9, is greater than permitted
[16:18:56] Explicit valence for atom # 23 S, 9, is greater than permitted
[16:18:56] Explicit valence for atom # 23 S, 9, is greater than permitted
[16:18:56] Explicit valence for atom # 22 S, 9, is greater than permitted
[16:18:56] Explicit valence for atom # 24 S, 9, is greater than permitted
[16:18:56] Explicit valence for atom # 19 S, 9, is greater than permitted
[16:18:56] Explicit valence for atom # 39 S, 9, is greater than permitted
[16:18:56] Explicit valence for atom # 0 S, 9, is greater than permitted


Invalid SMILES string: C(C)OC1C(=CC=CC1=C=O)NN2N(C(N(C2=O)C)OCC)C(=O)N=[SH3](=O)=O 
 Skipping this molecule
Invalid SMILES string: COC1=C(C=CC=C1)N(C)N2N(C(N(C2=O)CC)OC)C(=O)N=[SH3](=O)=O 
 Skipping this molecule
Invalid SMILES string: C(C)OC1=C(C=CC=C1)ON2N(C(N(C2=O)C3CC3)Cl)C(=O)N=[SH3](=O)=O 
 Skipping this molecule
Invalid SMILES string: C(#N)C1=C(C(=C(C=C1C)N2N=C(N(C2=O)C)C(F)(F)F)F)N=[SH3](=O)=O 
 Skipping this molecule
Invalid SMILES string: C(CCC)C1C(C2=CC=C(C=C2C1)C(C=CC3=CC=CC=C3)=O)N=[SH3](=O)=O 
 Skipping this molecule
Invalid SMILES string: C(=C)C1C(C2=CC=C(C=C2C1)C=3CCC(NN3)=O)N=[SH3](=O)=O 
 Skipping this molecule
Invalid SMILES string: C(CCCCCC)NC(NC1=C(C(=O)NC2CCN(CC2)C(C3=CC=CC=C3)C4=CC=CC=C4)C=C(C=C1N=[SH3](=O)=O)C)=O 
 Skipping this molecule
Invalid SMILES string: [SH3](=O)(=O)=NC=1C=C(C=CC1OC)CC(=O)N(C2CCCCC2)C(=S)NC 
 Skipping this molecule


[16:18:56] Explicit valence for atom # 25 S, 9, is greater than permitted


Invalid SMILES string: C(C)(C)(C)C1=CC=C(C=N1)C2(C(C2)C(=O)NC(C)C3=CC(=C(C=C3)N=[SH3](=O)=O)C)CC 
 Skipping this molecule


122

In [16]:
from MLPipeline import MLmodel, BinTheTarget

Target = '[°C ]'
Features = ['SMILES']

  from .autonotebook import tqdm as notebook_tqdm


### Regression without hyperparameter optimization. (RandomForestRegressor)
The dataframe is splitted inside the MLmodel class but you can change its splitting fraction and also get the values

In [20]:
model = MLmodel(modelType = 'random_forest', df = df, feature_types=['SMILES'],
                target = 'mp', features = Features)

# get the values (input and output) of the model
X_train, X_test, y_train, y_test = model.getValues()


# Train the model
model.train()
# Predictions and evaluation are done on the test set
predictions = model.predict()
model.evaluate()

TypeError: MLmodel.smiles_to_fingerprint() got an unexpected keyword argument 'axis'

### Regression model with hyperparameter optimization using Grid search. (RandomForestRegressor)

In [None]:
param_grid = {
    'n_estimators': [50, 100],  # Number of trees in the forest
    'max_depth': [None, 10, 40],  # Maximum depth of the trees
#     'min_samples_split': [2, 5, 10, 15],  # Minimum number of samples required to split an internal node
#     'min_samples_leaf': [1, 2, 4, 6],     # Minimum number of samples required to be at a leaf node
#     'max_features': ['auto', 'sqrt', 'log2'],  # Number of features to consider for the best split
#     'bootstrap': [True, False]            # Whether bootstrap samples are used when building trees
}

In [None]:
model = MLmodel(modelType='random_forest', df=df, target='[°C ]', 
                features=['SMILES'], hyperparameter_tuning=True, 
                optimization_method='grid_search', param_grid=param_grid)
model.train()
predictions = model.predict()
model.evaluate()

### Classification without the hyper parameter optmization. (RandomForestClassifier)

In [None]:
binner = BinTheTarget(df = df, target = Target, bins = 2)

In [None]:
model = MLmodel(modelType = 'RandomForestClassifier', df = binner.df,
                target = Target , features = Features)
                
model.train()
predictions = model.predict()
model.evaluate()

### Classification using grid search for hyperparameter optimization. (RandomForestClassifier)

In [None]:
param_grid = {
    'n_estimators': [ 100, 200],  # Number of trees in the forest
    'max_depth': [None, 10, 20],  # Maximum depth of the trees
    # 'min_samples_split': [2, 5],  # Minimum number of samples required to split an internal node
    # 'min_samples_leaf': [1, 2],     # Minimum number of samples required to be at a leaf node
    # 'max_features': ['auto', 'sqrt'],  # Number of features to consider for the best split
    # 'bootstrap': [True, False],           # Whether bootstrap samples are used when building trees
    # 'class_weight': [None, 'balanced']  # Weights associated with classes in the form {class_label: weight}
}


In [None]:
model = MLmodel(modelType='RandomForestClassifier', df=df, target='[°C ]', 
                features=['SMILES'], hyperparameter_tuning=True, param_grid=param_grid,
                optimization_method='grid_search')
model.train()
predictions = model.predict()
model.evaluate()

### Classification using optuna for hyperparameter optimization. (RandomForestClassifier)

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.base import clone

def objective(trial, model_instance):
    """
    Objective function for Optuna to minimize.
    """
    # Define hyperparameters to tune
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 300),
        'max_depth': trial.suggest_categorical('max_depth', [None, 10, 20, 30, 40]),
        'min_samples_split': trial.suggest_int('min_samples_split', 2, 15),
        'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 6),
        'max_features': trial.suggest_categorical('max_features', ['sqrt', 'log2']),
        'bootstrap': trial.suggest_categorical('bootstrap', [True, False])
    }

    # Clone the model to ensure a fresh instance each trial
    model_clone = clone(model_instance.model)
    model_clone.set_params(**params)
    
    # Define the score metric
    scoring = 'accuracy'

    # Perform cross-validation
    scores = cross_val_score(model_clone, model_instance.X_train, model_instance.y_train, cv=model_instance.cv, scoring=scoring)

    # Return the average score across all folds
    return scores.mean()

In [None]:
model = MLmodel(modelType='RandomForestClassifier', df=df, target='[°C ]', 
                features=['SMILES'], hyperparameter_tuning=True, param_grid=param_grid,
                optimization_method='optuna', objective=lambda trial: objective(trial, model))

model.train()
predictions = model.predict()
model.evaluate()

### Different classification models can be used. (MLPClassfier)
#### Without hyperparameter optimization

In [None]:
model = MLmodel(modelType='MLPClassifier', df=df, target='[°C ]', features=['SMILES'])

model.train()
predictions = model.predict()
model.evaluate()