# Generation ChemBERT (Chemical Language Model Embedding) Condensation_reactions

Found this dataset from a recent publication by a D.M. Makarov et al. (Journal of Computational Science 74 (2023) 102173) (https://doi.org/10.1016/j.jocs.2023.102173)

"""We considered pyrrole or dipyrromethane condensation reactions with various aldehydes, resulting in the production of boron(III) dipyrromethene or BODIPY (681 records). These reactions were retrieved from articles (see “Dataset reactions” and Scheme S1). Addi tionally, we used the reactions of the production of dipyrromethane (111 records) and porphyrins (457 records). All condensation reactions for dipyrromethanes and 213 reactions for porphyrins with various al dehydes were obtained in our laboratory. The remaining 244 reactions for the porphyrins synthesis were obtained from articles (see “Dataset reactions”). Our experimental dataset is based on a study of pyrrole condensation processes with aldehydes, using catalytic amounts of organic acids to produce ms-aryl- and ß-alkyl-substituted dipyrro methanes. """

The objective of this notebook is to introduce a method for predicting yield

As a beginner, there may be numerous opportunities for improvement in this notebook. I was largely inspired by the work of D.M. Makarov et al and the STEPHEN LEE's notebook (BELKA: Molecule Representations for ML Tutorial) thanks to them.

## Data

In [1]:
import pandas as pd

In [2]:
pd.set_option('display.max_row', 211)
pd.set_option('display.max_columns',211)

In [3]:
# Téléchargement du Dataset 
DataFrame = r"C:\Users\loris\Desktop\IA_chemistry\Dipyrromethanes_condensation_reactions-main\Condensation_reactions - Copie.xlsx"

df = pd.read_excel(DataFrame)

df= df.copy()

print(df.shape)
df.head()

(1249, 4)


Unnamed: 0,SMILES,yield,Temperature,Ind
0,CC(C)(C)c1ccc(OCCCCOc2ccc(C(C)(C)C)cc2C=O)c(C=...,0.05,20,78
1,O=Cc1ccccc1OCCOC(=O)C#CC(=O)OCCOc1ccccc1C=O.c1...,1.0,40,909
2,c1cc[nH]c1.Cc1cc(C)c(C=O)c(C)c1.COC(=O)c1ccc(C...,2.0,20,747
3,CC(C)(C)c1ccc(OCCCCCOc2ccc(C(C)(C)C)cc2C=O)c(C...,2.1,20,77
4,Cc1ccc(-c2ccccc2C=O)cc1.Fc1c(F)c(F)c(C(c2ccc[n...,2.5,25,266


In [4]:
df["SMILES"][0]

'CC(C)(C)c1ccc(OCCCCOc2ccc(C(C)(C)C)cc2C=O)c(C=O)c1.CC(C)(C)c1ccc(OCCCCOc2ccc(C(C)(C)C)cc2C(c2ccc[nH]2)c2ccc[nH]2)c(C(c2ccc[nH]2)c2ccc[nH]2)c1.CCC(=O)O.CC(=O)O.O=[N+]([O-])c1ccccc1>>CC(C)(C)c1ccc2c(c1)-c1c3nc(c4c5ccc([nH]5)c(c5nc(c(c6ccc1[nH]6)-c1cc(C(C)(C)C)ccc1OCCCCOc1ccc(C(C)(C)C)cc1-4)C=C5)-c1cc(C(C)(C)C)ccc1OCCCCO2)C=C3.O.O'

## Missing values

In [5]:
# Recherche des valeurs manquantes
missing_values = df.isnull().sum()
missing_values

SMILES         0
yield          0
Temperature    0
Ind            0
dtype: int64

In [6]:
import seaborn as sns
import matplotlib.pyplot as plt

# Création de la figure et des axes pour les subplots
fig, axs = plt.subplots(2, 1, figsize=(10,8))

# Tracé de l'histogramme sur le premier subplot
sns.histplot(data=df, x='yield', bins=30, ax=axs[0])
axs[0].set_xlabel('Rendement chimique (%)')
axs[0].set_ylabel("Nombre d'occurences")
axs[0].set_title("Distribution des rendements chimiques")

# Tracé du boxplot sur le deuxième subplot
sns.boxplot(df["yield"], ax=axs[1])
axs[1].set_ylabel("Rendement chimique (%)")
axs[1].set_title("Distribution des rendements chimiques")

# Ajustement de l'espacement entre les subplots
plt.tight_layout()

# Affichage des graphiques 
plt.show()

ModuleNotFoundError: No module named 'seaborn'

## 🤖 Chemical Language Model Embeddings

Just as traditional language models use self-attention mechanisms to compute the representation of each language element (e.g. word in a sentence) to every other element, chemical language models use the same principle in which elements are some chemical unit (e.g. atoms) instead of words.

Below is example code to obtain learned transformer-based embeddings from two chemical language models:

ChemBERTa: adapted for chemical SMILES from the RoBERTa architecture, trained on a dataset of 77 million molecules

MoLFormer: another transformer-based model adapted for SMILES but trained on a larger dataset (1.1 billion molecules!)


References:

Chithrananda, S., Grand, G., & Ramsundar, B. (2020). ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction.
    
Chithrananda, S., Grand, G., & Ramsundar, B. (2022). ChemBERTa-2: Towards Chemical Foundation Models. 

Ross, J., Belgodere, B., Chenthamarakshan, V., et al. (2022). Large-scale chemical language representations capture molecular structure and properties. 
Nature Machine Intelligence, 4, 1256-1264. 

ChemBERTa on HuggingFace Model Repo

MolFormer on HuggingFace Model Repo

MolFormer GitHub repo

In [7]:
# for chemical language models
from transformers import AutoModel, AutoTokenizer
import torch

  from .autonotebook import tqdm as notebook_tqdm


In [8]:
smiles = df["SMILES"]

In [9]:
# load pre-trained ChemBERTa model checkpoint and tokenizer
cb_tokenizer = AutoTokenizer.from_pretrained('DeepChem/ChemBERTa-10M-MLM')
cb_model = AutoModel.from_pretrained('DeepChem/ChemBERTa-10M-MLM')
cb_model.eval()

# tokenize SMILES
cb_encoded_inputs = cb_tokenizer(list(smiles), padding=True, truncation=True, return_tensors="pt")

# calculate embeddings
with torch.no_grad():
    outputs = cb_model(**cb_encoded_inputs)

# extract pooled output
cb_embeddings = outputs.pooler_output

cb_embeddings_df = pd.DataFrame(cb_embeddings.numpy())
cb_embeddings_df.head()

Some weights of RobertaModel were not initialized from the model checkpoint at DeepChem/ChemBERTa-10M-MLM and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,...,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383
0,-0.162271,0.359961,-0.208631,0.212792,-0.184559,0.023665,0.031174,-0.056128,0.070213,0.246463,0.404678,0.222561,0.246697,0.569248,-0.286271,0.163123,-0.211469,-0.137589,0.333122,0.270672,-0.171357,-0.007442,0.026922,0.093674,0.56386,0.190829,-0.260796,0.335964,0.166725,-0.452981,-0.144069,0.137669,0.074681,-0.076224,0.409972,-0.091544,-0.143137,-0.533438,-0.177174,0.303646,-0.170809,0.113998,-0.180275,-0.298499,0.129155,-0.358678,-0.197952,-0.127823,0.095293,0.146926,0.200765,-0.178596,0.116211,-0.196614,0.198353,0.067136,0.16422,0.255172,-0.430976,-0.052471,0.122035,-0.120684,0.194026,0.059159,-0.242641,0.216599,-0.209428,-0.450314,-0.067776,0.322422,0.050678,-0.037782,0.081106,0.362837,-0.256037,-0.186787,-0.048138,-0.397506,-0.240361,-0.247374,0.147705,0.132116,0.320903,-0.256691,-0.02149,-0.365033,-0.155543,0.344853,0.302176,-0.139125,0.172635,-0.128772,0.321327,0.151302,-0.063909,-0.135832,0.33211,-0.003239,0.258752,0.01268,-0.174956,0.26075,0.035437,0.454084,0.084206,...,0.325775,0.087837,-0.047203,-0.375325,0.240884,0.096204,0.210107,0.152762,-0.285373,-0.281662,0.243171,0.335456,0.193422,-0.344888,-0.180593,0.024781,-0.214996,-0.305409,-0.109843,-0.137799,-0.499643,0.295993,0.223503,0.012261,-0.14791,0.103415,-0.0914,0.090393,0.1784,-0.128588,0.159457,0.131086,-0.192747,0.246661,-0.018331,-0.00498,0.06622,0.229715,0.022504,-0.275646,0.057845,0.330876,0.000617,-0.515452,0.235576,0.299304,0.124795,0.08653,0.011721,-0.124608,0.353909,0.176914,0.021789,-0.560504,-0.019877,0.02793,-0.026669,-0.124045,0.082781,-0.416438,0.100424,0.171194,-0.240402,-0.120383,-0.101247,-0.219663,-0.180439,0.171812,-0.37869,-0.122113,0.160866,0.131128,0.28467,0.330584,-0.42532,0.112695,0.506408,0.36175,-0.046845,0.007392,-0.070656,-0.199145,0.104572,-0.008363,-0.072674,-0.080588,-0.084459,0.144243,0.227459,-0.067456,-0.541151,0.081943,-0.373041,-0.3458,0.036262,0.095997,-0.096873,-0.06386,-0.114892,0.092941,-0.445563,-0.034386,-0.321318,-0.407256,0.045911
1,-0.164998,0.200312,-0.291708,0.385741,-0.113083,0.090308,-0.015352,0.234842,0.140149,0.200789,0.300456,0.268847,0.327487,0.552415,-0.058277,0.069845,-0.283775,0.059433,0.132053,0.206883,0.020623,-0.173855,0.225282,-0.017904,0.497691,0.216123,-0.304758,0.297272,0.06349,-0.155111,0.19279,0.090803,0.076976,0.226654,0.310238,-0.049298,0.095382,-0.430092,-0.348674,-0.033861,0.29753,0.052408,-0.273249,-0.018303,0.000791,-0.184625,-0.371987,-0.196453,0.265759,-0.08904,0.163914,-0.434184,0.054147,-0.180023,0.367596,0.134108,0.269924,0.013862,-0.086283,0.153577,-0.441175,-0.102963,0.16932,0.168424,-0.019549,0.066295,0.027417,-0.02225,0.115971,-0.026206,0.111903,0.134881,0.163826,0.415292,-0.045816,0.029169,-0.002534,-0.43791,-0.069225,0.127752,-0.087582,0.20903,-0.017341,-0.325876,0.031164,-0.381508,-0.209271,0.060853,-0.095384,-0.291828,-0.112984,-0.0854,-0.271536,-0.28751,-0.201039,-0.028334,0.257096,0.126726,-0.001215,0.055023,0.019923,0.145393,0.170775,0.345924,-0.001274,...,0.059508,0.037455,-0.226644,-0.200301,0.135569,-0.03571,0.130619,0.011335,-0.017847,-0.136202,0.259121,-0.040067,0.328404,-0.241552,0.123093,-0.204784,-0.206881,-0.301866,-0.098708,-0.209272,-0.543192,0.190023,0.23518,-0.119917,0.086322,0.212423,-0.129963,0.174801,0.025072,-0.205548,0.252248,0.091975,0.29269,0.279312,0.064004,0.02954,-0.140886,0.274939,-0.214205,0.232435,-0.021775,0.326521,-0.173589,-0.327937,0.211141,0.306258,-0.026692,0.191214,-0.011201,-0.158324,0.48552,0.04253,0.131386,-0.392145,-0.045617,0.089598,0.010491,-0.237086,0.0473,-0.110376,-0.070836,-0.116565,0.031171,-0.102659,0.007115,-0.085476,-0.344428,-0.038877,-0.376358,0.223386,0.124038,0.318288,0.010998,0.29055,-0.214707,0.292548,0.314998,0.234466,0.075865,0.162426,0.147017,-0.085037,-0.049484,-0.181991,0.153568,-0.151253,0.183836,0.133308,0.08893,-0.268663,-0.258682,0.156034,-0.331153,-0.366319,-0.134505,-0.029372,0.222066,0.003363,-0.056301,-0.256319,-0.349865,0.015541,-0.251664,-0.024831,-0.041127
2,0.087443,0.348354,0.075141,0.265771,-0.317597,-0.08073,0.031378,-0.159422,0.076782,0.450581,0.020486,0.412668,0.452576,0.266731,-0.088604,0.035567,-0.181696,-0.197046,0.319921,0.329228,-0.230394,-0.054677,0.264726,0.104968,0.342561,0.441863,-0.341266,0.059278,0.336389,-0.076413,0.208437,0.19156,0.204068,0.083635,0.235593,-0.106136,0.289786,-0.252061,-0.266143,0.307279,-0.119827,-0.172364,-0.145971,-0.299094,0.357934,-0.059064,-0.443192,-0.2118,0.157538,0.08295,-0.139568,-0.335272,0.136966,-0.431957,0.0646,-0.038943,0.23024,0.004617,-0.389729,-0.0428,-0.323701,0.019915,0.090443,-0.200131,-0.156367,-0.051447,-0.080336,-0.053585,0.072576,0.029933,0.189949,0.046997,0.251135,0.505619,-0.113831,-0.042439,0.008957,-0.528493,-0.237793,-0.154873,0.127994,0.068931,0.257621,-0.324349,-0.068819,-0.075986,-0.260203,0.25797,-0.067435,-0.132135,0.114913,-0.042286,-0.296356,-0.137618,0.058961,-0.165032,0.498271,-0.165034,0.341459,0.013973,-0.128801,0.12968,-0.112283,0.303148,-0.040948,...,0.332492,0.02587,-0.151847,-0.217514,0.112499,-0.370836,0.152899,0.015132,-0.226122,-0.105108,0.320797,0.272942,0.320707,-0.060986,-0.301983,-0.103061,-0.183105,-0.022568,-0.245966,-0.115399,-0.429196,0.369226,0.339843,0.007097,0.079089,-0.031242,-0.007945,-0.031585,0.197342,-0.037872,0.163866,-0.137193,-0.108474,0.155755,-0.125552,-0.024259,-0.114572,0.037851,-0.132192,-0.03692,0.108299,0.408869,0.09432,-0.339686,0.086932,0.3726,0.080849,0.171356,-0.15516,-0.17394,0.399333,0.123328,-0.080917,-0.506595,-0.073971,0.09876,0.042386,-0.131056,0.021697,-0.016402,-0.091808,0.026021,-0.084562,0.02777,-0.086904,-0.062518,-0.211319,0.055575,-0.173359,-0.016561,-0.078423,0.076999,0.092031,0.185906,-0.532436,0.376591,0.106628,0.114349,0.102644,0.191851,0.03762,-0.343103,-0.184611,-0.083883,0.058932,0.040169,-0.105352,0.038794,-0.052485,0.007382,-0.301429,0.04378,-0.478187,-0.103193,-0.006233,0.230008,0.009344,-0.184263,0.031506,0.062032,-0.15129,0.162426,-0.353901,-0.378658,-0.27125
3,-0.17626,0.3315,-0.217019,0.218216,-0.184135,0.013105,0.025132,-0.038927,0.075716,0.217835,0.394953,0.209932,0.223927,0.594058,-0.284427,0.171727,-0.221581,-0.123168,0.306397,0.265301,-0.183306,0.007575,0.060609,0.096721,0.55645,0.177043,-0.261085,0.32019,0.155211,-0.43618,-0.188859,0.121432,0.058139,-0.102365,0.389624,-0.113831,-0.169072,-0.526837,-0.186467,0.306183,-0.118291,0.102877,-0.162771,-0.291369,0.117419,-0.337459,-0.190894,-0.094893,0.122876,0.110975,0.203009,-0.16131,0.10342,-0.196771,0.20909,0.066207,0.156469,0.282545,-0.436368,-0.041194,0.152029,-0.163704,0.208622,0.081991,-0.222724,0.200585,-0.189137,-0.445165,-0.061106,0.329794,0.043275,-0.031708,0.044975,0.330188,-0.257523,-0.191809,-0.026744,-0.394451,-0.216947,-0.258086,0.136926,0.127646,0.341987,-0.242775,-0.02042,-0.342146,-0.134044,0.349911,0.297647,-0.128103,0.177103,-0.146584,0.326321,0.15435,-0.100294,-0.146707,0.318168,0.012583,0.245729,-0.015691,-0.147501,0.265394,0.018757,0.45225,0.058254,...,0.286873,0.075731,-0.060459,-0.369617,0.230923,0.091402,0.222454,0.119655,-0.288558,-0.32972,0.243055,0.333877,0.196843,-0.336662,-0.154158,0.026134,-0.223284,-0.306917,-0.09845,-0.144181,-0.496922,0.273367,0.190322,0.032943,-0.149621,0.097822,-0.119656,0.101924,0.198588,-0.158726,0.159052,0.11375,-0.158031,0.269808,-0.0208,0.012123,0.047193,0.227319,0.013312,-0.267428,0.06636,0.320278,-0.023354,-0.514239,0.228328,0.292068,0.109983,0.072455,-0.001843,-0.089084,0.361358,0.171718,0.038441,-0.574028,-0.042479,0.027775,0.003253,-0.101972,0.103554,-0.439865,0.107998,0.138701,-0.236071,-0.147181,-0.083717,-0.198127,-0.157046,0.17063,-0.361139,-0.091939,0.160136,0.125527,0.263363,0.338415,-0.404516,0.117673,0.499794,0.387065,-0.062616,-0.006799,-0.106931,-0.18095,0.097889,0.019412,-0.092774,-0.104903,-0.104122,0.16784,0.250242,-0.074696,-0.541867,0.063428,-0.365987,-0.374295,0.042924,0.080642,-0.067237,-0.074299,-0.152272,0.088243,-0.449713,-0.005897,-0.332854,-0.401828,0.084737
4,-0.137525,0.265025,-0.010123,0.40796,-0.002231,0.017766,0.034262,-0.041863,0.038054,0.227417,0.051316,0.236605,0.482495,0.572863,-0.236792,0.023759,-0.301567,-0.119358,0.053671,0.26052,-0.101332,0.046685,0.147247,-0.146464,0.226873,0.238391,-0.332869,0.323694,0.209686,-0.173994,0.087302,-0.117985,0.112142,0.146261,0.414851,-0.071862,-0.006431,-0.225612,-0.289347,0.048542,0.198003,-0.126273,-0.273958,0.148164,0.377191,-0.131095,-0.500679,-0.194147,0.210539,-0.103456,-0.074326,-0.274666,0.255624,-0.269893,-0.066593,-0.059447,0.350845,0.024596,-0.328549,0.117672,-0.323744,-0.052897,0.043723,0.088231,0.086536,0.076209,0.017663,-0.033744,-0.072464,-0.117434,0.319621,0.174219,0.200891,0.483039,0.13165,-0.053119,0.15173,-0.206849,-0.073102,0.159188,-0.000124,0.230478,0.299481,-0.321978,-0.113721,-0.273351,-0.057563,0.163494,0.025687,-0.194245,-0.078695,-0.11346,-0.400821,-0.138526,-0.270769,-0.235079,0.165924,0.054616,0.05985,0.018769,-0.122229,-0.015627,0.184568,0.245649,-0.183787,...,0.092792,0.097946,-0.038889,-0.039618,0.280452,-0.132929,0.143625,0.038979,0.037168,-0.363583,0.308713,0.042017,0.307992,-0.325368,-0.254257,-0.111948,0.026055,-0.164983,-0.260877,0.108569,-0.452726,0.237875,0.161968,-0.092039,0.181199,0.112137,-0.039931,0.002873,0.318678,-0.083134,0.093927,-0.057268,0.006241,0.388679,-0.371659,0.03506,-0.17827,0.105218,-0.184556,0.161999,0.172466,0.31776,-0.047731,-0.385257,0.129601,0.266923,-0.046403,0.292638,-0.080534,0.247024,0.574939,-0.089122,-0.027367,-0.344002,-0.015118,0.071419,0.230483,-0.17233,-0.065445,-0.251528,-0.018486,0.147374,0.036158,-0.306249,0.034631,0.169968,-0.115037,-0.008717,-0.322422,0.3297,0.164823,0.152219,-0.002751,0.218693,-0.033536,0.090824,0.218572,0.33592,0.058331,0.111005,0.151942,-0.21839,-0.239677,0.230013,-0.029526,-0.029864,0.108796,0.082263,0.230274,-0.396407,-0.154371,0.200644,-0.391888,-0.319362,-0.082632,0.05096,0.341008,0.04678,-0.221067,-0.215818,-0.455212,0.067931,-0.197712,-0.222433,-0.051372


In [10]:
X = cb_embeddings_df
X.shape

(1249, 384)

In [11]:
y = df["yield"]
y.shape

(1249,)

In [12]:
# Sauvegarde du DataFrame dans un fichier CSV
X.to_csv("cb_embeddings_df.csv")

In [None]:
def filter_descriptors_low(data, threshold):
    # Drop colonne inutile
    data = data.drop(["SMILES", "Ind", "SMILES_2", "mols"], axis=1)
    # Copie des descripteurs
    descriptors = data.copy()
    
    # Calcul de la matrice de corrélation et mise à zéro de la diagonale
    corr = descriptors.corr()
    for index in corr.index:
        corr.loc[index, index] = 0
    
    # Préparation d'une liste pour collecter les noms des descripteurs non fortement corrélés
    descriptors_not_correlated = []
    
    # Vérifier chaque colonne individuellement
    for col in corr.columns:
        if all((corr[col][corr[col].notna()] <= threshold)):
            descriptors_not_correlated.append(col)
    
    print('Number of descriptors:', len(descriptors_not_correlated))
    
    # Retourner les données filtrées
    return data[descriptors_not_correlated]

In [None]:
df_3_desc = filter_descriptors_low(df_2, 0.9)
df_3_desc.shape

In [None]:
df_3_desc.columns

In [None]:
import pandas as pd
import numpy as np

def filter_descriptors_hard(data, threshold):
    # Drop colonne inutile
    data = data.drop(["SMILES", "Ind", "SMILES_2", "mols"], axis=1)
    # Copie des descripteurs
    descriptors = data.copy()
    
    # Calcul de la matrice de corrélation et mise à zéro de la diagonale
    corr = descriptors.corr()
    for index in corr.index:
        corr.loc[index, index] = 0
    
    # Identifie les groupes de descripteurs fortement corrélés
    corr_mask = corr.abs() > threshold
    groups = []
    for col in corr.columns:
        correlated_with = corr.index[corr_mask[col]].tolist()
        if correlated_with and col not in sum(groups, []):
            groups.append(correlated_with)
    
    # Choix d'un descripteur par groupe
    selected_descriptors = []
    for group in groups:
        # Choix basé sur le minimum de corrélations maximales avec les autres
        min_correlation_sum = np.inf
        selected = None
        for item in group:
            # Assurez-vous de ne considérer que les valeurs non-NaN pour la somme des corrélations
            corr_sum = corr[item][corr_mask[item] & corr[item].notna()].sum()
            if corr_sum < min_correlation_sum:
                min_correlation_sum = corr_sum
                selected = item
        selected_descriptors.append(selected)
    
    print('Descripteurs sélectionnés:', selected_descriptors)
        
    # Retourner les données filtrées
    return data[selected_descriptors], data['yield']


In [None]:
df = calculate_descriptors(df)