In [87]:
import pandas as pd
import numpy as np
import pickle
import tqdm
from tqdm import tqdm
from itertools import product
import pdfplumber
import pubchempy as pcp
tqdm.pandas()
import rdkit
from rdkit import Chem

First, let's import and curate all datasets from [*Chemicals*](https://chemicals.readthedocs.io/index.html) 
library. We'll do it iteratively, merging tables after each iteration. Pc will be in **bars**, so we divide pressure in pascals by 100000

In [2]:
df_yaws = pd.read_table('Yaws Collection.tsv')
df_yaws['Pc'] = df_yaws['Pc']/100000

In [3]:
df_passut=   pd.read_table('PassutDanner1973.tsv')
df_passut['Pc']  = df_passut['Pc']/100000

In [4]:
merged = pd.concat([df_yaws, df_passut], axis = 0).reset_index(drop = True)

In [5]:
df_PSRK = pd.read_table('Appendix to PSRK Revision 4.tsv')
df_PSRK['Pc']  = df_PSRK['Pc']/100000

In [6]:
merged = pd.concat([merged, df_PSRK], axis = 0).reset_index(drop = True)

As some datasets contain information which is not actual for this work, we drop unnecesary columns. In the example below, we drop "Tc_error" and "Pc_error"

In [7]:
df_CRC = pd.read_table('CRCCriticalOrganics.tsv')
df_CRC.drop(columns = ['Tc_error', 'Pc_error', 'Vc_error'], inplace = True)
df_CRC['Pc']  = df_CRC['Pc']/100000

In [8]:
merged = pd.concat([merged, df_CRC], axis = 0).reset_index(drop = True)

In [9]:
df_DIPP = pd.read_table('DIPPRPinaMartines.tsv')
df_DIPP['Pc'] = df_DIPP['Pc']/100000

In [10]:
merged = pd.concat([merged, df_DIPP], axis = 0).reset_index(drop = True)

In [11]:
df_IUPAC = pd.read_table('IUPACOrganicCriticalProps.tsv')
df_IUPAC.drop(columns = ['MW', 'Reference'], inplace = True)
df_IUPAC['Pc'] = df_IUPAC['Pc']/100000

In [12]:
merged = pd.concat([merged, df_IUPAC], axis = 0).reset_index(drop = True)

All datasets, processed so far, contain CAS numbers or name of the molecule, so we use [*PubChemPy*](https://pubchempy.readthedocs.io/en/latest/) library to get structure (in SMILES). Function below takes cas number (as string) and return canonical SMILES

In [43]:
def CAS_to_smiles(cas):
    try:
        comp = pcp.get_compounds(cas, 'name')[0]
        smi = comp.canonical_smiles
        mol = Chem.MolFromSmiles(smi)
        can_smi = Chem.MolToSmiles(mol)
        return can_smi
    except:
        return np.NaN
    

We observed, that in some cases, *PubChemPy* cannot process correctly CAS number. For these cases so  we created a function, that takes the name of the molecule and get its smiles. 

In [85]:
def name_to_smiles(name):
    try:
        comp = pcp.get_compounds(name, 'name')[0]
        smi = comp.canonical_smiles
        mol = Chem.MolFromSmiles(smi)
        can_smi = Chem.MolToSmiles(mol)
        return can_smi
    except:
        return np.NaN

We apply first function and get SMILES representaion of all molecules 

In [50]:
merged['SMILES'] = merged['CAS'].progress_apply(CAS_to_smiles)

 33%|████████████████████████▍                                                  | 3986/12206 [46:58<1:30:12,  1.52it/s][21:38:41] Explicit valence for atom # 1 Cl, 7, is greater than permitted
 34%|█████████████████████████▌                                                 | 4159/12206 [48:53<1:23:41,  1.60it/s][21:40:36] Explicit valence for atom # 1 Br, 3, is greater than permitted
 34%|█████████████████████████▌                                                 | 4167/12206 [48:59<1:32:45,  1.44it/s][21:40:42] Explicit valence for atom # 1 Br, 5, is greater than permitted
 34%|█████████████████████████▋                                                 | 4189/12206 [49:12<1:24:49,  1.58it/s][21:40:55] Explicit valence for atom # 1 Cl, 3, is greater than permitted
 37%|███████████████████████████▌                                               | 4478/12206 [52:35<1:34:16,  1.37it/s][21:44:18] Explicit valence for atom # 1 Cl, 5, is greater than permitted
 39%|█████████████████████████████ 

Here we get two tables. "Merged non-valid" is the one, where **CAS_to_smiles** function was not able to retrieve SMILES, so that SMILES column contains NaNs. Merged valid is the one, where the function was succesfful. Then we apply **name_to_SMILES** for "merged_non_valid" to try another time to get structure of molecules

In [67]:
merged_non_valid = merged[merged['SMILES'].isnull()]
merged_valid = merged.dropna(axis = 0, subset = 'SMILES')

In [87]:
merged_non_valid['SMILES'] = merged_non_valid['Chemical'].progress_apply(name_to_smiles)

 11%|████████▌                                                                      | 127/1170 [01:22<10:29,  1.66it/s][23:24:39] Explicit valence for atom # 1 Cl, 7, is greater than permitted
 14%|███████████                                                                    | 163/1170 [01:44<10:07,  1.66it/s][23:25:01] Explicit valence for atom # 1 Br, 3, is greater than permitted
 14%|███████████                                                                    | 164/1170 [01:44<10:07,  1.66it/s][23:25:01] Explicit valence for atom # 1 Br, 5, is greater than permitted
 14%|███████████▏                                                                   | 165/1170 [01:45<10:29,  1.60it/s][23:25:02] Explicit valence for atom # 1 Cl, 3, is greater than permitted
 17%|█████████████▎                                                                 | 198/1170 [02:06<09:44,  1.66it/s][23:25:23] Explicit valence for atom # 1 Cl, 5, is greater than permitted
 19%|███████████████               

We drop deeply unsucessful moecules...

In [96]:
merged_non_valid = merged_non_valid.dropna(axis = 0, subset = 'SMILES')

We get final table, which does not contain any *NaNs* in **SMILES** column. We dropped **Zc** as we decided to focus on **Tc**, **Pc** and **omega** values as paper dataset that we process further, does not contain **Zc** values

In [106]:
final_table = pd.concat([merged_valid, merged_non_valid], axis = 0)
final_table.drop(columns = ['Zc'], inplace = True)
final_table.to_csv('from_chemicals_datasets.csv')


Checking the absence of *NaN*s...

In [108]:
final_table.isnull().sum()

CAS            0
Chemical    1785
Tc             8
Pc           295
Vc           607
omega       4557
SMILES         0
dtype: int64

As SMILES of one molecule can be written in several ways, we should have the same algorithms to get unique SMILES for unique molecule. Not to download more and more libraries, this simple function just turns SMILES string to RDKit molecule object and then, again, get SMILES. [*RDKit*](https://www.rdkit.org/) algorithms allow to get unique SMILES values

the name of function is not correct, but let's forget about it :-)

In [21]:
def caconicalize_SMILES(smi):
    mol = Chem.MolFromSmiles(smi)
    can_smi = Chem.MolToSmiles(mol)
    return can_smi

In [32]:
final_table = pd.read_csv('from_chemicals_datasets.csv', index_col=0).reset_index(drop = True).drop(columns = ['CAS', 'Chemical'])

Here we process dataset from the [article](https://pubs.acs.org/doi/abs/10.1021/acs.jcim.3c00546). We upload .csv fil, drop unnecessary columns and rename the useful ones

In [26]:
article_ds = pd.read_csv('./ci3c00546_si_002/CritProp_SI/all_data/experimental_data/critprop_data_only_smiles_mean_value_expt.csv')
article_ds = article_ds.rename(columns = {'smiles':'SMILES', 'Tc (K)':'Tc', 'Pc (bar)':'Pc', 'omega (-)':'omega'}).drop(columns = ['rhoc (mol/L)', 'Tb (K)', 'Tm (K)', 'dHvap (kJ/mol)', 'dHfus (kJ/mol)'])

We get canonicalized SMILES...

In [28]:
article_ds['SMILES'] = article_ds['SMILES'].progress_apply(caconicalize_SMILES)

100%|████████████████████████████████████████████████████████████████████████████| 5680/5680 [00:00<00:00, 7237.86it/s]


And finally concatenate tables

In [35]:
concatenated_table = pd.concat([final_table, article_ds])

We observed, that table containts a lot of duplicates - most of molecules can be encountered in several datasets. As data for the same molecule can be different, we just average it. We group the table by SMILES and aggrgate it with *mean* for **Tc**, **Pc** and **omega**. This way helps us to average values if there are mote than one of them and returns the value if there is one one

In [80]:
grouped_table = concatenated_table.groupby(by = 'SMILES').agg({'Tc':'mean', 
                                               'Pc':'mean',
                                               'omega':'mean'}).reset_index()

Let's get **molecule** columns to facilitate data curation

In [81]:
grouped_table['mol'] =  grouped_table['SMILES'].progress_apply(Chem.MolFromSmiles)

100%|███████████████████████████████████████████████████████████████████████████| 7533/7533 [00:00<00:00, 12000.14it/s]


And save all molecules, which are organic. We get if molecule contains atomatic or aliphatic carbon atom with RDKit substructure match, create mask column and use it to get any orgaic molecules of interest

In [83]:
grouped_table['Is_organic'] = grouped_table['mol'].progress_apply(lambda x: x.HasSubstructMatch(Chem.MolFromSmarts('[C,c]')))
grouped_table = grouped_table[grouped_table['Is_organic']]

100%|███████████████████████████████████████████████████████████████████████████| 7533/7533 [00:00<00:00, 57280.39it/s]


Also, we remove molecules, which consist of more than one fragments - salts, mixtures, etc. We get number of fragments with RDkit function, create mask column and use it to get valid molecules

In [84]:
grouped_table['num_fragments'] = grouped_table['mol'].progress_apply(lambda x: len(Chem.GetMolFrags(x)) < 2)
grouped_table = grouped_table[grouped_table['num_fragments']]

100%|██████████████████████████████████████████████████████████████████████████| 7115/7115 [00:00<00:00, 171389.28it/s]


And save the table

In [86]:
grouped_table.to_csv('grouped_table.csv')