# Getting the SMILES code from CAS number and Compund Names:

Updated on: 2022-12-15 13:37:39 CET

Author: Abzer Kelminal (abzer.shah@uni-tuebingen.de)<br>
Input file format: .csv /.txt /.xlsx files<br>
Outputs: .csv file <br>
Dependencies: pandas, cirpy, openpyxl <br>

<div class="alert alert-block alert-info">
<b> TIP:</b> Lines that starts with '#' are comments to explain the function of the code. When you have '#' before a line of code, it is commented, hence, it will not be executed. You can run these commented code lines by uncommenting them (remove #).
</div>

In [2]:
# Importing datetime function from datetime module
from datetime import datetime
 
# returns current date and time
now = datetime.now().replace(microsecond=0)
print("now = ", now)

now =  2022-12-15 14:44:57


In [3]:
#Installing necessary packages if not present already:
!pip install pandas cirpy openpyxl

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting cirpy
  Downloading CIRpy-1.0.2.tar.gz (20 kB)
Building wheels for collected packages: cirpy
  Building wheel for cirpy (setup.py) ... [?25l[?25hdone
  Created wheel for cirpy: filename=CIRpy-1.0.2-py3-none-any.whl size=7275 sha256=a74b8ea22df797acd9b6369e1b342b744568e2db894e2ec6ae616e6cb0b46195
  Stored in directory: /root/.cache/pip/wheels/f3/eb/17/f92433a13fee7d374ef246df6adc1a58ba07f7969d72aee1f1
Successfully built cirpy
Installing collected packages: cirpy
Successfully installed cirpy-1.0.2


In [4]:
#Importing necessary modules 
import os             #lets Python to interact with user's operating system
import pandas as pd   #pandas is python's data analysis library
import cirpy          # CIRPy:  Chemical Identifier Resolver in Python

## For Google Colab:
Run the cell below if you are working if you are running the script with Colab. Since Google Colab runs in the cloud, we cannot access our local disk as in Jupyter Notebook. But, a workaround is to add a folder with your input files in Google Drive and we can mount the Drive with the code below.

In [1]:
# To mount Google drive:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Once the drive is mounted, you can access your drive files following the steps shown in the images. After locating the folder for the analysis, simply copy the path of your folder and paste in the next cell to set it as your working directory. <br>

**Accesing the Drive in Google Colab environment:**
![Google-Colab Files Upload](https://github.com/abzer005/Images-for-Jupyter-Notebooks/blob/main/path_showing_files_in_Colab.png?raw=true)

**Copy the path of the folder in the drive for setting as working directory:**
![Google-Colab Files Upload](https://github.com/abzer005/Images-for-Jupyter-Notebooks/blob/main/path_to_navigate_to_drive_in_Colab.png?raw=true)

When you run the code below, it will display an output box where you can simply enter the path of the folder containing all your input files in your local computer. It will be set as your working directory and you can access all the files within it. <br><font color ="red"> Note: Once you run the next cell, make sure you enter something before proceeding to the next cell. Else, the cell will be waiting for your input and will keep the kernel busy, thus you will get an error </font>

In [7]:
#Setting the working directory
directory = input("Enter the path of the folder with input files:\n")
os.chdir(directory)

Enter the path of the folder with input files:
/content/drive/MyDrive/get_smiles


In [8]:
# Getting all the files in the working directory
files = sorted([os.path.basename(item) for item in os.listdir() if os.path.isfile(item)])
pd.DataFrame(files, index=range(1, len(files)+1), columns=["Filename"])

Unnamed: 0,Filename
1,compound_list_testdata.xlsx


In [9]:
#reading the input file
review = pd.read_excel("compound_list_testdata.xlsx")

In case the file is csv or txt files: Uncomment (Remove # in the beginning of the line) the corresponding line

In [None]:
#review = pd.read_csv('compound_list.csv')        #for csv file
#review = pd.read_csv("compound_list.txt", sep='\t')             #for txt as tsv files

In [10]:
review.head() #just seeing the header of the file, usually first 5 rows and all columns

Unnamed: 0,ID,Compound,CAS_Registry_Number,PubChem
0,com_1,"1-(3,4-dichlorophenyl)-3-methylurea",3567-62-2,https://pubchem.ncbi.nlm.nih.gov/compound/1-_3...
1,com_2,Phenol,108-95-2,https://pubchem.ncbi.nlm.nih.gov/compound/996
2,com_3,Testosterone,58-22-0,https://pubchem.ncbi.nlm.nih.gov/compound/6013
3,com_4,Tetrachloroethylene,127-18-4,https://pubchem.ncbi.nlm.nih.gov/compound/31373
4,com_5,Tetraconazole,112281-77-3,https://pubchem.ncbi.nlm.nih.gov/compound/80277


In [11]:
#trying CIR resolver for one CAS code
cirpy.resolve('3567-62-2', 'smiles')

'CNC(=O)Nc1ccc(Cl)c(Cl)c1'

In [12]:
#To see if there are any empty cells in CAS number column
review["CAS_Registry_Number"].isnull().values.any()

True

# In case of missing CAS numbers in the column:

In [14]:
# If the above cells returns TRUE, then how many cells are empty
review["CAS_Registry_Number"].isnull().sum()

1

In [15]:
#Getting the row which has no CAS number
review[review["CAS_Registry_Number"].isnull() == True]

Unnamed: 0,ID,Compound,CAS_Registry_Number,PubChem
8,com_9,Tramadol,,https://pubchem.ncbi.nlm.nih.gov/compound/33741


# Getting SMILES from CAS numbers:

In [16]:
#using CAS number
smil = []
for value in review["CAS_Registry_Number"]:
    if (pd.isna(value)) != True:
        smil.append(cirpy.resolve(value,"smiles"))
    else: smil.append("No CAS number")
        
smil[0:5] # Looking at the first 6 smiles code as a sanity check

['CNC(=O)Nc1ccc(Cl)c(Cl)c1',
 'Oc1ccccc1',
 'C[C@]12CC[C@H]3[C@@H](CCC4=CC(=O)CC[C@]34C)[C@@H]1CC[C@@H]2O',
 'ClC(Cl)=C(Cl)Cl',
 'FC(F)C(F)(F)OCC(Cn1cncn1)c2ccc(Cl)cc2Cl']

In [17]:
#Adding a column "Smiles" to the review dataframe
review["Smiles"] = smil

# Getting SMILES from Compound names:

In [18]:
#using compound names
smil_names = []
for value in review["Compound"]:
    smil_names.append(cirpy.resolve(value,"smiles"))
    
smil_names[0:5]

['CNC(=O)Nc1ccc(Cl)c(Cl)c1',
 'Oc1ccccc1',
 'C[C@]12CC[C@H]3[C@@H](CCC4=CC(=O)CC[C@]34C)[C@@H]1CC[C@@H]2O',
 'ClC(Cl)=C(Cl)Cl',
 'FC(F)C(F)(F)OCC(Cn1cncn1)c2ccc(Cl)cc2Cl']

In [19]:
review["Smiles_from_CompoundNames"] = smil_names       #Adding a new column 'Smiles_from_CompoundNames'

In [20]:
#After adding these new columns, lets check the review dataframe 
review.head()

Unnamed: 0,ID,Compound,CAS_Registry_Number,PubChem,Smiles,Smiles_from_CompoundNames
0,com_1,"1-(3,4-dichlorophenyl)-3-methylurea",3567-62-2,https://pubchem.ncbi.nlm.nih.gov/compound/1-_3...,CNC(=O)Nc1ccc(Cl)c(Cl)c1,CNC(=O)Nc1ccc(Cl)c(Cl)c1
1,com_2,Phenol,108-95-2,https://pubchem.ncbi.nlm.nih.gov/compound/996,Oc1ccccc1,Oc1ccccc1
2,com_3,Testosterone,58-22-0,https://pubchem.ncbi.nlm.nih.gov/compound/6013,C[C@]12CC[C@H]3[C@@H](CCC4=CC(=O)CC[C@]34C)[C@...,C[C@]12CC[C@H]3[C@@H](CCC4=CC(=O)CC[C@]34C)[C@...
3,com_4,Tetrachloroethylene,127-18-4,https://pubchem.ncbi.nlm.nih.gov/compound/31373,ClC(Cl)=C(Cl)Cl,ClC(Cl)=C(Cl)Cl
4,com_5,Tetraconazole,112281-77-3,https://pubchem.ncbi.nlm.nih.gov/compound/80277,FC(F)C(F)(F)OCC(Cn1cncn1)c2ccc(Cl)cc2Cl,FC(F)C(F)(F)OCC(Cn1cncn1)c2ccc(Cl)cc2Cl


In [21]:
#Writing as a result file
review.to_csv('Review_with_Smiles.csv', index = False)