# This is a sandbox/preface for the solubility calculation assignment

The solubility calculation assignment focuses on building a simple linear solubility model which attempts to predict solubilities for a series of new compounds based on a model trained on a set of compounds with known solubilities. To put it another way, we have a test set and a training set, and want to use the known solubilities from the training set to predict solubilities for the test set. 

## For solubility prediction, we'll use a series of *descriptors*

Descriptors are properties of our molecule which might (or might not) be related to the solubility. For example, we might think that solubility will in general tend to go down as molecular weight goes up, and go up as polarity increases (or go down as polarity decreases) and so on. 

Here, let's take a sample molecule and calculate a series of descriptors which we might want to use in constructing a simple solubility model. 

In [2]:
# Run cells if using collab

%env PYTHONPATH=

✨🍰✨ Everything looks OK!
✨🍰✨ Everything looks OK!
Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | done
Solving environment: - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / done

## Package Plan ##

  environment location: /usr/local

  added / updated specs:
    - numpy
    - scipy


The following packages will be SUPERSEDED by a higher-priority channel:

  ca-certificates    conda-forge::ca-certificates-2021.10.~ --> anaconda::ca-certificates-2020.10.14-0
  certifi            conda-forge::certifi-2021.10.8-py37h8~ --> anaconda::certifi-2020.6.20-py37_0


Preparing transaction: \ done
Verifying transaction: / done
Executing transaction: \ done
Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | done
Solving environment: - \ | 

In [None]:
%env PYTHONPATH=
! wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
! chmod +x Miniforge3-Linux-x86_64.sh
! bash ./Miniforge3-Linux-x86_64.sh -b -f -p /usr/local
import sys
sys.path.append('/usr/local/lib/python3.10/site-packages/')

In [3]:
# Run cell if using collab

# Mount google drive to Colab Notebooks to access files
from google.colab import drive
drive.mount('/content/drive', force_remount = True)

Mounted at /content/drive


In [None]:
!mamba install -c openeye openeye-toolkits --yes

In [4]:
# Run cell if using collab

# Link openeye license to .bash_profile  
%%shell 
echo export OE_LICENSE="/content/drive/MyDrive/drug-computing/oelicense/oe_license.txt" >> ~/.bash_profile
source ~/.bash_profile



In [None]:
# Move into directory so that files for this lecture can be accessed
#%cd /content/drive/MyDrive/drug-computing/uci-pharmsci/lectures/empirical_physical_properties

#set the OE_LICENSE environment variable to point to the license file
%env OE_LICENSE=/content/drive/MyDrive/drug-computing/oelicense/oe_license.txt
# Check the OE_LICENSE environment variable set
%env

In [1]:
from openeye.oechem import *
from openeye.oemolprop import *
from openeye.oeiupac import *
from openeye.oezap import *
from openeye.oeomega import *
import numpy as np
import scipy.stats

#Initialize an OpenEye molecule
mol = OEMol()

#let's look at phenol
OEParseIUPACName( mol, 'naphthalene' )

#Generate conformation
omega = OEOmega()
omega(mol)

#Here one of the descriptors we'll use is the calculated solvation free energy, from OpenEye's ZAP electrostatics solver
#Get zap ready for electrostatics calculations
zap = OEZap()
zap.SetInnerDielectric( 1.0 )
zap.SetGridSpacing(0.5)
area = OEArea()

#Reduce verbosity
OEThrow.SetLevel(OEErrorLevel_Warning)


#Let's print a bunch of properties
#Molecular weight
print( "Molecular weight: %.2f" % OECalculateMolecularWeight(mol) )
#Number of atoms
print( "Number of atoms: %s" % mol.NumAtoms() ) 
#Number of heavy atoms
print( "Number of heavy atoms: %s" % OECount(mol, OEIsHeavy() ) )
#Number of ring atoms
print( "Number of ring atoms: %s" % OECount(mol, OEAtomIsInRing() ) )
#Number of halogens
print( "Number of halogens: %s" % OECount( mol, OEIsHalogen() ))
print ("Number of nitrogens: %s" % OECount( mol, OEIsNitrogen() ) )
print( "Number of oxygens: %s" % OECount( mol, OEIsOxygen() ) )
print( "Number of rotatable bonds: %s" % OECount( mol, OEIsRotor() ) )

#Calculated logP - water to octanol partitioning coefficient (which is often something which may correlate somewhat with solubility)
print( "Calculated logP: %.2f" %  OEGetXLogP( mol ) )

print( "Number of aromatic rings: %s" % OEGetAromaticRingCount( mol ) )

    
    
#Calculate lots of other properties using molprop toolkit as per example in OE MolProp manual
#Handle the setup of 'filter', which computes lots of properties with the goal of filtering compounds. Here we'll not do any filtering
#and will use it solely for property calculation
filt = OEFilter()
ostr = oeosstream()
pwnd = False
filt.SetTable( ostr, pwnd)
#headers = ostr.str().split('\t') #Python 2.x would want something like this; Python 3 version follows
headers = ostr.str().decode().split('\t')
ostr.clear()
filt(mol)
#fields = ostr.str().split('\t') #Python 2.x would want something like this; Python 3 version follows
fields = ostr.str().decode().split('\t')
tmpdct = dict( zip(headers, fields) ) #Format the data we need into a dictionary for easy extraction

print("Polar surface area: %s" % tmpdct[ '2d PSA' ] )
print("Number of hbond donors: %s" % int(tmpdct['hydrogen-bond donors']) )
print("Number of hbond acceptors: %s" % int(tmpdct['hydrogen-bond acceptors']) )
print ("Number of rings: %s" % int(tmpdct['number of ring systems']) )
#print(tmpdct.keys())

#Quickly estimate hydration free energy, or a value correlated with that -- from ZAP manual
#Do ZAP setup for molecule
OEAssignBondiVdWRadii(mol)
OEMMFFAtomTypes(mol)
OEMMFF94PartialCharges(mol)
zap.SetMolecule( mol )
solv = zap.CalcSolvationEnergy()
aval = area.GetArea( mol )
#Empirically estimate solvation free energy (hydration)
solvation = 0.59*solv + 0.01*aval #Convert electrostatic part to kcal/mol; use empirically determined kcal/sq angstrom value times surface area term
print ("Calculated solvation free energy: %.2f" % solvation)

Molecular weight: 128.17
Number of atoms: 18
Number of heavy atoms: 10
Number of ring atoms: 10
Number of halogens: 0
Number of nitrogens: 0
Number of oxygens: 0
Number of rotatable bonds: 0
Calculated logP: 3.57
Number of aromatic rings: 2
Polar surface area: 0.00
Number of hbond donors: 0
Number of hbond acceptors: 0
Number of rings: 1
Calculated solvation free energy: -4.13


## In the assignment, these get stored in a dictionary. Let's see how that works.

In [3]:
#Initialize an empty dictionary
compounds = {}
#Name we're working with
molname = 'phenol'
#Create a new OEMol to store this into
mol = OEMol()

#let's look at phenol
OEParseIUPACName( mol, molname )

#Generate conformation
omega = OEOmega()
omega(mol)

#Create a slot in our dictionary for phenol
compounds[molname] = {} #Make it another empty dictionary

#Now let's store some stuff in there
compounds[molname]['mw'] = OECalculateMolecularWeight(mol)
compounds[molname]['rotatable bonds'] = OECount( mol, OEIsRotor() )



#TO DO: Try making an update here to add properties for another compound of your choice to the dictionary



#Let's print it out
print(compounds)

{'phenol': {'mw': 94.11124000000002, 'rotatable bonds': 0}}


The point here is just that a dictionary is a flexible data structure which allows us to easily store away information we might want later in an organized manner. For example, if I want to see everything I have for phenol, I simply use:

In [4]:
print( compounds['phenol'])

{'mw': 94.11124000000002, 'rotatable bonds': 0}
