<a href="https://colab.research.google.com/github/StyrbjornKall/ecoCAIT/blob/master/tutorials/Inference_tutorial_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Inference
This script showcases the different models available in ecoCAIT and how to use them efficiently.

This is the exact same tutorial as the jupyter notebook available under `tutorials` but since this one runs on google colab some additional code need to run for it to work.

Note: For large files it is recommended to switch the Runtime to GPU (select *GPU* under the *Change Runtime type* in the dropdown menu *Runtime*).  

## Install dependencies

In [4]:
!pip install transformers
!pip install torch
!pip install rdkit==2022.03.5

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.27.3-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m78.7 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.3-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.8/199.8 KB[0m [31m23.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.13.3 tokenizers-0.13.2 transformers-4.27.3
Looking in indexes: https://pypi.org/simple, https://us

## Mount personal google drive
The paths stated below should not have to be changed for functional code. The script will automatically make a new folder called `ecoCAIT` in you google drive. 

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [2]:
%cd gdrive/My Drive/
%ls

/content/gdrive/My Drive
'20220628_121321 (1).jpg'   drive-download-20220823T101517Z-001.zip
 20220628_121321.jpg        [0m[01;34mExjobb[0m/
'20220628_121322 (1).jpg'   [01;34mfishbAIT[0m/
 20220628_121322.jpg        k-fold.drawio
'20220628_121325 (1).jpg'   Logo.pdf
 20220628_121325.jpg        Logotext.pdf
'20220628_122002 (1).jpg'   [01;34mMålningar[0m/
 20220628_122002.jpg       'Namnlös presentation.gslides'
'20220628_122004 (1).jpg'   P6280057.JPG
 20220628_122004.jpg        P6280075.JPG
'20220628_122007 (1).jpg'   P6280076.JPG
 20220628_122007.jpg        P6280132.JPG
'20220628_122008 (1).jpg'   P6280133.JPG
 20220628_122008.jpg        selfattention.drawio
'20220628_122009 (1).jpg'  [01;34m'stubbs låtar'[0m/
 20220628_122009.jpg        [01;34mtest[0m/
 BERT.drawio                tox_assay.drawio
 chemicalattention.drawio   WASP_2022_Industrial_PhD_student.gdoc
[01;34m'Colab Notebooks'[0m/          WASP_2022_Industrial_PhD_student.pdf
 doseresponse1.drawio


In [3]:
!git clone https://github.com/StyrbjornKall/ecoCAIT

Cloning into 'ecoCAIT'...
remote: Enumerating objects: 1020, done.[K
remote: Counting objects: 100% (224/224), done.[K
remote: Compressing objects: 100% (129/129), done.[K
remote: Total 1020 (delta 119), reused 191 (delta 95), pack-reused 796[K
Receiving objects: 100% (1020/1020), 954.19 MiB | 17.14 MiB/s, done.
Resolving deltas: 100% (546/546), done.
Updating files: 100% (137/137), done.


In [5]:
import os
os.chdir('/content/gdrive/My Drive/ecoCAIT/tutorials/')

## Run tutorial

Now we are ready to run the script. This follows the exact same layout as the jupyter notebook tutorial available under `tutorials`. 

In [6]:
import torch
import pandas as pd
import numpy as np
from inference_utils.ecoCAIT_for_inference import ecoCAIT_for_inference

Specify the model version and load the model

In [8]:
MODEL_TYPE = 'EC50'
SPECIES_GROUP = 'fish'
MODEL_VERSION = f'{MODEL_TYPE}_{SPECIES_GROUP}'

In [13]:
ecocait = ecoCAIT_for_inference(model_version=MODEL_VERSION)
ecocait.load_fine_tuned_model()

Load the SMILES you wish to predict

In [10]:
data = pd.read_excel('../data/tutorials/Inference_example_2.xlsx')
data

Unnamed: 0,SMILES,cmpdname
0,CC(=O)Oc1ccccc1C(O)=O,Aspirin
1,[Cr],Chromium
2,[H+].[Cl-].CNCCC(Oc1ccc(cc1)C(F)(F)F)c2ccccc2,Fluoxetine hydrochloride
3,Clc1ccc(cc1)C(c2ccc(Cl)cc2)C(Cl)(Cl)Cl,Clofenotane
4,[Cu],Copper
...,...,...
995,[Pb++].[O-]c1c(cc(c([O-])c1[N+]([O-])=O)[N+]([...,Lead styphnate
996,CC(C)(C)C(O)(CCc1ccc(Cl)cc1)Cn2cncn2,Tebuconazole
997,[Na+].[Na+].[Na+].[Na+].OCCN(CCO)c1nc(Nc2ccc(c...,OpticalBrightenerBbu220
998,CNC.OC(=O)COc1ccc(Cl)cc1Cl,"2,4-D dimethylamine salt"


Specify the endpoint and effect you wish to predict and make the prediction

In [11]:
PREDICTION_ENDPOINT = 'EC50'
PREDICTION_EFFECT = 'MOR'
EXPOSURE_DURATION = 96
SMILES_COLUMN_NAME = 'SMILES'

In [14]:
results = ecocait.predict_toxicity(SMILES = data[SMILES_COLUMN_NAME].iloc[0:10].tolist(), exposure_duration=EXPOSURE_DURATION, endpoint=PREDICTION_ENDPOINT, effect=PREDICTION_EFFECT, return_cls_embeddings=True)
results

Did not return onehotencoding for Endpoint. Why? You specified only one Endpoint or you specified NOEC and EC10 which are coded to be the same endpoint.
Did not return onehotencoding for Effect. Why? You specified only one Effect.
Will use input 0 to network due to no Onehotencodings being present.


  0%|          | 0/2 [00:00<?, ?it/s]You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
100%|██████████| 2/2 [00:00<00:00,  4.56it/s]


Unnamed: 0,SMILES,exposure_duration,endpoint,effect,SMILES_Canonical_RDKit,OneHotEnc_concatenated,predictions log10(mg/L),predictions (mg/L),CLS_embeddings
0,CC(=O)Oc1ccccc1C(O)=O,1.982271,EC50,MOR,CC(=O)Oc1ccccc1C(=O)O,[0.0],1.884043,76.567215,"[1.031221866607666, 0.9812864661216736, 1.5895..."
1,[Cr],1.982271,EC50,MOR,[Cr],[0.0],1.831604,67.858498,"[2.001863479614258, -0.05010111257433891, 1.34..."
2,[H+].[Cl-].CNCCC(Oc1ccc(cc1)C(F)(F)F)c2ccccc2,1.982271,EC50,MOR,CNCCC(Oc1ccc(C(F)(F)F)cc1)c1ccccc1.[Cl-].[H+],[0.0],-0.225506,0.594969,"[0.3745545446872711, -1.3732742071151733, 1.56..."
3,Clc1ccc(cc1)C(c2ccc(Cl)cc2)C(Cl)(Cl)Cl,1.982271,EC50,MOR,Clc1ccc(C(c2ccc(Cl)cc2)C(Cl)(Cl)Cl)cc1,[0.0],-1.892852,0.012798,"[-1.3456339836120605, -0.27076178789138794, -1..."
4,[Cu],1.982271,EC50,MOR,[Cu],[0.0],-0.414268,0.38524,"[-2.549443483352661, 0.6732224225997925, -1.10..."
5,CCNc1nc(Cl)nc(NC(C)C)n1,1.982271,EC50,MOR,CCNc1nc(Cl)nc(NC(C)C)n1,[0.0],1.320259,20.905407,"[0.1520749181509018, 1.1833568811416626, 0.951..."
6,CN(C)C1=NC(=O)N(C2CCCCC2)C(=O)N1C,1.982271,EC50,MOR,CN(C)c1nc(=O)n(C2CCCCC2)c(=O)n1C,[0.0],2.474742,298.361237,"[1.2230730056762695, 0.0014117200626060367, 2...."
7,CC(Br)(CO)[N+]([O-])=O,1.982271,EC50,MOR,CC(Br)(CO)[N+](=O)[O-],[0.0],1.364374,23.14057,"[0.7047246694564819, 0.3339962959289551, 0.413..."
8,c1ccc2c(c1)c3cccc4cccc2c34,1.982271,EC50,MOR,c1ccc2c(c1)-c1cccc3cccc-2c13,[0.0],-1.589469,0.025735,"[-1.6370105743408203, 0.2737756669521332, -1.1..."
9,[Cl-].[Cl-].[Zn++],1.982271,EC50,MOR,[Cl-].[Cl-].[Zn+2],[0.0],0.320301,2.090747,"[0.39709165692329407, -0.532727837562561, -0.2..."


## Upload your list of SMILES

For simplicity a file can be uploaded with the SMILES you wish to predict directly

In [15]:
from google.colab import files
import io

In [46]:
uploaded_file = files.upload()

Saving All 6k SMILES.txt to All 6k SMILES (1).txt


In [49]:
# excel file with one column containing SMILES
#data = pd.read_excel(io.BytesIO(uploaded_file[list(uploaded_file.keys())[0]]), header=None, names=['SMILES'])

# .csv file with one column containing SMILES
#data = pd.read_csv(io.BytesIO(uploaded_file[list(uploaded_file.keys())[0]]), header=None, names=['SMILES'])

# .txt file with one SMILES per line
data = pd.read_csv(io.BytesIO(uploaded_file[list(uploaded_file.keys())[0]]), sep='\rn', header=None, names=['SMILES'])





In [50]:
data

Unnamed: 0,SMILES
0,O=[N+]([O-])c1ccc(Cl)cc1
1,Nc1ccc([N+](=O)[O-])cc1
2,O=[N+]([O-])c1ccc(O)cc1
3,CN(C)c1ccc(C=O)cc1
4,O=[N+]([O-])c1ccc([N+](=O)[O-])cc1
...,...
6503,CCC(C(=O)O)c1ccc(N2C(=O)c3ccccc3C2=O)cc1
6504,NC(=O)NC1NC(=O)NC1=O
6505,S=C(SSSSSSC(=S)N1CCCCC1)N1CCCCC1
6506,CC1CCC(C(C)C)CC1


In [51]:
PREDICTION_ENDPOINT = 'EC50'
PREDICTION_EFFECT = 'MOR'
EXPOSURE_DURATION = 96
SMILES_COLUMN_NAME = 'SMILES'

In [53]:
results = ecocait.predict_toxicity(SMILES = data[SMILES_COLUMN_NAME].tolist(),
                                   exposure_duration=EXPOSURE_DURATION,
                                   endpoint=PREDICTION_ENDPOINT,
                                   effect=PREDICTION_EFFECT,
                                   return_cls_embeddings=True)
results

Did not return onehotencoding for Endpoint. Why? You specified only one Endpoint or you specified NOEC and EC10 which are coded to be the same endpoint.
Did not return onehotencoding for Effect. Why? You specified only one Effect.
Will use input 0 to network due to no Onehotencodings being present.


100%|██████████| 814/814 [06:23<00:00,  2.12it/s]


Unnamed: 0,SMILES,exposure_duration,endpoint,effect,SMILES_Canonical_RDKit,OneHotEnc_concatenated,predictions log10(mg/L),predictions (mg/L),CLS_embeddings
0,O=[N+]([O-])c1ccc(Cl)cc1,1.982271,EC50,MOR,O=[N+]([O-])c1ccc(Cl)cc1,[0.0],1.124751,13.327562,"[0.23420575261116028, 0.4216983914375305, 1.87..."
1,Nc1ccc([N+](=O)[O-])cc1,1.982271,EC50,MOR,Nc1ccc([N+](=O)[O-])cc1,[0.0],1.792578,62.026550,"[2.082552909851074, -0.5444334149360657, 2.003..."
2,O=[N+]([O-])c1ccc(O)cc1,1.982271,EC50,MOR,O=[N+]([O-])c1ccc(O)cc1,[0.0],1.397018,24.946980,"[1.0401618480682373, 0.3333073556423187, 1.806..."
3,CN(C)c1ccc(C=O)cc1,1.982271,EC50,MOR,CN(C)c1ccc(C=O)cc1,[0.0],1.628551,42.515839,"[2.072230339050293, -0.5749673843383789, 2.343..."
4,O=[N+]([O-])c1ccc([N+](=O)[O-])cc1,1.982271,EC50,MOR,O=[N+]([O-])c1ccc([N+](=O)[O-])cc1,[0.0],-0.234958,0.582160,"[-1.9305063486099243, -0.5327048301696777, 1.2..."
...,...,...,...,...,...,...,...,...,...
6503,CCC(C(=O)O)c1ccc(N2C(=O)c3ccccc3C2=O)cc1,1.982271,EC50,MOR,CCC(C(=O)O)c1ccc(N2C(=O)c3ccccc3C2=O)cc1,[0.0],1.584956,38.455318,"[1.1273287534713745, 2.010953426361084, 1.2554..."
6504,NC(=O)NC1NC(=O)NC1=O,1.982271,EC50,MOR,NC(=O)NC1NC(=O)NC1=O,[0.0],2.127614,134.157135,"[0.453058660030365, -0.470965176820755, 2.2553..."
6505,S=C(SSSSSSC(=S)N1CCCCC1)N1CCCCC1,1.982271,EC50,MOR,S=C(SSSSSSC(=S)N1CCCCC1)N1CCCCC1,[0.0],0.501646,3.174287,"[-1.0864914655685425, 0.29974955320358276, 0.1..."
6506,CC1CCC(C(C)C)CC1,1.982271,EC50,MOR,CC1CCC(C(C)C)CC1,[0.0],0.543796,3.497805,"[-0.3729274570941925, -1.2117304801940918, 1.2..."


# Check the predictions compared to our training sets

In [25]:
!pip install umap-learn
!pip install pacmap

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pacmap
  Downloading pacmap-0.7.0-py3-none-any.whl (18 kB)
Collecting annoy>=1.11
  Downloading annoy-1.17.1.tar.gz (647 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m648.0/648.0 KB[0m [31m35.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: annoy
  Building wheel for annoy (setup.py) ... [?25l[?25hdone
  Created wheel for annoy: filename=annoy-1.17.1-cp39-cp39-linux_x86_64.whl size=580984 sha256=ecbde36ecdd24e3411680239216dda9ba2b0350d9057862a8dc09fcbede8cdbe
  Stored in directory: /root/.cache/pip/wheels/5b/7d/31/9a9a4993d085bc85bee21946bce94cd5906ce99730f5467e57
Successfully built annoy
Installing collected packages: annoy, pacmap
Successfully installed annoy-1.17.1 pacmap-0

In [26]:
from inference_utils.plots_for_space import PlotPCA_CLSProjection, PlotPaCMAP_CLSProjection, PlotUMAP_CLSProjection
from inference_utils.pytorch_data_utils import check_closest_chemical, check_training_data

Check if chemicals are present in training data. They may be present as either an:

- 'endpoint_match' i.e. the chemical was used for training this model for this species and endpoint.
- 'effect_match' i.e. the chemical was used for training this model for this species, endpoint and effect.

In [54]:
results = check_training_data(results, model_type=MODEL_TYPE, species_group=SPECIES_GROUP, endpoint=PREDICTION_ENDPOINT, effect=PREDICTION_EFFECT)
results

Unnamed: 0,SMILES,exposure_duration,endpoint,effect,SMILES_Canonical_RDKit,OneHotEnc_concatenated,predictions log10(mg/L),predictions (mg/L),CLS_embeddings,endpoint match,effect match
0,O=[N+]([O-])c1ccc(Cl)cc1,1.982271,EC50,MOR,O=[N+]([O-])c1ccc(Cl)cc1,[0.0],1.124751,13.327562,"[0.23420575261116028, 0.4216983914375305, 1.87...",1,1
1,Nc1ccc([N+](=O)[O-])cc1,1.982271,EC50,MOR,Nc1ccc([N+](=O)[O-])cc1,[0.0],1.792578,62.026550,"[2.082552909851074, -0.5444334149360657, 2.003...",1,1
2,O=[N+]([O-])c1ccc(O)cc1,1.982271,EC50,MOR,O=[N+]([O-])c1ccc(O)cc1,[0.0],1.397018,24.946980,"[1.0401618480682373, 0.3333073556423187, 1.806...",1,1
3,CN(C)c1ccc(C=O)cc1,1.982271,EC50,MOR,CN(C)c1ccc(C=O)cc1,[0.0],1.628551,42.515839,"[2.072230339050293, -0.5749673843383789, 2.343...",1,1
4,O=[N+]([O-])c1ccc([N+](=O)[O-])cc1,1.982271,EC50,MOR,O=[N+]([O-])c1ccc([N+](=O)[O-])cc1,[0.0],-0.234958,0.582160,"[-1.9305063486099243, -0.5327048301696777, 1.2...",1,1
...,...,...,...,...,...,...,...,...,...,...,...
6503,CCC(C(=O)O)c1ccc(N2C(=O)c3ccccc3C2=O)cc1,1.982271,EC50,MOR,CCC(C(=O)O)c1ccc(N2C(=O)c3ccccc3C2=O)cc1,[0.0],1.584956,38.455318,"[1.1273287534713745, 2.010953426361084, 1.2554...",0,0
6504,NC(=O)NC1NC(=O)NC1=O,1.982271,EC50,MOR,NC(=O)NC1NC(=O)NC1=O,[0.0],2.127614,134.157135,"[0.453058660030365, -0.470965176820755, 2.2553...",0,0
6505,S=C(SSSSSSC(=S)N1CCCCC1)N1CCCCC1,1.982271,EC50,MOR,S=C(SSSSSSC(=S)N1CCCCC1)N1CCCCC1,[0.0],0.501646,3.174287,"[-1.0864914655685425, 0.29974955320358276, 0.1...",0,0
6506,CC1CCC(C(C)C)CC1,1.982271,EC50,MOR,CC1CCC(C(C)C)CC1,[0.0],0.543796,3.497805,"[-0.3729274570941925, -1.2117304801940918, 1.2...",0,0


Next we check if which chemical is closest to the predicted chemicals by evaluating the CLS-embeddings against the training set's CLS-embedding by means of their cosine-similarity:

- cosine-similarity=1 --> Identical structures
- cosine-similarity=-1 --> completely oposite in terms of toxicity

In [31]:
results = check_closest_chemical(results=results, MODELTYPE=MODEL_TYPE, PREDICTION_SPECIES=SPECIES_GROUP, PREDICTION_ENDPOINT=PREDICTION_ENDPOINT, PREDICTION_EFFECT=PREDICTION_EFFECT)
results

Unnamed: 0,SMILES,exposure_duration,endpoint,effect,SMILES_Canonical_RDKit,OneHotEnc_concatenated,predictions log10(mg/L),predictions (mg/L),CLS_embeddings,most similar chemical,cosine similarity
0,endpoint,1.982271,EC50,MOR,endpoint,[0.0],1.213103,16.334377,"[-0.1075105145573616, 0.45776957273483276, 0.7...",CCCCCCCCCCBr,0.791287
1,EC10,1.982271,EC50,MOR,EC10,[0.0],0.602213,4.001406,"[0.3030869960784912, -0.1913156658411026, 0.46...",Brc1ccccc1,0.864927
2,EC50,1.982271,EC50,MOR,EC50,[0.0],0.942790,8.765773,"[0.871803879737854, -0.06653017550706863, 0.56...",CCCCCCCCCCCCCCCNCCCCCCCCCCCCCCC,0.920570
3,EC10,1.982271,EC50,MOR,EC10,[0.0],0.602213,4.001406,"[0.3030869960784912, -0.1913156658411026, 0.46...",Brc1ccccc1,0.864927
4,EC50,1.982271,EC50,MOR,EC50,[0.0],0.942790,8.765773,"[0.871803879737854, -0.06653017550706863, 0.56...",CCCCCCCCCCCCCCCNCCCCCCCCCCCCCCC,0.920570
...,...,...,...,...,...,...,...,...,...,...,...
996,EC10,1.982271,EC50,MOR,EC10,[0.0],0.602213,4.001407,"[0.3030863106250763, -0.19131538271903992, 0.4...",Brc1ccccc1,0.864927
997,EC10,1.982271,EC50,MOR,EC10,[0.0],0.602213,4.001407,"[0.3030863106250763, -0.19131538271903992, 0.4...",Brc1ccccc1,0.864927
998,EC10,1.982271,EC50,MOR,EC10,[0.0],0.602213,4.001407,"[0.3030863106250763, -0.19131538271903992, 0.4...",Brc1ccccc1,0.864927
999,EC50,1.982271,EC50,MOR,EC50,[0.0],0.942790,8.765768,"[0.8718034625053406, -0.06652995944023132, 0.5...",CCCCCCCCCCCCCCCNCCCCCCCCCCCCCCC,0.920570


# Plot projections for all chemicals
Finally we can plot the chemical space built during training of the Transformer module in the model. The space is built by the CLS-embedidngs present in the training set of the model but can be used to project new chemicals onto. The space prepared in this example uses show_all_predictions=True which plots additional SMILES, not included in the training data to add interpretability. We also use inference_df=results to plot the predicted SMILES from above into the space, however this can be set to None if not desired.

The plot can be saved as interactive HTML by fig.write_html(figurename.html')

Note that in the hover text of each point, the L1Error from our 10x10-fold cross-validation is included from when the used model was evaluated on that chemical.

## PCA - CLS projection

In [32]:
PlotPCA_CLSProjection(model_type=MODEL_TYPE, endpoint=PREDICTION_ENDPOINT, effect=PREDICTION_EFFECT, species_group=SPECIES_GROUP, show_all_predictions=True, inference_df=results)

## PaCMAP - CLS projection


In [33]:
PlotPaCMAP_CLSProjection(model_type=MODEL_TYPE, endpoint=PREDICTION_ENDPOINT, effect=PREDICTION_EFFECT, species_group=SPECIES_GROUP, show_all_predictions=True, inference_df=results)


## UMAP - CLS projection

In [34]:
PlotUMAP_CLSProjection(model_type=MODEL_TYPE, endpoint=PREDICTION_ENDPOINT, effect=PREDICTION_EFFECT, species_group=SPECIES_GROUP, show_all_predictions=True, inference_df=results, n_neighbors=10, min_dist=0.1)
