<a href="https://colab.research.google.com/github/StyrbjornKall/ecoCAIT/blob/master/tutorials/Inference_tutorial_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Inference
This script showcases the different models available in ecoCAIT and how to use them efficiently.

This is the exact same tutorial as the jupyter notebook available under `tutorials` but since this one runs on google colab some additional code need to run for it to work.

Note: For large files it is recommended to switch the Runtime to GPU (select *GPU* under the *Change Runtime type* in the dropdown menu *Runtime*).  

## Install dependencies

In [38]:
!pip install transformers
!pip install torch
!pip install rdkit

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 4.6 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.10.1-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 67.7 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 47.8 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.10.1 tokenizers-0.13.2 transformers-4.24.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rdkit
  Downloading rd

## Mount personal google drive
The paths stated below should not have to be changed for functional code. The script will automatically make a new folder called `ecoCAIT` in you google drive. 

In [39]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
%cd gdrive/My Drive/
%ls

In [None]:
!git clone https://github.com/StyrbjornKall/ecoCAIT

Cloning into 'fishbAIT'...
remote: Enumerating objects: 193, done.[K
remote: Counting objects: 100% (193/193), done.[K
remote: Compressing objects: 100% (126/126), done.[K
remote: Total 193 (delta 89), reused 159 (delta 57), pack-reused 0[K
Receiving objects: 100% (193/193), 26.49 MiB | 15.54 MiB/s, done.
Resolving deltas: 100% (89/89), done.


In [41]:
import os
os.chdir('/content/gdrive/My Drive/ecoCAIT/tutorials/')

## Run tutorial

Now we are ready to run the script. This follows the exact same layout as the jupyter notebook tutorial available under `tutorials`. 

In [42]:
import torch
import pandas as pd
import numpy as np
from inference_utils.ecoCAIT_for_inference import ecoCAIT_for_inference

Specify the model version and load the model

In [43]:
MODEL_VERSION = 'EC50_fish'

In [44]:
fishbait = ecoCAIT_for_inference(model_version=MODEL_VERSION)
fishbait.load_fine_tuned_model()

Downloading:   0%|          | 0.00/715 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/334M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.16k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/149k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/101k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/384k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/772 [00:00<?, ?B/s]

Load the SMILES you wish to predict

In [None]:
data = pd.read_excel('../data/tutorials/Inference_example_2.xlsx')
data

Unnamed: 0,SMILES,cmpdname
0,CC(=O)Oc1ccccc1C(O)=O,Aspirin
1,[Cr],Chromium
2,[H+].[Cl-].CNCCC(Oc1ccc(cc1)C(F)(F)F)c2ccccc2,Fluoxetine hydrochloride
3,Clc1ccc(cc1)C(c2ccc(Cl)cc2)C(Cl)(Cl)Cl,Clofenotane
4,[Cu],Copper
...,...,...
995,[Pb++].[O-]c1c(cc(c([O-])c1[N+]([O-])=O)[N+]([...,Lead styphnate
996,CC(C)(C)C(O)(CCc1ccc(Cl)cc1)Cn2cncn2,Tebuconazole
997,[Na+].[Na+].[Na+].[Na+].OCCN(CCO)c1nc(Nc2ccc(c...,OpticalBrightenerBbu220
998,CNC.OC(=O)COc1ccc(Cl)cc1Cl,"2,4-D dimethylamine salt"


Specify the endpoint and effect you wish to predict and make the prediction

In [None]:
PREDICTION_ENDPOINT = 'EC50'
PREDICTION_EFFECT = 'MOR'
EXPOSURE_DURATION = 96
SMILES_COLUMN_NAME = 'SMILES'

In [None]:
fishbait.predict_toxicity(SMILES = data[SMILES_COLUMN_NAME].iloc[0:10].tolist(), exposure_duration=EXPOSURE_DURATION, endpoint=PREDICTION_ENDPOINT, effect=PREDICTION_EFFECT)

Did not return onehotencoding for Endpoint. Why? You specified only one Endpoint or you specified NOEC and EC10 which are coded to be the same endpoint.
Did not return onehotencoding for Effect. Why? You specified only one Effect.
Will use input 0 to network due to no Onehotencodings being present.


  0%|          | 0/2 [00:00<?, ?it/s]You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
100%|██████████| 2/2 [00:00<00:00,  4.02it/s]


Unnamed: 0,SMILES,exposure_duration,endpoint,effect,SMILES_Canonical_RDKit,OneHotEnc_concatenated,predictions log10(mg/L),predictions (mg/L)
0,CC(=O)Oc1ccccc1C(O)=O,96,EC50,MOR,CC(=O)Oc1ccccc1C(=O)O,[0.0],-2.900215,0.001258303
1,[Cr],96,EC50,MOR,[Cr],[0.0],-2.937654,0.001154373
2,[H+].[Cl-].CNCCC(Oc1ccc(cc1)C(F)(F)F)c2ccccc2,96,EC50,MOR,CNCCC(Oc1ccc(C(F)(F)F)cc1)c1ccccc1.[Cl-].[H+],[0.0],-4.483784,3.282584e-05
3,Clc1ccc(cc1)C(c2ccc(Cl)cc2)C(Cl)(Cl)Cl,96,EC50,MOR,Clc1ccc(C(c2ccc(Cl)cc2)C(Cl)(Cl)Cl)cc1,[0.0],-6.660405,2.185722e-07
4,[Cu],96,EC50,MOR,[Cu],[0.0],-4.658821,2.193711e-05
5,CCNc1nc(Cl)nc(NC(C)C)n1,96,EC50,MOR,CCNc1nc(Cl)nc(NC(C)C)n1,[0.0],-3.16755,0.000679908
6,CN(C)C1=NC(=O)N(C2CCCCC2)C(=O)N1C,96,EC50,MOR,CN(C)c1nc(=O)n(C2CCCCC2)c(=O)n1C,[0.0],-3.280939,0.0005236741
7,CC(Br)(CO)[N+]([O-])=O,96,EC50,MOR,CC(Br)(CO)[N+](=O)[O-],[0.0],-2.916111,0.001213078
8,c1ccc2c(c1)c3cccc4cccc2c34,96,EC50,MOR,c1ccc2c(c1)-c1cccc3cccc-2c13,[0.0],-6.311255,4.883661e-07
9,[Cl-].[Cl-].[Zn++],96,EC50,MOR,[Cl-].[Cl-].[Zn+2],[0.0],-3.643362,0.0002273204


## Upload your list of SMILES

For simplicity a file can be uploaded with the SMILES you wish to predict directly

In [45]:
from google.colab import files
import io

In [61]:
uploaded_file = files.upload()

Saving SMILES.txt to SMILES (1).txt


In [62]:
# excel file with one column containing SMILES
#data = pd.read_excel(io.BytesIO(uploaded_file[list(uploaded_file.keys())[0]]), header=None, names=['SMILES'])

# .csv file with one column containing SMILES
#data = pd.read_csv(io.BytesIO(uploaded_file[list(uploaded_file.keys())[0]]), header=None, names=['SMILES'])

# .txt file with one SMILES per line
data = pd.read_csv(io.BytesIO(uploaded_file[list(uploaded_file.keys())[0]]), sep='\n', header=None, names=['SMILES'])

In [63]:
data

Unnamed: 0,SMILES
0,CC(=O)Oc1ccccc1C(O)=O
1,[Cr]
2,[H+].[Cl-].CNCCC(Oc1ccc(cc1)C(F)(F)F)c2ccccc2
3,Clc1ccc(cc1)C(c2ccc(Cl)cc2)C(Cl)(Cl)Cl
4,[Cu]
...,...
995,[Pb++].[O-]c1c(cc(c([O-])c1[N+]([O-])=O)[N+]([...
996,CC(C)(C)C(O)(CCc1ccc(Cl)cc1)Cn2cncn2
997,[Na+].[Na+].[Na+].[Na+].OCCN(CCO)c1nc(Nc2ccc(c...
998,CNC.OC(=O)COc1ccc(Cl)cc1Cl


In [65]:
PREDICTION_ENDPOINT = 'EC50'
PREDICTION_EFFECT = 'MOR'
EXPOSURE_DURATION = 96
SMILES_COLUMN_NAME = 'SMILES'

In [66]:
fishbait.predict_toxicity(SMILES = data[SMILES_COLUMN_NAME].tolist(), exposure_duration=EXPOSURE_DURATION, endpoint=PREDICTION_ENDPOINT, effect=PREDICTION_EFFECT)

Did not return onehotencoding for Endpoint. Why? You specified only one Endpoint or you specified NOEC and EC10 which are coded to be the same endpoint.
Did not return onehotencoding for Effect. Why? You specified only one Effect.
Will use input 0 to network due to no Onehotencodings being present.


  0%|          | 0/2 [00:00<?, ?it/s]You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
100%|██████████| 2/2 [00:00<00:00,  3.22it/s]


Unnamed: 0,SMILES,exposure_duration,endpoint,effect,SMILES_Canonical_RDKit,OneHotEnc_concatenated,predictions log10(mg/L),predictions (mg/L)
0,CC(=O)Oc1ccccc1C(O)=O,96,EC50,MOR,CC(=O)Oc1ccccc1C(=O)O,[0.0],-2.900215,0.001258303
1,[Cr],96,EC50,MOR,[Cr],[0.0],-2.937654,0.001154373
2,[H+].[Cl-].CNCCC(Oc1ccc(cc1)C(F)(F)F)c2ccccc2,96,EC50,MOR,CNCCC(Oc1ccc(C(F)(F)F)cc1)c1ccccc1.[Cl-].[H+],[0.0],-4.483784,3.282587e-05
3,Clc1ccc(cc1)C(c2ccc(Cl)cc2)C(Cl)(Cl)Cl,96,EC50,MOR,Clc1ccc(C(c2ccc(Cl)cc2)C(Cl)(Cl)Cl)cc1,[0.0],-6.660405,2.185722e-07
4,[Cu],96,EC50,MOR,[Cu],[0.0],-4.65882,2.193713e-05
5,CCNc1nc(Cl)nc(NC(C)C)n1,96,EC50,MOR,CCNc1nc(Cl)nc(NC(C)C)n1,[0.0],-3.16755,0.000679908
6,CN(C)C1=NC(=O)N(C2CCCCC2)C(=O)N1C,96,EC50,MOR,CN(C)c1nc(=O)n(C2CCCCC2)c(=O)n1C,[0.0],-3.280938,0.0005236747
7,CC(Br)(CO)[N+]([O-])=O,96,EC50,MOR,CC(Br)(CO)[N+](=O)[O-],[0.0],-2.916111,0.001213079
8,c1ccc2c(c1)c3cccc4cccc2c34,96,EC50,MOR,c1ccc2c(c1)-c1cccc3cccc-2c13,[0.0],-6.311254,4.883672e-07
9,[Cl-].[Cl-].[Zn++],96,EC50,MOR,[Cl-].[Cl-].[Zn+2],[0.0],-3.643362,0.0002273204
