#CafChem tools running inference with the TxGemma model, or finetuning the TxGemma model on your own med chem dataset.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MauricioCafiero/CafChem/blob/main/notebooks/TxGemma_CafChem.ipynb)

## This notebook allows you to:
- Explore some of the tasks that TxGemma has been trained for.
- Formulate a prompt for a specific TxGemma task
- Select any of the TxGemma models
- Run inference.

also:
- Upload a classification dataset. Create a set of training prompts based on that dataset.
- Prepare TxGemma for finetuning
- Finetune TxGemma
- inference with the finetuned model.
- push the dataset and model to HF Hub.

## Requirements:
- This notebook will install rdit, bitsandbytes, and other libraries
- It will pull the CafChem tools from Github.
- It will install all needed libraries.
- Can use any GPU runtime for inference on a small model. A100 recommended for larger model and for fine-tuning

## Set-up

### Install libraries

In [1]:
! pip install --upgrade --quiet accelerate bitsandbytes huggingface_hub transformers
! pip install --upgrade datasets
! pip install rdkit
! pip install --upgrade --quiet peft trl

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.9/40.9 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 MB[0m [31m33.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m515.3/515.3 kB[0m [31m38.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.8/10.8 MB[0m [31m125.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m114.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m96.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m58.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━

### Import libraries and set up some definitions
- Pull the CafChem tools from Github
- Import libraries

In [1]:
!git clone https://github.com/MauricioCafiero/CafChem.git

fatal: destination path 'CafChem' already exists and is not an empty directory.


In [2]:
import os
import json
from IPython.display import display, Markdown
import torch
from rdkit.Chem import AllChem, Draw, QED
from rdkit import Chem

import CafChem.CafChemTxGemma as cctxg

## Inference with TxGemma
- First, set up your choice of model.
- View suggested tasks
- View the training prompts for those tasks
- Make your own prompt based on the trainig prompt
- Generate a response from TxGemma either in text or markdown format (markdown for chat models only).

In [5]:
model, tokenizer, pipe = cctxg.setup_txgemma(1)
cctxg.get_some_tdc_tasks()

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Device set to use cuda:0


LD50_Zhu
logP_Morgan
BindingDB_kd
BindingDB_ic50
BindingDB_ki
Lipophilicity_AstraZeneca
Solubility_AqSolDB
Bioavailability_Ma
BBB_Martins
Skin_Reaction
Carcinogens_Lagunin
SARSCoV2_Vitro_Touret
SARSCOV2_3CLPro_Diamond
HIV
ClinTox


In [6]:
parameters = cctxg.view_task_prompt('ClinTox')

Instructions: Answer the following question about drug properties.
Context: Humans are exposed to a variety of chemicals through food, household products, and medicines, some of which can be toxic, leading to over 30% of promising pharmaceuticals failing in human trials due to toxicity. Toxic drugs can be identified from clinical trials that failed due to toxicity, while non-toxic drugs can be identified from FDA approval status or from clinical trials that report no toxicity.
Question: Given a drug SMILES string, predict whether it
(A) is not toxic (B) is toxic
Drug SMILES: {Drug SMILES}
Answer:
Parameters:
Drug SMILES


In [7]:
parameters

['{Drug SMILES}']

In [8]:
prompt = cctxg.make_prompt('ClinTox',parameters, ['c1ccc(F)cc1'])

In [9]:
prompt

'Instructions: Answer the following question about drug properties.\nContext: Humans are exposed to a variety of chemicals through food, household products, and medicines, some of which can be toxic, leading to over 30% of promising pharmaceuticals failing in human trials due to toxicity. Toxic drugs can be identified from clinical trials that failed due to toxicity, while non-toxic drugs can be identified from FDA approval status or from clinical trials that report no toxicity.\nQuestion: Given a drug SMILES string, predict whether it\n(A) is not toxic (B) is toxic\nDrug SMILES: c1ccc(F)cc1\nAnswer:'

In [10]:
answer, full = cctxg.generate_text(prompt, model, tokenizer)



In [11]:
answer

' (A)'

In [12]:
cctxg.generate_chat(prompt, model, tokenizer)



---



**User:**

Instructions: Answer the following question about drug properties.
Context: Humans are exposed to a variety of chemicals through food, household products, and medicines, some of which can be toxic, leading to over 30% of promising pharmaceuticals failing in human trials due to toxicity. Toxic drugs can be identified from clinical trials that failed due to toxicity, while non-toxic drugs can be identified from FDA approval status or from clinical trials that report no toxicity.
Question: Given a drug SMILES string, predict whether it
(A) is not toxic (B) is toxic
Drug SMILES: c1ccc(F)cc1
Answer:

---



**TxGemma:**

(A)

---



**User:**

Explain your reasoning based on the molecule structure.

---



**TxGemma:**

The molecule represented by the SMILES string "c1ccc(F)cc1" is **benzene with a fluorine atom attached**. 

* **Benzene (c1ccc1)** is a common aromatic hydrocarbon. While benzene itself is known to be toxic and carcinogenic, it's also a fundamental building block in many pharmaceuticals.  
* **The fluorine atom (F)** is often introduced into molecules to modify their properties, including increasing metabolic stability and lipophilicity. Fluorine substitution can also change the molecule's overall toxicity profile compared to its non-fluorinated counterpart.

**Reasoning:**

Without further information about the specific context or other functional groups present in the molecule, it's impossible to definitively say whether this specific benzene derivative is toxic or not. 

* **Potentially Toxic:**  The presence of benzene raises concerns about potential carcinogenicity and other toxic effects. 
* **Potentially Non-Toxic:** The fluorine atom *could* potentially mitigate some of the toxicity of the benzene ring, depending on its position and the overall molecular structure. 

**Conclusion:**  More information is needed to make a reliable prediction about the toxicity of this molecule.  It's crucial to remember that toxicity is a complex property influenced by many factors beyond the simple structure of a molecule. 


---



## Fine-tuning TxGemma

### prepare dataset
- Read in your datset
- Create a training prompt
- Fill the prompt from your datset

In [3]:
tyro = cctxg.prepare_dataset("/content/tyrosinase_3classes_aug.csv")

prepare dataset class initiated!


In [4]:
prompt_text = "Drugs which act as inhibitors for Tyrosinase can be categoried by IC50, which is the \n\
concentration at which they inhibit 50% of the Tyrosinase enzyme's activity. Given the drug SMILES string and molecular \n\
properties below, tell which of the three categories the drug will be in: (A) IC50 less than \n\
2.5 micromolar, (B) between 2.5 and 50 micromolar, or (C) above 50 micromolar.\ndrug SMILES : DRUG_SMILES. \n\
molecular properties: MOLECULAR_PROPERTIES. \n\
\nAnswer: "

In [5]:
prompt_template = tyro.define_prompt_template(prompt_text)
print(prompt_template)

{'input_text': "Drugs which act as inhibitors for Tyrosinase can be categoried by IC50, which is the \nconcentration at which they inhibit 50% of the Tyrosinase enzyme's activity. Given the drug SMILES string and molecular \nproperties below, tell which of the three categories the drug will be in: (A) IC50 less than \n2.5 micromolar, (B) between 2.5 and 50 micromolar, or (C) above 50 micromolar.\ndrug SMILES : DRUG_SMILES. \nmolecular properties: MOLECULAR_PROPERTIES. \n\nAnswer: ", 'output_text': 'ANSWER_TEXT'}


In [6]:
training_prompts = tyro.fill_prompt_template(prompt_template, ['A','B','C'])

Generating train split: 0 examples [00:00, ? examples/s]

Dataset loaded!


In [7]:
training_prompts["train"][700]

{'input_text': "Drugs which act as inhibitors for Tyrosinase can be categoried by IC50, which is the \nconcentration at which they inhibit 50% of the Tyrosinase enzyme's activity. Given the drug SMILES string and molecular \nproperties below, tell which of the three categories the drug will be in: (A) IC50 less than \n2.5 micromolar, (B) between 2.5 and 50 micromolar, or (C) above 50 micromolar.\ndrug SMILES : O=C(OCc1ccc(O)cc1)c1cc(O)c(O)c(O)c1. \nmolecular properties: Molecular weight 276.24, partition coefficient: 1.87, Hydrgen-bond acceptors: 6, \nHydrgen-bond donors: 4, Polariable Surface Area: 107.22, Rotatable bonds: 3,  Aromatic rings: 2. \n\nAnswer: ",
 'output_text': 'B'}

### Training
- Setup a model (it is the 2B model for compute reasons)
- Get the model and tokenizer
- Test the model before finetuning
- Set up quantization/Lora for the model
- Train on your dataset
- Test the model after finetuning

In [8]:
tyro_model = cctxg.train_TxF2BPredict(batch_size = 16, epochs = 10, trained_model_name = "tyro_ft", push = False)

train_TxF2BPredict class initiated!


In [9]:
model, tokenizer = tyro_model.setup_model()

tokenizer_config.json:   0%|          | 0.00/46.4k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/818 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/24.2k [00:00<?, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/481M [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

In [10]:
test_prompt = tyro_model.formatting_func(training_prompts["train"][700])
test_prompt = test_prompt[:-8]

inputs = tokenizer(test_prompt, return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=8)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Drugs which act as inhibitors for Tyrosinase can be categoried by IC50, which is the 
concentration at which they inhibit 50% of the Tyrosinase enzyme's activity. Given the drug SMILES string and molecular 
properties below, tell which of the three categories the drug will be in: (A) IC50 less than 
2.5 micromolar, (B) between 2.5 and 50 micromolar, or (C) above 50 micromolar.
drug SMILES : O=C(OCc1ccc(O)cc1)c1cc(O)c(O)c(O)c1. 
molecular properties: Molecular weight 276.24, partition coefficient: 1.87, Hydrgen-bond acceptors: 6, 
Hydrgen-bond donors: 4, Polariable Surface Area: 107.22, Rotatable bonds: 3,  Aromatic rings: 2. 

Answer:537


In [11]:
model, lora_config = tyro_model.set_up_peft(model)

In [12]:
model = tyro_model.train_model(model, tokenizer, lora_config, training_prompts)

Applying formatting function to train dataset:   0%|          | 0/1053 [00:00<?, ? examples/s]

Adding EOS to train dataset:   0%|          | 0/1053 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/1053 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/1053 [00:00<?, ? examples/s]

Applying formatting function to eval dataset:   0%|          | 0/186 [00:00<?, ? examples/s]

Adding EOS to eval dataset:   0%|          | 0/186 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/186 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/186 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Epoch,Training Loss,Validation Loss
1,5.6765,1.653417
2,0.8042,0.492384
3,0.4205,0.373542
4,0.332,0.326129
5,0.2854,0.286634
6,0.2472,0.261297
7,0.2197,0.242341
8,0.2008,0.232703
9,0.1883,0.226801
10,0.1804,0.225766


In [13]:
outputs = model.generate(**inputs, max_new_tokens=8)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Drugs which act as inhibitors for Tyrosinase can be categoried by IC50, which is the 
concentration at which they inhibit 50% of the Tyrosinase enzyme's activity. Given the drug SMILES string and molecular 
properties below, tell which of the three categories the drug will be in: (A) IC50 less than 
2.5 micromolar, (B) between 2.5 and 50 micromolar, or (C) above 50 micromolar.
drug SMILES : O=C(OCc1ccc(O)cc1)c1cc(O)c(O)c(O)c1. 
molecular properties: Molecular weight 276.24, partition coefficient: 1.87, Hydrgen-bond acceptors: 6, 
Hydrgen-bond donors: 4, Polariable Surface Area: 107.22, Rotatable bonds: 3,  Aromatic rings: 2. 

Answer:  B
