
# Inference using the Multitask Text and Chemistry T5 model

In this notebook we show how to perform inference using the Multitask Text and Chemistry T5 model. We provide one example for each of the 5 tasks that the model has been trained on.


In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

In [None]:
max_length = 512
num_beams = 10

## Load model and tokenizer
Load the model and the respective tokenizer. In the HuggingFace hub we can find the two small variants of our model. We use the small version trained on the augmented dataset  the following examples.

In [None]:
model = AutoModelForSeq2SeqLM.from_pretrained("GT4SD/multitask-text-and-chemistry-t5-small-augm")
tokenizer = AutoTokenizer.from_pretrained("GT4SD/multitask-text-and-chemistry-t5-small-augm")

## Paragraph-to-actions

Input: The reaction mixture was cooled to -80° C., and a solution of tert-butyl 6-[(cyclopropylmethoxy)methyl]-6-hydroxy-1,4-oxazepane-4-carboxylate (Preparation 80, 50 g, 0.22 mol, 1 eq) in THF was added."

Expected output: SETTEMPERATURE −80° C; MAKESOLUTION with tert-butyl 6-[(cyclopropylmethoxy)methyl]-6-hydroxy-1,4-oxazepane-4-carboxylate (50 g, 0.22 mol, 1 eq) and THF; ADD SLN.




In [None]:
instance = "The reaction mixture was cooled to -80° C., and a solution of tert-butyl 6-[(cyclopropylmethoxy)methyl]-6-hydroxy-1,4-oxazepane-4-carboxylate (Preparation 80, 50 g, 0.22 mol, 1 eq) in THF was added."
input_text = f"Which actions are described in the following paragraph: {instance}"

text = tokenizer(input_text, return_tensors="pt")
output = model.generate(input_ids=text["input_ids"], max_length=max_length, num_beams=num_beams)
output = tokenizer.decode(output[0].cpu())

output = output.split(tokenizer.eos_token)[0]
output = output.replace(tokenizer.pad_token,"")
output = output.strip()

output

## Forward prediction

Input: CCOC(=O)c1cc2sc(C)cc2[nH]1.CI.CN(C)C=O.[H-]~[Na+]

Expected output: CCOC(=O)c1cc2sc(C)cc2n1C

In [None]:
instance = "CCOC(=O)c1cc2sc(C)cc2[nH]1.CI.CN(C)C=O.[H-]~[Na+]"
input_text = f"Predict the product of the following reaction: {instance}"

text = tokenizer(input_text, return_tensors="pt")
output = model.generate(input_ids=text["input_ids"], max_length=max_length, num_beams=num_beams)
output = tokenizer.decode(output[0].cpu())

output = output.split(tokenizer.eos_token)[0]
output = output.replace(tokenizer.pad_token,"")
output = output.strip()

output

## Backward prediction

Input: CCS(=O)c1ccc(CCN)cc1 

Expected output: CCS(=O)c1ccc(C#N)cc1.CO.N.[Ni]


In [None]:
instance = "CCS(=O)c1ccc(CCN)cc1"
input_text = f"Predict the reaction that produces the following product: {instance}"

text = tokenizer(input_text, return_tensors="pt")
output = model.generate(input_ids=text["input_ids"], max_length=max_length, num_beams=num_beams)
output = tokenizer.decode(output[0].cpu())

output = output.split(tokenizer.eos_token)[0]
output = output.replace(tokenizer.pad_token,"")
output = output.strip()

output

## Description-to-smiles

Input: The molecule is a tripeptide composed of L-alanine, L-aspartic acid, and glycine units joined in sequence by peptide linkages. It has a role as a metabolite. It derives from a L-alanine, a L-aspartic acid and a glycine.

Expected output: C[C@@H]\(C(=O)N[C@@H]\(CC(=O)O)C(=O)NCC(=O)O)N

In [None]:
instance = "The molecule is a tripeptide composed of L-alanine, L-aspartic acid, and glycine units joined in sequence by peptide linkages. It has a role as a metabolite. It derives from a L-alanine, a L-aspartic acid and a glycine."
input_text = f"Write in SMILES the described molecule: {instance}"

text = tokenizer(input_text, return_tensors="pt")
output = model.generate(input_ids=text["input_ids"], max_length=max_length, num_beams=num_beams)
output = tokenizer.decode(output[0].cpu())

output = output.split(tokenizer.eos_token)[0]
output = output.replace(tokenizer.pad_token,"")
output = output.replace("<unk>","\\\\")
output = output.strip()

output

## Smiles-to-caption

Input: Caption the following smile: COC1=C(C=C2C3CC4=CC(=C(C=C4C(N3)CC2=C1)OC)OC)OC

Expected output: The molecule is a racemate comprising equimolar amounts of (R,R)- and (S,S)-pavine. It has a role as a plant metabolite. It contains a (R,R)-pavine and a (S,S)-pavine. It is a conjugate base of a pavine(1+).

In [None]:
instance = "COC1=C(C=C2C3CC4=CC(=C(C=C4C(N3)CC2=C1)OC)OC)OC"
input_text = f"Caption the following molecule: {instance}"

text = tokenizer(input_text, return_tensors="pt")
output = model.generate(input_ids=text["input_ids"], max_length=max_length, num_beams=num_beams)
output = tokenizer.decode(output[0].cpu())

output = output.split(tokenizer.eos_token)[0]
output = output.replace(tokenizer.pad_token,"")
output = output.strip()

output