## Code to Chapter 6 of LangChain for Life Science and Healthcare book, by Dr. Ivan Reznikov

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1NgmCs1_LGLRwkia_oxOv0CbaLNUVOt1h?usp=sharing)

## Chemistry T5 models

This notebook demonstrates the use of various pre-trained chemistry models for different chemical tasks including:
- Forward reaction prediction (reactants → products)
- Backward reaction prediction (products → reactants)
- Description-to-SMILES conversion
- SMILES-to-description conversion
- Paragraph-to-actions extraction

**Data Sources & Inspiration:**
- GT4SD: https://github.com/GT4SD
- MolT5: https://github.com/blender-nlp/MolT5
- Online SMILES visualization: https://www.cheminfo.org/flavor/malaria/Utilities/SMILES_generator___checker/index.html


## Installation and Setup

First, we need to install the required packages for chemistry modeling:
- `einops`: Einstein Operations for tensor manipulation
- `rdkit`: Chemistry informatics toolkit for molecular operations
- `langchain` ecosystem: For LLM integration
- `py3Dmol`: 3D molecular visualization
- `transformers`: HuggingFace transformers library

In [1]:
!pip install einops
!pip install -q rdkit==2023.9.6 langchain langchain_openai langchainhub langchain_experimental langchain_community py3Dmol transformers==4.48

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m34.9/34.9 MB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.7/9.7 MB[0m [31m77.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.6/70.6 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.2/209.2 kB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m46.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.5/65.5 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.2/45.2 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
# !pip install transformers====4.42.4
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModelForCausalLM
from transformers import T5Tokenizer, T5ForConditionalGeneration, GenerationConfig
import torch

In [3]:
!pip freeze | grep transformers

sentence-transformers==4.1.0
transformers==4.48.0


In [4]:
device = "cuda" if torch.cuda.is_available() else "cpu"

## ChemistryGenerator Class

This class wraps chemistry models to provide a unified interface for different chemical tasks.
It handles model initialization, tokenization, and text generation with customizable parameters.

**Key Features:**
- Supports both seq2seq and causal language models
- Configurable generation parameters (temperature, beam search, sampling)
- Automatic device placement and optimization
- Robust text processing and cleanup

In [5]:
class ChemistryGenerator:
    def __init__(self, model, tokenizer, temperature=0.0001, do_sample=False, max_length=512, num_beams=5, top_k=1):
        self.model = model.bfloat16()
        self.model.to(device)
        tokenizer.pad_token = "[PAD]"
        tokenizer.padding_side = "left"
        self.tokenizer = tokenizer
        if do_sample:
          self.generation_config = GenerationConfig(
              do_sample=do_sample,
              top_k=top_k,
              num_beams=num_beams,
              temperature=temperature,
              max_new_tokens=max_length,
              pad_token_id = self.tokenizer.pad_token_id,
              repetition_penalty=1.5,
              num_return_sequences=1
          )
        else:
          self.generation_config = GenerationConfig(
              do_sample=do_sample,
              num_beams=num_beams,
              max_new_tokens=max_length,
          )

    def run_model(self, input_text):
        text = self.tokenizer(input_text, return_tensors="pt").to(device)

        output = self.model.generate(input_ids=text["input_ids"], generation_config=self.generation_config)
        if device == "cuda":
          output = self.tokenizer.decode(output[0], skip_special_tokens=True)
        else:
          output = self.tokenizer.decode(output[0].cpu(), skip_special_tokens=True)
        try:
          output = output.split(self.tokenizer.eos_token)[0]
          output = output.replace(self.tokenizer.pad_token, "")
          output = output.replace("<unk>","\\\\")
          output = output.strip()
        except Exception as e:
          print(e)

        return output

## Model Loading and Initialization

We'll load four different chemistry models, each with specific strengths:

1. **GT4SD Base (Augmented)**: General-purpose chemistry model for multiple tasks
2. **MolT5 Large**: Specialized for caption-to-SMILES conversion
3. **GT4SD Small (Augmented)**: Lighter version of GT4SD for faster inference
4. **CHEMLLM-2B**: Advanced chemistry language model for complex reasoning

### GT4SD Multitask Base Model

This is a T5-based model trained on multiple chemistry tasks with data augmentation.
Good for: reaction prediction, SMILES generation, general chemistry tasks.

In [6]:
# https://huggingface.co/GT4SD/multitask-text-and-chemistry-t5-base-augm
gt4sd_model_name = "GT4SD/multitask-text-and-chemistry-t5-base-augm"
model = AutoModelForSeq2SeqLM.from_pretrained(gt4sd_model_name, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(gt4sd_model_name)

gt4sd_base_generator = ChemistryGenerator(model, tokenizer)

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


### MolT5 Large Caption-to-SMILES Model

Specialized T5 model for converting molecular descriptions to SMILES notation.
Good for: natural language to chemical structure conversion.

In [7]:
# https://huggingface.co/laituan245/molt5-large-caption2smiles
laituan245_model_name = "laituan245/molt5-large-caption2smiles"
tokenizer = T5Tokenizer.from_pretrained(laituan245_model_name, device_map="auto")
model = T5ForConditionalGeneration.from_pretrained(laituan245_model_name)

laituan245_generator = ChemistryGenerator(model, tokenizer) # do_sample=False if errors with nan results

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/700 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

### GT4SD Multitask Small Model

Smaller, faster version of the GT4SD model with similar capabilities.
Good for: quick prototyping, resource-constrained environments.

In [8]:
# https://huggingface.co/GT4SD/multitask-text-and-chemistry-t5-small-augm
gt4sd_model_name = "GT4SD/multitask-text-and-chemistry-t5-small-augm"
model = AutoModelForSeq2SeqLM.from_pretrained(gt4sd_model_name, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(gt4sd_model_name)

gt4sd_small_generator = ChemistryGenerator(model, tokenizer)

config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/242M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

### AI4Chem CHEMLLM 2B Model

Advanced chemistry-focused language model with 2 billion parameters.
Good for: complex chemical reasoning, detailed explanations, advanced tasks.

In [9]:
# https://huggingface.co/AI4Chem/CHEMLLM-2b-1_5
ai4chem_model_name = "AI4Chem/CHEMLLM-2b-1_5"
model = AutoModelForCausalLM.from_pretrained(ai4chem_model_name, torch_dtype=torch.float16, device_map="auto",trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(ai4chem_model_name,trust_remote_code=True)

ai4chem_generator = ChemistryGenerator(model, tokenizer)

config.json:   0%|          | 0.00/961 [00:00<?, ?B/s]

configuration_internlm2.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/AI4Chem/CHEMLLM-2b-1_5:
- configuration_internlm2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_internlm2.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/AI4Chem/CHEMLLM-2b-1_5:
- modeling_internlm2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/975M [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/973M [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/973M [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/857M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/123 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenization_internlm2.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/AI4Chem/CHEMLLM-2b-1_5:
- tokenization_internlm2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


./tokenizer.model:   0%|          | 0.00/1.48M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/713 [00:00<?, ?B/s]

## Forward Reaction Prediction

Forward prediction involves predicting the products of a chemical reaction given the reactants and conditions.

**SMILES Reaction Format:** `reactants>reagents>products`
- Reactants: Starting materials
- Reagents: Catalysts, solvents, conditions (between `>` symbols)
- Products: Expected outcome

**Example:** `CC(=O)O.OCC>[H+].[Cl-].OCC>CC(=O)OCC`
- Reactants: Acetic acid (CC(=O)O) + Ethanol (OCC)  
- Reagents: HCl catalyst
- Expected Product: Ethyl acetate (CC(=O)OCC)

This represents an esterification reaction where acetic acid and ethanol form ethyl acetate.

Reference:

https://www.daylight.com/meetings/summerschool01/course/basics/smirks.html

### Test Case: Esterification Reaction

**Chemical Context:**
- Acetic acid + Ethanol → Ethyl acetate (in presence of acid catalyst)
- This is a classic Fischer esterification reaction
- Expected product: CC(=O)OCC (ethyl acetate)

In [10]:
instance = "CC(=O)O.OCC>[H+].[Cl-].OCC"
input_text = f"Predict the product of the following reaction: {instance}"

In [11]:
gt4sd_base_generator.run_model(input_text)

'CCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCOCCO

In [12]:
laituan245_generator.run_model(input_text)

'<pad> C1C2CN(CN2C3=C(N1)N=C(NC3=O)N)C4=CC=C(C=C4)OCC5(CC5)C6=CC=CC=C6'

In [13]:
gt4sd_small_generator.run_model(input_text)

'CCOC(=O)CCOC(C)=O'

In [14]:
ai4chem_generator.run_model(input_text)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
`get_max_cache()` is deprecated for all Cache classes. Use `get_max_cache_shape()` instead. Calling `get_max_cache()` will raise error from v4.48


'Predict the product of the following reaction: CC(=O)O.OCC>[H+].[Cl-].OCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCOCCCO

### Test Case: Complex Organic Synthesis

**Data Source:** USPTO 2023 reaction dataset

**Reaction Type:** Carbamate formation/protection reaction

**Expected:** Formation of tert-butyl carbamate protected amine

Reference:

https://figshare.com/articles/dataset/Reaction_SMILES_USPTO_year_2023/24921555?file=43858050

Reaction:

**`C(=O)(OC(C)(C)C)OC(=O)OC(C)(C)C.NC=1C=CC(=NC1)C(C#N)(C)C>O.O1CCOCC1`**`>C(#N)C(C)(C)C1=CC=C(C=N1)NC(OC(C)(C)C)=O`

In [15]:
instance = "C(=O)(OC(C)(C)C)OC(=O)OC(C)(C)C.NC=1C=CC(=NC1)C(C#N)(C)C>O.O1CCOCC1"
input_text = f"Predict the product of the following reaction: {instance}"

In [16]:
gt4sd_base_generator.run_model(input_text)

'CC(C)(C)OC(=O)NC1=C=CC(=NC1)C(C)(C#N)C(=O)O'

In [17]:
laituan245_generator.run_model(input_text)

'<pad> CC(C)(COP(=O)(O)OP(=O)(O)OC[C@@H]1[C@H]([C@H]([C@@H](O1)N2C=NC3=C(N=CN=C32)N)O)OP(=O)(O)O)[C@H](C(=O)NCCC(=O)NCCS)O'

In [18]:
gt4sd_small_generator.run_model(input_text)

'CC(C)(C)OC(=O)NC=1C=CC(=NC(=O)OC(C)(C)C)C(C)(C#N)C1'

In [19]:
ai4chem_generator.run_model(input_text)

'Predict the product of the following reaction: C(=O)(OC(C)(C)C)OC(=O)OC(C)(C)C.NC=1C=CC(=NC1)C(C#N)(C)C>O.O1CCOCC1>NC=1C=CC(=NC1)C(C#N)(C)C.NC=1C=CC(=NC1)C(C#N)(C)C.NC=1C=CC(=NC1)C(C#N)(C)C.NC=1C=CC(=NC1)C(C#N)(C)C.NC=1C=CC(=NC1)C(C#N)(C)C.NC=1C=CC(=NC1)C(C#N)(C)C.NC=1C=CC(=NC1)C(C#N)(C)C.NC=1C=CC(=NC1)C(C#N)(C)C.NC=1C=CC(=NC1)C(C#N)(C)C.NC=1C=CC(=NC1)C(C#N)(C)C.NC=1C=CC(=NC1)C(C#N)(C)C.NC=1C=CC(=NC1)C(C#N)(C)C.NC=1C=CC(=NC1)C(C#N)(C)C.NC=1C=CC(=NC1)C(C#N)(C)C.NC=1C=CC(=NC1)C(C#N)(C)C.NC=1C=CC(=NC1)C(C#N)(C)C.NC=1C=CC(=NC1)C(C#N)(C)C.NC=1C=CC(=NC1)C(C#N)(C)C.NC=1C=CC(=NC1)C(C#N)(C)C.NC=1C=CC(=NC1)C(C#N)(C)C.NC=1C=CC(=NC1)C(C#N)(C)C.NC=1C=CC(=NC1)C(C#N)(C)C.NC=1C=CC(=NC1)C(C#N)(C)C.NC=1C=CC(=NC1)C(C#N)(C)C.NC=1C=CC(=NC1)C(C#N)(C)C.NC=1C=CC(=NC1)'

### Test Case 3: Nucleophilic Substitution

**Reaction Type:** SN2 reaction between chloroacetic acid and aniline derivative
**Expected:** Formation of N-substituted glycine derivative

Reaction:

**`ClCC(=O)O.CS(=O)(=O)C1=CC=C(N)C=C1>O.[OH-].[Na+]`**`>CS(=O)(=O)C1=CC=C(C=C1)NCC(=O)O`


In [20]:
instance = "ClCC(=O)O.CS(=O)(=O)C1=CC=C(N)C=C1>O.[OH-].[Na+]"
input_text = f"Predict the product of the following reaction: {instance}"

In [21]:
gt4sd_base_generator.run_model(input_text)

'CS(=O)(=O)C1=CC=C(NCCC(=O)O)C=C1O'

In [22]:
laituan245_generator.run_model(input_text)

'<pad> C(C[C@@H](C(=O)O)NC(=O)CC(CC(=O)O)(C(=O)O)O)CNC(=O)CC(CC(=O)O)(C(=O)O)O'

In [23]:
gt4sd_small_generator.run_model(input_text)

'CS(=O)(=O)C1=CC=C(NC(=O)CCl)C=C1O'

In [24]:
ai4chem_generator.run_model(input_text)

'Predict the product of the following reaction: ClCC(=O)O.CS(=O)(=O)C1=CC=C(N)C=C1>O.[OH-].[Na+].CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.C'

### Test Case: Multi-Component Reaction

**Reaction Type:** Complex substitution involving multiple functional groups

**Challenge:** Multiple reactive sites and potential side reactions

Reaction:

**`BrC1=CC=C(C=C1)CBr.C(#N)C(C)(C)C1=CC=C(C=N1)NC(OC(C)(C)C)=O>N(C)(C)C=O.[H-].[Na+].[Cl-].[NH4+]`**`>BrC1=CC=C(CN(C(OC(C)(C)C)=O)C=2C=NC(=CC2)C(C)(C)C#N)C=C1`

In [25]:
instance = "BrC1=CC=C(C=C1)CBr.C(#N)C(C)(C)C1=CC=C(C=N1)NC(OC(C)(C)C)=O>N(C)(C)C=O.[H-].[Na+].[Cl-].[NH4+]"
input_text = f"Predict the product of the following reaction: {instance}"

In [26]:
gt4sd_base_generator.run_model(input_text)

'CC(C)(C#N)C1=CC=C(NC(=O)OC(C)(C)C)C=N1'

In [27]:
laituan245_generator.run_model(input_text)

'<pad> C1=CC(=CC=C1C2=C3C=CC(=O)C=C3OC4=C2C=CC(=C4)O[C@H]5[C@@H]([C@H]([C@@H]([C@H](O5)CO)O)O)O)O'

In [28]:
gt4sd_small_generator.run_model(input_text)

'CC(C)(C#N)C1=CC=C(C=CC2=CC=C(Br)C2)NC(=O)OC(C)(C)C'

In [29]:
ai4chem_generator.run_model(input_text)

'Predict the product of the following reaction: BrC1=CC=C(C=C1)CBr.C(#N)C(C)(C)C1=CC=C(C=N1)NC(OC(C)(C)C)=O>N(C)(C)C=O.[H-].[Na+].[Cl-].[NH4+].CCOC(C)=O>BrC1=CC=C(C=C1)CBr.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C#N.C'

## Backward Reaction Prediction (Retrosynthesis)

Retrosynthesis is the process of determining what reactants and conditions are needed to produce a given product. This is crucial for:
- Drug discovery and development
- Process optimization
- Route planning in organic synthesis
- Understanding reaction mechanisms

**Challenge:** One product can often be made through multiple synthetic routes, making this a complex prediction task.

We'll use the same reaction as in forward prediction: CC(=O)OCC

In [30]:
instance = "CC(=O)OCC"
input_text = f"Predict the reaction that produces the following product: {instance}"

### Test Case: Ethyl Acetate Retrosynthesis

**Target Product:** CC(=O)OCC (ethyl acetate)

**Expected Retrosynthesis:** Acetic acid + Ethanol with acid catalyst

**Chemical Logic:** Esters are commonly formed from carboxylic acids and alcohols

In [31]:
gt4sd_base_generator.run_model(input_text)

'CC(=O)OCC.CC(=O)[O-]CC(=O)[O-][Pb+2][Pd].CC(=O)[O-]CC(=O)[O-][Pb+2][Pd][Pd].CC(=O)[O-]CC(=O)[O-][Pb+2][Pd].CC(=O)[O-]CC(=O)[O-][Pb+2]'

In [32]:
laituan245_generator.run_model(input_text)

'<pad> CC(=O)OC[C@@H]1[C@H]([C@@H]([C@H]([C@@H](O1)N2C=CC(=NC2=O)N)O)O)O'

In [33]:
gt4sd_small_generator.run_model(input_text)

'O=C([O-])[O-][K+][K+].CC(=O)OCC'

In [34]:
ai4chem_generator.run_model(input_text)

'Predict the reaction that produces the following product: CC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(=O)OCC(='

### Test Case: Brominated Indole Amide

**Target Product:** BrC1=C2C=C(NC2=CC=C1)C(=O)N

**Chemical Context:** Indole derivative with bromine and amide functional groups

**Synthesis Challenge:** Multiple functional groups require careful reaction sequence

**Known Route:** `BrC1=C2C=C(NC2=CC=C1)C(=O)O>N(C)(C)C=O.O.C1CCCO1.C(C(=O)Cl)(=O)Cl.N>**BrC1=C2C=C(NC2=CC=C1)C(=O)N**`


In [35]:
instance = "BrC1=C2C=C(NC2=CC=C1)C(=O)N"
input_text = f"Predict the reaction that produces the following product: {instance}"

In [36]:
gt4sd_base_generator.run_model(input_text)

'CC(C)(C)OC(=O)N1C2=CC=CC(Br)=C2C=C1C(=O)OC(C)(C)C.ClCCl.O=C(O)C(F)(F)F'

In [37]:
laituan245_generator.run_model(input_text)

'<pad> C1=CC(=CC=C1C(=O)N[C@@H](CCC(=O)O)C(=O)O)NC(=O)C2C(=O)NC(=N2)N'

In [38]:
gt4sd_small_generator.run_model(input_text)

'O=C(O)C(F)(F)F.ClCCl.CC(C)(C)OC(=O)N1C2=CC=C(Br)C(=C1)C2'

In [39]:
ai4chem_generator.run_model(input_text)

'Predict the reaction that produces the following product: BrC1=C2C=C(NC2=CC=C1)C(=O)N1C(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)CNC'

### Test Case: Brominated Indole Nitrile

**Target Product:** BrC1=C2C=C(NC2=CC=C1)C#N

**Chemical Context:** Conversion of amide to nitrile (dehydration reaction)

**Known Route:** BrC1=C2C=C(NC2=CC=C1)C(=O)N>C1(=CC=CC=C1)C.O.P(=O)(Cl)(Cl)Cl>BrC1=C2C=C(NC2=CC=C1)C#N

**Mechanism:** POCl3-mediated dehydration of primary amide to nitrile

In [40]:
instance = "BrC1=C2C=C(NC2=CC=C1)C#N"
input_text = f"Predict the reaction that produces the following product: {instance}"

In [41]:
gt4sd_base_generator.run_model(input_text)

'O=C1CCC(=O)N1Br.ClCCl.CC(C)(C)[O-][Na+].CC(C)(C)[O-]CC(C)(C)[O-]CC(C)(C)[O-]CC(C)(C)[O-]CC(C)(C)[O-]CC(C)(C)[O-][Ti+4].CC(C)(C)[O-]CC(C)(C)P(C(C)(C)C)C(C)(C)CCC(C)(C)P(C(C)(C)C)C(C)(C)CCl[Pd]Cl[Fe+2]c1ccc(P(c2ccccc2)[c-]2cccc2)cc1c1ccc(P(c2ccccc2)[c-]2cccc2)cc1'

In [42]:
laituan245_generator.run_model(input_text)

'<pad> C1=CC=C(C(=C1)C(=O)O)NCC(=O)NCCC(=O)O'

In [43]:
gt4sd_small_generator.run_model(input_text)

'c1ccccc1P(c1ccccc1)c1ccccc1.C1(=O)N(Br)C(=O)CC1.C1=C(C#N)C=C(Br)C1'

In [44]:
ai4chem_generator.run_model(input_text)

'Predict the reaction that produces the following product: BrC1=C2C=C(NC2=CC=C1)C#N.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.C'

## Description-to-SMILES Conversion

This task involves converting natural language descriptions of molecules into SMILES (Simplified Molecular Input Line Entry System) notation.

**SMILES Benefits:**
- Compact, text-based representation of molecular structures
- Machine-readable format for computational chemistry
- Standardized way to store and search chemical databases
- Essential for cheminformatics and drug discovery

**Challenge:** Natural language descriptions can be ambiguous, verbose, or incomplete, making accurate SMILES generation difficult.


### Test Case: 4-Chlorophenol

**Source:** PubChem (https://pubchem.ncbi.nlm.nih.gov/compound/4-Chlorophenol)

**Chemical Name:** para-Chlorophenol

**Expected SMILES:** C1=CC(=CC=C1O)Cl

**Structure:** Benzene ring with hydroxyl group and chlorine in para positions

**Input**: The molecule appears as white crystals with a strong phenol odor. Slightly soluble to soluble in water, depending on the isomer, and denser than water. Noncombustible. Used as an intermediate in organic synthesis of dyes and drugs. The molecule is a monochlorophenol substituted at the pare position by a chlorine atom.


In [45]:
instance = "The molecule appears as white crystals with a strong phenol odor. Slightly soluble to soluble in water, depending on the isomer, and denser than water. Noncombustible. Used as an intermediate in organic synthesis of dyes and drugs. The molecule is a monochlorophenol substituted at the pare position by a chlorine atom."
input_text = f"Write in SMILES the described molecule: {instance}"

In [46]:
gt4sd_base_generator.run_model(input_text)

'C1=CC(=CC=C1Cl)Cl'

In [47]:
laituan245_generator.run_model(input_text)

'<pad> C1=CC(=C(C=C1Cl)O)N=C2C=CC(=[NH2+])C=C2'

In [48]:
gt4sd_small_generator.run_model(input_text)

'C1=CC(=CC=C1Cl)O'

In [49]:
ai4chem_generator.run_model(input_text)

'Write in SMILES the described molecule: The molecule appears as white crystals with a strong phenol odor. Slightly soluble to soluble in water, depending on the isomer, and denser than water. Noncombustible. Used as an intermediate in organic synthesis of dyes and drugs. The molecule is a monochlorophenol substituted at the pare position by a chlorine atom. It is a member of monochlorophenols and a member of monochlorobenzenes.'

### Test Case: Tripeptide (Ala-Asp-Gly)

**Source:** PubChem (https://pubchem.ncbi.nlm.nih.gov/compound/Ala-Asp-Gly)

**Chemical Name:** L-alanyl-L-aspartyl-glycine

**Structure:** Three amino acids linked by peptide bonds

**Expected SMILES:**
- `CC(C(=O)NC(CC(=O)O)C(=O)NCC(=O)O)N`
- `C[C@@H](C(=O)N[C@@H](CC(=O)O)C(=O)NCC(=O)O)N`

In [50]:
instance = "The molecule is a tripeptide composed of L-alanine, L-aspartic acid, and glycine units joined in sequence by peptide linkages. It has a role as a metabolite. It is functionally related to a L-alanine, a L-aspartic acid and a glycine."
input_text = f"Write in SMILES the described molecule: {instance}"

In [51]:
gt4sd_base_generator.run_model(input_text)

'C[C@@H](C(=O)N[C@@H](CC(=O)O)C(=O)NCC(=O)O)N'

In [52]:
laituan245_generator.run_model(input_text)

'<pad> C[C@@H](C(=O)N[C@@H](CC(=O)O)C(=O)NCC(=O)O)N'

In [53]:
gt4sd_small_generator.run_model(input_text)

'C[C@@H](C(=O)N[C@@H](CC(=O)O)C(=O)NCC(=O)O)N'

In [54]:
ai4chem_generator.run_model(input_text)

'Write in SMILES the described molecule: The molecule is a tripeptide composed of L-alanine, L-aspartic acid, and glycine units joined in sequence by peptide linkages. It has a role as a metabolite. It is functionally related to a L-alanine, a L-aspartic acid and a glycine. It is a conjugate acid of a L-aspartyl-L-alanyl-L-glycyl-L-phenylalanyl-L-phenylalanyl-L-phenylalanyl-L-phenylalanyl-L-phenylalanyl-L-phenylalanyl-L-phenylalanyl-L-phenylalanyl-L-phenylalanyl-L-phenylalanyl-L-phenylalanyl-L-phenylalanyl-L-phenylalanyl-L-phenylalanyl-L-phenylalanyl-L-phenylalanyl-L-phenylalanyl-L-phenylalanyl-L-phenylalanyl-L-phenylalanyl-L-phenylalanyl-L-phenylalanyl-L-phenylalanyl-L-phenylalanyl-L-phenylalanyl-L-phenylalanyl-L-phenylalanyl-L-phenylalanyl-L-phenylalanyl-L-phenylalanyl-L-phenylalanyl-L-phenylalanyl-L-phenylalanyl-L-phenylalanyl-L-phenylalanyl-L-phenylalanyl-L-phenylalanyl-L-phenylalanyl-L-phenylalanyl-L-phenylalanyl-L-phenylalanyl-L-phenylalanyl-L-phenylalanyl-L-phenylalanyl-L-phenyl

## SMILES-to-Caption Conversion

This task reverses the previous one: given a SMILES string, generate a natural language description of the molecule.

**Applications:**
- Making chemical databases more accessible to non-experts
- Generating training data for chemistry education
- Creating human-readable chemical reports
- Enhancing chemical search with natural language queries

**Challenge:** Good captions should include:
- Physical properties (appearance, solubility, etc.)
- Chemical properties and reactivity
- Biological activity and applications
- Structural features and functional groups

### Test Case: 4-Chlorophenol Caption Generation

**Input SMILES:** C1=CC(=CC=C1O)Cl

**Structure:** para-Chlorophenol

**Expected Caption Elements:**
- Physical appearance (white crystals)
- Odor characteristics (strong phenol odor)  
- Solubility properties
- Applications (dye and drug synthesis)
- Structural description (monochlorophenol, para substitution)

In [55]:
instance = "C1=CC(=CC=C1O)Cl"
input_text = f"Caption the following smile: {instance}"

In [56]:
gt4sd_base_generator.run_model(input_text)

'The molecule is a chlorocatechol that is catechol in which the hydrogen para- to the hydroxy group is replaced by a chlorine. It is a chlorocatechol and a member of monochlorobenzenes.'

In [57]:
laituan245_generator.run_model(input_text)

'<pad> CC(=O)N[C@@H]1[C@H]([C@@H]([C@H](OC1OCC=C)CO)O)O[C@H]2[C@@H]([C@H]([C@@H]([C@H](O2)CO)O)O)O'

In [58]:
gt4sd_small_generator.run_model(input_text)

'The molecule is a chlorophenol that is phenol in which the hydrogen at position 4 has been replaced by a chlorine. It has a role as a bacterial xenobiotic metabolite. It is a chlorophenol and a member of monochlorobenzenes. It derives from a hydride of a phenol.'

In [59]:
ai4chem_generator.run_model(input_text)

'Caption the following smile: C1=CC(=CC=C1O)Cl C1=CC(=CC=C1O)Cl C1=CC(=CC=C1O)Cl C1=CC(=CC=C1O)Cl C1=CC(=CC=C1O)Cl C1=CC(=CC=C1O)Cl C1=CC(=CC=C1O)Cl C1=CC(=CC=C1O)Cl C1=CC(=CC=C1O)Cl C1=CC(=CC=C1O)Cl C1=CC(=CC=C1O)Cl C1=CC(=CC=C1O)Cl C1=CC(=CC=C1O)Cl C1=CC(=CC=C1O)Cl C1=CC(=CC=C1O)Cl C1=CC(=CC=C1O)Cl C1=CC(=CC=C1O)Cl C1=CC(=CC=C1O)Cl C1=CC(=CC=C1O)Cl C1=CC(=CC=C1O)Cl C1=CC(=CC=C1O)Cl C1=CC(=CC=C1O)Cl C1=CC(=CC=C1O)Cl C1=CC(=CC=C1O)Cl C1=CC(=CC=C1O)Cl C1=CC(=CC=C1O)Cl C1=CC(=CC=C1O)Cl C1=CC(=CC=C1O)Cl C1=CC(=CC=C1O)Cl C1=CC(=CC=C1O)Cl C1=CC(=CC=C1O)Cl C1=CC(=CC=C1O)Cl C1=CC(=CC=C1O)Cl C1=CC(=CC=C1O)Cl C1=CC(=CC=C1O)Cl C1=CC(=CC=C1O)Cl C1=CC(=CC=C1O)Cl C1=CC(=CC=C1O)Cl C1=CC(=CC=C1O)Cl C1=CC(=CC=C1O)Cl C1=CC(=CC=C1O)Cl C1=CC(=CC=C1O)Cl C1=CC(=CC=C1O)Cl C1=CC(=CC=C'

## Paragraph-to-Actions Extraction

This task involves parsing scientific literature or patents to extract actionable synthesis steps.

**Applications:**
- Automated synthesis planning
- Laboratory procedure extraction
- Patent analysis and prior art search
- Recipe standardization and optimization
- Training data generation for synthesis robots

**Action Types:**
- ADD: Adding reagents or starting materials
- ADDSOLVENT: Adding solvents
- SETTEMPERATURE: Temperature control
- HEAT, COOL: Temperature changes
- STIR: Mixing operations
- FILTER: Separation operations
- And many more synthetic operations


### Test Case: Aspirin Synthesis Patent

**Source:** US Patent 6278014B1 (https://patents.google.com/patent/US6278014B1/en)

**Historical Context:**
- Original Hoffman patent (1900): acetic anhydride + salicylic acid
- Ledeler modification (1901): added sulfuric acid catalyst
- Bercy method (1936): acetic acid solvent, temperature control

**Expected Actions:**
- ADD acetic anhydride
- ADD salicylic acid  
- ADD sulfuric acid (catalyst)
- ADDSOLVENT acetic acid
- SETTEMPERATURE 90°C (heating)
- SETTEMPERATURE 20°C (cooling)

In [60]:
instance = '''
The invention describes a novel method for the synthesis of acetylsalicylic acid. Since 1900, when Hoffman received the patent for the manufacture of acetyl salicylic acid from acetic anhydride and salicylic acid, there have been many modifications of the synthesis: Ledeler (1901) added sulfuric acid to the system in order to accelerate the process of esterification. A. Bercy, (Nature, No. 2977, p.462, 1936) further proposed to make this synthesis in the presence of acetic acid as a solvent, heating the system to 90° C. for some time and then cooling to 20° C. Other authors (e.g., E.J.Perry, Chem. Abst. Vol. 10 No. 2121), proposed that during the synthesis process at those temperatures, the ester o-AcC6H4CO2C6H4CO2H is formed and then it is decomposed into acetyl salicylic acid and salicylic acid.
'''
input_text = f"Which actions are described in the following paragraph: {instance}"

In [61]:
gt4sd_base_generator.run_model(input_text)

'ADD acetic acid; ADD sulfuric acid; ADD acetic acid; SETTEMPERATURE 90° C; SETTEMPERATURE 20° C; YIELD ester o-AcC6H4CO2C6H4CO2H.'

In [62]:
laituan245_generator.run_model(input_text)

'<pad> CC(=O)OC1=CC=CC=C1C(=O)O[C@@H]([C@H](C)O)C(=O)O'

In [63]:
gt4sd_small_generator.run_model(input_text)

'ADD acetyl salicylic acid; ADD acetic anhydride; ADD salicylic acid; SETTEMPERATURE 90° C; SETTEMPERATURE 20° C.'

In [64]:
ai4chem_generator.run_model(input_text)

'Which actions are described in the following paragraph: \nThe invention describes a novel method for the synthesis of acetylsalicylic acid. Since 1900, when Hoffman received the patent for the manufacture of acetyl salicylic acid from acetic anhydride and salicylic acid, there have been many modifications of the synthesis: Ledeler (1901) added sulfuric acid to the system in order to accelerate the process of esterification. A. Bercy, (Nature, No. 2977, p.462, 1936) further proposed to make this synthesis in the presence of acetic acid as a solvent, heating the system to 90° C. for some time and then cooling to 20° C. Other authors (e.g., E.J.Perry, Chem. Abst. Vol. 10 No. 2121), proposed that during the synthesis process at those temperatures, the ester o-AcC6H4CO2C6H4CO2H is formed and then it is decomposed into acetyl salicylic acid and salicylic acid.\nA. Bercy, (Nature, No. 2977, p.462, 1936) further proposed to make this synthesis in the presence of acetic acid as a solvent, heat

## Best Practices:

1. **Task-Specific Model Selection:** Choose models based on your primary use case
2. **Ensemble Approaches:** Combine multiple model predictions for better accuracy
3. **Prompt Engineering:** Clear, specific prompts improve model performance
4. **Validation:** Always validate chemical predictions with domain experts
5. **Post-Processing:** Clean and verify generated SMILES for chemical validity
