## Code to Chapter 2 of LangChain for Life Science and Healthcare book, by Dr. Ivan Reznikov

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Pfr9eJv2Md0HbMDsZU5w8vOVT03HHeIk?usp=sharing)

This notebook explores various LLM and transformer models in the context of life science, involving encoder, decoder, encoder-decoder architectures and their visual/textual outputs.

It also demonstrates greedy decoding with beam visualization, LangChain pipelines, and chemistry-focused models.

In [None]:
import os
from google.colab import userdata

try:
  os.environ["HF_TOKEN"] = userdata.get("HF_TOKEN")
except Exception as e:
  print(e)

Setup and Installations

In [None]:
!pip install -q tiktoken transformers torch langchain_huggingface langchain_community
#!pip install -q https://download.pytorch.org/whl/cu121/torch-2.2.1%2Bcu121-cp310-cp310-linux_x86_64.whl#sha256=1adf430f01ff649c848ac021785e18007b0714fdde68e4e65bd0c640bf3fb8e1

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m48.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m31.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m44.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
!pip install -U bitsandbytes

Collecting bitsandbytes
  Downloading bitsandbytes-0.46.0-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Downloading bitsandbytes-0.46.0-py3-none-manylinux_2_24_x86_64.whl (67.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.0/67.0 MB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.46.0


In [None]:
!sudo apt-get install graphviz graphviz-dev
!pip install pygraphviz

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Note, selecting 'libgraphviz-dev' instead of 'graphviz-dev'
graphviz is already the newest version (2.42.2-6ubuntu0.1).
The following additional packages will be installed:
  libgail-common libgail18 libgtk2.0-0 libgtk2.0-bin libgtk2.0-common
  libgvc6-plugins-gtk librsvg2-common libxdot4
Suggested packages:
  gvfs
The following NEW packages will be installed:
  libgail-common libgail18 libgraphviz-dev libgtk2.0-0 libgtk2.0-bin
  libgtk2.0-common libgvc6-plugins-gtk librsvg2-common libxdot4
0 upgraded, 9 newly installed, 0 to remove and 35 not upgraded.
Need to get 2,434 kB of archives.
After this operation, 7,681 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 libgtk2.0-common all 2.24.33-2ubuntu2.1 [125 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 libgtk2.0-0 amd64 2.24.33-2ubuntu2.1 [2,038 kB]
Get:3 http://a

In [None]:
!pip freeze | grep "tiktoken\|bitsandbytes\|transformers\|pygraphviz"

bitsandbytes==0.46.0
pygraphviz==1.14
sentence-transformers==4.1.0
tiktoken==0.9.0
transformers==4.52.4


Set CUDA or CPU device for computation

In [None]:
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

Encoder, Decoder, and Encoder-Decoder Examples

In [None]:
# Import necessary libraries
from transformers import (
 AutoTokenizer,
 AutoModel,
 GPT2Tokenizer,
 GPT2LMHeadModel,
 AutoModelForSeq2SeqLM,
 AutoModelForCausalLM
)
input_text = "Butane is the only compound"

Encoder: BERT model that produces hidden states

In [None]:
## Encoder
# Load the tokenizer and model
model = AutoModel.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
input_ids = tokenizer(input_text, return_tensors='pt').input_ids

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
# Encode the input text
encoded_output = model(input_ids)
encoded_output.last_hidden_state

tensor([[[-0.3771,  0.2544, -0.0380,  ..., -0.3627,  0.4245,  1.0094],
         [ 0.2611,  0.0374, -0.0724,  ..., -0.1084,  0.6303,  0.8671],
         [-0.3729,  0.2033, -0.3311,  ..., -0.9189,  1.1230,  0.1301],
         ...,
         [-0.8172,  0.0670, -0.1465,  ..., -0.0153,  0.4193, -0.0631],
         [-0.9140, -0.0249, -0.3779,  ..., -0.1870,  0.5409,  0.1522],
         [ 0.7572,  0.1510, -0.3398,  ...,  0.0111, -0.4003, -0.1945]]],
       grad_fn=<NativeLayerNormBackward0>)

Decoder-only: GPT2 for text generation

In [None]:
## Decoder
# Load the tokenizer and model
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Encoder-Decoder: T5 model

In [None]:
# Generate and decode text
model_output = model.generate(input_ids, max_length=50, num_return_sequences=1)
generated_text = tokenizer.decode(model_output[0], skip_special_tokens=True)
generated_text

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


'Butane is the only compound that can be used to make a chemical that is more effective than the one used in the chemical used in the chemical used in the chemical used in the chemical used in the chemical used in the chemical used in the chemical used'

T5 with masked spans (denoising task)

In [None]:
## Encoder-Decoder
# Load the tokenizer and model
model = AutoModelForSeq2SeqLM.from_pretrained("google-t5/t5-base")
tokenizer = AutoTokenizer.from_pretrained("google-t5/t5-base")
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

In [None]:
# Generate and decode text
model_output = model.generate(input_ids, max_length=50, num_return_sequences=1)
generated_text = tokenizer.decode(model_output[0], skip_special_tokens=True)
generated_text

'is the only compound that is not a compound. Butane is the only compound that is not a compound. Butane is the only compound that is not a compound.'

Because T5 has been trained with the span-mask denoising objective, it can be used to predict the sentinel (masked-out) tokens during inference. The predicted tokens will then be placed between the sentinel tokens.

In [None]:
# Most cancers are caused by genetic changes that happen mostly by chance and are not inherited
masked_text = "Most cancers are caused by <extra_id_0> changes that happen mostly by <extra_id_1> and are not <extra_id_2>"
input_ids = tokenizer(masked_text, return_tensors="pt").input_ids

In [None]:
sequence_ids = model.generate(input_ids)
sequences = tokenizer.batch_decode(sequence_ids)
sequences

['<pad><extra_id_0> genetic<extra_id_1> accident<extra_id_2> a cause for concern.</s>']

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
# Most cancers are caused by genetic changes that happen mostly by chance and are not inherited
masked_text = "Most cancers are caused by <extra_id_0> that happen mostly by <extra_id_1> and are not <extra_id_2>"
input_ids = tokenizer(masked_text, return_tensors="pt").input_ids

In [None]:
sequence_ids = model.generate(input_ids)
sequences = tokenizer.batch_decode(sequence_ids)
sequences

['<pad><extra_id_0> mutations<extra_id_1> accident<extra_id_2> inherited.</s>']

In [None]:
seed = 42
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

In [None]:
import time
import locale

locale.getpreferredencoding = lambda: "UTF-8"

GraphManager class to visualize decoding beams

In [None]:
import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
import torch
import matplotlib.colors as mcolors
from matplotlib.colors import LinearSegmentedColormap

class GraphManager:
    def __init__(self, beams=1, length=5, text=""):
        self.beams = beams
        self.length = length
        self.text = text
        self.graph = self.create_graph()

    def create_graph(self):
        graph = nx.balanced_tree(self.beams, self.length, create_using=nx.DiGraph())
        nx.set_node_attributes(graph, 100, "tokenscore")
        nx.set_node_attributes(graph, 0, "sequencescore")
        nx.set_node_attributes(graph, 0, "cumscore")
        nx.set_node_attributes(graph, self.text, "token")
        return graph

    def get_best_sequence(self):
        """
        Get the best sequence of tokens and its maximum score.

        Parameters:
            graph (nx.DiGraph): The graph representing the sequences.

        Returns:
            tuple: A tuple containing the best sequence and its maximum score.
        """
        # Create a list of leaf nodes
        leaf_nodes = [node for node in self.graph.nodes() if self.graph.out_degree(node) == 0]

        # Get the leaf node with the highest sequencescore
        max_score_node = max(
            leaf_nodes, key=lambda node: self.graph.nodes[node]["sequencescore"]
        )

        # Retrieve the sequence of nodes from this leaf node to the root node in a list
        path = nx.shortest_path(self.graph, source=0, target=max_score_node)

        # Return the string of token attributes of this sequence
        sequence = " ".join([self.graph.nodes[node]["token"].split("_")[0] for node in path])

        return sequence, self.graph.nodes[max_score_node]["sequencescore"]

def plot_graph(graph, score, args=""):
    """
    Plot the graph with nodes colored based on scores.

    Parameters:
        score (str): Score type, either 'token' or 'sequence'.
        cmap_name (str, optional): The name of the colormap. Defaults to 'rg'.
    """

    width, height = 3 + 1.2 * graph.beams**graph.length, max(4, 2 + graph.length)
    if args == '-Grankdir="LR"':
        height = 2
    fig, ax = plt.subplots(figsize=(width, height), dpi=300, facecolor="white")

    # Create positions for each node
    pos = nx.nx_agraph.graphviz_layout(graph.graph, prog="dot", args=args)

    # Determine the data key based on the score type
    score_key = "tokenscore" if score == "token" else "sequencescore"

    # Extract scores and compute normalization parameters
    node_data = graph.graph.nodes(data=True)
    scores = [data[score_key] for _, data in node_data if data["token"] is not None]
    vmin, vmax = min(scores), max(scores)
    norm = mcolors.Normalize(vmin=vmin, vmax=vmax)

    # Create a colormap with three colors
    cmap = LinearSegmentedColormap.from_list("rg", ["r", "y", "g"], N=256)

    # Draw the nodes with colors based on their scores
    node_colors = [
        data[score_key] for _, data in node_data if data["token"] is not None
    ]

    nx.draw_networkx_nodes(
        graph.graph,
        pos,
        node_size=2000,
        node_shape="o",
        alpha=1,
        linewidths=4,
        node_color=node_colors,
        cmap=cmap,
    )

    # Draw the edges
    nx.draw_networkx_edges(graph.graph, pos)

    # Prepare node labels based on the score type
    label_format = "{:.2f}%" if score == "token" else "{:.2f}"
    labels = {
        node: f"{data['token'].split('_')[0]}\n{label_format.format(data[score_key])}"
        for node, data in node_data
        if data["token"] is not None
    }

    # Draw the labels
    nx.draw_networkx_labels(graph.graph, pos, labels=labels, font_size=10)
    plt.box(False)

    # Add a colorbar
    sm = plt.cm.ScalarMappable(cmap=cmap, norm=norm)
    sm.set_array([])
    label_text = "Token probability (%)" if score == "token" else "Sequence score"
    fig.colorbar(sm, ax=ax, orientation="vertical", pad=0, label=label_text)

    plt.show()



class GreedyTextGenerator:
    def __init__(self, model, tokenizer, beams=1, decoder_mode=False):
        self.model = model
        self.tokenizer = tokenizer
        self.decoder_mode = decoder_mode
        self.beams = beams

    def _greedy_sampling(self, logits):
        return torch.topk(logits, self.beams).indices


    def run_greedy(self, text, number_of_outputs):
        # Ensure input is on the same device as the model
        encoded_input_ids = self.tokenizer(text).input_ids
        encoded_input_ids = torch.tensor(encoded_input_ids, dtype=torch.long).unsqueeze(dim=0).to(self.model.device)

        decoder_input_ids = None
        if self.decoder_mode:
            decoder_input_ids = torch.tensor([self.model.config.decoder_start_token_id],
                                              dtype=torch.long).unsqueeze(dim=0).to(self.model.device)

        graph_manager = GraphManager(beams=self.beams, length=number_of_outputs, text=text)

        self._greedy_search(encoded_input_ids, 0, number_of_outputs, graph_manager.graph, decoder_input_ids)
        return graph_manager

    def _greedy_search(self, encoded_input, node, length, graph, decoder_input=None):
        if length:
            # Ensure inputs are on the same device
            if decoder_input is not None:
                outputs = self.model(input_ids=encoded_input, decoder_input_ids=decoder_input)
            else:
                outputs = self.model(encoded_input)

            logits = outputs.logits[0, -1, :]
            top_token_ids = self._greedy_sampling(logits)

            for j, token_id in enumerate(top_token_ids):
                # Ensure new tensors are on the same device
                if decoder_input is not None:
                    new_encoded_inputs = torch.cat((decoder_input,
                                                    torch.tensor([token_id], dtype=torch.long)
                                                    .unsqueeze(dim=0)
                                                    .to(self.model.device)),
                                                  dim=1)
                else:
                    new_encoded_inputs = torch.cat((encoded_input,
                                                    torch.tensor([token_id], dtype=torch.long)
                                                    .unsqueeze(dim=0)
                                                    .to(self.model.device)),
                                                  dim=1)

                current_node = list(graph.successors(node))[j]
                token_score = torch.log(torch.softmax(logits, dim=-1)[token_id])
                cumulative_score = graph.nodes[node]["cumscore"] + token_score.item()

                graph.nodes[current_node].update({
                    "tokenscore": np.exp(token_score.item()) * 100,
                    "cumscore": cumulative_score,
                    "sequencescore": 1 / (len(new_encoded_inputs.squeeze())) * cumulative_score,
                    "token": self.tokenizer.decode(token_id, skip_special_tokens=True) + f"_{length}_{j}"
                })

                # Recursively call with device-consistent inputs
                if decoder_input is not None:
                    self._greedy_search(encoded_input, current_node, length - 1, graph, decoder_input=new_encoded_inputs)
                else:
                    self._greedy_search(new_encoded_inputs, current_node, length - 1, graph, decoder_input)

In [None]:
# TEXT = "The formula of dihydrogen monoxide is"
TEXT = "RCHO + H2 ->"
# TEXT = "The formula of dihydrogen"
# TEXT = "The application of CAR-T cell therapy in pancreatic"
TEXT = "Lipinki assumption: Compounds are absorbed into" # cells via passive diffusion
NUMBER_OF_OUTPUTS = 4

In [None]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

# option for lower RAM:
'''
model_name = "bigscience/bloom-560m"
model = BloomForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
'''

tokenizer = AutoTokenizer.from_pretrained("unsloth/DeepSeek-R1-Distill-Llama-8B-unsloth-bnb-4bit")
model = AutoModelForCausalLM.from_pretrained("unsloth/DeepSeek-R1-Distill-Llama-8B-unsloth-bnb-4bit")
model.to(device)

tokenizer_config.json:   0%|          | 0.00/53.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.56k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/236 [00:00<?, ?B/s]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096, padding_idx=128004)
    (layers): ModuleList(
      (0): LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
      )
      (1): LlamaDecoder

In [None]:
text_generator = GreedyTextGenerator(model, tokenizer, decoder_mode=False, beams = 3)
graph_m = text_generator.run_greedy(TEXT, NUMBER_OF_OUTPUTS)

In [None]:
sequence, max_score = graph_m.get_best_sequence()
print(f"Generated text: {sequence}, score: {max_score}")

Generated text: Lipinki assumption: Compounds are absorbed into  the  bloodstream  via  the, score: -0.26241861979166664


In [None]:
# Plot graph
plot_graph(graph_m, "token")

Output hidden; open in https://colab.research.google.com to view.

In [None]:
import pickle

with open('deepseek.pickle', 'wb') as f:
    pickle.dump(graph_m.graph, f)

with open('deepseek.pickle', 'rb') as f:
    G = pickle.load(f)

In [None]:
G.nodes

NodeView((0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120))

In [None]:
edges_dict = {}
for e in list(G.edges):
  if e[0] not in edges_dict:
    edges_dict[e[0]] = []
  edges_dict[e[0]].append(e[1])

In [None]:
# Placeholder for storing tokens for each node
# Assuming a simple token structure for demonstration
tokens = {node: f"token_{G.nodes[node]['token']}__{G.nodes[node]['tokenscore']}" for node in range(max(edges_dict.keys()) + 1)}

def print_hierarchy(tokens, node, level=0):
    # Print the current node with its token and indentation based on its level in the hierarchy
    if node in tokens:
      print('  ' * level + f'Node {node}, Token: {tokens[node]}')
      # Recurse for each child node, if any
      for child in edges_dict.get(node, []):
          print_hierarchy(tokens, child, level + 1)

# Start printing the hierarchy from the root node
print_hierarchy(tokens, 0)

Node 0, Token: token_Lipinki assumption: Compounds are absorbed into__100
  Node 1, Token: token_ the_4_0__75.48396019890073
    Node 4, Token: token_ body_3_0__29.016525093012497
      Node 13, Token: token_ and_2_0__22.26947843644562
      Node 14, Token: token_ as_2_1__18.03437323390992
      Node 15, Token: token_,_2_2__17.343487244895613
    Node 5, Token: token_ bloodstream_3_1__23.292675058834106
      Node 16, Token: token_ and_2_0__32.528717383743725
      Node 17, Token: token_ via_2_1__15.607449166991074
      Node 18, Token: token_,_2_2__14.004096213351506
    Node 6, Token: token_ blood_3_2__7.924933558859634
      Node 19, Token: token_ and_2_0__20.71693405218311
      Node 20, Token: token_,_2_1__16.908668780144364
      Node 21, Token: token_ via_2_2__8.436047860698274
  Node 2, Token: token_ blood_4_1__4.159865266915964
    Node 7, Token: token_,_3_0__18.826177301524922
      Node 22, Token: token_ but_2_0__16.58162552121675
      Node 23, Token: token_ so_2_1__16.4365

In [None]:
text_generator = GreedyTextGenerator(model, tokenizer, decoder_mode=False, beams = 1)
graph_m = text_generator.run_greedy(TEXT, 20)

G = graph_m.graph

edges_dict = {}
for e in list(G.edges):
  if e[0] not in edges_dict:
    edges_dict[e[0]] = []
  edges_dict[e[0]].append(e[1])

tokens = {node: f"token_{G.nodes[node]['token']}__{G.nodes[node]['tokenscore']}" for node in range(max(edges_dict.keys()) + 1)}
print_hierarchy(tokens, 0)

Node 0, Token: token_Lipinki assumption: Compounds are absorbed into__100
  Node 1, Token: token_ the_20_0__75.48396019890073
    Node 2, Token: token_ body_19_0__29.016525093012497
      Node 3, Token: token_ and_18_0__22.26947843644562
        Node 4, Token: token_ distributed_17_0__12.838337491044586
          Node 5, Token: token_ in_16_0__18.229146123106823
            Node 6, Token: token_ the_15_0__34.89821950953099
              Node 7, Token: token_ blood_14_0__30.231400125704933
                Node 8, Token: token_,_13_0__32.81587438577953
                  Node 9, Token: token_ but_12_0__17.7028363674121
                    Node 10, Token: token_ the_11_0__12.713573293203556
                      Node 11, Token: token_ exact_10_0__7.5916213560166605
                        Node 12, Token: token_ mechanism_9_0__22.664394888734783
                          Node 13, Token: token_ is_8_0__61.909769754472954
                            Node 14, Token: token_ unclear_7_0__57.7062

Model wrapper class to abstract inference for any transformer

In [None]:
class Model:
    def __init__(self, model, tokenizer, device=None):
        """
        Initialize the Model class with a given model and tokenizer.

        Parameters:
        - model: The machine learning model to generate text.
        - tokenizer: The tokenizer used for encoding the input text and decoding the model output.
        """
        self.model = model
        self.tokenizer = tokenizer
        if device:
          self.model.to(device)

    def generate_model_output(self, input_text, max_length=512, num_beams=5):
        """
        Generates text output from the model given an input text.

        Parameters:
        - input_text: The input text to be processed by the model.
        - max_length: The maximum length of the output text.
        - num_beams: The number of beams for beam search.

        Returns:
        - The generated text output from the model.
        """

        encoded_input = self.tokenizer(input_text, return_tensors="pt").to(device)

        output_ids = self.model.generate(input_ids=encoded_input["input_ids"], max_length=max_length)
        output_text = self.tokenizer.decode(output_ids[0], skip_special_tokens=True)
        output_text = output_text.strip()

        return output_text

Load and use a chemistry-specific T5 model

In [None]:
from transformers import T5ForConditionalGeneration, AutoTokenizer
model_name = "GT4SD/multitask-text-and-chemistry-t5-base-augm"
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

chem_t5_model = Model(model, tokenizer, device)

config.json:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.37k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


Run chemistry prompts through the model

In [None]:
TEXT = "The formula of dihydrogen monoxide is"

In [None]:
input_text = f"{TEXT}"
chem_t5_model.generate_model_output(input_text)

'[O-][Mn](=O)(=O)=O'

In [None]:
input_text = f"Continue the following phrase as a chemist: {TEXT}"
chem_t5_model.generate_model_output(input_text)

'[O-][Mn](=O)(=O)=O.[O-][Mn](=O)(=O)=O.[O-][Mn](=O)(=O)=O.[Na+].[Na+]'

In [None]:
input_text = f"You are a professional chemistry researcher. Finish the following sentence: {TEXT}"
chem_t5_model.generate_model_output(input_text)

'The molecule is a dihydrogen monoxide. It is a conjugate base of a dihydrogen monoxide(2+). It is a conjugate acid of a dihydrogen monoxide(1-).'

In [None]:
instance = """To support clinical use, a multigram-scale process has been developed to provide 5-MeO-DMT,
a psychedelic natural product found in the parotid gland secretions of the toad, Incilius alvarius.
Several synthetic routes were initially explored, and the selected process featured an optimized Fischer
indole reaction to 5-MeO-DMT freebase in high-yield, from which the 1:1 succinate salt was produced to
provide 136 g of crystalline active pharmaceutical ingredient (API) with 99.86% peak area by high-performance
liquid chromatography (HPLC) and a net yield of 49%. The report provides in-process monitoring, validated
analytical methods, impurity formation and removal, and solid-state characterization of the API essential
for subsequent clinical development."""

In [None]:
input_text = f"Summarize the following abstract: {instance}"
chem_t5_model.generate_model_output(input_text)

'CC(=O)N(CC)CC.C1=CC2=C(C=C1C(=O)N(CC)CC)N(C(=O)N2CC)C3=NC(=NC(=N3)C4=C(N=CN4)C=C5C(=CC(=N5)C=C6C(=O)N(CC)CC)N6CC)C(=O)N(CC)CC.C(=O)(O)O'

In [None]:
input_text = f"You are a professional chemistry researcher. Summarize the following abstract: {instance}"
chem_t5_model.generate_model_output(input_text)

'CC(=O)N(CC)CC.C1=CC(=CC=C1C(=O)N[C@H](C2=NC(=NC(=C2)C(=O)N[C@H](C3=NC(=CS3)C(=O)N[C@H](C4=NC(=CS4)C(=O)N[C@H](C(=N1)C(=O)N[C@H](C(=O)N[C@H](C(=O)N[C@H](C(=O)N[C@H](C(=O)N[C@H](C(=O)N[C@H](C(=O)N[C@H](C(=O)N[C@H](C(=O)N[C@H](C(=O)N1)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)'

In [None]:
input_text = f"You are a professional chemistry researcher. Summarize the following abstract in 1 sentence: {instance}"
chem_t5_model.generate_model_output(input_text)

'CC(=O)N(CC)CC.C1=CC(=CC=C1C(=O)N[C@H](C2=NC(=NC(=C2)C(=O)N[C@H](C3=NC(=CS3)C(=O)N[C@H](C4=NC(=CS4)C(=O)N[C@H](C(=N1)C(=O)N[C@H](C(=O)N[C@H](C(=O)N[C@H](C(=O)N[C@H](C(=O)N[C@H](C(=O)N1)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N)CC(=O)N'

What is the molecule of Ibuprofen?

In [None]:
RUN_AI4Chem_GPU = False

In [None]:
input_text = "What is the molecule of Ibuprofen?"
chem_t5_model.generate_model_output(input_text)

'CC(C)C1=CC=C(C=C1)C(C)C(C)C2=CC=C(C=C2)I.C1=C(C=C(C=C1)Cl)Cl'

In [None]:
#pip install einops
#https://huggingface.co/AI4Chem/ChemLLM-7B-Chat
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
import torch

if RUN_AI4Chem_GPU:
  model_name_or_id = "AI4Chem/ChemLLM-7B-Chat"

  model = AutoModelForCausalLM.from_pretrained(model_name_or_id, torch_dtype=torch.float16, device_map="auto",trust_remote_code=True)
  tokenizer = AutoTokenizer.from_pretrained(model_name_or_id,trust_remote_code=True)

  chemllm_model = Model(model, tokenizer, device)

In [None]:
if RUN_AI4Chem_GPU:
  chemllm_model.generate_model_output(input_text)

### DeepSeek

In [None]:
tokenizer = AutoTokenizer.from_pretrained("unsloth/DeepSeek-R1-Distill-Llama-8B-unsloth-bnb-4bit")
model = AutoModelForCausalLM.from_pretrained("unsloth/DeepSeek-R1-Distill-Llama-8B-unsloth-bnb-4bit")
deepseek_model = Model(model, tokenizer, device)

In [None]:
deepseek_model.generate_model_output(input_text)

"What is the molecule of Ibuprofen? What are its key properties?\nIbuprofen is a nonsteroidal anti-inflammatory drug (NSAID), which means it has anti-inflammatory, analgesic (pain-relieving), and antipyretic (fever-reducing) properties. It's commonly used to treat headaches, muscle aches, arthritis, and other inflammatory conditions.\n\nThe molecular formula of ibuprofen is C₁₄H₁₈O₃. It has a molecular weight of approximately 206.08 g/mol.\n\nKey properties of ibuprofen include:\n\n- **Analgesic:** Relieves pain.\n- **Anti-inflammatory:** Reduces inflammation.\n- **Antipyretic:** Lowers body temperature.\n- **Anti-platelet:** Prevents blood platelets from clumping, which is important for reducing the risk of blood clots.\n- **Lipid-lowering:** May reduce levels of certain types of lipoproteins in the blood, which is associated with heart health.\n- **Prostaglandin inhibitor:** It inhibits the production of prostaglandins, which play a role in inflammation and pain.\n\nIbuprofen is a wh

In [None]:
'''# install all proper packages
!pip install -qU einops==0.6.1 bitsandbytes==0.41.1 scikit-learn==1.2.2 seaborn==0.12.2 matplotlib==3.7.1 jedi==0.19.0
!pip install -qU ctransformers==0.2.23 Cython==0.29.36
!pip install -qU transformers==4.28.0 sentence-transformers==2.2.2 safetensors==0.3.2 xformers==0.0.21
!pip install -qU pydantic==2.2.0 openai==0.27.9 tiktoken==0.4.0 langchain==0.0.271 huggingface-hub==0.16.4
!pip install -qU auto-gptq==0.3.0 jax==0.4.23
#!pip install -qU git+https://github.com/huggingface/transformers'''

'# install all proper packages\n!pip install -qU einops==0.6.1 bitsandbytes==0.41.1 scikit-learn==1.2.2 seaborn==0.12.2 matplotlib==3.7.1 jedi==0.19.0\n!pip install -qU ctransformers==0.2.23 Cython==0.29.36\n!pip install -qU transformers==4.28.0 sentence-transformers==2.2.2 safetensors==0.3.2 xformers==0.0.21\n!pip install -qU pydantic==2.2.0 openai==0.27.9 tiktoken==0.4.0 langchain==0.0.271 huggingface-hub==0.16.4\n!pip install -qU auto-gptq==0.3.0 jax==0.4.23\n#!pip install -qU git+https://github.com/huggingface/transformers'

In [None]:
'''if device == "cuda":
    !pip install -q auto_gptq
    !pip install -q --upgrade jax jaxlib -f https://storage.googleapis.com/jax-releases/jax_releases.html'''

'if device == "cuda":\n    !pip install -q auto_gptq\n    !pip install -q --upgrade jax jaxlib -f https://storage.googleapis.com/jax-releases/jax_releases.html'

### LLAMA

In [None]:
RUN_LLAMA_GPU = False

if RUN_LLAMA_GPU:
    # Import specific modules for GPU execution
    from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

    # Define the model name or path and the base model name
    model_name_or_path = "TheBloke/llama-2-13B-Guanaco-QLoRA-GPTQ"
    model_basename = "model"

    # Configure whether to use Triton
    use_triton = False

    # Initialize the tokenizer with the specified model name or path
    l_tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

    # Initialize the AutoGPTQForCausalLM model with appropriate parameters
    l_model = AutoGPTQForCausalLM.from_quantized(
        model_name_or_path,
        model_basename=model_basename,
        use_safetensors=True,
        trust_remote_code=True,
        device_map="auto",
        use_triton=use_triton,
        quantize_config=None,
    )

In [None]:
if RUN_LLAMA_GPU:
    llama_model = Model(l_model, l_tokenizer, device)
    llama_model.generate_model_output(input_text)

## LangChain integration

LangChain Integration: using HuggingFacePipeline with PromptTemplate

Inference from cloud: OpenAI, Anthropic, Gemini, DeepSeek, etc

In [None]:
'''
from langchain_deepseek import ChatDeepSeek
%pip install -qU langchain-deepseek

DEEPSEEK_API_KEY for code:

llm = ChatDeepSeek(
    model="deepseek-chat",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
    # other params...
)

messages = [
    (
        "system",
        "You are a helpful assistant.",
    ),
    ("human", f"{input_text}"),
]
ai_msg = llm.invoke(messages)
ai_msg.content
'''

'\nfrom langchain_deepseek import ChatDeepSeek\n%pip install -qU langchain-deepseek\n\nDEEPSEEK_API_KEY for code:\n\nllm = ChatDeepSeek(\n    model="deepseek-chat",\n    temperature=0,\n    max_tokens=None,\n    timeout=None,\n    max_retries=2,\n    # other params...\n)\n\nmessages = [\n    (\n        "system",\n        "You are a helpful assistant.",\n    ),\n    ("human", f"{input_text}"),\n]\nai_msg = llm.invoke(messages)\nai_msg.content\n'

In [None]:
from langchain_huggingface.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
hf = HuggingFacePipeline(pipeline=pipe)

Device set to use cuda:0


In [None]:
from langchain_core.prompts import PromptTemplate

template = """Question: {question}

Answer: Let's think step by step."""
prompt = PromptTemplate.from_template(template)

In [None]:
chain = prompt | hf.bind(skip_prompt=True)
print(chain.invoke({"question": input_text}))

 Ibuprofen is a common painkiller, so it's definitely a organic compound. It's commonly referred to as 2-(4-methylphenyl)propan-1-one. Wait, no, that might not be correct. Alternatively, I remember that the structure involves a benzene ring attached to a propanone group. So the molecule is a phenyl group connected to a ketone.

But wait, I think I'm mixing up something. Let me try to draw it mentally. A benzene ring with a methyl group attached, and then a ketone group connected to the ring. So, it's like a benzene ring substituted with a methyl group and a carbonyl group.

Wait, but I think I might be forgetting something. Oh, right! The carbonyl is connected to a propyl chain? No, wait, it's a propanone, which is an acetyl group. So, the structure is a benzene ring with a methyl group and an acetyl group attached.

Hmm, but I'm getting confused. Let me try to break it down. The base of the molecule is a benzene ring. Attached to this ring is a methyl group (CH3) and an acetyl group (

In [None]:
chain = prompt | hf
print(chain.invoke({"question": input_text}))

Question: What is the molecule of Ibuprofen?

Answer: Let's think step by step. Ibuprofen is a common medication used for pain relief. I know that it's in the class of Non-Steroidal Anti-Inflammatory Drugs (NSAIDs). NSAIDs usually have a structure similar to aspirin but with different substituents. Aspirin is C₆H₅COOH, which is a benzene ring attached to a carboxylic acid group. Ibuprofen, being an NSAID, must have a similar structure but with additional groups that provide the anti-inflammatory properties. 

I recall that ibuprofen has a specific substituent on the benzene ring. If I remember correctly, it's a cyclohexenyl ring. So, instead of a plain benzene ring, ibuprofen has a ring with a double bond, making it cyclohexene. This double bond allows for the necessary structure to interact with the enzymes that cause inflammation.

Additionally, the carboxylic acid group in aspirin is replaced by a ketone group in ibuprofen. This change allows the molecule to have a different mode of

In [None]:
for chunk in chain.stream(input_text):
  print(chunk, end="", flush=True)

 Ibuprofen is a common medication used for pain relief. It belongs to a class of drugs known as nonsteroidal anti-inflammatory drugs (NSAIDs). NSAIDs typically work by inhibiting the activity of cyclooxygenase (COX) enzymes, which are involved in the production of prostaglandins. Prostaglandins play a role in causing inflammation, pain, and fever. By inhibiting COX, NSAIDs reduce these inflammatory reactions.

Ibuprofen is particularly known for its effectiveness in treating headaches, muscle aches, arthritis, and other conditions involving inflammation. The molecule of ibuprofen consists of a central carbon connected to four different groups: a hydroxyl group (-OH), a methyl group (-CH3), and two hydrogen atoms. The central carbon also has a ring structure, which is a cyclohexyl ring. Additionally, there are ketone groups attached to the ring, which contribute to the molecule's pharmacological activity.

In terms of molecular formula, ibuprofen is C13H18O2. This means it contains 13 c

In [None]:
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, batch_size=2, temperature = 0.001, model_kwargs={"max_length": 1000},)
gpu_llm = HuggingFacePipeline(pipeline=pipe)
gpu_chain = prompt | gpu_llm.bind(stop=["\n\n"])

molecular_names_list = ["Ibuprofen", "Vitamin C", "Crude Oil", "Air"]
questions = []
for mol in molecular_names_list:
    questions.append({"question": f"What is the molecule of {mol}"})

answers = gpu_chain.batch(questions)
for answer in answers:
    print(answer)

Device set to use cuda:0


Question: What is the molecule of Ibuprofen

Answer: Let's think step by step. Ibuprofen is a common over-the-counter (OTC) medication used to relieve pain, inflammation, and fever. It belongs to a class of drugs known as nonsteroidal anti-inflammatory drugs (NSAIDs). NSAIDs work by inhibiting the activity of cyclooxygenase (COX) enzymes, which are involved in the production of prostaglandins. Prostaglandins play a role in causing inflammation, pain, and fever. By inhibiting COX enzymes, ibuprofen reduces the production of these prostaglandins, thereby alleviating the symptoms associated with inflammation and pain.

Ibuprofen is also known as 2-(4-methylphenyl)propan-1-ol. Its chemical structure consists of a central carbon atom connected to three groups: a hydroxyl group (-OH), a methyl group (-CH3), and a phenyl group (C6H5). The phenyl group is attached to the central carbon through a single bond, and the hydroxyl group is attached to the central carbon as well. The methyl group is 

## CPU-only model

In order to use LLAMA models, you'd need to visit huggingface website and submit the agreement: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

class QwenChatbot:
    def __init__(self, model_name="Qwen/Qwen3-0.6B"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(model_name)
        self.history = []

    def generate_response(self, user_input):
        messages = self.history + [{"role": "user", "content": user_input}]

        text = self.tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True,
            enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
        )

        inputs = self.tokenizer(text, return_tensors="pt")
        response_ids = self.model.generate(**inputs, max_new_tokens=32768)[0][len(inputs.input_ids[0]):].tolist()
        response = self.tokenizer.decode(response_ids, skip_special_tokens=True)

        # Update history
        self.history.append({"role": "user", "content": user_input})
        self.history.append({"role": "assistant", "content": response})

        return response

In [None]:
chatbot = QwenChatbot()

tokenizer_config.json:   0%|          | 0.00/9.73k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/726 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.50G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

The below cells might take a couple of minutes. Give it some time

In [None]:
input_text = "What is the molecule of Ibuprofen?"

In [None]:
# First input (without /think or /no_think tags, thinking mode is enabled by default)
print(f"User: {input_text}")
%time response_1 = chatbot.generate_response(input_text)
print(f"Bot: {response_1}")

User: What is the molecule of Ibuprofen?
CPU times: user 2min 5s, sys: 330 ms, total: 2min 5s
Wall time: 2min 6s
Bot: <think>
Okay, the user is asking about the molecule of Ibuprofen. Let me start by recalling what Ibuprofen is. It's a common over-the-counter medication, right? I remember it's used for pain relief and inflammation. Now, the question is about its molecular structure.

Ibuprofen is a compound with the molecular formula C18H22O. Wait, is that right? Let me check. Yes, the molecular formula is C18H22O. Now, the structure. Ibuprofen has a cyclohexane ring fused with an ethyl group and a hydroxyl group. The structure is like a six-membered ring with an ethyl group attached to one side and a hydroxyl group on another side. The functional groups are the hydroxyl group and the methyl groups.

I should also mention the functional groups. The hydroxyl group is a key part of the molecule, and the methyl groups are attached to the ring. The structure is important for its biological

In [None]:
input_text = "How many carbon atoms does it contain?"

In [None]:
# Third input with /think
print(f"User: {input_text}")
%time response_2 = chatbot.generate_response(input_text)
print(f"Bot: {response_2}")

User: How many carbon atoms does it contain?
CPU times: user 1min 40s, sys: 245 ms, total: 1min 40s
Wall time: 1min 40s
Bot: <think>
Okay, so the user asked, "How many carbon atoms does it contain?" and I need to answer that. Let me think. The previous answer mentioned Ibuprofen's formula is C₁₈H₂₂O. Wait, but the user is asking about the number of carbon atoms, not the formula. Let me check the formula again. The formula is C₁₈H₂₂O, which means 18 carbon atoms. But maybe I should verify that. 

Wait, the formula is C₁₈H₂₂O. Carbon is the first element in the formula, so each H is hydrogen, and O is oxygen. So yes, 18 carbons. But sometimes people might confuse the formula with the molecular formula. Let me confirm once more. The molecular formula of Ibuprofen is C₁₈H₂₂O. Therefore, the number of carbon atoms is 18. 

I should present this clearly. Make sure to state the formula and the number of carbons. The user might be a student or someone learning about molecular structures, so ac

In [None]:
input_text = "Base your calculations on the functional groups"

In [None]:
# Third input with /think
print(f"User: {input_text}")
%time response_3 = chatbot.generate_response(input_text)
print(f"Bot: {response_3}")

User: Base your calculations on the functional groups
CPU times: user 16min 35s, sys: 6.9 s, total: 16min 42s
Wall time: 16min 45s
Bot: <think>
Okay, the user is asking about the number of carbon atoms in the molecule of Ibuprofen, and they want me to base the calculation on the functional groups. Let me start by recalling the structure of Ibuprofen. 

First, I remember that Ibuprofen is a compound with a cyclohexane ring. So, the cyclohexane part contributes 6 carbon atoms. Then there's the ethyl group, which is a CH₃CH₂ group. That adds another 2 carbon atoms. Finally, there's the hydroxyl group, which is an OH group. Wait, but the hydroxyl group is actually a single carbon atom connected to oxygen, right? Or does it count as part of the ring?

Wait, the formula is C₁₈H₂₂O. Let me break it down. The formula is derived from the structure. Let me count the carbons. The cyclohexane has 6 carbons, the ethyl group adds 2, and the hydroxyl group is part of the molecule. But does the hydrox