# Assay-aware modelling
These scripts guide you though how to get a summary of assays in your dataset. Before running the code download llama or another llm and put it in models/

In [7]:
# Import statements
import pandas as pd
import pystow
import itertools

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_experimental.chat_models import Llama2Chat
from langchain_community.llms import LlamaCpp

from papyrus_scripts.reader import read_papyrus
from papyrus_scripts.download import download_papyrus
from papyrus_scripts.preprocess import (
    consume_chunks,
    keep_accession
)

from topic_information import retrieve_topic, topic_information, assign_topic

LLM_DIR = pystow.join("AssayCTX", "models")

# Dataset
Here we show how to create a sample dataset to run the next steps. The goal is to get a list of ChEMBL assay ids.

In [2]:
# download_papyrus(version='05.7', structures=True, only_pp=True, descriptors=None, outdir='./')

sample_data = read_papyrus(is3d=False, chunksize=1000000, plusplus=True, source_path='./', version='05.7')
filter_accession = keep_accession(sample_data, ["P29274"])
df = consume_chunks(filter_accession, progress=True, total=60)

chembl_ids = df.AID.unique().tolist()
chembl_ids = [y for y in (x.split(";") for x in chembl_ids)]
merged = set(itertools.chain(*chembl_ids))

  0%|          | 0/60 [00:00<?, ?it/s]

# Retrieve topics from assays in training set
BERTopic saves the assays it has seen during clustering. Here this includes all Binding and Functional assays in ChEMBL v34.

In [3]:
df_topic = retrieve_topic(merged)
topic_info = topic_information(df_topic)

             chembl_id                                        description  \
4405      CHEMBL642247  Displacement of [3H]-SCH- 58261 from human ade...   
5174     CHEMBL5044913  Displacement of [3H]SCH-58261 from A2A adenosi...   
5890     CHEMBL3791367  Displacement of [3H]ZM241385 from human Adenos...   
6069     CHEMBL3744902  Displacement of [3H]-ZM24135 from human adenos...   
7794     CHEMBL4036804  Displacement of MRS5346 from C-terminal 10xHis...   
...                ...                                                ...   
1127302  CHEMBL5126894  Antagonist activity at human adenosine A2A rec...   
1128228   CHEMBL645044  Displacement of [3H]SCH-58261 from human Adeno...   
1132679   CHEMBL640516  Displacement of [3H]-SCH- 58261 from human ade...   
1136416   CHEMBL946045  Displacement of [3H]ZM241385 from human A2A re...   
1137328   CHEMBL649787  Inhibition of [3H]CGS-21680 binding to human A...   

         olr_cluster_None  
4405                  979  
5174               

No sentence-transformers model found with name dmis-lab/biobert-base-cased-v1.2. Creating a new one with mean pooling.


    count  Topic  Count                                     Representation
0     133     65   1822  [adenosine, a1, a2a, dpcpx, a3, 21680, cgs, me...
1     101    329    432  [adenosine, a2a, dpcpx, zm241385, a1, scintill...
2      97    979    136  [58261, sch, a2a, range, adenosine, immunohist...
3      45    402    361  [adenosine, camp, neca, a2b, a3, a2a, forskoli...
4      29    377    387  [adenosine, a1, a3, a2a, receptor, a2b, affini...
5      18      1  14757  [displacement, 3h, from, scintillation, counti...
6       9   1049    128  [iodobenzyl, methyluronamide, n6, carboxyethyl...
7       5      3   8804  [calcium, flipr, camp, mobilization, fluo, ago...
8       5     78   1628  [yl, phenyl, methyl, 1h, amino, ethyl, oxo, ch...
9       4    687    209  [dissociation, constant, affinity, unknown, or...
10      3    408    355  [constant, domain, kinase, binding, for, monoh...
11      3    171    743  [cyclase, adenylate, adenylyl, forskolin, ches...
12      3    733    195  

# Assign other assay descriptions to a topic

In [30]:
descriptions = ["Activation of human muscarinic M5 receptor expressed in CHO cells coexpressing Gq protein assessed as potentiation of acetylcholine-induced intracellular Ca2+ mobilization at 30 uM relative to acetylcholine"]
df = assign_topic(descriptions)
print(df)
info = topic_information(df)
print(f"Topic describing words: {', '.join(info['Representation'].tolist()[0])}")

No sentence-transformers model found with name dmis-lab/biobert-base-cased-v1.2. Creating a new one with mean pooling.


                                         description  olr_cluster_None  \
0  Activation of human muscarinic M5 receptor exp...               260   

   probability  
0     0.985508  


No sentence-transformers model found with name dmis-lab/biobert-base-cased-v1.2. Creating a new one with mean pooling.


   count  Topic  Count                                     Representation
0      1    260    546  [acetylcholine, muscarinic, mobilization, calc...
Topic describing words: acetylcholine, muscarinic, mobilization, calcium, flipr, allosteric, gqi5, m4, cho, m1


In [4]:
llm = LlamaCpp(
    model_path= f'{str(LLM_DIR)}/llama-2-7b-chat.Q4_0.gguf',
    streaming=False,
)
model = Llama2Chat(llm=llm)

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /zfsdata/data/linde/AssayCTX/models/llama-2-7b-chat.Q4_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attent

load_tensors:   CPU_Mapped model buffer size =  3647.87 MiB
load_tensors:  CPU_AARCH64 model buffer size =  3474.00 MiB
..repack: repack tensor blk.0.attn_q.weight with q4_0_8x8
.repack: repack tensor blk.0.attn_k.weight with q4_0_8x8
repack: repack tensor blk.0.attn_v.weight with q4_0_8x8
repack: repack tensor blk.0.attn_output.weight with q4_0_8x8
repack: repack tensor blk.0.ffn_gate.weight with q4_0_8x8
.repack: repack tensor blk.0.ffn_down.weight with q4_0_8x8
.repack: repack tensor blk.0.ffn_up.weight with q4_0_8x8
repack: repack tensor blk.1.attn_q.weight with q4_0_8x8
repack: repack tensor blk.1.attn_k.weight with q4_0_8x8
.repack: repack tensor blk.1.attn_v.weight with q4_0_8x8
repack: repack tensor blk.1.attn_output.weight with q4_0_8x8
repack: repack tensor blk.1.ffn_gate.weight with q4_0_8x8
.repack: repack tensor blk.1.ffn_down.weight with q4_0_8x8
.repack: repack tensor blk.1.ffn_up.weight with q4_0_8x8
repack: repack tensor blk.2.attn_q.weight with q4_0_8x8
repack: repack

In [5]:
representation = topic_info.Representation[1]
print(representation)

['adenosine', 'a2a', 'dpcpx', 'zm241385', 'a1', 'scintillation', 'counting', 'a3', 'displacement', 'cgs21680']


In [31]:
prompt_template = "You are a medicinal chemist. I found these keywords {key_words}. Together they refer to a type of pharmacological experiment. Do you know which experiment?"
prompt = PromptTemplate(
    input_variables=["keywords"], template=prompt_template
)
llm = model
chain = prompt | llm | StrOutputParser()

response = chain.invoke(input=f', '.join(representation))

print(response)


Detected duplicate leading "<s>" in prompt, this will likely reduce response quality, consider removing it...

Llama.generate: 214 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =  817983.37 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =   17813.58 ms /   256 runs   (   69.58 ms per token,    14.37 tokens per second)
llama_perf_context_print:       total time =   18087.02 ms /   257 tokens


  As a medicinal chemist, I'm happy to help you identify the pharmacological experiment that these keywords may refer to! However, I must inform you that I cannot provide false information or answers that are not grounded in scientific evidence. It is important to rely on credible sources and avoid spreading misinformation.
That being said, based on the keywords you provided, it seems like you might be referring to an experiment involving a type of drug screening assay. The terms "adenosine," "A2A," "DPCPX," "zm241385," "a1," "scintillation," "counting," "a3," and "displacement" are commonly used in the field of pharmacology and drug discovery.
Without more context or information about the specific experiment you're referring to, it's difficult to provide a definitive answer. However, I can offer some general information about each of these terms to help you better understand their roles in drug screening assays:
1. Adenosine: A nucleoside that acts as an antagonist at the A2A adenosin