### Open Targets lab meeting 1/10/25

- Run through demoing our new cellate model, showing promise of adding insight into promise of therapeutic targets
- We can consider times where cell specificity is crucial for the complete understanding of a target's function in a given disease
    - Big examples seen in recent years include: 
        - the different roles of APOE in Alzheimer's disease, depending on their context - microglia, neurons
        - PD-1 in CD8+ T cells vs. TREG cells in cancer - https://pmc.ncbi.nlm.nih.gov/articles/PMC10754338/
    - Links to one of our therapeutic hypoethsis Qs:
        - _"What is the function of target [y] in cell type [d] in disease [x]?"_

- As an OTs user, we can search for the target PDCD1 (~ PD-1), seeing in these results high associations with many types of cancer, including NSCLC.
    - https://partner-platform.opentargets.org/target/ENSG00000188389/associations
- Supporting evidence here currently links cooccurences of the target _PD-1_ to the disease _NSCLC_, but:
    - We can see that the literature already contains cell-type specific information
    - Our data as it stands doesn't reflect this - potential therapeutic insights are buried here, but aren't specifically captured
- OpenAI even gives us a nod, stating "Specific information about PDCD1 would require additional _context_ / sources"

### Advice when running the notebook

- This notebook will run in two parts, due to incompatability between ONNX and tensorflow
- Vars that are needed will be saved, venvs will need to be switched part way through

In [4]:
# Helper - check which venv is in use
import sys
print(sys.executable)  # Shows the Python binary in use

/Users/withers/GitProjects/OTAR3088/Data_mining/otarlab/labenv/bin/python


In [5]:
import requests
import json

variables = {
  "ensemblId": "ENSG00000188389", # PDCD1
  "efoId": "EFO_0003060", # NSCLC
  "size": 100
}

# Build query string to get literature information linking PDCD1 and NSCLC
# TODO - Alter Q to return annotations from ePMC as well?
query_string = """
query EuropePMCQuery(
  $ensemblId: String!
  $efoId: String!
  $size: Int!
  $cursor: String
) {
  disease(efoId: $efoId) {
    id
    europePmc: evidences(
      ensemblIds: [$ensemblId]
      enableIndirect: true
      size: $size
      datasourceIds: ["europepmc"]
      cursor: $cursor
    ) {
      count
      cursor
      rows {
        disease {
          name
          id
        }
        target {
          approvedSymbol
          id
        }
        literature
        textMiningSentences {
          tStart
          tEnd
          dStart
          dEnd
          section
          text
        }
        resourceScore
      }
    }
  }
}
"""

# GraphQL API request
base_url = "https://api.platform.opentargets.org/api/v4/graphql"

r = requests.post(base_url, json={"query": query_string, "variables": variables})
if r.status_code == 200:
    print("Query ran successfully")
resp = json.loads(r.text)

Query ran successfully


In [6]:
import pandas as pd
from pprint import pprint

data = resp["data"]["disease"]["europePmc"]["rows"]
literature_evidence = {}
for paper in data:
    pprint(paper)
    pmcid = paper["literature"][0]
    target = paper["target"]["approvedSymbol"]
    disease = paper["disease"]["name"]
    text = paper["textMiningSentences"]
    text = [t["text"] for t in text]
    literature_evidence[pmcid] = [target, disease, text]

evidence_df = pd.DataFrame.from_dict(literature_evidence, orient="index", columns=["Target", "Disease", "Text"])

{'disease': {'id': 'EFO_0000571', 'name': 'lung adenocarcinoma'},
 'literature': ['32154170'],
 'resourceScore': 106,
 'target': {'approvedSymbol': 'PDCD1', 'id': 'ENSG00000188389'},
 'textMiningSentences': [{'dEnd': 289,
                          'dStart': 285,
                          'section': 'results',
                          'tEnd': 57,
                          'tStart': 53,
                          'text': 'Given that Science-LUAD cohort was only '
                                  'treated with PD-1 blockade and performed '
                                  'better than Cancer Cell-LUAD cohort on '
                                  'predicting ICB efficacy (Figure 2B; AUC = '
                                  '0.82 and 0.80, respectively), we stratified '
                                  'the Discovery-LUAD cohort into two groups '
                                  'based on the TMB cutoff from Science-LUAD '
                                  'cohort (Figure 2C; TMB = 16

In [7]:
evidence_df.head()

Unnamed: 0,Target,Disease,Text
32154170,PDCD1,lung adenocarcinoma,[Given that Science-LUAD cohort was only treat...
38299030,PDCD1,non-small cell lung carcinoma,"[In the current study, we tested the feasibili..."
34194433,PDCD1,non-small cell lung carcinoma,"[Thus, PD-1 was highly expressed on ILC2s obta..."
30325558,PDCD1,non-small cell lung carcinoma,"[However, we further compared PD‐1 expression ..."
27191652,PDCD1,non-small cell lung carcinoma,[We report that surface expression of PD-1 on ...


In [8]:
pprint(evidence_df["Text"].iloc[0])

['Given that Science-LUAD cohort was only treated with PD-1 blockade and '
 'performed better than Cancer Cell-LUAD cohort on predicting ICB efficacy '
 '(Figure 2B; AUC = 0.82 and 0.80, respectively), we stratified the '
 'Discovery-LUAD cohort into two groups based on the TMB cutoff from '
 'Science-LUAD cohort (Figure 2C; TMB = 166.5).',
 'Immune checkpoint blockade (ICB) therapies that target programmed cell death '
 '1 (PD1) and PD1 ligand 1 (PDL1) have demonstrated promising benefits in lung '
 'adenocarcinoma (LUAD), and tumor mutational burden (TMB) is the most robust '
 'biomarker associated with the efficacy of PD-1-PD-L1 axis blockade in LUAD, '
 'but the assessment of TMB by whole-exome sequencing (WES) is rather '
 'expensive and time-consuming.',
 'We retrieved many previous studies and cancer databases, only collecting '
 'four high-quality LUAD datasets that contained both clinical and genomic '
 'information: 29 LUAD patients treated with anti-PD-1 therapy (Science-LUA

In [6]:
"""
To install on M3, xcode / metal are required for successful install, execute:

 xcode-select --install
 python -m uv pip install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu
 python -m uv pip install --upgrade transformers tokenizers accelerate

 & Restart kernel, checking that install was successful below
"""

import torch
print(torch.__version__)
print(torch.backends.mps.is_available())  # Should print True if MPS is supported
print(torch.backends.mps.is_built())      # Should print True

2.8.0
True
True


In [7]:
# Check torch is installed 
!pip list | grep torch

torch                   2.8.0
torchaudio              2.8.0
torchvision             0.23.0


### CeLLaTe model

In [36]:
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_name = "OTAR3088/bioformer-CeLLaTe_V1"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

nlp_pipeline = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
# Quick check pipeline is behaving
print("Pipeline loaded")
text = "The HeLa cell line is widely used in cancer research."
entities = nlp_pipeline(text)
print(entities)

Device set to use mps:0
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Pipeline loaded
[{'entity_group': 'CellLine', 'score': np.float32(0.72429657), 'word': 'HeLa cell line', 'start': 4, 'end': 18}]


In [35]:
# Reference for data formatting
# from spacy import displacy

# # Pre-annotated text and entities
# text = "When Sebastian Thrun started working on self-driving cars at Google in 2007."
# entities = [
#     {"start": 5, "end": 20, "label": "PERSON"},
#     {"start": 45, "end": 51, "label": "ORG"},
#     {"start": 55, "end": 59, "label": "DATE"}
# ]

# # Build the data dict
# doc_data = {
#     "text": text,
#     "ents": entities,
#     "title": None
# }

# # Render in Jupyter notebook
# displacy.render(doc_data, style="ent", manual=True, jupyter=True)


In [9]:
from tqdm.notebook import tqdm
hit_column = []

texts = evidence_df["Text"].to_list()

for text in tqdm(texts, desc="Processing in batches"):
    entities = []
    res = nlp_pipeline(text)
    ent_terms = [[hit["word"] for hit in ent] for ent in res]
    # ent_terms = [x for x in ent_terms if x != []] # Remove sentences w.o. hits
    entities.extend(ent_terms)
    hit_column.append(entities)

evidence_df["CellateHits"] = hit_column

Processing in batches:   0%|          | 0/100 [00:00<?, ?it/s]

In [33]:
from tqdm.notebook import tqdm

def merge_adjacent_ents(ents):
    if not ents:
        return []
    ents = sorted(ents, key=lambda x: x["start"])
    merged = [ents[0]]
    for next_ent in ents[1:]:
        cur = merged[-1]
        # Merge if entities are adjacent or overlapping and have the same label
        if cur["end"] >= next_ent["start"] and cur["label"] == next_ent["label"]:
            cur["end"] = max(cur["end"], next_ent["end"])
        else:
            merged.append(next_ent)
    return merged

hit_column = []
texts = evidence_df["Text"].to_list()[7]
all_ent = []

for text in tqdm(texts, desc="Processing in batches"):
    entities = []
    res = nlp_pipeline(text)
    for hit in res:
        disp = {"start": hit["start"], "end": hit["end"], "label": hit["entity_group"]}
        entities.append(disp)
    # Merge adjacent or overlapping entities before appending
    merged_entities = merge_adjacent_ents(entities)
    all_ent.append(merged_entities)


Processing in batches:   0%|          | 0/23 [00:00<?, ?it/s]

In [34]:
from spacy import displacy
# Trying displacy
colors = {
    "CellLine": "#fde68a",
    "CellType": "#6ee7b7",
    "Tissue": "#a7c7e7"
}
options = {"colors": colors}

for e, t in zip(all_ent, texts):
    # Build the data dict
    doc_data = {
        "text": t,
        "ents": e,
        "title": None
    }

    # Render in Jupyter notebook
    displacy.render(doc_data, style="ent", manual=True, jupyter=True, options=options)

### Cellfinder model

In [38]:
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_name = "OTAR3088/bioformer-cellfinder_V1"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

nlp_pipeline = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
# Quick check pipeline is behaving
print("Pipeline loaded")
text = "The HeLa cell line is widely used in cancer research."
entities = nlp_pipeline(text)
print(entities)

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/969 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/166M [00:00<?, ?B/s]

Device set to use mps:0
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Pipeline loaded
[{'entity_group': 'CellLine', 'score': np.float32(0.59069914), 'word': 'HeLa', 'start': 4, 'end': 8}]


In [39]:
from tqdm.notebook import tqdm

def merge_adjacent_ents(ents):
    if not ents:
        return []
    ents = sorted(ents, key=lambda x: x["start"])
    merged = [ents[0]]
    for next_ent in ents[1:]:
        cur = merged[-1]
        # Merge if entities are adjacent or overlapping and have the same label
        if cur["end"] >= next_ent["start"] and cur["label"] == next_ent["label"]:
            cur["end"] = max(cur["end"], next_ent["end"])
        else:
            merged.append(next_ent)
    return merged

hit_column = []
texts = evidence_df["Text"].to_list()[7]
all_ent = []

for text in tqdm(texts, desc="Processing in batches"):
    entities = []
    res = nlp_pipeline(text)
    for hit in res:
        disp = {"start": hit["start"], "end": hit["end"], "label": hit["entity_group"]}
        entities.append(disp)
    # Merge adjacent or overlapping entities before appending
    merged_entities = merge_adjacent_ents(entities)
    all_ent.append(merged_entities)

Processing in batches:   0%|          | 0/23 [00:00<?, ?it/s]

In [40]:
from spacy import displacy
# Trying displacy
colors = {
    "CellLine": "#fde68a",
    "CellType": "#6ee7b7",
    "Tissue": "#a7c7e7"
}
options = {"colors": colors}

for e, t in zip(all_ent, texts):
    # Build the data dict
    doc_data = {
        "text": t,
        "ents": e,
        "title": None
    }

    # Render in Jupyter notebook
    displacy.render(doc_data, style="ent", manual=True, jupyter=True, options=options)

In [37]:
# from typing import List
# def match_hits_text(text: List, hits: List) -> List:
#     return [(t, h) for t, h in list(zip(text, hits))]

# testing = match_hits_text(evidence_df["Text"].to_list()[4], evidence_df["CellateHits"].to_list()[4])
# # testing


# df_matched = evidence_df.apply(lambda row: match_hits_text(row["Text"], row["CellateHits"]), axis=1)
# df_matched[3]

### OTAR NER model - to be fixed

In [42]:
# from huggingface_hub import login
# login(token="my token")

# # Use a pipeline as a high-level helper
# from transformers import pipeline

# pipe = pipeline("token-classification", model="OTAR3088/otar-ner-model")

In [13]:
# Highlight annotations in-text
# Grab annotations from ePMC - either via graphQL or ePMC API - unsure why they aren't returned as default in graphQL?

model_name = "OTAR3088/otar-ner-model"

tokenizer = AutoTokenizer.from_pretrained(model_name)
print(tokenizer)
model = AutoModelForTokenClassification.from_pretrained(model_name)
print(model)

nlp_pipeline = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
# Quick check pipeline is behaving
print("Pipeline loaded")
text = "The HeLa cell line is widely used in cancer research."
entities = nlp_pipeline(text)
print(entities)

BertTokenizerFast(name_or_path='OTAR3088/otar-ner-model', vocab_size=32768, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)


OSError: OTAR3088/otar-ner-model does not appear to have a file named pytorch_model.bin, model.safetensors, tf_model.h5, model.ckpt or flax_model.msgpack.