### Open Targets lab meeting 1/10/25

- Run through demoing our new cellate model, showing promise of adding insight into promise of therapeutic targets
- We can consider times where cell specificity is crucial for the complete understanding of a target's function in a given disease
    - Big examples seen in recent years include: 
        - the different roles of APOE in Alzheimer's disease, depending on their context - microglia, neurons
        - PD-1 in CD8+ T cells vs. TREG cells in cancer
    - Links to one of our therapeutic hypoethsis Qs:
        - _"What is the function of target [y] in cell type [d] in disease [x]?"_

- As an OTs user, we can search for the target PDCD1 (~ PD-1), seeing in these results high associations with many types of cancer, including NSCLC.
    - https://partner-platform.opentargets.org/target/ENSG00000188389/associations
- Supporting evidence here currently links cooccurences of the target _PD-1_ to the disease _NSCLC_, but:
    - We can see that the literature already contains cell-type specific information
    - Our data as it stands doesn't reflcct this - potential therapeutic insights are buried here, but aren't specifically captured
- OpenAI even gives us a nod, stating "Specific information about PDCD1 would require additional _context_ / sources"

In [1]:
import requests
import json

variables = {
  "ensemblId": "ENSG00000188389", # PDCD1
  "efoId": "EFO_0003060", # NSCLC
  "size": 100
}

# Build query string to get literature information linking PDCD1 and NSCLC
# TODO - Alter Q to return annotations from ePMC as well?
query_string = """
query EuropePMCQuery(
  $ensemblId: String!
  $efoId: String!
  $size: Int!
  $cursor: String
) {
  disease(efoId: $efoId) {
    id
    europePmc: evidences(
      ensemblIds: [$ensemblId]
      enableIndirect: true
      size: $size
      datasourceIds: ["europepmc"]
      cursor: $cursor
    ) {
      count
      cursor
      rows {
        disease {
          name
          id
        }
        target {
          approvedSymbol
          id
        }
        literature
        textMiningSentences {
          tStart
          tEnd
          dStart
          dEnd
          section
          text
        }
        resourceScore
      }
    }
  }
}
"""

# GraphQL API request
base_url = "https://api.platform.opentargets.org/api/v4/graphql"

r = requests.post(base_url, json={"query": query_string, "variables": variables})
if r.status_code == 200:
    print("Query ran successfully")
resp = json.loads(r.text)

Query ran successfully


In [2]:
import pandas as pd
from pprint import pprint

data = resp["data"]["disease"]["europePmc"]["rows"]
literature_evidence = {}
for paper in data:
    # pprint(paper)
    pmcid = paper["literature"][0]
    target = paper["target"]["approvedSymbol"]
    disease = paper["disease"]["name"]
    text = paper["textMiningSentences"]
    text = [t["text"] for t in text]
    literature_evidence[pmcid] = [target, disease, text]

evidence_df = pd.DataFrame.from_dict(literature_evidence, orient="index", columns=["Target", "Disease", "Text"])

In [3]:
evidence_df.head()

Unnamed: 0,Target,Disease,Text
32154170,PDCD1,lung adenocarcinoma,[Given that Science-LUAD cohort was only treat...
38299030,PDCD1,non-small cell lung carcinoma,"[In the current study, we tested the feasibili..."
34194433,PDCD1,non-small cell lung carcinoma,"[Thus, PD-1 was highly expressed on ILC2s obta..."
30325558,PDCD1,non-small cell lung carcinoma,"[However, we further compared PD‐1 expression ..."
27191652,PDCD1,non-small cell lung carcinoma,[We report that surface expression of PD-1 on ...


In [4]:
pprint(evidence_df["Text"].iloc[0])

['Given that Science-LUAD cohort was only treated with PD-1 blockade and '
 'performed better than Cancer Cell-LUAD cohort on predicting ICB efficacy '
 '(Figure 2B; AUC = 0.82 and 0.80, respectively), we stratified the '
 'Discovery-LUAD cohort into two groups based on the TMB cutoff from '
 'Science-LUAD cohort (Figure 2C; TMB = 166.5).',
 'Immune checkpoint blockade (ICB) therapies that target programmed cell death '
 '1 (PD1) and PD1 ligand 1 (PDL1) have demonstrated promising benefits in lung '
 'adenocarcinoma (LUAD), and tumor mutational burden (TMB) is the most robust '
 'biomarker associated with the efficacy of PD-1-PD-L1 axis blockade in LUAD, '
 'but the assessment of TMB by whole-exome sequencing (WES) is rather '
 'expensive and time-consuming.',
 'We retrieved many previous studies and cancer databases, only collecting '
 'four high-quality LUAD datasets that contained both clinical and genomic '
 'information: 29 LUAD patients treated with anti-PD-1 therapy (Science-LUA

In [5]:
"""
To install on M3, xcode / metal are required for successful install, execute:

 xcode-select --install
 python -m uv pip install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu
 python -m uv pip install --upgrade transformers tokenizers accelerate

 & Restart kernel, checking that install was successful below
"""

import torch
print(torch.__version__)
print(torch.backends.mps.is_available())  # Should print True if MPS is supported
print(torch.backends.mps.is_built())      # Should print True

2.8.0
True
True


In [6]:
# Check torch is installed 
!pip list | grep torch

torch                   2.8.0
torchaudio              2.8.0
torchvision             0.23.0


In [7]:
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

# TODO - Try other model versions 
model_name = "OTAR3088/bioformer-cellfinder_V1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

nlp_pipeline = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
# Quick check pipeline is behaving
text = "The HeLa cell line is widely used in cancer research."
entities = nlp_pipeline(text)
print(entities)

Device set to use mps:0
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


[{'entity_group': 'CellLine', 'score': np.float32(0.5906993), 'word': 'HeLa', 'start': 4, 'end': 8}]


In [8]:
from tqdm.notebook import tqdm
hit_column = []

texts = evidence_df["Text"].to_list()

for text in tqdm(texts, desc="Processing in batches"):
    entities = []
    res = nlp_pipeline(text)
    ent_terms = [[hit["word"] for hit in ent] for ent in res]
    # ent_terms = [x for x in ent_terms if x != []] # Remove sentences w.o. hits
    entities.extend(ent_terms)
    hit_column.append(entities)

evidence_df["CellateHits"] = hit_column

Processing in batches:   0%|          | 0/100 [00:00<?, ?it/s]

In [9]:
from typing import List
def match_hits_text(text: List, hits: List) -> List:
    return [(t, h) for t, h in list(zip(text, hits))]

testing = match_hits_text(evidence_df["Text"].to_list()[4], evidence_df["CellateHits"].to_list()[4])
# testing


df_matched = evidence_df.apply(lambda row: match_hits_text(row["Text"], row["CellateHits"]), axis=1)
df_matched[3]

  df_matched[3]


[('However, we further compared PD‐1 expression of three Tfh subtypes in NSCLC patients with those in HS and found that the frequency and the number of PD‐1+‐Tfh2 and PD‐1+‐Tfh17 in NSCLC patients were higher than those in HS (Figure 1G).',
  ['Tfh', 'NSCLC', 'Tfh', 'NSCLC']),
 ('We also observed a great skewing toward PD‐1+‐Tfh2 and PD‐1+‐Tfh17 subtypes in NSCLC.',
  ['Tfh', '##17', 'NSCLC']),
 ('These data suggest that circulating Tfh cells, especially PD‐1+‐Tfh2 and PD‐1+‐Tfh17 subtypes, expand in NSCLC and indicate a potential involvement of Tfh cells in NSCLC.',
  ['circulating', 'Tfh cells', 'Tfh', '##17', 'Tfh cells']),
 ('We also investigated PD‐1 expression of three Tfh subtypes in HS and NSCLC patients and found higher mean fluorescence intensity (MFI) of PD‐1 in Tfh1 than in Tfh2 and Tfh17, and higher MFI of PD‐1 in Tfh1 in NSCLC patients than in HS (Figure 1E,F).',
  ['Tfh', 'Tfh', 'Tfh', 'Tfh', '##17', 'Tfh', 'NSCLC', 'HS']),
 ('PD‐1/PD‐L1 checkpoint inhibitors show impres

In [10]:
# TODO - Retry with other cellate model
# Highlight annotations in-text
# Grab annotations from ePMC - either via graphQL or ePMC API - unsure why they aren't returned as default in graphQL?