<a href="https://colab.research.google.com/github/PathwayCommons/pathway-abstract-classifier/blob/main/pathway_abstract_classifier/tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1> Using PubMed Article Classifier - Tutorial </h1>

Set up environment

In [10]:
!git clone https://github.com/PathwayCommons/pathway-abstract-classifier.git

In [11]:
# This cell may throw some errors which you can safely ignore
%cd pathway-abstract-classifier
!pip install -r requirements.txt
# Currently necessary to get this to run in Colab environment
!pip install --upgrade google-cloud-storage

/content/pathway-abstract-classifier/pathway-abstract-classifier


Below we import necessary libraries, turn on AMP (optional) and load our model

In [12]:
import tensorflow as tf
import ktrain 
from cached_path import cached_path
import numpy as np
import pandas as pd 

# Activate AMP - This is optional and may improve performance
from tensorflow.keras.mixed_precision import experimental as mixed_precision
policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_policy(policy)

# Load model
model_path = cached_path("https://github.com/PathwayCommons/pathway-abstract-classifier/releases/download/pretrained-models/title_abstract_model.zip", extract_archive=True)
predictor = ktrain.load_predictor(model_path)

Here we read in our validation data, then pre-process our input data from this and make predictions. Finally, we pre-process our existing labels to be able to calculate performance metrics

In [13]:
# Unzip and read in validation data
import shutil
shutil.unpack_archive('../pathway-abstract-classifier/Data/val_data.tsv.zip', '../pathway-abstract-classifier/Data')

df=pd.read_csv('../pathway-abstract-classifier/Data/val_data.tsv', delimiter="\t")

#pre-process input data and make predictions (method here uses lists - but could also use dataframes (see files in archive))
titles=df['title'].tolist()
abstracts=df['abstract'].tolist()
# Load sep token ("[SEP]" in this case) for transformer
sep_token = predictor.preproc.get_tokenizer().sep_token
# Concatenate input
texts = [" ".join([title, sep_token, abstract]) for title, abstract in zip(titles, abstracts)] 
# Make predictions, checking how long they take to make (note- there are 1042 examples here)
%time predictions=predictor.predict((texts))

# pre process existing labels 
df['class']=df['class'].astype('bool')
y_val=df['class'].to_numpy()

CPU times: user 28.4 s, sys: 4.66 s, total: 33 s
Wall time: 2min 31s


Checking performance now

In [14]:
# Check performance on validation set
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import matthews_corrcoef
from sklearn.metrics import confusion_matrix

print(confusion_matrix((y_val),predictions))
print(classification_report((y_val), predictions))
print(matthews_corrcoef((y_val), predictions))

[[797  45]
 [ 47 153]]
              precision    recall  f1-score   support

       False       0.94      0.95      0.95       842
        True       0.77      0.77      0.77       200

    accuracy                           0.91      1042
   macro avg       0.86      0.86      0.86      1042
weighted avg       0.91      0.91      0.91      1042

0.7142926808037249


Let's try altering confidence threshold now (this will make our classifier more conservative)

In [15]:
# Default threshold is 0.5. I've found you need to go over 0.9 to start really seeing an effect
threshold=0.994
conf_predictions=(predictor.predict_proba(texts)[:,1] >= threshold).astype(bool)

print(confusion_matrix((y_val),conf_predictions))
print(classification_report((y_val), conf_predictions))
print(matthews_corrcoef((y_val), conf_predictions))

[[838   4]
 [121  79]]
              precision    recall  f1-score   support

       False       0.87      1.00      0.93       842
        True       0.95      0.40      0.56       200

    accuracy                           0.88      1042
   macro avg       0.91      0.70      0.74      1042
weighted avg       0.89      0.88      0.86      1042

0.5676294827683775


Now let's try a single example (with an explanation)

In [16]:
!pip install https://github.com/amaiya/eli5/archive/refs/heads/tfkeras_0_10_1.zip

In [17]:
title = "YTHDC1-mediated augmentation of miR-30d in repressing pancreatic tumorigenesis via attenuation of RUNX1-induced transcriptional activation of Warburg effect"
abstract = "Pancreatic ductal adenocarcinoma (PDAC) is one of the most lethal human cancers. It thrives in a malnourished environment; however, little is known about the mechanisms by which PDAC cells actively promote aerobic glycolysis to maintain their metabolic needs. Gene Expression Omnibus (GEO) was used to identify differentially expressed miRNAs. The expression pattern of miR-30d in normal and PDAC tissues was studied by in situ hybridization. The role of miR-30d/RUNX1 in vitro and in vivo was evaluated by CCK8 assay and clonogenic formation as well as transwell experiment, subcutaneous xenograft model and liver metastasis model, respectively. Glucose uptake, ATP and lactate production were tested to study the regulatory effect of miR-30d/RUNX1 on aerobic glycolysis in PDAC cells. Quantitative real-time PCR, western blot, Chip assay, promoter luciferase activity, RIP, MeRIP, and RNA stability assay were used to explore the molecular mechanism of YTHDC1/miR-30d/RUNX1 in PDAC. Here, we discover that miR-30d expression was remarkably decreased in PDAC tissues and associated with good prognosis, contributed to the suppression of tumor growth and metastasis, and attenuation of Warburg effect. Mechanistically, the m6A reader YTHDC1 facilitated the biogenesis of mature miR-30d via m6A-mediated regulation of mRNA stability. Then, miR-30d inhibited aerobic glycolysis through regulating SLC2A1 and HK1 expression by directly targeting the transcription factor RUNX1, which bound to the promoters of the SLC2A1 and HK1 genes. Moreover, miR-30d was clinically inversely correlated with RUNX1, SLC2A1 and HK1, which function as adverse prognosis factors for overall survival in PDAC tissues. Overall, we demonstrated that miR-30d is a functional and clinical tumor-suppressive gene in PDAC. Our findings further uncover that miR-30d is a novel target for YTHDC1 through m6A modification, and miR-30d represses pancreatic tumorigenesis via suppressing aerobic glycolysis."
print(predictor.predict(" ".join([title, sep_token, abstract])))

1


In [18]:
predictor.explain(" ".join([title, sep_token, abstract]))

Contribution?,Feature
19.582,Highlighted in text (sum)
-0.981,<BIAS>
