In [1]:
# !pip install transformers

In [2]:
from transformers import pipeline

import numpy as np
import pandas as pd
import seaborn as sn

import utils # all data reading and preprocessing functionality

Using the druglib dataset to test on.

In [3]:
druglib = pd.read_csv("./data/drugLib_raw.tsv", sep="\t")

In [4]:
druglib = druglib.rename(columns = {'Unnamed: 0': 'ID'})
druglib.sideEffects = pd.Categorical(druglib.sideEffects,
                                     ordered=True,
                                     categories=['No Side Effects', 'Mild Side Effects', 'Moderate Side Effects', 'Severe Side Effects', 'Extremely Severe Side Effects'])
druglib.effectiveness = pd.Categorical(druglib.effectiveness,
                                       ordered=True,
                                       categories=['Ineffective', 'Marginally Effective', 'Moderately Effective', 'Considerably Effective', 'Highly Effective'])
druglib.head()

Unnamed: 0,ID,urlDrugName,rating,effectiveness,sideEffects,condition,benefitsReview,sideEffectsReview,commentsReview
0,1366,biaxin,9,Considerably Effective,Mild Side Effects,sinus infection,The antibiotic may have destroyed bacteria cau...,"Some back pain, some nauseau.",Took the antibiotics for 14 days. Sinus infect...
1,3724,lamictal,9,Highly Effective,Mild Side Effects,bipolar disorder,Lamictal stabilized my serious mood swings. On...,"Drowsiness, a bit of mental numbness. If you t...",Severe mood swings between hypomania and depre...
2,3824,depakene,4,Moderately Effective,Severe Side Effects,bipolar disorder,Initial benefits were comparable to the brand ...,"Depakene has a very thin coating, which caused...",Depakote was prescribed to me by a Kaiser psyc...
3,969,sarafem,10,Highly Effective,No Side Effects,bi-polar / anxiety,It controlls my mood swings. It helps me think...,I didnt really notice any side effects.,This drug may not be for everyone but its wond...
4,696,accutane,10,Highly Effective,Mild Side Effects,nodular acne,Within one week of treatment superficial acne ...,Side effects included moderate to severe dry s...,Drug was taken in gelatin tablet at 0.5 mg per...


In [5]:
# Basic usage
classifier = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [6]:
type(classifier)

transformers.pipelines.text_classification.TextClassificationPipeline

In [8]:
# where is the model stored locally?
#~/.cache/huggingface

# the model properties
classifier.model.config


DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased-finetuned-sst-2-english",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "finetuning_task": "sst-2",
  "hidden_dim": 3072,
  "id2label": {
    "0": "NEGATIVE",
    "1": "POSITIVE"
  },
  "initializer_range": 0.02,
  "label2id": {
    "NEGATIVE": 0,
    "POSITIVE": 1
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.17.0",
  "vocab_size": 30522
}

In [9]:
# Output is a dictionary
classifier("This is such a great movie!")

[{'label': 'POSITIVE', 'score': 0.9998759031295776}]

In [10]:
# try a more tricky example
classifier("I am not sure what to think of the book but I couldn't stop reading.")

[{'label': 'POSITIVE', 'score': 0.9961125254631042}]

In [11]:
# try a more tricky example
classifier("The book hit me in the guts until I could breathe no more.")

[{'label': 'NEGATIVE', 'score': 0.9968511462211609}]

In [28]:
druglib_comments = druglib['commentsReview'][:10].to_list()
druglib_comments

['Took the antibiotics for 14 days. Sinus infection was gone after the 6th day.',
 'Severe mood swings between hypomania and depression with suicide ideation before Lamictal. Began with 10mg and tritrated up to 400mg over a few months. Played around with the dosage to finally arrive at 400mg. Experimented with taking it at different times in the evening. Found that most comfortable time is before sleep.',
 'Depakote was prescribed to me by a Kaiser psychiatrist in Pleasant Hill, CA in 2006.  The medication was given to help treat the diagnosis of Bipolar Disorder, Type II.  My disease was misdiagnosed for several years as depression, since I was never seen by a professional during hypomanic episodes, and when I did see a professional, I did not think that my manic symptoms were signs of a serious psychiatric problem.  Anti-depressant drugs were prescribed: Prozac in 2001, which had minimal effect and I stopped taking them a few months later. Wellbutrin was prescribed again for "chronic

In [29]:
classifier(druglib_comments)

[{'label': 'NEGATIVE', 'score': 0.9951233267784119},
 {'label': 'NEGATIVE', 'score': 0.9942966103553772},
 {'label': 'NEGATIVE', 'score': 0.9966172575950623},
 {'label': 'POSITIVE', 'score': 0.9998546838760376},
 {'label': 'NEGATIVE', 'score': 0.9553026556968689},
 {'label': 'NEGATIVE', 'score': 0.9960475564002991},
 {'label': 'NEGATIVE', 'score': 0.9929372668266296},
 {'label': 'NEGATIVE', 'score': 0.9108172655105591},
 {'label': 'NEGATIVE', 'score': 0.9247077107429504},
 {'label': 'NEGATIVE', 'score': 0.997154712677002}]

In [13]:
classifier(druglib['benefitsReview'][:10].to_list())

[{'label': 'NEGATIVE', 'score': 0.99843829870224},
 {'label': 'POSITIVE', 'score': 0.9974270462989807},
 {'label': 'POSITIVE', 'score': 0.9917389154434204},
 {'label': 'POSITIVE', 'score': 0.9949883222579956},
 {'label': 'NEGATIVE', 'score': 0.9986709356307983},
 {'label': 'NEGATIVE', 'score': 0.9974334836006165},
 {'label': 'NEGATIVE', 'score': 0.9982646107673645},
 {'label': 'POSITIVE', 'score': 0.9977782368659973},
 {'label': 'POSITIVE', 'score': 0.9927157759666443},
 {'label': 'NEGATIVE', 'score': 0.9964677095413208}]

In [44]:
file1 = './data/abstract_set1.txt'
file2 = './data/abstract_set2.txt'
data_selection = 'abstract'
label_selection = 'label' # can be 'label' or 'text_label'
df = utils.read_abstract_data(negatives_path=file2, positives_path=file1)
df.head()

Unnamed: 0,pmid,title,abstract,label,text_label
0,29981025,Impact of Neoadjuvant Chemotherapy on Breast C...,"BACKGROUND: Breast cancer subtype, as determin...",0,control
1,29984001,Expert-Performed Endotracheal Intubation-Relat...,The aim of this study was to determine complic...,0,control
2,29988545,A case report: Addison disease caused by adren...,We report middle age man with skin hyperpigmen...,0,control
3,29998100,An Unusual Morphological Presentation of Cutan...,Cutaneous squamous cell carcinoma (SCC) exhibi...,0,control
4,29999256,Informing Consent: Medical Malpractice and the...,"Since the early 1990s, jurisdictions around th...",0,control


In [57]:

abstract_sample = df.sample(10, random_state=1234)['abstract'].to_list()
abstract_sample_truncated = []
# truncate to 512 tokens
# the model has a maximum length of 512 tokens
# we will truncate to 250 tokens because the model doe more sophisticated tokenization
for abstract in abstract_sample:
    tokens = abstract.split()
    if len(tokens) > 250:
        tokens = tokens[:250]
    else:
        tokens = tokens
    # convert to string
    tokens = ' '.join(tokens)
    
    abstract_sample_truncated.append(tokens)

abstract_sample_truncated

['Ustiloxins were cyclopeptide mycotoxins from rice false smut balls (FSBs) that formed in rice spikelets infected by the fungal pathogen Ustilaginoidea virens. To investigate the chemical diversity of these metabolites and their bioactivities, one new cyclopeptide, ustiloxin G (1), together with four known congeners-ustiloxins A (2), B (3), D (4), and F (5)-were isolated from water extract of rice FSBs. Their structures were elucidated by analyses of their physical and spectroscopic data, including ultraviolet spectrometry (UV), infrared spectroscopy (IR), 1D and 2D nuclear magnetic resonance (NMR), and high-resolution electrospray ionization-mass spectrometry (HR-ESI-MS). All the compounds were evaluated for their cytotoxic as well as radicle and germ elongation inhibitory activities. Ustiloxin B (3) showed the best activity against the cell line BGC-823 with an IC50 value of 1.03 µM, while ustiloxin G (1) showed moderate activity against the cell lines A549 and A375 with IC50 values

In [58]:
predictions = classifier(abstract_sample_truncated)
predictions

[{'label': 'NEGATIVE', 'score': 0.977128267288208},
 {'label': 'NEGATIVE', 'score': 0.9973832964897156},
 {'label': 'NEGATIVE', 'score': 0.9983440637588501},
 {'label': 'POSITIVE', 'score': 0.9969672560691833},
 {'label': 'NEGATIVE', 'score': 0.7761743068695068},
 {'label': 'NEGATIVE', 'score': 0.9976509213447571},
 {'label': 'NEGATIVE', 'score': 0.5205545425415039},
 {'label': 'NEGATIVE', 'score': 0.9979456067085266},
 {'label': 'POSITIVE', 'score': 0.9912579655647278},
 {'label': 'NEGATIVE', 'score': 0.9980074763298035}]

In [35]:
# Get probabilities
probs = [d['score'] if d['label'].startswith('P') else 1 - d['score'] for d in predictions]
probs

[0.033216893672943115,
 0.006510734558105469,
 0.0036910176277160645,
 0.9968621730804443,
 0.22382569313049316,
 0.003848135471343994,
 0.003258049488067627,
 0.017386913299560547,
 0.9982472658157349,
 0.0014461874961853027]

In [38]:
preds = [1 if d['label'].startswith('P') else 0 for d in predictions]
preds = np.array(preds)
preds

array([0, 0, 0, 1, 0, 0, 0, 0, 1, 0])

In [30]:
import torch

In [31]:
torch.cuda.is_available()

False

In [32]:
torch.cuda.current_device()

AssertionError: Torch not compiled with CUDA enabled