<a href="https://colab.research.google.com/github/AmirMoazzami/266_final_proj/blob/mk%2Fbart-biobart-exploration/pretrained_no_finetune/BioBART.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install transformers
!pip install datasets
!pip install torchsummary

In [1]:
import shutil
import os

from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM, BartForConditionalGeneration, BartTokenizer
from datasets import load_dataset, load_from_disk
import torch


os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

if torch.cuda.is_available():
# if False:
    device = torch.device("cuda")
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))

else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")


model_names = [
    "GanjinZero/biobart-v2-base",
    "facebook/bart-base",
    "facebook/bart-large-cnn",
]

def get_model_and_tokenizer(model_name: str):
    model = BartForConditionalGeneration.from_pretrained(model_name).to(device)
    tokenizer = BartTokenizer.from_pretrained(model_name)
    return model, tokenizer

# # run once only during development period:
# dataset = load_dataset(
#     "allenai/mslr2022",
#     "ms2",
#     split='train[:10]',  # use only for setting up/debugging
# )  # this takes a long time, ~ 6 mins. Most time spent in train split

# # subsetting the first 10 into a small file for quick debugging in the future (no need to pull entire dataset!)
# dataset.save_to_disk("first_10_train_examples")
# shutil.make_archive('first_10_train_examples', 'zip', 'first_10_train_examples')

# run following all the time
dataset = load_from_disk("first_10_train_examples")
dataset

There are 1 GPU(s) available.
We will use the GPU: Tesla T4


Dataset({
    features: ['review_id', 'pmid', 'title', 'abstract', 'target', 'background'],
    num_rows: 10
})

In [2]:
model, tokenizer = get_model_and_tokenizer("GanjinZero/biobart-v2-base")

In [19]:
test_idx = 5
background_text = dataset[test_idx]["background"].replace("\n", " ")
text1 = dataset[test_idx]["abstract"][0].replace("\n", " ")
text2 = dataset[test_idx]["abstract"][1].replace("\n", " ")
instruction = "summarize: BACKGROUND - "
background_length = len(tokenizer.encode(instruction + background_text))

# Encoding separately, concatenate, then decode
inputs1 = tokenizer.encode(instruction + background_text + " ABSTRACT - " + text1, return_tensors="pt", max_length=1024, truncation=True).to(device)
inputs2 = tokenizer.encode(instruction + background_text + " ABSTRACT - " + text2, return_tensors="pt", max_length=1024, truncation=True).to(device)

encoded1 = model(inputs1).encoder_last_hidden_state[:, :int(1024 / 2)]  # max position embeddings is 1024 for BART
encoded2 = model(inputs2).encoder_last_hidden_state[:, :int(1024 / 2)]

concatenated = torch.cat((encoded1, encoded2), dim=1)
decoded = model.generate(max_length=300, pad_token_id=1, inputs_embeds=concatenated, decoder_inputs_embeds=concatenated)
print(decoded.shape)
tokenizer.decode(decoded[0], skip_special_tokens=True)

torch.Size([1, 300])


'summarize: BACKGROUND - INTRODUCTION Surgical stress in the presence of fasting worsens the catabolic state, causes insulin resistance and may delay recovery.. Carbohydrate rich drinks given preoperatively may ameliorate these deleterious effects. A systematic review was undertaken to analyse the effect of the potential effect of preoperative carbohydrate loading on insulin resistance, gastric emptying, gastric acidity, patient wellbeing, immunity and nutrition following surgery. ABSTRACT - The effect on gastric pH and volume, and the effects of this effect on the ability of an effect on a possible effect of a potential effect on an effect of an upcoming event, was found to be, respectively, of -1, -2, -3, and -1.5, -4, -1, -1 and -2.2,-1, and was investigated in this study by the use of a technique of, and hence, of, a, and a, thus, of-1. During and after the period, we were able to detect, and we could not detect, the possibility of, or the outcome of, an effect, of an event, of bei

In [73]:
background_text

'INTRODUCTION Surgical stress in the presence of fasting worsens the catabolic state , causes insulin resistance and may delay recovery . Carbohydrate rich drinks given preoperatively may ameliorate these deleterious effects . A systematic review was undertaken to analyse the effect of preoperative carbohydrate loading on insulin resistance , gastric emptying , gastric acidity , patient wellbeing , immunity and nutrition following surgery .'

In [77]:
dataset[test_idx]["target"].replace("\n", " ")

'Preoperative carbohydrate drinks significantly improved insulin resistance and indices of patient comfort following surgery , especially hunger , thirst , malaise , anxiety and nausea . No definite conclusions could be made regarding preservation of muscle mass . Following ingestion of carbohydrate drinks , no adverse events such as apparent or proven aspiration during or after surgery were reported . Administration of oral carbohydrate drinks before surgery is probably safe and may have a positive influence on a wide range of perioperative markers of clinical outcome .'

In [74]:
text1

'The effect on gastric pH and volume of 0 , 6 and 10 ml · kg−1 , of apple juice given 2.5 hours before surgery to children aged five to ten years was investigated in this prospect i ve , r and omized , single-blind study . Gastric contents were aspirated after induction of anaesthesia , and the volume measured . The pH of the gastric aspirate was then assessed using pH paper . Neither gastric volume nor pH immediately following the induction of general anaesthesia were significantly different among the three groups . Gastric volumes after 0 , 6 and 10 ml · kg−1 , of juice averaged ( mean ±SD ) 0.45 ±0.31 , 0.66 ±0.79 and 0.71 ±0.76 ml · kg−1 , respectively ; gastric pH averaged 1.7 ±0.6 , 1.7 ±0.6 and 1.8 ±0.8 , respectively . On the basis of questions asked immediately before induction of anaesthiesia , patients who drank 6 ml · kg−1 of apple juice had decreased thirst and were less irritable and upset before anaesthesia than those who had not ( P < 0.05 ) . It is concluded that drink

In [4]:
inputs_combined = tokenizer.encode("summarize: " + background_text + " " + text1 + " " + text2, return_tensors="pt", max_length=1024, truncation=True).to(device)

decoded_combined = model.generate(inputs_combined, max_length=300, pad_token_id=1)
print(decoded_combined.shape)
tokenizer.decode(decoded_combined[0], skip_special_tokens=True)

torch.Size([1, 300])


'summarize: INTRODUCTION Surgical stress in the presence of fasting worsens the catabolic state, causes insulin resistance and may delay recovery. Carbohydrate rich drinks given preoperatively may ameliorate these deleterious effects. A systematic review was undertaken to analyse the effect of preoperative carbohydrate loading on insulin resistance. gastric emptying, gastric acidity, patient wellbeing, immunity and nutrition following surgery. The effect on gastric pH and volume of 0, 6 and 10 ml · kg−1, of apple juice given 2.5 hours before surgery to children aged five to ten years was investigated in this prospect i ve, r and omized, single-blind study. Gastric contents were aspirated after induction of anaesthesia, and the volume measured. The pH of the gastric aspirate was then assessed using pH paper. Neither gastric volume nor pH immediately following the induction of general anaesthesia were significantly different among the three groups. Gast gastric volumes after 0,6 and 10 m

In [13]:
inputs_solo = tokenizer.encode("summarize: BACKGROUND - " + background_text + " ABSTRACT - " + text1, return_tensors="pt", max_length=1024, truncation=True).to(device)

decoded_solo = model.generate(inputs_solo, max_length=300, pad_token_id=1)
tokenizer.decode(decoded_solo[0], skip_special_tokens=True)

'summarize: BACKGROUND - INTRODUCTION Surgical stress in the presence of fasting worsens the catabolic state, causes insulin resistance and may delay recovery. Carbohydrate rich drinks given preoperatively may ameliorate these deleterious effects. A systematic review was undertaken to analyse the effect of preoperative carbohydrate loading on insulin resistance. gastric emptying, gastric acidity, patient wellbeing, immunity and nutrition following surgery. ABSTRACT - The effect on gastric pH and volume of 0, 6 and 10 ml · kg−1, of apple juice given 2.5 hours before surgery to children aged five to ten years was investigated in this prospect i ve, r and omized, single-blind study. Gastric contents were aspirated after induction of anaesthesia, and the volume measured. The pH of the gastric aspirate was then assessed using pH paper. Neither gastric volume nor pH immediately following the induction of general anaesthesia were significantly different among the three groups. Gastral volumes

In [12]:
print(inputs_solo.shape)
print(decoded_solo.shape)

torch.Size([1, 824])
torch.Size([1, 300])


In [31]:
print(decoded.shape)
tokenizer.decode(decoded[0], skip_special_tokens=True)

torch.Size([1, 400])


'summarize: (3 sentences) BACKGROUND Pulmonary arterial hypertension is a devastating disease, which leads to right heart failure and premature death. Recent evidence suggests that endothelin receptor antagonists may be promising drugs in the treatment of pulmonary arterial pressure. OBJECTIVES To evaluate the efficacy of endendothelin- receptor antagonists in pulmonaryterial hypertension. BACKGROUND Primary pulmonary hypertension ist a progressive disease for which no treatment has been shown in a prospect i ve, r and omized trial to improve survival. METHODS We conducted a 12-week prospect i v, rand omized, multicenter open trial comparing the effects of the continuous intravenous infusion of epoprostenol ( formerly called prostacyclin ) plus conventional therapy with those of conventional tapy alone in 81 patients with severe primary pulmonmonary hypertension ( New York Heart Association functional class III or IV ). RESULTS Exercise capacity was improved in  the 41 patients treated

In [33]:
def summarize(text, model, tokenizer, **generate_args):
    inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", max_length=1024, truncation=True).to(device)
    summary_ids = model.generate(inputs, **generate_args)
    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

# Summarize the first 5 examples
summaries = []
for example in dataset:
    text = example["abstract"][1]  # single abstract
    # text = "; ".join([f"ARTICLE {e + 1}: {txt}" for e, txt in enumerate(example["abstract"])])  # concatenate everything into one

    # test text from news article:
    # text = """ATLANTIC CITY, N.J. — Danish energy developer Orsted said Tuesday night it is scrapping two large offshore wind-power projects off the coast of New Jersey, adding uncertainty to a nascent industry the Biden administration and many state governments are counting on to help transition away from the burning of planet-warming fossil fuels. The company said it is canceling its Ocean Wind I and II projects in southern New Jersey, citing supply-chain issues and rising interest rates."""
    summary = summarize(text, model, tokenizer, max_length=200, min_length=20, length_penalty=1.0, num_beams=10, early_stopping=True)  # guesses for generate args
    summaries.append(summary)
    break

for i, summary in enumerate(summaries):
    print(f"Summary {i+1}:\n{summary}\n")

Summary 1:
The GFP-tagged ADSCs were identified in the lungs and differentiated into endothelial-like cells. Two weeks post-MCT administration, the ADSCs group received 1 × 106 ADSCs via the external jugular vein. Compared to PAH rats, mean pulmonary arterial pressure was decreased in rats at 1, 2, and 3 weeks after ADSCs-treatment.



In [29]:
"; ".join([f"ARTICLE {e + 1}: {txt}" for e, txt in enumerate(example["abstract"])])

'ARTICLE 1: Although transplantation of adult bone marrow mesenchymal stem cells ( BM-MSCs ) holds promise in the treatment for pulmonary arterial hypertension ( PAH ) , the poor survival and differentiation potential of adult BM-MSCs have limited their therapeutic efficiency . Here , we compared the therapeutic efficacy of human embryonic stem cell-derived MSCs ( hESC-MSCs ) with adult BM-MSCs for the treatment of PAH in an animal model . One week following monocrotaline (MCT)-induced PAH , mice were r and omly assigned to receive phosphate-buffered saline ( MCT group ) ; 3.0 × 106 human BM-derived MSCs ( BM-MSCs group ) or 3.0 × 106 hESC-derived MSCs ( hESC-MSCs group ) via tail vein injection . At 3 weeks posttransplantation , the right ventricular systolic pressure ( RVSP ) , degree of RV hypertrophy , and medial wall thickening of pulmonary arteries were lower= , and pulmonary capillary density was higher in the hESC-MSC group as compared with BM-MSC and MCT groups ( all p < 0.05 

In [32]:
print(dataset[0]["abstract"][1].replace(". ", ". \n"))

Abstract We investigated the effect of adipose-derived stem cells ( ADSCs ) transplantation effects on structural remodeling and pulmonary artery pressure in  monocrotaline (MCT)-induced pulmonary hypertensive rats . 
In the first experiment , 32 male Sprague-Dawley ( SD ) rats were r and omly divided into four groups ( n = 8/group ) : 3 ADSCs treated groups and normal control ( Ctrl ) . 
ADSCs were administered through the left jugular vein at 105 , 106 and 107 cells , respectively , and a cell density of 106cells/ml was shown to be optimal . 
The GFP-tagged ADSCs were identified in the lungs and differentiated into endothelial-like cells . 
In the second experiment , 96 male SD rats were r and omly divided into three groups ( n = 32/group ) : Ctrl , MCT-induced pulmonary arterial hypertension ( PAH ) , and PAH treated with ADSCs ( ADSCs ) . 
Two weeks post-MCT administration , the ADSCs group received 1 × 106 ADSCs via the external jugular vein . 
Compared to PAH rats , mean pulmonar

In [None]:
pipe(sample["abstract"][0])

[{'summary_text': 'Although transplantation of adult bone marrow mesenchymal stem cells ( BM-MSCs ) holds promise in the treatment for pulmonary arterial hypertension ( PAH ) , the poor survival and differentiation potential of adult BM-derived MSCs have limited their therapeutic efficiency . Here , we compared the potential efficacy of human embryonic stem cell- derived MSCs ( hESC-MSSCs ) with adult BM -MSCs for the repair of PAH in an animal model . One week following monocrotaline (MCT)-induced PAH , mice were r and omly assigned to receive phosphate-buffered saline ( MCT'}]