## 03 - Baseline Summarization using Transformer model
In this notebook file, we will do a transformer baseline model:
- read data from mini evaluation set (10% random sampling)
- extract abstract text (first paragraph)
- do summarization with model "Falconsai/medical_summarization"

In [1]:
!pip install pandas
!pip install transformers



In [2]:
import pandas as pd
from transformers import pipeline
import torch

In [3]:
# checking if GPU is available
print(torch.cuda.is_available())
print(torch.cuda.current_device())

True
0


In [4]:
# set up summarization model
model = pipeline("summarization",
                 model="Falconsai/medical_summarization",
                 device=0 # run on GPU
                )

In [5]:
s = 'The evolutionary origins of the hypoxia-sensitive cells that trigger amniote respiratory reflexes – carotid body glomus cells , and ‘pulmonary neuroendocrine cells’ ( PNECs ) - are obscure . Homology has been proposed between glomus cells , which are neural crest-derived , and the hypoxia-sensitive ‘neuroepithelial cells’ ( NECs ) of fish gills , whose embryonic origin is unknown . NECs have also been likened to PNECs , which differentiate in situ within lung airway epithelia . Using genetic lineage-tracing and neural crest-deficient mutants in zebrafish , and physical fate-mapping in frog and lamprey , we find that NECs are not neural crest-derived , but endoderm-derived , like PNECs , whose endodermal origin we confirm . We discover neural crest-derived catecholaminergic cells associated with zebrafish pharyngeal arch blood vessels , and propose a new model for amniote hypoxia-sensitive cell evolution: endoderm-derived NECs were retained as PNECs , while the carotid body evolved via the aggregation of neural crest-derived catecholaminergic ( chromaffin ) cells already associated with blood vessels in anamniote pharyngeal arches . '
s

'The evolutionary origins of the hypoxia-sensitive cells that trigger amniote respiratory reflexes – carotid body glomus cells , and ‘pulmonary neuroendocrine cells’ ( PNECs ) - are obscure . Homology has been proposed between glomus cells , which are neural crest-derived , and the hypoxia-sensitive ‘neuroepithelial cells’ ( NECs ) of fish gills , whose embryonic origin is unknown . NECs have also been likened to PNECs , which differentiate in situ within lung airway epithelia . Using genetic lineage-tracing and neural crest-deficient mutants in zebrafish , and physical fate-mapping in frog and lamprey , we find that NECs are not neural crest-derived , but endoderm-derived , like PNECs , whose endodermal origin we confirm . We discover neural crest-derived catecholaminergic cells associated with zebrafish pharyngeal arch blood vessels , and propose a new model for amniote hypoxia-sensitive cell evolution: endoderm-derived NECs were retained as PNECs , while the carotid body evolved via

In [6]:
result = model(s,
               max_length=500,
               min_length=100
              )
summ = result[0]["summary_text"]
print(len(summ))
print(summ)

Your max_length is set to 500, but your input_length is only 354. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=177)


783
the evolutionary origins of the hypoxia-sensitive cells that trigger amniote respiratory reflexes – carotid body glomus cells , and ‘pulmonary neuroendocrine cells’ ( PNECs ) - are obscure . we find that endoderm-derived neuroepithelial cells ( nEC ) are not neural crest , but endodermal , like PNCEs , whose endodermale origin we confirm . here we propose a new model for amnium-sensitive cell evolution : endodermm-derived catecholaminergic cells associated with zebrafish pharyngeal arch blood vessels , while the aggregation of chromaffin ) cells already associated with blood vessels in anamniotic arches has been proposed between the gills . ncs have also been likened to nephrine , which differentiate in situ within lung airway epithelia . the phenotypic origin is unknown .


In [7]:
# put in a function
def summarize(text, model=model, min_length=100, max_length=500):
    """
        summarize a text using a transformer model, 
        with min_length and max_length are number of tokens limits for the output
    """
    doc = model(text,
                max_length=max_length,
                min_length=min_length
               )
    summ = doc[0]["summary_text"]
    return summ

summarize(s)

Your max_length is set to 500, but your input_length is only 354. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=177)


'the evolutionary origins of the hypoxia-sensitive cells that trigger amniote respiratory reflexes – carotid body glomus cells , and ‘pulmonary neuroendocrine cells’ ( PNECs ) - are obscure . we find that endoderm-derived neuroepithelial cells ( nEC ) are not neural crest , but endodermal , like PNCEs , whose endodermale origin we confirm . here we propose a new model for amnium-sensitive cell evolution : endodermm-derived catecholaminergic cells associated with zebrafish pharyngeal arch blood vessels , while the aggregation of chromaffin ) cells already associated with blood vessels in anamniotic arches has been proposed between the gills . ncs have also been likened to nephrine , which differentiate in situ within lung airway epithelia . the phenotypic origin is unknown .'

In [8]:
# load data
filepath = "../data/mini_dataset/"
filename = "eLife_val_mini_milestone3.jsonl"
df = pd.read_json(filepath + filename,
                  orient="records",
                  lines=True
                 )
print(len(df))
df.head()         

24


Unnamed: 0,lay_summary,article,headings,keywords,id
0,"It can take several months , or even years , f...",Mature neural networks synchronize and integra...,"[Abstract, Introduction, Results, Discussion, ...",[neuroscience],elife-69011-v2
1,Many of our decisions are made on the basis of...,Many decisions are thought to arise via the ac...,"[Abstract, Introduction, Results, Discussion, ...",[neuroscience],elife-17688-v1
2,Oculo-Cerebro-Renal syndrome of Lowe ( Lowe sy...,Mutations in the inositol 5-phosphatase OCRL c...,"[Abstract, Introduction, Results, Discussion, ...",[cell biology],elife-02975-v2
3,"When an embryo develops , its cells must work ...",Gradients of signaling proteins are essential ...,"[Abstract, Introduction, Results, Discussion, ...",[developmental biology],elife-38137-v3
4,Our genomes contain a record of historical eve...,Similarity between two individuals in the comb...,"[Abstract, Introduction, Results, Discussion, ...","[evolutionary biology, genetics and genomics]",elife-15266-v1


In [9]:
# extract abstract (first paragraph)
df["abstract"] = df["article"].apply(lambda text: text.split("\n")[0])
print(df["abstract"].iloc[3])
df.head()

Gradients of signaling proteins are essential for inducing tissue morphogenesis . However , mechanisms of gradient formation remain controversial . Here we characterized the distribution of fluorescently-tagged signaling proteins , FGF and FGFR , expressed at physiological levels from the genomic knock-in alleles in Drosophila . FGF produced in the larval wing imaginal-disc moves to the air-sac-primordium ( ASP ) through FGFR-containing cytonemes that extend from the ASP to contact the wing-disc source . The number of FGF-receiving cytonemes extended by ASP cells decreases gradually with increasing distance from the source , generating a recipient-specific FGF gradient . Acting as a morphogen in the ASP , FGF activates concentration-dependent gene expression , inducing pointed-P1 at higher and cut at lower levels . The transcription-factors Pointed-P1 and Cut antagonize each other and differentially regulate formation of FGFR-containing cytonemes , creating regions with higher-to-lower

Unnamed: 0,lay_summary,article,headings,keywords,id,abstract
0,"It can take several months , or even years , f...",Mature neural networks synchronize and integra...,"[Abstract, Introduction, Results, Discussion, ...",[neuroscience],elife-69011-v2,Mature neural networks synchronize and integra...
1,Many of our decisions are made on the basis of...,Many decisions are thought to arise via the ac...,"[Abstract, Introduction, Results, Discussion, ...",[neuroscience],elife-17688-v1,Many decisions are thought to arise via the ac...
2,Oculo-Cerebro-Renal syndrome of Lowe ( Lowe sy...,Mutations in the inositol 5-phosphatase OCRL c...,"[Abstract, Introduction, Results, Discussion, ...",[cell biology],elife-02975-v2,Mutations in the inositol 5-phosphatase OCRL c...
3,"When an embryo develops , its cells must work ...",Gradients of signaling proteins are essential ...,"[Abstract, Introduction, Results, Discussion, ...",[developmental biology],elife-38137-v3,Gradients of signaling proteins are essential ...
4,Our genomes contain a record of historical eve...,Similarity between two individuals in the comb...,"[Abstract, Introduction, Results, Discussion, ...","[evolutionary biology, genetics and genomics]",elife-15266-v1,Similarity between two individuals in the comb...


In [10]:
# apply summarization
df["baseline_summary"] = df["abstract"].apply(lambda text: summarize(text))
df.head()

Your max_length is set to 500, but your input_length is only 226. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=113)
Your max_length is set to 500, but your input_length is only 206. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=103)
Your max_length is set to 500, but your input_length is only 318. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=159)
Your max_length is set to 500, but your input_length is only 321. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=1

Unnamed: 0,lay_summary,article,headings,keywords,id,abstract,baseline_summary
0,"It can take several months , or even years , f...",Mature neural networks synchronize and integra...,"[Abstract, Introduction, Results, Discussion, ...",[neuroscience],elife-69011-v2,Mature neural networks synchronize and integra...,we investigate the progression of large-scale ...
1,Many of our decisions are made on the basis of...,Many decisions are thought to arise via the ac...,"[Abstract, Introduction, Results, Discussion, ...",[neuroscience],elife-17688-v1,Many decisions are thought to arise via the ac...,a novel sensory manipulation mimics the additi...
2,Oculo-Cerebro-Renal syndrome of Lowe ( Lowe sy...,Mutations in the inositol 5-phosphatase OCRL c...,"[Abstract, Introduction, Results, Discussion, ...",[cell biology],elife-02975-v2,Mutations in the inositol 5-phosphatase OCRL c...,the inositol 5-phosphatase OCRL is recruited t...
3,"When an embryo develops , its cells must work ...",Gradients of signaling proteins are essential ...,"[Abstract, Introduction, Results, Discussion, ...",[developmental biology],elife-38137-v3,Gradients of signaling proteins are essential ...,gradient formation remains controversial . her...
4,Our genomes contain a record of historical eve...,Similarity between two individuals in the comb...,"[Abstract, Introduction, Results, Discussion, ...","[evolutionary biology, genetics and genomics]",elife-15266-v1,Similarity between two individuals in the comb...,background : similarity between two individual...


In [11]:
# write to output
output_path = "../data/milestone3/transformer_baseline/"
output_file = "elife.csv"
df.to_csv(output_path+output_file,
          index=False,
         )
print("Output file completed")

Output file completed


In [12]:
# write to txt file
output_file_txt = "elife.txt"

# write the baseline_summary column to txt file
txt_df = df['baseline_summary']
txt_df.to_csv(output_path+output_file_txt,
              index=False,
              header=False,
              sep="\n"
             )
print("Output file completed")

Output file completed


In [13]:
# repeat for PLOS dev set
# load data
filepath = "../data/mini_dataset/"
filename = "PLOS_val_mini_milestone3.jsonl"
df = pd.read_json(filepath + filename,
                  orient="records",
                  lines=True
                 )
print(len(df))
df.head()  

138


Unnamed: 0,lay_summary,article,headings,keywords,id
0,"Yersinia pestis , the bacterial agent of plagu...",Fleas can transmit Yersinia pestis by two mech...,"[Abstract, Introduction, Results, Discussion, ...","[united states, invertebrates, medicine and he...",journal.ppat.1006859
1,The genome of all vertebrates is heavily colon...,Endogenous retroviruses ( ERVs ) are remnants ...,"[Abstract, Introduction, Results, Discussion, ...","[viruses, sheep, virology]",journal.ppat.0030170
2,The molecular mechanisms underlying directed c...,The Drosophila embryonic gonad is assembled fr...,"[Abstract, Introduction, Results, Discussion, ...",[],journal.pgen.1003720
3,Contrary to the long-standing belief that no n...,"Recently , we presented a study of adult neuro...","[Abstract, Introduction, Model, Results, Discu...",[computational biology/computational neuroscie...,journal.pcbi.1001063
4,Embryonic stem cells have two remarkable prope...,Understanding the transcriptional regulation o...,"[Abstract, Introduction, Results, Discussion, ...","[developmental biology, cell biology, mammals,...",journal.pgen.0030145


In [14]:
# extract abstract (first paragraph)
df["abstract"] = df["article"].apply(lambda text: text.split("\n")[0])
print(df["abstract"].iloc[3])
df.head()

Recently , we presented a study of adult neurogenesis in a simplified hippocampal memory model . The network was required to encode and decode memory patterns despite changing input statistics . We showed that additive neurogenesis was a more effective adaptation strategy compared to neuronal turnover and conventional synaptic plasticity as it allowed the network to respond to changes in the input statistics while preserving representations of earlier environments . Here we extend our model to include realistic , spatially driven input firing patterns in the form of grid cells in the entorhinal cortex . We compare network performance across a sequence of spatial environments using three distinct adaptation strategies: conventional synaptic plasticity , where the network is of fixed size but the connectivity is plastic; neuronal turnover , where the network is of fixed size but units in the network may die and be replaced; and additive neurogenesis , where the network starts out with fe

Unnamed: 0,lay_summary,article,headings,keywords,id,abstract
0,"Yersinia pestis , the bacterial agent of plagu...",Fleas can transmit Yersinia pestis by two mech...,"[Abstract, Introduction, Results, Discussion, ...","[united states, invertebrates, medicine and he...",journal.ppat.1006859,Fleas can transmit Yersinia pestis by two mech...
1,The genome of all vertebrates is heavily colon...,Endogenous retroviruses ( ERVs ) are remnants ...,"[Abstract, Introduction, Results, Discussion, ...","[viruses, sheep, virology]",journal.ppat.0030170,Endogenous retroviruses ( ERVs ) are remnants ...
2,The molecular mechanisms underlying directed c...,The Drosophila embryonic gonad is assembled fr...,"[Abstract, Introduction, Results, Discussion, ...",[],journal.pgen.1003720,The Drosophila embryonic gonad is assembled fr...
3,Contrary to the long-standing belief that no n...,"Recently , we presented a study of adult neuro...","[Abstract, Introduction, Model, Results, Discu...",[computational biology/computational neuroscie...,journal.pcbi.1001063,"Recently , we presented a study of adult neuro..."
4,Embryonic stem cells have two remarkable prope...,Understanding the transcriptional regulation o...,"[Abstract, Introduction, Results, Discussion, ...","[developmental biology, cell biology, mammals,...",journal.pgen.0030145,Understanding the transcriptional regulation o...


In [15]:
# apply summarization
df["baseline_summary"] = df["abstract"].apply(lambda text: summarize(text))
df.head()

Token indices sequence length is longer than the specified maximum sequence length for this model (531 > 512). Running this sequence through the model will result in indexing errors
Your max_length is set to 500, but your input_length is only 461. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=230)
Your max_length is set to 500, but your input_length is only 398. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=199)
Your max_length is set to 500, but your input_length is only 359. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=179)
Your max_length is set to 500, but your input_length is only 417.

Unnamed: 0,lay_summary,article,headings,keywords,id,abstract,baseline_summary
0,"Yersinia pestis , the bacterial agent of plagu...",Fleas can transmit Yersinia pestis by two mech...,"[Abstract, Introduction, Results, Discussion, ...","[united states, invertebrates, medicine and he...",journal.ppat.1006859,Fleas can transmit Yersinia pestis by two mech...,fleas can transmit Yersinia pestis by two mech...
1,The genome of all vertebrates is heavily colon...,Endogenous retroviruses ( ERVs ) are remnants ...,"[Abstract, Introduction, Results, Discussion, ...","[viruses, sheep, virology]",journal.ppat.0030170,Endogenous retroviruses ( ERVs ) are remnants ...,endogenous retroviruses ( ERVs ) are remnants ...
2,The molecular mechanisms underlying directed c...,The Drosophila embryonic gonad is assembled fr...,"[Abstract, Introduction, Results, Discussion, ...",[],journal.pgen.1003720,The Drosophila embryonic gonad is assembled fr...,the hedgehog ( hh ) pathway gene shifted ( shf...
3,Contrary to the long-standing belief that no n...,"Recently , we presented a study of adult neuro...","[Abstract, Introduction, Model, Results, Discu...",[computational biology/computational neuroscie...,journal.pcbi.1001063,"Recently , we presented a study of adult neuro...",introduction : we present a study of adult neu...
4,Embryonic stem cells have two remarkable prope...,Understanding the transcriptional regulation o...,"[Abstract, Introduction, Results, Discussion, ...","[developmental biology, cell biology, mammals,...",journal.pgen.0030145,Understanding the transcriptional regulation o...,we first analyzed the transcriptional profiles...


In [16]:
# write to output
output_path = "../data/milestone3/transformer_baseline/"
output_file = "plos_mini.csv"
df.to_csv(output_path+output_file,
          index=False,
         )
print("Output file completed")

Output file completed


In [17]:
# write to txt file
output_file_txt = "plos.txt"

# write the baseline_summary column to txt file
txt_df = df['baseline_summary']
txt_df.to_csv(output_path+output_file_txt,
              index=False,
              header=False,
              sep="\n"
             )
print("Output file completed")

Output file completed


In [21]:
# apply to test sets
filepath = "../data/biolaysumm2024_data/"
filenames = ["eLife_test.jsonl",
             "PLOS_test.jsonl"
            ]
output_path = "../data/milestone3/transformer_baseline/test_set/"

output_csv_filenames = ["elife_test.csv",
                        "plos_test.csv"
                       ]

output_txt_filenames = ["elife.txt",
                        "plos.txt"
                       ]

                    

for i, fname in enumerate(filenames):
    print("Loading file = ", fname)
    df = pd.read_json(filepath + fname,
                      orient="records",
                      lines=True
                     )
    print("n rows =", len(df))

    print("Extracting abstract...")
    df["abstract"] = df["article"].apply(lambda text: text.split("\n")[0])
    
    print("Making summaries....")
    df["baseline_summary"] = df["abstract"].apply(lambda text: summarize(text))

    print("Writing csv output...")
    df.to_csv(output_path+output_csv_filenames[i],
              index=False,
             )
    print("Output csv file completed")

    txt_df = df['baseline_summary']
    txt_df.to_csv(output_path+output_txt_filenames[i],
                  index=False,
                  header=False,
                  sep="\n"
                 )
    print("Output txt file completed")

print("---- All completed----")
    

Your max_length is set to 500, but your input_length is only 394. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=197)


Loading file =  eLife_test.jsonl
n rows = 142
Extracting abstract...
Making summaries....


Your max_length is set to 500, but your input_length is only 259. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=129)
Your max_length is set to 500, but your input_length is only 352. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=176)
Your max_length is set to 500, but your input_length is only 392. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=196)
Your max_length is set to 500, but your input_length is only 308. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=1

Writing csv output...
Output csv file completed
Output txt file completed
Loading file =  PLOS_test.jsonl
n rows = 142
Extracting abstract...
Making summaries....


Your max_length is set to 500, but your input_length is only 477. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=238)
Your max_length is set to 500, but your input_length is only 484. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=242)
Your max_length is set to 500, but your input_length is only 488. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=244)
Your max_length is set to 500, but your input_length is only 464. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=2

Writing csv output...
Output csv file completed
Output txt file completed
---- All completed----
