# Data Extraction

In this notebook, we will extract abstract and summary parts from articles, then write to csv files

In [55]:
import pandas as pd

In [72]:
path = "../../biolaysumm2024_data/"
#filename = "eLife_train.jsonl"
filename = "PLOS_train.jsonl"
df = pd.read_json(path + filename,
                  orient="records",
                  lines=True)
df.head()

Unnamed: 0,lay_summary,article,headings,keywords,id
0,"Yersinia pestis , the bacterial agent of plagu...",Fleas can transmit Yersinia pestis by two mech...,"[Abstract, Introduction, Results, Discussion, ...","[united states, invertebrates, medicine and he...",journal.ppat.1006859
1,The genome of all vertebrates is heavily colon...,Endogenous retroviruses ( ERVs ) are remnants ...,"[Abstract, Introduction, Results, Discussion, ...","[viruses, sheep, virology]",journal.ppat.0030170
2,The molecular mechanisms underlying directed c...,The Drosophila embryonic gonad is assembled fr...,"[Abstract, Introduction, Results, Discussion, ...",[],journal.pgen.1003720
3,Contrary to the long-standing belief that no n...,"Recently , we presented a study of adult neuro...","[Abstract, Introduction, Model, Results, Discu...",[computational biology/computational neuroscie...,journal.pcbi.1001063
4,Embryonic stem cells have two remarkable prope...,Understanding the transcriptional regulation o...,"[Abstract, Introduction, Results, Discussion, ...","[developmental biology, cell biology, mammals,...",journal.pgen.0030145


In [73]:
item = df.iloc[0]
item

lay_summary    Yersinia pestis , the bacterial agent of plagu...
article        Fleas can transmit Yersinia pestis by two mech...
headings       [Abstract, Introduction, Results, Discussion, ...
keywords       [united states, invertebrates, medicine and he...
id                                          journal.ppat.1006859
Name: 0, dtype: object

In [74]:
item['headings']

['Abstract', 'Introduction', 'Results', 'Discussion', 'Materials and methods']

In [75]:
# count words
len(item["article"].split())

9266

In [76]:
# split by paragraph
paras = item["article"].split("\n")
print(len(paras))

5


In [77]:
# the Abstract section is the first one
print(len(paras[0].split()))
paras[0]

335


'Fleas can transmit Yersinia pestis by two mechanisms , early-phase transmission ( EPT ) and biofilm-dependent transmission ( BDT ) . Transmission efficiency varies among flea species and the results from different studies have not always been consistent . One complicating variable is the species of rodent blood used for the infectious blood meal . To gain insight into the mechanism of EPT and the effect that host blood has on it , fleas were fed bacteremic mouse , rat , guinea pig , or gerbil blood; and the location and characteristics of the infection in the digestive tract and transmissibility of Y . pestis were assessed 1 to 3 days after infection . Surprisingly , 10–28% of two rodent flea species fed bacteremic rat or guinea pig blood refluxed a portion of the infected blood meal into the esophagus within 24 h of feeding . We term this phenomenon post-infection esophageal reflux ( PIER ) . In contrast , PIER was rarely observed in rodent fleas fed bacteremic mouse or gerbil blood 

In [78]:
for pr in paras:
    print(pr)
    print("-------------")

Fleas can transmit Yersinia pestis by two mechanisms , early-phase transmission ( EPT ) and biofilm-dependent transmission ( BDT ) . Transmission efficiency varies among flea species and the results from different studies have not always been consistent . One complicating variable is the species of rodent blood used for the infectious blood meal . To gain insight into the mechanism of EPT and the effect that host blood has on it , fleas were fed bacteremic mouse , rat , guinea pig , or gerbil blood; and the location and characteristics of the infection in the digestive tract and transmissibility of Y . pestis were assessed 1 to 3 days after infection . Surprisingly , 10–28% of two rodent flea species fed bacteremic rat or guinea pig blood refluxed a portion of the infected blood meal into the esophagus within 24 h of feeding . We term this phenomenon post-infection esophageal reflux ( PIER ) . In contrast , PIER was rarely observed in rodent fleas fed bacteremic mouse or gerbil blood .

In [79]:
item['lay_summary']

'Yersinia pestis , the bacterial agent of plague , is transmitted by fleas that feed on blood from rodents that carry this disease . The conclusions from studies comparing how efficiently fleas transmit plague after becoming infected have been inconsistent , possibly because a variety of rodent blood sources have been used . To investigate this , we infected three different flea species with Y . pestis using four different types of rodent blood and compared how well they could transmit three days later . The two rodent flea species that transmitted efficiently tended to reflux bacteria and blood into their esophagus when rat or guinea pig blood was used for the infections , but not when mouse or gerbil blood was used . This reflux phenomenon appears to be related to the solubility of the hemoglobin molecule of different rodent species . In contrast , cat fleas , inefficient transmitters , never refluxed their infected blood meal into the esophagus . Rodent fleas that were infected usin

In [80]:
len(item['lay_summary'].split(' '))

217

In [81]:
# function to extract first paragraph of the text
def get_abstract(text):
    """
        return abstract (first paragraph) of the text
    """
    result = ""
    result = text.split("\n")[0]
    return result

get_abstract(item["article"])

'Fleas can transmit Yersinia pestis by two mechanisms , early-phase transmission ( EPT ) and biofilm-dependent transmission ( BDT ) . Transmission efficiency varies among flea species and the results from different studies have not always been consistent . One complicating variable is the species of rodent blood used for the infectious blood meal . To gain insight into the mechanism of EPT and the effect that host blood has on it , fleas were fed bacteremic mouse , rat , guinea pig , or gerbil blood; and the location and characteristics of the infection in the digestive tract and transmissibility of Y . pestis were assessed 1 to 3 days after infection . Surprisingly , 10–28% of two rodent flea species fed bacteremic rat or guinea pig blood refluxed a portion of the infected blood meal into the esophagus within 24 h of feeding . We term this phenomenon post-infection esophageal reflux ( PIER ) . In contrast , PIER was rarely observed in rodent fleas fed bacteremic mouse or gerbil blood 

In [82]:
# function to extract first paragraph of the text
def get_introduction(text):
    """
        return conclusion (second paragraph) of the text
    """
    result = ""
    result = text.split("\n")[1]
    return result

get_introduction(item["article"])

' Y . pestis is transmitted by the bite of infected fleas , and two modes of transmission have been described: early-phase transmission ( EPT ) and biofilm-dependent transmission ( BDT ) [1–3] . Fleas that have taken a highly bacteremic infectious blood meal are capable of EPT on their next feeding attempt within 4 days of becoming infected [4] . An extrinsic incubation period , the time needed for a vector to become infective after acquiring a pathogen , is not required or is very short; fleas can transmit Y . pestis by 24 h after an infectious blood meal [1] . In contrast , BDT does not typically ensue until at least 5–7 days after infection , the time required for a mature biofilm to form in the proventriculus [5 , 6] . The proventriculus is a valve in the flea foregut that regulates the ingress of blood and prevents its backflow into the esophagus [7] . BDT occurs when the Y . pestis biofilm begins to interfere with or block normal blood feeding . In partially blocked fleas , biofi

In [83]:
# apply to dataset
df["abstract"] = df["article"].apply(get_abstract)
df["introduction"] = df["article"].apply(get_introduction)
df.head()

Unnamed: 0,lay_summary,article,headings,keywords,id,abstract,introduction
0,"Yersinia pestis , the bacterial agent of plagu...",Fleas can transmit Yersinia pestis by two mech...,"[Abstract, Introduction, Results, Discussion, ...","[united states, invertebrates, medicine and he...",journal.ppat.1006859,Fleas can transmit Yersinia pestis by two mech...,Y . pestis is transmitted by the bite of infe...
1,The genome of all vertebrates is heavily colon...,Endogenous retroviruses ( ERVs ) are remnants ...,"[Abstract, Introduction, Results, Discussion, ...","[viruses, sheep, virology]",journal.ppat.0030170,Endogenous retroviruses ( ERVs ) are remnants ...,An essential step in the replication cycle of...
2,The molecular mechanisms underlying directed c...,The Drosophila embryonic gonad is assembled fr...,"[Abstract, Introduction, Results, Discussion, ...",[],journal.pgen.1003720,The Drosophila embryonic gonad is assembled fr...,The hedgehog ( hh ) signaling pathway plays a...
3,Contrary to the long-standing belief that no n...,"Recently , we presented a study of adult neuro...","[Abstract, Introduction, Model, Results, Discu...",[computational biology/computational neuroscie...,journal.pcbi.1001063,"Recently , we presented a study of adult neuro...",The adult mammalian brain contains two neurog...
4,Embryonic stem cells have two remarkable prope...,Understanding the transcriptional regulation o...,"[Abstract, Introduction, Results, Discussion, ...","[developmental biology, cell biology, mammals,...",journal.pgen.0030145,Understanding the transcriptional regulation o...,Pluripotent stem cells can give rise to all f...


In [84]:
len(df)

138

In [85]:
import random
random.seed(531)

number_of_sample = 200
lst = list(set([random.randint(0,len(df)-1) for i in range(int(number_of_sample*1.1))]))
random.shuffle(lst)
lst = lst[:number_of_sample]

In [86]:
output_df = df.iloc[lst]
output_df.head()

Unnamed: 0,lay_summary,article,headings,keywords,id,abstract,introduction
0,"Yersinia pestis , the bacterial agent of plagu...",Fleas can transmit Yersinia pestis by two mech...,"[Abstract, Introduction, Results, Discussion, ...","[united states, invertebrates, medicine and he...",journal.ppat.1006859,Fleas can transmit Yersinia pestis by two mech...,Y . pestis is transmitted by the bite of infe...
1,The genome of all vertebrates is heavily colon...,Endogenous retroviruses ( ERVs ) are remnants ...,"[Abstract, Introduction, Results, Discussion, ...","[viruses, sheep, virology]",journal.ppat.0030170,Endogenous retroviruses ( ERVs ) are remnants ...,An essential step in the replication cycle of...
2,The molecular mechanisms underlying directed c...,The Drosophila embryonic gonad is assembled fr...,"[Abstract, Introduction, Results, Discussion, ...",[],journal.pgen.1003720,The Drosophila embryonic gonad is assembled fr...,The hedgehog ( hh ) signaling pathway plays a...
3,Contrary to the long-standing belief that no n...,"Recently , we presented a study of adult neuro...","[Abstract, Introduction, Model, Results, Discu...",[computational biology/computational neuroscie...,journal.pcbi.1001063,"Recently , we presented a study of adult neuro...",The adult mammalian brain contains two neurog...
4,Embryonic stem cells have two remarkable prope...,Understanding the transcriptional regulation o...,"[Abstract, Introduction, Results, Discussion, ...","[developmental biology, cell biology, mammals,...",journal.pgen.0030145,Understanding the transcriptional regulation o...,Pluripotent stem cells can give rise to all f...


In [87]:
output_path = ""
output_filename = filename.split('.')[0]+'_mini.jsonl'
print("Writing output to", output_filename)
output_df.to_json(output_path + output_filename, orient='records', lines=True)
print("Completed")

Writing output to PLOS_val_mini_mini.jsonl
Completed


In [88]:
df[['keywords', 'abstract']]

Unnamed: 0,keywords,abstract
0,"[united states, invertebrates, medicine and he...",Fleas can transmit Yersinia pestis by two mech...
1,"[viruses, sheep, virology]",Endogenous retroviruses ( ERVs ) are remnants ...
2,[],The Drosophila embryonic gonad is assembled fr...
3,[computational biology/computational neuroscie...,"Recently , we presented a study of adult neuro..."
4,"[developmental biology, cell biology, mammals,...",Understanding the transcriptional regulation o...
...,...,...
133,[],Inhibitory interneurons play critical roles in...
134,"[medicine and health sciences, cancer risk fac...",The BRCA Challenge is a long-term data-sharing...
135,"[insulin-dependent signal transduction, invert...",The small GTPase RAS is among the most prevale...
136,"[taxonomy, medicine and health sciences, coron...",Recent years have seen the development of nume...
