# Data Extraction

In this notebook, we will extract abstract and summary parts from articles, then write to csv files

In [1]:
import pandas as pd

In [2]:
path = "./data/biolaysumm2024_data/"
filename = "eLife_train.jsonl"
df = pd.read_json(path + filename,
                  orient="records",
                  lines=True)
df.head()

Unnamed: 0,lay_summary,article,headings,keywords,id
0,"In the USA , more deaths happen in the winter ...","In temperate climates , winter deaths exceed s...","[Abstract, Introduction, Results, Discussion, ...",[epidemiology and global health],elife-35500-v1
1,Most people have likely experienced the discom...,Whether complement dysregulation directly cont...,"[Abstract, Introduction, Results, Discussion, ...","[microbiology and infectious disease, immunolo...",elife-48378-v2
2,The immune system protects an individual from ...,Variation in the presentation of hereditary im...,"[Abstract, Introduction, Results, Discussion, ...","[microbiology and infectious disease, immunolo...",elife-04494-v1
3,The brain adapts to control our behavior in di...,Rapid and flexible interpretation of conflicti...,"[Abstract, Introduction, Results, Discussion, ...",[neuroscience],elife-12352-v2
4,Cells use motor proteins that to move organell...,Myosin 5a is a dual-headed molecular motor tha...,"[Abstract, Introduction, Results, Discussion, ...",[structural biology and molecular biophysics],elife-05413-v2


In [3]:
item = df.iloc[0]
item

lay_summary    In the USA , more deaths happen in the winter ...
article        In temperate climates , winter deaths exceed s...
headings       [Abstract, Introduction, Results, Discussion, ...
keywords                        [epidemiology and global health]
id                                                elife-35500-v1
Name: 0, dtype: object

In [4]:
# count words
len(item["article"].split())

3039

In [5]:
# split by paragraph
paras = item["article"].split("\n")
print(len(paras))

5


In [6]:
# the Abstract section is the first one
print(len(paras[0].split()))
paras[0]

171


'In temperate climates , winter deaths exceed summer ones . However , there is limited information on the timing and the relative magnitudes of maximum and minimum mortality , by local climate , age group , sex and medical cause of death . We used geo-coded mortality data and wavelets to analyse the seasonality of mortality by age group and sex from 1980 to 2016 in the USA and its subnational climatic regions . Death rates in men and women ≥ 45 years peaked in December to February and were lowest in June to August , driven by cardiorespiratory diseases and injuries . In these ages , percent difference in death rates between peak and minimum months did not vary across climate regions , nor changed from 1980 to 2016 . Under five years , seasonality of all-cause mortality largely disappeared after the 1990s . In adolescents and young adults , especially in males , death rates peaked in June/July and were lowest in December/January , driven by injury deaths . '

In [7]:
for pr in paras:
    print(pr)
    print("-------------")

In temperate climates , winter deaths exceed summer ones . However , there is limited information on the timing and the relative magnitudes of maximum and minimum mortality , by local climate , age group , sex and medical cause of death . We used geo-coded mortality data and wavelets to analyse the seasonality of mortality by age group and sex from 1980 to 2016 in the USA and its subnational climatic regions . Death rates in men and women ≥ 45 years peaked in December to February and were lowest in June to August , driven by cardiorespiratory diseases and injuries . In these ages , percent difference in death rates between peak and minimum months did not vary across climate regions , nor changed from 1980 to 2016 . Under five years , seasonality of all-cause mortality largely disappeared after the 1990s . In adolescents and young adults , especially in males , death rates peaked in June/July and were lowest in December/January , driven by injury deaths . 
-------------
 It is well-esta

In [8]:
# MAX_CHAR = 1000
MAX_CHAR = -1


In [9]:
# function to extract first paragraph of the text
def get_abstract(text):
    """
        return abstract (first paragraph) of the text
    """
    result = ""
    result = text.split("\n")[0][:MAX_CHAR]
    return result

get_abstract(item["article"])

'In temperate climates , winter deaths exceed summer ones . However , there is limited information on the timing and the relative magnitudes of maximum and minimum mortality , by local climate , age group , sex and medical cause of death . We used geo-coded mortality data and wavelets to analyse the seasonality of mortality by age group and sex from 1980 to 2016 in the USA and its subnational climatic regions . Death rates in men and women ≥ 45 years peaked in December to February and were lowest in June to August , driven by cardiorespiratory diseases and injuries . In these ages , percent difference in death rates between peak and minimum months did not vary across climate regions , nor changed from 1980 to 2016 . Under five years , seasonality of all-cause mortality largely disappeared after the 1990s . In adolescents and young adults , especially in males , death rates peaked in June/July and were lowest in December/January , driven by injury deaths .'

In [10]:
# function to extract first paragraph of the text
def get_conclusion(text):
    """
        return conclusion (second last paragraph) of the text
    """
    result = ""
    result = text.split("\n")[-2][:MAX_CHAR]  # limit to 1,000 characters
    return result

get_conclusion(item["article"])

' We used wavelet and centre of gravity analyses , which allowed systematically identifying and characterizing seasonality of total and cause-specific mortality in the USA , and examining how seasonality has changed over time . We identified distinct seasonal patterns in relation to age and sex , including higher all-cause summer mortality in young men ( Feinstein , 2002; Rau et al . , 2018 ) . Importantly , we also showed that all-cause and cause-specific mortality seasonality is largely similar in terms of both timing and magnitude across diverse climatic regions with substantially different summer and winter temperatures . Insights of this kind would not have been possible analysing data averaged over time or nationally , or fixed to pre-specified frequencies . Prior studies have noted seasonality of mortality for all-cause mortality and for specific causes of death in the USA ( Feinstein , 2002; Kalkstein , 2013; Rau , 2004; Rau et al . , 2018; Rosenwaike , 1966; Seretakis et al . 

In [11]:
# apply to dataset
df["abstract"] = df["article"].apply(get_abstract)
df["conclusion"] = df["article"].apply(get_conclusion)
df.head()

Unnamed: 0,lay_summary,article,headings,keywords,id,abstract,conclusion
0,"In the USA , more deaths happen in the winter ...","In temperate climates , winter deaths exceed s...","[Abstract, Introduction, Results, Discussion, ...",[epidemiology and global health],elife-35500-v1,"In temperate climates , winter deaths exceed s...",We used wavelet and centre of gravity analyse...
1,Most people have likely experienced the discom...,Whether complement dysregulation directly cont...,"[Abstract, Introduction, Results, Discussion, ...","[microbiology and infectious disease, immunolo...",elife-48378-v2,Whether complement dysregulation directly cont...,Mechanistic advances in our understanding of ...
2,The immune system protects an individual from ...,Variation in the presentation of hereditary im...,"[Abstract, Introduction, Results, Discussion, ...","[microbiology and infectious disease, immunolo...",elife-04494-v1,Variation in the presentation of hereditary im...,We report that HOIL-1 is essential during inf...
3,The brain adapts to control our behavior in di...,Rapid and flexible interpretation of conflicti...,"[Abstract, Introduction, Results, Discussion, ...",[neuroscience],elife-12352-v2,Rapid and flexible interpretation of conflicti...,We used intracranial field potentials to meas...
4,Cells use motor proteins that to move organell...,Myosin 5a is a dual-headed molecular motor tha...,"[Abstract, Introduction, Results, Discussion, ...",[structural biology and molecular biophysics],elife-05413-v2,Myosin 5a is a dual-headed molecular motor tha...,Label-sizes of a few tens of nm are tradition...


In [12]:
# drop the article column to reduce file size
output_df = df.drop(columns=["article"])
output_df.head()

Unnamed: 0,lay_summary,headings,keywords,id,abstract,conclusion
0,"In the USA , more deaths happen in the winter ...","[Abstract, Introduction, Results, Discussion, ...",[epidemiology and global health],elife-35500-v1,"In temperate climates , winter deaths exceed s...",We used wavelet and centre of gravity analyse...
1,Most people have likely experienced the discom...,"[Abstract, Introduction, Results, Discussion, ...","[microbiology and infectious disease, immunolo...",elife-48378-v2,Whether complement dysregulation directly cont...,Mechanistic advances in our understanding of ...
2,The immune system protects an individual from ...,"[Abstract, Introduction, Results, Discussion, ...","[microbiology and infectious disease, immunolo...",elife-04494-v1,Variation in the presentation of hereditary im...,We report that HOIL-1 is essential during inf...
3,The brain adapts to control our behavior in di...,"[Abstract, Introduction, Results, Discussion, ...",[neuroscience],elife-12352-v2,Rapid and flexible interpretation of conflicti...,We used intracranial field potentials to meas...
4,Cells use motor proteins that to move organell...,"[Abstract, Introduction, Results, Discussion, ...",[structural biology and molecular biophysics],elife-05413-v2,Myosin 5a is a dual-headed molecular motor tha...,Label-sizes of a few tens of nm are tradition...


In [13]:
filename

'eLife_train.jsonl'

In [14]:
path = "./data/extracted/"
filename = "eLife_train.csv"
print("Writing output to", filename)
output_df.to_csv(path + filename)
print("Completed")

Writing output to eLife_train.csv
Completed


### Apply data extraction for all datasets

In [15]:
# testing filename conversion jsonl -> csv
filename = "eLife_train.jsonl"
print(filename)
print(filename[:filename.rfind(".")] + "_extracted.csv") # should be eLife_train.csv

eLife_train.jsonl
eLife_train_extracted.csv


In [19]:
file_path = "./data/biolaysumm2024_data/"
file_names = ["eLife_train.jsonl", "eLife_val.jsonl", "eLife_test.jsonl",
              "PLOS_train.jsonl", "PLOS_val.jsonl", "PLOS_test.jsonl"
             ]

output_path = "./data/extracted/"

print("Abstract text extraction:")
print("=============================")
for filename in file_names:
    print("Processing file =", filename)
    df = pd.read_json(file_path+filename,
                       orient="records",
                       lines=True)
    print("Number of records =", len(df))

    # apply get_abstract function
    print("Getting abstracts")
    df["abstract"] = df["article"].apply(get_abstract)
    print("Getting conclusions")
    df["conclusion"] = df["article"].apply(get_conclusion)
    print("Completed")
    # get some statistics
    
    abstract_len = df["abstract"].apply(lambda text: len(text))
    print("Abstract char count =\n", abstract_len.describe())
    if "lay_summary" in df.columns:
        lay_summary_len = df["lay_summary"].apply(lambda text: len(text))
        print("Summary char count =\n", lay_summary_len.describe())
    
    output_df = df.drop(columns=["article"])
    output_filename = filename[:filename.rfind(".")] + "_extracted_max1000.csv"
    print("Writing output to", output_filename)
    # output_df.to_csv(output_path + output_filename)

    # write txt file
    output_df_txt = df["abstract"]
    output_filename_txt = filename[:filename.rfind(".")] + "_extracted_max1000.txt"
    print("Writing output to", output_filename_txt)
    # output_df_txt.to_csv(output_path + output_filename_txt,
    #                     index=False,
    #                     sep="\n",
    #                     header=False)
    
    
    print("Completed")
    print("--------------------")

print("======= completed ========")

Abstract text extraction:
Processing file = eLife_train.jsonl
Number of records = 4346
Getting abstracts
Getting conclusions
Completed
Abstract char count =
 count    4346.000000
mean     1086.333870
std       139.682148
min       448.000000
25%      1028.000000
50%      1081.000000
75%      1135.000000
max      3043.000000
Name: abstract, dtype: float64
Summary char count =
 count    4346.000000
mean     2211.847906
std       371.989210
min      1006.000000
25%      1966.000000
50%      2191.000000
75%      2449.000000
max      4154.000000
Name: lay_summary, dtype: float64
Writing output to eLife_train_extracted_max1000.csv
Writing output to eLife_train_extracted_max1000.txt
Completed
--------------------
Processing file = eLife_val.jsonl
Number of records = 241
Getting abstracts
Getting conclusions
Completed
Abstract char count =
 count     241.000000
mean     1080.215768
std       131.387829
min       539.000000
25%      1018.000000
50%      1073.000000
75%      1130.000000
max     