Prompt: I want to rank the sentences in the passage
ChatGPT response: Pre-trained language models: You can use pre-trained language models like BERT, GPT, or RoBERTa to generate sentence embeddings, and then compute the similarity between these embeddings and the abstract's embedding. Sentences with higher similarity scores are likely to be more important.

In [None]:
!pip install sentence-transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers<5.0.0,>=4.6.0 (from sentence-transformers)
  Downloading transformers-4.29.1-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m103.9 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece (from sentence-transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m86.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub>=0.4.0 (from sentence-transformers)
  Downloading huggingface_hub-0.14.1-py

In [None]:
!pip install tqdm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import torch
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from tqdm import tqdm

# flatten the format of nested list
def flatten(nested_list):
    return [item for sublist in nested_list for item in sublist]

# get the top k sentences based on their sentence embeddings
def get_top_k_sentences(paper, model_name='all-MiniLM-L6-v2', k=20):
    # Ensure PyTorch is using the GPU
    if torch.cuda.is_available():
        device = torch.device("cuda")
        print("Running on GPU")
    else:
        device = torch.device("cpu")
        print("Running on CPU")

    model = SentenceTransformer(model_name, device=device)
    section_series = paper["sections"]
    section_arr = section_series.values

    # Flatten the passage structure
    print("Flattening the passage structure...")
    sentences = flatten(section_arr[0])
    if(len(sentences)<k):
      return [sentences]
    print(f"Total sentences: {len(sentences)}")

    # Generate embeddings
    print("Generating sentence embeddings...")
    embeddings = []
    for sentence in tqdm(sentences, desc="Generating embeddings"):
      #print(sentence)
      embedding = model.encode([sentence])
      embeddings.append(embedding[0])
    print("Embeddings generated.")

    # Calculate pairwise cosine similarity
    print("Calculating pairwise cosine similarity...")
    similarity_matrix = cosine_similarity(embeddings)
    print("Similarity matrix calculated.")

    # Compute average similarity for each sentence
    print("Computing average similarity for each sentence...")
    avg_similarity = np.mean(similarity_matrix, axis=1)
    print("Average similarity computed.")

    # Get indices of top k sentences
    top_k_indices = np.argpartition(avg_similarity, -k)[-k:]

    # Sort indices to maintain the original order of sentences
    top_k_indices = sorted(top_k_indices)

    # Return the top k sentences
    return [sentences[i] for i in top_k_indices]

In [None]:
import pandas as pd
from zipfile import ZipFile

# Load dataset
def loadDataset(path, fileName):
  zip_file = ZipFile(path)
  return pd.read_json(zip_file.open(fileName))

In [None]:
import json
import os
import pickle
import glob

In [None]:
def most_recent_pickle_index(save_folder_path):
  filenames = os.listdir(save_folder_path)

  # Filter filenames that match the pattern "checkpoint_{index}.pkl"
  filtered_filenames = [filename for filename in filenames if filename.startswith('checkpoint_') and filename.endswith('.pkl')]

  # Extract the indices from the filenames and convert them to integers
  indices = [int(filename.split('_')[1].split('.')[0]) for filename in filtered_filenames]

  # Check if there are any filenames
  if indices:
      # Find the filename with the highest index
      return max(indices)
  else:
      return None

In [None]:
def summarize_with_checkpoint(path, JSONfileName, summary_function, nameOfYourSummary, save_folder_path, **kwargs):
  '''
  path: path to original dataset
  JSONfileName: name of original dataset
  summary_function: takes a (1, n) row of the original dataset as input, returns a summary string
  nameOfYourSummary: the name to save the new summary under in the final dataframe
  

  '''
  recent_index = most_recent_pickle_index(save_folder_path)
  print(f'recent index is {recent_index}')
  recent_path = f'{save_folder_path}checkpoint_{recent_index}.pkl'
  
  original_dataset = loadDataset(path, JSONfileName)

  if recent_index == None:
    #first time in:
    our_summary = pd.DataFrame()
    #set up the correct colums
    cols_to_copy = ['id', 'year', 'title', 'abstract', 'summary', 'keywords']
    for col_name in cols_to_copy:
      our_summary[col_name] = ''
    our_summary[nameOfYourSummary] = ''
    recent_index = -1
  
  else:
    our_summary = pd.read_pickle(recent_path)

  for index, row in original_dataset.iloc[recent_index+1:].iterrows():
    paper = pd.Series(row).copy().to_frame().T
    generated_summary = summary_function(paper, **kwargs)
    print(generated_summary)
    paper[nameOfYourSummary] = [generated_summary]


    #our_summary = our_summary.append(paper)
    our_summary = pd.concat([our_summary, paper])

    # # Save pickle every 10 papers
    # if index % 10 == 0:
    #     print(f"pickling index {index}")
    our_summary.to_pickle(f'{save_folder_path}checkpoint_{index}.pkl')
  # return our_summary


In [None]:
!rm -rf pickles
!mkdir pickles

In [None]:
summarize_with_checkpoint(path = 'elife.zip', JSONfileName = 'test.json', summary_function = get_top_k_sentences, nameOfYourSummary = 'topK_importance', save_folder_path = '/content/pickles/', model_name = "all-MiniLM-L6-v2", k=20)

recent index is 230
Running on GPU
Flattening the passage structure...
Total sentences: 365
Generating sentence embeddings...


Generating embeddings: 100%|██████████| 365/365 [00:03<00:00, 117.56it/s]


Embeddings generated.
Calculating pairwise cosine similarity...
Similarity matrix calculated.
Computing average similarity for each sentence...
Average similarity computed.
['DOI: http://dx . doi . org/10 . 7554/eLife . 12401 . 00310 . 7554/eLife . 12401 . 004Figure 1—figure supplement 1 . Developing adult pigment patterns during peak stages of aox5+ fast projections in zebrafish . In zebrafish ( 7 . 5 SSL ) , melanophores are found along the horizontal myoseptum ( e . g . , outline and inset ) and an interstripe region with pigmented xanthophores ( inset , arrow ) has started to develop .', 'Scale bar: 200 µm . DOI: http://dx . doi . org/10 . 7554/eLife . 12401 . 00410 . 7554/eLife . 12401 . 005Figure 1—figure supplement 2 . Zebrafish\xa0aox5+ cells extend fast projections independently of melanophores and iridophores .', 'Right , Despite increased numbers of melanophores , there were no effects on interstripe xanthophore numbers ( p=0 . 8 ) , suggesting that aox5+ cells do not alter 

Generating embeddings: 100%|██████████| 400/400 [00:03<00:00, 117.95it/s]


Embeddings generated.
Calculating pairwise cosine similarity...
Similarity matrix calculated.
Computing average similarity for each sentence...
Average similarity computed.
['Phosphorylation between DDR1 dimers occurs both on the juxtamembrane region and the kinase activation loop and can be elicited by different types of ligand .', 'This allowed us to monitor exclusively the phosphorylation of receiver DDR1 by Western blotting with the Ab against phosphorylated tyrosine-513 . 10 . 7554/eLife . 25716 . 003Figure 1 . Co-expression of DDR1 donor kinase with signalling-incompetent receiver DDR1 mutants leads to collagen-induced phosphorylation of receiver DDR1 . ( A ) Schematic diagrams of wild-type and mutant DDR1b .', 'Collagen binding induces phosphorylation of DDR1b on cytoplasmic tyrosine residues; Y513 in the JM region and the three activation loop tyrosines are shown in phosphorylated form for WT DDR1b as yellow circles .', '( C ) Schematic diagram showing that co-expression of dim

Generating embeddings: 100%|██████████| 321/321 [00:02<00:00, 116.65it/s]


Embeddings generated.
Calculating pairwise cosine similarity...
Similarity matrix calculated.
Computing average similarity for each sentence...
Average similarity computed.
['To evaluate the interaction of PsAvh52 with GmTAP1 , we observed the localization of PsAvh52 and GmTAP1 during transient co-expression in N . benthamiana .', 'These data suggested that PsAvh52 could cause the relocation of GmTAP1 into the nucleus when the two proteins were co-expressed in N . benthamiana .', 'These data suggested that PsAvh52 causes GmTAP1 to relocate into the nucleus through a specific interaction .', 'As shown in Figure 3 , PsAvh52M4 could not cause GmTAP1 to relocate into the nucleus when the two were transiently co-expressed in N . benthamiana .', 'These results suggest that the amino acids 69\xa0–\xa086 of PsAvh52 , that are essential for the interaction , are also required for its ability to trigger the relocation of GmTAP1 into the nucleus .', 'When PsAvh52-RFP was co-expressed with GFP-GmT

Generating embeddings: 100%|██████████| 424/424 [00:03<00:00, 117.40it/s]


Embeddings generated.
Calculating pairwise cosine similarity...
Similarity matrix calculated.
Computing average similarity for each sentence...
Average similarity computed.
['TRIM5 proteins also contain one of two different C-terminal viral core recognition domains , a B30 . 2/SPRY domain in TRIM5α ( hereafter termed SPRY ) or a cyclophilin A ( CypA ) domain in TRIMCyp ( Brennan et al . , 2008; Newman et al . , 2008; Nisole et al . , 2004; Sayah et al . , 2004; Stremlau et al . , 2005; 2006; Virgen et al . , 2008 ) . 10 . 7554/eLife . 16269 . 003Figure 1 . ECT analysis of TRIM5-21R 2D crystals .', 'Consistent with the pattern recognition model , TRIM5-21R was shown to assemble into open hexagonal lattices , both alone and on the surface of 2D CA crystals that mimic the surface of the HIV-1 capsid ( Ganser-Pornillos et al . , 2011 ) .', 'Domain positions therefore had to be inferred , and were interpreted in the absence of high-resolution information on the structure of the TRIM5 protei

Generating embeddings: 100%|██████████| 261/261 [00:02<00:00, 115.96it/s]


Embeddings generated.
Calculating pairwise cosine similarity...
Similarity matrix calculated.
Computing average similarity for each sentence...
Average similarity computed.
['At this stage , homologous chromosomes are brought into a loose 400 nm-wide alignment , whilst lateral element proteins SYCP2 and SYCP3 are recruited to the chromosome axes in an inter-dependent manner ( Pelttari et al . , 2001; Yang et al . , 2006 ) .', 'We determine that SYCP3 is a tetrameric protein and that its helical core folds in an elongated rod-like structure spanning 20 nm in length .', 'We show that SYCP3 can bind DNA through the N-terminal regions extending from its tetrameric core .', 'As the DNA-binding sites are located at both tetramer ends , SYCP3 can act as a physical strut to hold distant regions of DNA together .', '( C ) The crystal structure of SYCP3Core is shown with a 90° rotation around its longitudinal axis; chains A-D are depicted in purple , salmon , teal and blue .', 'A noticeable cons

Generating embeddings: 100%|██████████| 300/300 [00:02<00:00, 118.68it/s]


Embeddings generated.
Calculating pairwise cosine similarity...
Similarity matrix calculated.
Computing average similarity for each sentence...
Average similarity computed.
['( B ) Quantification of pErk staining of FSCs and prefollicle cells just downstream from the niche within a wildtype or Egfrf24 FSC clone .', '( B ) A wildtype GFP ( − ) FSC clone with bright pErk in FSCs ( white arrows ) , and in prefollicle cells that recently divided from the FSC in the clone ( blue asterisks in B′–B′′′ ) .', 'To determine whether this pErk signal is dependent upon EGFR , we generated FSC clones that are homozygous for Egfrf24 , a loss-of-function allele , and stained for pErk .', 'Interestingly , all Egfrf24 FSC clones and a subset of early Egfrf24 prefollicle cell clones had severe morphological defects that suggested a loss of cell polarity .', 'To determine whether these cells had polarity defects , we stained ovarioles with Egfrf24 FSC clones for markers of apical , lateral , and basal ide

Generating embeddings: 100%|██████████| 310/310 [00:02<00:00, 117.49it/s]


Embeddings generated.
Calculating pairwise cosine similarity...
Similarity matrix calculated.
Computing average similarity for each sentence...
Average similarity computed.
['In addition , recent research has shown that the NeST lncRNA also binds WDR5 to upregulate IFN-γ expression through H3K4me3 ( Gomez et al . , 2013 ) , suggesting the existence of multiple different enhancing lncRNAs that function via WDR5 interactions .', 'We find that WDR5 binds over a thousand endogenous RNAs and that RNA binding is essential for WDR5 maintenance of ESC pluripotency .', 'To pinpoint the functional consequences of a selective lncRNA-binding mutation of WDR5 , we further analyzed WDR5 F266A .', 'In contrast to the other HOTTIP binding mutations , WDR5 F266A is defective in lncRNA binding in vitro and in vivo , but without any defects in binding MLL complex subunits RbBP5 or MLL1 in immunoprecipitation experiments ( Figure 1D ) .', 'We reasoned that the F266A mutation offered an experimental strate

Generating embeddings: 100%|██████████| 421/421 [00:03<00:00, 117.72it/s]


Embeddings generated.
Calculating pairwise cosine similarity...
Similarity matrix calculated.
Computing average similarity for each sentence...
Average similarity computed.
['Here , we use population genomic data and laboratory evolution experiments on S . paradoxus and its sibling species S . cerevisiae to investigate the hybrid reactivation hypothesis and , more generally , the factors governing TE accumulation in natural and experimental lineages .', 'Rather , we show that deterministic factors like population structure and the properties of individual hybrid genotypes are major determinants of TE content evolution in hybrid genomes .', 'Based on the assembly annotations , Ty1 was the most abundant family and exhibited the largest CN variation among lineages , with a striking difference between SpB and SpC ( Figure 2c ) .', 'We thus explored the evolutionary dynamics of Ty families within each genome .', 'Overall , this result indicated that CN variation reflected variation in famil

Generating embeddings: 100%|██████████| 271/271 [00:02<00:00, 122.31it/s]


Embeddings generated.
Calculating pairwise cosine similarity...
Similarity matrix calculated.
Computing average similarity for each sentence...
Average similarity computed.
['A highly conserved\xa0~780 bp enhancer called the ZRS controls the spatiotemporal expression of the Shh gene in the ZPA of both the fore and hind limbs ( Lettice et al . , 2002; 2003; Sagai et al . , 2005 ) .', 'Elevated frequencies of Shh/ZRS co-localization were observed only in the Shh expressing regions of the limb bud ( Amano et al . , 2009 ) , in a conformation consistent with enhancer-promoter loop formation ( Williamson et al . , 2016 ) .', 'We demonstrate that even though the activity of the ZRS is restricted to the ZPA , it retains features of a poised enhancer along the full distal portion of the limb bud composed of the mesenchymal cells of the progress zone; whereas , H3K27ac is enriched just in the distal-posterior limb region .', 'Even though Shh was not expressed in the anterior region of the limb 

Generating embeddings: 100%|██████████| 370/370 [00:03<00:00, 120.65it/s]


Embeddings generated.
Calculating pairwise cosine similarity...
Similarity matrix calculated.
Computing average similarity for each sentence...
Average similarity computed.
['Systematic mining of published data on axonal and dendritic profiles , augmented with information on neurotransmitter and synaptic specificity , led to the tentative definition of over 100 distinct neuron types across the hippocampal formation .', 'Information summaries are available for each neuron type , anatomical parcel , molecular marker , and cited author .', 'We define parcels and neuron types , explain how biomarker and electrophysiological data are linked to morphological data , describe how names are assigned to neuron types , and expand upon how the knowledge base will be maintained going forward .', 'Most publications that report morphological information on hippocampal neurons include evidence of axonal and dendritic presence in at least a subset of these parcels in the form of reconstructions , traci

In [None]:
def dumpLastPklToJson(path, jsonFileName, jsonFileStoringPath):
    # Get list of all pkl files in the directory
    pkl_files = glob.glob(os.path.join(path, 'checkpoint_*.pkl'))

    # If there are no pkl files, return
    if not pkl_files:
        print("No pkl files found")
        return

    # Sort the files by modification time
    pkl_files.sort(key=os.path.getmtime)

    # Get the latest file
    latest_pkl_file = pkl_files[-1]

    # Load data from the latest pkl file
    with open(latest_pkl_file, 'rb') as f:
        data = pickle.load(f)

    # Dump data to JSON file
    json_file_path = os.path.join(jsonFileStoringPath, jsonFileName)
    if isinstance(data, pd.DataFrame):
        # If data is a DataFrame, use pandas' to_json() method
        data.to_json(json_file_path, orient="records", lines=True)
    else:
        # If data is not a DataFrame, use the json module
        with open(json_file_path, 'w') as f:
            json.dump(data, f, indent=4)

    print(f"Data from {latest_pkl_file} has been dumped to {json_file_path}")


In [None]:
dumpLastPklToJson("/content/pickles", "mimicbertSum.json", "/content")

Data from /content/pickles/checkpoint_240.pkl has been dumped to /content/mimicbertSum.json


In [None]:
data = []
with open('/content/mimicbertSum.json') as f:
    for line in f:
        data.append(json.loads(line))

dataF = pd.DataFrame(data)
dataFMySummary = dataF["topK_importance"]

In [None]:
dataF

Unnamed: 0,id,year,title,abstract,summary,keywords,topK_importance,sections,headings
0,elife-37443-v3,2018,Cerebellar implementation of movement sequence...,"[Most movements are not unitary , but are comp...",[Imagine a gymnastics competition in which par...,[neuroscience],"[Given these factors , the ability to learn a ...","[[Most movements are comprised of sequences .,...","[Introduction, Results, Discussion, Materials ..."
1,elife-33101-v2,2018,Architecture of the human mTORC2 core complex,[The mammalian target of rapamycin ( mTOR ) is...,"[To grow and multiply , a living cell must tak...","[short report, structural biology and molecula...","[Together with the small protein mLST8 , mTOR ...",[[The serine/threonine kinase mammalian target...,"[Introduction, Results and discussion, Materia..."
2,elife-10806-v2,2016,Motion along the mental number line reveals sh...,[Perception of number and space are tightly in...,[Our sense of number is thought to have emerge...,[neuroscience],[We describe a new phenomenon in which visual ...,[[Our perception of numerosity and space are t...,"[Introduction, Results, Discussion, Materials ..."
3,elife-02848-v2,2014,Allosteric inhibition of a stem cell RNA-bindi...,[Gene expression and metabolism are coupled at...,"[When an embryo is developing , stem cells mus...","[biochemistry and chemical biology, structural...",[( B ) MSI1 displays decreased affinity for an...,[[The RNA-binding protein Musashi-1 ( MSI1 ) i...,"[Introduction, Results, Discussion, Materials ..."
4,elife-01524-v1,2014,Synaptotagmin 7 functions as a Ca2+-sensor for...,[Synaptotagmin ( syt ) 7 is one of three syt i...,[Neurons communicate with one another at junct...,"[cell biology, neuroscience]","[The EPSC amplitudes ( Figure 1F–I , K–L ) , a...",[[Chemical communication at synapses in the ce...,"[Introduction, Results, Discussion, Materials ..."
...,...,...,...,...,...,...,...,...,...
236,elife-04437-v2,2014,EGFR signaling promotes self-renewal through t...,"[Epithelial stem cells divide asymmetrically ,...",[A stem cell is a special cell that divides to...,"[stem cells and regenerative medicine, cell bi...",[( B ) Quantification of pErk staining of FSCs...,[[Adult stem cell divisions produce asymmetric...,"[Introduction, Results, Discussion, Materials ..."
237,elife-02046-v1,2014,Essential role of lncRNA binding for WDR5 main...,[The WDR5 subunit of the MLL complex enforces ...,[If all the DNA contained within a single huma...,"[stem cells and regenerative medicine, chromos...","[In addition , recent research has shown that ...","[[An orchestra of chromatin readers , writers ...","[Introduction, Results, Discussion, Materials ..."
238,elife-60474-v2,2020,The effect of hybridization on transposable el...,[Transposable elements ( TEs ) are mobile gene...,[Hybrids arise when two populations of organis...,"[evolutionary biology, genetics and genomics]","[Here , we use population genomic data and lab...",[[Hybridization is increasingly recognized as ...,"[Introduction, Results, Discussion, Materials ..."
239,elife-28590-v2,2017,Fibroblast growth factors (FGFs) prime the lim...,[Sonic hedgehog ( Shh ) expression in the limb...,"[As an animal embryo develops , specific genes...","[chromosomes and gene expression, developmenta...",[A highly conserved ~780 bp enhancer called th...,[[Spatial specific gene expression is fundamen...,"[Introduction, Results, Discussion, Materials ..."
