## 01-Groq.com API for Dev set
In this notebook, we will make API calls to Groq.com to summarize the article (up to 20k characters) in the mini (10% random sampling) dev set. 

We would use Mixtral 8x7B as it has very good performance scores, and it's open source.

In [1]:
# !pip install groq

In [2]:
# load api key
import json
import urllib.parse

with open('./data/credentials.json') as f:
    login = json.load(f)
    
api_key = login["GROQ_API_KEY"]
print(len(api_key))

56


In [3]:
from groq import Groq

client = Groq(
    api_key=api_key,
)

In [4]:
completion = client.chat.completions.create(
    model="mixtral-8x7b-32768",
    messages=[
        {
            "role": "user",
            "content": "Answer in 1 sentence: What's the capital city of Alberta?"
        },
    ],
    temperature=1,
    max_tokens=1024,
    top_p=1,
    stream=True,
    stop=None,
)

for chunk in completion:
    print(chunk.choices[0].delta.content or "", end="")

The capital city of Alberta is Edmonton.

In [5]:
# test some summarization
s = 'The evolutionary origins of the hypoxia-sensitive cells that trigger amniote respiratory reflexes – carotid body glomus cells , and ‘pulmonary neuroendocrine cells’ ( PNECs ) - are obscure . Homology has been proposed between glomus cells , which are neural crest-derived , and the hypoxia-sensitive ‘neuroepithelial cells’ ( NECs ) of fish gills , whose embryonic origin is unknown . NECs have also been likened to PNECs , which differentiate in situ within lung airway epithelia . Using genetic lineage-tracing and neural crest-deficient mutants in zebrafish , and physical fate-mapping in frog and lamprey , we find that NECs are not neural crest-derived , but endoderm-derived , like PNECs , whose endodermal origin we confirm . We discover neural crest-derived catecholaminergic cells associated with zebrafish pharyngeal arch blood vessels , and propose a new model for amniote hypoxia-sensitive cell evolution: endoderm-derived NECs were retained as PNECs , while the carotid body evolved via the aggregation of neural crest-derived catecholaminergic ( chromaffin ) cells already associated with blood vessels in anamniote pharyngeal arches . '
s

'The evolutionary origins of the hypoxia-sensitive cells that trigger amniote respiratory reflexes – carotid body glomus cells , and ‘pulmonary neuroendocrine cells’ ( PNECs ) - are obscure . Homology has been proposed between glomus cells , which are neural crest-derived , and the hypoxia-sensitive ‘neuroepithelial cells’ ( NECs ) of fish gills , whose embryonic origin is unknown . NECs have also been likened to PNECs , which differentiate in situ within lung airway epithelia . Using genetic lineage-tracing and neural crest-deficient mutants in zebrafish , and physical fate-mapping in frog and lamprey , we find that NECs are not neural crest-derived , but endoderm-derived , like PNECs , whose endodermal origin we confirm . We discover neural crest-derived catecholaminergic cells associated with zebrafish pharyngeal arch blood vessels , and propose a new model for amniote hypoxia-sensitive cell evolution: endoderm-derived NECs were retained as PNECs , while the carotid body evolved via

In [6]:
prompt = f"[INST] Simplify and summarize in 200 to 300 words:  {s} [/INST]"

In [7]:
# test pause
import time
print("Hello")
time.sleep(2)
print("World")

Hello
World


In [8]:
SLEEP_TIME = 10 # pause between requests

In [9]:
# put into a function
def send_sumarize_request(content, model=client, min_words=250, max_words=500, quiet=True):
    """
        summarize the content
        input: context (text), model (groq_api client), max_words (int)
        output: summarized text
    """
    if not quiet:
        print("Sending request for text =", content[:100])
        
    result = ""
    prompt = f'Simplify and summarize in minimum {min_words} to maximum {max_words} words, combine answer into 1 paragraph, keep important factual details:  "{content}"'
    try:
        completion = client.chat.completions.create(
            model="mixtral-8x7b-32768",
            messages=[
                {
                    "role": "user",
                    "content": prompt
                },
            ],
            temperature=0.8,
            max_tokens=2048,
            top_p=1,
            stream=True,
            stop=None,
        )
    
        
        for chunk in completion:
            result += chunk.choices[0].delta.content or ""
    except Exception as err:
        print("Skipping, error : ", err)
        result = ""
        

    # pause to avoid hitting bandwidth limit (~ 14K token / minute)
    print(f"Pausing for {SLEEP_TIME} secs...")
    time.sleep(SLEEP_TIME)
    print("OK")
    
    return result

In [10]:
send_sumarize_request(s)

Pausing for 10 secs...
OK


'The evolutionary origins of hypoxia-sensitive cells responsible for amniote respiratory reflexes, specifically carotid body glomus cells and pulmonary neuroendocrine cells (PNECs), are unclear. Although glomus cells are neural crest-derived, it has been suggested that they share a common origin with hypoxia-sensitive neuroepithelial cells (NECs) found in fish gills, whose embryonic origin is unknown. NECs have also been compared to PNECs, which differentiate in situ within lung airway epithelia. However, through the use of genetic lineage-tracing, neural crest-deficient mutants in zebrafish, and physical fate-mapping in frog and lamprey, it has been determined that NECs are not neural crest-derived, but instead, endoderm-derived, like PNECs, whose endodermal origin has now been confirmed. The study also discovered neural crest-derived catecholaminergic cells associated with zebrafish pharyngeal arch blood vessels. Therefore, the study proposes a new model for the evolution of amniote 

In [11]:
import pandas as pd

In [12]:
# load in some data

dev_df_filename = "../../data/mini_dataset/eLife_val_mini.jsonl"
df = pd.read_json(dev_df_filename,
                  orient="records",
                  lines=True
                 )
df.head()

Unnamed: 0,lay_summary,article,headings,keywords,id
0,"It can take several months , or even years , f...",Mature neural networks synchronize and integra...,"[Abstract, Introduction, Results, Discussion, ...",[neuroscience],elife-69011-v2
1,Many of our decisions are made on the basis of...,Many decisions are thought to arise via the ac...,"[Abstract, Introduction, Results, Discussion, ...",[neuroscience],elife-17688-v1
2,Oculo-Cerebro-Renal syndrome of Lowe ( Lowe sy...,Mutations in the inositol 5-phosphatase OCRL c...,"[Abstract, Introduction, Results, Discussion, ...",[cell biology],elife-02975-v2
3,"When an embryo develops , its cells must work ...",Gradients of signaling proteins are essential ...,"[Abstract, Introduction, Results, Discussion, ...",[developmental biology],elife-38137-v3
4,Our genomes contain a record of historical eve...,Similarity between two individuals in the comb...,"[Abstract, Introduction, Results, Discussion, ...","[evolutionary biology, genetics and genomics]",elife-15266-v1


In [13]:
import random
random.seed(42)
k = random.randint(0, len(df) - 1)
item = df.iloc[k]
item

lay_summary    Spoken language is colored by fluctuations in ...
article        Individuals with congenital amusia have a life...
headings       [Abstract, Introduction, Results, Discussion, ...
keywords                                          [neuroscience]
id                                                elife-53539-v2
Name: 20, dtype: object

In [14]:
item.article

"Individuals with congenital amusia have a lifelong history of unreliable pitch processing . Accordingly , they downweight pitch cues during speech perception and instead rely on other dimensions such as duration . We investigated the neural basis for this strategy . During fMRI , individuals with amusia ( N = 15 ) and controls ( N = 15 ) read sentences where a comma indicated a grammatical phrase boundary . They then heard two sentences spoken that differed only in pitch and/or duration cues and selected the best match for the written sentence . Prominent reductions in functional connectivity were detected in the amusia group between left prefrontal language-related regions and right hemisphere pitch-related regions , which reflected the between-group differences in cue weights in the same groups of listeners . Connectivity differences between these regions were not present during a control task . Our results indicate that the reliability of perceptual dimensions is linked with functi

In [15]:
# test summarizing the abstract
paras = item.article.split("\n")
abstract = paras[0]
abstract

'Individuals with congenital amusia have a lifelong history of unreliable pitch processing . Accordingly , they downweight pitch cues during speech perception and instead rely on other dimensions such as duration . We investigated the neural basis for this strategy . During fMRI , individuals with amusia ( N = 15 ) and controls ( N = 15 ) read sentences where a comma indicated a grammatical phrase boundary . They then heard two sentences spoken that differed only in pitch and/or duration cues and selected the best match for the written sentence . Prominent reductions in functional connectivity were detected in the amusia group between left prefrontal language-related regions and right hemisphere pitch-related regions , which reflected the between-group differences in cue weights in the same groups of listeners . Connectivity differences between these regions were not present during a control task . Our results indicate that the reliability of perceptual dimensions is linked with functi

In [16]:
print("Summary based on abstract:\n----------------------")
send_sumarize_request(abstract)

Summary based on abstract:
----------------------
Pausing for 10 secs...
OK


'Individuals with congenital amusia, a lifelong condition characterized by unreliable pitch processing, rely on other dimensions such as duration during speech perception. A study investigating the neural basis for this strategy used fMRI to compare functional connectivity in individuals with amusia (N = 15) and controls (N = 15) as they read sentences with grammatical phrase boundaries and then heard two sentences that differed only in pitch and/or duration cues, selecting the best match for the written sentence. Results showed prominent reductions in functional connectivity in the amusia group between left prefrontal language-related regions and right hemisphere pitch-related regions, reflecting the between-group differences in cue weights. These connectivity differences were not present during a control task, indicating a specific compensation mechanism. The study suggests that the reliability of perceptual dimensions is linked with functional connectivity between frontal and percep

In [17]:
print("Summary based on full article:\n----------------------")
s_full = send_sumarize_request(item.article)
print(len(s_full))
print(len(s_full.split()))
print(s_full)

Summary based on full article:
----------------------
Pausing for 10 secs...
OK
4484
633
Congenital amusia is a rare condition characterized by a lifelong history of unreliable pitch processing, resulting in a strategy of downweighting pitch cues during speech perception and relying on other dimensions such as duration. A study using fMRI found that individuals with amusia exhibited prominent reductions in functional connectivity between left prefrontal language-related regions and right hemisphere pitch-related regions, reflecting the between-group differences in cue weights. These connectivity differences were not present during a control task. The results suggest a compensatory mechanism for the reduced reliability of perceptual dimensions. While congenital amusia is believed to be innate, recovery is possible through training. Pitch is important for cueing categories in spoken language, conveying emotion in speech, and is usually associated with music. In highly controlled laborato

In [18]:
# 

In [19]:
print("Gold lay_summary:\n----------------------")
print(len(item.lay_summary))
print(len(item.lay_summary.split()))
item.lay_summary

Gold lay_summary:
----------------------
2652
469


'Spoken language is colored by fluctuations in pitch and rhythm . Rather than speaking in a flat monotone , we allow our sentences to rise and fall . We vary the length of syllables , drawing out some , and shortening others . These fluctuations , known as prosody , add emotion to speech and denote punctuation . In written language , we use a comma or a period to signal a boundary between phrases . In speech , we use changes in pitch – how deep or sharp a voice sounds – or in the length of syllables . Having more than one type of cue that can signal emotion or transitions between sentences has a number of advantages . It means that people can understand each other even when factors such as background noise obscure one set of cues . It also means that people with impaired sound perception can still understand speech . Those with a condition called congenital amusia , for example , struggle to perceive pitch , but they can compensate for this difficulty by placing greater emphasis on oth

In [20]:
# apply to all rows in eval miniset
text_cap = 20_000  # temporarily limit to 20k characters, set to -1 for full text

print("Summarization process started...")
df["groq_mistral_summary"] = df["article"].apply(lambda text: send_sumarize_request(text[:text_cap], quiet=False))
print("Completed")

Summarization process started...
Sending request for text = Mature neural networks synchronize and integrate spatiotemporal activity patterns to support cogniti
Pausing for 10 secs...
OK
Sending request for text = Many decisions are thought to arise via the accumulation of noisy evidence to a threshold or bound .
Pausing for 10 secs...
OK
Sending request for text = Mutations in the inositol 5-phosphatase OCRL cause Lowe syndrome and Dent's disease . Although OCRL 
Pausing for 10 secs...
OK
Sending request for text = Gradients of signaling proteins are essential for inducing tissue morphogenesis . However , mechanis
Pausing for 10 secs...
OK
Sending request for text = Similarity between two individuals in the combination of genetic markers along their chromosomes ind
Pausing for 10 secs...
OK
Sending request for text = Latent Epstein-Barr virus ( EBV ) infection is causally linked to several human cancers . EBV expres
Pausing for 10 secs...
OK
Sending request for text = Dynamic post-tra

In [21]:
# check how many rows have blank result (some errors)
empty_df = df.query("groq_mistral_summary.str.strip() == ''")
print(len(empty_df))

0


In [22]:
# attempt to retry for blank rows (due to some errors)
retry = True
if retry:
    print("Retrying for empty results...")
    for i in range(len(df)):
        item = df.iloc[i]
        if item["groq_mistral_summary"] == "":
            print("Item =", i)
            text = item["article"]
            df.at[i, "groq_mistral_summary"] = send_sumarize_request(text[:text_cap], quiet=False)
    print("Completed")

Retrying for empty results...
Completed


In [23]:
df

Unnamed: 0,lay_summary,article,headings,keywords,id,groq_mistral_summary
0,"It can take several months , or even years , f...",Mature neural networks synchronize and integra...,"[Abstract, Introduction, Results, Discussion, ...",[neuroscience],elife-69011-v2,The development of neural networks from an imm...
1,Many of our decisions are made on the basis of...,Many decisions are thought to arise via the ac...,"[Abstract, Introduction, Results, Discussion, ...",[neuroscience],elife-17688-v1,The article discusses how decisions are made t...
2,Oculo-Cerebro-Renal syndrome of Lowe ( Lowe sy...,Mutations in the inositol 5-phosphatase OCRL c...,"[Abstract, Introduction, Results, Discussion, ...",[cell biology],elife-02975-v2,Mutations in the inositol 5-phosphatase OCRL c...
3,"When an embryo develops , its cells must work ...",Gradients of signaling proteins are essential ...,"[Abstract, Introduction, Results, Discussion, ...",[developmental biology],elife-38137-v3,"The distribution of signaling proteins, FGF an..."
4,Our genomes contain a record of historical eve...,Similarity between two individuals in the comb...,"[Abstract, Introduction, Results, Discussion, ...","[evolutionary biology, genetics and genomics]",elife-15266-v1,The article discusses the use of genetic marke...
5,Over 90% of adults around the world are infect...,Latent Epstein-Barr virus ( EBV ) infection is...,"[Abstract, Introduction, Results, Discussion, ...","[microbiology and infectious disease, cancer b...",elife-22509-v2,Epstein-Barr virus (EBV) is a widespread human...
6,Genes are sections of DNA that encode the inst...,Dynamic post-translational modification of RNA...,"[Abstract, Introduction, Results, Discussion, ...","[chromosomes and gene expression, computationa...",elife-11215-v2,The C-terminal domain (CTD) of RNA polymerase ...
7,An organism’s DNA contains thousands of genes ...,Swi2/Snf2 ATPases remodel substrates such as n...,"[Abstract, Introduction, Results, Discussion, ...",[structural biology and molecular biophysics],elife-07432-v2,The Swi2/Snf2 ATPases are a large and diverse ...
8,Human reproductive cells—eggs and sperm—are pr...,Accurate chromosome segregation depends on coo...,"[Abstract, Introduction, Results, Discussion, ...","[chromosomes and gene expression, cell biology]",elife-01133-v1,"In mammalian oocytes, Shugoshin-like protein 2..."
9,Microorganisms live in most parts of our body ...,Streptococcus pneumoniae is a leading cause of...,"[Abstract, Introduction, Results, Discussion, ...","[microbiology and infectious disease, genetics...",elife-26255-v2,"The bacterium Streptococcus pneumoniae, a lead..."


In [24]:
output_path = "./data/output/mini_dev_set/"
output_filename = "elife_groq_mistral_summary.csv"

print("Writing to file ", output_filename)
df.to_csv(output_path+output_filename,
          index = False
         )
print("Completed")

Writing to file  elife_groq_mistral_summary.csv
Completed


In [25]:
output_filename = "elife_groq_mistral_summary.json"

print("Writing to file ", output_filename)
df.to_json(output_path+output_filename,
           orient="records",
           )
print("Completed")

Writing to file  elife_groq_mistral_summary.json
Completed


In [26]:
# process the PLOS dataset:
dev_df_filename = "../../data/mini_dataset/PLOS_val_mini.jsonl"
df = pd.read_json(dev_df_filename,
                  orient="records",
                  lines=True
                 )
df.head()

Unnamed: 0,lay_summary,article,headings,keywords,id
0,"Yersinia pestis , the bacterial agent of plagu...",Fleas can transmit Yersinia pestis by two mech...,"[Abstract, Introduction, Results, Discussion, ...","[united states, invertebrates, medicine and he...",journal.ppat.1006859
1,The genome of all vertebrates is heavily colon...,Endogenous retroviruses ( ERVs ) are remnants ...,"[Abstract, Introduction, Results, Discussion, ...","[viruses, sheep, virology]",journal.ppat.0030170
2,The molecular mechanisms underlying directed c...,The Drosophila embryonic gonad is assembled fr...,"[Abstract, Introduction, Results, Discussion, ...",[],journal.pgen.1003720
3,Contrary to the long-standing belief that no n...,"Recently , we presented a study of adult neuro...","[Abstract, Introduction, Model, Results, Discu...",[computational biology/computational neuroscie...,journal.pcbi.1001063
4,Embryonic stem cells have two remarkable prope...,Understanding the transcriptional regulation o...,"[Abstract, Introduction, Results, Discussion, ...","[developmental biology, cell biology, mammals,...",journal.pgen.0030145


In [27]:
# apply to all rows in eval miniset
text_cap = 20_000  # temporarily limit to 20k characters, set to -1 for full text

print("Summarization process started...")
df["groq_mistral_summary"] = df["article"].apply(lambda text: send_sumarize_request(text[:text_cap], quiet=False))
print("Completed")

Summarization process started...
Sending request for text = Fleas can transmit Yersinia pestis by two mechanisms , early-phase transmission ( EPT ) and biofilm-
Pausing for 10 secs...
OK
Sending request for text = Endogenous retroviruses ( ERVs ) are remnants of ancient retroviral infections of the host germline 
Pausing for 10 secs...
OK
Sending request for text = The Drosophila embryonic gonad is assembled from two distinct cell types , the Primordial Germ Cells
Pausing for 10 secs...
OK
Sending request for text = Recently , we presented a study of adult neurogenesis in a simplified hippocampal memory model . The
Pausing for 10 secs...
OK
Sending request for text = Understanding the transcriptional regulation of pluripotent cells is of fundamental interest and wil
Pausing for 10 secs...
OK
Sending request for text = The current model of hepatitis C virus ( HCV ) production involves the assembly of virions on or nea
Pausing for 10 secs...
OK
Sending request for text = Secondary amphip

In [28]:
# check how many rows have blank result (some errors)
empty_df = df.query("groq_mistral_summary.str.strip() == ''")
print(len(empty_df))

0


In [29]:
# attempt to retry for blank rows (due to some errors)
retry = True
if retry:
    print("Retrying for empty results...")
    for i in range(len(df)):
        item = df.iloc[i]
        if item["groq_mistral_summary"] == "":
            print("Item =", i)
            text = item["article"]
            df.at[i, "groq_mistral_summary"] = send_sumarize_request(text[:text_cap], quiet=False)
    print("Completed")

Retrying for empty results...
Completed


In [30]:
df.head()

Unnamed: 0,lay_summary,article,headings,keywords,id,groq_mistral_summary
0,"Yersinia pestis , the bacterial agent of plagu...",Fleas can transmit Yersinia pestis by two mech...,"[Abstract, Introduction, Results, Discussion, ...","[united states, invertebrates, medicine and he...",journal.ppat.1006859,"Fleas can transmit Yersinia pestis, the bacter..."
1,The genome of all vertebrates is heavily colon...,Endogenous retroviruses ( ERVs ) are remnants ...,"[Abstract, Introduction, Results, Discussion, ...","[viruses, sheep, virology]",journal.ppat.0030170,Endogenous retroviruses (ERVs) are remnants of...
2,The molecular mechanisms underlying directed c...,The Drosophila embryonic gonad is assembled fr...,"[Abstract, Introduction, Results, Discussion, ...",[],journal.pgen.1003720,The Drosophila embryonic gonad is formed from ...
3,Contrary to the long-standing belief that no n...,"Recently , we presented a study of adult neuro...","[Abstract, Introduction, Model, Results, Discu...",[computational biology/computational neuroscie...,journal.pcbi.1001063,"In a study, the researchers presented a model ..."
4,Embryonic stem cells have two remarkable prope...,Understanding the transcriptional regulation o...,"[Abstract, Introduction, Results, Discussion, ...","[developmental biology, cell biology, mammals,...",journal.pgen.0030145,The transcriptional regulation of pluripotent ...


In [31]:
output_path = "./data/output/mini_dev_set/"
output_filename = "plos_groq_mistral_summary.csv"

print("Writing to file ", output_filename)
df.to_csv(output_path+output_filename,
          index = False
         )
print("Completed")

Writing to file  plos_groq_mistral_summary.csv
Completed


In [32]:
output_filename = "plos_groq_mistral_summary.json"

print("Writing to file ", output_filename)
df.to_json(output_path+output_filename,
           orient="records",
           )
print("Completed")

Writing to file  plos_groq_mistral_summary.json
Completed
