## 02-Groq.com API for test set
In this notebook, we will make API calls to Groq.com to summarize the article (up to 20k characters) in the test set. 

We would use Mixtral 8x7B as it has very good performance scores, and it's open source.

In [1]:
# !pip install groq

In [2]:
# load api key
import json
import urllib.parse

with open('./data/credentials.json') as f:
    login = json.load(f)
    
api_key = login["GROQ_API_KEY"]
print(len(api_key))

56


In [3]:
from groq import Groq

client = Groq(
    api_key=api_key,
)

In [4]:
# test pause
import time
print("Hello")
time.sleep(2)
print("World")

Hello
World


In [5]:
SLEEP_TIME = 10 # pause between requests

In [6]:
def send_sumarize_request(content, model=client, min_words=250, max_words=500, quiet=True):
    """
        summarize the content
        input: context (text), model (groq_api client), max_words (int)
        output: summarized text
    """
    if not quiet:
        print("Sending request for text =", content[:100])
        
    result = ""
    prompt = f'Simplify and summarize in minimum {min_words} to maximum {max_words} words, combine answer into 1 paragraph, keep important factual details:  "{content}"'
    try:
        completion = client.chat.completions.create(
            model="mixtral-8x7b-32768",
            messages=[
                {
                    "role": "user",
                    "content": prompt
                },
            ],
            temperature=0.8,
            max_tokens=2048,
            top_p=1,
            stream=True,
            stop=None,
        )
    
        
        for chunk in completion:
            result += chunk.choices[0].delta.content or ""
    except Exception as err:
        print("Skipping, error : ", err)
        result = ""
        

    # pause to avoid hitting bandwidth limit (~ 14K token / minute)
    print(f"Pausing for {SLEEP_TIME} secs...", end="")
    time.sleep(SLEEP_TIME)
    print("OK")
    
    return result

In [7]:
import pandas as pd

In [8]:
# load in some data

dev_df_filename = "../../data/biolaysumm2024_data/eLife_test.jsonl"
df = pd.read_json(dev_df_filename,
                  orient="records",
                  lines=True
                 )
df.head()

Unnamed: 0,article,headings,keywords,id
0,Acylation of diverse carbohydrates occurs acro...,"[Abstract, Introduction, Results and discussio...","[biochemistry and chemical biology, computatio...",elife-81547-v1
1,Honey bee ecology demands they make both rapid...,"[Abstract, Introduction, Results, Discussion, ...",[computational and systems biology],elife-86176-v2
2,"Biguanides , including the world’s most prescr...","[Abstract, Introduction, Results, Discussion, ...",[genetics and genomics],elife-82210-v1
3,Ecological relationships between bacteria medi...,"[Abstract, Introduction, Results, Discussion, ...","[microbiology and infectious disease, ecology]",elife-83152-v2
4,Gamma oscillations are believed to underlie co...,"[Abstract, Introduction, Results, Discussion, ...",[neuroscience],elife-83044-v2


In [9]:
# apply to all rows in eval miniset
text_cap = 20_000  # temporarily limit to 20k characters because of API restriction, set to -1 for full text

print("Summarization process started...")
df["groq_mistral_summary"] = df["article"].apply(lambda text: send_sumarize_request(text[:text_cap], quiet=False))
print("Completed")

Summarization process started...
Sending request for text = Acylation of diverse carbohydrates occurs across all domains of life and can be catalysed by protein
Pausing for 10 secs...OK
Sending request for text = Honey bee ecology demands they make both rapid and accurate assessments of which flowers are most li
Pausing for 10 secs...OK
Sending request for text = Biguanides , including the world’s most prescribed drug for type 2 diabetes , metformin , not only l
Pausing for 10 secs...OK
Sending request for text = Ecological relationships between bacteria mediate the services that gut microbiomes provide to their
Pausing for 10 secs...OK
Sending request for text = Gamma oscillations are believed to underlie cognitive processes by shaping the formation of transien
Pausing for 10 secs...OK
Sending request for text = Asynchronous replication of chromosome domains during S phase is essential for eukaryotic genome fun
Pausing for 10 secs...OK
Sending request for text = While biological age i

In [10]:
df.head()

Unnamed: 0,article,headings,keywords,id,groq_mistral_summary
0,Acylation of diverse carbohydrates occurs acro...,"[Abstract, Introduction, Results and discussio...","[biochemistry and chemical biology, computatio...",elife-81547-v1,Acyltransferase family 3 (AT3) domain-containi...
1,Honey bee ecology demands they make both rapid...,"[Abstract, Introduction, Results, Discussion, ...",[computational and systems biology],elife-86176-v2,Honey bees make rapid and accurate decisions w...
2,"Biguanides , including the world’s most prescr...","[Abstract, Introduction, Results, Discussion, ...",[genetics and genomics],elife-82210-v1,"Metformin, a biguanide and first-line treatmen..."
3,Ecological relationships between bacteria medi...,"[Abstract, Introduction, Results, Discussion, ...","[microbiology and infectious disease, ecology]",elife-83152-v2,"In this study, researchers aimed to understand..."
4,Gamma oscillations are believed to underlie co...,"[Abstract, Introduction, Results, Discussion, ...",[neuroscience],elife-83044-v2,Gamma oscillations in the brain are believed t...


In [11]:
# check how many rows have blank result (some errors)
empty_df = df.query("groq_mistral_summary.str.strip() == ''")
print(len(empty_df))

0


In [12]:
# attempt to retry for blank rows (due to some errors)
retry = True
if retry:
    print("Retrying for empty results...")
    for i in range(len(df)):
        item = df.iloc[i]
        if item["groq_mistral_summary"] == "":
            print("Item =", i)
            text = item["article"]
            df.at[i, "groq_mistral_summary"] = send_sumarize_request(text[:text_cap], quiet=False)
    print("Completed")

Retrying for empty results...
Completed


In [13]:
output_path = "./data/output/test_set/"
output_filename = "elife_groq_mistral_summary.csv"

print("Writing to file ", output_filename)
df.to_csv(output_path+output_filename,
          index = False
         )
print("Completed")

Writing to file  elife_groq_mistral_summary.csv
Completed


In [14]:
output_filename = "elife_groq_mistral_summary.json"

print("Writing to file ", output_filename)
df.to_json(output_path+output_filename,
           orient="records",
           )
print("Completed")

Writing to file  elife_groq_mistral_summary.json
Completed


In [15]:
# process the PLOS dataset:
dev_df_filename = "../../data/biolaysumm2024_data/PLOS_test.jsonl"
df = pd.read_json(dev_df_filename,
                  orient="records",
                  lines=True
                 )
df.head()

Unnamed: 0,article,headings,keywords,id
0,Lung-resident ( LR ) mesenchymal stem and stro...,"[Abstract, Introduction, Results, Discussion, ...","[immune system, medical conditions, molecular ...",journal.ppat.1009789
1,Visceral leishmaniasis ( VL ) is endemic in So...,"[Abstract, Introduction, Methods, Results, Dis...","[neonates, clinical laboratory sciences, trans...",journal.pntd.0007992
2,A high burden of Salmonella enterica subspecie...,"[Abstract, Introduction, Methods, Results, Dis...","[pathogens, medical conditions, taxonomy, bact...",journal.pntd.0010704
3,Severe Acute Respiratory Syndrome Coronavirus-...,"[Abstract, Introduction, Results, Discussion, ...","[pathogens, amniotes, medical conditions, bind...",journal.ppat.1010691
4,Many fungal species utilize hydroxyderivatives...,"[Abstract, Introduction, Results and discussio...","[taxonomy, proteins, chemistry, genetics, enzy...",journal.pgen.1009815


In [16]:
# apply to all rows in eval miniset
text_cap = 20_000  # temporarily limit to 20k characters due to API restriction, set to -1 for full text

print("Summarization process started...")
df["groq_mistral_summary"] = df["article"].apply(lambda text: send_sumarize_request(text[:text_cap], quiet=False))
print("Completed")

Summarization process started...
Sending request for text = Lung-resident ( LR ) mesenchymal stem and stromal cells ( MSCs ) are key elements of the alveolar ni
Pausing for 10 secs...OK
Sending request for text = Visceral leishmaniasis ( VL ) is endemic in South Sudan , where outbreaks occur frequently . Because
Pausing for 10 secs...OK
Sending request for text = A high burden of Salmonella enterica subspecies enterica serovar Typhi ( S . Typhi ) bacteremia has 
Pausing for 10 secs...OK
Sending request for text = Severe Acute Respiratory Syndrome Coronavirus-2 ( SARS-CoV-2 ) marks the third novel β-coronavirus t
Pausing for 10 secs...OK
Sending request for text = Many fungal species utilize hydroxyderivatives of benzene and benzoic acid as carbon sources . The y
Pausing for 10 secs...OK
Sending request for text = Improving biological plausibility and functional capacity are two important goals for brain models t
Pausing for 10 secs...OK
Sending request for text = Trisomy of human chrom

In [17]:
df.head()

Unnamed: 0,article,headings,keywords,id,groq_mistral_summary
0,Lung-resident ( LR ) mesenchymal stem and stro...,"[Abstract, Introduction, Results, Discussion, ...","[immune system, medical conditions, molecular ...",journal.ppat.1009789,Lung-resident mesenchymal stem and stromal cel...
1,Visceral leishmaniasis ( VL ) is endemic in So...,"[Abstract, Introduction, Methods, Results, Dis...","[neonates, clinical laboratory sciences, trans...",journal.pntd.0007992,Visceral leishmaniasis (VL) is a protozoan dis...
2,A high burden of Salmonella enterica subspecie...,"[Abstract, Introduction, Methods, Results, Dis...","[pathogens, medical conditions, taxonomy, bact...",journal.pntd.0010704,A study was conducted to genetically character...
3,Severe Acute Respiratory Syndrome Coronavirus-...,"[Abstract, Introduction, Results, Discussion, ...","[pathogens, amniotes, medical conditions, bind...",journal.ppat.1010691,The Severe Acute Respiratory Syndrome Coronavi...
4,Many fungal species utilize hydroxyderivatives...,"[Abstract, Introduction, Results and discussio...","[taxonomy, proteins, chemistry, genetics, enzy...",journal.pgen.1009815,The yeast Candida parapsilosis metabolizes hyd...


In [18]:
# check how many rows have blank result (some errors)
empty_df = df.query("groq_mistral_summary.str.strip() == ''")
print(len(empty_df))

0


In [19]:
# attempt to retry for blank rows (due to some errors)
retry = True
if retry:
    print("Retrying for empty results...")
    for i in range(len(df)):
        item = df.iloc[i]
        if item["groq_mistral_summary"] == "":
            print("Item =", i)
            text = item["article"]
            df.at[i, "groq_mistral_summary"] = send_sumarize_request(text[:text_cap], quiet=False)
    print("Completed")

Retrying for empty results...
Completed


In [20]:
output_path = "./data/output/test_set/"
output_filename = "plos_groq_mistral_summary.csv"

print("Writing to file ", output_filename)
df.to_csv(output_path+output_filename,
          index = False
         )
print("Completed")

Writing to file  plos_groq_mistral_summary.csv
Completed


In [21]:
output_filename = "plos_groq_mistral_summary.json"

print("Writing to file ", output_filename)
df.to_json(output_path+output_filename,
           orient="records",
           )
print("Completed")

Writing to file  plos_groq_mistral_summary.json
Completed
