## 03-Groq.com API for full dev set
In this notebook, we will make API calls to Groq.com to summarize the article (up to 20k characters) in the full dev set (about 2,000 examples). 

We would use Mixtral 8x7B as it has very good performance scores, and it's open source.

In [1]:
# !pip install groq

In [2]:
# load api key
import json
import urllib.parse

with open('./data/credentials.json') as f:
    login = json.load(f)
    
api_key = login["GROQ_API_KEY"]
print(len(api_key))

56


In [3]:
from groq import Groq

client = Groq(
    api_key=api_key,
)

In [4]:
# test pause
import time
print("Hello")
time.sleep(2)
print("World")

Hello
World


In [5]:
SLEEP_TIME = 10 # pause between requests

In [6]:
def send_sumarize_request(content, model=client, min_words=250, max_words=500, quiet=True):
    """
        summarize the content
        input: context (text), model (groq_api client), max_words (int)
        output: summarized text
    """
    if not quiet:
        print("Sending request for text =", content[:100])
        
    result = ""
    prompt = f'Simplify and summarize in minimum {min_words} to maximum {max_words} words, combine answer into 1 paragraph, keep important factual details:  "{content}"'
    try:
        completion = client.chat.completions.create(
            model="mixtral-8x7b-32768",
            messages=[
                {
                    "role": "user",
                    "content": prompt
                },
            ],
            temperature=1,
            max_tokens=2048,
            top_p=1,
            stream=True,
            stop=None,
        )
    
        
        for chunk in completion:
            result += chunk.choices[0].delta.content or ""
    except Exception as err:
        print("Skipping, error : ", err)
        result = ""
        

    # pause to avoid hitting bandwidth limit (~ 14K token / minute)
    print(f"Completed. Pausing for {SLEEP_TIME} secs...", end="")
    time.sleep(SLEEP_TIME)
    print("OK")
    
    return result

In [7]:
import pandas as pd

In [8]:
# load in some data

dev_df_filename = "../../data/biolaysumm2024_data/eLife_val.jsonl"
df = pd.read_json(dev_df_filename,
                  orient="records",
                  lines=True
                 )
df.head()

Unnamed: 0,lay_summary,article,headings,keywords,id
0,The DNA in genes encodes the basic information...,Cell-fate reprograming is at the heart of deve...,"[Abstract, Introduction, Results, Discussion, ...",[developmental biology],elife-15477-v3
1,Klebsiella pneumoniae is a type of bacteria th...,"Klebsiella pneumoniae is a respiratory , blood...","[Abstract, Introduction, Results, Discussion, ...","[microbiology and infectious disease, immunolo...",elife-56656-v2
2,Malaria is one of the world's most deadly infe...,Plasmodium vivax relapse infections occur foll...,"[Abstract, Introduction, Results, Discussion, ...",[epidemiology and global health],elife-04692-v2
3,The Amazon rainforest in South America is the ...,When 2 Mha of Amazonian forests are disturbed ...,"[Abstract, Introduction, Results, Discussion, ...",[ecology],elife-21394-v2
4,Neurons that arise in the adult nervous system...,Neurosphere formation is commonly used as a su...,"[Abstract, Introduction, Results, Discussion, ...",[stem cells and regenerative medicine],elife-02669-v2


In [9]:
# apply to all rows in eval miniset
text_cap = 20_000  # temporarily limit to 20k characters, set to -1 for full text

print("Summarization process started...")
df["groq_mistral_summary"] = df["article"].apply(lambda text: send_sumarize_request(text[:text_cap], quiet=False))
print("Completed")

Summarization process started...
Sending request for text = Cell-fate reprograming is at the heart of development , yet very little is known about the molecular
Completed. Pausing for 10 secs...OK
Sending request for text = Klebsiella pneumoniae is a respiratory , blood , liver , and bladder pathogen of significant clinica
Completed. Pausing for 10 secs...OK
Sending request for text = Plasmodium vivax relapse infections occur following activation of latent liver-stages parasites ( hy
Completed. Pausing for 10 secs...OK
Sending request for text = When 2 Mha of Amazonian forests are disturbed by selective logging each year , more than 90 Tg of ca
Completed. Pausing for 10 secs...OK
Sending request for text = Neurosphere formation is commonly used as a surrogate for neural stem cell ( NSC ) function but the 
Completed. Pausing for 10 secs...OK
Sending request for text = Piezo1 is a mechanically activated ion channel involved in sensing forces in various cell types and 
Completed. Pausing 

In [10]:
# check how many rows have blank result (some errors)
empty_df = df.query("groq_mistral_summary.str.strip() == ''")
print(len(empty_df))

0


In [11]:
df.head()

Unnamed: 0,lay_summary,article,headings,keywords,id,groq_mistral_summary
0,The DNA in genes encodes the basic information...,Cell-fate reprograming is at the heart of deve...,"[Abstract, Introduction, Results, Discussion, ...",[developmental biology],elife-15477-v3,Cell-fate decisions are regulated by intercell...
1,Klebsiella pneumoniae is a type of bacteria th...,"Klebsiella pneumoniae is a respiratory , blood...","[Abstract, Introduction, Results, Discussion, ...","[microbiology and infectious disease, immunolo...",elife-56656-v2,Klebsiella pneumoniae is a dangerous pathogen ...
2,Malaria is one of the world's most deadly infe...,Plasmodium vivax relapse infections occur foll...,"[Abstract, Introduction, Results, Discussion, ...",[epidemiology and global health],elife-04692-v2,The paper discusses the development of a withi...
3,The Amazon rainforest in South America is the ...,When 2 Mha of Amazonian forests are disturbed ...,"[Abstract, Introduction, Results, Discussion, ...",[ecology],elife-21394-v2,"Selective logging in Amazonian forests, which ..."
4,Neurons that arise in the adult nervous system...,Neurosphere formation is commonly used as a su...,"[Abstract, Introduction, Results, Discussion, ...",[stem cells and regenerative medicine],elife-02669-v2,Neural stem cells (NSCs) in the adult mammalia...


In [12]:
output_path = "./data/output/full_dev_set/"
output_filename = "elife_groq_mistral_summary.csv"

print("Writing to file ", output_filename)
df.to_csv(output_path+output_filename,
          index = False
         )
print("Completed")

Writing to file  elife_groq_mistral_summary.csv
Completed


In [13]:
output_filename = "elife_groq_mistral_summary.json"

print("Writing to file ", output_filename)
df.to_json(output_path+output_filename,
           orient="records",
           )
print("Completed")

Writing to file  elife_groq_mistral_summary.json
Completed


In [14]:
# process the PLOS dataset:
dev_df_filename = "../../data/biolaysumm2024_data/PLOS_val.jsonl"
df = pd.read_json(dev_df_filename,
                  orient="records",
                  lines=True
                 )
df.head()

Unnamed: 0,lay_summary,article,headings,keywords,id
0,Messenger RNAs carry the instructions necessar...,Gene expression varies widely between individu...,"[Abstract, Introduction, Results, Discussion, ...","[genetics, biology, genomics, genetics and gen...",journal.pgen.1002882
1,"Annually , more than two million people are in...",The live attenuated simian immunodeficiency vi...,"[Abstract, Introduction, Materials and Methods...",[],journal.ppat.1004633
2,The opportunistic pathogen Candida albicans is...,Mucosal infections with Candida albicans belon...,"[Abstract, Introduction, Results, Discussion, ...","[blood cells, cell motility, medicine and heal...",journal.ppat.1005882
3,"Lymphatic filariasis ( LF ) , commonly known a...","Between 2000–2007 , the Global Programme to El...","[Abstract, Introduction, Methods, Results, Dis...",[infectious diseases/neglected tropical diseas...,journal.pntd.0000708
4,Parkinson’s disease ( PD ) is a neurodegenerat...,Homozygous mutations in the glucocerebrosidase...,"[Abstract, Introduction, Results, Discussion, ...",[],journal.pgen.1005065


In [15]:
# apply to all rows in eval miniset
text_cap = 20_000  # temporarily limit to 20k characters, set to -1 for full text

print("Summarization process started...")
df["groq_mistral_summary"] = df["article"].apply(lambda text: send_sumarize_request(text[:text_cap], quiet=False))
print("Completed")

Summarization process started...
Sending request for text = Gene expression varies widely between individuals of a population , and regulatory change can underl
Completed. Pausing for 10 secs...OK
Sending request for text = The live attenuated simian immunodeficiency virus ( LASIV ) vaccine SIVΔnef is one of the most effec
Completed. Pausing for 10 secs...OK
Sending request for text = Mucosal infections with Candida albicans belong to the most frequent forms of fungal diseases . Host
Completed. Pausing for 10 secs...OK
Sending request for text = Between 2000–2007 , the Global Programme to Eliminate Lymphatic Filariasis ( GPELF ) delivered more 
Completed. Pausing for 10 secs...OK
Sending request for text = Homozygous mutations in the glucocerebrosidase ( GBA ) gene result in Gaucher disease ( GD ) , the m
Completed. Pausing for 10 secs...OK
Sending request for text = Listeria monocytogenes is a facultative intracellular pathogen capable of inducing a robust cell-med
Completed. Pausing 

In [16]:
df.head()

Unnamed: 0,lay_summary,article,headings,keywords,id,groq_mistral_summary
0,Messenger RNAs carry the instructions necessar...,Gene expression varies widely between individu...,"[Abstract, Introduction, Results, Discussion, ...","[genetics, biology, genomics, genetics and gen...",journal.pgen.1002882,"Gene expression varies between individuals, an..."
1,"Annually , more than two million people are in...",The live attenuated simian immunodeficiency vi...,"[Abstract, Introduction, Materials and Methods...",[],journal.ppat.1004633,The live attenuated simian immunodeficiency vi...
2,The opportunistic pathogen Candida albicans is...,Mucosal infections with Candida albicans belon...,"[Abstract, Introduction, Results, Discussion, ...","[blood cells, cell motility, medicine and heal...",journal.ppat.1005882,In the study of mucosal infections with Candid...
3,"Lymphatic filariasis ( LF ) , commonly known a...","Between 2000–2007 , the Global Programme to El...","[Abstract, Introduction, Methods, Results, Dis...",[infectious diseases/neglected tropical diseas...,journal.pntd.0000708,The Global Programme to Eliminate Lymphatic Fi...
4,Parkinson’s disease ( PD ) is a neurodegenerat...,Homozygous mutations in the glucocerebrosidase...,"[Abstract, Introduction, Results, Discussion, ...",[],journal.pgen.1005065,A study investigating the relationship between...


In [44]:
# check how many rows have blank result (some errors)
empty_df = df.query("groq_mistral_summary.str.strip() == ''")
print(len(empty_df))

0


In [46]:
# attempt to retry for blank rows (due to some errors)
retry = True
if retry:
    print("Retrying for empty results...")
    for i in range(len(df)):
        item = df.iloc[i]
        if item["groq_mistral_summary"] == "":
            print("Item =", i)
            text = item["article"]
            df.at[i, "groq_mistral_summary"] = send_sumarize_request(text[:text_cap], quiet=False)
    print("Completed")

Retrying for empty results...
Completed


In [42]:
output_path = "./data/output/full_dev_set/"
output_filename = "plos_groq_mistral_summary.csv"

print("Writing to file ", output_filename)
df.to_csv(output_path+output_filename,
          index = False
         )
print("Completed")

Writing to file  plos_groq_mistral_summary.csv
Completed


In [43]:
output_filename = "plos_groq_mistral_summary.json"

print("Writing to file ", output_filename)
df.to_json(output_path+output_filename,
           orient="records",
           )
print("Completed")

Writing to file  plos_groq_mistral_summary.json
Completed
