## 02-Groq.com API for test set
In this notebook, we will make API calls to Groq.com to summarize the article (up to 20k characters) in the test set. 

We would use Mixtral 8x7B as it has very good performance scores, and it's open source.

In [1]:
# !pip install groq

In [2]:
# load api key
import json
import urllib.parse

with open('./data/credentials.json') as f:
    login = json.load(f)
    
api_key = login["GROQ_API_KEY"]
print(len(api_key))

56


In [3]:
from groq import Groq

client = Groq(
    api_key=api_key,
)

In [4]:
# test pause
import time
print("Hello")
time.sleep(2)
print("World")

Hello
World


In [5]:
SLEEP_TIME = 10 # pause between requests

In [6]:
def send_sumarize_request(content, model=client, min_words=250, max_words=500, max_char=20000, quiet=True):
    """
        summarize the content
        input: context (text), model (groq_api client), max_words (int)
        output: summarized text
    """
    if not quiet:
        print("Sending request for text =", content[:100])
        
    result = ""
    # get 1 example from train set for the model to follow the style
    lay_summary_example = "Electrocorticography , or ECoG , is a technique that is used to record the electrical activity of the brain via electrodes placed inside the skull . This electrical activity repeatedly rises and falls , and can therefore be represented as a series of waves . All waves have three basic properties: amplitude , frequency and phase . Amplitude describes the height of a wave's peaks ( and the depth of its troughs ) , and frequency defines how many waves are produced per second . The phase of a wave changes from 0° to 360° between two consecutive peaks of that wave and then repeats , similar to the phases of the moon . Previous studies have shown that brain activity at different frequencies can interact . For instance , neural firing ( when nerve impulses are sent from one neuron to the next ) is related to ‘high frequency activity’; and the amplitude of high frequency activity can be altered by the phase of other , lower frequency brain activity . It has been suggested that this phenomenon , called ‘phase-amplitude coupling’ , might be one way that the brain uses to represent information . This ‘phase coding’ hypothesis has been demonstrated in rodents but is largely untested in humans . Now , Watrous et al . have explored this hypothesis in epilepsy patients who had ECoG electrodes implanted in their brains for a diagnostic procedure before surgery . These electrodes were used to record brain activity while the patients viewed images from four different categories ( houses , scenes , tools and faces ) . Watrous et al . found that phase-amplitude coupling occurred in over 40% of the recordings of brain activity . The analysis also revealed that the phase of the lower frequency activity at which the high frequency activity occurred was different for each of the four image categories . This provides support for the phase-coding hypothesis in humans . Furthermore , it suggests that not only how much neural firing occurs but also when ( or specifically at what phase ) it occurs is important for how the brain represents information . Future studies could now build on this analysis to see if phase-amplitude coupling also supports phase coding and neural representations in other thought processes , such as memory and navigation ."
    # prompt = f'Simplify and summarize in minimum {min_words} to maximum {max_words} words, combine answer into 1 paragraph, keep important factual details:  "{content}"'
    prompt = f"One good example of lay summary is '{lay_summary_example}'. Follow the style of this example. The following text is from a medical research paper, which contains medical terms. The keywords are listed with ## keywords tag. The section headings have ## tag before it. Pay more attention to 'Abstract', 'Introduction', 'Conclusion' and 'Discussion' sections. Simplify and summarize in minimum 250 to maximum 500 words, combine answer into 1 paragraph, keeping important facts. Now do a lay summary on this text:\n {content[:max_char]}"

    try:
        completion = client.chat.completions.create(
            model="mixtral-8x7b-32768",
            messages=[
                {
                    "role": "user",
                    "content": prompt
                },
            ],
            temperature=0.8,
            max_tokens=2048,
            top_p=1,
            stream=True,
            stop=None,
        )
    
        
        for chunk in completion:
            result += chunk.choices[0].delta.content or ""
    except Exception as err:
        print("Skipping, error : ", err)
        result = ""
        

    # pause to avoid hitting bandwidth limit (~ 14K token / minute)
    print(f"OK. Pausing for {SLEEP_TIME} secs...", end="")
    time.sleep(SLEEP_TIME)
    print("OK")
    
    return result

In [7]:
import pandas as pd

In [8]:
# load in some from eLife test set
dev_df_filename = "../../../data/biolaysumm2024_data/eLife_test.jsonl"
df = pd.read_json(dev_df_filename,
                  orient="records",
                  lines=True
                 )
df.head()

Unnamed: 0,article,headings,keywords,id
0,Acylation of diverse carbohydrates occurs acro...,"[Abstract, Introduction, Results and discussio...","[biochemistry and chemical biology, computatio...",elife-81547-v1
1,Honey bee ecology demands they make both rapid...,"[Abstract, Introduction, Results, Discussion, ...",[computational and systems biology],elife-86176-v2
2,"Biguanides , including the world’s most prescr...","[Abstract, Introduction, Results, Discussion, ...",[genetics and genomics],elife-82210-v1
3,Ecological relationships between bacteria medi...,"[Abstract, Introduction, Results, Discussion, ...","[microbiology and infectious disease, ecology]",elife-83152-v2
4,Gamma oscillations are believed to underlie co...,"[Abstract, Introduction, Results, Discussion, ...",[neuroscience],elife-83044-v2


In [9]:
# take some item to inspect
k = 3
item = df.iloc[k]
print("Full text article =", item["article"])

Full text article = Ecological relationships between bacteria mediate the services that gut microbiomes provide to their hosts . Knowing the overall direction and strength of these relationships is essential to learn how ecology scales up to affect microbiome assembly , dynamics , and host health . However , whether bacterial relationships are generalizable across hosts or personalized to individual hosts is debated . Here , we apply a robust , multinomial logistic-normal modeling framework to extensive time series data ( 5534 samples from 56 baboon hosts over 13 years ) to infer thousands of correlations in bacterial abundance in individual baboons and test the degree to which bacterial abundance correlations are ‘universal’ . We also compare these patterns to two human data sets . We find that , most bacterial correlations are weak , negative , and universal across hosts , such that shared correlation patterns dominate over host-specific correlations by almost twofold . Further , tax

In [10]:
# we can create a function to include headings and keywords into the article text
def build_text_with_headings(item):
  """
    return the article text with heading embedded
  """
  result = ""
  paras = item["article"].split("\n")
  keywords = ', '.join(item["keywords"])
  keywords = f"## Keywords: {keywords}"  
  # print(keywords)
  result += keywords + "\n"  
  headings = item["headings"]
  if len(paras) != len(headings):
    print("Error, not matching length")
    return item["article"]
  else:
    for (heading, paragraph) in zip(headings, paras):
      result += f"## {heading}\n{paragraph}\n\n"

  return result

In [11]:
processed_text = build_text_with_headings(item)
print(processed_text)

## Keywords: microbiology and infectious disease, ecology
## Abstract
Ecological relationships between bacteria mediate the services that gut microbiomes provide to their hosts . Knowing the overall direction and strength of these relationships is essential to learn how ecology scales up to affect microbiome assembly , dynamics , and host health . However , whether bacterial relationships are generalizable across hosts or personalized to individual hosts is debated . Here , we apply a robust , multinomial logistic-normal modeling framework to extensive time series data ( 5534 samples from 56 baboon hosts over 13 years ) to infer thousands of correlations in bacterial abundance in individual baboons and test the degree to which bacterial abundance correlations are ‘universal’ . We also compare these patterns to two human data sets . We find that , most bacterial correlations are weak , negative , and universal across hosts , such that shared correlation patterns dominate over host-speci

In [12]:
# test with 1 example
send_sumarize_request(processed_text, quiet=False)

Sending request for text = ## Keywords: microbiology and infectious disease, ecology
## Abstract
Ecological relationships betwe
OK. Pausing for 10 secs...OK


'Gut microbiomes, which are communities of bacteria in our stomach, play a significant role in our health by affecting various processes like digestion and metabolism. However, there are still many unknowns when it comes to the relationships between these bacteria in the gut. This study aimed to understand these relationships by analyzing extensive time series data from 56 baboons over 13 years. The researchers found that most of the relationships between bacteria are weak, negative, and consistent across different hosts. This means that the same patterns of bacterial relationships are more common than unique patterns in each host. The researchers also compared these patterns to two human data sets and found that the universality of bacterial associations in baboons is similar to that in human infants and stronger than one data set from human adults. These findings contribute to our understanding of microbiome personalization, community assembly, stability, and designing microbiome int

In [13]:
# # apply to all rows in eval miniset
# text_cap = 20_000  # temporarily limit to 20k characters because of API restriction, set to -1 for full text

# print("Summarization process started...")
# df["groq_mistral_summary"] = df["article"].apply(lambda text: send_sumarize_request(text, quiet=False))
# print("Completed")

In [14]:
# create empty column
df["mixtral_summary"] = df["article"].apply(lambda x:"")
df.head()
n = len(df)

for i in range(len(df)):
  print(f"Processing item {i}/{n} = {i*100/n:.1f} %")
  item = df.iloc[i]
  text = build_text_with_headings(item)
  print("Processing article =", text[:200])
  summary = send_sumarize_request(text)
  print("Summary =", summary[:200])
  # df.at[i, "llm_templates"] = parsed_item
  df.at[i, "mixtral_summary"] = summary
  print("------------------------")

Processing item 0/142 = 0.0 %
Processing article = ## Keywords: biochemistry and chemical biology, computational and systems biology
## Abstract
Acylation of diverse carbohydrates occurs across all domains of life and can be catalysed by proteins with
OK. Pausing for 10 secs...OK
Summary = Acylation of carbohydrates is a process that happens in all forms of life and is essential for various functions in bacteria, including symbiosis, resistance to viruses and antimicrobials, and biosynt
------------------------
Processing item 1/142 = 0.7 %
Processing article = ## Keywords: computational and systems biology
## Abstract
Honey bee ecology demands they make both rapid and accurate assessments of which flowers are most likely to offer them nectar or pollen . To 
OK. Pausing for 10 secs...OK
Summary = Honey bees have impressive decision-making skills when it comes to choosing flowers for nectar or pollen. They can quickly and accurately assess which flowers are most likely to offer rewards 

In [15]:
df.head()

Unnamed: 0,article,headings,keywords,id,mixtral_summary
0,Acylation of diverse carbohydrates occurs acro...,"[Abstract, Introduction, Results and discussio...","[biochemistry and chemical biology, computatio...",elife-81547-v1,Acylation of carbohydrates is a process that h...
1,Honey bee ecology demands they make both rapid...,"[Abstract, Introduction, Results, Discussion, ...",[computational and systems biology],elife-86176-v2,Honey bees have impressive decision-making ski...
2,"Biguanides , including the world’s most prescr...","[Abstract, Introduction, Results, Discussion, ...",[genetics and genomics],elife-82210-v1,"Metformin, a commonly prescribed drug for type..."
3,Ecological relationships between bacteria medi...,"[Abstract, Introduction, Results, Discussion, ...","[microbiology and infectious disease, ecology]",elife-83152-v2,Gut microbiomes are diverse and dynamic commun...
4,Gamma oscillations are believed to underlie co...,"[Abstract, Introduction, Results, Discussion, ...",[neuroscience],elife-83044-v2,Gamma oscillations are a type of brain wave th...


In [16]:
# check how many rows have blank result (some errors)
empty_df = df.query("mixtral_summary.str.strip() == ''")
print(len(empty_df))

0


In [17]:
# attempt to retry for blank rows (due to some errors)
retry = True
if retry:
    print("Retrying for empty results...")
    for i in range(len(df)):
        item = df.iloc[i]
        if item["mixtral_summary"] == "":
            print("Item =", i)
            # text = item["article"]
            text = build_text_with_headings(item)
            df.at[i, "mixtral_summary"] = send_sumarize_request(text[:text_cap], quiet=False)
    print("Completed")

Retrying for empty results...
Completed


In [18]:
output_path = "./data/output/test_set/"
output_filename = "elife_groq_mixtral_summary.csv"

print("Writing to file ", output_filename)
df.to_csv(output_path+output_filename,
          index = False
         )
print("Completed")

Writing to file  elife_groq_mixtral_summary.csv
Completed


In [19]:
output_filename = "elife_groq_mixtral_summary.json"

print("Writing to file ", output_filename)
df.to_json(output_path+output_filename,
           orient="records",
           )
print("Completed")

Writing to file  elife_groq_mixtral_summary.json
Completed


In [20]:
# process the PLOS dataset:
dev_df_filename = "../../../data/biolaysumm2024_data/PLOS_test.jsonl"
df = pd.read_json(dev_df_filename,
                  orient="records",
                  lines=True
                 )
df.head()

Unnamed: 0,article,headings,keywords,id
0,Lung-resident ( LR ) mesenchymal stem and stro...,"[Abstract, Introduction, Results, Discussion, ...","[immune system, medical conditions, molecular ...",journal.ppat.1009789
1,Visceral leishmaniasis ( VL ) is endemic in So...,"[Abstract, Introduction, Methods, Results, Dis...","[neonates, clinical laboratory sciences, trans...",journal.pntd.0007992
2,A high burden of Salmonella enterica subspecie...,"[Abstract, Introduction, Methods, Results, Dis...","[pathogens, medical conditions, taxonomy, bact...",journal.pntd.0010704
3,Severe Acute Respiratory Syndrome Coronavirus-...,"[Abstract, Introduction, Results, Discussion, ...","[pathogens, amniotes, medical conditions, bind...",journal.ppat.1010691
4,Many fungal species utilize hydroxyderivatives...,"[Abstract, Introduction, Results and discussio...","[taxonomy, proteins, chemistry, genetics, enzy...",journal.pgen.1009815


In [21]:
# # apply to all rows in eval miniset
# text_cap = 20_000  # temporarily limit to 20k characters due to API restriction, set to -1 for full text

# print("Summarization process started...")
# df["groq_mistral_summary"] = df["article"].apply(lambda text: send_sumarize_request(text[:text_cap], quiet=False))
# print("Completed")

In [22]:
# create empty column
df["mixtral_summary"] = df["article"].apply(lambda x:"")
df.head()
n = len(df)

for i in range(len(df)):
  print(f"Processing item {i}/{n} = {i*100/n:.1f} %")
  item = df.iloc[i]
  text = build_text_with_headings(item)
  print("Processing article =", text[:200])
  summary = send_sumarize_request(text)
  print("Summary =", summary[:200])
  # df.at[i, "llm_templates"] = parsed_item
  df.at[i, "mixtral_summary"] = summary
  print("------------------------")

Processing item 0/142 = 0.0 %
Processing article = ## Keywords: immune system, medical conditions, molecular development, health care, developmental biology, cellular types, immunology, pediatrics, respiratory physiology, cell biology, pediatric infec
OK. Pausing for 10 secs...OK
Summary = Lung-resident mesenchymal stem cells (LR-MSCs) play a crucial role in maintaining lung health and regenerating lung tissue after injury. They are found in the alveolar niche and can interact with othe
------------------------
Processing item 1/142 = 0.7 %
Processing article = ## Keywords: neonates, clinical laboratory sciences, transfusion medicine, parasitic diseases, drugs, blood transfusion, developmental biology, women's health, hematology, protozoan infections, pharma
OK. Pausing for 10 secs...OK
Summary = Visceral Leishmaniasis (VL), a parasitic disease, is endemic in South Sudan and can cause severe complications for pregnant women. A study was conducted to describe the characteristics and out

In [23]:
df.head()

Unnamed: 0,article,headings,keywords,id,mixtral_summary
0,Lung-resident ( LR ) mesenchymal stem and stro...,"[Abstract, Introduction, Results, Discussion, ...","[immune system, medical conditions, molecular ...",journal.ppat.1009789,Lung-resident mesenchymal stem cells (LR-MSCs)...
1,Visceral leishmaniasis ( VL ) is endemic in So...,"[Abstract, Introduction, Methods, Results, Dis...","[neonates, clinical laboratory sciences, trans...",journal.pntd.0007992,"Visceral Leishmaniasis (VL), a parasitic disea..."
2,A high burden of Salmonella enterica subspecie...,"[Abstract, Introduction, Methods, Results, Dis...","[pathogens, medical conditions, taxonomy, bact...",journal.pntd.0010704,"Salmonella Typhi, a bacterium that causes typh..."
3,Severe Acute Respiratory Syndrome Coronavirus-...,"[Abstract, Introduction, Results, Discussion, ...","[pathogens, amniotes, medical conditions, bind...",journal.ppat.1010691,"The COVID-19 pandemic, caused by the SARS-CoV-..."
4,Many fungal species utilize hydroxyderivatives...,"[Abstract, Introduction, Results and discussio...","[taxonomy, proteins, chemistry, genetics, enzy...",journal.pgen.1009815,Lay Summary:\nCandida parapsilosis is a type o...


In [24]:
# check how many rows have blank result (some errors)
empty_df = df.query("mixtral_summary.str.strip() == ''")
print(len(empty_df))

0


In [25]:
# attempt to retry for blank rows (due to some errors)
retry = True
if retry:
    print("Retrying for empty results...")
    for i in range(len(df)):
        item = df.iloc[i]
        if item["mixtral_summary"] == "":
            print("Item =", i)
            # text = item["article"]
            text = build_text_with_headings(item)
            df.at[i, "mixtral_summary"] = send_sumarize_request(text, quiet=False)
    print("Completed")

Retrying for empty results...
Completed


In [26]:
output_path = "./data/output/test_set/"
output_filename = "plos_groq_mixtral_summary.csv"

print("Writing to file ", output_filename)
df.to_csv(output_path+output_filename,
          index = False
         )
print("Completed")

Writing to file  plos_groq_mixtral_summary.csv
Completed


In [27]:
output_filename = "plos_groq_mixtral_summary.json"

print("Writing to file ", output_filename)
df.to_json(output_path+output_filename,
           orient="records",
           )
print("Completed")

Writing to file  plos_groq_mixtral_summary.json
Completed
