## 01_Mixtral_Mini_Dev_Set
In this notebook, we will load the mini development dataset, and use Mixtral 8x7B model in Colab A100 GPU to process (summarize, simplify) the text

In [1]:
from google.colab import drive

drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [17]:
# load mini dev set for elife
import pandas as pd
import io

# df = pd.read_json("./drive/MyDrive/531/milestone5/data/mini_dev_set/eLife_val.jsonl",
#                   orient="records",
#                   lines=True)

df = pd.read_json("./drive/MyDrive/531/milestone5/data/mini_dev_set/PLOS_val.jsonl",
                  orient="records",
                  lines=True)

# df = pd.read_json("./data/eLife_val.jsonl",
#                   orient="records",
#                   lines=True)

df.head()

Unnamed: 0,lay_summary,article,headings,keywords,id
0,"Yersinia pestis , the bacterial agent of plagu...",Fleas can transmit Yersinia pestis by two mech...,"[Abstract, Introduction, Results, Discussion, ...","[united states, invertebrates, medicine and he...",journal.ppat.1006859
1,The genome of all vertebrates is heavily colon...,Endogenous retroviruses ( ERVs ) are remnants ...,"[Abstract, Introduction, Results, Discussion, ...","[viruses, sheep, virology]",journal.ppat.0030170
2,The molecular mechanisms underlying directed c...,The Drosophila embryonic gonad is assembled fr...,"[Abstract, Introduction, Results, Discussion, ...",[],journal.pgen.1003720
3,Contrary to the long-standing belief that no n...,"Recently , we presented a study of adult neuro...","[Abstract, Introduction, Model, Results, Discu...",[computational biology/computational neuroscie...,journal.pcbi.1001063
4,Embryonic stem cells have two remarkable prope...,Understanding the transcriptional regulation o...,"[Abstract, Introduction, Results, Discussion, ...","[developmental biology, cell biology, mammals,...",journal.pgen.0030145


In [18]:
# take some item to inspect
k = 5
item = df.iloc[k]
print("Lay summary = ", item["lay_summary"])
print("Full text article =", item["article"])

Lay summary =  The current model of HCV egress is that virions assemble at lipid droplets , envelope at the ER and then likely exit the hepatocyte via the secretory pathway in association with apolipoproteins . To gain a more detailed insight into infectious HCV release , we combined an RNAi analysis of host factors that are required for infectious HCV secretion with live cell imaging of HCV core trafficking . Using this approach , we identified numerous components of the secretory pathway that are both required for infectious HCV release and co-traffic with HCV core . The dynamics of HCV core trafficking , both in terms of frequency of transport , particle velocity , and the corresponding run lengths were quantified . We observe that dynamic core movements in the periphery require NS2 , a viral protein required for virion assembly . Core co-traffics with multiple components of the secretory pathway , including the Golgi , recycling endosome , microtubules , VAMP1 secretory vesicles , 

In [19]:
def build_text_with_headings(item):
  """
    return the article text with heading embedded
  """
  result = ""
  paras = item["article"].split("\n")
  headings = item["headings"]
  if len(paras) != len(headings):
    print("Error, not matching length")
    return item["article"]
  else:
    for (heading, paragraph) in zip(headings, paras):
      result += f"## {heading}\n{paragraph}\n\n"

  return result


In [20]:
print(build_text_with_headings(item))

## Abstract
The current model of hepatitis C virus ( HCV ) production involves the assembly of virions on or near the surface of lipid droplets , envelopment at the ER in association with components of VLDL synthesis , and egress via the secretory pathway . However , the cellular requirements for and a mechanistic understanding of HCV secretion are incomplete at best . We combined an RNA interference ( RNAi ) analysis of host factors for infectious HCV secretion with the development of live cell imaging of HCV core trafficking to gain a detailed understanding of HCV egress . RNAi studies identified multiple components of the secretory pathway , including ER to Golgi trafficking , lipid and protein kinases that regulate budding from the trans-Golgi network ( TGN ) , VAMP1 vesicles and adaptor proteins , and the recycling endosome . Our results support a model wherein HCV is infectious upon envelopment at the ER and exits the cell via the secretory pathway . We next constructed infectiou

In [6]:
!pip install bitsandbytes



In [7]:
!pip install transformers
!pip install accelerate



In [8]:
!pip install  flash_attn



In [9]:
# load Mixtral 8x7B, 4-bit
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1" # this probably works on A100
tokenizer = AutoTokenizer.from_pretrained(model_id)

# the following WORKS with V100, and T4 but very slow without 4_bit
# model = AutoModelForCausalLM.from_pretrained(model_id,
#                                              load_in_4bit=True,
#                                              device_map="auto")

model = AutoModelForCausalLM.from_pretrained(model_id,
                                             load_in_4bit=True,
                                             torch_dtype=torch.float16,
                                             use_flash_attention_2=True,
                                             device_map="auto")  # flash attention



The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead.


Loading checkpoint shards:   0%|          | 0/19 [00:00<?, ?it/s]

In [10]:
# test if the model works correctly
messages = [
    {"role": "user", "content": "What is the capital city of British Columbia? Answer in 1 sentence"},
]

input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

outputs = model.generate(input_ids, max_new_tokens=1000)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[INST] What is the capital city of British Columbia? Answer in 1 sentence [/INST] The capital city of British Columbia, a province in Canada, is Victoria.


In [11]:
# filter out the text between [INST] ... [/INST]
# text = tokenizer.decode(outputs[0], skip_special_tokens=True)
s = "[INST] What is the capital city of British Columbia? Answer in 1 sentence [/INST] The capital city of British Columbia, a province in Canada, is Victoria."
id = s.find('[/INST]') + len('[/INST]')
result = s[id:]
print(result)

 The capital city of British Columbia, a province in Canada, is Victoria.


In [21]:
# test with new prompt
# max_char = 20000
# prompt = f"Simplify and summarize in minimum 250 to maximum 500 words, combine answer into 1 paragraph, keep important factual details. Also include key facts in 1-5 concise sentences: {s[:max_char]}"
# prompt = f"Simplify and summarize in 200 to 300 words:  {s[:max_char]}"

# def summarize(article, max_char=20000):
def summarize(article, max_char=-1):
  """
    summarize the article text, with
  """
  print("Processing text =", article[:100])
  print("---------------")
  lay_summary_example = "Over 90% of adults around the world are infected with the Epstein-Barr virus . Like other closely related viruses , such as those that cause chicken pox and cold sores , an infection lasts for the rest of the person’s life , although the virus generally remains in a latent or dormant state . However , under certain conditions the latent viruses can cause cancers to develop; in fact , it is estimated that such infections are responsible for nearly 2% of all cancer deaths worldwide . One way that healthy human cells prevent cancer is by triggering their own death in a process called apoptosis . The Epstein-Barr virus can block apoptosis , therefore making the cells more likely to become cancerous . Previous research identified one protein in the Epstein-Barr virus that promotes cancer by preventing infected cells from dying as normal . However , even in the absence of this protein , Epstein-Barr virus-infected cells remain resistant to apoptosis . This suggests that the virus has another way of blocking cell death . Price et al . have now used a technique that stresses living cells in a way that reveals which proteins prevent apoptosis to study human cells infected with the Epstein-Barr virus . This revealed that soon after infection , the virus could force the human cell to produce MCL-1 , a protein that prevents cell death . Later , the Epstein-Barr virus enlisted a second human protein called BFL-1 , which makes the infected cell further resistant to apoptosis . Price et al . discovered that a protein in the Epstein-Barr virus called EBNA3A controls the production of the MCL-1 and BFL-1 proteins . In the future , developing therapies that target these proteins may lead to new treatments for cancers caused by the Epstein-Barr virus . Such treatments would be likely to have fewer side effects for patients than traditional chemotherapies ."
  prompt = f"The following text is from a medical research paper, which contains medical terms. The section headings have ## tag before it. Pay special attention to 'Abstract', 'Introduction' and 'Conclusion' sections. Simplify and summarize in minimum 250 to maximum 500 words, combine answer into 1 paragraph, keeping important facts. One good example of lay summary is '{lay_summary_example}'. Now do a lay summary on this text: {article[:max_char]}"
  messages = [
      {"role": "user", "content": prompt},
  ]

  input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

  outputs = model.generate(input_ids, max_new_tokens=1000)
  # print(tokenizer.decode(outputs[0], skip_special_tokens=True))
  output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
  id = output_text.find('[/INST]') + len('[/INST]')
  result = output_text[id:]

  return result


In [22]:
# test function
text = build_text_with_headings(item)
print("--------------")
print("Summary =\n", summarize(text))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


--------------
Processing text = ## Abstract
The current model of hepatitis C virus ( HCV ) production involves the assembly of virio
---------------
Summary =
  Hepatitis C virus (HCV) is a member of the Flaviviridae family that causes chronic liver infections in humans. The virus has a single-stranded positive RNA genome that encodes for structural and non-structural proteins. The structural proteins include core and the glycoproteins E1 and E2, while the non-structural proteins include p7, NS2, NS3, NS4A, NS4B, NS5A, and NS5B. The virus infects hepatocytes in the liver and uses several host factors for entry, replication, and egress.

A recent study combined RNA interference (RNAi) analysis and live cell imaging to understand the cellular requirements and mechanisms of HCV egress. The RNAi analysis identified multiple components of the secretory pathway, including ER to Golgi trafficking, lipid and protein kinases, and the recycling endosome, that are required for HCV egress. The st

In [23]:
# create empty column
df["mixtral_summary"] = df["article"].apply(lambda x:"")
df.head()

n = len(df)
max_row = 50 # for dev. Set to n for full set

for i in range(max_row):
  print("Processing item i =", i)
  item = df.iloc[i]
  text = build_text_with_headings(item)
  summary = summarize(text)
  print("Summary =", summary[:200])
  # df.at[i, "llm_templates"] = parsed_item
  df.at[i, "mixtral_summary"] = summary
  print("------------------------")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Processing item i = 0
Processing text = ## Abstract
Fleas can transmit Yersinia pestis by two mechanisms , early-phase transmission ( EPT ) 
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  Fleas can transmit the bacteria Yersinia pestis, which causes plague, through two methods: early-phase transmission (EPT) and biofilm-dependent transmission (BDT). EPT occurs when fleas feed on an in
------------------------
Processing item i = 1
Processing text = ## Abstract
Endogenous retroviruses ( ERVs ) are remnants of ancient retroviral infections of the ho
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  This research paper studies the coevolution of endogenous and exogenous retroviruses in sheep. The researchers isolated and characterized 27 endogenous retroviruses (ERVs) in the sheep genome, which 
------------------------
Processing item i = 2
Processing text = ## Abstract
The Drosophila embryonic gonad is assembled from two distinct cell types , the Primordia
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  The Drosophila embryonic gonad is formed from two types of cells, Primordial Germ Cells (PGCs) and Somatic Gonadal Precursor cells (SGPs). The PGCs form at the posterior of the blastoderm stage embry
------------------------
Processing item i = 3
Processing text = ## Abstract
Recently , we presented a study of adult neurogenesis in a simplified hippocampal memory
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  The research presents a study of adult neurogenesis in a simplified hippocampal memory model. The model is extended to include realistic, spatially driven input firing patterns in the form of grid ce
------------------------
Processing item i = 4
Processing text = ## Abstract
Understanding the transcriptional regulation of pluripotent cells is of fundamental inte
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  This research paper analyzes the transcriptional profiles of mouse embryonic stem (ES) cells and primordial germ cells to identify genes upregulated in pluripotent cells. A novel computational algori
------------------------
Processing item i = 5
Processing text = ## Abstract
The current model of hepatitis C virus ( HCV ) production involves the assembly of virio
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  Hepatitis C virus (HCV) is a member of the Flaviviridae family that causes chronic liver infections in humans. The virus has a single-stranded positive RNA genome that encodes for structural and non-
------------------------
Processing item i = 6
Processing text = ## Abstract
Secondary amphiphilicity is inherent to the secondary structural elements of proteins . 
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  This research paper uses molecular dynamics simulations to study the conformational behavior of two synthetic peptides, LK and EALA, with built-in secondary amphiphilicity. The study finds that these
------------------------
Processing item i = 7
Processing text = ## Abstract
Herein , we studied a virulent isolate of the leading bacterial pathogen Streptococcus p
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  This research uses an infant mouse model to study the bacterial pathogen Streptococcus pneumoniae, also known as the pneumococcus, and its interaction with the influenza A virus (IAV). The researcher
------------------------
Processing item i = 8
Processing text = ## Abstract
HIV is known to spread efficiently both in a cell-free state and from cell to cell , how
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  HIV can spread efficiently from cell to cell through direct contact between infected and uninfected cells, and this mode of transmission may be important for the virus to maintain infection in vivo. 
------------------------
Processing item i = 9
Processing text = ## Abstract
An increasing number of genetic variants have been identified for many complex diseases 
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  This research paper discusses the use of genomic profiles for risk prediction of complex diseases. The authors propose a new statistical framework that connects various predictive indices and demonst
------------------------
Processing item i = 10
Processing text = ## Abstract
The Saccharomyces cerevisae RAD3 gene is the homolog of human XPD , an essential gene en
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  The Saccharomyces cerevisae RAD3 gene is the equivalent of the human XPD gene, which is involved in both nucleotide excision repair (NER) and transcription. Some mutant RAD3 alleles, such as rad3-101
------------------------
Processing item i = 11
Processing text = ## Abstract
A budget proposal to stop the U . S . Centers for Disease Control and Prevention ( CDC )
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  The proposed budget cut to the US Centers for Disease Control and Prevention (CDC) for surveillance and research of mosquito-borne diseases, such as dengue and West Nile virus, could have serious con
------------------------
Processing item i = 12
Processing text = ## Abstract
Insulator or enhancer-blocking elements are proposed to play an important role in the re
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  The research paper uses the chromatin immunopurification (ChIP) method to identify the binding sites of the CTCF protein in the Drosophila genome. The researchers used a CTCF-specific antibody to imm
------------------------
Processing item i = 13
Processing text = ## Abstract
Ebolaviruses , highly lethal zoonotic pathogens , possess longer genomes than most other
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  Ebolaviruses are deadly viruses that cause lethal hemorrhagic fever in humans and are a concern as potential bioterrorism agents. There are currently no approved treatments for these infections. The 
------------------------
Processing item i = 14
Processing text = ## Abstract
Studies of the furious and paralytic forms of canine rabies at the early stage of diseas
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  Rabies is a deadly virus that attacks the nervous system and is usually transmitted through the bite of an infected animal. The disease has two clinical forms: furious and paralytic. In a study of na
------------------------
Processing item i = 15
Processing text = ## Abstract
Replication fork integrity , which is essential for the maintenance of genome stability 
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  The research paper is about the role of 14-3-3 proteins in maintaining genome stability during DNA replication. When DNA replication is stressed, 14-3-3 proteins help to prevent chromosome breaks and
------------------------
Processing item i = 16
Processing text = ## Abstract
Polyarthritis and rash caused by Sindbis virus ( SINV ) , was first recognised in northe
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  Sindbis virus (SINV) is a mosquito-borne virus that causes polyarthritis and rash, known as Ockelbo disease in Sweden and Pogosta disease in Finland. It is mainly found in tropical and sub-tropical c
------------------------
Processing item i = 17
Processing text = ## Abstract
With the post-genomic era came a dramatic increase in high-throughput technologies , of 
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  This research uses the lab worm Caenorhabditis elegans to study how native soil nematodes interact with their environment. The researchers isolated bacteria from grassland prairie soils and used C. e
------------------------
Processing item i = 18
Processing text = ## Abstract
Centromeres are the attachment points between the genome and the cytoskeleton: centromer
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  The research paper is about the epigenetic marking system in centromeres, the attachment points between the genome and the cytoskeleton. The DNA sequence of centromeres has little role in perpetuatin
------------------------
Processing item i = 19
Processing text = ## Abstract
Humans are a diploid species that inherit one set of chromosomes paternally and one homo
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  The text is a research paper about phasing human genomes, which is the process of determining the unique nucleotide content of each of the two chromosomes in the 23 pairs that make up a human genome.
------------------------
Processing item i = 20
Processing text = ## Abstract
In contrast to HIV infection in humans and SIV in macaques , SIV infection of natural ho
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  Sooty mangabeys (SM) are natural hosts of the simian immunodeficiency virus (SIV) that doesn't cause disease in them. A study found that 8% of SM have a mutation in the CCR5 gene, which encodes a pro
------------------------
Processing item i = 21
Processing text = ## Abstract
Individuals choose their mates so as to maximize reproductive success , and one importan
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  In this study, researchers investigated the link between insulin signaling, reproduction, and attractiveness in fruit flies. They found that global activation of insulin signaling increases attractiv
------------------------
Processing item i = 22
Processing text = ## Abstract
Breast cancers that are “triple-negative” for the clinical markers ESR1 , PGR , and HER2
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  This research paper discusses the connection between breast cancer and the inactivation of three tumor suppressor pathways: Rb, p53, and BRCA1. The researchers created a mouse model with combined ina
------------------------
Processing item i = 23
Processing text = ## Abstract
Pathogen-associated secretion systems translocate numerous effector proteins into eukary
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  Legionella pneumophila, a bacterium that causes a severe pneumonia called Legionnaires' disease, uses a type IV secretion system to inject effector proteins into host cells. One of these effectors, L
------------------------
Processing item i = 24
Processing text = ## Abstract
The inference of regulatory interactions and quantitative models of gene regulation from
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  This research paper uses time-series transcriptomics data to study the FliA-FlgM module of the E. coli motility network. The paper measures the activity of genes involved in this module in real time 
------------------------
Processing item i = 25
Processing text = ## Abstract
Soil-transmitted helminth ( STH ) infections ( i . e . , Ascaris lumbricoides , hookworm
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  Soil-transmitted helminth (STH) infections, such as Ascaris lumbricoides, hookworm, and Trichuris trichiura, affect more than a billion people worldwide. These infections are more common in developin
------------------------
Processing item i = 26
Processing text = ## Abstract
Prompt post-exposure prophylaxis ( PEP ) is essential in preventing the fatal onset of d
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  Rabies is a fatal disease that can be prevented through prompt and appropriate post-exposure prophylaxis (PEP). However, in many low-income countries, life-saving rabies vaccines are not always avail
------------------------
Processing item i = 27
Processing text = ## Abstract
Uganda has active foci of both chronic and acute HAT with the acute zoonotic form of dis
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  In Uganda, there are active foci of both chronic and acute Human African Trypanosomiasis (HAT), a deadly disease spread by the tsetse fly. The acute form of the disease, caused by T. b. rhodesiense, 
------------------------
Processing item i = 28
Processing text = ## Abstract
Dietary restriction ( DR ) extends lifespan in various species and also slows the onset 
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  Dietary restriction (DR) can extend lifespan in various species and delay age-related diseases. The TOR pathway is essential for longevity phenotypes resulting from DR in flies, yeast, and worms. TOR
------------------------
Processing item i = 29
Processing text = ## Abstract
The unique capability of acetogens to ferment a broad range of substrates renders them i
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  Acetogens are ideal for sustainable bioproduction due to their ability to grow on H2, CO2, or syngas. However, advanced design strategies for acetogens are limited by incomplete knowledge of their ph
------------------------
Processing item i = 30
Processing text = ## Abstract
Arabidopsis thaliana cryptochrome 2 ( CRY2 ) mediates light control of flowering time . 
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  Arabidopsis thaliana cryptochrome 2 (CRY2) regulates flowering time by mediating the light control of flowering time. CRY2 interacts with CIB1 (CRY2-interacting bHLH 1) in response to blue light to a
------------------------
Processing item i = 31
Processing text = ## Abstract
Enteric bacterial pathogens cause food borne disease , which constitutes an enormous eco
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  Enterohemorrhagic Escherichia coli (EHEC) causes severe diarrhea and potentially fatal kidney disease in humans. EHEC and other enteric pathogens use a type III secretion system (T3SS) to inject viru
------------------------
Processing item i = 32
Processing text = ## Abstract
High-altitude hypoxia ( reduced inspired oxygen tension due to decreased barometric pres
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  High-altitude hypoxia, caused by decreased barometric pressure and oxygen tension, poses significant physiological challenges to human populations living at high altitudes, such as the Andean Altipla
------------------------
Processing item i = 33
Processing text = ## Abstract
Machupo virus ( MACV ) , a New World arenavirus , is the etiological agent of Bolivian h
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  The Machupo virus (MACV) is a New World arenavirus that causes Bolivian hemorrhagic fever (BHF), a disease with similar symptoms to Argentine hemorrhagic fever (AHF) caused by Junin virus (JUNV). Bot
------------------------
Processing item i = 34
Processing text = ## Abstract
Serological tests for IgM and IgG are routinely used in clinical laboratories for the ra
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  Dengue is a significant public health threat, with 50 to 100 million cases per year and 3 billion people at risk. It is caused by the dengue virus, which has four serotypes (DENV1-4). The virus is tr
------------------------
Processing item i = 35
Processing text = ## Abstract
Toll/interleukin-1 receptor ( TIR ) domains in Toll-like receptors are essential for ini
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  The text is a medical research paper about the Toll-like receptors (TLRs) and their role in the innate immune system. TLRs are part of the Toll-like receptor/interleukin-1 receptor (TLR/IL-1R) superf
------------------------
Processing item i = 36
Processing text = ## Abstract
Protein modifications play a major role for most biological processes in living organism
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  This research paper analyzes the N-terminal peptides from proteins extracted from Drosophila Kc167 cells to understand the biological function of N-terminal acetylation. The study finds that N-termin
------------------------
Processing item i = 37
Processing text = ## Abstract
Exosomes can transfer genetic materials between cells . Their roles in viral infections 
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  Enterovirus 71 (EV71) is a non-enveloped virus that causes hand-foot-and-mouth disease, and can sometimes lead to severe complications such as brain inflammation and lung congestion. A recent study f
------------------------
Processing item i = 38
Processing text = ## Abstract
According to recent experimental evidence , the interaction between chromatin loops , wh
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  The text is about a research paper on the role of chromatin loops in gene expression. Chromatin loops are formed when regulatory elements, such as enhancers and promoters, interact with each other. T
------------------------
Processing item i = 39
Processing text = ## Abstract
Predicting the dynamic behavior of a large network from that of the composing modules is
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  This text is a medical research paper about the behavior of gene transcription networks and how they are affected by retroactivity. Retroactivity is a phenomenon where the connection of a module to o
------------------------
Processing item i = 40
Processing text = ## Abstract
The 5-year survival of non-small cell lung cancer patients can be as low as 1% in advanc
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  The text is about a research paper on the use of a mechanistic model to estimate the cell killing efficacy of chemotherapy for non-small cell lung cancer (NSCLC) patients. The model takes into accoun
------------------------
Processing item i = 41
Processing text = ## Abstract
The mechanisms and treatment of psychomotor retardation , which includes motor and cogni
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  The Allan-Herndon-Dudley syndrome (AHDS) is a genetic disorder that affects brain development and results in severe psychomotor retardation. It is caused by mutations in the monocarboxylate transport
------------------------
Processing item i = 42
Processing text = ## Abstract
Natural selection drives populations towards higher fitness , but crossing fitness valle
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  This research investigates the impact of population subdivision on the time it takes for a population to cross a fitness valley or plateau in a rugged fitness landscape. The researchers use a simple 
------------------------
Processing item i = 43
Processing text = ## Abstract
The nuclear pore complex ( NPC ) regulates molecular traffic across the nuclear envelope
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  The nuclear pore complex (NPC) is a crucial structure in the cell that regulates the transport of molecules across the nuclear envelope. Though the NPC's structure is highly conserved across species,
------------------------
Processing item i = 44
Processing text = ## Abstract
Staphylococcus aureus is an opportunistic pathogen that colonizes the skin and mucosal s
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  Staphylococcus aureus is a bacterium that can cause various diseases, and it forms biofilms that are resistant to host immune responses and chemotherapies. Biofilms are made up of proteins, polysacch
------------------------
Processing item i = 45
Processing text = ## Abstract
The Type II Secretion System ( T2SS ) is a molecular machine that drives the secretion o
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  The Type II Secretion System (T2SS) is a molecular machine used by certain bacteria to secrete proteins through their outer membrane. It consists of three main forms, each with a different sequence o
------------------------
Processing item i = 46
Processing text = ## Abstract
Burkholderia pseudomallei is a mostly saprophytic bacterium , but can infect humans wher
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  Burkholderia pseudomallei is a bacterium that causes melioidosis, a serious disease with high mortality rates even with proper diagnosis and treatment. The bacterium is considered an emerging pathoge
------------------------
Processing item i = 47
Processing text = ## Abstract
Offspring of Schistosoma mansoni-infected women in schistosomiasis-endemic areas may be 
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  Schistosomiasis is a parasitic infection that can affect pregnant women and their unborn children. A study in Uganda examined the effects of praziquantel treatment of Schistosoma mansoni during pregn
------------------------
Processing item i = 48
Processing text = ## Abstract
Hox proteins play fundamental roles in controlling morphogenetic diversity along the ant
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  The research paper is about the regulation of the cell death gene, reaper (rpr), by the Hox protein, Deformed (Dfd), in Drosophila. The researchers found that Dfd regulates rpr expression in a highly
------------------------
Processing item i = 49
Processing text = ## Abstract
The antiproliferative response to anticancer treatment is the result of concurrent respo
---------------
Summary =  This research uses time-lapse live cell microscopy and flow cytometry to study how cancer cells respond to X-ray treatment. The researchers built a computational model to simulate the process of prol
------------------------


In [24]:
# write output
# filename = "./data/eLife_val_output.csv"
filename = "./data/plos_val_output.csv"

df.to_csv(filename,
          index=False)

In [30]:
# write txt file
filename_txt = "./data/plos.txt"

# clean text -> 1 paragraph
df["txt_summary"] = df["mixtral_summary"].apply(lambda text: (text
                                                              .replace("\n", "")
                                                              .strip('"')
                                                              .strip()
                                                              ))

df["txt_summary"].to_csv(filename_txt,
                         index=False,
                         header=False,
                         sep="\n"

                         )