## 01_Mixtral_Mini_Dev_Set
In this notebook, we will load the mini development dataset, and use Mixtral 8x7B model in Colab A100 GPU to process (summarize, simplify) the text

In [1]:
from google.colab import drive

drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
# load mini dev set for elife
import pandas as pd
import io

df = pd.read_json("./drive/MyDrive/531/milestone5/data/mini_dev_set/eLife_val.jsonl",
                  orient="records",
                  lines=True)

# df = pd.read_json("./data/eLife_val.jsonl",
#                   orient="records",
#                   lines=True)

df.head()

Unnamed: 0,lay_summary,article,headings,keywords,id
0,"It can take several months , or even years , f...",Mature neural networks synchronize and integra...,"[Abstract, Introduction, Results, Discussion, ...",[neuroscience],elife-69011-v2
1,Many of our decisions are made on the basis of...,Many decisions are thought to arise via the ac...,"[Abstract, Introduction, Results, Discussion, ...",[neuroscience],elife-17688-v1
2,Oculo-Cerebro-Renal syndrome of Lowe ( Lowe sy...,Mutations in the inositol 5-phosphatase OCRL c...,"[Abstract, Introduction, Results, Discussion, ...",[cell biology],elife-02975-v2
3,"When an embryo develops , its cells must work ...",Gradients of signaling proteins are essential ...,"[Abstract, Introduction, Results, Discussion, ...",[developmental biology],elife-38137-v3
4,Our genomes contain a record of historical eve...,Similarity between two individuals in the comb...,"[Abstract, Introduction, Results, Discussion, ...","[evolutionary biology, genetics and genomics]",elife-15266-v1


In [3]:
# take some item to inspect
k = 5
item = df.iloc[k]
print("Lay summary = ", item["lay_summary"])
print("Full text article =", item["article"])

Lay summary =  Over 90% of adults around the world are infected with the Epstein-Barr virus . Like other closely related viruses , such as those that cause chicken pox and cold sores , an infection lasts for the rest of the person’s life , although the virus generally remains in a latent or dormant state . However , under certain conditions the latent viruses can cause cancers to develop; in fact , it is estimated that such infections are responsible for nearly 2% of all cancer deaths worldwide . One way that healthy human cells prevent cancer is by triggering their own death in a process called apoptosis . The Epstein-Barr virus can block apoptosis , therefore making the cells more likely to become cancerous . Previous research identified one protein in the Epstein-Barr virus that promotes cancer by preventing infected cells from dying as normal . However , even in the absence of this protein , Epstein-Barr virus-infected cells remain resistant to apoptosis . This suggests that the vi

In [4]:
def build_text_with_headings(item):
  """
    return the article text with heading embedded
  """
  result = ""
  paras = item["article"].split("\n")
  headings = item["headings"]
  if len(paras) != len(headings):
    print("Error, not matching length")
    return item["article"]
  else:
    for (heading, paragraph) in zip(headings, paras):
      result += f"## {heading}\n{paragraph}\n\n"

  return result


In [5]:
print(build_text_with_headings(item))

## Abstract
Latent Epstein-Barr virus ( EBV ) infection is causally linked to several human cancers . EBV expresses viral oncogenes that promote cell growth and inhibit the apoptotic response to uncontrolled proliferation . The EBV oncoprotein LMP1 constitutively activates NFκB and is critical for survival of EBV-immortalized B cells . However , during early infection EBV induces rapid B cell proliferation with low levels of LMP1 and little apoptosis . Therefore , we sought to define the mechanism of survival in the absence of LMP1/NFκB early after infection . We used BH3 profiling to query mitochondrial regulation of apoptosis and defined a transition from uninfected B cells ( BCL-2 ) to early-infected ( MCL-1/BCL-2 ) and immortalized cells ( BFL-1 ) . This dynamic change in B cell survival mechanisms is unique to virus-infected cells and relies on regulation of MCL-1 mitochondrial localization and BFL-1 transcription by the viral EBNA3A protein . This study defines a new role for EBN

In [6]:
!pip install bitsandbytes



In [7]:
!pip install transformers
!pip install accelerate



In [8]:
!pip install  flash_attn



In [9]:
# load Mixtral 8x7B, 4-bit
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1" # this probably works on A100
tokenizer = AutoTokenizer.from_pretrained(model_id)

# the following WORKS with V100, and T4 but very slow without 4_bit
# model = AutoModelForCausalLM.from_pretrained(model_id,
#                                              load_in_4bit=True,
#                                              device_map="auto")

model = AutoModelForCausalLM.from_pretrained(model_id,
                                             load_in_4bit=True,
                                             torch_dtype=torch.float16,
                                             use_flash_attention_2=True,
                                             device_map="auto")  # flash attention



The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead.


Loading checkpoint shards:   0%|          | 0/19 [00:00<?, ?it/s]

In [10]:
# test if the model works correctly
messages = [
    {"role": "user", "content": "What is the capital city of British Columbia? Answer in 1 sentence"},
]

input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

outputs = model.generate(input_ids, max_new_tokens=1000)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[INST] What is the capital city of British Columbia? Answer in 1 sentence [/INST] The capital city of British Columbia, a province in Canada, is Victoria.


In [11]:
# filter out the text between [INST] ... [/INST]
# text = tokenizer.decode(outputs[0], skip_special_tokens=True)
s = "[INST] What is the capital city of British Columbia? Answer in 1 sentence [/INST] The capital city of British Columbia, a province in Canada, is Victoria."
id = s.find('[/INST]') + len('[/INST]')
result = s[id:]
print(result)

 The capital city of British Columbia, a province in Canada, is Victoria.


In [12]:
# test with new prompt
# max_char = 20000
# prompt = f"Simplify and summarize in minimum 250 to maximum 500 words, combine answer into 1 paragraph, keep important factual details. Also include key facts in 1-5 concise sentences: {s[:max_char]}"
# prompt = f"Simplify and summarize in 200 to 300 words:  {s[:max_char]}"

def summarize(article, max_char=20000):
  """
    summarize the article text, with
  """
  print("Processing text =", article[:100])
  print("---------------")
  lay_summary_example = "Over 90% of adults around the world are infected with the Epstein-Barr virus . Like other closely related viruses , such as those that cause chicken pox and cold sores , an infection lasts for the rest of the person’s life , although the virus generally remains in a latent or dormant state . However , under certain conditions the latent viruses can cause cancers to develop; in fact , it is estimated that such infections are responsible for nearly 2% of all cancer deaths worldwide . One way that healthy human cells prevent cancer is by triggering their own death in a process called apoptosis . The Epstein-Barr virus can block apoptosis , therefore making the cells more likely to become cancerous . Previous research identified one protein in the Epstein-Barr virus that promotes cancer by preventing infected cells from dying as normal . However , even in the absence of this protein , Epstein-Barr virus-infected cells remain resistant to apoptosis . This suggests that the virus has another way of blocking cell death . Price et al . have now used a technique that stresses living cells in a way that reveals which proteins prevent apoptosis to study human cells infected with the Epstein-Barr virus . This revealed that soon after infection , the virus could force the human cell to produce MCL-1 , a protein that prevents cell death . Later , the Epstein-Barr virus enlisted a second human protein called BFL-1 , which makes the infected cell further resistant to apoptosis . Price et al . discovered that a protein in the Epstein-Barr virus called EBNA3A controls the production of the MCL-1 and BFL-1 proteins . In the future , developing therapies that target these proteins may lead to new treatments for cancers caused by the Epstein-Barr virus . Such treatments would be likely to have fewer side effects for patients than traditional chemotherapies ."
  prompt = f"The following text is from a medical research paper, which contains medical terms. The section headings have ## tag before it. Pay special attention to 'Abstract', 'Introduction' and 'Conclusion' sections. Simplify and summarize in minimum 250 to maximum 500 words, combine answer into 1 paragraph, keeping important facts. One good example of lay summary is '{lay_summary_example}'. Now do a lay summary on this text: {article[:max_char]}"
  messages = [
      {"role": "user", "content": prompt},
  ]

  input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

  outputs = model.generate(input_ids, max_new_tokens=1000)
  # print(tokenizer.decode(outputs[0], skip_special_tokens=True))
  output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
  id = output_text.find('[/INST]') + len('[/INST]')
  result = output_text[id:]

  return result


In [13]:
# test function
text = build_text_with_headings(item)
print("--------------")
print("Summary =\n", summarize(text))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


--------------
Processing text = ## Abstract
Latent Epstein-Barr virus ( EBV ) infection is causally linked to several human cancers 
---------------
Summary =
  The Epstein-Barr virus (EBV) is a widespread human tumor virus that causes various cancers, including lymphomas and carcinomas. EBV expresses viral oncoproteins that promote cell growth and inhibit apoptosis, a programmed cell death process. One such oncoprotein, LMP1, constitutively activates NFκB and is critical for the survival of EBV-immortalized B cells. However, during early infection, EBV induces rapid B cell proliferation with low levels of LMP1 and little apoptosis. Researchers have sought to define the mechanism of survival in the absence of LMP1/NFκB early after infection. They used BH3 profiling to query mitochondrial regulation of apoptosis and found that EBV infection modestly reduces overall mitochondrial priming but changes BCL-2 family dependencies between uninfected and infected cells. Specifically, uninfecte

In [14]:
# create empty column
df["mixtral_summary"] = df["article"].apply(lambda x:"")
df.head()

for i in range(len(df)):
  print("Processing item i =", i)
  item = df.iloc[i]
  text = build_text_with_headings(item)
  summary = summarize(text)
  print("Summary =", summary[:200])
  # df.at[i, "llm_templates"] = parsed_item
  df.at[i, "mixtral_summary"] = summary
  print("------------------------")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Processing item i = 0
Processing text = ## Abstract
Mature neural networks synchronize and integrate spatiotemporal activity patterns to sup
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  This research paper studies the development of neural networks in mice and humans, focusing on the transition from immature to mature network dynamics. In mice, the paper found that mature cortical p
------------------------
Processing item i = 1
Processing text = ## Abstract
Many decisions are thought to arise via the accumulation of noisy evidence to a threshol
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  The text is a research paper about a study on how people make decisions based on visual stimuli. The study uses a task where people have to decide the direction of motion of a set of dots on a screen
------------------------
Processing item i = 2
Processing text = ## Abstract
Mutations in the inositol 5-phosphatase OCRL cause Lowe syndrome and Dent's disease . Al
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  The text is about a medical research paper that discusses the role of a protein called OCRL in clathrin-mediated endocytosis, a process by which cells internalize molecules from their environment. Th
------------------------
Processing item i = 3
Processing text = ## Abstract
Gradients of signaling proteins are essential for inducing tissue morphogenesis . Howeve
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  The research paper studies the formation of a gradient of a protein called Bnl in the Drosophila larval air-sac primordium (ASP), which is important for the development of the tracheal system. The re
------------------------
Processing item i = 4
Processing text = ## Abstract
Similarity between two individuals in the combination of genetic markers along their chr
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  This study uses genetic data from over 4,000 individuals from 60 different population groups in Africa to analyze the genetic structure and history of African populations. The study found that Africa
------------------------
Processing item i = 5
Processing text = ## Abstract
Latent Epstein-Barr virus ( EBV ) infection is causally linked to several human cancers 
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  The Epstein-Barr virus (EBV) is a widespread human tumor virus that causes various cancers, including lymphomas and carcinomas. EBV expresses viral oncoproteins that promote cell growth and inhibit a
------------------------
Processing item i = 6
Processing text = ## Abstract
Dynamic post-translational modification of RNA polymerase II ( RNAPII ) coordinates the 
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  This summary discusses a research study about the C-terminal domain (CTD) of RNA polymerase II (RNAPII), which plays a crucial role in co-transcriptional regulation. The CTD is a structurally disorde
------------------------
Processing item i = 7
Processing text = ## Abstract
Swi2/Snf2 ATPases remodel substrates such as nucleosomes and transcription complexes to 
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  The research paper is about the Swi2/Snf2 ATPase Mot1, which dissociates the TATA-binding protein (TBP) from DNA in an ATP-dependent manner. The paper reports the crystal structure of the N-terminal 
------------------------
Processing item i = 8
Processing text = ## Abstract
Accurate chromosome segregation depends on coordination between cohesion resolution and 
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  The research paper is about the role of a protein called Shugoshin-like protein 2 (Sgol2) in the first meiotic division of oocytes, which produces haploid gametes from diploid germ cells. Sgol2 is es
------------------------
Processing item i = 9
Processing text = ## Abstract
Streptococcus pneumoniae is a leading cause of invasive disease in infants , especially 
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  The text is about a medical research study on the bacterium Streptococcus pneumoniae, which can cause various diseases such as pneumonia, otitis media, and meningitis. The study focuses on the durati
------------------------
Processing item i = 10
Processing text = ## Abstract
C4 photosynthesis has independently evolved from the ancestral C3 pathway in at least 60
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  The research paper studies the convergent evolution of C4 photosynthesis, a complex trait found in at least 60 independent lineages of angiosperms. The study uses a meta-analysis of 18 lineages conta
------------------------
Processing item i = 11
Processing text = ## Abstract
Host shutoff is a common strategy used by viruses to repress cellular mRNA translation a
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  The Influenza A virus (IAV) causes significant shutoff of host gene expression, but the mechanisms behind this are not well understood. This study used RNA sequencing and ribosome profiling to explor
------------------------
Processing item i = 12
Processing text = ## Abstract
The transcription factor RpaA is the master regulator of circadian transcription in cyan
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  The research paper studies the role of a protein called RpaA in the cyanobacterium Synechococcus elongatus PCC7942. RpaA is a key component of the circadian clock, which regulates gene expression in 
------------------------
Processing item i = 13
Processing text = ## Abstract
Coiled coils are the best-understood protein fold , as their backbone structure can uniq
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  The text is about the discovery of a new type of fiber formed by inserting two or six residues into the heptad repeat of a parallel, trimeric coiled coil. This insertion causes local formation of sho
------------------------
Processing item i = 14
Processing text = ## Abstract
Hemoglobin ( Hb ) represents a model protein to study molecular adaptation in vertebrate
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  This research paper studies the molecular adaptation of hemoglobin (Hb) in vertebrates. The traditional view is that Hb affinity for oxygen is the primary determinant of molecular adaptation, but the
------------------------
Processing item i = 15
Processing text = ## Abstract
We report a functional switching valve within the female genitalia of the Brazilian cave
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  The Brazilian cave insect, Neotrogla, has a unique genital structure where the female has a penis-like organ for coercively mating with the male and obtaining nutritious semen. The semen is stored in
------------------------
Processing item i = 16
Processing text = ## Abstract
Previous studies tracking AMPA receptor ( AMPAR ) diffusion at synapses observed a large
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  This research uses super-resolution microscopy to study the movement of AMPA receptors, which are important for synaptic function in the brain. The researchers found that most AMPA receptors are foun
------------------------
Processing item i = 17
Processing text = ## Abstract
Histone acetylation and deposition of H2A . Z variant are integral aspects of active tra
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  The research paper is about the DOMINO chromatin regulator complex in Drosophila, which is responsible for histone acetylation and the deposition of the H2A.Z variant in chromatin. The DOMINO complex
------------------------
Processing item i = 18
Processing text = ## Abstract
Membrane nanodomains have been implicated in Ras signaling , but what these domains are 
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  The research uses a high-resolution imaging technique called spt-PALM to study the diffusion and trafficking of a mutant KRas protein, KRasG12D, on the membrane of U2OS cells. The researchers found t
------------------------
Processing item i = 19
Processing text = ## Abstract
This is an analysis of how magnetic fields affect biological molecules and cells . It wa
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  The text is a critique of recent research claiming that magnetic fields can affect biological cells and organisms. The author argues that these claims contradict basic laws of physics and are not sup
------------------------
Processing item i = 20
Processing text = ## Abstract
Individuals with congenital amusia have a lifelong history of unreliable pitch processin
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  Congenital amusia is a condition where individuals have a lifelong history of unreliable pitch processing, leading them to rely on other dimensions such as duration during speech perception. A study 
------------------------
Processing item i = 21
Processing text = ## Abstract
NCOA4 is a selective cargo receptor for the autophagic turnover of ferritin , a process 
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  This research paper is about NCOA4, a protein involved in the regulation of intracellular iron levels. NCOA4 is responsible for delivering ferritin, a protein that stores iron, to the lysosome for de
------------------------
Processing item i = 22
Processing text = ## Abstract
ISG15 is an interferon-stimulated , linear di-ubiquitin-like protein , with anti-viral a
---------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Summary =  The research paper investigates the role of ISG15, an interferon-stimulated protein with anti-viral activity, during bacterial infection. The study found that ISG15 expression in nonphagocytic cells 
------------------------
Processing item i = 23
Processing text = ## Abstract
Temporal experience of odor gradients is important in spatial orientation of animals . T
---------------
Summary =  The research paper investigates how early olfactory circuits in fruit flies process temporal variation of olfactory stimuli. The researchers subjected flies to precisely defined odor concentration wa
------------------------


In [16]:
# write output
filename = "./data/eLife_val_output.csv"
df.to_csv(filename,
          index=False)