## Setting up the environment and installing necessary libraries

In [1]:
!git clone https://github.com/CarperAI/trlx.git
!git config --global --add safe.directory /content/trlx && cd /content/trlx && pip install -e .

# uninstall scikit_learn + jax to avoid numpy issues
!pip uninstall -y scikit_learn jax

import os

# run within repo
os.chdir('/content/trlx/examples/summarize_rlhf/')
print(os.getcwd())

!pip install -r requirements.txt
!pip install mpi4py

# run within reward model directory
os.chdir('/content/trlx/examples/summarize_rlhf/reward_model/')
print(os.getcwd())

Cloning into 'trlx'...
remote: Enumerating objects: 8089, done.[K
remote: Counting objects: 100% (3366/3366), done.[K
remote: Compressing objects: 100% (764/764), done.[K
remote: Total 8089 (delta 2857), reused 2896 (delta 2600), pack-reused 4723[K
Receiving objects: 100% (8089/8089), 46.84 MiB | 21.32 MiB/s, done.
Resolving deltas: 100% (5563/5563), done.
Obtaining file:///content/trlx
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
Collecting accelerate>=0.17.1 (from trlx==0.7.0)
  Downloading accelerate-0.23.0-py3-none-any.whl (258 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.1/258.1 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
Collecting cattrs>=22.2.0 (from trlx==0.7.0)
  Downloading cattrs-2

## Creating a custom dataset

In [2]:
from transformers import pipeline, set_seed
import json

def generate_examples(prompt_list, model_name='gpt2', max_length=50, num_return_sequences=2, seed=42):
    generator = pipeline('text-generation', model=model_name, device=0)
    set_seed(seed)
    examples = []
    for prompt in prompt_list:
        result = generator(prompt, max_length=max_length, num_return_sequences=num_return_sequences)
        example = {'prompt': prompt}
        for i, res in enumerate(result):
            answer = res['generated_text'].lstrip().removeprefix(prompt).strip()
            example[f'answer{i + 1}'] = answer
        examples.append(example)
        print(json.dumps(example, indent=2))
    return examples

In [3]:
prompts = [
    "Among which hypoxia AV 0 difference is max ?",
    "Which of the following needs cholesterol and other lipids for growth?",
    "Reduction potential of potassium is",
    "Drinking can be induced by",
    "Endemic typhus is transmitted by",
    "In Familial hypercholesterolemia there is",
    "Corynebacterum other than diphtheriae carrying toxin:",
    "Which of the following insecticides is a natural Which of the product ?",
    "Radiofrequency ablation treatment is most useful in:",
    "Which of the following will cause Bull's eye retinopathy",
    "Which of the following is lined by transitional epithelium ?",
    "Reversible cause of dementia is ?",
    "Donepezil is used in treatment of which disease?",
    "Rivastigmine is given in which disease?",
    "Alzheimer's disease, which is involved?",
    "Senile plaques in brain is a feature of which disease?",
    "Defect in Amyloid protein folding occurs in which disease?",
    "Regions of 'trinucleotide repeats' are seen in which disease?",
    "Alzheimer's Disease is associated with :",
    "Tau proteins are most commonly associated with ",
    " MMSE is used for the diagnosis of :",
    "The nucleus involved in Alzheimer's disease is -",
    "In Alzheimer's disease, plaque is made up of:",
    "Hirano bodies seen in?",
    "Which drug is not used now in Alzheimer's disease?",
    "Misfolded amyloid deposition in brain is seen in?",
    "Anticholinesterases drugs are used in which disease?",
    "'Silent epidemic' of the century is:",
    "Galantamine is used in ?"
    ]

In [4]:
generated_examples = generate_examples(prompts)

# Save generated examples to import in Label Studio
with open('ls_input_data.json', 'w') as f:
    json.dump(generated_examples, f, indent=2)

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{
  "prompt": "Among which hypoxia AV 0 difference is max ?",
  "answer1": "for the hypoxiologic hypothesis, and a mean difference in hypoxial concentrations between subjects with no hypoxia (18 \u03bcg/dL and 0.3 ppm) and those with significant",
  "answer2": "This means AV 0 or no. On the other hand, if AV 0 is greater than no AV, then no AV is required to treat AV 0 or AV 0 difference.\n\nExceptions"
}


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{
  "prompt": "Which of the following needs cholesterol and other lipids for growth?",
  "answer1": "Do they help the liver or muscles of the heart be more hydrated? Have these changes caused atherosclerosis. They are usually not.\n\nCholesterol or lipids may be",
  "answer2": "Are you having trouble, or just feel like it doesn't add up?\n\nYou know that one of the most commonly asked questions about all types of cholesterol is \"which one has"
}


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{
  "prompt": "Reduction potential of potassium is",
  "answer1": "approximately 3 mg/L in rats. Because potassium has no excitatory modulator, it cannot compensate for reduced metabolic rate and muscle hypertrophy. In general, decreased potassium excitation from the kidneys causes renal disease",
  "answer2": "greater and therefore there is less available potassium for consumption. The reduction in potassium has been suggested as a factor favoring an increased risk of atherosclerosis.28\n\n2.3.2 Keto Foods and Metabol"
}


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{
  "prompt": "Drinking can be induced by",
  "answer1": "alcohol or some other form of caffeine, which can cause you to have an increased tendency to get anxious or moody.\n\nResearch shows that the use of coffee can make you feel like you're eating more, especially",
  "answer2": "certain compounds that are metabolized in the body or taken for free, using a chemical receptor antagonist like Ritalin. However, even if the compound is being taken orally, the effects on the skin could differ depending on"
}


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{
  "prompt": "Endemic typhus is transmitted by",
  "answer1": "contact with blood transfusion by a type of blood transfusion. All those who wish to avoid transmitting these diseases can avoid contact with these substances by the use of means containing these substances.\n\nThe person using the",
  "answer2": "the bite marks of animals like mollusks. In addition, typhus infects other animals like mollusks and the mollusks die with the disease. Most cases of acute typhus are"
}


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{
  "prompt": "In Familial hypercholesterolemia there is",
  "answer1": "a risk of type II diabetes in very large numbers, even those with normal blood glucose control. The majority of patients in the AP-1, AP-1B, and AP-1C groups",
  "answer2": "a strong risk that people of familial hypercholesterolemia will become sexually active when they become a mother, thus the risk for sexually transmitted disease.\n\nFamilial hypercholesterolemia"
}


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{
  "prompt": "Corynebacterum other than diphtheriae carrying toxin:",
  "answer1": "For diphtheriae, diphtheria B and diphtheria C enter the fungal group in the intestinal walls only (n = 4); d",
  "answer2": "the possibility that diphtheriae would carry toxins by D. melanogaster as evidenced by their toxin content. This possibility was further tested in several case isolated from young children"
}


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{
  "prompt": "Which of the following insecticides is a natural Which of the product ?",
  "answer1": "It is a chemical of which an insect is responsible, and can lead to problems with the body and with certain cancers.\n\nIt is also a natural pesticide and should",
  "answer2": "Insecticides which are used in small size areas may occur in some cases, a few of them may increase or decrease certain insect population, some of them may act as"
}


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{
  "prompt": "Radiofrequency ablation treatment is most useful in:",
  "answer1": "(1) patients with a history of radiofrequency ablation, (2) patients with a history of type-1 diabetes mellitus, or (3) patients with the treatment of peripheral hyperar",
  "answer2": "1) reducing the frequency, 2) removing interference from radio frequencies and producing effective suppression at low noise levels.\n\n1.1.1. Frequency-Resolution Optimizer 2.2."
}


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{
  "prompt": "Which of the following will cause Bull's eye retinopathy",
  "answer1": "to flare up in future years?\n\n1. You have no control over how you react to exercise, and it might be difficult to control your mood and behaviour if you do not exercise",
  "answer2": "and should be ruled out in your care from taking aspirin:\n\nHeart failure\n\nHeart disease\n\nBreast enlargement\n\nDiseases that cause a loss of circulation"
}


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{
  "prompt": "Which of the following is lined by transitional epithelium ?",
  "answer1": "\u2014 Is this the only group consisting mainly of subcapsular and extracellular cell lines? Can we interpret the other two by looking at the morphology of the plaques? (1)",
  "answer2": "Where do the two epithelium intersect?\n\nLemma : a part of the epithelium (usually called a phalange)\n\nLiam : a piece of"
}


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{
  "prompt": "Reversible cause of dementia is ?",
  "answer1": "Cisgender?Aging. Cisgender?\" or?\n\n(incl. \"Degenerate\" is commonly used for people with cognitive impairment )...or\n\n(incl. \"",
  "answer2": "For a person who is suffering from dementia, there are more problems than you may imagine, but they're very rare and very important to people with dementia. I don't think that any of the symptoms that"
}


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{
  "prompt": "Donepezil is used in treatment of which disease?",
  "answer1": "Biological activity: No.\n\nConceivably, for a given organism, a certain percentage of the RNA that was introduced into a human cell would be detectable at specific sites by",
  "answer2": "The study showed a significant inhibition of TNF-\u03b1 levels in the blood in the diabetic group. It was also observed that TNF-\u03b1 did not increase in the diabetic group as expected."
}


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{
  "prompt": "Rivastigmine is given in which disease?",
  "answer1": "I have no evidence of evidence of disease in this animal. A very early animal from eastern Europe, the Lederidorus, is mentioned, but this animal was not observed till the",
  "answer2": "s symptoms range from mild to severe. It may be called a tachycardia-paresthesized and tachycardia-associated rash in which the patient's right forearm may be"
}


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{
  "prompt": "Alzheimer's disease, which is involved?",
  "answer1": "In all, of the eight million people diagnosed, one out of 20 is the result of a genetic mutation, including one who has not yet been diagnosed. Other possible causes included changes in genetics",
  "answer2": "And will the United Nations report on this soon?\n\nNorman Eisen: Well, I think you'd probably have to hear from somebody who is on the medical side of the issue. It is"
}


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{
  "prompt": "Senile plaques in brain is a feature of which disease?",
  "answer1": "s incidence and mortality are related, and which diseases might be particularly susceptible?s prognosis?s outcome (see below, for examples). The aim was to determine which diseases may have",
  "answer2": "I don't think so. That's an interesting question because, unfortunately, this is the most widespread, and very rare, disease that you could go around to diagnose yourself with"
}


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{
  "prompt": "Defect in Amyloid protein folding occurs in which disease?",
  "answer1": "In fact it is caused by the defective protein-coding mechanism (i.e. it is the mainstay of the cellular and protein-chain interactions, namely the intracellular",
  "answer2": "Monsanto is known to be involved in a number of important cellular activities, including proteomics, tissue physiology, and transcriptional regulation. Several mechanisms appear linked to amyl"
}


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{
  "prompt": "Regions of 'trinucleotide repeats' are seen in which disease?",
  "answer1": "\"In an observational, randomized, double-blind, placebo-controlled trial, a high proportion of human T cells were identified from the control group. However,",
  "answer2": "A. The major mutations in these regions were found when using two large human (in vivo) mice, that are known to develop HIV. Therefore a study with a large"
}


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{
  "prompt": "Alzheimer's Disease is associated with :",
  "answer1": "Alzheimer's Disease is not the only serious case of Alzheimer's Disease that occurs in families with a limited number of Alzheimer's sufferers.\n\n\nWhat are the treatments to stop Alzheimer's",
  "answer2": "(1) greater risk for dementia, (2) greater risk for stroke, and (3) greater risk for coronary heart disease as a result of eating or drinking that are considered to cause the risk in"
}


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{
  "prompt": "Tau proteins are most commonly associated with ",
  "answer1": "erythrocyte formation. Since this enzyme and the histone acetyltransferase are not involved in this process, in mice A\u03b2 signaling can be inhibited by the enzyme which is not activated in A",
  "answer2": "erythrocytes, but may be implicated in the pathogenesis and degradation of other types of erythrocytes and lymphocytes [42]. A possible source of erythrocyte-derived"
}


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{
  "prompt": " MMSE is used for the diagnosis of :",
  "answer1": "MMSE is used for the diagnosis of :\n\nT1S2JT J3\n\nT4S5T J6\n\nT5J1T N1\n\nT5YT J7\n\nT5Y",
  "answer2": "MMSE is used for the diagnosis of :\n\n\u2022 Asymmetries such as (S) = 2.7 \u00d7 108 (S, S = 11.5 \u00d7 108) or (S^2) > S = 1.2"
}


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{
  "prompt": "The nucleus involved in Alzheimer's disease is -",
  "answer1": "13.2% -- so that means that -13.1% of human tumors are involved there\" and that is, -13.2% is associated with the use of oral agents which target the",
  "answer2": "12C-phosphatase, which breaks down cholesterol and other free radicals, to allow our bodies to make better antibodies when activated in the brain.\n\nBut it is still quite tiny. Only"
}


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{
  "prompt": "In Alzheimer's disease, plaque is made up of:",
  "answer1": "Insulin-like growth factor C (IGF-C)\n\nProtein-cholesterol\n\nFat distribution in the body\n\nStructure and function\n\nGrowth",
  "answer2": "Plaque (a protein found in blood cells that keeps us healthy)\n\nBicarbonate (B-calcium ions, charged to take away carbonates)\n\nFat"
}


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{
  "prompt": "Hirano bodies seen in?",
  "answer1": "At the very least, we have identified several cases where the K.M.W.I.S. was involved in an attempted murder. We were very surprised at how many of these men didn't",
  "answer2": "Vladimir Ilyich Kaltenberg, the Russian Orthodox Patriarch of Constantinople who fought in the Armenian War"
}


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{
  "prompt": "Which drug is not used now in Alzheimer's disease?",
  "answer1": "Is it really that hard to figure out if Alzheimer's is a sign of Alzheimer's disease as a condition not caused by alcohol?\n\nRigid and transparent communication between researchers\n\nThe",
  "answer2": "Read more\n\n\"The FDA does not treat Alzheimer's disease with every antibiotic, including clobazepam. In contrast, many antibiotics are available today in most non-human primate"
}


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{
  "prompt": "Misfolded amyloid deposition in brain is seen in?",
  "answer1": ".\n\nIn vitro experiments are possible to confirm that amyloid deposition in humans is caused by a genetic defect. If this defect is present, its cause might be an inherited condition",
  "answer2": "I think that's what this is related to.\"\n\n\"It's not,\" said Dr. Mihail. \"It's the same thing they're doing with some people with"
}


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{
  "prompt": "Anticholinesterases drugs are used in which disease?",
  "answer1": "is an isolated pathogen for which therapies can be developed. There have been studies with patients with metastatic kidney disease [21,23], which were shown to result in increased protein synthesis",
  "answer2": "Do they have direct physical impact on liver function?\n\nWhat if they can be taken after cancer is treated with chemotherapy drugs like tetracycline?\n\nIt"
}


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{
  "prompt": "'Silent epidemic' of the century is:",
  "answer1": "the lack of awareness of the problems affecting the disabled and non-disabled community. The world is not looking for solutions to poverty, unemployment, poverty rates, illness and even death. It is not looking",
  "answer2": "\"It was just a coincidence that the first major mass killer had come just before the onset of the Cold War. It was not surprising that he was the first to come here. One of the big"
}
{
  "prompt": "Galantamine is used in ?",
  "answer1": "-methamphetamine?, methamphetamine use, and cocaine use. It also has been shown that methylphenidate is particularly toxic to people who have previously been treated. The only other known instance of methylphenidate was",
  "answer2": "/m? as a stimulant and non-psychotropic. It is used in people who smoke cigarettes, so it is important in their daily dose. Because dopamine levels increase over time due to smoking, caffeine or"
}


In [6]:
with open('ls_input_data.json', 'w') as f:
    json.dump(generated_examples, f, indent=2)

In [7]:
import json

try:
    with open('ls_input_data.json', 'w') as f:
        json.dump(generated_examples, f, indent=2)
    print("File 'ls_input_data.json' created and data written successfully.")
except Exception as e:
    print("An error occurred:", str(e))

File 'ls_input_data.json' created and data written successfully.


In [8]:
import os
import json

# Print the current working directory (path)
current_directory = os.getcwd()
print("Current Working Directory:", current_directory)

try:
    with open('ls_input_data.json', 'w') as f:
        json.dump(generated_examples, f, indent=2)
    print("File 'ls_input_data.json' created and data written successfully.")
except Exception as e:
    print("An error occurred:", str(e))


Current Working Directory: /content/trlx/examples/summarize_rlhf/reward_model
File 'ls_input_data.json' created and data written successfully.
