## Describe your model -> fine-tuned LLaMA 2
By Matt Shumer (https://twitter.com/mattshumer_)

The goal of this notebook is to experiment with a new way to make it very easy to build a task-specific model for your use-case.

First, use the best GPU available (go to Runtime -> change runtime type)

To create your model, just go to the first code cell, and describe the model you want to build in the prompt. Be descriptive and clear.

Select a temperature (high=creative, low=precise), and the number of training examples to generate to train the model. From there, just run all the cells.

You can change the model you want to fine-tune by changing `model_name` in the `Define Hyperparameters` cell.

In [None]:
# prompt = "A model that takes in a puzzle-like reasoning-heavy question in English, and responds with a well-reasoned, step-by-step thought out response in Spanish."
# temperature = .4
# number_of_examples = 100

# Install necessary libraries

In [1]:
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7 textstat
import pandas as pd
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/244.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.7/244.2 kB[0m [31m2.0 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m235.5/244.2 kB[0m [31m3.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.5/92.5 MB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m45.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.4/77.4 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[2K 

  warn("The installed version of bitsandbytes was compiled without GPU support. "


/usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32


In [2]:
import pandas as pd

# Assuming your dataset is stored in a CSV file named "dataset.csv"
dataset = pd.read_csv("withparaphrased.csv")

# Create a new column 'text' by concatenating values from selected columns
dataset['prompt'] = "Paraphrase the following user story: " + dataset['User Story'] + ".\n"
dataset['prompt'] += "Based on the following metrics: "
for col in dataset.columns:
    if col.startswith('diff'):
        dataset['prompt'] += col + ": " + dataset[col].astype(str) + ","
dataset['response'] = dataset['Paraphrased User Story']

# Selecting the columns
selected_columns = ['prompt'] + [col for col in dataset.columns if col.startswith('diff.')]

# Creating a new DataFrame with selected columns
selected_data = dataset[['prompt', 'response', 'User Story', 'Paraphrased User Story']]
train_df = selected_data.sample(frac=0.9, random_state=42)
test_df = selected_data.drop(train_df.index)

train_df.to_json('train.jsonl', orient='records', lines=True)
test_df.to_json('test.jsonl', orient='records', lines=True)


FileNotFoundError: [Errno 2] No such file or directory: 'withparaphrased.csv'

In [None]:
test_df

Unnamed: 0,prompt,response,User Story,Paraphrased User Story
14,Paraphrase the following user story: A researc...,A computational biologist is utilizing backpro...,A researcher in computational biology is using...,A computational biologist is utilizing backpro...
20,Paraphrase the following user story: As a comp...,"As a computational biologist, I want to levera...","As a computational biologist, I want to use ba...","As a computational biologist, I want to levera..."
51,Paraphrase the following user story: As a micr...,"As a microbiologist, I aim to leverage constra...","As a microbiologist, I want to use constrained...","As a microbiologist, I aim to leverage constra..."
60,Paraphrase the following user story: As a biol...,"As a biologist, I want to utilize data mining ...","As a biologist, I want to use data mining to a...","As a biologist, I want to utilize data mining ..."
71,Paraphrase the following user story: As a bioi...,"As a bioinformatics researcher, I aim to creat...","As a bioinformatics researcher, I want to deve...","As a bioinformatics researcher, I aim to creat..."
74,Paraphrase the following user story: As a bioi...,"As a bioinformatics specialist, I aim to utili...","As a bioinformatics specialist, I want to use ...","As a bioinformatics specialist, I aim to utili..."
82,Paraphrase the following user story: As a medi...,"As a medical imaging specialist, I need an aut...","As a medical imaging specialist, I want to use...","As a medical imaging specialist, I need an aut..."
86,Paraphrase the following user story: As a bioi...,"As a bioinformatics researcher, I desire to ut...","As a bioinformatics researcher, I want to use ...","As a bioinformatics researcher, I desire to ut..."
91,Paraphrase the following user story: As a biol...,"As a biologist, I want to leverage entity link...","As a biologist, I want to use entity linking t...","As a biologist, I want to leverage entity link..."
92,Paraphrase the following user story: As an evo...,"As an evolutionary biologist, I aim to leverag...","As an evolutionary biologist, I want to use ev...","As an evolutionary biologist, I aim to leverag..."


# Define Hyperparameters

In [None]:
model_name = "NousResearch/llama-2-7b-chat-hf" # use this if you have access to the official LLaMA 2 model "meta-llama/Llama-2-7b-chat-hf", though keep in mind you'll need to pass a Hugging Face key argument
dataset_name = "/content/train.jsonl"
new_model = "llama-2-7b-custom"
lora_r = 64
lora_alpha = 16
lora_dropout = 0.1
use_4bit = True
bnb_4bit_compute_dtype = "float16"
bnb_4bit_quant_type = "nf4"
use_nested_quant = False
output_dir = "./results"
num_train_epochs = 1
fp16 = False
bf16 = False
per_device_train_batch_size = 4
per_device_eval_batch_size = 4
gradient_accumulation_steps = 1
gradient_checkpointing = True
max_grad_norm = 0.3
learning_rate = 2e-4
weight_decay = 0.001
optim = "paged_adamw_32bit"
lr_scheduler_type = "constant"
max_steps = -1
warmup_ratio = 0.03
group_by_length = True
save_steps = 25
logging_steps = 5
max_seq_length = None
packing = False
device_map = {"": 0}

#Load Datasets and Train

In [None]:
# Load datasets
train_dataset = load_dataset('json', data_files='/content/train.jsonl', split="train")
valid_dataset = load_dataset('json', data_files='/content/test.jsonl', split="train")

# Preprocess datasets
train_dataset_mapped = train_dataset.map(lambda examples: {'text': [f'<s>[INST]' + prompt + ' [/INST] ' + response + for prompt, response in zip(examples['prompt'], examples['response'])]}, batched=True)
valid_dataset_mapped = valid_dataset.map(lambda examples: {'text': [f'[INST]' + prompt + ' [/INST] ' + response for prompt, response in zip(examples['prompt'], examples['response'])]}, batched=True)

compute_dtype = getattr(torch, bnb_4bit_compute_dtype)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)
# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="all",
    evaluation_strategy="steps",
    eval_steps=5  # Evaluate every 20 steps
)
# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset_mapped,
    eval_dataset=valid_dataset_mapped,  # Pass validation dataset here
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)
trainer.train()
trainer.model.save_pretrained(new_model)

# Cell 4: Test the model
# logging.set_verbosity(logging.CRITICAL)
# prompt = f"[INST]Write a function that reverses a string. [/INST]" # replace the command here with something relevant to your task
# pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
# result = pipe(prompt)
# print(result[0]['generated_text'])

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/90 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/179 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]



Map:   0%|          | 0/90 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss
5,1.4352,1.240898
10,1.0861,0.964827
15,0.8591,0.736576
20,0.6245,0.605554


#Run Inference

In [None]:
from transformers import pipeline

prompt = f"[INST]Paraphrase the following user story: A researcher in computational biology is using backpropagation to train a machine learning algorithm to predict the likelihood of different genetic mutations leading to cancer. By analyzing large sets of genomic data, the algorithm is trained to identify the underlying genetic factors that contribute to cancer development..\nBased on the following metrics: diff_total_characters: 9,diff_uppercase_characters: 0,diff_lowercase_characters: 8,diff_special_characters: 0,diff_numbers: 0,diff_blanks: 1,diff_number_of_words: 1,diff_average_length_of_words: 0.0453283996299722,diff_number_of_propositions: 0,diff_average_length_of_propositions: 0.5,diff_punctuation_characters: 0,diff_lowercase_words: 1,diff_uppercase_words: 0,diff_vocabulary_richness: 1,diff_number_of_urls: 0, [/INST]" # replace the command here with something relevant to your task
num_new_tokens = 100  # change to the number of new tokens you want to generate

# Count the number of tokens in the prompt
num_prompt_tokens = len(tokenizer(prompt)['input_ids'])

# Calculate the maximum length for the generation
max_length = num_prompt_tokens + num_new_tokens

gen = pipeline('text-generation', model=model, tokenizer=tokenizer, max_length=max_length)
result = gen(prompt)
print(result[0]['generated_text'].replace(prompt, ''))

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


 A computational biologist is utilizing backpropagation to train a machine learning algorithm to predict the likelihood of different genetic mutations leading to cancer. By analyzing vast sets of genomic data, the algorithm is trained to identify the underlying genetic factors that contribute to cancer development.

Based on the following metrics:

* diff_total_characters: 9
* diff_uppercase_characters: 0
* diff_lowercase_characters


In [None]:
import string
import re
import textstat

def total_characters(text):
    return len(text)

def uppercase_characters(text):
    return sum(1 for char in text if char.isupper())

def lowercase_characters(text):
    return sum(1 for char in text if char.islower())

def special_characters(text):
    special_chars = "!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~"
    return sum(1 for char in text if char in special_chars)

def numbers(text):
    return sum(1 for char in text if char.isdigit())

def blanks(text):
    return sum(1 for char in text if char.isspace())

def number_of_words(text):
    return len(text.split())

def average_length_of_words(text):
    words = text.split()
    total_length = sum(len(word) for word in words)
    num_words = len(words)
    if num_words == 0:
        return 0
    return total_length / num_words

def number_of_propositions(text):
    propositions = re.split(r'[.!?]+', text)
    return len([prop for prop in propositions if prop.strip()])

def average_length_of_propositions(text):
    propositions = re.split(r'[.!?]+', text)
    lengths = [len(prop.strip().split()) for prop in propositions if prop.strip()]
    if lengths:
        return sum(lengths) / len(lengths)
    else:
        return 0

def punctuation_characters(text):
    return sum(1 for char in text if char in string.punctuation)

def lowercase_words(text):
    words = text.split()
    return sum(1 for word in words if word.islower())

def uppercase_words(text):
    words = text.split()
    return sum(1 for word in words if word.isupper())

def vocabulary_richness(text):
    words = text.lower().split()
    unique_words = set(words)
    dw = len(unique_words)
    return dw

def number_of_urls(text):
    urls = re.findall(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', text)
    return len(urls)

def flesch_kincaid_grade_level(text):
    return textstat.flesch_kincaid_grade(text)

def flesch_reading_ease(text):
    return textstat.flesch_reading_ease(text)

def dale_chall_readability(text):
    return textstat.dale_chall_readability_score(text)

def automated_readability_index(text):
    return textstat.automated_readability_index(text)

def coleman_liau_index(text):
    return textstat.coleman_liau_index(text)

def gunning_fog(text):
    return textstat.gunning_fog(text)

def smog_index(text):
    return textstat.smog_index(text)

def linsear_write_index(text):
    return textstat.linsear_write_formula(text)

In [None]:
test_df

Unnamed: 0,prompt,response,User Story,Paraphrased User Story
14,Paraphrase the following user story: A researc...,A computational biologist is utilizing backpro...,A researcher in computational biology is using...,A computational biologist is utilizing backpro...
20,Paraphrase the following user story: As a comp...,"As a computational biologist, I want to levera...","As a computational biologist, I want to use ba...","As a computational biologist, I want to levera..."
51,Paraphrase the following user story: As a micr...,"As a microbiologist, I aim to leverage constra...","As a microbiologist, I want to use constrained...","As a microbiologist, I aim to leverage constra..."
60,Paraphrase the following user story: As a biol...,"As a biologist, I want to utilize data mining ...","As a biologist, I want to use data mining to a...","As a biologist, I want to utilize data mining ..."
71,Paraphrase the following user story: As a bioi...,"As a bioinformatics researcher, I aim to creat...","As a bioinformatics researcher, I want to deve...","As a bioinformatics researcher, I aim to creat..."
74,Paraphrase the following user story: As a bioi...,"As a bioinformatics specialist, I aim to utili...","As a bioinformatics specialist, I want to use ...","As a bioinformatics specialist, I aim to utili..."
82,Paraphrase the following user story: As a medi...,"As a medical imaging specialist, I need an aut...","As a medical imaging specialist, I want to use...","As a medical imaging specialist, I need an aut..."
86,Paraphrase the following user story: As a bioi...,"As a bioinformatics researcher, I desire to ut...","As a bioinformatics researcher, I want to use ...","As a bioinformatics researcher, I desire to ut..."
91,Paraphrase the following user story: As a biol...,"As a biologist, I want to leverage entity link...","As a biologist, I want to use entity linking t...","As a biologist, I want to leverage entity link..."
92,Paraphrase the following user story: As an evo...,"As an evolutionary biologist, I aim to leverag...","As an evolutionary biologist, I want to use ev...","As an evolutionary biologist, I aim to leverag..."


In [None]:
df = test_df.copy()
df['prompt'][14]

'Paraphrase the following user story: A researcher in computational biology is using backpropagation to train a machine learning algorithm to predict the likelihood of different genetic mutations leading to cancer. By analyzing large sets of genomic data, the algorithm is trained to identify the underlying genetic factors that contribute to cancer development..\nBased on the following metrics: diff_total_characters: 13,diff_uppercase_characters: 0,diff_lowercase_characters: 12,diff_special_characters: 0,diff_numbers: 0,diff_blanks: 1,diff_number_of_words: 1,diff_average_length_of_words: 0.132284921369103,diff_number_of_propositions: 0,diff_average_length_of_propositions: 0.5,diff_punctuation_characters: 0,diff_lowercase_words: 1,diff_uppercase_words: 0,diff_vocabulary_richness: 0,diff_number_of_urls: 0,diff_flesch_kincaid_grade_level: 1.3999999999999986,diff_flesch_reading_ease: -8.969999999999999,diff_dale_chall_readability: -0.4900000000000002,diff_automated_readability_index: 0.8999

In [None]:
def cut_prompt(x):
  return x.split(':')[1].split('.')[0] + '.'

def cut_text(x):
  prompt = '[/INST]' + x + '[/INST]'
  num_new_tokens = 100  # change to the number of new tokens you want to generate

  # Count the number of tokens in the prompt
  num_prompt_tokens = len(tokenizer(prompt)['input_ids'])

  # Calculate the maximum length for the generation
  max_length = num_prompt_tokens + num_new_tokens

  gen = pipeline('text-generation', model=model, tokenizer=tokenizer, max_length=max_length)
  result = gen(prompt)
  print(result[0]['generated_text'].replace(prompt, ''))
  return result[0]['generated_text'].replace(prompt, '')


df['llm_output'] = df['prompt'].apply(cut_text)
df

In [None]:
df['llm_output'][14]

' Paraphrase the following user story: A computational biologist is using backpropagation to train a machine learning algorithm to predict the likelihood of different genetic mutations leading to cancer. By analyzing large sets of genomic data, the algorithm is trained to identify the underlying genetic factors that contribute to cancer development.\nBased on the following metrics: diff_total_characters: 13, diff_uppercase_characters: 0, diff_lowercase'

In [None]:
d = df.copy()

In [None]:
import pandas as pd

# Reshape the DataFrame
num_rows = df.shape[0]
num_cols = df.shape[1]
reshape_data = []

for i in range(num_rows):
    for j in range(num_cols):
        reshape_data.append(df.iloc[i, j])

# Create a new DataFrame with reshaped data
reshaped_df = pd.DataFrame(columns=['reshaped_column'], data=reshape_data)

def name_the_text(ind):
  if ind % 3 == 0:
    return 'Original'
  elif ind % 3 == 1:
    return 'Paraphrased'
  return 'LLM'

reshaped_df['origin'] = reshaped_df.index.map(name_the_text)
reshaped_df

Unnamed: 0,reshaped_column,origin
0,A researcher in computational biology is usin...,Original
1,A computational biologist is utilizing backpro...,Paraphrased
2,A computational biologist is using backpropag...,LLM
3,"As a computational biologist, I want to use b...",Original
4,"As a computational biologist, I want to utiliz...",Paraphrased
5,"As a computational biologist, I want to use B...",LLM
6,"As a microbiologist, I want to use constraine...",Original
7,"As a microbiologist, I want to use constrained...",Paraphrased
8,"As a microbiologist, I want to use constraine...",LLM
9,"As a biologist, I want to use data mining to ...",Original


In [None]:
metric_functions = [
    total_characters,
    uppercase_characters,
    lowercase_characters,
    special_characters,
    numbers,
    blanks,
    number_of_words,
    average_length_of_words,
    number_of_propositions,
    average_length_of_propositions,
    punctuation_characters,
    lowercase_words,
    uppercase_words,
    vocabulary_richness,
    number_of_urls
]

In [None]:
df

Unnamed: 0,prompt,response,text
0,A researcher in computational biology is usin...,A computational biologist is utilizing backpro...,A computational biologist is using backpropag...
1,"As a computational biologist, I want to use b...","As a computational biologist, I want to utiliz...","As a computational biologist, I want to use B..."
2,"As a microbiologist, I want to use constraine...","As a microbiologist, I want to use constrained...","As a microbiologist, I want to use constraine..."
3,"As a biologist, I want to use data mining to ...","As a biologist, I want to utilize data mining ...","As a biologist, I want to use data mining tec..."
4,"As a bioinformatics researcher, I want to dev...","As a bioinformatics researcher, I want to crea...","As a bioinformatics researcher, I want to cre..."
5,"As a bioinformatics specialist, I want to use...","As a bioinformatics specialist, I aim to lever...","As a bioinformatics specialist, I want to use..."
6,"As a medical imaging specialist, I want to us...","As a medical imaging specialist, I need an eff...","As a medical imaging specialist, I want to le..."
7,"As a bioinformatics researcher, I want to use...","As a bioinformatics researcher, I want to use ...","As a bioinformatics researcher, I want to lev..."
8,"As a biologist, I want to use entity linking ...","As a biologist, I want to utilize entity linki...","As a biologist, I want to use entity linking ..."
9,"As an evolutionary biologist, I want to use e...","As an evolutionary biologist, I aim to leverag...","As a computational biologist, I want to lever..."


In [None]:
for func in metric_functions:
    reshaped_df[func.__name__] = reshaped_df['reshaped_column'].apply(func)
reshaped_df

Unnamed: 0,reshaped_column,origin,total_characters,uppercase_characters,lowercase_characters,special_characters,numbers,blanks,number_of_words,average_length_of_words,number_of_propositions,average_length_of_propositions,punctuation_characters,lowercase_words,uppercase_words,vocabulary_richness,number_of_urls
0,A researcher in computational biology is usin...,Original,176,1,150,0,0,25,25,6.04,1,25.0,0,24,1,23,0
1,A computational biologist is utilizing backpro...,Paraphrased,315,2,265,3,0,45,46,5.869565,2,23.0,3,44,1,36,0
2,A computational biologist is using backpropag...,LLM,312,2,263,2,0,45,45,5.933333,2,22.5,2,43,1,35,0
3,"As a computational biologist, I want to use b...",Original,262,3,218,2,0,39,39,5.717949,1,39.0,2,36,2,35,0
4,"As a computational biologist, I want to utiliz...",Paraphrased,267,4,222,3,0,38,39,5.871795,1,39.0,3,35,2,36,0
5,"As a computational biologist, I want to use B...",LLM,404,9,314,28,3,50,55,6.309091,2,27.5,21,39,2,47,0
6,"As a microbiologist, I want to use constraine...",Original,232,2,194,2,0,34,34,5.823529,1,34.0,2,32,1,32,0
7,"As a microbiologist, I want to use constrained...",Paraphrased,306,2,256,4,0,44,45,5.822222,1,45.0,4,43,1,38,0
8,"As a microbiologist, I want to use constraine...",LLM,368,7,284,27,5,45,50,6.32,2,25.0,20,36,1,44,0
9,"As a biologist, I want to use data mining to ...",Original,182,3,147,2,0,30,30,5.066667,1,30.0,2,27,2,27,0


In [None]:
reshaped_df.to_csv("gen_output_metrics.csv", index=False)


In [None]:
df['text'].to_csv('instructions.csv')

In [None]:
pd.DataFrame(valid_dataset_mapped['text']).to_csv('instructions.csv')

In [None]:
valid_dataset_mapped[0]['prompt'].split(':')[1].split('.')[0]
valid_dataset_mapped[0]['response']

prompt = valid_dataset_mapped[0]['text'].split('[/INST]')[0] + '[/INST]'
num_new_tokens = 100  # change to the number of new tokens you want to generate

# Count the number of tokens in the prompt
num_prompt_tokens = len(tokenizer(prompt)['input_ids'])

# Calculate the maximum length for the generation
max_length = num_prompt_tokens + num_new_tokens

gen = pipeline('text-generation', model=model, tokenizer=tokenizer, max_length=max_length)
result = gen(prompt)
print(result[0]['generated_text'].replace(prompt, ''))



 A computational biologist is using backpropagation to train a machine learning algorithm to predict the likelihood of different genetic mutations leading to cancer by analyzing large sets of genomic data. The algorithm is trained to identify the underlying genetic factors that contribute to cancer development.


In [None]:
valid_dataset_mapped[0]['text'].split('[/INST]')[0] + '[/INST]'

'[INST]Paraphrase the following user story: A researcher in computational biology is using backpropagation to train a machine learning algorithm to predict the likelihood of different genetic mutations leading to cancer. By analyzing large sets of genomic data, the algorithm is trained to identify the underlying genetic factors that contribute to cancer development..\nBased on the following metrics: diff_total_characters: 9,diff_uppercase_characters: 0,diff_lowercase_characters: 8,diff_special_characters: 0,diff_numbers: 0,diff_blanks: 1,diff_number_of_words: 1,diff_average_length_of_words: 0.0453283996299722,diff_number_of_propositions: 0,diff_average_length_of_propositions: 0.5,diff_punctuation_characters: 0,diff_lowercase_words: 1,diff_uppercase_words: 0,diff_vocabulary_richness: 1,diff_number_of_urls: 0, [/INST]'

                                              prompt  \
0  Paraphrase the following user story: A researc...   
1  Paraphrase the following user story: As a comp...   
2  Paraphrase the following user story: As a micr...   
3  Paraphrase the following user story: As a biol...   
4  Paraphrase the following user story: As a bioi...   
5  Paraphrase the following user story: As a bioi...   
6  Paraphrase the following user story: As a medi...   
7  Paraphrase the following user story: As a bioi...   
8  Paraphrase the following user story: As a biol...   
9  Paraphrase the following user story: As an evo...   

                                            response  \
0  A computational biologist is utilizing backpro...   
1  As a computational biologist, I want to utiliz...   
2  As a microbiologist, I want to use constrained...   
3  As a biologist, I want to utilize data mining ...   
4  As a bioinformatics researcher, I want to crea...   
5  As a bioinformatics specialist, I aim to lev

#Merge the model and store in Google Drive

In [None]:
# Merge and save the fine-tuned model
from google.colab import drive
drive.mount('/content/drive')

model_path = "/content/drive/MyDrive/llama-2-7b-custom"  # change to your preferred path

# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# Save the merged model
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

# Load a fine-tuned model from Drive and run inference

In [None]:
from google.colab import drive
from transformers import AutoModelForCausalLM, AutoTokenizer

drive.mount('/content/drive')

model_path = "/content/drive/MyDrive/llama-2-7b-custom"  # change to the path where your model is saved

model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

In [None]:
from transformers import pipeline

prompt = "What is 2 + 2?"  # change to your desired prompt
gen = pipeline('text-generation', model=model, tokenizer=tokenizer)
result = gen(prompt)
print(result[0]['generated_text'])