<a href="https://colab.research.google.com/github/AlexanderFnug/Demo2023KEA/blob/master/Kopi_af_Finetuning_Mistral_7B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install transformers trl accelerate torch bitsandbytes peft datasets -qU

#### Load HF Dataset

First things first, we need to load our `mosaicml/instruct-v3` dataset. It's a great collection of effective and safe tasks.

In [None]:
from datasets import load_dataset

# Specify the path to your JSON file
file_path = "/content/co2data.json"

# Load the dataset explicitly as a JSON file
instruct_tune_dataset = load_dataset('json', data_files=file_path)

Generating train split: 0 examples [00:00, ? examples/s]

Let's take a peek at our dataset.

It's our job to merge these `prompt` and `response` columns into a single formatted prompt for instruct-tuning.

In [None]:
instruct_tune_dataset

DatasetDict({
    train: Dataset({
        features: ['response', 'prompt'],
        num_rows: 30
    })
})

Since we want to generate a model that generates instructions - we're going to filter away all the subset datasets and only used the `dolly_hhrlhf` component!

In [None]:
instruct_tune_dataset = instruct_tune_dataset.filter(lambda x: x["source"] == "dolly_hhrlhf")

In [None]:
instruct_tune_dataset

DatasetDict({
    train: Dataset({
        features: ['prompt', 'response', 'source'],
        num_rows: 34333
    })
    test: Dataset({
        features: ['prompt', 'response', 'source'],
        num_rows: 4771
    })
})

We're going to train on a small subset of the data - if you were considering an Epoch based approach this would reduce the amount of time spent training!

In [None]:
instruct_tune_dataset["train"] = instruct_tune_dataset["train"].select(range(30))

In [None]:
instruct_tune_dataset["test"] = instruct_tune_dataset["test"].select(range(200))

In [None]:
instruct_tune_dataset

DatasetDict({
    train: Dataset({
        features: ['prompt', 'response', 'source'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['prompt', 'response', 'source'],
        num_rows: 200
    })
})

#### Create Formatted Prompt

In the following function we'll be merging our `prompt` and `response` columns by creating the following template:

```
<s>### Instruction:
Use the provided input to create an instruction that could have been used to generate the response with an LLM.

### Input:
{input}

### Response:
{response}</s>
```

In [None]:
def create_prompt(sample, file_path1, file_path2):
    # Function to read the content of a file
    def read_file_content(file_path):
        with open(file_path, 'r') as file:
            return file.read()

    # Read the content of your files
    file_content1 = read_file_content(file_path1)
    file_content2 = read_file_content(file_path2)

    bos_token = "<s>"
    original_system_message = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
    system_message = "You are a helpful expert web sustainability analyst. Your function is to make a web sustainability analysis based on the information provided." + \
    " Please read through the provided content and analyse pageload and all other relevant data to make recommendations on how to make the website more sustainable." + \
    " Please do not ask the user any follow-up questions." + \
    " Please do not write any source links." + " Please describe the benefits of all recommendations given." + \
    " Please do not use lists but make use of paragraphs and headlines instead. Keep the word count under 400."

    # Including file contents in the input section
    input_content = file_content1 + "\n" + file_content2

    eos_token = "</s>"

    full_prompt = f"{bos_token}### Instruction:\n{system_message}\n\n### Input:\n{input_content}\n\n### Response:\n{sample['response']}{eos_token}"
    return full_prompt

# Usage example with file paths and a sample
file_path1 = '/content/Lighthouse.txt'
file_path2 = '/content/WSG.txt'
sample = {'response': 'Your sample response here'}
prompt_text = create_prompt(sample, file_path1, file_path2)
print(prompt_text)


In [None]:
create_prompt(instruct_tune_dataset["train"][0], '/content/Lighthouse.txt', '/content/WSG.txt')

'<s>### Instruction:\nYou are a helpful expert web sustainability analyst. Your function is to make a web sustainability analysis based on the information provided. Please read through the provided content and analyse pageload and all other relevant data to make recommendations on how to make the website more sustainable. Please do not ask the user any follow-up questions. Please do not write any source links. Please describe the benefits of all recommendations given. Please do not use lists but make use of paragraphs and headlines instead. Keep the word count under 400.\n\n### Input:\n\nSERVE IMAGES IN NEXT-GEN FORMATS\nImage formats like WebP and AVIF often provide better compression than PNG or JPEG, which means faster downloads and less data consumption. \n\nFIRST MEANINGFUL PAINT\nFirst Meaningful Paint measures when the primary content of a page is visible. \nValue: 0.7\xa0s\n\nHAS A `<META NAME="VIEWPORT">` TAG WITH `WIDTH` OR `INITIAL-SCALE`\nA `<meta name="viewport">` not only

### Loading the Base Model

We're going to load our model in `4bit`, with double quantization, with `bfloat16` as our compute dtype.

You'll notice we're loading the instruct-tuned model - this is because it's already adept at following tasks - we're just teaching it a new one!

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.1",
    device_map='auto',
    quantization_config=nf4_config,
    use_cache=False
)

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

Let's example how well the model does at this task currently:

In [None]:
def generate_response(prompt, model):
  encoded_input = tokenizer(prompt,  return_tensors="pt", add_special_tokens=True)
  model_inputs = encoded_input.to('cuda')

  generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)

  decoded_output = tokenizer.batch_decode(generated_ids)

  return decoded_output[0].replace(prompt, "")

In [None]:
text = """
### Instruction:\nUse the provided input to create an instruction that can help lessen carbon footprint for the webpage using the Lighthouse data### Input:\n
SERVE IMAGES IN NEXT-GEN FORMATS
Image formats like WebP and AVIF often provide better compression than PNG or JPEG, which means faster downloads and less data consumption.

FIRST MEANINGFUL PAINT
First Meaningful Paint measures when the primary content of a page is visible.
Value: 0.7 s

HAS A `<META NAME="VIEWPORT">` TAG WITH `WIDTH` OR `INITIAL-SCALE`
A `<meta name="viewport">` not only optimizes your app for mobile screen sizes, but also prevents .

TOTAL BLOCKING TIME
Sum of all time periods between FCP and Time to Interactive, when task length exceeded 50ms, expressed in milliseconds.
Value: 260 ms

ELIMINATE RENDER-BLOCKING RESOURCES
Resources are blocking the first paint of your page. Consider delivering critical JS/CSS inline and deferring all non-critical JS/styles.

REMOVE DUPLICATE MODULES IN JAVASCRIPT BUNDLES
Remove large, duplicate JavaScript modules from bundles to reduce unnecessary bytes consumed by network activity.

AVOID SERVING LEGACY JAVASCRIPT TO MODERN BROWSERS
Polyfills and transforms enable legacy browsers to use new JavaScript features. However, many aren't necessary for modern browsers. For your bundled JavaScript, adopt a modern script deployment strategy using module/nomodule feature detection to reduce the amount of code shipped to modern browsers, while retaining support for legacy browsers.
Value: Potential savings of 0 KiB

REDUCE UNUSED CSS
Reduce unused rules from stylesheets and defer CSS not used for above-the-fold content to decrease bytes consumed by network activity.

MINIFY JAVASCRIPT
Minifying JavaScript files can reduce payload sizes and script parse time.

ENABLE TEXT COMPRESSION
Text-based resources should be served with compression  to minimize total network bytes.

ALL TEXT REMAINS VISIBLE DURING WEBFONT LOADS
Leverage the `font-display` CSS feature to ensure text is user-visible while webfonts are loading.

AVOIDS `DOCUMENT.WRITE()`
For users on slow connections, external scripts dynamically injected via `document.write` can delay page load by tens of seconds.

EFFICIENTLY ENCODE IMAGES
Optimized images load faster and consume less cellular data.

USES PASSIVE LISTENERS TO IMPROVE SCROLLING PERFORMANCE
Consider marking your touch and wheel event listeners as `passive` to improve your page's scroll performance.

PROPERLY SIZE IMAGES
Serve images that are appropriately-sized to save cellular data and improve load time.
Value: Potential savings of 9 KiB

CUMULATIVE LAYOUT SHIFT
Cumulative Layout Shift measures the movement of visible elements within the viewport.
Value: 0.002

MINIFY CSS
Minifying CSS files can reduce network payload sizes.

TIME TO INTERACTIVE
Time to Interactive is the amount of time it takes for the page to become fully interactive.
Value: 1.8 s

SERVE STATIC ASSETS WITH AN EFFICIENT CACHE POLICY
A long cache lifetime can speed up repeat visits to your page.
Value: 27 resources found

FIRST CONTENTFUL PAINT
First Contentful Paint marks the time at which the first text or image is painted.
Value: 0.7 s

USE VIDEO FORMATS FOR ANIMATED CONTENT
Large GIFs are inefficient for delivering animated content. Consider using MPEG4/WebM videos for animations and PNG/WebP for static images instead of GIF to save network bytes.

LARGEST CONTENTFUL PAINT
Largest Contentful Paint marks the time at which the largest text or image is painted.
Value: 1.1 s

MAX POTENTIAL FIRST INPUT DELAY
The maximum potential First Input Delay that your users could experience is the duration of the longest task.
Value: 210 ms

DEFER OFFSCREEN IMAGES
Consider lazy-loading offscreen and hidden images after all critical resources have finished loading to lower time to interactive.

AVOID MULTIPLE PAGE REDIRECTS
Redirects introduce additional delays before the page can be loaded.

IMAGE ELEMENTS DO NOT HAVE EXPLICIT `WIDTH` AND `HEIGHT`
Set an explicit width and height on image elements to reduce layout shifts and improve CLS.

SPEED INDEX
Speed Index shows how quickly the contents of a page are visibly populated.
Value: 2.3 s

PRECONNECT TO REQUIRED ORIGINS
Consider adding `preconnect` or `dns-prefetch` resource hints to establish early connections to important third-party origins.
\n\n### Response:
"""
generate_response(text, model)

"<s> In order to reduce your webpage's carbon footprint, consider the following suggestions:\n\n1. Serve images in next-gen formats: Using WebP and AVIF formats for images can result in faster downloads and less data consumption.\n2. Optimize the order of resources: Ensure that the most important resources like JavaScript, CSS, and media files are loaded first, and defer non-critical resources.\n3. Prioritize above-the-fold content: Ensure that important content is visible above the fold, which can improve user experience and reduce page load times.\n4. Minimize resources: Consider removing duplicate modules in JavaScript bundles, deferring non-critical resources, and adopting a modern script deployment strategy to reduce the amount of code shipped to modern browsers.\n5. Optimize images: Use appropriate image file formats, optimal image sizes, and lazy loading to reduce the amount of data consumed by network activity.</s>"

Now, we're going to prepare our model for 4bit LoRA training!

We can use these handy helper functions to achieve this goal thanks to `huggingface` and the `peft` library!

In [None]:
from peft import AutoPeftModelForCausalLM, LoraConfig, get_peft_model, prepare_model_for_kbit_training

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM"
)

In [None]:
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

All that's left to do is set up a number of hyper parameters.

In [None]:
from transformers import TrainingArguments

args = TrainingArguments(
  output_dir = "mistral_instruct_generation",
  #num_train_epochs=5,
  max_steps = 100, # comment out this line if you want to train in epochs
  per_device_train_batch_size = 4,
  warmup_steps = 0.03,
  logging_steps=10,
  save_strategy="epoch",
  #evaluation_strategy="epoch",
  evaluation_strategy="steps",
  eval_steps=20, # comment out this line if you want to evaluate at the end of each epoch
  learning_rate=2e-4,
  bf16=True,
  lr_scheduler_type='constant',
)

ValueError: Your setup doesn't support bf16/gpu. You need torch>=1.10, using Ampere GPU with cuda>=11.0

In [None]:
from trl import SFTTrainer

max_seq_length = 2048

trainer = SFTTrainer(
  model=model,
  peft_config=peft_config,
  max_seq_length=max_seq_length,
  tokenizer=tokenizer,
  packing=True,
  formatting_func=create_prompt,
  args=args,
  train_dataset=instruct_tune_dataset["train"],
  eval_dataset=instruct_tune_dataset["test"]
)



In [None]:
trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss
20,1.5234,1.371291
40,1.4518,1.3347
60,1.4292,1.32248
80,1.4286,1.314245
100,1.4276,1.309581


TrainOutput(global_step=100, training_loss=1.478941478729248, metrics={'train_runtime': 476.1112, 'train_samples_per_second': 0.84, 'train_steps_per_second': 0.21, 'total_flos': 3.50843194834944e+16, 'train_loss': 1.478941478729248, 'epoch': 0.08})

In [None]:
trainer.save_model("mistral_instruct_generation")

# Save Model and Push to Hub

4bit save and push coming soon!

The PR is literally in the process of being added! Check it out [here](https://github.com/TimDettmers/bitsandbytes/pull/753)!

For now, we'll save our adapters!

In [None]:
!pip install huggingface-hub -qU

In [None]:
from huggingface_hub import notebook_login

notebook_login()

In [None]:
trainer.push_to_hub("ai-maker-space/mistral-instruct-generation")

In [None]:
merged_model = model.merge_and_unload()



In [None]:
def generate_response(prompt, model):
  encoded_input = tokenizer(prompt,  return_tensors="pt", add_special_tokens=True)
  model_inputs = encoded_input.to('cuda')

  generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)

  decoded_output = tokenizer.batch_decode(generated_ids)

  return decoded_output[0]

In [None]:
generate_response("### Instruction:\nUse the provided input to create an instruction that could have been used to generate the response with an LLM.### Input:\nThere are more than 12,000 species of grass. The most common is Kentucky Bluegrass, because it grows quickly, easily, and is soft to the touch. Rygrass is shiny and bright green colored. Fescues are dark green and shiny. Bermuda grass is harder but can grow in drier soil.\n\n### Response:", merged_model)

'<s> ### Instruction:\nUse the provided input to create an instruction that could have been used to generate the response with an LLM.### Input:\nThere are more than 12,000 species of grass. The most common is Kentucky Bluegrass, because it grows quickly, easily, and is soft to the touch. Rygrass is shiny and bright green colored. Fescues are dark green and shiny. Bermuda grass is harder but can grow in drier soil.\n\n### Response:\nIdentify the most common species of grass, and provide a brief description of its properties.</s>'