## Finetuning Google's Gemma Model on Intel Max Series GPUs 🚀

Welcome to this exciting journey where we'll dive into the world of finetuning large language models (LLMs) using Intel® Data Center GPU Max Series! 🌟 

In this notebook, we'll be working with Google's Gemma model and optimizing it for a specific task using the Intel Max 1550 GPU. 💪
___

### Overview

In this notebook, you will learn how to fine-tune a large language model (Google's Gemma) using Intel Max Series GPUs (XPUs) for a specific task. The notebook covers the following key points:

1. Setting up the environment and optimizing it for Intel GPUs
2. Initializing the XPU and configuring LoRA settings for efficient fine-tuning
3. Loading the pre-trained Gemma model and testing its performance
4. Preparing a diverse dataset of question-answer pairs covering various domains
5. Fine-tuning the model using the Hugging Face `Trainer` class
6. Evaluating the fine-tuned model on a test dataset
7. Saving and loading the fine-tuned model for future use


The notebook demonstrates how fine-tuning can enhance a model's performance on a diverse range of topics, making it more versatile and applicable to various domains. You will gain insights into the process of creating a **task-specific model** that can provide accurate and relevant responses to a wide range of questions.
</br>
___

#### Step 1: Setting Up the Environment 🛠️

First things first, let's get our environment ready! We'll install all the necessary packages, including the Hugging Face `transformers` library, `datasets` for easy data loading, `wandb` for experiment tracking, and a few others. 📦

In [1]:
import sys
import site
import os

# Install the required packages
!{sys.executable} -m pip install --upgrade  "transformers>=4.38.*"
!{sys.executable} -m pip install --upgrade  "datasets>=2.18.*"
!{sys.executable} -m pip install --upgrade "wandb>=0.16.*"
!{sys.executable} -m pip install --upgrade "trl>=0.7.11"
!{sys.executable} -m pip install --upgrade "peft>=0.9.0"
!{sys.executable} -m pip install --upgrade "accelerate>=0.28.*"

# Get the site-packages directory
site_packages_dir = site.getsitepackages()[0]

# add the site pkg directory where these pkgs are insalled to the top of sys.path
if not os.access(site_packages_dir, os.W_OK):
    user_site_packages_dir = site.getusersitepackages()
    if user_site_packages_dir in sys.path:
        sys.path.remove(user_site_packages_dir)
    sys.path.insert(0, user_site_packages_dir)
else:
    if site_packages_dir in sys.path:
        sys.path.remove(site_packages_dir)
    sys.path.insert(0, site_packages_dir)

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


We'll now make sure to optimize our environment for the Intel GPU by setting the appropriate environment variables and configuring the number of cores and threads. This will ensure we get the best performance out of our hardware! ⚡

In [2]:
import warnings
warnings.filterwarnings("ignore")

import os
import psutil

num_physical_cores = psutil.cpu_count(logical=False)
num_cores_per_socket = num_physical_cores // 2

os.environ["TOKENIZERS_PARALLELISM"] = "0"
#HF_TOKEN = os.environ["HF_TOKEN"]

# Set the LD_PRELOAD environment variable
ld_preload = os.environ.get("LD_PRELOAD", "")
conda_prefix = os.environ.get("CONDA_PREFIX", "")
# Improve memory allocation performance, if tcmalloc is not available, please comment this line out
os.environ["LD_PRELOAD"] = f"{ld_preload}:{conda_prefix}/lib/libtcmalloc.so"
# Reduce the overhead of submitting commands to the GPU
os.environ["SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS"] = "1"
# reducing memory accesses by fusing SDP ops
os.environ["ENABLE_SDP_FUSION"] = "1"
# set openMP threads to number of physical cores
os.environ["OMP_NUM_THREADS"] = str(num_physical_cores)
# Set the thread affinity policy
os.environ["OMP_PROC_BIND"] = "close"
# Set the places for thread pinning
os.environ["OMP_PLACES"] = "cores"

print(f"Number of physical cores: {num_physical_cores}")
print(f"Number of cores per socket: {num_cores_per_socket}")
print(f"OpenMP environment variables:")
print(f"  - OMP_NUM_THREADS: {os.environ['OMP_NUM_THREADS']}")
print(f"  - OMP_PROC_BIND: {os.environ['OMP_PROC_BIND']}")
print(f"  - OMP_PLACES: {os.environ['OMP_PLACES']}")

Number of physical cores: 96
Number of cores per socket: 48
OpenMP environment variables:
  - OMP_NUM_THREADS: 96
  - OMP_PROC_BIND: close
  - OMP_PLACES: cores


___
#### Step 2: Initializing the XPU and monitoring GPU memory in realtime 🎮

Next, we'll initialize the Intel Max 1550 GPU, which is referred to as an XPU. We'll use the `intel_extension_for_pytorch` library to seamlessly integrate XPU namespace with. 🤝

##### 👀 GPU Memory Monitoring 👀

To keep track of the Intel Max 1550 GPU (XPU) memory usage throughout this notebook, please refer to the cell below. It displays the current memory usage and updates every 5 seconds, providing you with real-time information about the GPU's memory consumption. 📊

The memory monitoring cell displays the following information:

- XPU Device Name: The name of the Intel Max 1550 GPU being used.
- Reserved Memory: The amount of memory currently reserved by the GPU.
- Allocated Memory: The amount of memory currently allocated by the GPU.
- Max Reserved Memory: The maximum amount of memory that has been reserved by the GPU.
- Max Allocated Memory: The maximum amount of memory that has been allocated by the GPU.

Keep an eye on this cell to monitor the GPU memory usage as you progress through the notebook. If you need to check the current memory usage at any point, simply scroll down to the memory monitoring cell for a quick reference. 👇

In [3]:
import asyncio
import threading
import torch
from IPython.display import display, HTML

import torch
import intel_extension_for_pytorch as ipex

if torch.xpu.is_available():
    torch.xpu.empty_cache()
    
    def get_memory_usage():
        memory_reserved = round(torch.xpu.memory_reserved() / 1024**3, 3)
        memory_allocated = round(torch.xpu.memory_allocated() / 1024**3, 3)
        max_memory_reserved = round(torch.xpu.max_memory_reserved() / 1024**3, 3)
        max_memory_allocated = round(torch.xpu.max_memory_allocated() / 1024**3, 3)
        return memory_reserved, memory_allocated, max_memory_reserved, max_memory_allocated
   
    def print_memory_usage():
        device_name = torch.xpu.get_device_name()
        print(f"XPU Name: {device_name}")
        memory_reserved, memory_allocated, max_memory_reserved, max_memory_allocated = get_memory_usage()
        memory_usage_text = f"XPU Memory: Reserved={memory_reserved} GB, Allocated={memory_allocated} GB, Max Reserved={max_memory_reserved} GB, Max Allocated={max_memory_allocated} GB"
        print(f"\r{memory_usage_text}", end="", flush=True)
    
    async def display_memory_usage(output):
        device_name = torch.xpu.get_device_name()
        output.update(HTML(f"<p>XPU Name: {device_name}</p>"))
        while True:
            memory_reserved, memory_allocated, max_memory_reserved, max_memory_allocated = get_memory_usage()
            memory_usage_text = f"XPU ({device_name}) :: Memory: Reserved={memory_reserved} GB, Allocated={memory_allocated} GB, Max Reserved={max_memory_reserved} GB, Max Allocated={max_memory_allocated} GB"
            output.update(HTML(f"<p>{memory_usage_text}</p>"))
            await asyncio.sleep(5)
    
    def start_memory_monitor(output):
        loop = asyncio.new_event_loop()
        asyncio.set_event_loop(loop)
        loop.create_task(display_memory_usage(output))
        thread = threading.Thread(target=loop.run_forever)
        thread.start()    
    output = display(display_id=True)
    start_memory_monitor(output)
else:
    print("XPU device not available.")

___
#### Step 3: Configuring the LoRA Settings 🎛️

To finetune our Gemma model efficiently, we'll use the LoRA (Low-Rank Adaptation) technique. 

LoRA allows us to adapt the model to our specific task by training only a small set of additional parameters. This greatly reduces the training time and memory requirements! ⏰

We'll define the LoRA configuration, specifying the rank (`r`) and the target modules we want to adapt. 🎯

In [1]:
from peft import LoraConfig

lora_config = LoraConfig(
    r=32,
    lora_alpha=16,
    lora_dropout=0.1,
    bias="none",
    # could use q, v and 0 projections as well and comment out the rest
    target_modules=["q_proj", "o_proj", 
                    "v_proj", "k_proj", 
                    "gate_proj", "up_proj",
                    "down_proj"],
    task_type="CAUSAL_LM")

___
#### Step 4: Loading the Gemma Model 🤖

Now, let's load the Gemma model using the Hugging Face `AutoModelForCausalLM` class. We'll also load the corresponding tokenizer to preprocess our input data. The model will be moved to the XPU for efficient training. 💪

> Note: Before running this notebook, please ensure you have read and agreed to the [Gemma Terms of Use](https://ai.google.dev/gemma/terms). You'll need to visit the Gemma model card on the Hugging Face Hub, accept the usage terms, and generate an access token with write permissions. This token will be required to load the model and push your finetuned version back to the Hub.

To create an access token:
1. Go to your Hugging Face account settings.
2. Click on "Access Tokens" in the left sidebar.
3. Click on the "New token" button.
4. Give your token a name, select the desired permissions (make sure to include write access), and click "Generate".
5. Copy the generated token and keep it secure. You'll use this token to authenticate when loading the model.

Make sure to follow these steps to comply with the terms of use and ensure a smooth finetuning experience. If you have any questions or concerns, please refer to the official Gemma documentation or reach out to the Hugging Face community for assistance.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

Now that you have logged in , let's load the model using transformers library:

In [4]:
from transformers import AutoTokenizer, AutoModelForCausalLM

USE_CPU = False
device = "xpu:0" if torch.xpu.is_available() else "cpu"
if USE_CPU:
    device = "cpu"
print(f"using device: {device}")

model_id = "Qwen/CodeQwen1.5-7B-Chat"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
# Set padding side to the right to ensure proper attention masking during fine-tuning
tokenizer.padding_side = "right"
model = AutoModelForCausalLM.from_pretrained(model_id).to(device)
# Disable caching mechanism to reduce memory usage during fine-tuning
model.config.use_cache = False
# Configure the model's pre-training tensor parallelism degree to match the fine-tuning setup
model.config.pretraining_tp = 1 

using device: xpu:0


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

___
#### Step 5: Testing the Model 🧪

Before we start finetuning, let's test the Gemma model on a sample input to see how it performs out-of-the-box. We'll generate some responses bsaed on a few questions in the `test_inputs` list below. 🌿

In [18]:
# Example prompt and messages for generating response
prompt = "Grade my code against requirements and provide feedback for each wrong step and right step "
code = """
import re

def regex_stat(file_path):
    # Read text from file
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()

    while True:
        token = input("Enter a token to search (or 'quit' to exit): ").strip()
        
        if token.lower() == 'quit':
            break
        
        # Use re.findall to find all occurrences of the token in the text
        occurrences = re.findall(r'\b{}\b'.format(re.escape(token)), text, flags=re.IGNORECASE)
        
        print(f"Token '{token}' appears {len(occurrences)} times in the text.")

def regex_substitute(file_path):
    # Read text from file
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()

    while True:
        old_token = input("Enter the token to replace (or 'quit' to exit): ").strip()
        
        if old_token.lower() == 'quit':
            break
        
        new_token = input(f"Enter the new token to substitute '{old_token}' with: ").strip()

        # Use re.sub to substitute all occurrences of old_token with new_token
        modified_text = re.sub(r'\b{}\b'.format(re.escape(old_token)), new_token, text, flags=re.IGNORECASE)
        
        # Print the modified text
        print("Modified text:")
        print(modified_text)
        
        # Optionally, write the modified text back to the file
        with open(file_path, 'w', encoding='utf-8') as file:
            file.write(modified_text)
            print("Changes saved to file.")

# Example usage:
if __name__ == "__main__":
    file_path = 'sample.txt'  # Replace with the relative path to your text file
    
    while True:
        print("\nChoose an option:")
        print("1. Search for a token (regex_stat)")
        print("2. Substitute tokens (regex_substitute)")
        print("3. Quit")
        choice = input("Enter your choice (1/2/3): ").strip()

        if choice == '1':
            regex_stat(file_path)
        elif choice == '2':
            regex_substitute(file_path)
        elif choice == '3':
            print("Exiting program.")
            break
        else:
            print("Invalid choice. Please enter 1, 2, or 3.")


"""

requir ="""
Task -
In this homework, you will be creating two functions: 
1) allows a user to enter a token (word) and
then uses regular expressions to create stats about the token in the text, 
2) allows a user to substitute
new tokens for old tokens (via regular expressions).

Coding requirements -
• You must use regular expressions in python to accomplish the above task. (re library)
o Hint: re.sub
• You must not split the text into individual words (you will lose points if you do)
• Your function must take in a relative path to a filename and use the text from that file for
the search and substitute.
• Your function should loop until the user enters “quit”, allowing multiple substitutions to be
made.

Grading Rubrik -
Assignment will be graded as follows:
Description Points
Code Runs 10
Regex_stat Implementation 40
Regex_substitute Implementation 40
Code (Comments, functions, cleanliness, readability) 10
Total: 100

"""

outcome_format = """
{
  "Score": "{Score}",
  "Evaluation feedback": "{Evaluation feedback}",
  "Suggestion": "{Suggestion}"
  "Total_Score" : "{Maximum possible Score}" 
}
"""

messages = [
    {"role": "system", "content": "You are a critic feedback system and grader providing marks out of 100 with valid reasons why less marks are granted and grant fewer marks for wrong implementation."},
    {"role": "user", "content": f"{prompt} Code is as below : {code} Evaluate by requirements given: {requir} in given output format {outcome_format} + dont provide code at any point"}
]


# Format input using apply_chat_template
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer(text, return_tensors="pt").to(device)

# Generate response
generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512)
generated_ids = [output_ids[len(model_inputs.input_ids[0]):] for output_ids in generated_ids]

# Decode response
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

print("Generated Response:", response)

Generated Response: {
  "Score": "70",
  "Evaluation feedback": "The code is good, but the regex_stat function has some bugs. The function is not case-insensitive as requested, and it uses the wrong regex syntax to escape the token. ",
  "Suggestion": "Ensure that the regex_stat function is case-insensitive and use the correct syntax to escape the token. ",
  "Total_Score": "100"
}


In [6]:
#Using Flsk to create API for this model 
!pip install flask
!pip install pyngrok
from flask import Flask, request, jsonify
import re

import getpass
import os
import threading
from pyngrok import ngrok, conf



ERROR: ld.so: object '/opt/intel/oneapi/intelpython/lib/libtcmalloc.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/opt/intel/oneapi/intelpython/lib/libtcmalloc.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
Defaulting to user installation because normal site-packages is not writeable
ERROR: ld.so: object '/opt/intel/oneapi/intelpython/lib/libtcmalloc.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/opt/intel/oneapi/intelpython/lib/libtcmalloc.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
Defaulting to user installation because normal site-packages is not writeable
Collecting pyngrok
  Downloading pyngrok-7.1.6-py3-none-any.whl.metadata (7.4 kB)
Downloading pyngrok-7.1.6-py3-none-any.whl (22 kB)
Installing collected packages: pyngrok
[0mSuccessfully installed pyngrok-7.1.6
Enter your authtoken, which

 ········


In [15]:
print("Enter your authtoken, which can be copied from https://dashboard.ngrok.com/get-started/your-authtoken")
conf.get_default().auth_token = "2iGty5lGP8g2jc5apF7Fe3tfL4O_2iQrhGY6ViJm7rQwGStHs"

app = Flask(__name__)

# Open a ngrok tunnel to the HTTP server
public_url = ngrok.connect(5000).public_url
print(" * ngrok tunnel \"{}\" -> \"http://127.0.0.1:{}/\"".format(public_url, 5000))

# Update any base URLs to use the public ngrok URL
app.config["BASE_URL"] = public_url

# ... Update inbound traffic via APIs to use the public-facing ngrok URL

@app.route('/evaluate', methods=['POST'])
def evaluate():
    data = request.get_json()

    if not data or 'code' not in data or 'requirements' not in data:
        return jsonify({"error": "Invalid input"}), 400

    code = data['code']
    requirements = data['requirements']
    prompt = "Grade my code against requirements and provide feedback for each wrong step and right step "
    outcome_format = """
    {
      "Score": "{Score}",
      "Evaluation feedback": "{Evaluation feedback}",
      "Suggestion": "{Suggestion}"
      "Total_Score" : "{Maximum possible Score}" 
    }
    """

    messages = [
        {"role": "system", "content": "You are a critic feedback system and grader providing marks out of 100 with valid reasons why less marks are granted and grant fewer marks for wrong implementation."},
        {"role": "user", "content": f"{prompt} Code is as below : {code} Evaluate by requirements given: {requir} in given output format {outcome_format} + dont provide code at any point"}
    ]
    
    
    # Format input using apply_chat_template
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    model_inputs = tokenizer(text, return_tensors="pt").to(device)
    
    # Generate response
    generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512)
    generated_ids = [output_ids[len(model_inputs.input_ids[0]):] for output_ids in generated_ids]
    
    # Decode response
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    
    
    return jsonify(response)

if __name__ == "__main__":
    app.run(debug=True)


# Start ngrok tunnel
public_url = ngrok.connect(5000).public_url
print(f" * ngrok tunnel {public_url} -> http://127.0.0.1:5000/")

# Update any base URLs to use the public ngrok URL
app.config["BASE_URL"] = public_url

# Start Flask server (note: use_reloader=False to prevent it from restarting in Jupyter notebooks)
app.run(use_reloader=False)


    
    



2024-06-23 09:37:51,589 - pyngrok.ngrok - INFO - Opening tunnel named: http-5000-b38e3ae5-9c6b-4f8d-83ce-69686a46a77d


Enter your authtoken, which can be copied from https://dashboard.ngrok.com/get-started/your-authtoken
                                                                                                    

2024-06-23 09:37:52,827 - pyngrok.process - INFO - Overriding default auth token
2024-06-23 09:37:52,941 - pyngrok.process.ngrok - INFO - t=2024-06-23T09:37:52+0000 lvl=info msg="no configuration paths supplied"
2024-06-23 09:37:52,943 - pyngrok.process.ngrok - INFO - t=2024-06-23T09:37:52+0000 lvl=info msg="using configuration at default config path" path=/home/u3de0c7e3c41391700102d87c8bbfc17/.config/ngrok/ngrok.yml
2024-06-23 09:37:52,944 - pyngrok.process.ngrok - INFO - t=2024-06-23T09:37:52+0000 lvl=info msg="open config file" path=/home/u3de0c7e3c41391700102d87c8bbfc17/.config/ngrok/ngrok.yml err=nil
2024-06-23 09:37:52,951 - pyngrok.process.ngrok - INFO - t=2024-06-23T09:37:52+0000 lvl=info msg="starting web service" obj=web addr=127.0.0.1:4040 allow_hosts=[]


PyngrokNgrokError: The ngrok process was unable to start.