Step 1: Checking GPU Availability
Our first step is to ensure that we have access to GPU resources. Let's kick things off by running the following command:

In [None]:
!nvidia-smi

Thu Nov  9 01:11:41 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   31C    P0    41W / 350W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

This command queries the NVIDIA System Management Interface to display information about our GPU. It's a crucial step to verify that our environment is GPU-enabled, which is essential for accelerating the training of large language models.

## Setup

Step 2: Cloning the Repository and Installing Dependencies
Next up, we'll clone the Alpaca LoRa repository and install the required dependencies. Execute the following commands:

In [None]:
!git clone https://github.com/tloen/alpaca-lora
!pip install -r alpaca-lora/requirements.txt
!pip install huggingface_hub

Cloning into 'alpaca-lora'...
remote: Enumerating objects: 607, done.[K
remote: Total 607 (delta 0), reused 0 (delta 0), pack-reused 607[K
Receiving objects: 100% (607/607), 27.84 MiB | 21.08 MiB/s, done.
Resolving deltas: 100% (357/357), done.
Collecting git+https://github.com/huggingface/peft.git (from -r alpaca-lora/requirements.txt (line 9))
  Cloning https://github.com/huggingface/peft.git to /tmp/pip-req-build-78q8_v7x
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/peft.git /tmp/pip-req-build-78q8_v7x
  Resolved https://github.com/huggingface/peft.git to commit 3ff90626b6c4ec5c611392298e0f0339132bcc24
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting accelerate (from -r alpaca-lora/requirements.txt (line 1))
  Downloading accelerate-0.24.1-py3-none-any.whl (261 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

These commands fetch the Alpaca LoRa repository from GitHub and install the necessary packages, including the Hugging Face Hub, a key component for managing and sharing models and datasets.

Step 3: Updating and Installing Python Packages
Now, let's make sure we have the correct versions of some essential Python packages. Execute the following commands:

In [None]:
!pip install -U pip
!pip install accelerate==0.18.0
!pip install appdirs==1.4.4
!pip install bitsandbytes==0.37.2
!pip install datasets==2.10.1
!pip install fire==0.5.0
!pip install git+https://github.com/huggingface/peft.git
!pip install git+https://github.com/huggingface/transformers.git
!pip install torch==2.0.0
!pip install sentencepiece==0.1.97
!pip install tensorboardX==2.6
!pip install gradio==3.23.0

Collecting pip
  Downloading pip-23.3.1-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m21.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.1.2
    Uninstalling pip-23.1.2:
      Successfully uninstalled pip-23.1.2
Successfully installed pip-23.3.1
Collecting accelerate==0.18.0
  Downloading accelerate-0.18.0-py3-none-any.whl (215 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m215.3/215.3 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.18.0
[0mCollecting bitsandbytes==0.37.2
  Downloading bitsandbytes-0.37.2-py3-none-any.whl (84.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.2/84.2 MB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0

These commands ensure that our Python environment is equipped with the correct versions of the required packages, including the Accelerate library for distributed training, Hugging Face's Transformers library, and Gradio for creating interactive user interfaces for machine learning models.

Congratulations! With these steps, you've successfully set up an environment ready to explore the fascinating world of natural language processing using Hugging Face's Transformers library. I encourage you to delve into the tutorial notebooks and start experimenting with the powerful tools at your disposal.

In [None]:
import transformers
import textwrap
from transformers import LlamaTokenizer, LlamaForCausalLM
import os
import sys
from typing import List

from peft import (
    LoraConfig,
    get_peft_model,
    get_peft_model_state_dict,
    prepare_model_for_int8_training,
)

import fire
import torch
from datasets import load_dataset
import pandas as pd

import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
from pylab import rcParams
import json

%matplotlib inline
sns.set(rc={'figure.figsize':(8, 6)})
sns.set(rc={'figure.dpi':100})
sns.set(style='white', palette='muted', font_scale=1.2)

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
DEVICE

'cuda'

Step 4: Importing Pandas and Loading Data
Now that we've set up our environment, let's move on to importing the essential data manipulation library, Pandas, and loading our dataset. Execute the following command:

In [None]:
import pandas as pd

This command imports the Pandas library, which is widely used for data manipulation and analysis in Python.

## Data

Step 5: Downloading the Pre-trained Model
Our next step is to acquire a pre-trained language model. Execute the following command to download the GenMedGPT 5k dataset:

In [None]:
#GenMedGPT 5k
!gdown --id 1nDTKZ3wZbZWTkFMBkxlamrzbNz0frugg

Downloading...
From: https://drive.google.com/uc?id=1nDTKZ3wZbZWTkFMBkxlamrzbNz0frugg
To: /content/GenMedGPT-5k.json
100% 3.08M/3.08M [00:00<00:00, 173MB/s]


This command uses gdown to download the dataset from Google Drive.

## Alpaca LoRa

Step 6: Setting up the Language Model and Tokenizer
Let's initialize our language model and tokenizer using the pre-trained LLaMA 7B huggung-face model. Execute the following commands:

In [None]:
BASE_MODEL = "yahma/llama-7b-hf"

model = LlamaForCausalLM.from_pretrained(
    BASE_MODEL,
    load_in_8bit=True,
    torch_dtype=torch.float16,
    device_map="auto",
)

tokenizer = LlamaTokenizer.from_pretrained(BASE_MODEL)

tokenizer.pad_token_id = (
    0  # unk. we want this to be different from the eos token
)
tokenizer.padding_side = "left"

Downloading (…)lve/main/config.json:   0%|          | 0.00/472 [00:00<?, ?B/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)l-00001-of-00002.bin:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Downloading (…)l-00002-of-00002.bin:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/207 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


These commands set up the language model and tokenizer, configuring them to work with the GenMedGPT 5k dataset.

## Dataset

Step 7: Loading the Dataset
Let's move on to loading our dataset. Execute the following commands:

In [None]:
data = load_dataset("json", data_files="GenMedGPT-5k.json")

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

This command uses the Hugging Face Datasets library to load the data from the specified JSON file, assuming it contains the necessary information.



Step 8: Exploring the Training Data
Now, let's take a quick look at the training data. Execute the following command:

In [None]:
data["train"]

Dataset({
    features: ['instruction', 'output', 'input'],
    num_rows: 5452
})

This command prints information about the training dataset, providing insights into its structure and contents.

Step 9: Specifying a Cutoff Length
To manage the length of our input sequences, let's set a cutoff length. Execute the following command:

In [None]:
CUTOFF_LEN = 256

This variable, CUTOFF_LEN, will be used to limit the length of input sequences during training.

With these additional steps, you've now prepared the foundation for working with a pre-trained language model and loading your dataset.

Step 10: Creating a Prompt Generator Function
In this step, we define a function generate_prompt that takes a data point as input and constructs a prompt with instruction, input, and response. Execute the following command:

In [None]:
def generate_prompt(data_point):
    return f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.  # noqa: E501
### Instruction:
{data_point["instruction"]}
### Input:
{data_point["input"]}
### Response:
{data_point["output"]}"""

This function is designed to create a structured prompt using information from a given data point.

Step 11: Tokenizing Prompts
Next, we define a tokenization function tokenize and a utility function generate_and_tokenize_prompt. Execute the following commands:

In [None]:
def tokenize(prompt, add_eos_token=True):
    # there's probably a way to do this with the tokenizer settings
    # but again, gotta move fast
    result = tokenizer(
        prompt,
        truncation=True,
        max_length=CUTOFF_LEN,
        padding=False,
        return_tensors=None,
    )
    if (
        result["input_ids"][-1] != tokenizer.eos_token_id
        and len(result["input_ids"]) < CUTOFF_LEN
        and add_eos_token
    ):
        result["input_ids"].append(tokenizer.eos_token_id)
        result["attention_mask"].append(1)

    result["labels"] = result["input_ids"].copy()

    return result

def generate_and_tokenize_prompt(data_point):
    full_prompt = generate_prompt(data_point)
    tokenized_full_prompt = tokenize(full_prompt)
    return tokenized_full_prompt

These functions tokenize the generated prompts, preparing them for consumption by the language model.

Step 12: Splitting and Processing Training and Validation Data
Now, let's split the training data into training and validation sets, and process them using the functions we defined earlier. Execute the following commands:

In [None]:
train_val = data["train"].train_test_split(
    test_size=200, shuffle=True, seed=42
)
train_data = (
    train_val["train"].shuffle().map(generate_and_tokenize_prompt)
)
val_data = (
    train_val["test"].shuffle().map(generate_and_tokenize_prompt)
)

Map:   0%|          | 0/5252 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

These commands split the training data into training and validation sets and apply tokenization to each data point.

Step 13: Configuring Model Training Parameters
Define the training parameters, including LORA hyperparameters, batch size, learning rate, and other relevant settings:

In [None]:
LORA_R = 8
LORA_ALPHA = 16
LORA_DROPOUT= 0.05
LORA_TARGET_MODULES = [
    "q_proj",
    "v_proj",
]

BATCH_SIZE = 128
MICRO_BATCH_SIZE = 4
GRADIENT_ACCUMULATION_STEPS = BATCH_SIZE // MICRO_BATCH_SIZE
LEARNING_RATE = 3e-4
TRAIN_STEPS = 300
OUTPUT_DIR = "experiments"

These parameters are crucial for configuring the training process.

Step 14: Preparing and Configuring the LORA Model
Now, let's prepare and configure the LORA model using the specified hyperparameters and configurations:

In [None]:
model = prepare_model_for_int8_training(model)
config = LoraConfig(
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    target_modules=LORA_TARGET_MODULES,
    lora_dropout=LORA_DROPOUT,
    bias="none",
    task_type="CAUSAL_LM",
)
lora_model = get_peft_model(model, config)
lora_model.print_trainable_parameters()

trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199




These commands prepare the model for training in 8-bit precision, configure the LORA model with the specified parameters, and print information about trainable parameters.

## Training

Step 15: Configuring Training Arguments
Let's configure the training arguments using the Transformers library. Execute the following command:

In [None]:
training_arguments = transformers.TrainingArguments(
    per_device_train_batch_size=MICRO_BATCH_SIZE,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
    warmup_steps=100,
    # max_steps=TRAIN_STEPS,
    num_train_epochs=1,
    learning_rate=LEARNING_RATE,
    fp16=True,
    logging_steps=10,
    optim="adamw_torch",
    evaluation_strategy="steps",
    save_strategy="steps",
    eval_steps=50,
    save_steps=50,
    output_dir=OUTPUT_DIR,
    save_total_limit=3,
    load_best_model_at_end=True,
    report_to="tensorboard"
)

These arguments define various settings for the training process, such as batch size, learning rate, and logging configurations.



Step 16: Data Collation for Seq2Seq Models
Next, let's configure the data collator for sequence-to-sequence models. Execute the following command:

In [None]:
data_collator = transformers.DataCollatorForSeq2Seq(
    tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True
)

This data collator is specifically designed for sequence-to-sequence models, ensuring proper padding and tensor formatting.

Step 17: Initializing and Training the Trainer
Now, let's create the Trainer instance and start the training process. Execute the following commands:

In [None]:
trainer = transformers.Trainer(
    model=lora_model,
    train_dataset=train_data,
    eval_dataset=val_data,
    args=training_arguments,
    data_collator=data_collator
)

trainer.train()
lora_model.save_pretrained(OUTPUT_DIR)

These commands set up the Trainer with the specified model, datasets, training arguments, and data collator, then initiate the training process.

Step 18: Logging in to Hugging Face Hub
To facilitate model sharing and collaboration, log in to the Hugging Face Hub. Execute the following command:

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

This command prompts you to log in using your Hugging Face credentials.

Step 19: Pushing the Model to Hugging Face Hub
Now, let's push the trained LORA model to the Hugging Face Hub. Execute the following command:

In [None]:
lora_model.push_to_hub("test", organization="KalbeDigitalLab", use_auth_token=True)



adapter_model.safetensors:   0%|          | 0.00/16.8M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/KalbeDigitalLab/alpara-7b-new/commit/41b2df0c1ed6f404b595e92985adefc0201f5f70', commit_message='Upload model', commit_description='', oid='41b2df0c1ed6f404b595e92985adefc0201f5f70', pr_url=None, pr_revision=None, pr_num=None)

Replace "test" with the desired repository name, and "KalbeDigitalLab" with the appropriate organization name.

Step 20: Monitoring Training Progress with TensorBoard
Lastly, visualize the training progress using TensorBoard. Execute the following commands:

In [None]:
%load_ext tensorboard
%tensorboard --logdir experiments/runs

These commands load the TensorBoard extension and launch TensorBoard, allowing you to monitor key metrics and visualizations during the training process.

Congratulations! With these final steps, you've successfully configured, trained, and shared your LORA-based language model using the Transformers library and Hugging Face Hub.






# Test inside model

Step 21: Downloading the SafeTensors File
To download the SafeTensors file for the model, execute the following command:

In [None]:
from huggingface_hub import hf_hub_download
hf_hub_download(repo_id="KalbeDigitalLab/alpara-7b-peft", filename="adapter_model.safetensors")

Downloading (…)er_model.safetensors:   0%|          | 0.00/16.8M [00:00<?, ?B/s]

'/root/.cache/huggingface/hub/models--KalbeDigitalLab--alpara-7b-peft/snapshots/787a2e170a915e3a4b3327f8c004ce2d8a842e36/adapter_model.safetensors'

This command downloads the SafeTensors file for the specified model repository.

Step 22: Loading SafeTensors and Extracting Tensors
Now, let's load the SafeTensors file and extract tensors. Execute the following commands:

In [None]:
from safetensors import safe_open

tensors = {}
with safe_open("/root/.cache/huggingface/hub/models--KalbeDigitalLab--alpara-7b-peft/snapshots/787a2e170a915e3a4b3327f8c004ce2d8a842e36/adapter_model.safetensors", framework="pt", device=0) as f:
    for k in f.keys():
        tensors[k] = f.get_tensor(k)

These commands utilize the SafeTensors library to open the file and extract tensors.

# Inference

Step 23: Loading the Adapted Model
Now, let's load the adapted model with the PEFT (Positional Embedding Fine-Tuning) modifications. Execute the following commands:

In [None]:
from peft import PeftModel
from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig

tokenizer = LlamaTokenizer.from_pretrained("yahma/llama-7b-hf")

model = LlamaForCausalLM.from_pretrained(
    "yahma/llama-7b-hf",
    load_in_8bit=True,
    device_map="auto"
)
model = PeftModel.from_pretrained(model, "KalbeDigitalLab/alpara-7b-peft")


(…)7b-hf/resolve/main/tokenizer_config.json:   0%|          | 0.00/207 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

(…)-hf/resolve/main/special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


(…)hma/llama-7b-hf/resolve/main/config.json:   0%|          | 0.00/472 [00:00<?, ?B/s]

(…)esolve/main/pytorch_model.bin.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

(…)b-hf/resolve/main/generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

(…)7b-peft/resolve/main/adapter_config.json:   0%|          | 0.00/484 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/16.8M [00:00<?, ?B/s]

These commands load the original Llama model and apply the PEFT modifications.

Step 24: Defining a Prompt for Generation
Now, let's define a prompt for text generation. Execute the following command:

In [None]:
PROMPT = """Below is an instruction that describes a task. Write a response that appropriately completes the request.


### Instruction:
"how to cure flu?"

### Response:"""


This prompt provides instructions for the language model to generate a response related to curing the flu.

Step 25: Generating Responses
Finally, let's generate responses using the adapted model. Execute the following commands:

In [None]:
inputs = tokenizer(
    PROMPT,
    return_tensors="pt"
)
input_ids = inputs["input_ids"].cuda()

generation_config = GenerationConfig(
    temperature=0.1,
    top_p=0.95,
    top_k=40,
    num_beams=4,
    repetition_penalty=1.15,
)
print("Generating...")
generation_output = model.generate(
    input_ids=input_ids,
    # generation_config=generation_config,
    return_dict_in_generate=True,
    output_scores=True,
    max_new_tokens=512,
)
for s in generation_output.sequences:
    result = tokenizer.decode(s).split("### Response:")[1]
    print(result)

Generating...

"If you have flu, you can try taking some medications to relieve your symptoms. You can also try drinking plenty of fluids, resting, and using a humidifier to help relieve your symptoms."</s>


These commands use the adapted model to generate responses based on the provided prompt, applying generation configurations such as temperature, top-k sampling, and beam search.

You've now completed the process of downloading the adapted model, loading it, and generating responses based on a given prompt. Feel free to experiment with different prompts and generation configurations to explore the capabilities of your fine-tuned language model!