<a href="https://colab.research.google.com/github/Gaurav7888/BhagwadGitaGPT/blob/main/H2O_LLM_Studio_CLI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetune a large language model using [H2O LLM Studio](https://github.com/h2oai/h2o-llmstudio)

In this notebook, we demonstrate how one can finetune a large language model easily using the **CLI interface** of H2O LLM Studio.

In [4]:
!git clone https://github.com/h2oai/h2o-llmstudio.git
!cd h2o-llmstudio && git checkout ce10af57ff118a2bbb81b5b3eae12273e290299a -q
!cp -r h2o-llmstudio/. ./
!rm -r h2o-llmstudio

Cloning into 'h2o-llmstudio'...
remote: Enumerating objects: 648, done.[K
remote: Counting objects: 100% (473/473), done.[K
remote: Compressing objects: 100% (287/287), done.[K
remote: Total 648 (delta 261), reused 331 (delta 168), pack-reused 175[K
Receiving objects: 100% (648/648), 10.79 MiB | 16.89 MiB/s, done.
Resolving deltas: 100% (321/321), done.


In [5]:
# Install pyhon 3.10 that will be used within pipenv
!sudo add-apt-repository ppa:deadsnakes/ppa -y > /dev/null
!sudo apt install python3.10 python3.10-distutils psmisc -y > /dev/null
!curl -sS https://bootstrap.pypa.io/get-pip.py | python3.10 > /dev/null
    
# install requirements
!make setup > /dev/null



debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76, <> line 6.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 
[0m[1mCreating a virtualenv for this project...[0m
Pipfile: [33m[1m/content/Pipfile[0m
[1mUsing[0m [33m[1m/usr/bin/python3.10[0m [32m(3.10.11)[0m [1mto create virtualenv...[0m
⠙[0m Creating virtual environment...[K[36mcreated virtual environment CPython3.10.11.final.0-64 in 798ms
  creator Venv(dest=/root/.local/share/virtualenvs/content-cQIIIOO2, clear=False, no_vcs_ignore=False, global=False, describe=CPython3Posix)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/roo

In [7]:
!python -m pip install datasets > /dev/null
!mkdir data
!mkdir data/oasst-data

In [8]:
!mv /content/data.csv /content/data/oasst-data

In [9]:
!mv /content/val_data.csv /content/data/oasst-data

## Configurations

In H2O LLM Studio, we use dataclasses to specify various [finetuning parameters](https://github.com/h2oai/h2o-llmstudio/blob/main/docs/parameters.md).

In [10]:
%%writefile cfg_notebook.py

import os
from dataclasses import dataclass

from llm_studio.python_configs.text_causal_language_modeling_config import ConfigProblemBase, ConfigNLPCausalLMDataset, \
    ConfigNLPCausalLMTokenizer, ConfigNLPAugmentation, ConfigNLPCausalLMArchitecture, ConfigNLPCausalLMTraining, \
    ConfigNLPCausalLMPrediction, ConfigNLPCausalLMEnvironment, ConfigNLPCausalLMLogging


ROOT_DIR = "./data/oasst-data/"
@dataclass
class Config(ConfigProblemBase):
    output_directory: str = "output/demo_oasst-data/"
    experiment_name: str = "demo_experiment"
    llm_backbone: str = "EleutherAI/pythia-1.4b-deduped"

    dataset: ConfigNLPCausalLMDataset = ConfigNLPCausalLMDataset(
        train_dataframe=os.path.join(ROOT_DIR, "data.csv"),
        
        validation_strategy="automatic",
        validation_dataframe="",
        validation_size=0.01,

        prompt_column=("Instruction",),
        answer_column="Output",
        text_prompt_start="",
        text_answer_separator="",

        add_eos_token_to_prompt=True,
        add_eos_token_to_answer=True,
        mask_prompt_labels=False,

    )
    tokenizer: ConfigNLPCausalLMTokenizer = ConfigNLPCausalLMTokenizer(
        max_length_prompt=128,
        max_length_answer=128,
        max_length=128,
        padding_quantile=1.0
    )
    augmentation: ConfigNLPAugmentation = ConfigNLPAugmentation(token_mask_probability=0.0)
    architecture: ConfigNLPCausalLMArchitecture = ConfigNLPCausalLMArchitecture(
        backbone_dtype="float16",
        gradient_checkpointing=False,
        force_embedding_gradients=False,
        intermediate_dropout=0
    )
    training: ConfigNLPCausalLMTraining = ConfigNLPCausalLMTraining(
        loss_function="CrossEntropy",
        optimizer="AdamW",

        learning_rate=0.00015,

        batch_size=2,
        drop_last_batch=True,
        epochs=1,
        schedule="Cosine",
        warmup_epochs=0.0,

        weight_decay=0.0,
        gradient_clip=0.0,
        grad_accumulation=1,

        lora=True,
        lora_r=4,
        lora_alpha=16,
        lora_dropout=0.05,
        lora_target_modules="",

        save_best_checkpoint=False,
        evaluation_epochs=1.0,
        evaluate_before_training=False,
    )
    prediction: ConfigNLPCausalLMPrediction = ConfigNLPCausalLMPrediction(
        metric="BLEU",

        min_length_inference=2,
        max_length_inference=256,
        batch_size_inference=0,

        do_sample=False,
        num_beams=2,
        temperature=0.3,
        repetition_penalty=1.2,
    )
    environment: ConfigNLPCausalLMEnvironment = ConfigNLPCausalLMEnvironment(
        mixed_precision=True,
        number_of_workers=4,
        seed=1
    )

Overwriting cfg_notebook.py


In [11]:
%%writefile run.sh

pipenv run python train.py -C cfg_notebook.py & 

wait
echo "all done"

Overwriting run.sh


In [12]:
!sh run.sh

  from distutils import util

Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /root/.local/share/virtualenvs/content-cQIIIOO2/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
2023-04-26 23:15:00,101 - INFO: Global random seed: 1
2023-04-26 23:15:00,101 - INFO: Preparing the data...
2023-04-26 23:15:00,101 - INFO: Setting up automatic validation split...
2023-04-26 23:15:00,268 - INFO: Preparing train and validation data
2023-04-26 23:15:00,269 - INFO: Loading train dataset...
Downloading (…)okenizer_config.json: 100% 396/396 [00:00<00:00, 306kB/s]
Downloading (…)/main/tokenizer.json: 100% 2.11M/2.11M [00:00<00:00, 2.

In [16]:
import pandas as pd
val_outputs = pd.read_csv("output/demo_oasst-data/validation_predictions.csv")

In [17]:
val_outputs.head()

Unnamed: 0,Instruction,Output,pred_Output
0,What is the hindi commentary of this sanskrit ...,This shlok is from chapter 1 and shlok 10Hindi...,--अध्याय 1.10।. --अध्याय 1.10।. --अध्याय 1.10।...
1,What is the hindi commentary of this sanskrit ...,This shlok is from chapter 6 and shlok 29Hindi...,ईक्षते योगयुक्तात्मा सर्वत्र समदर्शनः।।6.29।।1...
2,What is the hindi commentary of this sanskrit ...,This shlok is from chapter 2 and shlok 44Hindi...,भोगात्मिका विधीयते हुए कहा जाता है कि भोगात्मि...
3,What is the hindi commentary of this sanskrit ...,This shlok is from chapter 1 and shlok 14Hindi...,।\n\n1.14।।1.14।।1.14।।1.14।।1.14।-।1.14।-।1.1...
4,What is the hindi commentary of this sanskrit ...,This shlok is from chapter 4 and shlok 42Hindi...,भारतीय विषयोंके अनुसार भारतीय विषयोंके अनुसार ...


In [19]:
for _, row in val_outputs.iloc[2:3].iterrows():
    print("============")
    print()
    print(row.Instruction)
    print()
    print("-----Target Answer-----")
    print()
    print(row.Output)
    print()
    print("-----Predicted Answer-----")
    print()
    print(row.pred_Output)
    print()


What is the hindi commentary of this sanskrit shlok in bhagvad gita मूल श्लोकः
भोगैश्वर्यप्रसक्तानां तयापहृतचेतसाम्।

व्यवसायात्मिका बुद्धिः समाधौ न विधीयते।।2.44।।

-----Target Answer-----

This shlok is from chapter 2 and shlok 44Hindi Commentary By Swami Ramsukhdas
 2.44।। व्याख्या -- 'तयापहृतचेतसाम्'-- पूर्वश्लोकोंमें जिस पुष्पित वाणीका वर्णन किया गया है  उस वाणीसे जिनका चित्त अपहृत हो गया है अर्थात् स्वर्गमें बड़ा भारी सुख है दिव्य नन्दनवन है अप्सराएँ हैं अमृत है ऐसी वाणीसे जिनका चित्त उन भोगोंकी तरफ खिंच गया है।
 'भोगैश्वर्यप्रसक्तानाम्'-- शब्द स्पर्श रूप रस और गन्ध ये पाँच विषय शरीरका आराम मान और नामकी बड़ाई इनके द्वारा सुख लेनेका नाम भोग है। भोगोंके लिये पदार्थ रूपयेपैसे मकान आदिका जो संग्रह किया जाता है उसका नाम ऐश्वर्य है। इन भोग और ऐश्वर्यमें जिनकी आसक्ति है प्रियता है खिंचाव है अर्थात् इनमें जिनकी महत्त्वबुद्धि है उनको  'भोगैश्वर्यप्रसक्तानाम्'  कहा गया है।

जो भोग और ऐश्वर्यमें ही लगे रहते हैं वे आसुरी सम्पत्तिवाले होते हैं। कारण कि असु नाम प्राणोंका है और उन प्राणोंको जो

### Inference and prompting

You can also load the trained model and manually prompt it.

In [28]:
!pipenv run python prompt.py --e output/demo_oasst-data/


Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /root/.local/share/virtualenvs/content-cQIIIOO2/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
Using pad_token, but it is not set yet.
Using cls_token, but it is not set yet.
Using sep_token, but it is not set yet.
Loading model weights...
trainable params: 786432 || all params: 1415434240 || trainable%: 0.055561182411413196

You can change inference parameters on the fly by typing --param value, such as --num_beams 4. You can also chain them such as --num_beams 4 --top_k 30.

Please enter some prompt (type 'exit' to stop): what is the meaning of नैव किञ्चित्करोमीति 