<a href="https://colab.research.google.com/github/Lostkyd/Thesis/blob/main/train.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Training LLM

In [99]:
!pip install -q bitsandbytes datasets accelerate loralib
!pip install -q git+https://github.com/huggingface/peft.git git+https://github.com/huggingface/transformers.git

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


#### Confirm CUDA

In [100]:
import torch
torch.cuda.is_available()

#### Load Base Model

In [101]:

import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.float16,
    device_map='auto',
    token = "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"
)
tokenizer = AutoTokenizer.from_pretrained("bigscience/tokenizer")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



##### View Model Summary

In [102]:
print(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_

In [103]:
for param in model.parameters():
  param.requires_grad = False  # freeze the model - train adapters later
  if param.ndim == 1:
    # cast the small parameters (e.g. layernorm) to fp32 for stability
    param.data = param.data.to(torch.float32)

model.gradient_checkpointing_enable()  # reduce number of stored activations
model.enable_input_require_grads()

class CastOutputToFloat(nn.Sequential):
  def forward(self, x): return super().forward(x).to(torch.float32)
model.lm_head = CastOutputToFloat(model.lm_head)

#### Helper Function

In [104]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

#### Obtain LoRA Model

In [105]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=64,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 33554432 || all params: 6771970048 || trainable%: 0.49548996469513035


#### Load Sample Dataset

In [106]:
import json

dataset_file = '/content/drive/MyDrive/InstructionDataset.json'

```
### CONTEXT
{context}

### QUESTION
{question}

### ANSWER
{answer}</s>
```

In [216]:
from datasets import load_dataset
dataset= load_dataset('Lostkyd/pdf_forms')

In [269]:
dataset

Dataset({
    features: ['Input', 'Output', 'Instruction'],
    num_rows: 35
})

In [285]:
from datasets import Dataset
df = Dataset.to_pandas(dataset)

In [286]:
df

Unnamed: 0,Input,Output,Instruction
0,\n \n \nCROSS ENROLLMENT FORM \nCCIT - FO...,"RESEARCH METHODS, INTERNSHIP 2 , 3.0 , 3.0 , M...","What is the course description, units, section..."
1,\n \nCROSS ENROLLMENT FORM \nCCIT - FO - 012...,"Edson John Domingo, Mr. Johua Galvez, Mr. Rafa...",Who has the signature in this form? What is t...
2,\n \n \nCROSS ENROLLMENT FORM \nCCIT - FO...,"CCAUTOMA, 3.0, disapproved",What is the course code in this form? How many...
3,PROCEDURE: \nSTEP 1 – Fill-up form \nSTEP 2 – ...,"Arlene O. Trillanes. Yes, it has a signature.",Who approved this shifting form? Does it have ...
4,\nPROCEDURE: \nSTEP 1 – Fill-up form \nSTE...,"Santos, Maria Teresa R. , Jose Luis M. Rodrigu...",Give me the name of individuals that has a sig...
5,\n \n \n \nREG-FO-042 \n \nAPPLICATION FOR ...,BS Computer Science specialization in Machine ...,What is the course of the student in this form?
6,\n \n \n \nREG-FO-042 \n \nAPPLICATION FOR ...,"Edwards, Sofia T., Alexander Harrison, Sophia...",Who are the individuals that have the signatur...
7,\n \nCROSS ENROLLMENT FORM \nCCIT - FO - 012...,Disapproved,What is the status of this cross enrollment form?
8,REG - FO - 002\nRevision Status/Date 1 : 07No...,"Jackson Powell. Yes, it has a signature.",Who is the university registrar in this form? ...
9,\n \n \n \n \nREG-FO-042 \n \n \nAPPL...,There is no deans name and signature on the form.,What is the name of dean in this form? is the...


In [287]:
df.loc[0]

Input           \n  \n  \nCROSS ENROLLMENT FORM   \nCCIT - FO...
Output         RESEARCH METHODS, INTERNSHIP 2 , 3.0 , 3.0 , M...
Instruction    What is the course description, units, section...
Name: 0, dtype: object

In [307]:
def merge(df):

    output = "### Instruction: {", df["Instruction"] , "}", "### Input: {", df["Input"], "}", "### Output: {", df["Output"], "}"
    return output

In [308]:
merge(df.loc[0])

('### Instruction: {',
 'What is the course description, units, section and schedule in this form? ',
 '}',
 '### Input: {',
 ' \n  \n  \nCROSS ENROLLMENT FORM   \nCCIT - FO - 012   \nRevision Status/Date: 05/15/2020 \n1st TERM AY : 2020-2021   \n   \n   \n   \n   \n   \n   \nStudent’s Copy   \nName (Lastname,  Given Name, Middle Initial):   \n  Smith, John A. \n  \nStudent ID:   \n  2018-100731 \n  \nProgram:   \nMedical Technology  \n  \nDate:   \n   \n 05/15/2020 \n#   Course Code    \nCourse Description   \nUnits   \nSection   \nSchedule   \n1   \nCMSCSMTD \nRESEARCH METHODS \n3.0 \nMSCS23A \n \n \n05:00PM - 09:00PM \n \n2   \nCTNTERN1 \nINTERNSHIP 2 \n3.0 \n  MIT2021 \n06:00PM - 08:00PM \n06:00PM - 08:00PM \n \n3      \n   \n   \n   \n   \n4      \n   \n   \n   \n   \n5      \n   \n   \n   \n   \nReason for Cross Enrollment:   \n   \nRequested by:    \n   \n  John A. Smith \nStudent’s Signature   \nEndorsed by:   \n   \nMrs. Maria L. Santos \nFaculty Adviser’s Signature over Print

In [310]:
df1 = df.apply(lambda x: merge(x), axis = 1)

In [311]:
df1

0     (### Instruction: {, What is the course descri...
1     (### Instruction: {, Who has the signature in ...
2     (### Instruction: {, What is the course code i...
3     (### Instruction: {, Who approved this shiftin...
4     (### Instruction: {, Give me the name of indiv...
5     (### Instruction: {, What is the course of the...
6     (### Instruction: {, Who are the individuals t...
7     (### Instruction: {, What is the status of thi...
8     (### Instruction: {, Who is the university reg...
9     (### Instruction: {, What is  the name of dean...
10    (### Instruction: {, Does the parent/guardian ...
11    (### Instruction: {, Who is the student who ha...
12    (### Instruction: {, what is the course code t...
13    (### Instruction: {, what is the course of the...
14    (### Instruction: {, what is the Student No. o...
15    (### Instruction: {, Why is the student making...
16    (### Instruction: {, what is the schedule of M...
17    (### Instruction: {, Who verified and rece

In [314]:
df2 = df1.to_frame()

In [320]:
df2.rename(columns={0:"text"}, inplace=True)

In [321]:
df2.head()

Unnamed: 0,text
0,"(### Instruction: {, What is the course descri..."
1,"(### Instruction: {, Who has the signature in ..."
2,"(### Instruction: {, What is the course code i..."
3,"(### Instruction: {, Who approved this shiftin..."
4,"(### Instruction: {, Give me the name of indiv..."


In [322]:
dataset = Dataset.from_pandas(df2)

In [325]:
from transformers import AutoTokenizer
from datasets import load_dataset

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

In [328]:
def tokenization(dataset):
    return tokenizer(dataset["text"], is_split_into_words=True)

dataset = dataset.map(tokenization, batched=True)

Map:   0%|          | 0/35 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (929 > 512). Running this sequence through the model will result in indexing errors


In [329]:
dataset

Dataset({
    features: ['text', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 35
})

In [333]:
dataset

Dataset({
    features: ['text', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 35
})

#### Train LoRA

In [None]:
dataset[0]

In [337]:
import transformers

trainer = transformers.Trainer(
    model=model,
    train_dataset=dataset,
    args=transformers.TrainingArguments(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        warmup_steps=100,
        max_steps=100,
        learning_rate=1e-3,
        fp16=True,
        logging_steps=1,
        output_dir='outputs',
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

NotImplementedError: ignored

In [None]:
print(mapped_qa_dataset)

In [None]:
HUGGING_FACE_USER_NAME = ""

In [None]:
from huggingface_hub import notebook_login
notebook_login()

In [None]:
model_name = ""

model.push_to_hub(f"{HUGGING_FACE_USER_NAME}/{model_name}", use_auth_token=True)

In [None]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

peft_model_id = f"{HUGGING_FACE_USER_NAME}/{model_name}"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, return_dict=True, load_in_8bit=False, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Load the Lora model
qa_model = PeftModel.from_pretrained(model, peft_model_id)

In [None]:
from IPython.display import display, Markdown

def make_inference(context, question):
  batch = tokenizer(f"### CONTEXT\n{context}\n\n### QUESTION\n{question}\n\n### ANSWER\n", return_tensors='pt')

  with torch.cuda.amp.autocast():
    output_tokens = qa_model.generate(**batch, max_new_tokens=200)

  display(Markdown((tokenizer.decode(output_tokens[0], skip_special_tokens=True))))

In [None]:
context = "Cheese is the best food."
question = "What is the best food?"

make_inference(context, question)

In [None]:
context = "Cheese is the best food."
question = "How far away is the Moon from the Earth?"

make_inference(context, question)

In [None]:
context = "The Moon orbits Earth at an average distance of 384,400 km (238,900 mi), or about 30 times Earth's diameter. Its gravitational influence is the main driver of Earth's tides and very slowly lengthens Earth's day. The Moon's orbit around Earth has a sidereal period of 27.3 days. During each synodic period of 29.5 days, the amount of visible surface illuminated by the Sun varies from none up to 100%, resulting in lunar phases that form the basis for the months of a lunar calendar. The Moon is tidally locked to Earth, which means that the length of a full rotation of the Moon on its own axis causes its same side (the near side) to always face Earth, and the somewhat longer lunar day is the same as the synodic period. However, 59% of the total lunar surface can be seen from Earth through cyclical shifts in perspective known as libration."
question = "At what distance does the Moon orbit the Earth?"

make_inference(context, question)

In [None]:
marketmail_model = PeftModel.from_pretrained(model, "c-s-ale/bloom-7b1-marketmail-ai")

In [None]:
from IPython.display import display, Markdown

def make_inference_mm_ai(product, description):
  batch = tokenizer(f"Below is a product and description, please write a marketing email for this product.\n\n### Product:\n{product}\n### Description:\n{description}\n\n### Marketing Email:\n", return_tensors='pt')

  with torch.cuda.amp.autocast():
    output_tokens = marketmail_model.generate(**batch, max_new_tokens=200)

  display(Markdown((tokenizer.decode(output_tokens[0], skip_special_tokens=True))))

In [None]:
your_product_name_here = "The Coolinator"
your_product_description_here = "A personal cooling device to keep you from getting overheated on a hot summer's day!"

make_inference_mm_ai(your_product_name_here, your_product_description_here)