#**Library Installation**

In [None]:
!pip install transformers datasets bitsandbytes accelerate peft datasets
!pip install scikit-learn torch --upgrade
!pip install wandb flash-attn
!pip install evaluate

Collecting datasets
  Downloading datasets-3.0.1-py3-none-any.whl.metadata (20 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.44.0-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Collecting peft
  Downloading peft-0.13.0-py3-none-any.whl.metadata (13 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.1-py3-none-any.whl (471 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading bitsandbytes-0.44.0-py3-none-manylinux_2_24_x86_64.whl (

### Imports

In [None]:
import os
import torch
from datasets import load_dataset, Dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import wandb
from huggingface_hub import notebook_login
import json
import hashlib
from sklearn.model_selection import train_test_split
from huggingface_hub import notebook_login


##  Environment Setup

In [None]:
# Set the token as an environment variable
os.environ["HUGGINGFACE_TOKEN"] = "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"

# Login to Hugging Face
notebook_login()

# Set HF_HOME
os.environ['HF_HOME'] = 'REDACTED'

# Set environment variables
os.environ["WANDB_PROJECT"] = "phi-3-5-mini-qlora"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # Adjust if using multiple GPUs

# Initialize wandb
wandb.init(project="phi-3-5-mini-qlora")


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


##  Model Configuration

In [None]:

# Model and tokenizer configuration
model_name = "microsoft/Phi-3.5-mini-instruct"
output_dir = "./phi-3-5-mini-qlora-output"

# Quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)


In [None]:

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)
model.config.use_cache = False
model = prepare_model_for_kbit_training(model)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/3.98k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/3.45k [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/11.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3.5-mini-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py:   0%|          | 0.00/73.8k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3.5-mini-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/16.3k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/195 [00:00<?, ?B/s]

## LoRA configuration

In [None]:

# LoRA configuration
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=[
        f'model.layers.{i}.self_attn.o_proj' for i in range(10)
    ] + [
        f'model.layers.{i}.self_attn.qkv_proj' for i in range(10)
    ] + [
        f'model.layers.{i}.mlp.gate_up_proj' for i in range(10)
    ] + [
        f'model.layers.{i}.mlp.down_proj' for i in range(10)
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA
model = get_peft_model(model, peft_config)


## Data loading and preprocessing functions

In [None]:


def load_jsonl(path):
    with open(path, 'r') as file:
        return [json.loads(line) for line in file]

def format_ultrachat_data(data):
    formatted_data = []
    for item in data:
        text = item['text']
        query_start = text.find("### Query:") + len("### Query:")
        response_start = text.find("### Response:") + len("### Response:")
        references_start = text.find("### References:") + len("### References:")

        query = text[query_start:response_start - len("### Response:")].strip()
        response = text[response_start:references_start - len("### References:")].strip()

        prompt_id = hashlib.sha256(query.encode()).hexdigest()

        formatted_item = {
            "prompt": query,
            "prompt_id": prompt_id,
            "messages": [
                {"content": query, "role": "user"},
                {"content": response, "role": "assistant"}
            ]
        }
        formatted_data.append(formatted_item)
    return formatted_data

def prepare_datasets(data_path, tokenizer, max_length=2048):
    data = load_jsonl(data_path)
    train_data, test_data = train_test_split(data, test_size=0.3, random_state=42)

    train_data_formatted = format_ultrachat_data(train_data)
    test_data_formatted = format_ultrachat_data(test_data)

    def tokenize_function(examples):
        texts = [" ".join([msg['content'] for msg in example['messages']]) for example in examples['data']]
        return tokenizer(texts, padding="max_length", truncation=True, max_length=max_length)

    train_dataset = Dataset.from_dict({"data": train_data_formatted})
    test_dataset = Dataset.from_dict({"data": test_data_formatted})

    tokenized_train = train_dataset.map(
        tokenize_function,
        batched=True,
        remove_columns=train_dataset.column_names
    )
    tokenized_test = test_dataset.map(
        tokenize_function,
        batched=True,
        remove_columns=test_dataset.column_names
    )

    return tokenized_train, tokenized_test


## Load and preprocess dataset

In [None]:


train_dataset, test_dataset = prepare_datasets("combined_UnitOps_Training_ZAR.jsonl", tokenizer)


Map:   0%|          | 0/4370 [00:00<?, ? examples/s]

Map:   0%|          | 0/1873 [00:00<?, ? examples/s]

## Training arguments

In [None]:


training_args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=8,
    learning_rate=1e-4,
    weight_decay=0.01,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    logging_steps=1,
    save_strategy="steps",
    save_steps=100,
    evaluation_strategy="steps",
    eval_steps=100,
    load_best_model_at_end=True,
    metric_for_best_model="loss",
    greater_is_better=False,
    fp16=True,
    fp16_full_eval=True,
    max_grad_norm=0.3,
    report_to=["wandb"],
)




## Initialize Trainer and train the model

In [None]:


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

# Train the model
trainer.train()


  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss,Validation Loss


Step,Training Loss,Validation Loss
100,1.2429,1.197741
200,1.1102,1.168833
300,1.0495,1.161945
400,1.1005,1.152431
500,1.0706,1.155998
600,1.0478,1.157609


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


TrainOutput(global_step=680, training_loss=1.1411159681923249, metrics={'train_runtime': 23050.5981, 'train_samples_per_second': 0.948, 'train_steps_per_second': 0.03, 'total_flos': 9.97104867883352e+17, 'train_loss': 1.1411159681923249, 'epoch': 4.977127172918573})

## Saving the fine-tuned model and pushing it to Hugging Face

In [None]:
from huggingface_hub import login
login(token=#insertyourkey)

trainer.save_model(output_dir)

# Push the model to the Hugging Face Hub
model.push_to_hub("KunalRaghuvanshi/phi3_mini_qlora_chemical_eng")

# End wandb run
wandb.finish()


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


adapter_model.safetensors:   0%|          | 0.00/31.5M [00:00<?, ?B/s]

VBox(children=(Label(value='0.036 MB of 0.036 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
eval/loss,█▄▂▁▂▂
eval/runtime,▁▇█▅▁▁
eval/samples_per_second,█▂▁▄██
eval/steps_per_second,█▄▁▄██
train/epoch,▁▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▅▅▅▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇█
train/global_step,▁▁▁▁▁▂▂▃▃▃▃▃▃▄▄▅▅▅▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇▇███
train/grad_norm,█▅▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▃▃▃▄▃▃▃▃▄▄▃▄▄▄▃▄▄▄▄▄▄
train/learning_rate,▄███████▇▇▆▆▆▆▆▆▆▅▅▅▅▅▄▄▄▃▃▂▂▂▂▂▂▂▁▁▁▁▁▁
train/loss,▇█▃▄▅▃▄▄▄▃▃▃▂▃▃▃▃▂▂▃▂▃▃▂▂▂▂▃▂▃▁▂▂▂▂▂▂▂▃▁

0,1
eval/loss,1.15761
eval/runtime,523.5816
eval/samples_per_second,3.577
eval/steps_per_second,0.896
total_flos,9.97104867883352e+17
train/epoch,4.97713
train/global_step,680.0
train/grad_norm,0.66325
train/learning_rate,0.0
train/loss,1.1396


## Example of generating text with the fine-tuned model

In [None]:
input_text = "What is gauss law"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(model.device)
with torch.no_grad():
    outputs = model.generate(input_ids, max_new_tokens=200, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

What is gauss law and how is it applied in the context of fluid flow in pipes? Gauss law, in the context of fluid flow in pipes, is applied to understand how pressure and velocity change along the length of a pipe. It states that the change in pressure and velocity across a pipe section is related to the pressure and velocity at the entrance and exit of that section. This principle is crucial for engineers to design efficient piping systems, ensuring that fluids are transported effectively from one point to another. By applying gauss law, engineers can predict how changes in pipe diameter, length, or roughness affect the flow characteristics, which is essential for optimizing the design and operation of piping systems in chemical engineering.

2. How does the concept of gauss law extend to the analysis of fluid flow in pipes? The concept of gauss law extends to the analysis of fluid flow in pipes by providing a mathematical framework to calculate the pressure and velocity changes along