# **Project Name**    - IndustryGPT: Specialized LLM Bot Using Pre-Trained Models


##### **Project Type**    - Deep Learning for NLP
##### **Contribution**    - Individual
##### **Name -** Ayush Bhagat

# **Project Summary -**

A domain-specific chatbot was developed for the Finance industry by fine-tuning the TinyLlama-1.1B-Chat-v1.0 model using QLoRA. It can answer queries related to budgeting, investments, and financial planning with improved accuracy and relevance.

# **GitHub Link -**

https://github.com/Ayushx29/IndustryGPT-Specialized-LLM-Bot-Using-Pre-Trained-Models

# **Problem Statement**


Most general-purpose chatbots struggle to provide accurate and context-aware responses in specialized fields like finance. This leads to vague or incorrect guidance on important topics such as budgeting, investing, and financial planning. There is a growing demand for compact, efficient, and domain-adapted language models that can deliver reliable financial insights while being resource-friendly and easy to integrate into user-facing applications.


# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

In [None]:
!pip install -q transformers datasets peft trl bitsandbytes accelerate
!pip install -q huggingface_hub

In [None]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

2025-06-21 08:18:01.415938: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1750493881.615023     145 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1750493881.673717     145 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [None]:
torch.cuda.is_available()

True

In [None]:
from datasets import load_dataset, DatasetDict, concatenate_datasets
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments


# Load dataset
dataset1 = load_dataset("poornima9348/finance-alpaca-1k-test")
dataset2 = load_dataset("ssbuild/alpaca_finance_en")

In [None]:
dataset1

DatasetDict({
    test: Dataset({
        features: ['instruction', 'input', 'output', 'text'],
        num_rows: 1000
    })
})

In [None]:
dataset2

DatasetDict({
    train: Dataset({
        features: ['id', 'instruction', 'input', 'output'],
        num_rows: 68912
    })
})

In [None]:
# Combine 'instruction' and 'output' columns into a new 'text' column
def combine_text_columns(example):
    return {'text': f"{example['instruction']} ### {example['output']}"}

# Apply the function to each example in the dataset
dataset1 = dataset1.map(combine_text_columns)
dataset2 = dataset2.map(combine_text_columns)

# Remove 'instruction', 'input' and 'output' columns
dataset1['test']=dataset1['test'].remove_columns(['instruction','input', 'output'])
dataset2['train']=dataset2['train'].remove_columns(['instruction','input', 'output','id'])

In [None]:
# Perform the train-test split on the necessary dataset if required
split_dataset1 = dataset1['test'].train_test_split(train_size=0.8)
split_dataset2 = dataset2['train'].train_test_split(test_size=0.2)

In [None]:
split_dataset1

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 800
    })
    test: Dataset({
        features: ['text'],
        num_rows: 200
    })
})

In [None]:
split_dataset2

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 55129
    })
    test: Dataset({
        features: ['text'],
        num_rows: 13783
    })
})

In [None]:
# Concatenate the datasets
merged_train = concatenate_datasets([split_dataset1['train'], split_dataset2['train']])
merged_test = concatenate_datasets([split_dataset1['test'], split_dataset2['test']])

# Create a new DatasetDict with the merged datasets
merged_dataset = DatasetDict({
    'train': merged_train,
    'test': merged_test
})

# Filter out None values in case some splits are missing
merged_dataset = DatasetDict({k: v for k, v in merged_dataset.items() if v is not None})

# Print the merged dataset to verify
print(merged_dataset)

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 55929
    })
    test: Dataset({
        features: ['text'],
        num_rows: 13983
    })
})


In [None]:
# Shuffle the dataset and slice it
merged_train_dataset = merged_dataset['train'].shuffle(seed=42).select(range(5000))

def transform_conversation(example):
    conversation_text1 = example['text']
    segments = conversation_text1.split('###')

    reformatted_segments = []

    # Iterate over the segments and ensure each segment has a prompt and answer
    for i in range(0, len(segments) - 1, 2):
        prompt = segments[i].strip()
        if i + 1 < len(segments):
            answer = segments[i + 1].strip()
            # Apply the new template
            reformatted_segments.append(f'<s>[INST] {prompt} [/INST] {answer} </s>')
        else:
            # Handle the case where there is no corresponding assistant segment
            reformatted_segments.append(f'<s>[INST] {prompt} [/INST] </s>')

    return {'text': ''.join(reformatted_segments)}

# Apply the transformation
transformed_dataset = merged_train_dataset.map(transform_conversation)

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

In [None]:
# Shuffle the dataset and slice it
merged_test_dataset = merged_dataset['test'].shuffle(seed=42).select(range(100))

def transform_conversation(example):
    conversation_text1 = example['text']
    segments = conversation_text1.split('###')

    reformatted_segments = []

    # Iterate over the segments and ensure each segment has a prompt and answer
    for i in range(0, len(segments) - 1, 2):
        prompt = segments[i].strip()
        if i + 1 < len(segments):
            answer = segments[i + 1].strip()
            # Apply the new template
            reformatted_segments.append(f'<s>[INST] {prompt} [/INST] {answer} </s>')
        else:
            # Handle the case where there is no corresponding assistant segment
            reformatted_segments.append(f'<s>[INST] {prompt} [/INST] </s>')

    return {'text': ''.join(reformatted_segments)}

# Apply the transformation
transformed_test_dataset = merged_test_dataset.map(transform_conversation)

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

In [None]:
merged_test_dataset['text'][0]

'You are given a list of words, sort them alphabetically ### ["ant", "bat", "cat", "dog", "monkey"]'

In [None]:
import os
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    BitsAndBytesConfig,
    DataCollatorForSeq2Seq
)
from peft import LoraConfig
from trl import SFTTrainer

# -------------------- CONFIGURATION --------------------

# Model and output
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
new_model = "tinyllama-finance-chatbot-finetune"

# QLoRA parameters
lora_r = 64
lora_alpha = 16
lora_dropout = 0.1

# BitsAndBytes config
use_4bit = True
bnb_4bit_compute_dtype = "float16"
bnb_4bit_quant_type = "nf4"
use_nested_quant = False

# Training config
output_dir = "./results"
num_train_epochs = 1
fp16 = False
bf16 = False
per_device_train_batch_size = 4
per_device_eval_batch_size = 4
gradient_accumulation_steps = 1
gradient_checkpointing = True
max_grad_norm = 0.3
learning_rate = 2e-4
weight_decay = 0.001
optim = "paged_adamw_32bit"
lr_scheduler_type = "cosine"
max_steps = -1
warmup_ratio = 0.03
group_by_length = True
save_steps = 0
logging_steps = 25
max_seq_length = 350

# -------------------- TOKENIZER & DATA --------------------

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

def tokenize(example):
    encodings = tokenizer(
        example["text"],
        truncation=True,
        padding="max_length",
        max_length=max_seq_length,
        return_attention_mask=True,
    )
    return {
        "input_ids": encodings["input_ids"],
        "attention_mask": encodings["attention_mask"],
        "labels": encodings["input_ids"]
    }

# Assuming `transformed_dataset` is already defined
tokenized_dataset = transformed_dataset.map(
    tokenize,
    batched=True,
    remove_columns=transformed_dataset.column_names,
    load_from_cache_file=False
)

# Set torch format for compatibility
tokenized_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])
print("✅ Tokenized keys:", tokenized_dataset[0].keys())

# -------------------- MEMORY MANAGEMENT --------------------

torch.cuda.empty_cache()
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
os.makedirs("/kaggle/working/offload", exist_ok=True)

# -------------------- LOAD MODEL --------------------

compute_dtype = getattr(torch, bnb_4bit_compute_dtype)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    offload_folder="/kaggle/working/offload",
    offload_buffers=True
)
model.config.use_cache = False
model.config.pretraining_tp = 1

# -------------------- PEFT CONFIG --------------------

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)

# -------------------- TRAINING ARGS --------------------

training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard"
)

# -------------------- DATA COLLATOR --------------------

data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    padding="max_length",
    max_length=max_seq_length,
    return_tensors="pt"
)

# -------------------- SFT TRAINER --------------------

trainer = SFTTrainer(
    model=model,
    train_dataset=tokenized_dataset,
    peft_config=peft_config,
    args=training_arguments,
    data_collator=data_collator
)

# -------------------- TRAIN --------------------

trainer.train()
print("🎉 Training complete!")


Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

✅ Tokenized keys: dict_keys(['input_ids', 'attention_mask', 'labels'])


Truncating train dataset:   0%|          | 0/5000 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Step,Training Loss
25,3.4676
50,0.9139
75,0.6452
100,0.7781
125,0.7139
150,0.6836
175,0.5809
200,0.6354
225,0.7574
250,0.6875


🎉 Training complete!


In [None]:
# Save trained model
trainer.model.save_pretrained(new_model)

In [None]:
# List the contents to ensure files are saved
print("Contents of new_model directory:", os.listdir(new_model))

Contents of new_model directory: ['adapter_config.json', 'adapter_model.safetensors', 'README.md']


In [None]:
# Ignore warnings
import logging
from transformers import pipeline
from transformers.utils import logging
logging.set_verbosity(logging.CRITICAL)


# Run text generation pipeline with our fine-tuned model
prompt = "Generate a title for a blog about the Nobel Prize ceremony."
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=100)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

<s>[INST] Generate a title for a blog about the Nobel Prize ceremony. [/INST] The Nobel Prize Ceremony: A Tribute to Innovation 


In [None]:
# TensorBoard extension (if applicable)
%load_ext tensorboard
%tensorboard --logdir results/runs

<IPython.core.display.Javascript object>

In [None]:
# Reload model and tokenizer (if needed) and merge LoRA
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
import os
import shutil

# Define model_name and new_model
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
new_model = "tinyllama-finance-chatbot-finetune"

# Clear GPU memory
torch.cuda.empty_cache()

# Ensure the directory exists
if not os.path.exists(new_model):
    os.makedirs(new_model)

# Define offload directory
offload_dir = "/kaggle/working/"
os.makedirs(offload_dir, exist_ok=True)

try:
    # Load base model and merge with LoRA
    base_model = AutoModelForCausalLM.from_pretrained(
        model_name,
        low_cpu_mem_usage=True,
        return_dict=True,
        torch_dtype=torch.float16,
        device_map="auto",
        offload_folder=offload_dir
    )

    model = PeftModel.from_pretrained(base_model, new_model)
    model = model.merge_and_unload()

    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "right"

    model.save_pretrained(new_model)
    tokenizer.save_pretrained(new_model)

    print("Contents of new_model directory:", os.listdir(new_model))

except RuntimeError as e:
    if "out of memory" in str(e):
        print("Out of memory error. Try using a smaller model or increasing GPU memory.")
        torch.cuda.empty_cache()
    else:
        raise e
except ValueError as e:
    print(f"ValueError: {e}")



Saving checkpoint shards:   0%|          | 0/1 [00:00<?, ?it/s]

Contents of new_model directory: ['tokenizer.json', 'adapter_config.json', 'special_tokens_map.json', 'tokenizer_config.json', 'adapter_model.safetensors', 'config.json', 'tokenizer.model', 'model.safetensors', 'generation_config.json', 'README.md']


In [None]:
# Hugging Face Hub authentication
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient

# ✅ Retrieve token using your actual secret name
user_secrets = UserSecretsClient()
hf_token = user_secrets.get_secret("llm_training")

# Login and push to hub
login(token=hf_token)

model_repo_name = "Ayushx29/finance_finetune_model"
tokenizer_repo_name = "Ayushx29/finance_finetune_model"

model.push_to_hub(model_repo_name, use_auth_token=hf_token, check_pr=True)
tokenizer.push_to_hub(tokenizer_repo_name, use_auth_token=hf_token, check_pr=True)



Saving checkpoint shards:   0%|          | 0/1 [00:00<?, ?it/s]

Uploading...:   0%|          | 0.00/2.20G [00:00<?, ?B/s]



README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

Uploading...:   0%|          | 0.00/500k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Ayushx29/finance_finetune_model/commit/7359821bd1cb6f50c12fb48733784a5be4b05035', commit_message='Upload tokenizer', commit_description='', oid='7359821bd1cb6f50c12fb48733784a5be4b05035', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Ayushx29/finance_finetune_model', endpoint='https://huggingface.co', repo_type='model', repo_id='Ayushx29/finance_finetune_model'), pr_revision=None, pr_num=None)

In [None]:
# Clear GPU memory
torch.cuda.empty_cache()

In [None]:
# Load model back from hub (for inference/verification)
fine_tuned_finance_model = AutoModelForCausalLM.from_pretrained("Ayushx29/finance_finetune_model")
fine_tuned_tokenizer = AutoTokenizer.from_pretrained("Ayushx29/finance_finetune_model", trust_remote_code=True)

config.json:   0%|          | 0.00/674 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.62M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/437 [00:00<?, ?B/s]

In [None]:
import os
import shutil

# Define the source (output) directory and the new target directory
source_dir = "/kaggle/working/tinyllama-finance-chatbot-finetune"
target_dir = "/kaggle/working/new_dir"
os.makedirs(target_dir, exist_ok=True)

files_to_copy = [
    "config.json",
    "tokenizer.json",
    "tokenizer_config.json",
    "adapter_model.safetensors",
    "generation_config.json",
    "tokenizer.model",
    "special_tokens_map.json"
]
for file_name in files_to_copy:
    src = os.path.join(source_dir, file_name)
    dst = os.path.join(target_dir, file_name)
    if os.path.exists(src):
        shutil.copy2(src, dst)
        print(f"Copied {file_name} to {target_dir}")
    else:
        print(f"{file_name} not found in {source_dir}")

print("✅ File copying complete.")

Copied config.json to /kaggle/working/new_dir
Copied tokenizer.json to /kaggle/working/new_dir
Copied tokenizer_config.json to /kaggle/working/new_dir
Copied adapter_model.safetensors to /kaggle/working/new_dir
Copied generation_config.json to /kaggle/working/new_dir
Copied tokenizer.model to /kaggle/working/new_dir
Copied special_tokens_map.json to /kaggle/working/new_dir
✅ File copying complete.
