<a href="https://colab.research.google.com/github/Saithurubilli/Deep-Learning-for-NLP/blob/main/Capstone_Project_CHAT_BOT_LLM_Deep_Learning_for_NLP_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Capstone Project: Deep Learning for NLP

CONTRIBUTION - INDIVIDUAL

GITHUB -

In this capstone project, students will embark on an exciting journey to create an Industry-Specific Large Language Model (LLM) Bot using state-of-the-art pre-trained models from sources like Hugging Face. The primary objective is to build an intelligent bot that can effectively engage with users by answering questions and providing insights specific to a chosen industry. This project will not only enhance your technical skills but also provide a deep understanding of the chosen industry's nuances, challenges, and trends.

In this project, we developed an industry-specific chatbot for the retail banking sector using a fine-tuned large language model (LLM). The primary goal was to create an AI assistant capable of answering real customer queries about banking services such as account balance checks, card activation, ATM location, loan applications, and more.

We used the publicly available Bitext Retail Banking Chatbot Dataset, which includes thousands of real-world banking prompts and professional, human-like responses. The model was fine-tuned using Google’s FLAN-T5-small on cleaned Q&A pairs, and deployed using a Gradio web interface for user-friendly interaction.

This chatbot demonstrates how LLMs can be effectively adapted to specialized industries like banking, delivering fast, informative, and polite responses — similar to a live customer support agent.



In [None]:
pip install transformers datasets scikit-learn gradio


Collecting datasets
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.6.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading multiprocess-0.70.16-py311-none-any.whl (143 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading 

In [None]:
import pandas as pd

# Load the CSV file
df = pd.read_csv("/content/bitext-retail-banking-llm-chatbot-training-dataset.csv")

# Keep only needed columns and drop missing values
df_clean = df[['instruction', 'response']].dropna()
df_clean = df_clean[df_clean['instruction'].str.strip() != ""]
df_clean = df_clean[df_clean['response'].str.strip() != ""]
df_clean = df_clean.drop_duplicates(subset=['instruction', 'response'])

# Rename for model input/output
df_clean = df_clean.rename(columns={"instruction": "input", "response": "output"})

# Preview
df_clean.sample(3)


Unnamed: 0,input,output
21164,"I want to get a password, where do I do it?",I'm here to assist you in getting a password a...
18609,i have to dispute a withdrawal could ya help me,"Certainly, I can help you with that! I underst..."
2006,"I want to take out a loan, will you help me?",I'd be delighted to assist you with taking out...


In [None]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(df_clean, test_size=0.1, random_state=42)


In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "google/flan-t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

(…)a5b18a05535c9e14c7a355904270e15b0945ea86:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [None]:
from datasets import Dataset

# Convert to Hugging Face format
train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)

# Tokenization function
def tokenize(example):
    input_enc = tokenizer(example["input"], padding="max_length", truncation=True, max_length=128)
    target_enc = tokenizer(example["output"], padding="max_length", truncation=True, max_length=128)
    input_enc["labels"] = target_enc["input_ids"]
    return input_enc

# Tokenize datasets
train_tokenized = train_dataset.map(tokenize)
test_tokenized = test_dataset.map(tokenize)


Map:   0%|          | 0/22990 [00:00<?, ? examples/s]

Map:   0%|          | 0/2555 [00:00<?, ? examples/s]

In [None]:
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq

training_args = Seq2SeqTrainingArguments(
    output_dir="./banking_bot",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    predict_with_generate=True,
    num_train_epochs=3,
    logging_dir='./logs',
    evaluation_strategy="epoch"
)

collator = DataCollatorForSeq2Seq(tokenizer, model=model)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_tokenized,
    eval_dataset=test_tokenized,
    tokenizer=tokenizer,
    data_collator=collator
)

trainer.train()


  trainer = Seq2SeqTrainer(
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mfarhan-anwar790394[0m ([33mfarhan-anwar790394-almabetter[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch,Training Loss,Validation Loss
1,1.0446,0.85381
2,0.9215,0.779876
3,0.8951,0.76184


TrainOutput(global_step=17244, training_loss=1.0451939254757203, metrics={'train_runtime': 2451.9731, 'train_samples_per_second': 28.128, 'train_steps_per_second': 7.033, 'total_flos': 3205217027358720.0, 'train_loss': 1.0451939254757203, 'epoch': 3.0})

In [None]:
# prompt: give testing accuracy training accuracy and validation accuracy

# After trainer.train()
import json

# Function to extract metrics from logs
def get_metrics(log_history):
    train_loss = None
    eval_loss = None
    # Find the final training loss
    for log in log_history:
        if 'loss' in log:
            train_loss = log['loss']
        if 'eval_loss' in log:
            eval_loss = log['eval_loss']
    return train_loss, eval_loss

train_loss, eval_loss = get_metrics(trainer.state.log_history)

print(f"Training Loss (last logged): {train_loss}")
print(f"Validation Loss (last epoch): {eval_loss}")

# To get evaluation metrics on the test set after training
eval_results = trainer.evaluate(test_tokenized)
print(f"Test Evaluation Results: {eval_results}")

# Note: For accuracy in seq2seq models, it's usually not a single number
# like classification accuracy. Metrics like ROUGE, BLEU are common.
# If you need specific accuracy-like metrics, you would need to implement
# or use a library for sequence generation evaluation metrics and potentially
# override the compute_metrics function in the Trainer or perform manual evaluation.
# The default evaluation in Seq2SeqTrainer provides loss.



Training Loss (last logged): 0.8951
Validation Loss (last epoch): 0.761839747428894


Test Evaluation Results: {'eval_loss': 0.761839747428894, 'eval_runtime': 20.6817, 'eval_samples_per_second': 123.539, 'eval_steps_per_second': 30.897, 'epoch': 3.0}


In [None]:
model.save_pretrained("banking-chatbot")
tokenizer.save_pretrained("banking-chatbot")


('banking-chatbot/tokenizer_config.json',
 'banking-chatbot/special_tokens_map.json',
 'banking-chatbot/spiece.model',
 'banking-chatbot/added_tokens.json',
 'banking-chatbot/tokenizer.json')

In [None]:
import torch

def chat(prompt):
    # Set device
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    # Tokenize and move tensors to the same device as model
    inputs = tokenizer(prompt, return_tensors="pt")
    inputs = {k: v.to(device) for k, v in inputs.items()}

    # Generate response
    with torch.no_grad():
        output = model.generate(**inputs, max_length=100)

    return tokenizer.decode(output[0], skip_special_tokens=True)


In [None]:
chat("How can I activate my debit card?")


'I\'m here to assist you with activating your debit card. Activating your card is a simple process. Here\'s what you need to do: 1. Visit our website at Company Website URL. 2. Look for the "Activate Card" or "Card Activation" option. 3. Click on it to access the activation instructions. 4. You will be prompted to enter your card details, such as the card number, expiration date'

In [None]:
import gradio as gr

In [None]:
interface = gr.Interface(
    fn=chat,          # The function to wrap with a UI
    inputs=gr.Textbox(label="Your banking question"), # Input component: a textbox for the user's prompt
    outputs=gr.Textbox(label="Chatbot response"), # Output component: a textbox for the chatbot's response
    title="Retail Banking Chatbot" # Title for the Gradio interface
)

In [None]:
import gradio as gr

interface = gr.Interface(
    fn=chat,
    inputs=gr.Textbox(label="Your banking question"),
    outputs=gr.Textbox(label="Chatbot response"),
    title="Retail Banking Chatbot"
)

interface.launch()

It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://4f0cc5649f4a914c36.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


