# Project: BERT for Banking FAQ Classification
This project uses BERT (Bidirectional Encoder Representations from Transformers) to classify questions from a banking FAQ dataset into predefined categories such as 'accounts', 'loans', etc.

**Why this matters:**
- Enhances customer support automation in banking
- Enables quick and accurate routing of queries
- Utilizes state-of-the-art transformer-based NLP techniques

### Import Required Libraries
We begin by importing the `pandas` library to work with structured data.

In [1]:
import pandas as pd

### Load and Display Dataset
The dataset containing banking FAQs is loaded using `pandas.read_csv()`. We use `.head()` to preview the first few entries.

In [2]:
bank = pd.read_csv("BankFAQs.csv")
bank.head()  # Load CSV data into a DataFrame  

Unnamed: 0,Question,Answer,Class
0,What are the documents required for opening a ...,Following documents are required to open a Cur...,accounts
1,Can I transfer my Current Account from one bra...,"Yes, Current Accounts can be transferred from ...",accounts
2,My present status is NRI. What extra documents...,NRI/PIO can open the proprietorship/partnershi...,accounts
3,What are the documents required for opening a ...,Following documents are required for opening a...,accounts
4,What documents are required to change the addr...,Following documents are required to change the...,accounts


### Import Label Encoder
The `LabelEncoder` from scikit-learn is imported to convert categorical labels (e.g., 'accounts') into numerical form for model compatibility.

In [3]:
from sklearn.preprocessing import LabelEncoder

In [4]:
label_encoder = LabelEncoder()

In [5]:
bank ["label"] = label_encoder.fit_transform (bank["Class"])

### Train-Test Split of Dataset

We split the dataset into training and validation sets using an 80/20 ratio.  
- `stratify=bank["label"]` ensures that the distribution of classes remains consistent across train and validation sets (important for classification problems).  
- `random_state=42` ensures reproducibility of the split.


In [6]:
from sklearn.model_selection import train_test_split

In [7]:
train_text, val_text, train_label, val_label = train_test_split(
    bank["Question"].tolist(),  # Input text data
    bank["label"].tolist(),     # Corresponding target labels
    test_size=0.2,              # 20% of data used for validation
    stratify=bank["label"],     # Maintain class distribution in split
    random_state=42             # Reproducible results
)


### Tokenization Using BERT Tokenizer

We use the `BertTokenizer` from Hugging Face Transformers to tokenize the text data.  
- `bert-base-uncased` is a pre-trained lowercase BERT model.
- Tokenization converts raw text into token IDs that the BERT model can understand.
- We apply truncation and padding to ensure all sequences are of the same length (`max_length=128`), which is essential for batch processing.


In [8]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

In [9]:
train_encodings = tokenizer(train_text,truncation = True ,padding = True ,max_length=128)
val_encodings = tokenizer(val_text,truncation=True,padding=True, max_length=128)

### Creating a Custom PyTorch Dataset for BERT

We define a custom `Dataset` class to wrap the tokenized inputs and labels so they can be used with PyTorch's `DataLoader`.

- `__init__`: Stores the input encodings and labels.
- `__len__`: Returns the number of samples (required by PyTorch).
- `__getitem__`: Retrieves one data sample as a dictionary of tensors (`input_ids`, `attention_mask`, and `labels`) — this is the format BERT expects.

Finally, we create `train_dataset` and `val_dataset` objects for training and validation respectively.


In [11]:
import torch

# Define a custom PyTorch Dataset for BERT
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings  # Tokenized inputs (input_ids, attention_mask)
        self.labels = labels        # Corresponding class labels

    def __len__(self):
        return len(self.labels)     # Total number of samples

    def __getitem__(self, idx):
        # Create a dictionary with input tensors for a single sample
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])  # Add label tensor
        return item

# Create dataset objects for training and validation
train_dataset = Dataset(train_encodings, train_label)
val_dataset = Dataset(val_encodings, val_label)


### Load Pretrained BERT Model for Sequence Classification

We load a pre-trained BERT model (`bert-base-uncased`) and fine-tune it for our multi-class classification task.

- `BertForSequenceClassification` is a wrapper over BERT that adds a classification head (a linear layer) on top.
- `num_labels` is set to the number of unique classes in our target variable.
- This prepares the model to predict one of the N classes given a banking-related question.


In [12]:
from transformers import BertForSequenceClassification

# Determine the number of target classes from the label encoder
num_labels = len(label_encoder.classes_)

# Load pre-trained BERT model with a classification head on top
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',  # Base BERT model (lowercased)
    num_labels=num_labels # Output layer adjusted to number of classes
)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Define Training Arguments for BERT Fine-Tuning

We use `TrainingArguments` from Hugging Face to configure how the model will be trained:

- `output_dir`: Where the model checkpoints and predictions will be saved.
- `evaluation_strategy="epoch"`: Evaluate the model at the end of each epoch.
- `per_device_train_batch_size` & `per_device_eval_batch_size`: Batch sizes during training and evaluation.
- `num_train_epochs`: Total number of passes through the training data.
- `weight_decay`: Regularization to reduce overfitting.
- `logging_dir`: Directory to save training logs.
- `logging_steps`: How often to log training progress.


In [13]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',                 # Directory to save model checkpoints
    evaluation_strategy="epoch",           # Evaluate after each epoch
    per_device_train_batch_size=16,        # Batch size for training
    per_device_eval_batch_size=16,         # Batch size for evaluation
    num_train_epochs=4,                    # Number of training epochs
    weight_decay=0.01,                     # Regularization (L2 penalty)
    logging_dir='./logs',                  # Directory to save logs
    logging_steps=10                       # Log training metrics every 10 steps
)







### Initialize the Hugging Face Trainer

We use the `Trainer` class to manage the full training and evaluation pipeline:

- `model`: The BERT model for sequence classification.
- `args`: Training arguments we defined earlier (batch size, epochs, logging, etc.).
- `train_dataset`: Training data in the format expected by the model.
- `eval_dataset`: Validation data used to evaluate the model after each epoch.

The `Trainer` handles optimization, backpropagation, evaluation, and logging internally.


In [14]:
from transformers import Trainer

trainer = Trainer(
    model=model,                   # Pretrained BERT model with classification head
    args=training_args,            # Training configuration (epochs, batch size, etc.)
    train_dataset=train_dataset,   # Tokenized and labeled training data
    eval_dataset=val_dataset       # Tokenized and labeled validation data
)


In [15]:
trainer.train()


Epoch,Training Loss,Validation Loss
1,0.541,0.461093
2,0.3076,0.293099
3,0.0485,0.271037
4,0.0129,0.274052


TrainOutput(global_step=356, training_loss=0.35204597361636963, metrics={'train_runtime': 2778.6927, 'train_samples_per_second': 2.041, 'train_steps_per_second': 0.128, 'total_flos': 119511228541200.0, 'train_loss': 0.35204597361636963, 'epoch': 4.0})

In [16]:
eval_results = trainer.evaluate()
print(eval_results)


{'eval_loss': 0.27405211329460144, 'eval_runtime': 41.4655, 'eval_samples_per_second': 8.561, 'eval_steps_per_second': 0.555, 'epoch': 4.0}


In [17]:
from sklearn.metrics import accuracy_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc}


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics
)

# Evaluate again
trainer.evaluate()


{'eval_loss': 0.27405211329460144,
 'eval_model_preparation_time': 0.0064,
 'eval_accuracy': 0.9295774647887324,
 'eval_runtime': 43.7587,
 'eval_samples_per_second': 8.113,
 'eval_steps_per_second': 0.526}

In [18]:
model.save_pretrained("bert-faq-model")
tokenizer.save_pretrained("bert-faq-model")


('bert-faq-model\\tokenizer_config.json',
 'bert-faq-model\\special_tokens_map.json',
 'bert-faq-model\\vocab.txt',
 'bert-faq-model\\added_tokens.json')

### Inference: Predict Label for New Question

We define a helper function `predict_faq(question)` that:
- Tokenizes the input question using the same tokenizer used during training.
- Feeds the tokenized input into the trained BERT model.
- Extracts the predicted class by selecting the index with the highest logit score.
- Converts the predicted class index back to the original label using the `LabelEncoder`.

This allows us to make real-time predictions for new user questions.


In [20]:
def predict_faq(question):
    # Tokenize the input question and convert to PyTorch tensors
    inputs = tokenizer(
        question, 
        return_tensors="pt",        # Return as PyTorch tensors
        truncation=True,            # Truncate if longer than max length
        padding=True,               # Pad shorter sequences
        max_length=128
    )

    # Disable gradient calculation for inference
    with torch.no_grad():
        outputs = model(**inputs)   # Get model predictions

    logits = outputs.logits         # Raw model outputs (logits)
    predicted_class_id = torch.argmax(logits, dim=1).item()  # Index of highest logit
    predicted_label = label_encoder.inverse_transform([predicted_class_id])[0]  # Decode label
    return predicted_label

# Example usage
question = "How can I apply for a credit card?"
print(predict_faq(question))  # Output: predicted label for the question


security
