# **FinQuery: Intelligent Banking Intent Classification for Customer Support**

## Introduction

In the rapidly evolving world of digital banking, customers expect instant, accurate responses to their queries—whether it’s checking a balance, reporting a lost card, or troubleshooting a payment issue. **FinQuery** is designed to meet this demand by automatically classifying customer support messages into one of 77 fine‑grained banking intents (the BANKING77 dataset). By leveraging a streamlined **TF‑IDF + Linear SVC** baseline alongside stronger baselines like **Logistic Regression** and **Multinomial Naive Bayes**, FinQuery demonstrates how even classic NLP pipelines can deliver high accuracy in complex, single‑domain intent detection. This notebook walks through each stage—from data exploration and preprocessing to model training, hyperparameter tuning, and final evaluation—showing how a disciplined approach to text normalization, stratified splits, and metric‑driven model selection yields a dependable automated routing solution for customer inquiries.


# BUSINESS UNDERSTANDING

- I aim to build a robust intent-classification model for BANKING77, which contains 77 fine-grained customer banking service intents.
- This will help automated customer support route queries correctly, improving response time and customer satisfaction.

# DATA UNDERSTANDING

- Train set: 10,003 examples
- Test set:  3,080 examples
- Number of Intents: 77
- Data Source: https://huggingface.co/datasets/PolyAI/banking77

# PROJECT AIM

- Accurately classify user queries into one of 77 intents

## My metrics of success are:

**Accuracy** 
- Easy to understand; % of correct predictions out of all. Use for overall view.

**Macro F1 Score**: 
- It balances both precision and recall
- It doesn’t get skewed by class imbalance
- It's meaningful when every intent matters, not just the frequent ones

## **DATA PREPARATION**

In [None]:
#imports

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from matplotlib.colors import LinearSegmentedColormap
import seaborn as sns
from wordcloud import WordCloud
import re
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score, ConfusionMatrixDisplay, confusion_matrix
import joblib
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
# For Bert Transformer
import torch
from sklearn.preprocessing import LabelEncoder
from torch.utils.data import Dataset
from transformers import (
    BertTokenizer,
    BertForSequenceClassification,
    Trainer,
    TrainingArguments
)

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

#### **Loading the Datasets**

In [None]:
# Creating Dataframe from the training Dataset
# Load the training dataset

train_df = pd.read_csv("train.csv")
train_df.head()

In [None]:
# Creating Dataframe from the test Dataset
# Load the test dataset

test_df = pd.read_csv("test.csv")
test_df.head()

In [None]:
# Examining the Data

print(f"The Training Data Shape is: {train_df.shape}")
print(f"The Test Data Shape is: {test_df.shape}")

- Our Training data has 10,003 records and 2 columns
- Our Test data has 3,080 records and 2 columns

In [None]:
train_df.info()

- The datatype of the data in the train_df both columns i.e., text and category is string i.e, text

In [None]:
test_df.info()

- The datatype of the data in the test_df both columns i.e., text and category is string i.e, text

In [None]:
# Checking out the missing data in the training data

train_df.isna().sum()

In [None]:
# Checking out the missing data in the test data

test_df.isna().sum()

In [None]:
# Checking out for duplicates in the training data

len(train_df[train_df.duplicated(keep="first")])

In [None]:
# Checking out for duplicates in the test data 

len(test_df[test_df.duplicated(keep="first")])

- In both datasets there is neither missing data nor duplicates i.e., our data is pretty much clean

## **Exploratory Data Analysis**

### **Distribution of Intent category**

In [None]:
# Counting top 10 distribution of intent category in the training data

category_counts = train_df['category'].value_counts().sort_values(ascending=False).head(10)
category_counts

#### **Visualize category distribution**

In [None]:
# Intent categories by frequency

# Ensure data is sorted
category_counts = category_counts.sort_values(ascending=False)

plt.figure(figsize=(15, 8))

# Use seaborn barplot with the 'magma' palette
# Pass the properly sorted index to the 'order' parameter
ax = sns.barplot(
    x=category_counts.values,
    y=category_counts.index,
    palette='magma',
    order=category_counts.index  # This works because category_counts is now properly sorted
)

# No need to invert y-axis as the data is already in descending order and
# seaborn plots from top to bottom by default

# Add title and axis labels with styling
ax.set_title("Top 10 Most Frequent Intents", fontsize=16, fontweight='bold')
ax.set_xlabel("Number of Queries", fontsize=14)
ax.set_ylabel("")  # no label on the y-axis

# Improve the overall styling
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.grid(axis='x', linestyle='--', alpha=0.7)
ax.set_axisbelow(True)

# Increase tick label font size
ax.tick_params(axis='x', labelsize=12)
ax.tick_params(axis='y', labelsize=12)

# Annotate each bar with its count
for p in ax.patches:
    width = p.get_width()
    ax.text(
        width + width * 0.02,                      # x position: slightly past the end of the bar
        p.get_y() + p.get_height() / 2,  # y position: center of the bar
        f'{int(width):,}',                # text to display with comma formatting
        va='center',
        fontsize=12,
        fontweight='bold'
    )
    
# Add percentage
total = category_counts.sum()
for i, p in enumerate(ax.patches):
    width = p.get_width()
    percentage = (width / total) * 100
    ax.text(
        width + width * 0.12,                 
        p.get_y() + p.get_height() / 2,  
        f'({percentage:.1f}%)',           
        va='center',
        fontsize=11,
        color='#666666'
    )

plt.tight_layout()
plt.show();

#### **WordCloud of full training text**

In [None]:
# Define Stop words to remove them from the word cloud
stop_words = set(stopwords.words('english'))

all_text = " ".join(train_df['text'].tolist())
wordcloud = WordCloud(
    width=750,
    height=450,
    background_color='white',
    max_words=200,
    collocations=False,
    stopwords=stop_words
).generate(all_text)

plt.figure(figsize=(12,6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title("Word Cloud of Customer Queries", fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show();

In [None]:
# Build a CountVectorizer (Here I'll remove English stopwords)
count_vect = CountVectorizer(stop_words='english', ngram_range=(1,1))

# Fit my corpus
X_counts = count_vect.fit_transform(train_df['text'])

# Sum up the counts of each vocabulary word
word_counts = X_counts.sum(axis=0)  # returns a 1×V sparse matrix
counts = [(word, word_counts[0, idx]) for word, idx in count_vect.vocabulary_.items()]

# Sort by frequency and take top N
top_n = 20
top_words = sorted(counts, key=lambda x: x[1], reverse=True)[:top_n]
words, freqs = zip(*top_words)

# Create a DataFrame for plotting
df_top = pd.DataFrame({
    'word': words,
    'frequency': freqs
})

# Plot horizontal bar chart
plt.figure(figsize=(10, 6))
sns.barplot(data=df_top, x='frequency', y='word', palette='magma')
plt.title("Top 20 Most Common Words")
plt.xlabel("Frequency")
plt.ylabel("Word")
plt.tight_layout()
plt.show();

## **Data Preprocessing**

In [None]:
def clean_text(text):

    # Convert to lowercase
    text = text.lower()
    
    # Remove punctuation and special characters
    text = re.sub(r'[^\w\s]', '', text)

    # Normalize whitespace
    text = re.sub(r'\s+', ' ', text)

    # remove any leading or trailing whitespace characters
    text = text.strip()

    return text

I am keeping the preprocessing minimal—just lowercasing and whitespace normalization—for a few reasons:

1. **TF‑IDF’s built‑in tokenization**  
   Scikit‑learn’s `TfidfVectorizer` already:
   * Splits on non‑alphanumeric boundaries,
   * And builds the n‑gram vocabulary for you.  
     Adding a separate tokenization step outside of the vectorizer would be redundant.

2. **Stopword removal can hurt fine‑grained intents**  
   In a 77‑way banking intent task, words like “to”, “on”, “did”, or “my” can carry important signals—for example:  
   _“did my payment go through?”_ vs. _“why is my payment delayed?”_  
   So we often keep stopwords in the bag‑of‑words representation or let the vectorizer handle them selectively.

3. **Lemmatization adds extra complexity with limited gain**
   * Lemmatization (via spaCy or NLTK) can reduce inflectional forms (e.g., “payments” → “payment”), but banking queries already tend to use a consistent vocabulary (“withdraw”, “withdrawal”; “pay”, “payment”).

In [None]:
# Clean the text columns

train_df['clean_text'] = train_df['text'].apply(clean_text)
test_df['clean_text'] = test_df['text'].apply(clean_text)

# Sample cleaned text
train_df[['text','clean_text']].sample(3)

#### **Train/Validation Split**

In [None]:
# Determining Target vector (y) and Feature Matrix (X)

X = train_df['clean_text']
y = train_df['category']

print("TF-IDF matrix feature shape:", X.shape)
print("Our Target Vector Shape is:", y.shape)

In [None]:
#performing train test split

X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

#checking shapes of the both X_train and X_test sets
print("X_train_raw:", X_train_raw.shape, "X_test_raw:", X_test_raw.shape)

#### **Feature Extraction (TF-IDF)**

In [None]:
# Create Vectorizer for text classification

tfidf_vectorizer = TfidfVectorizer(
    ngram_range=(1, 3),  # Capture longer phrases that might be important
    min_df=2,            # Removes terms that appear in fewer than 2 documents
    max_df=0.95,         # Removes terms that appear in more than 95% of documents
    sublinear_tf=True    # Reduces the weight of terms that occur very frequently in a document
)

In [None]:
# Fit vectorizer on training text

X_train = tfidf_vectorizer.fit_transform(X_train_raw)
X_test = tfidf_vectorizer.transform(X_test_raw)

In [None]:
print("X_train shape:", X_train.shape, "X_test shape:", X_test.shape)

## **MODELLING**

### **Linear SVC Model**

In [None]:
# Training LinearSVC Model

# Create Model
svc_model = LinearSVC(class_weight='balanced', random_state=42, max_iter=5000)

# Fit the Model
svc_model.fit(X_train, y_train)

# Predict
svc_prediction = svc_model.predict(X_test)

# Test Accuracy, precision, recall, f1 score
svc_accuracy = accuracy_score(y_test, svc_prediction)
svc_precision = precision_score(y_test, svc_prediction, average='macro')
svc_recall = recall_score(y_test, svc_prediction, average='macro')
svc_f1_score = f1_score(y_test, svc_prediction, average='macro')

print(f"SVC Model Accuracy: {svc_accuracy:.4f}")
print(f"SVC model Precision: {svc_precision:.4f}  (macro‐avg)")
print(f"SVC Model Recall: {svc_recall:.4f}  (macro‐avg)")
print(f"SVC F1 Score: {svc_f1_score:.4f}  (macro‐avg)");

In [None]:
print("Classification Report - LinearSVC")
print(classification_report(y_test, svc_prediction, zero_division=0));

### **Logistic Regression Model**

In [None]:
# Training Logistic Regression Model

# Create model
lr_model = LogisticRegression(random_state=42, max_iter=1000, class_weight="balanced")

# Fit model
lr_model.fit(X_train, y_train)

# Predict
lr_prediction = lr_model.predict(X_test)

# Test Accuracy, precision, recall, f1 score
lr_accuracy = accuracy_score(y_test, lr_prediction)
lr_precision = precision_score(y_test, lr_prediction, average='macro')
lr_recall = recall_score(y_test, lr_prediction, average='macro')
lr_f1_score = f1_score(y_test, lr_prediction, average='macro')

print(f"Logistic Regression Model Accuracy: {lr_accuracy:.4f}")
print(f"Logistic Regression model Precision: {lr_precision:.4f} (macro‐avg)")
print(f"Logistic Regression Model Recall: {lr_recall:.4f} (macro‐avg)")
print(f"Logistic Regression F1 Score: {lr_f1_score:.4f} (macro‐avg)");

In [None]:
print("Classification Report - Logistic Regression")
print(classification_report(y_test, lr_prediction, zero_division=0));

### **Naive Bayes Model**

In [None]:
# Training the Naive Bayes model

# Create Model
nb_model = MultinomialNB()

# Fit Model
nb_model.fit(X_train, y_train)

# Predict on test set
nb_predictions = nb_model.predict(X_test)

# Test Accuracy, precision, recall, f1 score
nb_accuracy = accuracy_score(y_test, nb_predictions)
nb_precision = precision_score(y_test, nb_predictions, average='macro')
nb_recall = recall_score(y_test, nb_predictions, average='macro')
nb_f1_score = f1_score(y_test, nb_predictions, average='macro')

# Evaluation results
print(f"Naive Bayes Accuracy:         {nb_accuracy:.4f}")
print(f"Naive Bayes Precision:        {nb_precision:.4f}  (macro avg)")
print(f"Naive Bayes Recall:           {nb_recall:.4f}  (macro avg)")
print(f"Naive Bayes F1 Score:         {nb_f1_score:.4f}  (macro avg)\n")

In [None]:
# Classification report
print("Classification Report:\n")
print(classification_report(y_test, nb_predictions, zero_division=0))

#### **Getting Best Performing Model**

In [None]:
model_names = ['Logistic Regression', 'Linear SVC', 'Naive Bayes']
accuracy = [lr_accuracy, svc_accuracy, nb_accuracy]
f1 = [lr_f1_score, svc_f1_score, nb_f1_score]
precision = [lr_precision, svc_precision, nb_precision]
recall = [lr_recall, svc_recall, nb_recall]

In [None]:
# Create a DataFrame
metrics_df = pd.DataFrame({
    'Model': model_names,
    'Accuracy': accuracy,
    'Macro F1': f1,
    'Macro Precision': precision,
    'Macro Recall': recall
})

In [None]:
# Melt for plotting
df_melted = metrics_df.melt(id_vars='Model', var_name='Metric', value_name='Score')

# Visualization
plt.figure(figsize=(10, 6))
sns.barplot(data=df_melted, x='Metric', y='Score', hue='Model')
plt.title('Model Comparison')
plt.ylim(0.75, 0.90)
plt.legend(title='Model', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show();

- The best performing model from the barplot above is definitely the Linear SVC model

### **Hyperparameter Tuning for Best Model (i.e., LinearSVC)**

In [None]:
# Create Pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('svc', LinearSVC())
])

# Define Hyperparameter Grid
param_grid = {
    'svc__class_weight': ['balanced'],
    'svc__random_state': [42],
    'svc__max_iter': [1000, 2000, 5000],
    'svc__C': [0.01, 0.1, 1, 10]
}

# Grid Search with 5-fold CV
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='f1_macro', verbose=2, n_jobs=-1)

# Fit your data
grid_search.fit(X_train_raw, y_train)

# Print best model details
print("Best Parameters:", grid_search.best_params_)
print("Best Macro F1 Score:", grid_search.best_score_)

In [None]:
# Evaluate on test set

tuned_prediction = grid_search.predict(X_test_raw)
print(classification_report(y_test, tuned_prediction))

### **Saving The Model**

In [None]:
# Save the entire GridSearchCV object

joblib.dump(grid_search, 'tuned_linear_svc_model.pkl')

In [None]:
# Save the TF-IDF Vectorizer

joblib.dump(tfidf_vectorizer, 'tfidf_vectorizer.pkl')

### **TESTING MODEL ON UNSEEN DATA**

In [None]:
# Define the test data feature matrix

X_real_test = test_df['clean_text']

In [None]:
# Get the true labels

y_real_test = test_df['category']

#### **Load Model and Make Predictions**

In [None]:
# load my trained model

loaded_model = joblib.load('tuned_linear_svc_model.pkl')

# predict on the unseen data

y_real_pred = loaded_model.predict(X_real_test)

#### **Evaluate Model Performance on Test Data**

In [None]:
# Evaluate on the true test set
real_test_accuracy = accuracy_score(y_real_test, y_real_pred)
real_test_macro_f1 = f1_score(y_real_test, y_real_pred, average='macro')

print("Final Evaluation on Real Test Set:")
print(f"Accuracy: {real_test_accuracy:.4f}")
print(f"Macro F1 Score: {real_test_macro_f1:.4f} (macro‐avg)")
print("\nFull Classification Report:")
print(classification_report(y_real_test, y_real_pred, zero_division=0))

## Conclusion

1. **Performance Recap**  
   - Our best performing model, **Linear SVC** (with class‑weight balancing and tuned C parameter), achieved a **macro‑F1** of **≈0.87** on the unseen test set, ensuring robust handling across all 77 intent categories.  
   - The **Logistic Regression** and **Naive Bayes** baselines also showed competitive macro‑F1 scores of **≈0.84** and **≈0.77**, highlighting the strength of even simple TF‑IDF pipelines.

2. **Key Takeaways**  
   - **Minimal preprocessing** (lowercasing, whitespace normalization) paired with **TF‑IDF n‑grams** is sufficient to capture the nuance in banking queries.  
   - **Stratified sampling** and **macro‑averaged metrics** are essential for fair evaluation when classes are imbalanced.  
   - **Hyperparameter tuning** (via `GridSearchCV`) can yield noticeable gains—here, a ~1% lift in macro‑F1 for Linear SVC.

3. **Next Steps**  
   - **Transformer‑based fine‑tuning** (e.g., BERT or FinBERT) to capture deeper semantic patterns and further improve low‑resource intents.  
   - **Active learning** to continuously incorporate new customer queries and emerging intents into the training set.  
   - **Production deployment** via a lightweight FastAPI service, with a confidence‑based fallback to human agents for low‑certainty inputs.

FinQuery’s pipeline shows that with careful preprocessing, rigorous validation, and thoughtful metric selection, automated intent classification can be both **accurate** and **interpretable**, ready to enhance your bank’s customer support with near‑real‑time routing and resolution.

## **Fine-Tune BERT for Intent Classification**

In [None]:
# Encode Labels

le = LabelEncoder()
train_df["label"] = le.fit_transform(train_df["category"])
test_df["label"]  = le.transform(test_df["category"])
num_labels = len(le.classes_)

In [None]:
# Creating a PyTorch Dataset

class IntentDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text  = str(self.texts[idx])
        label = int(self.labels[idx])
        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            truncation=True,
            padding='max_length',
            max_length=self.max_len,
            return_attention_mask=True,
            return_tensors='pt'
        )
        return {
            'input_ids':      encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels':         torch.tensor(label, dtype=torch.long)
        }
        

In [None]:
# Initializing Tokenizer & Datasets

tokenizer     = BertTokenizer.from_pretrained('bert-base-uncased')
train_dataset = IntentDataset(
    texts=train_df["text"].tolist(),
    labels=train_df["label"].tolist(),
    tokenizer=tokenizer,
    max_len=128
)
eval_dataset  = IntentDataset(
    texts=test_df["text"].tolist(),
    labels=test_df["label"].tolist(),
    tokenizer=tokenizer,
    max_len=128
)

In [None]:
# Load Pretrained BERT for Sequence Classification

model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=num_labels
)

In [None]:
# Training Arguments & Trainer

training_args = TrainingArguments(
    output_dir='./bert_finetuned',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_dir='./logs',
    logging_steps=50,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss"
)

def compute_metrics(pred):
    labels      = pred.label_ids
    preds       = pred.predictions.argmax(-1)
    precision, recall, f1, _ = \
        __import__('sklearn.metrics').metrics.precision_recall_fscore_support(
            labels, preds, average='macro', zero_division=0
        )
    acc = __import__('sklearn.metrics').metrics.accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'precision_macro': precision,
        'recall_macro': recall,
        'f1_macro': f1
    }

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics
)

In [None]:
# Training & Evaluation

trainer.train()
eval_results = trainer.evaluate()
print("Evaluation results:", eval_results)

In [None]:
# Saving  Model & Tokenizer & Label Encoder

model.save_pretrained("./bert_intent_model")

tokenizer.save_pretrained("./bert_intent_model")

joblib.dump(le, "intent_label_encoder.pkl")

In [None]:
# Helper Function for Inference

def predict_intent(text: str) -> str:
    """
    Returns the predicted intent label for a single banking query.
    """
    # 1) clean text
    cleaned = clean_text(text)
    # 2) tokenize
    inputs = tokenizer.encode_plus(
        cleaned,
        add_special_tokens=True,
        truncation=True,
        padding='max_length',
        max_length=128,
        return_tensors='pt'
    )
    # 3) model forward + softmax
    model.eval()
    with torch.no_grad():
        outputs = model(
            input_ids=inputs['input_ids'],
            attention_mask=inputs['attention_mask']
        )
    logits = outputs.logits
    pred_id = logits.argmax(dim=-1).item()
    return le.inverse_transform([pred_id])[0]