### Project descriptions ###

This project builds an end-to-end Natural Language Processing (NLP) pipeline for classifying and analyzing customer complaints submitted to a major U.S. bank.

The primary objective is to automate the triage and routing of complaints by using machine learning and transformer-based models. The project also incorporates sentiment analysis to assess customer emotions and prioritize critical issues.

Key steps included:

Data cleaning and preprocessing of over 7,000 complaint records

Text vectorization using TF-IDF for baseline models

Training and evaluation of traditional classifiers (Naive Bayes, Logistic Regression, SVM)

Fine-tuning a BERT transformer model to classify complaints with ~77% accuracy

Performing sentiment analysis using VADER to detect negative, positive, or neutral tones

Providing business insights to help departments identify and prioritize issues based on volume and sentiment

This solution enables the bank to:

Improve operational efficiency by automating complaint routing

Monitor product performance via complaint sentiment trends

Identify high-risk cases for faster escalation and response





In [None]:
# installs that google colab dony have.

! pip install spacy

! pip install transformers datasets scikit-learn torch

! pip install vaderSentiment


Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl.metadata (572 bytes)
Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/126.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.0/126.0 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2


In [None]:
# Libraries imports

import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
import re
import string
import nltk
import spacy
import os
import torch
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.preprocessing import LabelEncoder
from transformers import BertTokenizerFast, BertForSequenceClassification
from torch.utils.data import Dataset
from transformers import Trainer, TrainingArguments
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

NOTE: I DID USE COUPLE ML MODELS THAT DIDN'T ARRIVE TO THE MAX OF ACCURACY, CEHCK THE COMMENTS AND YOU WILL SEE THE MODELS, BUT GOING TO BE COMMENT PLUS NOT RUNNING.

---



In [None]:
# Important downloads that we need to do get from nltk

nltk.download("stopwords")
nltk.download("wordnet")
nltk.download('punkt_tab')
os.environ["WANDB_DISABLED"] = "true"

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [None]:
# creating csv path

df_path = 'complaints_banking_2023.csv'
df = pd.read_csv(df_path)

df.head()

Unnamed: 0,Complaint ID,Date Received,Banking Product,Issue ID,Complaint Description,State,ZIP,Bank Response
0,CID76118977,1/1/2023,Checking or savings account,I_3510635,on XX/XX/XX22 I opened a safe balance account ...,California,92311,Closed with monetary relief
1,CID98703933,1/1/2023,"Credit reporting, credit repair services, or o...",I_3798538,There is an item from Bank of ABC on my credit...,California,91344,Closed with explanation
2,CID52036665,1/1/2023,Checking or savings account,I_3648593,On XX/XX/XX22 I found out that my account was ...,New York,10466,Closed with monetary relief
3,CID62581335,1/1/2023,Credit card or prepaid card,I_6999080,I've had a credit card for years with Bank of ...,California,92127,Closed with monetary relief
4,CID65731164,1/1/2023,Checking or savings account,I_3648593,This issue has to do with the way that Bank of...,New Jersey,7946,Closed with explanation


In [None]:
#eCheking for Nan df.isna().sum()

Unnamed: 0,0
Complaint ID,0
Date Received,0
Banking Product,0
Issue ID,0
Complaint Description,0
State,27
ZIP,30
Bank Response,0


In [None]:
df = df.dropna()

# Droping NaN cells --> [State][ZIP]

df.columns

Index(['Complaint ID', 'Date Received', 'Banking Product', 'Issue ID',
       'Complaint Description', 'State', 'ZIP', 'Bank Response'],
      dtype='object')

In [None]:
# Convert 'Date Received' to datetime format
df['Date Received'] = pd.to_datetime(df['Date Received'], errors='coerce')
# Find the date range
date_max = df['Date Received'].max()
date_min = df['Date Received'].min()

print(f'Max date: {date_max}')
print(f'Min date: {date_min}')

Max date: 2023-10-21 00:00:00
Min date: 2023-01-01 00:00:00


In [None]:
# Setting up our stop words function
stop_words = set(stopwords.words('english'))

# Setting up lemmatizer function
lemmatizer = WordNetLemmatizer()

In [None]:
def preprocessing(text):
    # Lowercase
    text = text.lower()
    # Remove numbers
    text = re.sub(r'\d+', '', text)
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Tokenize
    tokens = nltk.word_tokenize(text)
    # Remove stopwords and lemmatize
    cleaned = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    return ' '.join(cleaned)

In [None]:
# Appaly th data processing

df['Cleaned Complaint'] = df['Complaint Description'].apply(preprocessing)

In [None]:
print(list(df['Cleaned Complaint'].head()))



In [None]:
''' Also we can use our spacy fuction to clean the data. Spacy! does the cleaning more eassy and gets amazing results. '''

nlp = spacy.load('en_core_web_sm')

def spacy_preprocessing(text):
    doc = nlp(text.lower())  # Lowercase and parse
    cleaned = [
        token.lemma_ for token in doc
        if not token.is_stop and not token.is_punct and not token.like_num and token.is_alpha
    ]
    return ' '.join(cleaned)

df['Cleaned Complaint'] = df['Complaint Description'].apply(spacy_preprocessing)


' Also we can use our spacy fuction to clean the data. Spacy doses the cleaning more eassy and gets amazing results. '

In [None]:
#### Becuase the model didn't work we need to change the strategy. ###

# tfidf = TfidfVectorizer(max_features=3000, ngram_range=(1,2))

# X = tfidf.fit_transform(df['Cleaned Complaint'])
# y = df['Banking Product']


In [None]:
# print(X.shape)
# print(y.shape)

In [None]:
# X_train , X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2)

# need to change the

# model = MultinomialNB()
# model.fit(X_train, y_train)

# y_pred = model.predict(X_test)

# print("Accuracy:", accuracy_score(y_test, y_pred))
# print("\nClassification Report:\n", classification_report(y_test, y_pred))
# print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))


In [None]:
### Grouping clasees that are getting 0.00 on our first train because they're suport is really low, so the model couldn't learn enough from them ###

class_counts = df['Banking Product'].value_counts()

rare_clases = class_counts[class_counts < 50].index

df['Product Grouped'] = df['Banking Product'].apply(lambda x: 'Other' if x in rare_clases else x)

In [None]:
y = df['Product Grouped']

In [None]:
### Model couldn't get a good accuracy. ###

# X_train , X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2)

# model = MultinomialNB()
# model.fit(X_train, y_train)

# y_pred = model.predict(X_test)

# print("Accuracy:", accuracy_score(y_test, y_pred))
# print("\nClassification Report:\n", classification_report(y_test, y_pred))
# print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))


In [None]:
# print(y_test.value_counts()) # Chekking the exacly suport on our (y_test)
# print(y_train.value_counts()) # Chekking the exacly suport on our (y_train)

In [None]:
# second_model = LogisticRegression(max_iter=1000)
# second_model.fit(X_train, y_train)

# y_pred = second_model.predict(X_test)

# print("Accuracy:", accuracy_score(y_test, y_pred))
# print("\nClassification Report:\n", classification_report(y_test, y_pred))
# print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

In [None]:
# thered_model = LinearSVC()

# thered_model.fit(X_train, y_train)

# y_pred = thered_model.predict(X_test)

# print("Accuracy:", accuracy_score(y_test, y_pred))
# print("\nClassification Report:\n", classification_report(y_test, y_pred))
# print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

In [None]:
 # Encode the department labels into numeric format (e.g., 'Credit card' → 0, 'Mortgage' → 1)

le = LabelEncoder()
df['label'] = le.fit_transform(df['Product Grouped'])

In [None]:
# Load the BERT tokenizer and classification model with the correct number of output labels

tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=len(le.classes_)
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
 # Define a custom Dataset class to wrap tokenized complaints and labels for use with HuggingFace Trainer
class ComplaintDataset(Dataset):
    def __init__(self, texts, labels):
      # Tokenize the input complaint texts with truncation, padding, and max length
      self.encodings = tokenizer(texts, truncation=True, padding=True, max_length=256)
      self.labels = labels

    def __getitem__(self, idx):
        # Return input tensors (input_ids, attention_mask) along with the corresponding label
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()} | {'labels': torch.tensor(self.labels[idx])}

    def __len__(self):
      # Return total number of samples in the dataset
        return len(self.labels)

# Split the dataset into training and testing sets (80/20 split)
train_texts, test_texts, train_labels, test_labels = train_test_split(
    df['Cleaned Complaint'], df['label'], test_size=0.2, random_state=42
)


# Convert text and labels into Dataset objects that can be used by the Trainer
train_dataset = ComplaintDataset(train_texts.tolist(), train_labels.tolist())
test_dataset = ComplaintDataset(test_texts.tolist(), test_labels.tolist())



In [None]:
# settinng up the paramet5ers of the training model.
training_args = TrainingArguments(
    output_dir='./results',
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    num_train_epochs=5,
    logging_dir='./logs',
    save_total_limit=1,
    logging_steps=20,
    report_to='none'
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

trainer.train()

Step,Training Loss
20,2.2305
40,1.8343
60,1.4667
80,1.2492
100,1.0686
120,0.9242
140,0.9138
160,0.8501
180,0.84
200,0.6585


Step,Training Loss
20,2.2305
40,1.8343
60,1.4667
80,1.2492
100,1.0686
120,0.9242
140,0.9138
160,0.8501
180,0.84
200,0.6585


TrainOutput(global_step=440, training_loss=0.8205267917026173, metrics={'train_runtime': 1119.9508, 'train_samples_per_second': 24.925, 'train_steps_per_second': 0.393, 'total_flos': 3672702282700800.0, 'train_loss': 0.8205267917026173, 'epoch': 5.0})

In [None]:
preds_output = trainer.predict(test_dataset)
y_pred = preds_output.predictions.argmax(axis=1)

In [None]:
# True labels from your test split
y_true = test_labels

# Predicted label indices → convert to text labels
y_pred_labels = le.inverse_transform(y_pred)
y_true_labels = le.inverse_transform(y_true)

print("Accuracy:", accuracy_score(y_true_labels, y_pred_labels))
print(classification_report(y_true_labels, y_pred_labels))

NameError: name 'test_labels' is not defined

In [None]:
analyzer = SentimentIntensityAnalyzer()

def get_sentiment(text):
    score = analyzer.polarity_scores(text)['compound']
    return 'positive' if score > 0.2 else 'negative' if score < -0.2 else 'neutral'

df['Sentiment'] = df['Complaint Description'].apply(get_sentiment)


In [None]:
print(df['Sentiment'].value_counts())

Sentiment
negative    3563
positive    2577
neutral      839
Name: count, dtype: int64


In [None]:
df_eval = pd.DataFrame({
    'Complaint': test_texts,
    'True Label': le.inverse_transform(test_labels),
    'Predicted Label': le.inverse_transform(y_pred)
})

# Filter mistakes
mistakes = df_eval[df_eval['True Label'] != df_eval['Predicted Label']]
mistakes.sample(5)


Unnamed: 0,Complaint,True Label,Predicted Label
3483,executive summary fraudulent application made ...,Debt collection,Credit card or prepaid card
2936,xxxxxx got phone call xxxx xxxx xxxx xxxx bank...,"Money transfer, virtual currency, or money ser...",Checking or savings account
5851,sent letter request agency xxxx xxxx xxxx prov...,Debt collection,"Credit reporting, credit repair services, or o..."
3650,xxxx xxxx always go line pay bill bank abc day...,Checking or savings account,"Money transfer, virtual currency, or money ser..."
6754,disputing issue navient sending letter agency ...,Debt collection,Student loan


### Business Insights & Strategic Recommendations

Key Findings:
Over 50% of complaints are negative (3,563 out of 6,979 total).

Most negative complaints fall under product categories like “Debt collection,” “Credit reporting,” and “Checking or savings account.”

The BERT model performs best on Mortgage, Credit Card, and Student Loan complaints (F1 > 0.80), making it ideal for automating classification in these areas.



## Strategic Recommendations:

1. Auto-Triage High-Risk Complaints:

  (1.1)  Use the BERT model to auto-route complaints to relevant departments based on predicted category.

   (1.2) Flag complaints with “negative” sentiment + high-risk category (e.g., Mortgage, Debt Collection) for priority review.

2. Track Negative Sentiment Trends:

   (2.1)Weekly dashboard tracking volume of negative complaints by department.

   (2.2)Helps uncover areas with poor service or new product issues.

3. Improve Response Scripts:

  (3.1) Analyze common keywords in negative complaints using attention/SHAP tools.

   (3.2) Fine-tune chatbot/response templates based on specific language that drives frustration.

4. Use Sentiment as a Risk Signal:

   (4.1)Tag complaints with very negative VADER compound scores (e.g. < -0.5) for escalation to compliance/legal.