# **`Customer Grievance Categorization with BERT and TensorFlow`**

##### This project aims to develop a robust Natural Language Processing (NLP) solution for categorizing customer grievances. Customer grievances can take various forms, spanning issues related to service quality, access and availability, billing disputes, benefit packages, and marketing concerns. The project leverages BERT and TensorFlow to train a sophisticated classification model capable of accurately categorizing text descriptions into relevant grievance categories.

### Key Features

- **BERT Integration:** The model utilizes BERT (Bidirectional Encoder Representations from Transformers), a state-of-the-art pre-trained language representation model, to capture intricate contextual relationships within the text.

- **TensorFlow Implementation:** TensorFlow, an open-source machine learning framework, is employed to build, train, and deploy the classification model efficiently.

- **Multi-Class Categorization:** The model is designed to classify customer grievances into multiple categories, allowing for a comprehensive understanding of the nature of each grievance.




In [68]:
!pip install transformers



In [69]:
import pandas as pd
import numpy as np
from tqdm.auto import tqdm
import tensorflow as tf
from transformers import BertTokenizer

In [70]:
df = pd.read_excel('/content/NLP_Data.xlsx')
df.head()

Unnamed: 0,Description of the Grievance,Grievance Category,Grievance SubCategory
0,concerns regarding laboratory tests being bill...,Billing/Financial Dispute,Provider Claim Issues
1,dassatifaction with provider,Quality Of Service,Not Satisfied With Provider Services
2,Dissatisafaction with delay in care.,Access And Availability,Pharmacy
3,Dissatisafaction with Dental provider way of c...,Quality Of Service,Not Satisfied With Provider Services
4,Dissatisfaction for not being informed he had ...,Billing/Financial Dispute,Balance Billing


In [71]:
df = df.dropna()

In [72]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 499 entries, 0 to 521
Data columns (total 3 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   Description of the Grievance  499 non-null    object
 1   Grievance Category            499 non-null    object
 2   Grievance SubCategory         499 non-null    object
dtypes: object(3)
memory usage: 15.6+ KB


### **Data Preparation**

In [73]:
df['Grievance Category'].value_counts()


Quality Of Service           160
Quality Of Care              135
Access And Availability      100
Billing/Financial Dispute     60
Benefit Package               36
Marketing                      4
Enrollment/Disenrollment       2
Confidentiality/Privacy        1
Cms Issues                     1
Name: Grievance Category, dtype: int64

### **Tokenization**

In [74]:
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

In [75]:
token = tokenizer.encode_plus(
    df['Description of the Grievance'].iloc[0],
    max_length=256,
    truncation=True,
    padding='max_length',
    add_special_tokens=True,
    return_tensors='tf'
)

In [76]:
X_input_ids = np.zeros((len(df), 256))
X_attn_masks = np.zeros((len(df), 256))

In [77]:
def generate_training_data(df, ids, masks, tokenizer):
    for i, text in tqdm(enumerate(df['Description of the Grievance'])):
        tokenized_text = tokenizer.encode_plus(
            text,
            max_length=256,
            truncation=True,
            padding='max_length',
            add_special_tokens=True,
            return_tensors='tf'
        )
        ids[i, :] = tokenized_text.input_ids
        masks[i, :] = tokenized_text.attention_mask
    return ids, masks

In [78]:
X_input_ids, X_attn_masks = generate_training_data(df, X_input_ids, X_attn_masks, tokenizer)

0it [00:00, ?it/s]

In [79]:
labels = np.zeros((len(df), 9))
labels.shape

(499, 9)

In [80]:
unique_categories = df['Grievance Category'].unique()

#Create a dictionary mapping category names to unique integers
category_to_index = {category: index for index, category in enumerate(unique_categories)}

#Get the target values as a NumPy array
target_values = df['Grievance Category'].map(category_to_index).values

#One-hot encoded representation
num_categories = len(unique_categories)
labels = np.zeros((len(df), num_categories))
labels[np.arange(len(df)), target_values] = 1


In [81]:
labels

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.]])

In [82]:
# creating a data pipeline using tensorflow dataset utility, creates batches of data for easy loading...
dataset = tf.data.Dataset.from_tensor_slices((X_input_ids, X_attn_masks, labels))
dataset.take(1) # one sample data

<_TakeDataset element_spec=(TensorSpec(shape=(256,), dtype=tf.float64, name=None), TensorSpec(shape=(256,), dtype=tf.float64, name=None), TensorSpec(shape=(9,), dtype=tf.float64, name=None))>

In [83]:
def CategoryDatasetMapFunction(input_ids, attn_masks, labels):
    return {
        'input_ids': input_ids,
        'attention_mask': attn_masks
    }, labels

In [84]:
dataset = dataset.map(CategoryDatasetMapFunction) # converting to required format for tensorflow dataset

In [85]:
dataset.take(1)

<_TakeDataset element_spec=({'input_ids': TensorSpec(shape=(256,), dtype=tf.float64, name=None), 'attention_mask': TensorSpec(shape=(256,), dtype=tf.float64, name=None)}, TensorSpec(shape=(9,), dtype=tf.float64, name=None))>

In [86]:
dataset = dataset.shuffle(10000).batch(16, drop_remainder=True) # batch size, drop any left out tensor

In [87]:
p = 9.9
train_size = int((len(df)//16)*p) # for each 16 batch of data we will have len(df)//16 samples, take 80% of that for train.

In [88]:
train_size

306

In [89]:
train_dataset = dataset.take(train_size)
val_dataset = dataset.skip(train_size)

### **Model**

In [90]:
from transformers import TFBertModel

In [91]:
model = TFBertModel.from_pretrained('bert-base-cased') # bert base model with pretrained weights

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

In [92]:
# defining 2 input layers for input_ids and attn_masks
input_ids = tf.keras.layers.Input(shape=(256,), name='input_ids', dtype='int32')
attn_masks = tf.keras.layers.Input(shape=(256,), name='attention_mask', dtype='int32')

bert_embds = model.bert(input_ids, attention_mask=attn_masks)[1] # 0 -> activation layer (3D), 1 -> pooled output layer (2D)
intermediate_layer = tf.keras.layers.Dense(512, activation='relu', name='intermediate_layer')(bert_embds)
output_layer = tf.keras.layers.Dense(9, activation='softmax', name='output_layer')(intermediate_layer) # softmax -> calcs probs of classes

Categorization_model = tf.keras.Model(inputs=[input_ids, attn_masks], outputs=output_layer)
Categorization_model.summary()

Model: "model_3"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_ids (InputLayer)      [(None, 256)]                0         []                            
                                                                                                  
 attention_mask (InputLayer  [(None, 256)]                0         []                            
 )                                                                                                
                                                                                                  
 bert (TFBertMainLayer)      TFBaseModelOutputWithPooli   1083102   ['input_ids[0][0]',           
                             ngAndCrossAttentions(last_   72         'attention_mask[0][0]']      
                             hidden_state=(None, 256, 7                                     

In [93]:
optim =tf.keras.optimizers.legacy.Adam(learning_rate=1e-5, decay=1e-6)
loss_func = tf.keras.losses.CategoricalCrossentropy()
acc = tf.keras.metrics.CategoricalAccuracy('accuracy')

In [94]:
Categorization_model.compile(optimizer=optim, loss=loss_func, metrics=[acc])

In [95]:
hist = Categorization_model.fit(
    train_dataset,
    validation_data=val_dataset,
    epochs=1
)



In [96]:
Categorization_model.save('Categorization_model')

### **Prediction**

In [97]:
Categorization_model = tf.keras.models.load_model('Categorization_model')

tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

def prepare_data(input_text, tokenizer):
    token = tokenizer.encode_plus(
        input_text,
        max_length=256,
        truncation=True,
        padding='max_length',
        add_special_tokens=True,
        return_tensors='tf'
    )
    return {
        'input_ids': tf.cast(token.input_ids, tf.float64),
        'attention_mask': tf.cast(token.attention_mask, tf.float64)
    }

def make_prediction(model, processed_data, classes=['Quality Of Service','Quality Of Care','Access And Availability','Billing/Financial Dispute','Benefit Package','Marketing','Enrollment/Disenrollment','Confidentiality/Privacy','Cms Issues']):
    probs = model.predict(processed_data)[0]
    return classes[np.argmax(probs)]

In [98]:
input_text = input('Enter your grievance here: ')
processed_data = prepare_data(input_text, tokenizer)
result = make_prediction(Categorization_model, processed_data=processed_data)
print(f"Predicted Category: {result}")

Enter your grievance here: Dissatisfaction with chiropractic benefits
Predicted Category: Quality Of Care
