<a href="https://colab.research.google.com/github/KelvinLam05/customer_e-mail_categorization/blob/main/customer_e_mail_categorization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Zero-Shot Text Classification**

Zero-shot learning, or ZSL, is a machine learning process commonly used for Natural Language Processing that allows us to generate predictions on unseen data without the need to train a model. Essentially, zero-shot learning gives us immensely powerful models that have been trained on enormous datasets and work out-of-the-box.

**Goal of the project**

The Consumer Financial Protection Bureau (CFPB) is a federal U.S. agency that acts as a mediator when disputes arise between financial institutions and consumers. Via a web form, consumers can send the agency a narrative of their dispute. A zero-shot text classification model would make the classification of complaints and their routing to the appropriate teams more efficient than manually tagged complaints.


**Attribute information**

Each submission was tagged with one of five financial product classes:

* credit reporting

* debt collection

* mortgages and loans 

* credit cards

* retail banking

**Load the packages**

In [None]:
# Importing libraries
import pandas as pd
import numpy as np
import tensorflow as tf

**Load the data**

In [None]:
# Load dataset
df = pd.read_csv('/content/customer_complaints.csv')

In [None]:
# Examine the data
df.head()

Unnamed: 0,narrative,product
0,purchase order day shipping amount receive pro...,credit_card
1,forwarded message date tue subject please inve...,credit_card
2,forwarded message cc sent friday pdt subject f...,retail_banking
3,payment history missing credit report speciali...,credit_reporting
4,payment history missing credit report made mis...,credit_reporting


In [None]:
# Overview of all variables, their datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   narrative  1000 non-null   object
 1   product    1000 non-null   object
dtypes: object(2)
memory usage: 15.8+ KB


**Check for missing values**

In [None]:
df.isnull().sum()

narrative    0
product      0
dtype: int64

We can see there are no null values. 

**Check for special characters**

In [None]:
text_data = str()

for sentence in df['narrative'].values:
    text_data += sentence
    
''.join(set(text_data))

'gost vuykapmdeinbrlqwcfxjhz'

There are no special characters.

**Text preprocessing**

The kind of data we get from customer feedback is usually unstructured. It contains unusual text and symbols that need to be cleaned so that a machine learning model can grasp it.

In [None]:
import re
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer 

In [None]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

We will now set up our cleaning function.

In [None]:
def text_cleaning(text_data):                                                                                       

  # Tokenization
  tokenizer = ToktokTokenizer()
  text_data = tokenizer.tokenize(text_data)                                 
  
  # Removing stopwords
  text_data = [item for item in text_data if item not in stop_words]                           

  # Lemmatization
  text_data = [lemmatizer.lemmatize(word = w, pos = 'v') for w in text_data]
  
  # Convert list of tokens to string data type
  text_data = ' '.join (text_data)                                       
  
  return text_data 

In [None]:
df['clean_narrative'] = df['narrative'].apply(text_cleaning)

We get the following output:

In [None]:
df[['narrative', 'clean_narrative']]

Unnamed: 0,narrative,clean_narrative
0,purchase order day shipping amount receive pro...,purchase order day ship amount receive product...
1,forwarded message date tue subject please inve...,forward message date tue subject please invest...
2,forwarded message cc sent friday pdt subject f...,forward message cc send friday pdt subject fin...
3,payment history missing credit report speciali...,payment history miss credit report specialize ...
4,payment history missing credit report made mis...,payment history miss credit report make mistak...
...,...,...
995,bank america add hard inquiry credit report pe...,bank america add hard inquiry credit report pe...
996,opened premium checking bundle advertised main...,open premium check bundle advertise main land ...
997,opened premium checking bundle advertised main...,open premium check bundle advertise main land ...
998,original account number date original account ...,original account number date original account ...


**Testing for GPU**

In [None]:
import torch

In [None]:
# Whether cuda is available
torch.cuda.is_available()

True

In [None]:
# Load
device = torch.cuda.current_device() if torch.cuda.is_available() else -1

In [None]:
print(device)

0


**Transformer Pipeline**

We will use the pipeline( ) function to load the zero-shot-classification transformer and set it to use the valhalla/distilbart-mnli-12-9 model. 

In [None]:
from transformers import pipeline

In [None]:
task = 'zero-shot-classification'
zero_shot_model = 'valhalla/distilbart-mnli-12-9'
zero_shot_classifier = pipeline(task, zero_shot_model, device = device)

We can use this pipeline by passing in a sequence and a list of candidate labels. The pipeline assumes by default that only one of the candidate labels is true, returning a list of scores for each label which add up to 1.

In [None]:
sequence = df['clean_narrative'][0]

In [None]:
candidate_labels = ['credit_reporting',       
                    'debt_collection',        
                    'mortgages_and_loans',    
                    'retail_banking',          
                    'credit_card'] 

In [None]:
outputs = zero_shot_classifier(sequences = sequence, candidate_labels = candidate_labels)

Let’s take a look at the outputs.

In [None]:
for label, score in zip(outputs['labels'], outputs['scores']):
    print(f'{label}: {score:.3f}')

credit_card: 0.580
credit_reporting: 0.171
debt_collection: 0.103
retail_banking: 0.099
mortgages_and_loans: 0.047


The model correctly identifies that the likely label is credit_card. Other irrelevant labels, such as credit_reporting, debt_collection, retail_banking and mortgages_and_loans, have a very low score.

**Classify all the submissions**

In [None]:
task = 'zero-shot-classification'
zero_shot_model = 'valhalla/distilbart-mnli-12-9'
classifier = pipeline(task, zero_shot_model, device = device) 

In [None]:
candidate_labels = ['credit_reporting',       
                    'debt_collection',        
                    'mortgages_and_loans',    
                    'retail_banking',          
                    'credit_card']  

In [None]:
# Compute the predicted label for each submission
df['label_pred_zero_shot'] = df['clean_narrative'].apply(lambda x: classifier(x, candidate_labels = candidate_labels)['labels'][0])

At this point, we have a dataset that contains labels produced by the zero-shot classifier.

In [None]:
df[['product','label_pred_zero_shot']]

Unnamed: 0,product,label_pred_zero_shot
0,credit_card,credit_card
1,credit_card,retail_banking
2,retail_banking,credit_card
3,credit_reporting,credit_reporting
4,credit_reporting,credit_reporting
...,...,...
995,credit_reporting,credit_reporting
996,retail_banking,retail_banking
997,retail_banking,retail_banking
998,debt_collection,credit_reporting


Finally, the model can be evaluated using the original datasets already labelled evaluation data.

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
# Calculate the accuracy of the zero-shot model
accuracy_score(df['product'], df['label_pred_zero_shot'])

0.77

The zero-shot classifier does a decent job. Glancing at a few random submissions uncorrectly labeled by the zero-shot classifier, there does not seem to be a particularly problematic class, although such a assertion would require further investigation. 