## Complaint Categorization Baseline Model

Fast and efficient handling of complaints on consumer forums is vital to commerce industry today. This notebook presents a baseline approach towards solving this problem. Consumer complaints on financial products is taken as the dataset to establish results.

Tf-idf (term frequency times inverse document frequency) scheme to weight individual tokens is often used in information retrieval. One of the advantage of tf-idf is reduce the impact of tokens that occur very frequently, hence offering little to none in terms of information.
The tf-idf of term 't' in document 'd' is tf-idf(d, t) = tf(t) * idf(d, t), where tf(t) is the number of times t occurs while idf is given by idf(d, t) = log [(1 + n) / (1 + df(d,t) + 1] 

In [None]:
# Import required libraries

import pandas as pd
import re
import matplotlib.pyplot as plt
import seaborn as sns

import nltk
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem import PorterStemmer
import string
from nltk.corpus import stopwords

#from gensim.models import Word2Vec
#from gensim.models.doc2vec import Doc2Vec, TaggedDocument

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.manifold import TSNE
from sklearn.metrics.pairwise import cosine_similarity

# nltk downloaded (run only once)
nltk.download('stopwords',quiet=True) # stopword library
nltk.download('wordnet', quiet=True) # wordnet library
nltk.download('words', quiet=True) # words library
nltk.download('punkt', quiet=True) # tokenize library


True

In [None]:
# Read the dataset
df = pd.read_csv('complaints.csv')

# Information about the dataset
print(df.info())
print('-'*60)
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 179776 entries, 0 to 179775
Data columns (total 2 columns):
 #   Column                        Non-Null Count   Dtype 
---  ------                        --------------   ----- 
 0   Consumer complaint narrative  179776 non-null  object
 1   Product                       179776 non-null  object
dtypes: object(2)
memory usage: 2.7+ MB
None
------------------------------------------------------------


Unnamed: 0,Consumer complaint narrative,Product
0,I have outdated information on my credit repor...,Credit reporting
1,I purchased a new car on XXXX XXXX. The car de...,Consumer Loan
2,An account on my credit report has a mistaken ...,Credit reporting
3,This company refuses to provide me verificatio...,Debt collection
4,This complaint is in regards to Square Two Fin...,Debt collection


- There are no null values in the dataframe.
- Further analysis will be based on 'Consumer complaint narrative' feature

### Typical Complaint

In [None]:
df['Consumer complaint narrative'][0]

'I have outdated information on my credit report that I have previously disputed that has yet to be removed this information is more then seven years old and does not meet credit reporting requirements'

### Categories

In [None]:
print(df.Product.unique())

['Credit reporting' 'Consumer Loan' 'Debt collection' 'Mortgage'
 'Credit card' 'Other financial service' 'Bank account or service'
 'Student loan' 'Money transfers' 'Payday loan' 'Prepaid card'
 'Virtual currency'
 'Credit reporting, credit repair services, or other personal consumer reports'
 'Credit card or prepaid card' 'Checking or savings account'
 'Payday loan, title loan, or personal loan'
 'Money transfer, virtual currency, or money service'
 'Vehicle loan or lease']


# Different Preprocessing steps
- Method 1 = Normalization + Tokenization
    - Normalization = Lower case + Remove Punctuation
- Method 2 = Method1 + Lemmatization + Stop_words
- Method 3 = Method2 + Remove alphanumeric tokens

### Method 1 

In [None]:
# Normalization

def lower_case(text):
  return text.lower()
def remove_punctuation(text):
  return re.sub('[^a-zA-Z]',' ', str(text))

def normalize_document(text):
    text = remove_punctuation(text)
    text = lower_case(text)
    return text

In [None]:
df['normalize_document'] = df['Consumer complaint narrative'].apply(normalize_document)
df.head() 

Unnamed: 0,Consumer complaint narrative,Product,normalize_document
0,I have outdated information on my credit repor...,Credit reporting,i have outdated information on my credit repor...
1,I purchased a new car on XXXX XXXX. The car de...,Consumer Loan,i purchased a new car on xxxx xxxx the car de...
2,An account on my credit report has a mistaken ...,Credit reporting,an account on my credit report has a mistaken ...
3,This company refuses to provide me verificatio...,Debt collection,this company refuses to provide me verificatio...
4,This complaint is in regards to Square Two Fin...,Debt collection,this complaint is in regards to square two fin...


In [None]:
# Tokenize the normalized_documents
df['Method1_doc'] = df['normalize_document'].apply(lambda x : nltk.word_tokenize(x))
df.head()

Unnamed: 0,Consumer complaint narrative,Product,normalize_document,Method1_doc
0,I have outdated information on my credit repor...,Credit reporting,i have outdated information on my credit repor...,"[i, have, outdated, information, on, my, credi..."
1,I purchased a new car on XXXX XXXX. The car de...,Consumer Loan,i purchased a new car on xxxx xxxx the car de...,"[i, purchased, a, new, car, on, xxxx, xxxx, th..."
2,An account on my credit report has a mistaken ...,Credit reporting,an account on my credit report has a mistaken ...,"[an, account, on, my, credit, report, has, a, ..."
3,This company refuses to provide me verificatio...,Debt collection,this company refuses to provide me verificatio...,"[this, company, refuses, to, provide, me, veri..."
4,This complaint is in regards to Square Two Fin...,Debt collection,this complaint is in regards to square two fin...,"[this, complaint, is, in, regards, to, square,..."


In [None]:
# Remove normalize_document feature
df.drop(columns=['normalize_document'], axis=1, inplace=True)
df.head()

Unnamed: 0,Consumer complaint narrative,Product,Method1_doc
0,I have outdated information on my credit repor...,Credit reporting,"[i, have, outdated, information, on, my, credi..."
1,I purchased a new car on XXXX XXXX. The car de...,Consumer Loan,"[i, purchased, a, new, car, on, xxxx, xxxx, th..."
2,An account on my credit report has a mistaken ...,Credit reporting,"[an, account, on, my, credit, report, has, a, ..."
3,This company refuses to provide me verificatio...,Debt collection,"[this, company, refuses, to, provide, me, veri..."
4,This complaint is in regards to Square Two Fin...,Debt collection,"[this, complaint, is, in, regards, to, square,..."


### Method 2 = Method 1 + Lemmatization + Stopwords

In [None]:
stops = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
#ps = PorterStemmer()
def method2(text, lemma=True):
    
    sample = text
    
    # Removing stopwords
    sample = [word for word in sample if not word in stops]
    sample = ' '.join(sample) # This step is not needed if lemmatization is done 
    
    # Lemmatization
    if lemma:
        sample = sample.split()
        sample = [lemmatizer.lemmatize(word) for word in sample]
        sample = ' '.join(sample)
    
    return sample

In [None]:
df['Method2_doc'] = df['Method1_doc'].apply(method2)
df.head()

Unnamed: 0,Consumer complaint narrative,Product,Method1_doc,Method2_doc
0,I have outdated information on my credit repor...,Credit reporting,"[i, have, outdated, information, on, my, credi...",outdated information credit report previously ...
1,I purchased a new car on XXXX XXXX. The car de...,Consumer Loan,"[i, purchased, a, new, car, on, xxxx, xxxx, th...",purchased new car xxxx xxxx car dealer called ...
2,An account on my credit report has a mistaken ...,Credit reporting,"[an, account, on, my, credit, report, has, a, ...",account credit report mistaken date mailed deb...
3,This company refuses to provide me verificatio...,Debt collection,"[this, company, refuses, to, provide, me, veri...",company refuse provide verification validation...
4,This complaint is in regards to Square Two Fin...,Debt collection,"[this, complaint, is, in, regards, to, square,...",complaint regard square two financial refer cf...


### Method 3 = Method 2 + Remove alpha numeric tokens

In [None]:
only_english = set(nltk.corpus.words.words())
def method3(text):
    
    sample = text
    sample = re.sub(r"\S*https?:\S*", '', sample) #links and urls
    sample = re.sub('\[.*?\]', '', sample) #text between [square brackets]
    sample = re.sub('\(.*?\)', '', sample) #text between (parenthesis)
    sample = re.sub('[%s]' % re.escape(string.punctuation), '', sample) #punctuations
    sample = re.sub('\w*\d\w', '', sample) #digits with trailing or preceeding text
    sample = re.sub(r'\n', ' ', sample) #new line character
    sample = re.sub(r'\\n', ' ', sample) #new line character
    sample = re.sub("[''""...“”‘’…]", '', sample) #list of quotation marks
    sample = re.sub(r', /<[^>]+>/', '', sample)    #HTML attributes
    
    sample = ' '.join([w for w in nltk.wordpunct_tokenize(sample) if w.lower() in only_english or not w.isalpha()]) #doesn't remove indian languages
    sample = ' '.join(list(filter(lambda ele: re.search("[a-zA-Z\s]+", ele) is not None, sample.split()))) #languages other than english
    
    sample = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002500-\U00002BEF"  # chinese char
                               u"\U00002702-\U000027B0"
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               u"\U0001f926-\U0001f937"
                               u"\U00010000-\U0010ffff"
                               u"\u2640-\u2642"
                               u"\u2600-\u2B55"
                               u"\u200d"
                               u"\u23cf"
                               u"\u23e9"
                               u"\u231a"
                               u"\ufe0f"  # dingbats
                               u"\u3030"
                               "]+", flags=re.UNICODE).sub(r'', sample) #emojis and symbols
    sample = sample.strip()
    sample = " ".join([x.strip() for x in sample.split()])
    
    return sample

In [None]:
df['Method3_doc'] = df['Method2_doc'].apply(method3)
df.loc[:, ['Consumer complaint narrative', 'Method1_doc', 'Method2_doc', 'Method3_doc']].head()

Unnamed: 0,Consumer complaint narrative,Method1_doc,Method2_doc,Method3_doc
0,I have outdated information on my credit repor...,"[i, have, outdated, information, on, my, credi...",outdated information credit report previously ...,outdated information credit report previously ...
1,I purchased a new car on XXXX XXXX. The car de...,"[i, purchased, a, new, car, on, xxxx, xxxx, th...",purchased new car xxxx xxxx car dealer called ...,new car car dealer citizen bank get day payoff...
2,An account on my credit report has a mistaken ...,"[an, account, on, my, credit, report, has, a, ...",account credit report mistaken date mailed deb...,account credit report mistaken date mailed deb...
3,This company refuses to provide me verificatio...,"[this, company, refuses, to, provide, me, veri...",company refuse provide verification validation...,company refuse provide verification validation...
4,This complaint is in regards to Square Two Fin...,"[this, complaint, is, in, regards, to, square,...",complaint regard square two financial refer cf...,complaint regard square two financial refer ca...


### Bag_of_words
- Bag of words (BOW) is a technique to extract features from the text 
- The words that are obtained after all the preprocessing steps
- The bag of word model focuses on the word count to represent a sentence.

In [None]:
from keras.preprocessing.text import Tokenizer

text =df['Method3_doc']
sentence = []
for i in text:
    sentence.append(i)

# using tokenizer 
model = Tokenizer()
model.fit_on_texts(sentence)

#print keys 
keys = list(model.word_index.keys())
print(f'Key : {keys[0:20]}')
print('Total_Keys:', len(keys)) 
print('-'*40)

#create bag of words representation 
bow = model.texts_to_matrix(sentence, mode='count')
print(bow)

Key : ['account', 'credit', 'payment', 'loan', 'would', 'time', 'bank', 'report', 'debt', 'told', 'n', 'information', 'call', 'company', 'received', 'card', 'mortgage', 'day', 'month', 'letter']
Total_Keys: 19017
----------------------------------------
[[0. 0. 2. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 1. 1. ... 0. 0. 0.]
 ...
 [0. 1. 1. ... 0. 0. 0.]
 [0. 2. 3. ... 0. 0. 0.]
 [0. 0. 3. ... 0. 0. 0.]]


### TF-IDF Experiment
- TF-IDF vectorizer will be applied on all the three preprocessing methods
- Based on that the importance of different terms for each method can be compared

### Method 1 
- As the first method has the preprocessed document in the form of tokens the TF-IDF is defined as shown in the next cell

In [None]:
# Defining TF-IDF vectorizer

def dummy_fun(doc):
    return doc

tfidf = TfidfVectorizer(analyzer='word',
                        tokenizer=dummy_fun,
                        preprocessor=dummy_fun,
                        token_pattern=None,
                        stop_words = 'english') 

In [None]:
tfidf_method1 = tfidf.fit_transform(df.Method1_doc)
tfidf_method1.shape

  'stop_words.' % sorted(inconsistent))


(179776, 78812)

In [None]:
# Create Data Frame of tdidf scores
tfidf_df_method1 = pd.DataFrame(tfidf_method1.toarray(),
             columns = tfidf.get_feature_names(),
             index = df.index)

In [None]:
# Calculate tfidf for all columns and list top 10
tfidf_df_method1.mean().sort_values(ascending = False).head(10)  

xxxx           0.169184
credit         0.048357
xx             0.047034
account        0.041912
debt           0.030252
report         0.029436
loan           0.028440
payment        0.026103
bank           0.024022
information    0.021939
dtype: float64

### TD-IDF Method2
- Here the convetional way of defining TF-IDF is done as the method2 preprocessing returns sentence

In [None]:
vectorizer_tfidf = TfidfVectorizer(stop_words = set(nltk.corpus.stopwords.words('english')))
tfidf_method2 = vectorizer_tfidf.fit_transform(df.Method2_doc)
tfidf_method2.shape

(179776, 74110)

In [None]:
# Create Data Frame of tdidf scores
tfidf_df_method2 = pd.DataFrame(tfidf_method2.toarray(),
             columns = vectorizer_tfidf.get_feature_names(),
             index = df.index)

In [None]:
# Calculate tfidf for all columns and list top 10
tfidf_df_method2.mean().sort_values(ascending = False).head(10)  

xxxx       0.168266
credit     0.048320
account    0.046734
xx         0.046716
payment    0.036594
loan       0.033560
report     0.032294
debt       0.031119
bank       0.024618
company    0.023310
dtype: float64

### TF - IDF Method 3
- Here the convetional way of defining TF-IDF is done as the method3 preprocessing returns sentence

In [None]:
vectorizer_tfidf = TfidfVectorizer(stop_words = set(nltk.corpus.stopwords.words('english')))
tfidf_method3 = vectorizer_tfidf.fit_transform(df.Method3_doc)
tfidf_method3.shape

(179776, 18961)

In [None]:
# Create Data Frame of tdidf scores
tfidf_df_method3 = pd.DataFrame(tfidf_method3.toarray(),
             columns = vectorizer_tfidf.get_feature_names(),
             index = df.index)

In [None]:
# Calculate tfidf for all columns and list top 10
tfidf_df_method3.mean().sort_values(ascending = False).head(10) 

credit     0.058716
account    0.057299
payment    0.044387
loan       0.040472
report     0.039748
debt       0.037294
bank       0.029835
company    0.027746
card       0.027173
would      0.026787
dtype: float64

- TF-IDF vectorizer is analyzed for all the three pre-processing techniques and the results are shown with top10 frequently occuring words in the consumer complaint narrative text.
- Further model training will be done with all the preprocessing techniques seperately and the results will be compared

### Training a model with Method1 preprocessing
- Train-test split - 15% of the total data is used as validation data while the remaining as training. This leads to 152809 training instances while 26967 validation instances.

In [None]:
def sentence(text):
    sent = ' '.join(text)
    #for i in txt:
     #   sent.append(' '.join(i))
    return sent

In [None]:
df['Method1'] = df['Method1_doc'].apply(sentence)
df.head()

Unnamed: 0,Consumer complaint narrative,Product,Method1_doc,Method1
0,I have outdated information on my credit repor...,Credit reporting,"[i, have, outdated, information, on, my, credi...",i have outdated information on my credit repor...
1,I purchased a new car on XXXX XXXX. The car de...,Consumer Loan,"[i, purchased, a, new, car, on, xxxx, xxxx, th...",i purchased a new car on xxxx xxxx the car dea...
2,An account on my credit report has a mistaken ...,Credit reporting,"[an, account, on, my, credit, report, has, a, ...",an account on my credit report has a mistaken ...
3,This company refuses to provide me verificatio...,Debt collection,"[this, company, refuses, to, provide, me, veri...",this company refuses to provide me verificatio...
4,This complaint is in regards to Square Two Fin...,Debt collection,"[this, complaint, is, in, regards, to, square,...",this complaint is in regards to square two fin...


In [None]:
X_train, X_test, y_train, y_test = train_test_split(df['Method1'].values, df['Product'].values, test_size=0.15, random_state=42)
print('Training utterances: {}'.format(X_train.shape[0]))
print('Validation utterances: {}'.format(X_test.shape[0]))

Training utterances: 152809
Validation utterances: 26967


##### Calculating tf-idf scores
Calculating tf-idf scores for each unique token in the dataset and creating frequency chart for each utterance in the dataset.

In [None]:
vectorizer = TfidfVectorizer()
vectorizer.fit(X_train)

TfidfVectorizer()

In [None]:
X_train = vectorizer.transform(X_train)
X_test = vectorizer.transform(X_test)
X_train, X_test

(<152809x72535 sparse matrix of type '<class 'numpy.float64'>'
 	with 13620353 stored elements in Compressed Sparse Row format>,
 <26967x72535 sparse matrix of type '<class 'numpy.float64'>'
 	with 2397990 stored elements in Compressed Sparse Row format>)

##### Feature Selection
Chi-square test measures dependence between stochastic variables, so using this function “weeds out” the features that are the most likely to be independent of class and therefore irrelevant for classification.

In [None]:
from sklearn.feature_selection import SelectKBest, chi2

ch2 = SelectKBest(chi2, k=5000)
X_train = ch2.fit_transform(X_train, y_train)
X_test = ch2.transform(X_test)

X_train, X_test

(<152809x5000 sparse matrix of type '<class 'numpy.float64'>'
 	with 10704472 stored elements in Compressed Sparse Row format>,
 <26967x5000 sparse matrix of type '<class 'numpy.float64'>'
 	with 1889836 stored elements in Compressed Sparse Row format>)

##### Naive Bayes
In multinomial naive bayes the probability of a document $d$ being in class $c$ is computed as $$P(c|d) = P(c) \prod_{1\le k \le n_d}{P(t_k|c)} $$ where, $P(c)$ is the prior probablity of a document occuring in class $c$ and $P(t_k|c)$ is the conditional probability of term $t_k$ occurring in a document of class $c$.

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
model_mnb_m1 = MultinomialNB()
model_mnb_m1.fit(X_train, y_train)
pred = model_mnb_m1.predict(X_test)
print(accuracy_score(y_test, pred))

0.7624133199836838


### With minimum preprocessing the accuracy obtained is 76%. Let us try to train the model with the other preprocessing techniques too.

In [None]:
# Training the model with Method 2 preprocessing

## Train_Test Split
X_train, X_test, y_train, y_test = train_test_split(df['Method2_doc'].values, df['Product'].values, test_size=0.15, random_state=42)
print('Training utterances: {}'.format(X_train.shape[0]))
print('Validation utterances: {}'.format(X_test.shape[0]))

## TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
vectorizer.fit(X_train)
X_train = vectorizer.transform(X_train)
X_test = vectorizer.transform(X_test)

## Feature Selection
from sklearn.feature_selection import SelectKBest, chi2
ch2 = SelectKBest(chi2, k=5000)
X_train = ch2.fit_transform(X_train, y_train)
X_test = ch2.transform(X_test)

## Model Testing
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
model_mnb_m2 = MultinomialNB()
model_mnb_m2.fit(X_train, y_train)
pred = model_mnb_m2.predict(X_test)
print(accuracy_score(y_test, pred))

Training utterances: 152809
Validation utterances: 26967
0.7655653205770016


In [None]:
# Training the model with Method 3 preprocessing

## Train_Test Split
X_train, X_test, y_train, y_test = train_test_split(df['Method3_doc'].values, df['Product'].values, test_size=0.15, random_state=42)
print('Training utterances: {}'.format(X_train.shape[0]))
print('Validation utterances: {}'.format(X_test.shape[0]))

## TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
vectorizer.fit(X_train)
X_train = vectorizer.transform(X_train)
X_test = vectorizer.transform(X_test)

## Feature Selection
from sklearn.feature_selection import SelectKBest, chi2
ch2 = SelectKBest(chi2, k=5000)
X_train = ch2.fit_transform(X_train, y_train)
X_test = ch2.transform(X_test)

## Model Testing
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
model_mnb_m3 = MultinomialNB()
model_mnb_m3.fit(X_train, y_train)
pred = model_mnb_m3.predict(X_test)
print(accuracy_score(y_test, pred))

Training utterances: 152809
Validation utterances: 26967
0.7340823970037453


- The comparitive analysis clearly potraits that too much preprocessing will result in reduction in model performance as some important words/characters are unncessarily removed.