## Complaint Categorization using Word Embeddings

In [1]:
from nltk.tokenize import RegexpTokenizer
import numpy as np
import re

In [2]:
import pandas as pd
complaints_dataframe = pd.read_csv('complaints.csv') 


In [4]:
complaints_dataframe.head()

Unnamed: 0,Consumer complaint narrative,Product
0,I have outdated information on my credit repor...,Credit reporting
1,I purchased a new car on XXXX XXXX. The car de...,Consumer Loan
2,An account on my credit report has a mistaken ...,Credit reporting
3,This company refuses to provide me verificatio...,Debt collection
4,This complaint is in regards to Square Two Fin...,Debt collection


In [5]:
def convert_complaint_to_words(comp):
    
    converted_words = RegexpTokenizer('\w+').tokenize(comp)
    converted_words = [re.sub(r'([xx]+)|([XX]+)|(\d+)', '', w).lower() for w in converted_words]
    converted_words = list(filter(lambda a: a != '', converted_words))
    
    return converted_words

In [6]:
all_words = list()
for comp in complaints_dataframe['Consumer complaint narrative']:
    for w in convert_complaint_to_words(comp):
        all_words.append(w)

In [7]:
print('Size of vocabulary is {}'.format(len(set(all_words))))

Size of vocabulary: 76908


In [9]:
print('Complaint is \n', complaints_dataframe['Consumer complaint narrative'][10], '\n')
print('Tokens are\n', convert_complaint_to_words(complaints_dataframe['Consumer complaint narrative'][10]))

Complaint is 
 Without provocation, I received notice that my credit line was being decreased by nearly 100 %. My available credit was reduced from $ XXXX to XXXX ( the rough amount of my available balance ). 

When I called to question the change, I was provided a nob-descript response referencing my XXXX report. It was my understanding that under the FCRA I was entitled to a copy of this report, but was refused by Citi and have been given no further explanation. 

This is predatory in that it affects my utilization of credit, further subjecting me to increase in APrs, etc and a higher cost of credit without any reason. 

Tokens are
 ['without', 'provocation', 'i', 'received', 'notice', 'that', 'my', 'credit', 'line', 'was', 'being', 'decreased', 'by', 'nearly', 'my', 'available', 'credit', 'was', 'reduced', 'from', 'to', 'the', 'rough', 'amount', 'of', 'my', 'available', 'balance', 'when', 'i', 'called', 'to', 'question', 'the', 'change', 'i', 'was', 'provided', 'a', 'nob', 'descript

### Indexing


In [10]:
index_dictionary = dict()
count = 1
index_dictionary['<unk>'] = 0
for word in set(all_words):
    index_dictionary[word] = count
    count += 1

### Dataset

In [11]:
embeddings_index = {}
f = open('glove.6B.300d.txt')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

#### Taking average of all word embeddings in a sentence to generate the sentence representation.

In [13]:
complaints_list = list()
for comp in complaints_dataframe['Consumer complaint narrative']:
    sentence = np.zeros(300)
    count = 0
    for w in convert_complaint_to_words(comp):
        try:
            sentence += embeddings_index[w]
            count += 1
        except KeyError:
            continue
    complaints_list.append(sentence / count)

#### Converting categrical labels to numerical format and further one hot encoding on the numerical labels.

In [14]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(complaints_dataframe['Product'])
complaints_dataframe['Target'] = le.transform(complaints_dataframe['Product'])
complaints_dataframe.head()

Unnamed: 0,Consumer complaint narrative,Product,Target
0,I have outdated information on my credit repor...,Credit reporting,5
1,I purchased a new car on XXXX XXXX. The car de...,Consumer Loan,2
2,An account on my credit report has a mistaken ...,Credit reporting,5
3,This company refuses to provide me verificatio...,Debt collection,7
4,This complaint is in regards to Square Two Fin...,Debt collection,7


### One hot Encoding

In [15]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(np.array(complaints_list), complaints_dataframe.Target.values, 
    test_size=0.15, random_state=0)

In [16]:
print(X_train.shape)

(152809, 300)


In [18]:
print(y_train.shape)

(152809,)


#### Training and testing the classifier

In [19]:
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score
clf = BernoulliNB()
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
print(accuracy_score(y_test, pred))

0.4839618793340008


In [20]:
from sklearn.tree import DecisionTreeClassifier

In [21]:
dt_classifier = DecisionTreeClassifier()  
dt_classifier.fit(X_train, y_train) 

DecisionTreeClassifier()

In [22]:
print(accuracy_score(y_test, pred))

0.4839618793340008
