# Clustering, customer Complains
-[Rishit Dagli](rishit.tech)

## About Me

[Twitter](https://twitter.com/rishit_dagli)

[GitHub](https://github.com/Rishit-dagli)

[Medium](https://medium.com/@rishit.dagli)

## Some imports

In [3]:
from tensorflow.keras.preprocessing.text import text_to_word_sequence
import pandas as pd
from tensorflow.keras.preprocessing.text import Tokenizer
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import seaborn as sns
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import Conv1D, GlobalMaxPooling1D
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from tensorflow.keras import utils as np_utils

## Read Data
We will also make the x and y splits now

In [4]:
data = pd.read_csv("Consumer Finance Complaints/consumer_complaints.csv", 
                usecols=('product','consumer_complaint_narrative'),
                dtype={'consumer_complaint_narrative': object})

data=data[data['consumer_complaint_narrative'].notnull()]

data=data[data['product'].notnull()]
data.reset_index(drop=True,inplace=True)
x = data.iloc[:, 1].values
y = data.iloc[:, 0].values
print(y)

['Debt collection' 'Consumer Loan' 'Mortgage' ... 'Payday loan' 'Mortgage'
 'Mortgage']


Let's see the unique categories

In [5]:
print(np.unique(y, return_counts=True))


(array(['Bank account or service', 'Consumer Loan', 'Credit card',
       'Credit reporting', 'Debt collection', 'Money transfers',
       'Mortgage', 'Other financial service', 'Payday loan',
       'Prepaid card', 'Student loan'], dtype=object), array([ 5711,  3678,  7929, 12526, 17552,   666, 14919,   110,   726,
         861,  2128]))


## Tokenization

Let's Tokenize words and remove some values with TF Tokenizer

In [6]:
tokenizer = Tokenizer(num_words= 200, filters = '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',lower=True, split=" ")
tokenizer.fit_on_texts(x)
x = tokenizer.texts_to_sequences(x)
x = sequence.pad_sequences(x, maxlen=200)

print(x)

[[  0   0   0 ...   3  84 108]
 [  0   0   0 ...   2   8   6]
 [145  10 112 ...   7   9   7]
 ...
 [  0   0   0 ... 171  66   1]
 [  0   0   0 ...   2 150  68]
 [ 32   2   4 ...   5  24  16]]


Let's now convert categorical values to numerical identities

In [7]:
labelencoder_Y = LabelEncoder()
y = labelencoder_Y.fit_transform(y)
print(y)

[4 1 6 ... 8 6 6]


These are the unique values we have

In [8]:
print(np.unique(y, return_counts=True))

(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10]), array([ 5711,  3678,  7929, 12526, 17552,   666, 14919,   110,   726,
         861,  2128]))


Let's also convert Y to a form like this

Class 1 = [0, 0, 0, 1]

Class 2 = [1, 0, 0, 1]

Aas ana example

In [9]:
y = np_utils.to_categorical(y, num_classes= 11)

print(y)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 1. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [10]:
np.random.seed(10)
indices = np.arange(len(x))
np.random.shuffle(indices)
x = x[indices]
y = y[indices]

We will now remove some bad characters

In [11]:
index_from=3
start_char = 1
if start_char is not None:
        x = [[start_char] + [w + index_from for w in x1] for x1 in x]
elif index_from:
        x = [[w + index_from for w in x1] for x1 in x]

We now need to deal with OOV or out of vocabulary terms

In [12]:
num_words = None
if not num_words:
        num_words = max([max(x1) for x1 in x])
        
oov_char = 2
skip_top = 0

if oov_char is not None:
        x = [[w if (skip_top <= w < num_words) else oov_char for w in x1] for x1 in x]
else:
        x = [[w for w in x1 if (skip_top <= w < num_words)] for x1 in x]
        
test_split = 0.2
idx = int(len(x) * (1 - test_split))
x_train, y_train = np.array(x[:idx]), np.array(y[:idx])
x_test, y_test = np.array(x[idx:]), np.array(y[idx:])

print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

(53444, 201)
(53444, 11)
(13362, 201)
(13362, 11)


To make our sequences of equal length we will pad them

In [13]:
x_train = sequence.pad_sequences(x_train, maxlen=201)
x_test = sequence.pad_sequences(x_test, maxlen=201)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

x_train shape: (53444, 201)
x_test shape: (13362, 201)


## Model

Let's first define a few hyperparameters

In [80]:
max_features = 1000
maxlen = 201
embedding_dims = 50
filters = 250
kernel_size = 3
hidden_dims = 250

model = Sequential()
model.add(Embedding(max_features,
                    embedding_dims,
                    input_length=maxlen))
model.add(Dropout(0.2))
model.add(Conv1D(filters,
                 kernel_size,
                 padding='valid',
                 activation='relu',
                 strides=1))
model.add(GlobalMaxPooling1D())

model.add(Dense(hidden_dims))
model.add(Dropout(0.2))
model.add(Activation('relu'))

model.add(Dense(11))
model.add(Activation('softmax'))

In [81]:
model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

In [83]:
model.fit(x_train, 
          y_train,
          batch_size=32,
          epochs=50,
          validation_data=(x_test, y_test))

Train on 53444 samples, validate on 13362 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<tensorflow.python.keras.callbacks.History at 0x7fd7101e7978>

## Conclusion

An approximate peak of 75 % is good atleast for a dataset like this which has a lot of noise, so we did good and also did no overfitting this is good for now. We first tokenized our text and then removed the bad characters too. With some hyper parameter tuning we got the best model.