# Complaint Status Tracking (HACKEREARTH CHALLENGE)

Problem Statement

Societe Generale (SocGen) is a French multinational banking and financial services company. With over 1,54,000 employees, based in 76 countries, they handle over 32 million clients throughout the world on a daily basis.

They provide services like retail banking, corporate and investment banking, asset management, portfolio management, insurance and other financial services.

While handling customer complaints, it is hard to track the status of the complaint. To automate this process, SocGen wants you to build a model that can automatically predict the complaint status (how the complaint was resolved) based on the complaint submitted by the consumer and other related meta-data.

In [1]:
import warnings
import math
import pandas as pd
import string
import numpy as np
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from datetime import datetime
warnings.filterwarnings('ignore')

In [90]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
sample= pd.read_csv('sample_submission.csv')

In [3]:
train.head()

Unnamed: 0,Complaint-ID,Date-received,Transaction-Type,Complaint-reason,Company-response,Date-sent-to-company,Complaint-Status,Consumer-disputes,Consumer-complaint-summary
0,Tr-1,11/11/2015,Mortgage,"Loan servicing, payments, escrow account",,11/11/2015,Closed with explanation,Yes,"Seterus, Inc a déposé un faux rapport auprès d..."
1,Tr-2,7/7/2015,Credit reporting,Incorrect information on credit report,Company chooses not to provide a public response,7/7/2015,Closed with non-monetary relief,No,XX / XX / XXXX La requête en faillite n ° XXXX...
2,Tr-3,5/7/2015,Bank account or service,Using a debit or ATM card,,5/7/2015,Closed with explanation,No,"El XXXX / XXXX / 15, estaba preparando el vuel..."
3,Tr-4,11/12/2016,Debt collection,Cont'd attempts collect debt not owed,Company believes it acted appropriately as aut...,11/12/2016,Closed with explanation,No,"The loan was paid in XXXX XXXX. In XXXX, 4 yea..."
4,Tr-5,9/29/2016,Credit card,Payoff process,Company has responded to the consumer and the ...,9/29/2016,Closed with explanation,No,J'ai obtenu un compte de crédit de soins pour ...


### We have to Predict the Complaint Status

In [4]:
train['Complaint-Status'].value_counts()

Closed with explanation            34300
Closed with non-monetary relief     5018
Closed with monetary relief         2818
Closed                               809
Untimely response                    321
Name: Complaint-Status, dtype: int64

### Complaint ID is irrelevant to the Prediction

In [5]:
train = train.drop('Complaint-ID',axis=1)

In [6]:
### CREATE A COPY OF DATAFRAME FOR LABEL ENCODING 

df_label = train.copy()

In [7]:
(train['Date-received'][0])

'11/11/2015'

In [8]:
#### CONVERT STRING TO DATETIME FORMAT

#for i in range(0,len(train)):
    #df_label['Date-received'][i] = datetime.strptime(train['Date-received'][i],'%m/%d/%Y')

### Label Encoding of Categorical variables 

In [9]:
### NAIVE METHOD BY DROPPING ALL DATETIME COLUMNS (THEY MIGHT BE IMPORTANT...WE WILL CHECK LATER)

df_label = df_label.drop('Date-received',axis=1)
df_label = df_label.drop('Date-sent-to-company',axis=1)

In [10]:
### LET US ALSO DROP THE DESCRIPTION COLUMN (NEED NLP THAT WE MIGHT IGNORE FOR NAIVE APPROACH)

df_label = df_label.drop('Consumer-complaint-summary',axis=1)

### There are several Null values so Let us replace those with "NO RESPONSE" & "NOT KNOWN"

In [11]:
print(df_label['Company-response'].isnull().sum())
print(df_label['Consumer-disputes'].isnull().sum())

22506
7698


In [12]:
df_label['Company-response'] = df_label['Company-response'].fillna('No Response')
df_label['Consumer-disputes'] = df_label['Consumer-disputes'].fillna('Not Known')

In [13]:
print(df_label['Company-response'].isnull().sum())
print(df_label['Consumer-disputes'].isnull().sum())

0
0


In [14]:
tt = preprocessing.LabelEncoder()
tt = tt.fit(df_label['Transaction-Type'])
cr = preprocessing.LabelEncoder()
cr = cr.fit(df_label['Complaint-reason'])
cre = preprocessing.LabelEncoder()
cre = cre.fit(df_label['Company-response'])
cd = preprocessing.LabelEncoder()
cd = cd.fit(df_label['Consumer-disputes'])
cs= preprocessing.LabelEncoder()
cs = cs.fit(df_label['Complaint-Status'])

### Converted all categorical variables so the dataframe is algorithm ready

In [15]:
df_label['Transaction-Type'] = tt.transform(df_label['Transaction-Type'])
df_label['Complaint-reason'] = cr.transform(df_label['Complaint-reason'])
df_label['Company-response'] = cre.transform(df_label['Company-response'])
df_label['Consumer-disputes'] = cd.transform(df_label['Consumer-disputes'])
df_label['Complaint-Status'] = cs.transform(df_label['Complaint-Status'])

In [16]:
df_label.head(1)

Unnamed: 0,Transaction-Type,Complaint-reason,Company-response,Complaint-Status,Consumer-disputes
0,10,78,10,1,2


In [17]:
len(df_label['Complaint-Status'].value_counts())

5

### Train Test Split and applying Algorithms for NAIVE LABEL ENCODER

In [18]:
X_train, X_test, y_train, y_test = train_test_split(df_label.drop('Complaint-Status',axis=1), 
                 df_label['Complaint-Status'], test_size=0.33, random_state=42)

In [19]:
nb = GaussianNB()
nb.fit(X_train, y_train)
prediction=nb.predict(X_test)
print("F1 SCORE for Naive Bayes:",metrics.f1_score(y_test, prediction, average='weighted'))

clf = LogisticRegression(random_state=0, multi_class='ovr')
model = clf.fit(X_train, y_train)
prediction=model.predict(X_test)
print("F1 SCORE for Logistic Regression:",metrics.f1_score(y_test, prediction, average='weighted'))

decisiontree=DecisionTreeClassifier()
decisiontree.fit(X_train, y_train)
prediction=decisiontree.predict(X_test)
print("F1 SCORE For Decision Trees:",metrics.f1_score(y_test, prediction, average='weighted'))

neigh = KNeighborsClassifier(n_neighbors=5)
neigh.fit(X_train, y_train)
prediction=neigh.predict(X_test)
print("F1 SCORE for KNN:",metrics.f1_score(y_test, prediction, average='weighted'))

randomforest=RandomForestClassifier(n_estimators =100)
randomforest.fit(X_train, y_train)
prediction=randomforest.predict(X_test)
print("F1 SCORE for Random Forest:",metrics.f1_score(y_test, prediction, average='weighted'))

F1 SCORE for Naive Bayes: 0.5197088275842644
F1 SCORE for Logistic Regression: 0.7019842327760217
F1 SCORE For Decision Trees: 0.7198656879561207
F1 SCORE for KNN: 0.7206302352354023
F1 SCORE for Random Forest: 0.7208757103057245


In [20]:
list(X_train.columns), list(randomforest.feature_importances_)

(['Transaction-Type',
  'Complaint-reason',
  'Company-response',
  'Consumer-disputes'],
 [0.3585601473414549,
  0.375994951593255,
  0.1893296666155011,
  0.07611523444978911])

- Company response and Company disputes have the lowest importance (Maybe cause of the NAN values that were there)
- Highly dependent on the Complaint reason and the Transaction Type

### Let us do some Feature Engineering

In [21]:
train['Date-received'][0] ,train['Date-sent-to-company'][0]

('11/11/2015', '11/11/2015')

Date received and Date sent to company can be either

- Same 
- DIfferent

Let us generate a new column based on that data and call it Promptness in customer service = Yes or No

In [22]:
df_label['Promptness'] = 1

In [23]:
for i in range(0,len(train)):
    if(train['Date-received'][i] != train['Date-sent-to-company'][i]):
        df_label['Promptness'][i] = 0
    else:
        continue

In [24]:
df_label.head(1)

Unnamed: 0,Transaction-Type,Complaint-reason,Company-response,Complaint-Status,Consumer-disputes,Promptness
0,10,78,10,1,2,1


Lets take calculate the correlation for this new feature

In [25]:
df_label.corr()['Complaint-Status']

Transaction-Type    -0.115914
Complaint-reason    -0.002027
Company-response    -0.005819
Complaint-Status     1.000000
Consumer-disputes   -0.109791
Promptness           0.006972
Name: Complaint-Status, dtype: float64

In [26]:
X_train, X_test, y_train, y_test = train_test_split(df_label.drop('Complaint-Status',axis=1), 
                 df_label['Complaint-Status'], test_size=0.33, random_state=42)
nb = GaussianNB()
nb.fit(X_train, y_train)
prediction=nb.predict(X_test)
print("F1 SCORE for Naive Bayes:",metrics.f1_score(y_test, prediction, average='weighted'))

clf = LogisticRegression(random_state=0, multi_class='ovr')
model = clf.fit(X_train, y_train)
prediction=model.predict(X_test)
print("F1 SCORE for Logistic Regression:",metrics.f1_score(y_test, prediction, average='weighted'))

decisiontree=DecisionTreeClassifier()
decisiontree.fit(X_train, y_train)
prediction=decisiontree.predict(X_test)
print("F1 SCORE For Decision Trees:",metrics.f1_score(y_test, prediction, average='weighted'))

neigh = KNeighborsClassifier(n_neighbors=5)
neigh.fit(X_train, y_train)
prediction=neigh.predict(X_test)
print("F1 SCORE for KNN:",metrics.f1_score(y_test, prediction, average='weighted'))

randomforest=RandomForestClassifier(n_estimators =100)
randomforest.fit(X_train, y_train)
prediction=randomforest.predict(X_test)
print("F1 SCORE for Random Forest:",metrics.f1_score(y_test, prediction, average='weighted'))

F1 SCORE for Naive Bayes: 0.5213423790556663
F1 SCORE for Logistic Regression: 0.7019842327760217
F1 SCORE For Decision Trees: 0.7214784119910971
F1 SCORE for KNN: 0.7215723534275794
F1 SCORE for Random Forest: 0.7211507527573842


In [27]:
types=['rbf','linear']
for i in types:
    model=svm.SVC(kernel=i)
    model.fit(X_train, y_train)
    prediction=model.predict(X_test)
    print('Accuracy for SVM kernel is',metrics.accuracy_score(prediction,y_test))

Accuracy for SVM kernel is 0.792968202829528
Accuracy for SVM kernel is 0.7933884297520661


### We have a Slight increase in accuracy

### This might not seem like much but it shows us how powerful feature engineering is as a Machine learning tool

### Lets look at COMPLAINT REASON column as it has >70 unique values

In [28]:
train['Complaint-reason'].value_counts().keys()[0] , train['Complaint-reason'].value_counts().keys()[6]

('Incorrect information on credit report',
 'Incorrect information on your report')

### Both essentially mean almost the same thing (But label encoder seperates them )

In [29]:
reason = list(train['Complaint-reason'])

In [30]:
uni = list(train['Complaint-reason'].value_counts().keys())

In [31]:
uni[:10]

['Incorrect information on credit report',
 "Cont'd attempts collect debt not owed",
 'Loan servicing, payments, escrow account',
 'Loan modification,collection,foreclosure',
 'Dealing with my lender or servicer',
 'Disclosure verification of debt',
 'Incorrect information on your report',
 'Communication tactics',
 'Account opening, closing, or management',
 "Credit reporting company's investigation"]

### Convert each sentence with its corresponding TFIDF

In [32]:
e = []
for i in uni:
    kk = i.split(' ')
    kk = [j.lower() for j in kk]
    e.extend(kk)

In [33]:
k = reason

In [34]:
len(k), len(e)

(43266, 700)

In [35]:
k = [i.lower() for i in k]

In [36]:
k[0]

'loan servicing, payments, escrow account'

In [42]:
elements = ' '.join(k)

In [43]:
def tf(lis,elem):
    no = lis.count(elem)
    return no/len(lis)

def idf(k,e,elem):
    count = 0
    n = len(k)
    for i in k:
        if elem in i:
            count+=1
    return 1+ math.log10(len(k)/count)

In [44]:
df = pd.DataFrame()
TF=[]
IDF = []
e = list(set(e))
for i in e:
    TF.append(tf(elements,i))
    IDF.append(idf(k,e,i))
T = pd.Series(TF)
W = pd.Series(e)
I = pd.Series(IDF)
df['Words'] = W.values
df['TF'] = T.values
df['IDF'] = I.values
df['TFIDF'] = df['TF'] * df['IDF']
tfidf = pd.Series(df.TFIDF.values,index=df.Words).to_dict()
mat = []
for i in k:
    wor = i.split(' ')
    el=[]
    for j in e:
        if j in wor:
            el.append(tfidf[j])
        else:
            el.append(0)
    mat.append(el)

In [45]:
len(mat) == len(df_label)

True

In [46]:
tfidf = [sum(i)/len(i) for i in mat]

In [47]:
df_label['reason'] = tfidf

In [48]:
df_label.head(1)

Unnamed: 0,Transaction-Type,Complaint-reason,Company-response,Complaint-Status,Consumer-disputes,Promptness,reason
0,10,78,10,1,2,1,0.000122


In [105]:
df_label = df_label.drop('Complaint-reason',axis=1)

### Lets apply the algorithms after generating tfidf features

In [106]:
X_train, X_test, y_train, y_test = train_test_split(df_label.drop('Complaint-Status',axis=1), 
                 df_label['Complaint-Status'], test_size=0.33, random_state=42)
nb = GaussianNB()
nb.fit(X_train, y_train)
prediction=nb.predict(X_test)
print("F1 SCORE for Naive Bayes:",metrics.f1_score(y_test, prediction, average='weighted'))

clf = LogisticRegression(random_state=0, multi_class='ovr')
model = clf.fit(X_train, y_train)
prediction=model.predict(X_test)
print("F1 SCORE for Logistic Regression:",metrics.f1_score(y_test, prediction, average='weighted'))

decisiontree=DecisionTreeClassifier()
decisiontree.fit(X_train, y_train)
prediction=decisiontree.predict(X_test)
print("F1 SCORE For Decision Trees:",metrics.f1_score(y_test, prediction, average='weighted'))

neigh = KNeighborsClassifier(n_neighbors=5)
neigh.fit(X_train, y_train)
prediction=neigh.predict(X_test)
print("F1 SCORE for KNN:",metrics.f1_score(y_test, prediction, average='weighted'))

randomforest=RandomForestClassifier(n_estimators =100)
randomforest.fit(X_train, y_train)
prediction=randomforest.predict(X_test)
print("F1 SCORE for Random Forest:",metrics.f1_score(y_test, prediction, average='weighted'))

F1 SCORE for Naive Bayes: 0.5188888132715702
F1 SCORE for Logistic Regression: 0.7019842327760217
F1 SCORE For Decision Trees: 0.7215626388691304
F1 SCORE for KNN: 0.7231281072749107
F1 SCORE for Random Forest: 0.7213825646306306


### Again a very slight change in F1 scores

In [109]:
types=['rbf']
for i in types:
    model=svm.SVC(kernel=i)
    model.fit(X_train, y_train)
    prediction=model.predict(X_test)
    print('Accuracy for SVM kernel is',metrics.accuracy_score(prediction,y_test))

Accuracy for SVM kernel is 0.7935985432133352


## TESTING

In [91]:
idval = test['Complaint-ID']

In [92]:
test = test.drop('Complaint-ID',axis=1)
test = test.drop('Consumer-complaint-summary',axis=1)

In [93]:
r = test['Complaint-reason']

In [94]:
test = test.drop('Complaint-reason',axis=1)

In [95]:
test['Company-response'] = test['Company-response'].fillna('No Response')
test['Consumer-disputes'] = test['Consumer-disputes'].fillna('Not Known')

In [96]:
test['Transaction-Type'] = tt.transform(test['Transaction-Type'])
test['Company-response'] = cre.transform(test['Company-response'])
test['Consumer-disputes'] = cd.transform(test['Consumer-disputes'])

In [99]:
test['Promptness'] = 1
for i in range(0,len(test)):
    if(test['Date-received'][i] != test['Date-sent-to-company'][i]):
        test['Promptness'][i] = 0
    else:
        continue

In [102]:
test = test.drop(['Date-received','Date-sent-to-company'],axis=1)

In [113]:
test['Complaint-reason'] = r
reason = list(test['Complaint-reason'])

In [114]:
uni = list(test['Complaint-reason'].value_counts().keys())

In [115]:
uni[:10]

['Incorrect information on credit report',
 "Cont'd attempts collect debt not owed",
 'Loan servicing, payments, escrow account',
 'Loan modification,collection,foreclosure',
 'Incorrect information on your report',
 'Dealing with my lender or servicer',
 'Disclosure verification of debt',
 'Communication tactics',
 'Account opening, closing, or management',
 "Credit reporting company's investigation"]

In [117]:
len(uni)

147

In [118]:
e = []
for i in uni:
    kk = i.split(' ')
    kk = [j.lower() for j in kk]
    e.extend(kk)

In [119]:
k = reason

In [120]:
len(k), len(e)

(18543, 689)

In [121]:
k = [i.lower() for i in k]

In [122]:
k[0]

'account opening, closing, or management'

In [123]:
elements = ' '.join(k)

In [124]:
def tf(lis,elem):
    no = lis.count(elem)
    return no/len(lis)

def idf(k,e,elem):
    count = 0
    n = len(k)
    for i in k:
        if elem in i:
            count+=1
    return 1+ math.log10(len(k)/count)

df = pd.DataFrame()
TF=[]
IDF = []
e = list(set(e))
for i in e:
    TF.append(tf(elements,i))
    IDF.append(idf(k,e,i))
T = pd.Series(TF)
W = pd.Series(e)
I = pd.Series(IDF)
df['Words'] = W.values
df['TF'] = T.values
df['IDF'] = I.values
df['TFIDF'] = df['TF'] * df['IDF']
tfidf = pd.Series(df.TFIDF.values,index=df.Words).to_dict()
mat = []
for i in k:
    wor = i.split(' ')
    el=[]
    for j in e:
        if j in wor:
            el.append(tfidf[j])
        else:
            el.append(0)
    mat.append(el)

In [136]:
len(mat) == len(test)
tfidf = [sum(i)/len(i) for i in mat]

In [129]:
test['reason'] = 0

In [138]:
(test['reason'])  = tfidf

In [140]:
test=test.drop('Complaint-reason',axis=1)

In [143]:
values = model.predict(test)

In [154]:
p = cs.inverse_transform(values)

In [155]:
test['Complaint-Status'] = p

In [156]:
test['Complaint-Status'].value_counts()

Closed with explanation        18450
Closed with monetary relief       93
Name: Complaint-Status, dtype: int64

In [165]:
t = test.drop(['Transaction-Type','Company-response','Consumer-disputes','Promptness','reason'],axis=1)
t['Complaint-ID'] = idval

In [175]:
columnsTitles=["Complaint-ID","Complaint-Status"]
df=t.reindex(columns=columnsTitles)

In [177]:
df['Complaint-Status'].value_counts()

Closed with explanation        18450
Closed with monetary relief       93
Name: Complaint-Status, dtype: int64

In [178]:
df.to_csv('submission.csv', index=False)