### Consumer Complaints Analysis, Visualization & Prediction

The Goal is to analyze and build prediction model to predict which consumer is more likely to dispute the resolution of a complaint.Make predictions for the "Consumer disputed" column.

**Importing packages**

In [1]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns

import plotly
import plotly.express as px
import plotly.graph_objs as go

import cufflinks as cf
from plotly.offline import iplot, init_notebook_mode, plot
cf.go_offline()

import warnings
warnings.filterwarnings('ignore')

In [2]:
consumer_data= pd.read_csv("C:\\Users\\Spyder\\Dev\\CSV files\\rows.csv\\rows.csv")

In [3]:
consumer_data.head()

Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
0,05/10/2019,Checking or savings account,Checking account,Managing an account,Problem using a debit or ATM card,,,NAVY FEDERAL CREDIT UNION,FL,328XX,Older American,,Web,05/10/2019,In progress,Yes,,3238275
1,05/10/2019,Checking or savings account,Other banking product or service,Managing an account,Deposits and withdrawals,,,BOEING EMPLOYEES CREDIT UNION,WA,98204,,,Referral,05/10/2019,Closed with explanation,Yes,,3238228
2,05/10/2019,Debt collection,Payday loan debt,Communication tactics,Frequent or repeated calls,,,CURO Intermediate Holdings,TX,751XX,,,Web,05/10/2019,Closed with explanation,Yes,,3237964
3,05/10/2019,"Credit reporting, credit repair services, or o...",Credit reporting,Incorrect information on your report,Old information reappears or never goes away,,,Ad Astra Recovery Services Inc,LA,708XX,,,Web,05/10/2019,Closed with explanation,Yes,,3238479
4,05/10/2019,Checking or savings account,Checking account,Managing an account,Banking errors,,,ALLY FINANCIAL INC.,AZ,85205,,,Postal mail,05/10/2019,In progress,Yes,,3238460


In [4]:
consumer_data.shape

(1282355, 18)

In [5]:
consumer_data.isnull().sum()

Date received                         0
Product                               0
Sub-product                      235166
Issue                                 0
Sub-issue                        531186
Consumer complaint narrative     898791
Company public response          833273
Company                               0
State                             19400
ZIP code                         115298
Tags                            1106712
Consumer consent provided?       591701
Submitted via                         0
Date sent to company                  0
Company response to consumer          7
Timely response?                      0
Consumer disputed?               513854
Complaint ID                          0
dtype: int64

In [6]:
consumer_data.columns = consumer_data.columns.str.title()

In [7]:
mode_value= consumer_data['Consumer Disputed?'].mode()
mode_value ='No'

In [8]:
consumer_data['Consumer Disputed?'].fillna(mode_value, inplace=True)

In [9]:
consumer_data['Consumer Disputed?'].isnull().fillna(mode_value,inplace =True)

In [10]:
consumer_data.isnull().mean().round(4)*100

Date Received                    0.00
Product                          0.00
Sub-Product                     18.34
Issue                            0.00
Sub-Issue                       41.42
Consumer Complaint Narrative    70.09
Company Public Response         64.98
Company                          0.00
State                            1.51
Zip Code                         8.99
Tags                            86.30
Consumer Consent Provided?      46.14
Submitted Via                    0.00
Date Sent To Company             0.00
Company Response To Consumer     0.00
Timely Response?                 0.00
Consumer Disputed?               0.00
Complaint Id                     0.00
dtype: float64

In [11]:
total = consumer_data.isnull().sum().sort_values(ascending = False)  

percent = (consumer_data.isnull().sum()/consumer_data.isnull().count()*100).sort_values(ascending =False)

df = pd.concat([total , percent],axis =1,keys=['Total' ,'Percent'])

(df[~(df['Total'] == 0)])

Unnamed: 0,Total,Percent
Tags,1106712,86.303091
Consumer Complaint Narrative,898791,70.089094
Company Public Response,833273,64.9799
Consumer Consent Provided?,591701,46.141747
Sub-Issue,531186,41.422695
Sub-Product,235166,18.338604
Zip Code,115298,8.991114
State,19400,1.512842
Company Response To Consumer,7,0.000546


#### Summary Statistics

In [12]:
consumer_data[['Issue','Date Received','Product','Sub-Issue','Consumer Complaint Narrative','Company',
               'Company Public Response','Consumer Consent Provided?',
               'Company Response To Consumer','Submitted Via']].describe().transpose()

Unnamed: 0,count,unique,top,freq
Issue,1282355,167,Incorrect information on your report,134338
Date Received,1282355,2717,09/08/2017,3553
Product,1282355,18,Mortgage,278098
Sub-Issue,751169,218,Information belongs to someone else,59168
Consumer Complaint Narrative,383564,366945,There are many mistakes appear in my report wi...,978
Company,1282355,5275,"EQUIFAX, INC.",115703
Company Public Response,449082,10,Company has responded to the consumer and the ...,311852
Consumer Consent Provided?,690654,4,Consent provided,383885
Company Response To Consumer,1282348,8,Closed with explanation,993221
Submitted Via,1282355,6,Web,945329


### What are the top 15 issues and sub issues?

In [13]:
sns.set(style='white')
consumer_data['Issue'].str.strip("'").value_counts()[0:15].iplot(kind='bar',title='Top 15 issues',fontsize=14,color='orange')

In [14]:
consumer_data['Sub-Issue'].str.strip("'").value_counts()[0:15].iplot(kind ='bar',
                                                                     title='Top 15 Sub Issues',fontsize=14,color='#9370DB')

In [15]:
consumer_data['Company'].str.strip("'").value_counts()[0:15].iplot(kind='bar',
                                                          title='Top 15 Company',fontsize=14,color='purple')

### In which month did most complaints occur and on which day of the week are most complaints received?

In this section, we extract the date features from the Date Received field.

In [16]:
from datetime import datetime

In [17]:
consumer_data['Date'] =pd.to_datetime(consumer_data['Date Received'])

#Extracting Year.
consumer_data['Year'] =consumer_data['Date'].dt.year

#Extracting Month.
consumer_data['Month'] =consumer_data['Date'].dt.month_name()

#Extracting Weekdays

consumer_data['Week_Days'] = consumer_data['Date'].dt.day_name()


In [18]:
consumer_data.head()

Unnamed: 0,Date Received,Product,Sub-Product,Issue,Sub-Issue,Consumer Complaint Narrative,Company Public Response,Company,State,Zip Code,...,Submitted Via,Date Sent To Company,Company Response To Consumer,Timely Response?,Consumer Disputed?,Complaint Id,Date,Year,Month,Week_Days
0,05/10/2019,Checking or savings account,Checking account,Managing an account,Problem using a debit or ATM card,,,NAVY FEDERAL CREDIT UNION,FL,328XX,...,Web,05/10/2019,In progress,Yes,No,3238275,2019-05-10,2019,May,Friday
1,05/10/2019,Checking or savings account,Other banking product or service,Managing an account,Deposits and withdrawals,,,BOEING EMPLOYEES CREDIT UNION,WA,98204,...,Referral,05/10/2019,Closed with explanation,Yes,No,3238228,2019-05-10,2019,May,Friday
2,05/10/2019,Debt collection,Payday loan debt,Communication tactics,Frequent or repeated calls,,,CURO Intermediate Holdings,TX,751XX,...,Web,05/10/2019,Closed with explanation,Yes,No,3237964,2019-05-10,2019,May,Friday
3,05/10/2019,"Credit reporting, credit repair services, or o...",Credit reporting,Incorrect information on your report,Old information reappears or never goes away,,,Ad Astra Recovery Services Inc,LA,708XX,...,Web,05/10/2019,Closed with explanation,Yes,No,3238479,2019-05-10,2019,May,Friday
4,05/10/2019,Checking or savings account,Checking account,Managing an account,Banking errors,,,ALLY FINANCIAL INC.,AZ,85205,...,Postal mail,05/10/2019,In progress,Yes,No,3238460,2019-05-10,2019,May,Friday


In [19]:
consumer_data['Week_Days'].value_counts().iplot(kind ='barh',title ='Number of Complaints per Weekday')

In [20]:
pd.crosstab(consumer_data['Year'],consumer_data['Month']).iplot(kind='bar',barmode='stack',
                                                        title='Number of Complaints per Month')

### What is the most common response received from companies?

In [21]:
grouped = consumer_data.groupby(['Company Response To Consumer']).size()
pie_chart = go.Pie(labels=grouped.index,values=grouped,
                  title='Company Response to the Customer')
iplot([pie_chart])

### Which state received the largest number of complaints?

In [22]:
states = consumer_data['State'].value_counts()

scl = [
    [0.0, 'rgb(242,240,247)'],
    [0.2, 'rgb(218,218,235)'],
    [0.4, 'rgb(188,189,220)'],
    [0.6, 'rgb(158,154,200)'],
    [0.8, 'rgb(117,107,177)'],
    [1.0, 'rgb(84,39,143)']
]

data = [go.Choropleth(
    colorscale = scl,
    autocolorscale = False,
    locations = states.index,
    z = states.values,
    locationmode = 'USA-states',
    text = states.index,
    marker = go.choropleth.Marker(
        line = go.choropleth.marker.Line(
            color = 'rgb(254,254,254)',
            width = 2
        )),
    colorbar = go.choropleth.ColorBar(
        title = "Complaints")
)]

layout = go.Layout(
    title = go.layout.Title(
        text = 'Complaints by State'
    ),
    geo = go.layout.Geo(
        scope = 'usa',
        projection = go.layout.geo.Projection(type = 'albers usa'),
        showlakes = True,
        lakecolor = 'rgb(100,149,237)'),
)

fig = go.Figure(data = data, layout = layout)
iplot(fig)

### What was the most common medium via which complaints were submitted?

In [23]:
pd.crosstab(consumer_data['Timely Response?'],consumer_data['Submitted Via']).iplot(kind='bar',
                                                                                    title='Company Response to the Customer')

### Consumers that disputed the company response and those that did not?

In [24]:
pd.crosstab(consumer_data['Timely Response?'], consumer_data['Consumer Disputed?']).iplot(kind='bar',
                                                                    title ='Timely Response vs Consumer Disputed' )

### Random Forest on Consumer Complaints Dataset

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual

**Importing the library to Label Encode and One-Hot Encode our categorical column**

In [25]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
onehotencoder =OneHotEncoder()

In [26]:
consumer_data['Consumer_encode']= labelencoder.fit_transform(consumer_data['Consumer Disputed?'])

In [27]:
enc = OneHotEncoder(handle_unknown='ignore')
consumer_data1 = pd.DataFrame(enc.fit_transform(consumer_data[['Product']]).toarray())
df = consumer_data.join(consumer_data1)

**Splitting Independent features from Dependent ones**

In [28]:
x = df.iloc[:,24:41].values
y = df['Consumer_encode'].values

**Splitting the dataset into the Training set and Test set**

In [29]:
from sklearn.model_selection import train_test_split
x_train , x_test , y_train , y_test = train_test_split(x,y, test_size =0.25, random_state =10)

**Feature Scaling**

In [30]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train =sc.fit_transform(x_train)
x_test = sc.fit_transform(x_test)

**Fitting Random Forest Classification to the Training set**

In [31]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=10,criterion='entropy', random_state =10)
classifier.fit(x_train,y_train)

**Predicting the Test set results**

In [32]:
y_pred = classifier.predict(x_test)

**Making the Confusion Matrix**

In [33]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
print(cm)

[[283465      3]
 [ 37120      1]]


**Model Accuracy Score**

In [34]:
from sklearn.metrics import accuracy_score
print('Accuracy Score:',accuracy_score(y_test,y_pred))

Accuracy Score: 0.884203762449741
