### Consumer Complaints Analysis, Visualization & Prediction

The Goal is to analyze and build prediction model to predict which consumer is more likely to dispute the resolution of a complaint.Make predictions for the "Consumer disputed" column.

**Importing packages**

In [1]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns

# Plotly libraries
import plotly
import plotly.express as px
import plotly.graph_objs as go
import chart_studio.plotly as py

import cufflinks as cf
from plotly.offline import iplot, init_notebook_mode, plot
cf.go_offline()

import warnings
warnings.filterwarnings('ignore')

In [2]:
consumer_data= pd.read_csv('E:/Amit Baghel/Kumar Amit/My Projects/US Consumer Finance Complaints/Consumer_Complaints.csv')

In [3]:
consumer_data.head(2)

Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
0,3/12/2014,Mortgage,Other mortgage,"Loan modification,collection,foreclosure",,,,M&T BANK CORPORATION,MI,48382,,,Referral,3/17/2014,Closed with explanation,Yes,No,759217
1,10/1/2016,Credit reporting,,Incorrect information on credit report,Account status,I have outdated information on my credit repor...,Company has responded to the consumer and the ...,"TRANSUNION INTERMEDIATE HOLDINGS, INC.",AL,352XX,,Consent provided,Web,10/5/2016,Closed with explanation,Yes,No,2141773


In [4]:
consumer_data.columns = consumer_data.columns.str.title()

In [5]:
mode_value= consumer_data['Consumer Disputed?'].mode()
mode_value ='No'

In [6]:
consumer_data['Consumer Disputed?'].fillna(mode_value, inplace=True)

In [7]:
consumer_data['Consumer Disputed?'].isnull().fillna(mode_value,inplace =True)

In [8]:
consumer_data.isnull().mean().round(4)*100

Date Received                    0.00
Product                          0.00
Sub-Product                     26.01
Issue                            0.00
Sub-Issue                       52.83
Consumer Complaint Narrative    77.88
Company Public Response         71.46
Company                          0.00
State                            1.02
Zip Code                         1.03
Tags                            86.06
Consumer Consent Provided?      58.47
Submitted Via                    0.00
Date Sent To Company             0.00
Company Response To Consumer     0.00
Timely Response?                 0.00
Consumer Disputed?               0.00
Complaint Id                     0.00
dtype: float64

In [9]:
# getting the sum of null values and ordering.
total = consumer_data.isnull().sum().sort_values(ascending = False)  

#getting the percent and order of null.
percent = (consumer_data.isnull().sum()/consumer_data.isnull().count()*100).sort_values(ascending =False)

# Concatenating the total and percent
df = pd.concat([total , percent],axis =1,keys=['Total' ,'Percent'])

# Returning values of nulls different of 0
(df[~(df['Total'] == 0)])

Unnamed: 0,Total,Percent
Tags,777945,86.057481
Consumer Complaint Narrative,704013,77.879009
Company Public Response,646002,71.461742
Consumer Consent Provided?,528549,58.468909
Sub-Issue,477597,52.83252
Sub-Product,235160,26.013764
Zip Code,9278,1.026347
State,9225,1.020484


#### Summary Statistics

In [10]:
consumer_data[['Issue','Date Received','Product','Sub-Issue','Consumer Complaint Narrative','Company',
               'Company Public Response','Consumer Consent Provided?',
               'Company Response To Consumer','Submitted Via']].describe().transpose()

Unnamed: 0,count,unique,top,freq
Issue,903983,166,"Loan modification,collection,foreclosure",112315
Date Received,903983,2176,9/8/2017,3551
Product,903983,18,Mortgage,242194
Sub-Issue,426386,217,Account status,37056
Consumer Complaint Narrative,199970,195304,I am filing this complaint because Experian ha...,103
Company,903983,4504,"BANK OF AMERICA, NATIONAL ASSOCIATION",70488
Company Public Response,257981,10,Company has responded to the consumer and the ...,147983
Consumer Consent Provided?,375434,4,Consent provided,199971
Company Response To Consumer,903983,8,Closed with explanation,686039
Submitted Via,903983,6,Web,634850


### What are the top 15 issues and sub issues?

In [11]:
sns.set(style='white')
consumer_data['Issue'].str.strip("'").value_counts()[0:15].iplot(kind='bar',title='Top 15 issues',fontsize=14,color='orange')

In [12]:
consumer_data['Sub-Issue'].str.strip("'").value_counts()[0:15].iplot(kind ='bar',
                                                                     title='Top 15 Sub Issues',fontsize=14,color='#9370DB')

In [13]:
consumer_data['Company'].str.strip("'").value_counts()[0:15].iplot(kind='bar',
                                                          title='Top 15 Company',fontsize=14,color='purple')

### In which month did most complaints occur and on which day of the week are most complaints received?

In this section, we extract the date features from the Date Received field.

In [14]:
from datetime import datetime

In [15]:
consumer_data['Date'] =pd.to_datetime(consumer_data['Date Received'])

#Extracting Year.
consumer_data['Year'] =consumer_data['Date'].dt.year

#Extracting Month.
consumer_data['Month'] =consumer_data['Date'].dt.month_name(locale='English')

#Extracting Weekdays
consumer_data['Week_Days'] = consumer_data['Date'].dt.day_name(locale = 'English')


In [16]:
consumer_data['Week_Days'].value_counts().iplot(kind ='barh',title ='Number of Complaints per Weekday')

In [17]:
pd.crosstab(consumer_data['Year'],consumer_data['Month']).iplot(kind='bar',barmode='stack',
                                                        title='Number of Complaints per Month')

### What is the most common response received from companies?

In [18]:
grouped = consumer_data.groupby(['Company Response To Consumer']).size()
pie_chart = go.Pie(labels=grouped.index,values=grouped,
                  title='Company Response to the Customer')
iplot([pie_chart])

### Which state received the largest number of complaints?

In [19]:
states = consumer_data['State'].value_counts()

scl = [
    [0.0, 'rgb(242,240,247)'],
    [0.2, 'rgb(218,218,235)'],
    [0.4, 'rgb(188,189,220)'],
    [0.6, 'rgb(158,154,200)'],
    [0.8, 'rgb(117,107,177)'],
    [1.0, 'rgb(84,39,143)']
]

data = [go.Choropleth(
    colorscale = scl,
    autocolorscale = False,
    locations = states.index,
    z = states.values,
    locationmode = 'USA-states',
    text = states.index,
    marker = go.choropleth.Marker(
        line = go.choropleth.marker.Line(
            color = 'rgb(254,254,254)',
            width = 2
        )),
    colorbar = go.choropleth.ColorBar(
        title = "Complaints")
)]

layout = go.Layout(
    title = go.layout.Title(
        text = 'Complaints by State'
    ),
    geo = go.layout.Geo(
        scope = 'usa',
        projection = go.layout.geo.Projection(type = 'albers usa'),
        showlakes = True,
        lakecolor = 'rgb(100,149,237)'),
)

fig = go.Figure(data = data, layout = layout)
iplot(fig)

### What was the most common medium via which complaints were submitted?

In [20]:
pd.crosstab(consumer_data['Timely Response?'],consumer_data['Submitted Via']).iplot(kind='bar',
                                                                                    title='Company Response to the Customer')

### Consumers that disputed the company response and those that did not?

In [21]:
pd.crosstab(consumer_data['Timely Response?'], consumer_data['Consumer Disputed?']).iplot(kind='bar',
                                                                    title ='Timely Response vs Consumer Disputed' )

### Random Forest on Consumer Complaints Dataset

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual

**Importing the library to Label Encode and One-Hot Encode our categorical column**

In [22]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
onehotencoder =OneHotEncoder()

In [23]:
# Label Encoding the Consumer Disputed? column
consumer_data['Consumer_encode']= labelencoder.fit_transform(consumer_data['Consumer Disputed?'])

In [24]:
enc = OneHotEncoder(handle_unknown='ignore')
consumer_data1 = pd.DataFrame(enc.fit_transform(consumer_data[['Product']]).toarray())
df = consumer_data.join(consumer_data1)

**Splitting Independent features from Dependent ones**

In [25]:
x = df.iloc[:,24:41].values
y = df['Consumer_encode'].values

**Splitting the dataset into the Training set and Test set**

In [26]:
from sklearn.model_selection import train_test_split
x_train , x_test , y_train , y_test = train_test_split(x,y, test_size =0.25, random_state =10)

**Feature Scaling**

In [27]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train =sc.fit_transform(x_train)
x_test = sc.fit_transform(x_test)

**Fitting Random Forest Classification to the Training set**

In [28]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=10,criterion='entropy', random_state =10)
classifier.fit(x_train,y_train)

RandomForestClassifier(criterion='entropy', n_estimators=10, random_state=10)

**Predicting the Test set results**

In [29]:
y_pred = classifier.predict(x_test)

**Making the Confusion Matrix**

In [30]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
print(cm)

[[188549      0]
 [ 37447      0]]


**Model Accuracy Score**

In [31]:
from sklearn.metrics import accuracy_score
print('Accuracy Score:',accuracy_score(y_test,y_pred))

Accuracy Score: 0.8343023770332219


**Thank You !!!**