Classify whether application accepted or not using Logistic regression

card
Factor. Was the application for a credit card accepted?

reports
Number of major derogatory reports.

age
Age in years plus twelfths of a year.

income
Yearly income (in USD 10,000).

share
Ratio of monthly credit card expenditure to yearly income.

expenditure
Average monthly credit card expenditure.

owner
Factor. Does the individual own their home?

selfemp
Factor. Is the individual self-employed?

dependents
Number of dependents.

months
Months living at current address.

majorcards
Number of major credit cards held.

active
Number of active credit accounts.

Output variable -> y
y -> Whether the client has subscribed a term deposit or not 
Binomial ("yes" or "no")

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from numpy.polynomial.polynomial import polyfit
from sklearn.linear_model import LogisticRegression
import seaborn as sns
import statsmodels.stats.tests.test_influence
from sklearn.feature_selection import RFE
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [None]:
df = pd.read_csv('bank-full.csv')

In [None]:
df1 = df.copy()

In [None]:
df1

In [None]:
df1.describe()

In [None]:
sns.pairplot(df1)

In [None]:
df1['y'].value_counts()

In [None]:
count_no_sub = len(df1[df1['y']=="no"])
count_sub = len(df1[df1['y']=="yes"])

In [None]:
(count_sub / (count_sub + count_no_sub))*100

Percentage of Client Subscribed is 11.70 % in the current data set

In [None]:
pd.crosstab(df1.job,df1.y).plot(kind='bar')
plt.title('Subscribed Frequency for Job Title')
plt.xlabel('Job')
plt.ylabel('Frequency of subscribtion')

The frequency of subscribtion depends a great deal on the job title. Thus, the job title can be a good predictor of the outcome variable.

In [None]:
table=pd.crosstab(df1.marital,df1.y)
table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)
plt.title('Stacked Bar Chart of Marital Status vs Subscribed')
plt.xlabel('Marital Status')
plt.ylabel('Proportion of Customers')

The marital status seem a strong predictor for the outcome variable

In [None]:
table=pd.crosstab(df1.education,df1.y)
table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)
plt.title('Stacked Bar Chart of Education vs Subscribed')
plt.xlabel('Education')
plt.ylabel('Proportion of Customers')

# Education seem a strong predictor for the outcome variable

In [None]:
table=pd.crosstab(df1.contact,df1.y)
table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)
plt.title('Stacked Bar Chart of Contact vs Subscribed')
plt.xlabel('Contact')
plt.ylabel('Proportion of Customers')

Contact does not seem a strong predictor for the outcome variable

In [None]:
table=pd.crosstab(df1.poutcome,df1.y)
table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)
plt.title('Stacked Bar Chart of Poutcome vs Subscribed')
plt.xlabel('Poutcome')
plt.ylabel('Proportion of Customers')

Poutcome does not seem a strong predictor for the outcome variable

In [None]:
df1.age.hist()
plt.title('Histogram of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')

Most of the customers are in age between 20 and 50 years

In [None]:
pd.crosstab(df1.month,df1.y).plot(kind='bar')
plt.title('Subscribed Frequency for Month')
plt.xlabel('Month')
plt.ylabel('Frequency of Subscribed')

Month might be a good predictor of the outcome variable


In [None]:
df1.day.hist()
plt.title('Histogram of Duration')
plt.xlabel('Duration')
plt.ylabel('Frequency')

In [None]:
df1['housing'].value_counts()

Data is somewhat evenly distributed on whether the client has House or not

In [None]:
df1['loan'].value_counts()

However majority of the client do not have loan


# 3 - Cleaning Data

In [None]:
df1.isnull().sum()

Since there are no Null values in any column we don't have to create any exceptions

# 4 - Logistic Regression Model

In [None]:
df1 ['default'] = df1 ['default'].map({'yes': 1, 'no': 0})

In [None]:
df1 ['housing'] = df1 ['housing'].map({'yes': 1, 'no': 0})

In [None]:
df1 ['loan'] = df1 ['loan'].map({'yes': 1, 'no': 0})

In [None]:
df1 ['y'] = df1 ['y'].map({'yes': 1, 'no': 0})

In [None]:
df1 = pd.get_dummies(df1, columns=['job'])

In [None]:
df1 = pd.get_dummies(df1, columns=['marital'])

In [None]:
df1 = pd.get_dummies(df1, columns=['education'])

In [None]:
df1 = pd.get_dummies(df1, columns=['month'])

In [None]:
df1 = df1.drop(['contact', 'poutcome'], axis=1)

In [None]:
X = df1.loc[:, df1.columns != 'y']
y = df1.loc[:, df1.columns == 'y']

In [None]:
logreg = LogisticRegression()

In [None]:
rfe = RFE(logreg, 20)
rfe = rfe.fit(X, y.values.ravel())
print(rfe.support_)
print(rfe.ranking_)

As per Recursive Feature Elimination (RFE) analysis we can exclude all the variables which are False

In [None]:
X = df1[['default', 'housing', 'loan', 'job_housemaid', 'job_retired', 'job_student', 'marital_married', 'education_primary', 'education_unknown', 'month_aug', 'month_dec', 'month_feb', 'month_jan', 'month_jul', 'month_jun', 'month_mar', 'month_may', 'month_nov', 'month_oct', 'month_sep']]
y = df1.loc[:, df1.columns == 'y']


In [None]:
logit=sm.Logit(y,X)
result = logit.fit()

In [None]:
result.summary()

All variables have significant p value

In [None]:
logreg.fit(X, y)

In [None]:
y_pred = logreg.predict(X)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X, y)))

In [None]:
print(classification_report(y, y_pred))

In [None]:
confusion_matrix(y, y_pred)

5 - Output Interpretation
1 - Confusion Matrix
The result is telling us that we have 39455+456 correct predictions and 4833+467 incorrect predictions.

2 - Accuracy == 84%
Of the entire data set, 84% of the clients will subcribe