Goal of this notebook is to generate visualizations that will help portray our statistical findings while also being easy to read and ascetic.

Step 1. Dig through the data and find statistically significant and interesting correlations - the messier of the EDA steps.

In [1]:
#Importing EDA type modules
import numpy as pd
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
#importing clean_df
df = pd.read_csv('Peruvian_Bank_Data/clean_df.csv', header = 0)
df.head()

Unnamed: 0,age,job,marital,education,in_default,avg_yearly_balance,housing_loan,personal_loan,contact_method,day,month,duration,campaign_contacts,prev_days,previous_contacts,prev_outcome,term_deposit
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,5,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,5,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,5,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,5,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,5,198,1,-1,0,unknown,no


In [3]:
#let's start with age
print('Avg age of no : {} \nAvg age of yes: {}'.format(df.age.loc[df.term_deposit == 'no'].mean(), df.age.loc[df.term_deposit == 'yes'].mean()))

Avg age of no : 40.85346751058695 
Avg age of yes: 41.743717728055074


In [42]:
#order of jobs for both
#yes split
df.loc[df.term_deposit == 'yes'].set_index('job').groupby('job').count()['term_deposit'].sort_values(ascending = False)

job
management       1432
technician        923
blue-collar       777
admin.            689
retired           570
services          407
student           288
unemployed        215
self-employed     207
entrepreneur      138
housemaid         123
unknown            41
Name: term_deposit, dtype: int64

In [43]:
df.loc[df.term_deposit == 'no'].set_index('job').groupby('job').count()['term_deposit'].sort_values(ascending = False)

job
blue-collar      9901
management       8995
technician       7442
admin.           4960
services         4164
retired          1924
self-employed    1555
entrepreneur     1517
housemaid        1229
unemployed       1216
student           734
unknown           285
Name: term_deposit, dtype: int64

In [106]:
#This gives me an idea of creating a dictionary of some percentages represented as float
jobs_total = {}
for idx, val in df.job.value_counts().items():
    jobs_total[idx] = val
jobs_yes = {}
for idx, val in df.job.loc[df.term_deposit == 'yes'].value_counts().items():
    jobs_yes[idx] = val
jobs_no = {}
for idx, val in df.job.loc[df.term_deposit == 'no'].value_counts().items():
    jobs_no[idx] = val

jobs_no_perc = {}
jobs_yes_perc = {}
for key in jobs_total:
    jobs_no_perc[key] = round(jobs_no[key]/jobs_total[key], 4)
    jobs_yes_perc[key] = round(jobs_yes[key]/jobs_total[key], 4)

print(jobs_no_perc)

print(jobs_yes_perc)

{'blue-collar': 0.9272, 'management': 0.8627, 'technician': 0.8897, 'admin.': 0.878, 'services': 0.911, 'retired': 0.7715, 'self-employed': 0.8825, 'entrepreneur': 0.9166, 'unemployed': 0.8498, 'housemaid': 0.909, 'student': 0.7182, 'unknown': 0.8742}
{'blue-collar': 0.0728, 'management': 0.1373, 'technician': 0.1103, 'admin.': 0.122, 'services': 0.089, 'retired': 0.2285, 'self-employed': 0.1175, 'entrepreneur': 0.0834, 'unemployed': 0.1502, 'housemaid': 0.091, 'student': 0.2818, 'unknown': 0.1258}


Too early to make any real conclusions here, although we do see that a surprisingly high amount of students signed up for term deposits.

In [102]:
#let's take a look at marital data

print('Total marital data:\n{}\n'.format(df.marital.value_counts()))
print('Yes marital data: \n{}\n'.format(df.marital.loc[df.term_deposit =='yes'].value_counts()))
print('No marital data: \n{}\n'.format(df.marital.loc[df.term_deposit =='no'].value_counts()))

Total marital data:
married     30011
single      13986
divorced     5735
Name: marital, dtype: int64

Yes marital data: 
married     3032
single      2079
divorced     699
Name: marital, dtype: int64

No marital data: 
married     26979
single      11907
divorced     5036
Name: marital, dtype: int64



Just from eyeballing the data here, we see that a higher ratio of single people have subscribed to a term_deposit versus married partners. However, this may also be because only one partner decides to do the term deposit using a combination of their money. So I can't write this off as leaning any which way because I don't have individual data. We can go ahead and create ratios again however for exploration.

In [108]:
mar_total = {}
for idx, val in df.marital.value_counts().items():
    mar_total[idx] = val
mar_yes = {}
for idx, val in df.marital.loc[df.term_deposit == 'yes'].value_counts().items():
    mar_yes[idx] = val
mar_no = {}
for idx, val in df.marital.loc[df.term_deposit == 'no'].value_counts().items():
    mar_no[idx] = val

mar_no_perc = {}
mar_yes_perc = {}
for key in mar_total:
    mar_no_perc[key] = round(mar_no[key]/mar_total[key], 4)
    mar_yes_perc[key] = round(mar_yes[key]/mar_total[key], 4)

print("Marital status, No: {}\n".format(mar_no_perc))

print("Marital status, Yes: {}\n".format(mar_yes_perc))

Marital status, No: {'married': 0.899, 'single': 0.8514, 'divorced': 0.8781}

Marital status, Yes: {'married': 0.101, 'single': 0.1486, 'divorced': 0.1219}



In [114]:
#same exact thing, but education now. We've seem to found a groove.

print('Total educational data:\n{}\n'.format(df.education.value_counts()))
print('Yes educational data: \n{}\n'.format(df.education.loc[df.term_deposit =='yes'].value_counts()))
print('No educational data: \n{}\n'.format(df.education.loc[df.term_deposit =='no'].value_counts()))

edu_total = {}
for idx, val in df.education.value_counts().items():
    edu_total[idx] = val
edu_yes = {}
for idx, val in df.education.loc[df.term_deposit == 'yes'].value_counts().items():
    edu_yes[idx] = val
edu_no = {}
for idx, val in df.education.loc[df.term_deposit == 'no'].value_counts().items():
    edu_no[idx] = val

edu_no_perc = {}
edu_yes_perc = {}
for key in edu_total:
    edu_no_perc[key] = round(edu_no[key]/edu_total[key], 4)
    edu_yes_perc[key] = round(edu_yes[key]/edu_total[key], 4)

print("Educational status, No: {}\n".format(edu_no_perc))

print("Educational status, Yes: {}\n".format(edu_yes_perc))

Total educational data:
secondary    25508
tertiary     14651
primary       7529
unknown       2044
Name: education, dtype: int64

Yes educational data: 
secondary    2695
tertiary     2189
primary       655
unknown       271
Name: education, dtype: int64

No educational data: 
secondary    22813
tertiary     12462
primary       6874
unknown       1773
Name: education, dtype: int64

Educational status, No: {'secondary': 0.8943, 'tertiary': 0.8506, 'primary': 0.913, 'unknown': 0.8674}

Educational status, Yes: {'secondary': 0.1057, 'tertiary': 0.1494, 'primary': 0.087, 'unknown': 0.1326}



In [115]:
#I have a hunch about defaulting, but let's see what the data holds. 

print('Total default data:\n{}\n'.format(df.in_default.value_counts()))
print('Yes default data: \n{}\n'.format(df.in_default.loc[df.term_deposit =='yes'].value_counts()))
print('No default data: \n{}\n'.format(df.in_default.loc[df.term_deposit =='no'].value_counts()))

dft_total = {}
for idx, val in df.in_default.value_counts().items():
    dft_total[idx] = val
dft_yes = {}
for idx, val in df.in_default.loc[df.term_deposit == 'yes'].value_counts().items():
    dft_yes[idx] = val
dft_no = {}
for idx, val in df.in_default.loc[df.term_deposit == 'no'].value_counts().items():
    dft_no[idx] = val

dft_no_perc = {}
dft_yes_perc = {}
for key in dft_total:
    dft_no_perc[key] = round(dft_no[key]/dft_total[key], 4)
    dft_yes_perc[key] = round(dft_yes[key]/dft_total[key], 4)

print("In Default status, No: {}\n".format(dft_no_perc))

print("In Default status, Yes: {}\n".format(dft_yes_perc))

Total default data:
no     48841
yes      891
Name: in_default, dtype: int64

Yes default data: 
no     5749
yes      61
Name: in_default, dtype: int64

No default data: 
no     43092
yes      830
Name: in_default, dtype: int64

In Default status, No: {'no': 0.8823, 'yes': 0.9315}

In Default status, Yes: {'no': 0.1177, 'yes': 0.0685}



Pretty substantial difference. I forsee it being a strong feature for the model.

In [122]:
#avg_yearly_balance is next. This numerical data will be easier read by visuals, but let's get some basic info off it.
print('Avg_yearly_balance of no : {} \nAvg_yearly_balance of yes: {}'.format(df.avg_yearly_balance.loc[df.term_deposit == 'no'].mean(), \
                                                                             df.avg_yearly_balance.loc[df.term_deposit == 'yes'].mean()))
print('Min/Max no: {}/{}\nMin/Max yes:{}/{}'.format(df.avg_yearly_balance.loc[df.term_deposit == 'no'].min(), \
                                                    df.avg_yearly_balance.loc[df.term_deposit == 'no'].max(), \
                                                    df.avg_yearly_balance.loc[df.term_deposit == 'yes'].min(), \
                                                    df.avg_yearly_balance.loc[df.term_deposit == 'yes'].max()))
print('Total STD; {}\nNo STD: {}\nYes STD:{}'.format(df.avg_yearly_balance.std(), \
                                                     df.avg_yearly_balance.loc[df.term_deposit == 'no'].std(), \
                                                    df.avg_yearly_balance.loc[df.term_deposit == 'yes'].std()))

Avg_yearly_balance of no : 1312.7761941623787 
Avg_yearly_balance of yes: 1783.4358003442342
Min/Max no: -8019/102127
Min/Max yes:-3058/81204
Total STD; 3041.6087657666208
No STD: 2983.651008144628
Yes STD:3420.1800572166817


Hard to tell what's meaningful here. Clearly being a little negative hasn't stopped someone from signing up for a term deposit. Or at least being an average of negatives. The population of customers that have subscribed for a term_deposit also have the widest standard deviation. This must mean that avg balance has less to do with subscription than one would think. However, this could just be because less customers are in this group. 