-------

# Bank Marketing Data Set

## Abstract

The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y)

## Data Set Information

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.

There are four datasets:

1) bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010), very close to the data analyzed in [Moro et al., 2014]

2) bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs.

3) bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs).

4) bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs).

The smallest datasets are provided to test more computationally demanding machine learning algorithms (e.g., SVM).

The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

## Attribute Information

### Input variables:
    
#### bank client data:

1 - age (numeric)

2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')

3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)

4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')

5 - default: has credit in default? (categorical: 'no','yes','unknown')

6 - housing: has housing loan? (categorical: 'no','yes','unknown')

7 - loan: has personal loan? (categorical: 'no','yes','unknown')

#### related with the last contact of the current campaign:

8 - contact: contact communication type (categorical: 'cellular','telephone')

9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')

10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')

11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

#### other attributes:

12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)

14 - previous: number of contacts performed before this campaign and for this client (numeric)

15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')

#### social and economic context attributes

16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)

17 - cons.price.idx: consumer price index - monthly indicator (numeric)

18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)

19 - euribor3m: euribor 3 month rate - daily indicator (numeric)

20 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):

21 - y - has the client subscribed a term deposit? (binary: 'yes','no')

------

In [1]:
import pandas as pd
import numpy as np
data= pd.read_csv('data_after_imputing_unknown_value.csv')

# Statistical Test of Categorical Columns
# Job Column

### Define null and Alternate Hypothesis
__Null Hypothesis: H0:__ There is no effect of job of the client on the term deposit subscription.

__Alternate Hypothesis: H1:__ There is an effect of job of the client on the term deposit subscription.

In [2]:
c2= pd.crosstab(data['job'],data['y'])
c2

y,no,yes
job,Unnamed: 1_level_1,Unnamed: 2_level_1
admin.,9152,1362
blue-collar,8730,643
entrepreneur,1333,124
housemaid,958,106
management,2602,328
retired,1328,446
self-employed,1275,149
services,3659,323
student,605,281
technician,6034,734


In [3]:
import scipy.stats as stats
stats.chi2_contingency(c2)

(987.5156293968546,
 9.14272721732315e-206,
 10,
 array([[9329.55404487, 1184.44595513],
        [8317.09245411, 1055.90754589],
        [1292.86287268,  164.13712732],
        [ 944.13596193,  119.86403807],
        [2599.92327862,  330.07672138],
        [1574.15150044,  199.84849956],
        [1263.58046033,  160.41953967],
        [3533.41109061,  448.58890939],
        [ 786.18840439,   99.81159561],
        [6005.55656987,  762.44343013],
        [ 901.54336214,  114.45663786]]))

__Observation-__ Since, the p-value(9.14272721732315e-206) is less than 0.05, so, we can infer that our __null hypothesis is rejected__ and there is a significant effect of job of the client on the term deposit subscription.

### Post-Hoc Test of the job column

In [4]:
d2= pd.get_dummies(data['job'])
d2.head()

Unnamed: 0,admin.,blue-collar,entrepreneur,housemaid,management,retired,self-employed,services,student,technician,unemployed
0,0,0,0,1,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,1,0,0,0
2,0,0,0,0,0,0,0,1,0,0,0
3,1,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,1,0,0,0


In [5]:
for series in d2:
    nl = "\n"
    
    crosstab = pd.crosstab(d2[f"{series}"], data['y'], normalize='index')
    print(crosstab, nl)
    chi2, p, dof, expected = stats.chi2_contingency(c2)
    print(f"Chi2 value= {chi2}{nl}p-value= {p}{nl}Degrees of freedom= {dof}{nl}")

y             no       yes
admin.                    
0       0.893134  0.106866
1       0.870458  0.129542 

Chi2 value= 987.5156293968546
p-value= 9.14272721732315e-206
Degrees of freedom= 10

y                  no       yes
blue-collar                    
0            0.874367  0.125633
1            0.931399  0.068601 

Chi2 value= 987.5156293968546
p-value= 9.14272721732315e-206
Degrees of freedom= 10

y                   no       yes
entrepreneur                    
0             0.886336  0.113664
1             0.914894  0.085106 

Chi2 value= 987.5156293968546
p-value= 9.14272721732315e-206
Degrees of freedom= 10

y                no       yes
housemaid                    
0          0.887000  0.113000
1          0.900376  0.099624 

Chi2 value= 987.5156293968546
p-value= 9.14272721732315e-206
Degrees of freedom= 10

y                 no       yes
management                    
0           0.887292  0.112708
1           0.888055  0.111945 

Chi2 value= 987.5156293968546
p-value=

__Observation-__ From the above table, we can infer that most of the clients who say 'yes' for term subscription they are working as __RETIRED(25%)__ and __STUDENT(31%)__.

### Conclusion of Job Column
There is a __significant effect__ of job of the client on the term deposit deposition. and most of the clients who say 'yes' for term subscription they are working as RETIRED and STUDENT.

# Age_Cat Column

### Define null and Alternate Hypothesis
__Null Hypothesis: H0:__ There is no effect of age category of the client on the term deposit subscription.

__Alternate Hypothesis: H1:__ There is an effect of age category of the client on the term deposit subscription.

In [6]:
c1= pd.crosstab(data['age_cat'],data['y'])
c1

y,no,yes
age_cat,Unnamed: 1_level_1,Unnamed: 2_level_1
Adult,9004,995
Mature,9344,796
Old,8552,1321
Young,9648,1528


In [7]:
stats.chi2_contingency(c1)

(244.52160929053852,
 1.0015466976149123e-52,
 3,
 array([[8872.57094299, 1126.42905701],
        [8997.68670487, 1142.31329513],
        [8760.76536855, 1112.23463145],
        [9916.97698359, 1259.02301641]]))

__Observation-__ Since, the p-value(1.0015466976149268e-52) is less than 0.05, so, we can infer that our __null hypothesis is rejected__ and there is a significant effect of age category of the client on the term deposit subscription.

### Post-Hoc Test of the age category column

In [8]:
d1= pd.get_dummies(data['age_cat'])
d1.head()

Unnamed: 0,Adult,Mature,Old,Young
0,0,0,1,0
1,0,0,1,0
2,1,0,0,0
3,0,1,0,0
4,0,0,1,0


In [9]:
for series in d1:
    nl = "\n"
    
    crosstab = pd.crosstab(d1[f"{series}"], data['y'],normalize='index')
    print(crosstab, nl)
    chi2, p, dof, expected = stats.chi2_contingency(c1)
    print(f"Chi2 value= {chi2}{nl}p-value= {p}{nl}Degrees of freedom= {dof}{nl}")

y            no       yes
Adult                    
0      0.883132  0.116868
1      0.900490  0.099510 

Chi2 value= 244.52160929053852
p-value= 1.0015466976149123e-52
Degrees of freedom= 3

y             no       yes
Mature                    
0       0.876192  0.123808
1       0.921499  0.078501 

Chi2 value= 244.52160929053852
p-value= 1.0015466976149123e-52
Degrees of freedom= 3

y          no       yes
Old                    
0    0.894012  0.105988
1    0.866201  0.133799 

Chi2 value= 244.52160929053852
p-value= 1.0015466976149123e-52
Degrees of freedom= 3

y            no       yes
Young                    
0      0.896308  0.103692
1      0.863278  0.136722 

Chi2 value= 244.52160929053852
p-value= 1.0015466976149123e-52
Degrees of freedom= 3



__Observation-__ From the above table, we can infer that most of the clients who say 'yes' for term subscription they belong to __OLD(13%)__ and __YOUNG(14%)__ category.

### Conclusion of Age Category Column
There is a __significant effect__ of age category of the client on the term deposit deposition. and most of the clients who say 'yes' for term subscription they belong to OLD and YOUNG category.

# Education Column

### Define null and Alternate Hypothesis
__Null Hypothesis: H0:__ There is no effect of education of the client on the term deposit subscription.

__Alternate Hypothesis: H1:__ There is an effect of education of the client on the term deposit subscription.

In [10]:
c3= pd.crosstab(data['education'],data['y'])
c3

y,no,yes
education,Unnamed: 1_level_1,Unnamed: 2_level_1
basic.4y,3877,476
basic.6y,2104,188
basic.9y,6028,498
high.school,8739,1109
illiterate,14,4
professional.course,4847,623
university.degree,10939,1742


In [11]:
stats.chi2_contingency(c3)

(187.96404840958104,
 6.893288776179271e-38,
 6,
 array([[3.86261639e+03, 4.90383607e+02],
        [2.03379664e+03, 2.58203360e+02],
        [5.79081888e+03, 7.35181121e+02],
        [8.73858172e+03, 1.10941828e+03],
        [1.59722249e+01, 2.02777508e+00],
        [4.85378168e+03, 6.16218316e+02],
        [1.12524325e+04, 1.42856754e+03]]))

__Observation-__ Since, the p-value(6.893288776179271e-38) is less than 0.05, so, we can infer that our __null hypothesis is rejected__ and there is a significant effect of education of the client on the term deposit subscription.

### Post-Hoc Test of the education column

In [12]:
d3= pd.get_dummies(data['education'])
d3.head()

Unnamed: 0,basic.4y,basic.6y,basic.9y,high.school,illiterate,professional.course,university.degree
0,1,0,0,0,0,0,0
1,0,0,0,1,0,0,0
2,0,0,0,1,0,0,0
3,0,1,0,0,0,0,0
4,0,0,0,1,0,0,0


In [13]:
for series in d3:
    nl = "\n"
    
    crosstab = pd.crosstab(d3[f"{series}"], data['y'], normalize='index')
    print(crosstab, nl)
    chi2, p, dof, expected = stats.chi2_contingency(c3)
    print(f"Chi2 value= {chi2}{nl}p-value= {p}{nl}Degrees of freedom= {dof}{nl}")

y               no       yes
basic.4y                    
0         0.886955  0.113045
1         0.890650  0.109350 

Chi2 value= 187.96404840958104
p-value= 6.893288776179271e-38
Degrees of freedom= 6

y               no       yes
basic.6y                    
0         0.885541  0.114459
1         0.917976  0.082024 

Chi2 value= 187.96404840958104
p-value= 6.893288776179271e-38
Degrees of freedom= 6

y               no       yes
basic.9y                    
0         0.880503  0.119497
1         0.923690  0.076310 

Chi2 value= 187.96404840958104
p-value= 6.893288776179271e-38
Degrees of freedom= 6

y                  no       yes
high.school                    
0            0.887332  0.112668
1            0.887388  0.112612 

Chi2 value= 187.96404840958104
p-value= 6.893288776179271e-38
Degrees of freedom= 6

y                 no       yes
illiterate                    
0           0.887394  0.112606
1           0.777778  0.222222 

Chi2 value= 187.96404840958104
p-value= 6.89328877

__Observation-__ From the above table, we can infer that most of the clients who say 'yes' for term subscription they are having __ILLITERATE(22%)__ and __UNIVERSITY.DEGREE(14%)__.

### Conclusion of Education Column
There is a __significant effect__ of education of the client on the term deposit deposition. and most of the clients who say 'yes' for term subscription they are having ILLITERATE and UNIVERSITY.DEGREE.

# Default Column

In [14]:
c4= pd.crosstab(data['default'],data['y'])
c4

y,no,yes
default,Unnamed: 1_level_1,Unnamed: 2_level_1
no,36545,4640
yes,3,0


In [15]:
stats.chi2_contingency(c4)

(0.08755907942916237,
 0.7673035224267515,
 1,
 array([[3.65453380e+04, 4.63966204e+03],
        [2.66203749e+00, 3.37962513e-01]]))

__Observation-__ Since, the p-value(0.7673035224267515) is more than 0.05, so, we can infer that our __null hypothesis is fail to reject__ and there is not a significant effect of 'credit in default' of the client on the term deposit subscription.

### Conclusion of Default Column
There is a __no effect__ of 'credit in default' of the client on the term deposit subscrition and most of the clients who say 'yes' for term deposit subscription they are having NO credit in default.

# Housing Column

### Define null and Alternate Hypothesis
__Null Hypothesis: H0:__ There is no effect of housing loan of the client on the term deposit subscription.

__Alternate Hypothesis: H1:__ There is an effect of housing loan of the client on the term deposit subscription.

In [16]:
c5= pd.crosstab(data['housing'],data['y'])
c5

y,no,yes
housing,Unnamed: 1_level_1,Unnamed: 2_level_1
no,17000,2074
yes,19548,2566


In [17]:
stats.chi2_contingency(c5)

(5.387626999544283,
 0.020280031834089277,
 1,
 array([[16925.2343401,  2148.7656599],
        [19622.7656599,  2491.2343401]]))

__Observation-__ Since, the p-value(0.020280031834089277) is slightly less than 0.05, so, we can infer that our __null hypothesis is rejected__ and there is a very less significant effect of housing loan of the client on the term deposit subscription.

### Post-Hoc Test of the housing column

In [18]:
d5= pd.get_dummies(data['housing'])
d5.head()

Unnamed: 0,no,yes
0,1,0
1,1,0
2,0,1
3,1,0
4,1,0


In [19]:
for series in d5:
    nl = "\n"
    
    crosstab = pd.crosstab(d5[f"{series}"], data['y'], normalize='index')
    print(crosstab, nl)
    chi2, p, dof, expected = stats.chi2_contingency(c5)
    print(f"Chi2 value= {chi2}{nl}p-value= {p}{nl}Degrees of freedom= {dof}{nl}")

y         no       yes
no                    
0   0.883965  0.116035
1   0.891266  0.108734 

Chi2 value= 5.387626999544283
p-value= 0.020280031834089277
Degrees of freedom= 1

y          no       yes
yes                    
0    0.891266  0.108734
1    0.883965  0.116035 

Chi2 value= 5.387626999544283
p-value= 0.020280031834089277
Degrees of freedom= 1



__Observation-__ From the above table, we can infer that there is a __good balance__ between the clients who say 'yes' for term subscription as having clients housing loan(~11%) and clients not having house loan(~11%).

# Loan Column

### Define null and Alternate Hypothesis
__Null Hypothesis: H0:__ There is no effect of personal loan of the client on the term deposit subscription.

__Alternate Hypothesis: H1:__ There is an effect of personal loan of the client on the term deposit subscription.

In [20]:
c6= pd.crosstab(data['loan'],data['y'])
c6

y,no,yes
loan,Unnamed: 1_level_1,Unnamed: 2_level_1
no,30839,3942
yes,5709,698


In [21]:
stats.chi2_contingency(c6)

(1.001666950509534,
 0.316907490545382,
 1,
 array([[30862.77527435,  3918.22472565],
        [ 5685.22472565,   721.77527435]]))

__Observation-__ Since, the p-value(0.316907490545382) is more than 0.05, so, we can infer that our __null hypothesis is fail to reject__ and there is no significant effect of personal loan of the client on the term deposit subscription.

### Conclusion of Loan Column
There is a __no significant effect__ of personal loan of the client on the term deposit subscrition.

# Marital Column

### Define null and Alternate Hypothesis
__Null Hypothesis: H0:__ There is no effect of marital status of the client on the term deposit subscription.

__Alternate Hypothesis: H1:__ There is an effect of marital status of the client on the term deposit subscription.

In [22]:
c11= pd.crosstab(data['marital'],data['y'])
c11

y,no,yes
marital,Unnamed: 1_level_1,Unnamed: 2_level_1
divorced,4136,476
married,22446,2538
single,9966,1626


In [23]:
stats.chi2_contingency(c11)

(123.17077425521249,
 1.7939329408092675e-27,
 2,
 array([[ 4092.4389628 ,   519.5610372 ],
        [22169.44818879,  2814.55181121],
        [10286.1128484 ,  1305.8871516 ]]))

__Observation-__ Since, the p-value(1.7939329408092675e-27) is less than 0.05, so, we can infer that our __null hypothesis is rejected__ and there is a significant effect of marital status of the client on the term deposit subscription.

### Post-Hoc Test of the marital column

In [24]:
d11= pd.get_dummies(data['marital'])
d11.head()

Unnamed: 0,divorced,married,single
0,0,1,0
1,0,1,0
2,0,1,0
3,0,1,0
4,0,1,0


In [25]:
for series in d11:
    nl = "\n"
    
    crosstab = pd.crosstab(d11[f"{series}"], data['y'], normalize='index')
    print(crosstab, nl)
    chi2, p, dof, expected = stats.chi2_contingency(c11)
    print(f"Chi2 value= {chi2}{nl}p-value= {p}{nl}Degrees of freedom= {dof}{nl}")

y               no       yes
divorced                    
0         0.886155  0.113845
1         0.896791  0.103209 

Chi2 value= 123.17077425521249
p-value= 1.7939329408092675e-27
Degrees of freedom= 2

y              no       yes
married                    
0        0.870279  0.129721
1        0.898415  0.101585 

Chi2 value= 123.17077425521249
p-value= 1.7939329408092675e-27
Degrees of freedom= 2

y             no       yes
single                    
0       0.898162  0.101838
1       0.859731  0.140269 

Chi2 value= 123.17077425521249
p-value= 1.7939329408092675e-27
Degrees of freedom= 2



__Observation-__ From the above table, we can infer that most of the clients who say 'yes' for term subscription they are having __SINGLE(14%)__ as marital status.

### Conclusion of Marital Column
There is a __significant effect__ of education of the client on the term deposit deposition. and most of the clients who say 'yes' for term subscription  they are having SINGLE as marital status.

# 'Contact' Column

### Define null and Alternate Hypothesis
__Null Hypothesis: H0:__ There is no effect of contact communication type of the client on the term deposit subscription.

__Alternate Hypothesis: H1:__ There is an effect of contact communication type of the client on the term deposit subscription.

In [26]:
c7= pd.crosstab(data['contact'],data['y'])
c7

y,no,yes
contact,Unnamed: 1_level_1,Unnamed: 2_level_1
cellular,22291,3853
telephone,14257,787


In [27]:
stats.chi2_contingency(c7)

(862.3183642075705,
 1.5259856523129964e-189,
 1,
 array([[23198.7693503,  2945.2306497],
        [13349.2306497,  1694.7693503]]))

__Observation-__ Since, the p-value(1.5259856523129964e-189) is less than 0.05, so, we can infer that our __null hypothesis is rejected__ and there is a significant effect of type of contact of the client on the term deposit subscription.

### Post-Hoc Test of the contact column

In [28]:
d7= pd.get_dummies(data['contact'])
d7.head()

Unnamed: 0,cellular,telephone
0,0,1
1,0,1
2,0,1
3,0,1
4,0,1


In [29]:
for series in d7:
    nl = "\n"
    
    crosstab = pd.crosstab(d7[f"{series}"], data['y'], normalize='index')
    print(crosstab, nl)
    chi2, p, dof, expected = stats.chi2_contingency(c7)
    print(f"Chi2 value= {chi2}{nl}p-value= {p}{nl}Degrees of freedom= {dof}{nl}")

y               no       yes
cellular                    
0         0.947687  0.052313
1         0.852624  0.147376 

Chi2 value= 862.3183642075705
p-value= 1.5259856523129964e-189
Degrees of freedom= 1

y                no       yes
telephone                    
0          0.852624  0.147376
1          0.947687  0.052313 

Chi2 value= 862.3183642075705
p-value= 1.5259856523129964e-189
Degrees of freedom= 1



__Observation-__ From the above table, we can infer that most of the clients who say 'yes' for term subscription their type of contact is __CELLULAR(15%)__.

### Conclusion of Contact Column
There is a __significant effect__ of type of contact of the client on the term deposit subscrition and most of the clients who say 'yes' for term subscription their type of contact is CELLULAR.

# Month Column

### Define null and Alternate Hypothesis
__Null Hypothesis: H0:__ There is no effect of last contacted month to the client on the term deposit subscription.

__Alternate Hypothesis: H1:__ There is an effect of last contacted month to the client on the term deposit subscription.

In [30]:
c8= pd.crosstab(data['month'],data['y'])
c8

y,no,yes
month,Unnamed: 1_level_1,Unnamed: 2_level_1
apr,2093,539
aug,5523,655
dec,93,89
jul,6525,649
jun,4759,559
mar,270,276
may,12883,886
nov,3685,416
oct,403,315
sep,314,256


In [31]:
stats.chi2_contingency(c8)

(3101.149351411678, 0.0, 9, array([[ 2335.49422162,   296.50577838],
        [ 5482.02253083,   695.97746917],
        [  161.49694086,    20.50305914],
        [ 6365.8189764 ,   808.1810236 ],
        [ 4718.905118  ,   599.094882  ],
        [  484.49082257,    61.50917743],
        [12217.86471788,  1551.13528212],
        [ 3639.00524425,   461.99475575],
        [  637.11430514,    80.88569486],
        [  505.78712246,    64.21287754]]))

__Observation-__ Since, the p-value(0.0) is less than 0.05, so, we can infer that our __null hypothesis is rejected__ and there is a significant effect of last contacted month to the client on the term deposit subscription.

### Post Hoc Test of Month Column

In [32]:
d8= pd.get_dummies(data['month'])
d8.head()

Unnamed: 0,apr,aug,dec,jul,jun,mar,may,nov,oct,sep
0,0,0,0,0,0,0,1,0,0,0
1,0,0,0,0,0,0,1,0,0,0
2,0,0,0,0,0,0,1,0,0,0
3,0,0,0,0,0,0,1,0,0,0
4,0,0,0,0,0,0,1,0,0,0


In [33]:
for series in d8:
    nl = "\n"
    
    crosstab = pd.crosstab(d8[f"{series}"], data['y'], normalize='index')
    print(crosstab, nl)
    chi2, p, dof, expected = stats.chi2_contingency(c8)
    print(f"Chi2 value= {chi2}{nl}p-value= {p}{nl}Degrees of freedom= {dof}{nl}")

y          no       yes
apr                    
0    0.893635  0.106365
1    0.795213  0.204787 

Chi2 value= 3101.149351411678
p-value= 0.0
Degrees of freedom= 9

y          no       yes
aug                    
0    0.886175  0.113825
1    0.893979  0.106021 

Chi2 value= 3101.149351411678
p-value= 0.0
Degrees of freedom= 9

y          no       yes
dec                    
0    0.889016  0.110984
1    0.510989  0.489011 

Chi2 value= 3101.149351411678
p-value= 0.0
Degrees of freedom= 9

y          no       yes
jul                    
0    0.882666  0.117334
1    0.909534  0.090466 

Chi2 value= 3101.149351411678
p-value= 0.0
Degrees of freedom= 9

y          no       yes
jun                    
0    0.886228  0.113772
1    0.894885  0.105115 

Chi2 value= 3101.149351411678
p-value= 0.0
Degrees of freedom= 9

y          no       yes
mar                    
0    0.892623  0.107377
1    0.494505  0.505495 

Chi2 value= 3101.149351411678
p-value= 0.0
Degrees of freedom= 9

y          no   

__Observation-__ From the above table, we can infer that most of the clients who say 'yes' for term subscription they are last contacted in the month of __MARCH(51%)__ and __DECEMBER(49%)__.

### Conclusion of Month Column
There is a __significant effect__ of last contacted month of the client on the term deposit subscrition and most of the clients who say 'yes' for term subscription they are last contacted in the month of MARCH and DECEMBER.

# Day of Week Column

### Define null and Alternate Hypothesis
__Null Hypothesis: H0:__ There is no effect of last contact day of the week to the client on the term deposit subscription.

__Alternate Hypothesis: H1:__ There is an effect of last contact day of the week to the client on the term deposit subscription.

In [34]:
c9= pd.crosstab(data['day_of_week'],data['y'])
c9

y,no,yes
day_of_week,Unnamed: 1_level_1,Unnamed: 2_level_1
fri,6981,846
mon,7667,847
thu,7578,1045
tue,7137,953
wed,7185,949


In [35]:
stats.chi2_contingency(c9)

(26.14493907587197,
 2.9584820052785324e-05,
 4,
 array([[6945.25580266,  881.74419734],
        [7554.8623871 ,  959.1376129 ],
        [7651.58308245,  971.41691755],
        [7178.62775566,  911.37224434],
        [7217.67097213,  916.32902787]]))

__Observation-__ Since, the p-value(2.9584820052785324e-05) is less than 0.05, so, we can infer that our __null hypothesis is rejected__ and there is a significant effect of last contact day of the week to the client on the term deposit subscription.

### Post-Hoc Test of the day of week column

In [36]:
d9= pd.get_dummies(data['day_of_week'])
d9.head()

Unnamed: 0,fri,mon,thu,tue,wed
0,0,1,0,0,0
1,0,1,0,0,0
2,0,1,0,0,0
3,0,1,0,0,0
4,0,1,0,0,0


In [37]:
for series in d9:
    nl = "\n"
    
    crosstab = pd.crosstab(d9[f"{series}"], data['y'], normalize='index')
    print(crosstab, nl)
    chi2, p, dof, expected = stats.chi2_contingency(c9)
    print(f"Chi2 value= {chi2}{nl}p-value= {p}{nl}Degrees of freedom= {dof}{nl}")

y          no       yes
fri                    
0    0.886274  0.113726
1    0.891913  0.108087 

Chi2 value= 26.14493907587197
p-value= 2.9584820052785324e-05
Degrees of freedom= 4

y          no       yes
mon                    
0    0.883914  0.116086
1    0.900517  0.099483 

Chi2 value= 26.14493907587197
p-value= 2.9584820052785324e-05
Degrees of freedom= 4

y          no       yes
thu                    
0    0.889605  0.110395
1    0.878812  0.121188 

Chi2 value= 26.14493907587197
p-value= 2.9584820052785324e-05
Degrees of freedom= 4

y          no       yes
tue                    
0    0.888604  0.111396
1    0.882200  0.117800 

Chi2 value= 26.14493907587197
p-value= 2.9584820052785324e-05
Degrees of freedom= 4

y          no       yes
wed                    
0    0.888334  0.111666
1    0.883329  0.116671 

Chi2 value= 26.14493907587197
p-value= 2.9584820052785324e-05
Degrees of freedom= 4



__Observation-__ From the above table, we can infer that the clients who say 'yes' for term subscription they are contacted in __TUESDAY(12%)__ and __THURSDAY(12%)__.

### Conclusion of Day_of_Week Column
There is a __significant effect__ of last contact day of the week of the client on the term deposit subscrition and the clients who say 'yes' for term subscription they are contacted in TUESDAY and THURSDAY.

# POutcome Column

### Define null and Alternate Hypothesis
__Null Hypothesis: H0:__ There is no effect of outcome of the previous marketing campaign on the term deposit subscription.

__Alternate Hypothesis: H1:__ There is an effect of outcome of the previous marketing campaign on the term deposit subscription.

In [38]:
c10= pd.crosstab(data['poutcome'],data['y'])
c10

y,no,yes
poutcome,Unnamed: 1_level_1,Unnamed: 2_level_1
failure,3647,605
nonexistent,32422,3141
success,479,894


In [39]:
stats.chi2_contingency(c10)

(4230.5237978319765, 0.0, 2, array([[ 3772.99446441,   479.00553559],
        [31556.67971254,  4006.32028746],
        [ 1218.32582306,   154.67417694]]))

__Observation-__ Since, the p-value(0.0) is less than 0.05, so, we can infer that our __null hypothesis is rejected__ and there is a significant effect of outcome of the previous marketing campaign on the term deposit subscription.

### Post-Hoc Test of the poutcome column

In [40]:
d10=pd.get_dummies(data['poutcome'])
d10.head()

Unnamed: 0,failure,nonexistent,success
0,0,1,0
1,0,1,0
2,0,1,0
3,0,1,0
4,0,1,0


In [41]:
for series in d10:
    nl = "\n"
    
    crosstab = pd.crosstab(d10[f"{series}"], data['y'], normalize='index')
    print(crosstab, nl)
    chi2, p, dof, expected = stats.chi2_contingency(c10)
    print(f"Chi2 value= {chi2}{nl}p-value= {p}{nl}Degrees of freedom= {dof}{nl}")

y              no       yes
failure                    
0        0.890757  0.109243
1        0.857714  0.142286 

Chi2 value= 4230.5237978319765
p-value= 0.0
Degrees of freedom= 2

y                  no       yes
nonexistent                    
0            0.733511  0.266489
1            0.911678  0.088322 

Chi2 value= 4230.5237978319765
p-value= 0.0
Degrees of freedom= 2

y              no       yes
success                    
0        0.905915  0.094085
1        0.348871  0.651129 

Chi2 value= 4230.5237978319765
p-value= 0.0
Degrees of freedom= 2



__Observation-__ From the above table, we can infer that the clients who say 'yes' for term subscription their last campaign outcome was __SUCCESS(65%)__.

### Conclusion of POutcome Column
There is a __significant effect__ of outcome of the previous marketing campaign on the term deposit subscrition and the clients who say 'yes' for term subscription their last campaign outcome was SUCCESS.

# Summary of Statistical Tests of Categorical Features(after dealing with 'unknown' values)-
## Except 'Loan' and 'Default' features, other categorical features have a significant impact on the target variable feature.

# Statistical Test- Numerical Features
# Age Column

In [42]:
import pandas as pd
import numpy as np
import statsmodels.api as sms
import statsmodels.formula.api as statsmodel
from statsmodels.formula.api import ols
import scipy.stats as stats

### Define null and Alternate Hypothesis
__Null Hypothesis: H0:__ There is no effect of age of the clients on the term deposit subscription.

__Alternate Hypothesis: H1:__ There is an effect of age of the clients on the term deposit subscription.

In [43]:
model1 = ols("age ~ y",data = data).fit()

In [44]:
sms.stats.anova_lm(model1)

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
y,1.0,4133.451,4133.450624,38.094659,6.802136e-10
Residual,41186.0,4468876.0,108.504727,,


__Observation-__ Since, the p-value(6.802136e-10) is less than 0.05, so, we can infer that our __null hypothesis is rejected__ and there is a significant effect of age of the clients on the term deposit subscription..

### Post Hoc Test of Age Column

In [45]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

In [46]:
print(pairwise_tukeyhsd(data['age'],data['y']))

Multiple Comparison of Means - Tukey HSD,FWER=0.05
group1 group2 meandiff lower  upper  reject
-------------------------------------------
  no    yes    1.002   0.6838 1.3201  True 
-------------------------------------------


__Observation-__ From the above table, we can infer that the clients who say 'yes' for the term deposit subscription have more mean age than compared to the clients who say'no' for the term deposit subscription.

### Conclusion of Age Column

There is a __significant effect__ of mean age of the clients on the term deposit subscription and the clients who say 'yes' for the term deposit subscription have more mean age than compared to the clients who say 'no' for the term deposit subscription.

# Duration Column

### Define null and Alternate Hypothesis
__Null Hypothesis: H0:__ There is no effect of last contact duration of the clients on the term deposit subscription.

__Alternate Hypothesis: H1:__ There is an effect of last contact duration of the clients on the term deposit subscription.

In [47]:
model2 = ols("duration ~ y",data = data).fit()

In [48]:
sms.stats.anova_lm(model2)

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
y,1.0,454771000.0,454771000.0,8094.101634,0.0
Residual,41186.0,2314055000.0,56185.48,,


__Observation-__ Since, the p-value(0.0) is less than 0.05, so, we can infer that our __null hypothesis is rejected__ and there is a significant effect of last contact duration of the clients on the term deposit subscription..

### Post Hoc Test of Duration Column

In [49]:
print(pairwise_tukeyhsd(data['duration'],data['y']))

Multiple Comparison of Means - Tukey HSD,FWER=0.05
group1 group2 meandiff  lower    upper   reject
-----------------------------------------------
  no    yes   332.3464 325.1059 339.5868  True 
-----------------------------------------------


__Observation-__ From the above table, we can infer that the clients who say 'yes' for the term deposit subscription have more mean of last call duration of clients than compared to the clients who say 'no' for the term deposit subscription.

### Conclusion of Duration Column

There is a __significant effect__ of last call duration of the clients on the term deposit subscription and the clients who say 'yes' for the term deposit subscription have more mean of last call duration than compared to the clients who say'no' for the term deposit subscription.

# Campaign Column

### Define null and Alternate Hypothesis
__Null Hypothesis: H0:__ There is no effect of number of contacts performed to a particular client during the campaign  on the term deposit subscription.

__Alternate Hypothesis: H1:__ There is an effect of number of contacts performed to a particular client during the campaign on the term deposit subscription.

In [50]:
model3 = ols("campaign ~ y",data = data).fit()

In [51]:
sms.stats.anova_lm(model3)

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
y,1.0,1391.562959,1391.562959,182.156673,2.0077799999999998e-41
Residual,41186.0,314635.259513,7.639374,,


__Observation-__ Since, the p-value(2.007780e-41) is less than 0.05, so, we can infer that our __null hypothesis is rejected__ and there is a significant effect of number of contacts performed to a particular client during the campaign on the term deposit subscription.

### Post Hoc Test of Campaign Column

In [52]:
print(pairwise_tukeyhsd(data['campaign'],data['y']))

Multiple Comparison of Means - Tukey HSD,FWER=0.05
group1 group2 meandiff  lower   upper  reject
---------------------------------------------
  no    yes   -0.5814  -0.6658 -0.4969  True 
---------------------------------------------


__Observation-__ From the above table, we can infer that the clients who say 'yes' for the term deposit subscription have slightly less mean number of contacts performed to a particular client than compared to the clients who say 'no' for the term deposit subscription.

### Conclusion of Campaign Column

There is a __significant effect__ of number of contacts performed to a particular client during the campaign on the term deposit subscription and the clients who say 'yes' for the term deposit subscription have slightly less mean number of contacts performed than compared to the clients who say'no' for the term deposit subscription.

# PDays Column 

### Define null and Alternate Hypothesis
__Null Hypothesis: H0:__ There is no effect of number of days that passed by after the client was last contacted from a previous campaign on the term deposit subscription.

__Alternate Hypothesis: H1:__ There is an effect of number of days that passed by after the client was last contacted from a previous campaign on the term deposit subscription.

In [53]:
model3 = ols("pdays ~ y",data = data).fit()

In [54]:
sms.stats.anova_lm(model3)

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
y,1.0,154.018928,154.018928,4859.909473,0.0
Residual,41186.0,1305.255497,0.031692,,


__Observation-__ Since, the p-value(0.0) is less than 0.05, so, we can infer that our __null hypothesis is rejected__ and there is a significant effect of number of days that passed by after the client was last contacted from a previous campaign on the term deposit subscription.

### Post Hoc Test of PDays Column

In [55]:
print(pairwise_tukeyhsd(data['pdays'],data['y']))

Multiple Comparison of Means - Tukey HSD,FWER=0.05
group1 group2 meandiff lower upper  reject
------------------------------------------
  no    yes    0.1934  0.188 0.1988  True 
------------------------------------------


__Observation-__ From the above table, we can infer that the clients who say 'yes' for the term deposit subscription have more mean number of days that passed by after the client was last contacted from a previous campaign than compared to the clients who say 'no' for the term deposit subscription.

### Conclusion of PDays Column

There is a __significant effect__ of number of days that passed by after the client was last contacted from a previous campaign on the term deposit subscription and the clients who say 'yes' for the term deposit subscription have more mean number of days that passed by after the client was last contacted from a previous campaign than compared to the clients who say'no' for the term deposit subscription.

# Previous Column

### Define null and Alternate Hypothesis
__Null Hypothesis: H0:__ There is no effect of number of contacts performed before this campaign for a particular client on the term deposit subscription.

__Alternate Hypothesis: H1:__ There is an effect of number of contacts performed before this campaign for a particular client  on the term deposit subscription.

In [56]:
model4 = ols("previous ~ y",data = data).fit()

In [57]:
sms.stats.anova_lm(model4)

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
y,1.0,534.48549,534.48549,2304.257088,0.0
Residual,41186.0,9553.326106,0.231956,,


__Observation-__ Since, the p-value(0.0) is less than 0.05, so, we can infer that our __null hypothesis is rejected__ and there is a significant effect of number of contacts performed before this campaign for a particular client on the term deposit subscription.

### Post Hoc Test of Previous Column

In [58]:
print(pairwise_tukeyhsd(data['previous'],data['y']))

Multiple Comparison of Means - Tukey HSD,FWER=0.05
group1 group2 meandiff lower  upper reject
------------------------------------------
  no    yes    0.3603  0.3456 0.375  True 
------------------------------------------


__Observation-__ From the above table, we can infer that the clients who say 'yes' for the term deposit subscription have slightly more mean number of contacts performed before this campaign for a particular client than compared to the clients who say 'no' for the term deposit subscription.

### Conclusion of Previous Column

There is a __significant effect__ of number of days that passed by after the client was last contacted from a previous campaign on the term deposit subscription and the clients who say 'yes' for the term deposit subscription have slightly more number of contacts performed before this campaign for a particular client than compared to the clients who say'no' for the term deposit subscription.

# 'emp.var.rate' Column

### Define null and Alternate Hypothesis
__Null Hypothesis: H0:__ There is no effect of employment variation rate on the term deposit subscription.

__Alternate Hypothesis: H1:__ There is an effect of employment variation rate on the term deposit subscription.

In [59]:
data2=data.rename(index=str,columns={'emp.var.rate':'emp_var_rate', 'cons.price.idx':'cons_price_idx',
       'cons.conf.idx':'cons_conf_idx', 'nr.employed':'nr_employed'})

In [60]:
model5 = ols("emp_var_rate ~ y",data = data2).fit()

In [61]:
sms.stats.anova_lm(model5)

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
y,1.0,9046.842163,9046.842163,4023.829925,0.0
Residual,41186.0,92599.152609,2.248316,,


__Observation-__ Since, the p-value(0.0) is less than 0.05, so, we can infer that our __null hypothesis is rejected__ and there is a significant effect of employment variation rate on the term deposit subscription.

### Post Hoc Test of emp.var.rate Column

In [62]:
print(pairwise_tukeyhsd(data2['emp_var_rate'],data2['y']))

Multiple Comparison of Means - Tukey HSD,FWER=0.05
group1 group2 meandiff  lower   upper  reject
---------------------------------------------
  no    yes   -1.4823  -1.5281 -1.4365  True 
---------------------------------------------


__Observation-__ From the above table, we can infer that the clients who say 'yes' for the term deposit subscription have slightly less mean employment variation rate than compared to the clients who say 'no' for the term deposit subscription.

### Conclusion of emp.var.rate Column

There is a __significant effect__ of employment variation rate on the term deposit subscription and the clients who say 'yes' for the term deposit subscription have slightly more employment variation rate than compared to the clients who say'no' for the term deposit subscription.

# 'cons.price.idx' Column

### Define null and Alternate Hypothesis
__Null Hypothesis: H0:__ There is no effect of consumer price index on the term deposit subscription.

__Alternate Hypothesis: H1:__ There is an effect of consumer price index on the term deposit subscription.

In [63]:
model6 = ols("cons_price_idx ~ y",data = data2).fit()

In [64]:
sms.stats.anova_lm(model6)

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
y,1.0,256.037173,256.037173,778.589786,9.318965e-170
Residual,41186.0,13543.906156,0.328847,,


__Observation-__ Since, the p-value(9.318965e-170) is less than 0.05, so, we can infer that our __null hypothesis is rejected__ and there is a significant effect of consumer price index on the term deposit subscription.

### Post Hoc Test of cons.price.idx Column

In [65]:
print(pairwise_tukeyhsd(data2['cons_price_idx'],data2['y']))

Multiple Comparison of Means - Tukey HSD,FWER=0.05
group1 group2 meandiff  lower   upper  reject
---------------------------------------------
  no    yes   -0.2494  -0.2669 -0.2319  True 
---------------------------------------------


__Observation-__ From the above table, we can infer that the clients who say 'yes' for the term deposit subscription have slightly less mean consumer price index than compared to the clients who say 'no' for the term deposit subscription.

### Conclusion of cons.price.idx Column

There is a __significant effect__ of consumer price index on the term deposit subscription and the clients who say 'yes' for the term deposit subscription have slightly more less consumer price index than compared to the clients who say'no' for the term deposit subscription.

# 'cons.conf.idx' Column

### Define null and Alternate Hypothesis
__Null Hypothesis: H0:__ There is no effect of consumer confidence index on the term deposit subscription.

__Alternate Hypothesis: H1:__ There is an effect of consumer confidence index on the term deposit subscription.

In [66]:
model7 = ols("cons_conf_idx ~ y",data = data2).fit()

In [67]:
sms.stats.anova_lm(model7)

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
y,1.0,2656.927416,2656.927416,124.409975,7.536665e-29
Residual,41186.0,879577.484094,21.356225,,


__Observation-__ Since, the p-value(7.536665e-29) is less than 0.05, so, we can infer that our __null hypothesis is rejected__ and there is a significant effect of consumer confidence index on the term deposit subscription.

### Post Hoc Test of cons.price.idx Column

In [68]:
print(pairwise_tukeyhsd(data2['cons_conf_idx'],data2['y']))

Multiple Comparison of Means - Tukey HSD,FWER=0.05
group1 group2 meandiff lower  upper  reject
-------------------------------------------
  no    yes    0.8033  0.6622 0.9445  True 
-------------------------------------------


__Observation-__ From the above table, we can infer that the clients who say 'yes' for the term deposit subscription have slightly more mean consumer confidence index than compared to the clients who say 'no' for the term deposit subscription.

### Conclusion of cons.conf.idx Column

There is a __significant effect__ of consumer confidence index on the term deposit subscription and the clients who say 'yes' for the term deposit subscription have slightly more more consumer confidence index than compared to the clients who say'no' for the term deposit subscription.

# 'euribor3m' Column

### Define null and Alternate Hypothesis
__Null Hypothesis: H0:__ There is no effect of euribor 3 month rate on the term deposit subscription.

__Alternate Hypothesis: H1:__ There is an effect of euribor 3 month rate on the term deposit subscription.

In [69]:
model8 = ols("euribor3m ~ y",data = data).fit()

In [70]:
sms.stats.anova_lm(model8)

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
y,1.0,11736.509666,11736.509666,4309.479048,0.0
Residual,41186.0,112166.6637,2.723417,,


__Observation-__ Since, the p-value(0.0) is less than 0.05, so, we can infer that our __null hypothesis is rejected__ and there is a significant effect of euribor 3 month rate on the term deposit subscription.

### Post Hoc Test of euribor3m Column

In [71]:
print(pairwise_tukeyhsd(data['euribor3m'],data['y']))

Multiple Comparison of Means - Tukey HSD,FWER=0.05
group1 group2 meandiff  lower   upper  reject
---------------------------------------------
  no    yes   -1.6884  -1.7388 -1.6379  True 
---------------------------------------------


__Observation-__ From the above table, we can infer that the clients who say 'yes' for the term deposit subscription have less mean euribor 3 month rate than compared to the clients who say 'no' for the term deposit subscription.

### Conclusion of euribor3m Column

There is a __significant effect__ of euribor 3 month rate on the term deposit subscription and the clients who say 'yes' for the term deposit subscription have less mean euribor 3 month rate than compared to the clients who say'no' for the term deposit subscription.

# 'nr.employed' Column

### Define null and Alternate Hypothesis
__Null Hypothesis: H0:__ There is no effect of number of employees on the term deposit subscription.

__Alternate Hypothesis: H1:__ There is an effect of number of employees on the term deposit subscription.

In [72]:
model9 = ols("nr_employed ~ y",data = data2).fit()

In [73]:
sms.stats.anova_lm(model9)

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
y,1.0,27047270.0,27047270.0,5926.610646,0.0
Residual,41186.0,187960500.0,4563.7,,


__Observation-__ Since, the p-value(0.0) is less than 0.05, so, we can infer that our __null hypothesis is rejected__ and there is a significant effect of number of employees on the term deposit subscription.

### Post Hoc Test of nr.employed Column

In [74]:
print(pairwise_tukeyhsd(data2['nr_employed'],data2['y']))

Multiple Comparison of Means - Tukey HSD,FWER=0.05
group1 group2 meandiff  lower    upper   reject
-----------------------------------------------
  no    yes   -81.0506 -83.1142 -78.9871  True 
-----------------------------------------------


__Observation-__ From the above table, we can infer that the clients who say 'yes' for the term deposit subscription have less mean number of employees than compared to the clients who say 'no' for the term deposit subscription.

### Conclusion of nr.employed Column

There is a __significant effect__ of number of employees on the term deposit subscription and the clients who say 'yes' for the term deposit subscription have less mean number of employees than compared to the clients who say'no' for the term deposit subscription.

# Summary of Statistical Tests of Numerical Features

## All numerical features have a significant impact on the target variable.