# Geodemographic Segmentation Model

Dataset: https://www.superdatascience.com/training/

Author: Filipa C. S. Rodrigues (filipacsrodrigues@gmail.com)

In [1]:
%matplotlib inline 
import pandas
import csv
import numpy as np
import statsmodels.api as sm
import statsmodels.discrete.discrete_model as smdis
import statsmodels.stats.outliers_influence as outliers
import matplotlib.pyplot as plt



In [2]:
df = pandas.DataFrame.from_csv('Churn-Modelling.csv', index_col=None)
df[:5]

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


#### Variables

- Exited: dependent variable (y) -->  is binary
- Gender, Geography: categorical independent variable (x)
- All the remaining variables: numeric independent variable (x)

In [3]:
df_y = df['Exited']
df_x = df.drop(['Exited'], axis = 1)

Create dummy variables for the categorical variables:

In [4]:
dummy = pandas.get_dummies(df_x['Gender'])
df_x = dummy.join(df_x)
dummy = pandas.get_dummies(df_x['Geography'])
df_x = dummy.join(df_x)
df_x[:5]

Unnamed: 0,France,Germany,Spain,Female,Male,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
0,1,0,0,1,0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88
1,0,0,1,1,0,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58
2,1,0,0,1,0,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57
3,1,0,0,1,0,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63
4,0,0,1,1,0,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1


Only one dummy variable should be used to avoid the "dummy variable trap":

In [5]:
df_x = df_x.drop(['Gender', 'Male', 'France', 'Geography'], axis =1)
df_x[:5]

Unnamed: 0,Germany,Spain,Female,RowNumber,CustomerId,Surname,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
0,0,0,1,1,15634602,Hargrave,619,42,2,0.0,1,1,1,101348.88
1,0,1,1,2,15647311,Hill,608,41,1,83807.86,1,0,1,112542.58
2,0,0,1,3,15619304,Onio,502,42,8,159660.8,3,1,0,113931.57
3,0,0,1,4,15701354,Boni,699,39,1,0.0,2,0,0,93826.63
4,0,1,1,5,15737888,Mitchell,850,43,2,125510.82,1,1,1,79084.1


Add a constant \begin{align} b_0 \end{align} to the model:

In [6]:
df_x = sm.add_constant(df_x)
df_x[:2]

Unnamed: 0,const,Germany,Spain,Female,RowNumber,CustomerId,Surname,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
0,1,0,0,1,1,15634602,Hargrave,619,42,2,0.0,1,1,1,101348.88
1,1,0,1,1,2,15647311,Hill,608,41,1,83807.86,1,0,1,112542.58


Exclude the variables that should not affect the model:

In [7]:
df_x = df_x.drop(['RowNumber', 'CustomerId', 'Surname'], axis = 1)
df_x[:2]

Unnamed: 0,const,Germany,Spain,Female,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
0,1,0,0,1,619,42,2,0.0,1,1,1,101348.88
1,1,0,1,1,608,41,1,83807.86,1,0,1,112542.58


Create a model with all the remaining variables:

In [8]:
model1 = smdis.Logit(df_y, df_x).fit()
print model1.summary()
print "\n_____P-values____"
p = model1.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model1.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))

Optimization terminated successfully.
         Current function value: 0.428068
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:                 Exited   No. Observations:                10000
Model:                          Logit   Df Residuals:                     9988
Method:                           MLE   Df Model:                           11
Date:                Tue, 27 Dec 2016   Pseudo R-squ.:                  0.1532
Time:                        15:07:20   Log-Likelihood:                -4280.7
converged:                       True   LL-Null:                       -5054.9
                                        LLR p-value:                     0.000
                      coef    std err          z      P>|z|      [95.0% Conf. Int.]
-----------------------------------------------------------------------------------
const              -3.9208      0.245    -15.980      0.000        -4.402    -3.440
Germany       

Note: 

__"Pseudo R-squ."__ -  pseudo R-squared. 
Logistic regression does not have an equivalent to the R-squared that is found in OLS regression; however, many people have tried to come up with one.  There are a wide variety of pseudo-R-square statistics.  Because this statistic does not mean what R-square means in OLS regression (the proportion of variance explained by the predictors), we suggest interpreting this statistic with great caution.        (source: http://www.ats.ucla.edu/stat/stata/output/stata_logistic.htm) 



__Backward Elimination__

New model without the variable with the highest p-value: "Spain"

In [9]:
df_x1 = df_x.drop(['Spain'], axis = 1)
df_x1[:2]

Unnamed: 0,const,Germany,Female,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
0,1,0,1,619,42,2,0.0,1,1,1,101348.88
1,1,0,1,608,41,1,83807.86,1,0,1,112542.58


In [10]:
model2 = smdis.Logit(df_y, df_x1).fit()
print model2.summary()
print "\n_____P-values____"
p = model2.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model2.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))

Optimization terminated successfully.
         Current function value: 0.428080
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:                 Exited   No. Observations:                10000
Model:                          Logit   Df Residuals:                     9989
Method:                           MLE   Df Model:                           10
Date:                Tue, 27 Dec 2016   Pseudo R-squ.:                  0.1531
Time:                        15:07:22   Log-Likelihood:                -4280.8
converged:                       True   LL-Null:                       -5054.9
                                        LLR p-value:                     0.000
                      coef    std err          z      P>|z|      [95.0% Conf. Int.]
-----------------------------------------------------------------------------------
const              -3.9110      0.245    -15.994      0.000        -4.390    -3.432
Germany       

New model without the variable with the highest p-value: "HasCrCard"

In [11]:
df_x2 = df_x1.drop(['HasCrCard'], axis = 1)
df_x2[:2]

Unnamed: 0,const,Germany,Female,CreditScore,Age,Tenure,Balance,NumOfProducts,IsActiveMember,EstimatedSalary
0,1,0,1,619,42,2,0.0,1,1,101348.88
1,1,0,1,608,41,1,83807.86,1,1,112542.58


In [12]:
model3 = smdis.Logit(df_y, df_x2).fit()
print model3.summary()
print "\n_____P-values____"
p = model3.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model3.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))

Optimization terminated successfully.
         Current function value: 0.428109
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:                 Exited   No. Observations:                10000
Model:                          Logit   Df Residuals:                     9990
Method:                           MLE   Df Model:                            9
Date:                Tue, 27 Dec 2016   Pseudo R-squ.:                  0.1531
Time:                        15:07:23   Log-Likelihood:                -4281.1
converged:                       True   LL-Null:                       -5054.9
                                        LLR p-value:                     0.000
                      coef    std err          z      P>|z|      [95.0% Conf. Int.]
-----------------------------------------------------------------------------------
const              -3.9444      0.241    -16.395      0.000        -4.416    -3.473
Germany       

New model without the variable with the highest p-value: "EstimatedSalary"

In [13]:
df_x3 = df_x2.drop(['EstimatedSalary'], axis = 1)
df_x3[:2]

Unnamed: 0,const,Germany,Female,CreditScore,Age,Tenure,Balance,NumOfProducts,IsActiveMember
0,1,0,1,619,42,2,0.0,1,1
1,1,0,1,608,41,1,83807.86,1,1


In [14]:
model4 = smdis.Logit(df_y, df_x3).fit()
print model4.summary()
print "\n_____P-values____"
p = model4.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model4.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))

Optimization terminated successfully.
         Current function value: 0.428161
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:                 Exited   No. Observations:                10000
Model:                          Logit   Df Residuals:                     9991
Method:                           MLE   Df Model:                            8
Date:                Tue, 27 Dec 2016   Pseudo R-squ.:                  0.1530
Time:                        15:07:24   Log-Likelihood:                -4281.6
converged:                       True   LL-Null:                       -5054.9
                                        LLR p-value:                     0.000
                     coef    std err          z      P>|z|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------
const             -3.8959      0.236    -16.528      0.000        -4.358    -3.434
Germany          

New model without the variable with the highest p-value: "Tenure"

In [15]:
df_x4 = df_x3.drop(['Tenure'], axis = 1)
df_x4[:2]

Unnamed: 0,const,Germany,Female,CreditScore,Age,Balance,NumOfProducts,IsActiveMember
0,1,0,1,619,42,0.0,1,1
1,1,0,1,608,41,83807.86,1,1


In [16]:
model5 = smdis.Logit(df_y, df_x4).fit()
print model5.summary()
print "\n_____P-values____"
p = model5.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model5.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))

Optimization terminated successfully.
         Current function value: 0.428307
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:                 Exited   No. Observations:                10000
Model:                          Logit   Df Residuals:                     9992
Method:                           MLE   Df Model:                            7
Date:                Tue, 27 Dec 2016   Pseudo R-squ.:                  0.1527
Time:                        15:07:26   Log-Likelihood:                -4283.1
converged:                       True   LL-Null:                       -5054.9
                                        LLR p-value:                     0.000
                     coef    std err          z      P>|z|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------
const             -3.9760      0.231    -17.200      0.000        -4.429    -3.523
Germany          

Since the Accuracy decreases, the variable "Tenure" should be used in the model. 
Now, the variable with the highest p-value is "NumOfProducts" with 0.032, which is above of the defined threshold - 0.05.  

In [17]:
df_x3[:5]

Unnamed: 0,const,Germany,Female,CreditScore,Age,Tenure,Balance,NumOfProducts,IsActiveMember
0,1,0,1,619,42,2,0.0,1,1
1,1,0,1,608,41,1,83807.86,1,1
2,1,0,1,502,42,8,159660.8,3,0
3,1,0,1,699,39,1,0.0,2,0
4,1,0,1,850,43,2,125510.82,1,1


#### Transforming Independent Variables

Most common transformations:

1. $$\sqrt{x}$$
2. $$x^2$$
3. $$ln(x)$$

Transform "Balance" variable with ln(x):

Original value: Balance (in 1000$):

        Bal2 = Bal1 + 1unit -->  Bal2 = Bal1 + 1000$
        
        Scenario1: Bal1 = 1000$
            --> Bal2 = 1000$ + 1000$ = 2000$
        Scenario2: Bal1 = 10000$
            --> Bal2 = 10000$ + 1000$ = 11000$
          
Log10(Balance + 1):

        log10(Bal2) = log10(Bal1) + 1unit --> Bal2 = Bal1*10
        
        Scenario1: Bal1 = 1000$
            --> Bal2 = 1000*10 = 10000
    Scenario2: Bal1 = 10000$
            ---> Bal2 = 10000*10 = 100000
            
Using the original value, means that if someone that starts with a balance of 1000$ and has an increase of 1000$, and someone that starts with a balance of 10000$ and has the same amount of increasing, are two completly different things.

Using the ln transformation, it has the same affect on any person regardless of their starting point. It always a 10 times increase so a unit increase has a consistent increase in the balance variable which is 10 times.

So regardless of who we're segmenting we can say that the effect of a one unit increase in the new transformed variable is consistent throughout our population and that is much more powerful because that does not restrict the logistic regression.
        

In [18]:
df_x4 = df_x3
df_x4['Balance_log'] = df_x4['Balance'].map(lambda x: np.log10(x+1))
df_x4[:2]

Unnamed: 0,const,Germany,Female,CreditScore,Age,Tenure,Balance,NumOfProducts,IsActiveMember,Balance_log
0,1,0,1,619,42,2,0.0,1,1,0.0
1,1,0,1,608,41,1,83807.86,1,1,4.92329


In [19]:
model6 = smdis.Logit(df_y, df_x4).fit()
print model6.summary()
print "\n_____P-values____"
p = model6.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model6.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))

Optimization terminated successfully.
         Current function value: 0.428134
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:                 Exited   No. Observations:                10000
Model:                          Logit   Df Residuals:                     9990
Method:                           MLE   Df Model:                            9
Date:                Tue, 27 Dec 2016   Pseudo R-squ.:                  0.1530
Time:                        15:07:42   Log-Likelihood:                -4281.3
converged:                       True   LL-Null:                       -5054.9
                                        LLR p-value:                     0.000
                     coef    std err          z      P>|z|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------
const             -3.9147      0.237    -16.505      0.000        -4.380    -3.450
Germany          

The accuracy is improved!

#### Derived Variables

Sometimes it is useful too account for some effects, for example in our case, the balance and age might be correlated, because the older a person is the more whealth he/she can accumulate. 

So a new variable can be created:

$$WealthAccumlation = \frac{Balance}{Age}$$

In [20]:
df_x5 = df_x4
df_x5['WealthAccumulation'] = df_x5['Balance']/ df_x5['Age']
df_x5[:2]

Unnamed: 0,const,Germany,Female,CreditScore,Age,Tenure,Balance,NumOfProducts,IsActiveMember,Balance_log,WealthAccumulation
0,1,0,1,619,42,2,0.0,1,1,0.0,0.0
1,1,0,1,608,41,1,83807.86,1,1,4.92329,2044.094146


In [21]:
model7 = smdis.Logit(df_y, df_x5).fit()
print model7.summary()
print "\n_____P-values____"
p = model7.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model7.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))

Optimization terminated successfully.
         Current function value: 0.427323
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:                 Exited   No. Observations:                10000
Model:                          Logit   Df Residuals:                     9989
Method:                           MLE   Df Model:                           10
Date:                Tue, 27 Dec 2016   Pseudo R-squ.:                  0.1546
Time:                        15:07:54   Log-Likelihood:                -4273.2
converged:                       True   LL-Null:                       -5054.9
                                        LLR p-value:                     0.000
                         coef    std err          z      P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------------
const                 -3.4253      0.266    -12.896      0.000        -3.946    -2.905
Germa

This new variable had a negative effect in the model accuracy.

However, one should have in consideration that might be some collinearity effects because wealth accumulation includes the variables balance and age that are already in the model. So basically this means that this new variable might be somehow correlated with the other two and including all of them in the model might cause some damage, so on of them should be excluded.


#### Multicollinearity

Multicollinearity is a phenomenon in which two or more predictor variables in a multiple regression model are highly correlated, meaning that one can be linearly predicted from the others with a substantial degree of accuracy.

The "Variance Inflation Factors - VIF" can be used to measure the degree of multicollinearity.

Minimum possible value = 1.0
Values > 5 may indicate a collinearity problem.

$$VIF(j) = \frac{1}{( 1 - R(j)^2)}$$

where R(j) is the multiple correlarion coefficient between variable j and the other independent variables


In [22]:
i = 0
for column in df_x5.columns:
    
    print column + " %s" % outliers.variance_inflation_factor(df_x5.values, i)
    i += 1

const 91.4773722803
Germany 1.27101224421
Female 1.00321582613
CreditScore 1.00104450767
Age 2.10487620512
Tenure 1.00161073156
Balance 20.8895673667
NumOfProducts 1.15312466504
IsActiveMember 1.01384972933
Balance_log 8.72881202924
WealthAccumulation 14.5598595824


Rerun without the "Balance_log" variable:

In [23]:
df_x6 = df_x5.drop('Balance_log', axis = 1)

In [24]:
model8 = smdis.Logit(df_y, df_x6).fit()
print model8.summary()
print "\n_____P-values____"
p = model8.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model8.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))

Optimization terminated successfully.
         Current function value: 0.427344
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:                 Exited   No. Observations:                10000
Model:                          Logit   Df Residuals:                     9990
Method:                           MLE   Df Model:                            9
Date:                Tue, 27 Dec 2016   Pseudo R-squ.:                  0.1546
Time:                        15:08:25   Log-Likelihood:                -4273.4
converged:                       True   LL-Null:                       -5054.9
                                        LLR p-value:                     0.000
                         coef    std err          z      P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------------
const                 -3.4089      0.264    -12.904      0.000        -3.927    -2.891
Germa

In [25]:
i = 0

for column in df_x6.columns:
    
    print column + " %s" % outliers.variance_inflation_factor(df_x6.values, i)
    i += 1

const 90.6320775047
Germany 1.2161676569
Female 1.00302785174
CreditScore 1.00097995156
Age 2.10477505278
Tenure 1.00153406128
Balance 14.0245653407
NumOfProducts 1.12330226029
IsActiveMember 1.01353234553
WealthAccumulation 14.5598562679


Create new variable log for wealth accumulation:

In [29]:
df_x7 = df_x5
df_x7['WealthAccumulation_log'] = df_x7['WealthAccumulation'].map(lambda x: np.log10(x + 1))
df_x7[:2]

Unnamed: 0,const,Germany,Female,CreditScore,Age,Tenure,Balance,NumOfProducts,IsActiveMember,Balance_log,WealthAccumulation,WealthAccumulation_log
0,1,0,1,619,42,2,0.0,1,1,0.0,0.0,0.0
1,1,0,1,608,41,1,83807.86,1,1,4.92329,2044.094146,3.310713


In [30]:
df_x7 = df_x7.drop(['Balance', 'WealthAccumulation'], axis = 1)

In [31]:
model9 = smdis.Logit(df_y, df_x7).fit()
print model9.summary()
print "\n_____P-values____"
p = model9.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model9.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))

Optimization terminated successfully.
         Current function value: 0.427891
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:                 Exited   No. Observations:                10000
Model:                          Logit   Df Residuals:                     9990
Method:                           MLE   Df Model:                            9
Date:                Tue, 27 Dec 2016   Pseudo R-squ.:                  0.1535
Time:                        15:09:32   Log-Likelihood:                -4278.9
converged:                       True   LL-Null:                       -5054.9
                                        LLR p-value:                     0.000
                             coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------------
const                     -3.5735      0.267    -13.362      0.000        -4.098    

In [33]:
i = 0

for column in df_x7.columns:
    
    print column + " %s" % outliers.variance_inflation_factor(df_x7.values, i)
    i += 1

const 93.5608074531
Germany 1.26986359735
Female 1.00282038022
CreditScore 1.00102945704
Age 2.26479838945
Tenure 1.00165459162
NumOfProducts 1.15243000098
IsActiveMember 1.01168357579
Balance_log 705.940971113
WealthAccumulation_log 704.739971273


"Balance_log" and "WealthAccumulation_log" have too large values which means that these two variables are basically the same thing. So one of these variables must be excluded!

In [34]:
df_x8 = df_x7.drop(['Balance_log'], axis = 1)

In [35]:
model10 = smdis.Logit(df_y, df_x8).fit()
print model10.summary()
print "\n_____P-values____"
p = model10.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model10.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))

Optimization terminated successfully.
         Current function value: 0.428322
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:                 Exited   No. Observations:                10000
Model:                          Logit   Df Residuals:                     9991
Method:                           MLE   Df Model:                            8
Date:                Tue, 27 Dec 2016   Pseudo R-squ.:                  0.1527
Time:                        15:17:01   Log-Likelihood:                -4283.2
converged:                       True   LL-Null:                       -5054.9
                                        LLR p-value:                     0.000
                             coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------------
const                     -3.9325      0.239    -16.467      0.000        -4.401    

Correlation between variables:

In [39]:
df_x5[['Age', 'Balance_log', 'WealthAccumulation_log', 'WealthAccumulation']].corr()

Unnamed: 0,Age,Balance_log,WealthAccumulation_log,WealthAccumulation
Age,1.0,0.03453,-0.007524,-0.246293
Balance_log,0.03453,1.0,0.998404,0.865141
WealthAccumulation_log,-0.007524,0.998404,1.0,0.888872
WealthAccumulation,-0.246293,0.865141,0.888872,1.0


As one can observe, the "WealthAccumulation_log" and "Balance_log" variables are highly correlated - they are basically the same which is very bad for the model.
 
Thumb rule: anything 0.9 is very high correlation. Correlations above 0.5 should be addressed.

In [43]:
df_final = df_x5.drop(['Balance', 'WealthAccumulation', 'WealthAccumulation_log'], axis = 1)

In [44]:
model11 = smdis.Logit(df_y, df_final).fit()
print model11.summary()
print "\n_____P-values____"
p = model11.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model11.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))

Optimization terminated successfully.
         Current function value: 0.428257
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:                 Exited   No. Observations:                10000
Model:                          Logit   Df Residuals:                     9991
Method:                           MLE   Df Model:                            8
Date:                Tue, 27 Dec 2016   Pseudo R-squ.:                  0.1528
Time:                        15:52:34   Log-Likelihood:                -4282.6
converged:                       True   LL-Null:                       -5054.9
                                        LLR p-value:                     0.000
                     coef    std err          z      P>|z|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------
const             -3.9126      0.237    -16.497      0.000        -4.377    -3.448
Germany          

In [45]:
i = 0

for column in df_final.columns:
    
    print column + " %s" % outliers.variance_inflation_factor(df_final.values, i)
    i += 1

const 76.3936634172
Germany 1.26934389528
Female 1.00281465894
CreditScore 1.00102204693
Age 1.01171547993
Tenure 1.0014367308
NumOfProducts 1.15145597018
IsActiveMember 1.01028026511
Balance_log 1.42007014313


Now, none of the variables presents multicollinearity.