# Calculating the Accuracy of the Model

Using the same dataset, expand the model by including all other features into the regression. 

Moreover, calculate the accuracy of the model and create a confusion matrix

## Import the relevant libraries

In [1]:
import numpy as npp
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy import stats
stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df)

## Load the data

Load the ‘Bank_data.csv’ dataset.

In [41]:
data=pd.read_csv("D:\\Data_Sets\\DS_for_eng\\Titanic_Test.csv")
data.head()

Unnamed: 0,Survived,Pclass,Sex,Age
0,0,2,male,34.0
1,1,1,male,38.0
2,1,2,female,22.0
3,0,3,male,42.0
4,1,3,male,19.0


In [42]:
data_new=data.copy()
data_new["Sex"]=data_new["Sex"].map({"male":1,"female":0})
data_new.head()

Unnamed: 0,Survived,Pclass,Sex,Age
0,0,2,1,34.0
1,1,1,1,38.0
2,1,2,0,22.0
3,0,3,1,42.0
4,1,3,1,19.0


In [43]:
data_new.describe()

Unnamed: 0,Survived,Pclass,Sex,Age
count,206.0,206.0,206.0,206.0
mean,0.42233,2.23301,0.61165,30.286796
std,0.495134,0.852015,0.488562,15.23471
min,0.0,1.0,0.0,0.75
25%,0.0,1.0,0.0,20.625
50%,0.0,3.0,1.0,29.0
75%,1.0,3.0,1.0,39.75
max,1.0,3.0,1.0,71.0


### Declare the dependent and independent variables

Use 'duration' as the independet variable.

In [44]:
y=data_new["Survived"]
x1=data_new[["Pclass","Sex","Age"]]

### Simple Logistic Regression

Run the regression and graph the scatter plot.

## Expand the model

We can be omitting many causal factors in our simple logistic model, so we instead switch to a multivariate logistic regression model. Add the ‘interest_rate’, ‘march’, ‘credit’ and ‘previous’ estimators to our model and run the regression again. 

### Declare the independent variable(s)

In [45]:
x=sm.add_constant(x1)
log_reg=sm.Logit(y,x)
result=log_reg.fit()
result.summary()

Optimization terminated successfully.
         Current function value: 0.418652
         Iterations 7


0,1,2,3
Dep. Variable:,Survived,No. Observations:,206.0
Model:,Logit,Df Residuals:,202.0
Method:,MLE,Df Model:,3.0
Date:,"Wed, 18 Mar 2020",Pseudo R-squ.:,0.3853
Time:,06:21:35,Log-Likelihood:,-86.242
converged:,True,LL-Null:,-140.29
,,LLR p-value:,2.812e-23

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,6.0989,1.040,5.862,0.000,4.060,8.138
Pclass,-1.5275,0.283,-5.393,0.000,-2.083,-0.972
Sex,-2.6115,0.401,-6.518,0.000,-3.397,-1.826
Age,-0.0513,0.015,-3.433,0.001,-0.081,-0.022


### Confusion Matrix

Create the confusion matrix of the model and estimate its accuracy. 

<i> For convenience we have already provided you with a function that finds the confusion matrix and the model accuracy.</i>

In [46]:
def confusion_matrix(data,actual_values,model):
        
        # Confusion matrix 
        
        # Parameters
        # ----------
        # data: data frame or array
            # data is a data frame formatted in the same way as your input data (without the actual values)
            # e.g. const, var1, var2, etc. Order is very important!
        # actual_values: data frame or array
            # These are the actual values from the test_data
            # In the case of a logistic regression, it should be a single column with 0s and 1s
            
        # model: a LogitResults object
            # this is the variable where you have the fitted model 
            # e.g. results_log in this course
        # ----------
        
        #Predict the values using the Logit model
        pred_values = model.predict(data)
        # Specify the bins 
        bins=npp.array([0,0.5,1])
        # Create a histogram, where if values are between 0 and 0.5 tell will be considered 0
        # if they are between 0.5 and 1, they will be considered 1
        cm = npp.histogram2d(actual_values, pred_values, bins=bins)[0]
        # Calculate the accuracy
        accuracy = (cm[0,0]+cm[1,1])/cm.sum()
        # Return the confusion matrix and 
        return cm, accuracy

In [50]:
confusion_matrix(x,y,result)

(array([[101.,  18.],
        [ 17.,  70.]]), 0.8300970873786407)