# AccelerateAI: Logistic Regression

## Context setting: Logit Model

**Context and Dataset:**

We will predict whether a student will be admitted to a particular college, based on their gmat, gpa scores and work experience. The dependent variable here is a **Binary Logistic variable**, which is expected to take one of two forms i.e., ```admitted``` or ```not admitted```.

**Usage of statsmodels:**

Statsmodels is a Python library that provides various functions for estimating different statistical models and performing statistical tests 
* We define the set of dependent(y) and independent(X) variables. If the dependent variable is in non-numeric form, it is first converted to numeric. In our example dataset, we don't have to do that as all are numeric. 
* Statsmodels provides a Logit() function for performing logistic regression. The Logit() function accepts y and X as parameters and returns the Logit object. The model is then fitted to the data.

## Import Libraries & Load Dataset

P.S. - Dataset can be referred from the GitHub Repo.

In [1]:
import pandas as pd 
import statsmodels.api as sm

In [2]:
# Loading the training & test dataset 
df_train = pd.read_csv('C:/Users/mishr/Desktop/Notebooks/data/studentadmit_train1.csv', index_col = 0) 
df_test = pd.read_csv('C:/Users/mishr/Desktop/Notebooks/data/studentadmit_test1.csv', index_col = 0) 
df_train.head()

Unnamed: 0,gmat,gpa,work_experience,admitted
16,580,2.7,4,0
33,660,3.3,6,1
8,740,3.3,5,1
38,590,1.7,4,0
32,660,4.0,4,1


In [3]:
# Defining the dependent and independent variables
X_train = df_train[['gmat', 'gpa', 'work_experience']]
y_train = df_train[['admitted']]

# Add constant term
X_train_const = sm.add_constant(X_train, prepend=False)

## Logit Model

Parameters:

* endog: A 1-d endogenous response variable. The dependent variable.
* exog: A nobs x k array where nobs is the number of observations and k is the number of regressors. An intercept is not included by default and should be added by the user (using add_constant).
* offset: Offset is added to the linear prediction with coefficient equal to 1.
* missing: Available options are ‘none’, ‘drop’, and ‘raise’. 
    - If ‘none’, no nan checking is done. 
    - If ‘drop’, any observations with nans are dropped. 
    - If ‘raise’, an error is raised. Default is ‘none’.
* check_rank: Check exog rank to determine model degrees of freedom. Default is True. Setting to False reduces model initialization time when exog.shape is large.

In [4]:
# Building the model and fitting the data
log_reg = sm.Logit(y_train, X_train_const).fit()

Optimization terminated successfully.
         Current function value: 0.247296
         Iterations 8


In [5]:
print(log_reg.summary())

                           Logit Regression Results                           
Dep. Variable:               admitted   No. Observations:                   30
Model:                          Logit   Df Residuals:                       26
Method:                           MLE   Df Model:                            3
Date:                Tue, 06 Sep 2022   Pseudo R-squ.:                  0.6432
Time:                        19:30:17   Log-Likelihood:                -7.4189
converged:                       True   LL-Null:                       -20.794
Covariance Type:            nonrobust   LLR p-value:                 6.639e-06
                      coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------
gmat                0.0025      0.018      0.141      0.888      -0.032       0.037
gpa                 3.3208      2.397      1.385      0.166      -1.378       8.020
work_experience     0.9975      

**Interpretation:**

* coef : The estimated coefficients of the independent variables in the regression equation
* std err : Estimates of the standard errors of the estimated coefficients
* z : Ratios of estimated coefficients to their estimated standard errors. This could also be the Wald Test statistic W = beta / Std error 
* P>|z| : The p-value for each variable
* Pseudo R-squ. : A substitute for the R-squared value in Least Squares linear regression. It is the ratio of the log-likelihood of the null model to that of the full model.
* Log-Likelihood : The natural logarithm of the Maximum Likelihood Estimation(MLE) function. MLE is the optimization process where we try to find the set of parameters that result in the best fit.
* LL-Null : The value of log-likelihood of the model when no independent variable is included (only an intercept is included).
* LLR p-value : Log-Likelihood Ratio p-value - overall p-value for the regression.

## Predict on Test Data

In [6]:
df_test.sample(4)

Unnamed: 0,gmat,gpa,work_experience,admitted
35,650,2.3,1,0
7,720,3.3,4,1
36,670,2.7,2,0
30,640,3.0,1,0


In [7]:
X_test = df_test[['gmat', 'gpa', 'work_experience']]
y_test = df_test['admitted']

# Add constant term
X_test_const = sm.add_constant(X_test, prepend=False)

In [8]:
# performing predictions on the test datdaset
yhat = log_reg.predict(X_test_const)
prediction = list(map(round, yhat))

In [9]:
# comparing original and predicted values of y
print('Actual values :', list(y_test.values))
print('Predictions   :', prediction)

Actual values : [0, 0, 0, 0, 0, 1, 1, 0, 1, 1]
Predictions   : [0, 0, 0, 0, 0, 1, 0, 0, 1, 1]


## Accuracy

In [10]:
from sklearn.metrics import (confusion_matrix,accuracy_score)

In [11]:
# confusion matrix
cm = confusion_matrix(y_test, prediction) 
print ("Confusion Matrix : \n", cm) 

Confusion Matrix : 
 [[6 0]
 [1 3]]


In [12]:
# accuracy score of the model
print('Test accuracy = ', accuracy_score(y_test, prediction))

Test accuracy =  0.9
