A researcher is interested in how predictor variables, such as 
1)GRE (Graduate Record Exam scores), 
2)GPA (grade point average) and 
3) rank/prestige of the undergraduate institution
effect admission into graduate school.

#The response variable "admission to grad school"  is a binary variable.
The only two choices are admit/don’t admit.
Values are 0 = no admit, 1 = admit

THIS TASK IS CALLED CLASSIFICATION. 
CLASSIFICATION => The target output is on of a limited number of categories.
In this problem we only have two possible targets: no admit and admit



CHECK FOR UNDERSTANDING: Why is Classification different from Regression, a/k/a Linear Regression?

Note: The fact that you solve CLASSIFICATION problems with a technique called LOGISITIC REGRESSION is unfortunate, but a fact of life.

In [8]:
import pandas as pd
import statsmodels.formula.api as smf

In [3]:
df = pd.read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 4 columns):
admit    400 non-null int64
gre      400 non-null int64
gpa      400 non-null float64
rank     400 non-null int64
dtypes: float64(1), int64(3)
memory usage: 12.6 KB


In [9]:
df.describe()

Unnamed: 0,admit,gre,gpa,rank
count,400.0,400.0,400.0,400.0
mean,0.3175,587.7,3.3899,2.485
std,0.466087,115.516536,0.380567,0.94446
min,0.0,220.0,2.26,1.0
25%,0.0,520.0,3.13,2.0
50%,0.0,580.0,3.395,2.0
75%,1.0,660.0,3.67,3.0
max,1.0,800.0,4.0,4.0


'''
admit is the target we want to model.
admit categorical (binary): 0 = no admit, 1 = admit
gre is continuous
gpa is continuous
rank is categorical: 1,2,3, or 4
'''

# You could do this. But don't. (Rank should be treated as categorical variable!)

In [17]:
fitted_model = smf.logit(formula='admit ~ gre + gpa + rank', data=df).fit()
fitted_model.summary()

Optimization terminated successfully.
         Current function value: 0.574302
         Iterations 6


0,1,2,3
Dep. Variable:,admit,No. Observations:,400.0
Model:,Logit,Df Residuals:,396.0
Method:,MLE,Df Model:,3.0
Date:,"Mon, 18 Nov 2019",Pseudo R-squ.:,0.08107
Time:,17:18:07,Log-Likelihood:,-229.72
converged:,True,LL-Null:,-249.99
Covariance Type:,nonrobust,LLR p-value:,8.207e-09

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-3.4495,1.133,-3.045,0.002,-5.670,-1.229
gre,0.0023,0.001,2.101,0.036,0.000,0.004
gpa,0.7770,0.327,2.373,0.018,0.135,1.419
rank,-0.5600,0.127,-4.405,0.000,-0.809,-0.311


# Do This Instead!
## In reality you should *always* explicitly separate out Categorical factors like so:
## Notice this this model has more coefficients. (why?)

In [16]:
fitted_model = smf.logit(formula='admit ~ gre + gpa + C(rank)', data=df).fit()
fitted_model.summary()

Optimization terminated successfully.
         Current function value: 0.573147
         Iterations 6


0,1,2,3
Dep. Variable:,admit,No. Observations:,400.0
Model:,Logit,Df Residuals:,394.0
Method:,MLE,Df Model:,5.0
Date:,"Mon, 18 Nov 2019",Pseudo R-squ.:,0.08292
Time:,17:15:57,Log-Likelihood:,-229.26
converged:,True,LL-Null:,-249.99
Covariance Type:,nonrobust,LLR p-value:,7.578e-08

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-3.9900,1.140,-3.500,0.000,-6.224,-1.756
C(rank)[T.2],-0.6754,0.316,-2.134,0.033,-1.296,-0.055
C(rank)[T.3],-1.3402,0.345,-3.881,0.000,-2.017,-0.663
C(rank)[T.4],-1.5515,0.418,-3.713,0.000,-2.370,-0.733
gre,0.0023,0.001,2.070,0.038,0.000,0.004
gpa,0.8040,0.332,2.423,0.015,0.154,1.454


In [18]:
#CONFUSION MATRIX. HOW WELL DID YOUR MODEL PREDICT THE REALITY OF YOUR DATA?
fitted_model.pred_table()

array([[253.,  20.],
       [ 98.,  29.]])