# Homework w04d04


The attached dataset includes data about students being admitted to graduate school, their GRE (Graduate Record Exam scores), GPA (Grade Point Average) and the prestige of their undergraduate institution. The aim is to predict the probability of being admitted to graduate school using the remaining features with logistic regression.

The variables have the following ranges:

1. Admit: 1 (admitted), 0 (not admitted)
1. GRE: 0 (lowest) to 800 (highest)
1. GPA: 0 (lowest) to 4 (highest)
1. Prestige: 1 (highest) to 4 (lowest)

In [81]:
# Imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pylab as pl
import statsmodels.formula.api as smf
%matplotlib inline

# Optional
import seaborn as sns
sns.set_style("darkgrid")
from sklearn import model_selection, linear_model, metrics

# Import data
df = pd.read_csv("admissions.csv").dropna()
df.head()

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.0,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0


## Part 1. Frequency Tables

#### Create a frequency table of our variables

In [40]:
# Create a frequency table for prestige vs whether or not someone was admitted (hint: look at pd.crosstab
xtab = pd.crosstab(df['prestige'],df['admit'])
xtab

admit,0,1
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,28,33
2.0,95,53
3.0,93,28
4.0,55,12


In [30]:
from __future__ import division
for i in range(1,xtab.shape[0]+1):
    percent_accepted = xtab[1][i]/(xtab[0][i]+xtab[1][i])
    print("Precentage student accepted: %f" % percent_accepted)

Precentage student accepted: 0.540984
Precentage student accepted: 0.358108
Precentage student accepted: 0.231405
Precentage student accepted: 0.179104


## Part 2. Dummy variables

#### 2.1 Create four new dummy variables for prestige

In [49]:
# Create dummy vars here
dummy_ranks = pd.get_dummies(df['prestige'])
dummy_ranks.columns = ['prestige_1','prestige_2','prestige_3','prestige_4']

#### 2.2 When modelling our prestige categorical variables, how many do we need? Why?
All 4? 3? 2? 1?

Answer: In general number of different values minus 1, that is 3.
In this case it looks like we need 4 to proceed with rest of exercise

## Part 3. Hand calculating odds ratios

Develop your intuition about expected outcomes by hand calculating odds ratios.

In [50]:
cols_to_keep = ['admit', 'gre', 'gpa']
hand_calc = df[cols_to_keep].join(dummy_ranks.ix[:, 'prestige_1':])
hand_calc.head()

Unnamed: 0,admit,gre,gpa,prestige_1,prestige_2,prestige_3,prestige_4
0,0,380.0,3.61,0.0,0.0,1.0,0.0
1,1,660.0,3.67,0.0,0.0,1.0,0.0
2,1,800.0,4.0,1.0,0.0,0.0,0.0
3,1,640.0,3.19,0.0,0.0,0.0,1.0
4,0,520.0,2.93,0.0,0.0,0.0,1.0


#### 3.1 Cross-tabulate prestige_1 admission

In [52]:
# Use pd.crosstab to create a frequency table of prestige_1 vs admission
xtab2 = pd.crosstab(hand_calc['admit'],hand_calc['prestige_1'])
xtab2

prestige_1,0.0,1.0
admit,Unnamed: 1_level_1,Unnamed: 2_level_1
0,243,28
1,93,33


#### 3.2 Use the cross-tab above to calculate the odds of being admitted to grad school if you attended a #1 ranked college

In [69]:
prob1 = xtab2[1][1]/(xtab2[1][0]+xtab2[1][1])
odds1 = prob1/(1-prob1)
print("Probability of admission with attended #1 ranked college: %f, odds: %f" % (prob1,odds1))

Probability of admission with attended #1 ranked college: 0.540984, odds: 1.178571


#### 3.3 Now calculate the odds of admission if you did not attend a #1 ranked college

In [70]:
prob2 = xtab2[0][1]/(xtab2[0][0]+xtab2[0][1])
odds2 = prob2/(1-prob2)
print("Probability of admission with not atteneded #1 ranked college: %f, odds: %f" % (prob2,odds2))

Probability of admission with not atteneded #1 ranked college: 0.276786, odds: 0.382716


#### 3.4 Calculate the odds ratio

In [66]:
odds1/odds2

3.0794930875576041

#### 3.5 Write this finding in a sentence 

Answer: what the odds ratio tell us is that attending a rank #1 college multiplies your odds to be admitted by a factor 3 versus if you didn't attend a rank #1 college.

#### 3.6 Print the cross-tab vs prestige_4

In [71]:
xtab4 = pd.crosstab(hand_calc['admit'],hand_calc['prestige_4'])
xtab4

prestige_4,0.0,1.0
admit,Unnamed: 1_level_1,Unnamed: 2_level_1
0,216,55
1,114,12


#### 3.7 Calculate the odds ratio

In [74]:
prob3 = xtab4[1][1]/(xtab4[1][0]+xtab4[1][1])
odds3 = prob3/(1-prob3)
print("Probability of admission with attended #4 ranked college: %f, odds: %f" % (prob3,odds3))
prob4 = xtab4[0][1]/(xtab4[0][0]+xtab4[0][1])
odds4 = prob4/(1-prob4)
print("Probability of admission with not atteneded #4 ranked college: %f, odds: %f" % (prob4,odds4))
ratio = odds3/odds4
print(ratio)
print(1/ratio)

Probability of admission with attended #4 ranked college: 0.179104, odds: 0.218182
Probability of admission with not atteneded #4 ranked college: 0.345455, odds: 0.527778
0.413397129187
2.41898148148


#### 3.8 Write this finding in a sentence

Answer: what the odds ratio tell us is that attending a rank #4 college multiplies your odds to be admitted by a factor .41 (in other words, divides it by 2.4) versus if you didn't attend a rank #4 college.

## Part 4. Analysis
First we'll create a clean data frame for the regression analysis.

In [75]:
# We'll set the top tier (#1, aka most prestigious) as our reference category
# and merge prestige_2, prestige_3 and prestige_4 back into the dataset
cols_to_keep = ['admit', 'gre', 'gpa']
data = df[cols_to_keep].join(dummy_ranks.ix[:, 'prestige_2':])
data.head()

Unnamed: 0,admit,gre,gpa,prestige_2,prestige_3,prestige_4
0,0,380.0,3.61,0.0,1.0,0.0
1,1,660.0,3.67,0.0,1.0,0.0
2,1,800.0,4.0,0.0,0.0,0.0
3,1,640.0,3.19,0.0,0.0,1.0
4,0,520.0,2.93,0.0,0.0,1.0


We're going to add a constant term for our logistic regression. The statsmodels function we're going to be using requires that intercepts/constants are specified explicitly.

In [76]:
# Manually add the intercept
data['intercept'] = 1.0

#### 4.1 Assign the predictor column names to a variable called train_cols

In [None]:
train_cols = # predictor col names

#### 4.2 Fit a logistic regression model using statsmodels
Want to model admin ~ gre + gpa + presige_2 + prestige_3 + prestige_4 + intercept using logistic regression.

In [94]:
model = smf.logit(formula="admit ~ gre + gpa + prestige_2 + prestige_3 + prestige_4",data = data).fit() 

Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6


#### 4.3 Print the logistic regression summary results

In [95]:
model.summary() 

0,1,2,3
Dep. Variable:,admit,No. Observations:,397.0
Model:,Logit,Df Residuals:,391.0
Method:,MLE,Df Model:,5.0
Date:,"Thu, 03 Nov 2016",Pseudo R-squ.:,0.08166
Time:,16:26:51,Log-Likelihood:,-227.82
converged:,True,LL-Null:,-248.08
,,LLR p-value:,1.176e-07

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,-3.8769,1.142,-3.393,0.001,-6.116 -1.638
gre,0.0022,0.001,2.028,0.043,7.44e-05 0.004
gpa,0.7793,0.333,2.344,0.019,0.128 1.431
prestige_2,-0.6801,0.317,-2.146,0.032,-1.301 -0.059
prestige_3,-1.3387,0.345,-3.882,0.000,-2.015 -0.663
prestige_4,-1.5534,0.417,-3.721,0.000,-2.372 -0.735


#### 4.4  Give a short interpretation of the regression coefficients

In [96]:
np.exp(model.params)

Intercept     0.020716
gre           1.002221
gpa           2.180027
prestige_2    0.506548
prestige_3    0.262192
prestige_4    0.211525
dtype: float64

Parameters explaination:
* intercept is the log odds of getting in if you attended college rank #1, with other parameters set to 0
* gre is multiplying factor of log odds for every additional point scored
* gpa is multiplying factor of log odds for every additional point average
* other prestige parameters are the additional log odds for depending on the college you attended.

Example: if you attended college #2 and scored 600 and had average of 3.5, then your log odds will be:

In [110]:
log_odds = model.params.Intercept + model.params.gre*600 + model.params.gpa*3.5 + model.params.prestige_2
log_odds

-0.49826839678180379

In [111]:
np.exp(log_odds)

0.60758184000585658

in such case your odds of being admited are 0.60 to 1, or approximately 3 to 5

Example2: if you attended college #1 and scored 700 and had average of 4.5, then your log odds will be:

In [112]:
log_odds = model.params.Intercept + model.params.gre*700 + model.params.gpa*4.5
log_odds

1.1830457668781018

In [117]:
np.exp(log_odds)

3.264301378302533

in such case your odds of being admited are 3.26 to 1, or approximately 13 to 4