# Project 3

In this project, you will perform a logistic regression on the admissions data we've been working with in projects 1 and 2.

In [101]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np


In [102]:
##import data set
df_raw = pd.read_csv("C:/Users/ilybangi/Desktop/Python/Unit Projects/project-3/assets/admissions.csv")

In [103]:
##drop null values and rename data frame
df = df_raw.dropna() 
print(df.head())

   admit    gre   gpa  prestige
0      0  380.0  3.61       3.0
1      1  660.0  3.67       3.0
2      1  800.0  4.00       1.0
3      1  640.0  3.19       4.0
4      0  520.0  2.93       4.0


## Part 1. Frequency Tables

#### 1. Let's create a frequency table of our variables

In [104]:
# frequency table for prestige and whether or not someone was admitted
admit_prestige = pd.crosstab(index=df["admit"], 
                           columns=df["prestige"])

admit_prestige.index= ["not admitted","admitted"]

admit_prestige

prestige,1.0,2.0,3.0,4.0
not admitted,28,95,93,55
admitted,33,53,28,12


## Part 2. Return of dummy variables

#### 2.1 Create class or dummy variables for prestige 

In [105]:
dummy_ranks = pd.get_dummies(df_raw['prestige']) ##create dummy variables

In [106]:
dummy_ranks.columns = ['prestige_1', 'prestige_2','prestige_3','prestige_4']

In [107]:
df = df.join(dummy_ranks)

In [108]:
df.drop('prestige', inplace=True, axis=1)  ##drop prestige in palce of dummy columns

In [109]:
dummy_ranks

Unnamed: 0,prestige_1,prestige_2,prestige_3,prestige_4
0,0,0,1,0
1,0,0,1,0
2,1,0,0,0
3,0,0,0,1
4,0,0,0,1
5,0,1,0,0
6,1,0,0,0
7,0,1,0,0
8,0,0,1,0
9,0,1,0,0


In [110]:
df.head()

Unnamed: 0,admit,gre,gpa,prestige_1,prestige_2,prestige_3,prestige_4
0,0,380.0,3.61,0,0,1,0
1,1,660.0,3.67,0,0,1,0
2,1,800.0,4.0,1,0,0,0
3,1,640.0,3.19,0,0,0,1
4,0,520.0,2.93,0,0,0,1


#### 2.2 When modeling our class variables, how many do we need? 



Answer: In this case, we would need 4 because we created 4 separate dummy variabels for each prestige rank.

## Part 3. Hand calculating odds ratios

Develop your intuition about expected outcomes by hand calculating odds ratios.

In [111]:
cols_to_keep = ['admit', 'gre', 'gpa']
handCalc = df[cols_to_keep].join(dummy_ranks.ix[:, 'prestige_1':])
print(handCalc.head())

   admit    gre   gpa  prestige_1  prestige_2  prestige_3  prestige_4
0      0  380.0  3.61           0           0           1           0
1      1  660.0  3.67           0           0           1           0
2      1  800.0  4.00           1           0           0           0
3      1  640.0  3.19           0           0           0           1
4      0  520.0  2.93           0           0           0           1


.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
  


In [112]:
#crosstab prestige 1 admission 
# frequency table cutting prestige and whether or not someone was admitted
pd.crosstab(df.admit, df.prestige_1)

prestige_1,0,1
admit,Unnamed: 1_level_1,Unnamed: 2_level_1
0,243,28
1,93,33


#### 3.1 Use the cross tab above to calculate the odds of being admitted to grad school if you attended a #1 ranked college

In [113]:
odds_admit_P1 = 33/28

odds_admit_P1

1.1785714285714286

#### 3.2 Now calculate the odds of admission if you did not attend a #1 ranked college

In [114]:
odds_admit_not_P1 = 93/243

odds_admit_not_P1

0.38271604938271603

#### 3.3 Calculate the odds ratio

In [115]:
odds_ratio = odds_admit_P1 / odds_admit_not_P1

odds_ratio

3.079493087557604

#### 3.4 Write this finding in a sentenance: 

Answer: the odds of being admitted into this program are more than 3 times higher for students from a top tier (prestige 1) school than for students who are not

#### 3.5 Print the cross tab for prestige_4

In [116]:
#crosstab prestige 4 admission 
# frequency table cutting prestige and whether or not someone was admitted
pd.crosstab(df.admit, df.prestige_4)

prestige_4,0,1
admit,Unnamed: 1_level_1,Unnamed: 2_level_1
0,216,55
1,114,12


In [117]:
odds_admit_p4 = 12/55

odds_admit_p4

0.21818181818181817

In [118]:
odds_admit_not_p4 = 114/216

odds_admit_not_p4

0.5277777777777778

#### 3.6 Calculate the OR 

In [119]:
odds_ratio = odds_admit_p4 / odds_admit_not_p4

odds_ratio

0.4133971291866028

#### 3.7 Write this finding in a sentence

Answer: the odds of being admitted into the program for students who attended schools prestige 2,3,and 4 are .4 times as likely as those who attended a top tier (prestige 1) school. 

## Part 4. Analysis

In [120]:
# create a clean data frame for the regression
cols_to_keep = ['admit', 'gre', 'gpa']
data = df[cols_to_keep].join(dummy_ranks.ix[:, 'prestige_2':])
print(data.head())

   admit    gre   gpa  prestige_2  prestige_3  prestige_4
0      0  380.0  3.61           0           1           0
1      1  660.0  3.67           0           1           0
2      1  800.0  4.00           0           0           0
3      1  640.0  3.19           0           0           1
4      0  520.0  2.93           0           0           1


We're going to add a constant term for our Logistic Regression. The statsmodels function we're going to be using requires that intercepts/constants are specified explicitly.

In [121]:
# manually add the intercept
data['intercept'] = 1.0

#### 4.1 Set the covariates to a variable called train_cols

In [122]:
train_cols = data.columns[1:]

In [123]:
print(train_cols)

Index(['gre', 'gpa', 'prestige_2', 'prestige_3', 'prestige_4', 'intercept'], dtype='object')


#### 4.2 Fit the model

In [124]:
logit = sm.Logit(data['admit'], data[train_cols])

#### 4.3 Print the summary results

In [76]:
result = logit.fit()

Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6


In [125]:
print(result.summary())

                           Logit Regression Results                           
Dep. Variable:                  admit   No. Observations:                  397
Model:                          Logit   Df Residuals:                      391
Method:                           MLE   Df Model:                            5
Date:                Thu, 24 Aug 2017   Pseudo R-squ.:                 0.08166
Time:                        19:10:18   Log-Likelihood:                -227.82
converged:                       True   LL-Null:                       -248.08
                                        LLR p-value:                 1.176e-07
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
gre            0.0022      0.001      2.028      0.043    7.44e-05       0.004
gpa            0.7793      0.333      2.344      0.019       0.128       1.431
prestige_2    -0.6801      0.317     -2.146      0.0

#### 4.4 Calculate the odds ratios of the coeffiencents and their 95% CI intervals

hint 1: np.exp(X)

hint 2: conf['OR'] = params
        
           conf.columns = ['2.5%', '97.5%', 'OR']

In [134]:
np.exp(result.params)

gre           1.002221
gpa           2.180027
prestige_2    0.506548
prestige_3    0.262192
prestige_4    0.211525
intercept     0.020716
dtype: float64

In [153]:
params = result.params
conf = result.conf_int()
conf['OR'] = params
conf.columns = ['2.5%', '97.5%', 'OR']
print(np.exp(conf))

                2.5%     97.5%        OR
gre         1.000074  1.004372  1.002221
gpa         1.136120  4.183113  2.180027
prestige_2  0.272168  0.942767  0.506548
prestige_3  0.133377  0.515419  0.262192
prestige_4  0.093329  0.479411  0.211525
intercept   0.002207  0.194440  0.020716


#### 4.5 Interpret the OR of Prestige_2

answer: those who attended prestige 2 schools are roughly half as likely to be admitted as those from a prestige 1 school

#### 4.6 Interpret the OR of GPA

Answer: OR of gpa is greater than 2 - meaning that the odds of being admitted double as gpa increases

## Part 5: Predicted probablities


As a way of evaluating our classifier, we're going to recreate the dataset with every logical combination of input values. This will allow us to see how the predicted probability of admission increases/decreases across different variables. First we're going to generate the combinations using a helper function called cartesian (above).

We're going to use np.linspace to create a range of values for "gre" and "gpa". This creates a range of linearly spaced values from a specified min and maximum value--in our case just the min/max observed values.

In [159]:
##use statsmodel cartesian 
from sklearn.utils.extmath import cartesian

In [166]:
# instead of generating all possible values of GRE and GPA, we're going
# to use an evenly spaced range of 10 values from the min to the max 
gres = np.linspace(data['gre'].min(), data['gre'].max(), 10)
print(gres)
# array([ 220.        ,  284.44444444,  348.88888889,  413.33333333,
#         477.77777778,  542.22222222,  606.66666667,  671.11111111,
#         735.55555556,  800.        ])
gpas = np.linspace(data['gpa'].min(), data['gpa'].max(), 10)
print(gpas)
# array([ 2.26      ,  2.45333333,  2.64666667,  2.84      ,  3.03333333,
#         3.22666667,  3.42      ,  3.61333333,  3.80666667,  4.        ])


# enumerate all possibilities
combos = pd.DataFrame(cartesian([gres, gpas, [1, 2, 3, 4], [1.]]), columns = ["gre", "gpa", 'prestige','intercept'])

combos.head()

[ 220.          284.44444444  348.88888889  413.33333333  477.77777778
  542.22222222  606.66666667  671.11111111  735.55555556  800.        ]
[ 2.26        2.45333333  2.64666667  2.84        3.03333333  3.22666667
  3.42        3.61333333  3.80666667  4.        ]


Unnamed: 0,gre,gpa,prestige,intercept
0,220.0,2.26,1.0,1.0
1,220.0,2.26,2.0,1.0
2,220.0,2.26,3.0,1.0
3,220.0,2.26,4.0,1.0
4,220.0,2.453333,1.0,1.0


#### 5.1 Recreate the dummy variables

In [174]:
# enumerate all possibilities

combos = pd.DataFrame(cartesian([gres, gpas, [1, 2, 3, 4], [1.]]), columns = ['gre', 'gpa', 'prestige', 'intercept'])

combos.head()

Unnamed: 0,gre,gpa,prestige,intercept
0,220.0,2.26,1.0,1.0
1,220.0,2.26,2.0,1.0
2,220.0,2.26,3.0,1.0
3,220.0,2.26,4.0,1.0
4,220.0,2.453333,1.0,1.0


In [177]:

# recreate the dummy variables
dummy_ranks = pd.get_dummies(combos['prestige'], prefix='prestige')
dummy_ranks.columns = ['prestige_1', 'prestige_2', 'prestige_3', 'prestige_4']
# keep only what we need for making predictions
cols_to_keep = ['gre', 'gpa', 'prestige', 'intercept']
combos = combos[cols_to_keep].join(dummy_ranks.loc[:, 'prestige_1':])

combos.head()

Unnamed: 0,gre,gpa,prestige,intercept,prestige_1,prestige_2,prestige_3,prestige_4
0,220.0,2.26,1.0,1.0,1,0,0,0
1,220.0,2.26,2.0,1.0,0,1,0,0
2,220.0,2.26,3.0,1.0,0,0,1,0
3,220.0,2.26,4.0,1.0,0,0,0,1
4,220.0,2.453333,1.0,1.0,1,0,0,0


#### 5.2 Make predictions on the enumerated dataset

In [180]:
# make predictions on the enumerated dataset
combos['admit_pred'] = result.predict(combos[train_cols])

combos.tail(4)

Unnamed: 0,gre,gpa,prestige,intercept,prestige_1,prestige_2,prestige_3,prestige_4,admit_pred
396,800.0,4.0,1.0,1.0,1,0,0,0,0.73404
397,800.0,4.0,2.0,1.0,0,1,0,0,0.582995
398,800.0,4.0,3.0,1.0,0,0,1,0,0.419833
399,800.0,4.0,4.0,1.0,0,0,0,1,0.368608


#### 5.3 Interpret findings for the last 4 observations

Answer: The final 4 observations all have perfect gre scores and the highest gpa. The only variable that changes here is the prestige of the school. This shows that the student with a perfect gre score, perfect gpa, and attended a prestige 1 school has a 73% chance of being admitted. The following observations show the impact that a certain prestige level of a school will have on the odds of being admited. The same perfect gre and gpa, but a prestige rank of 2 will only hae a 58% chance of admission.

## Bonus

Plot the probability of being admitted into graduate school, stratified by GPA and GRE score.