# DS-SF-27 | Unit Project 3: Basic Machine Learning Modeling

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.

In [88]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

import statsmodels.formula.api as smf

from sklearn import linear_model, cross_validation

In [44]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'ucla-admissions.csv'))
df.dropna(inplace = True)

df

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.00,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0
...,...,...,...,...
395,0,620.0,4.00,2.0
396,0,560.0,3.04,3.0
397,0,460.0,2.63,2.0
398,0,700.0,3.65,2.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether or not an applicant was admitted.

In [45]:
# TODO
pd.crosstab(index=df['prestige'], columns=df['admit'])

admit,0,1
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,28,33
2.0,95,53
3.0,93,28
4.0,55,12


## Part B.  Variable Transformations

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [46]:
# TODO
prestige_df = pd.get_dummies(df.prestige, prefix = 'prestige')

prestige_df.rename(columns = {'prestige_1.0': 'prestige_1',
                              'prestige_2.0': 'prestige_2',
                              'prestige_3.0': 'prestige_3',
                              'prestige_4.0': 'prestige_4'}, inplace = True)

In [47]:
prestige_df

Unnamed: 0,prestige_1,prestige_2,prestige_3,prestige_4
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,1.0
...,...,...,...,...
395,0.0,1.0,0.0,0.0
396,0.0,0.0,1.0,0.0
397,0.0,1.0,0.0,0.0
398,0.0,1.0,0.0,0.0


> ### Question 3.  How many of these binary variables do we need for modeling?

Answer: We need four columns of binary variables - one for each prestige level.

> ### Question 4.  Why are we doing this?

Answer: Rather than look at one feature, `prestige`, the feature is now split into 4 features in order to determine the probability of of the school being at either of the levels (1-4).  This allows us to focus on one feature at a time, giving each feature its own weight. 

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [48]:
# TODO
df = df.join([prestige_df])
df.drop('prestige', axis=1, inplace=True)
df.head()

Unnamed: 0,admit,gre,gpa,prestige_1,prestige_2,prestige_3,prestige_4
0,0,380.0,3.61,0.0,0.0,1.0,0.0
1,1,660.0,3.67,0.0,0.0,1.0,0.0
2,1,800.0,4.0,1.0,0.0,0.0,0.0
3,1,640.0,3.19,0.0,0.0,0.0,1.0
4,0,520.0,2.93,0.0,0.0,0.0,1.0


## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether or not an applicant was admitted.

In [49]:
# TODO
pd.crosstab(index=df['prestige_1'], columns=df['admit'])

admit,0,1
prestige_1,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,243,93
1.0,28,33


> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [69]:
# TODO - p / (1 - p)
p_admitted_p1 = 33./(33+28)
odds_admitted_p1 = p_admitted_p1 / (1 - p_admitted_p1)
print 
"The odds of being admitted into grad school for applicants from the most prestigious undergrad schools: {}".format(odds_admited_p1)




'The odds of being admitted into grad school for applicants from the most prestigious undergrad schools: 1.17857142857'

> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [73]:
# TODO
p_admitted_p0 = 93./(93+243)
odds_admitted_p0 = p_admitted_p0 / (1 - p_admitted_p0)
print "The odds of not being admitted into grad school for applicants from the most prestigious undergrad schools: {}".format(odds_not_admitted_p1)

The odds of not being admitted into grad school for applicants from the most prestigious undergrad schools: 0.848484848485


> ### Question 9.  Finally, what's the odds ratio?

In [62]:
# TODO
odds_admitted_p1/odds_admitted_p0

3.079493087557604

> ### Question 10.  Write this finding in a sentenance.

Answer:  The odds admittance into graduate school for a student from the most prestigious undergraduate schools is 3x more likely than non-pretigious undergraduate school attendees.  

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentenance.

In [82]:
# TODO
pd.crosstab(index=df['prestige_4'], columns=df['admit'])
p_admit_p4 = 12./(12+55)
odds_admit_p4 = p_admit_p4 / (1-p_admit_p4)
print "odds admitted if from prestige level 4 undergraduate school: {}".format(odds_admit_p4)

p_admit_not_p4 = 114./(114 + 216)
odds_admit_not_p4 = p_admit_not_p4 / (1-p_admit_not_p4)
print "odds admitted if not from prestigious level 4 undergraduate school: {}".format(odds_admit_not_p4)

odds_ratio = odds_admit_p4 / odds_admit_not_p4
print "odds_ratio: {}".format(odds_ratio)


odds admitted if from prestige level 4 undergraduate school: 0.218181818182
odds admitted if not from prestigious level 4 undergraduate school: 0.527777777778
odds_ratio: 0.413397129187


Answer: The odds of a student from the least prestiious undergraduate schools receiving admittance into graduate school is 0.413 times less likely than a student from more prestigious undergraduate schools.

## Part C. Analysis using `statsmodels`

> ### Question 12.  Fit a logistic regression model prediting admission into UCLA using `gre`, `gpa`, and the prestige of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [83]:
df.columns

Index([u'admit', u'gre', u'gpa', u'prestige_1', u'prestige_2', u'prestige_3',
       u'prestige_4'],
      dtype='object')

In [105]:
# TODO
X = df.drop('admit', axis=1, inplace=False)[['gre', 'gpa', 'prestige_2', 'prestige_3', 'prestige_4']]
c = df.admit
train_X, test_X, train_c, test_c = cross_validation.train_test_split(X, c, test_size=0.4, random_state=0)
model = smf.Logit(train_c, train_X)
model_score = model.fit()

Optimization terminated successfully.
         Current function value: 0.596703
         Iterations 5


> ### Question 13.  Print the model's summary results.

In [106]:
# TODO
model_score.summary()

0,1,2,3
Dep. Variable:,admit,No. Observations:,238.0
Model:,Logit,Df Residuals:,233.0
Method:,MLE,Df Model:,4.0
Date:,"Tue, 25 Oct 2016",Pseudo R-squ.:,0.02656
Time:,03:26:45,Log-Likelihood:,-142.02
converged:,True,LL-Null:,-145.89
,,LLR p-value:,0.1012

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
gre,0.0011,0.001,0.834,0.404,-0.001 0.004
gpa,-0.2240,0.244,-0.917,0.359,-0.702 0.255
prestige_2,-0.5271,0.373,-1.414,0.157,-1.258 0.203
prestige_3,-1.0937,0.428,-2.555,0.011,-1.933 -0.255
prestige_4,-1.3659,0.515,-2.650,0.008,-2.376 -0.356


> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

In [107]:
# TODO -- source: http://blog.yhat.com/posts/logistic-regression-python-rodeo.html
params = model_score.params
conf = model_score.conf_int()
conf['OR'] = params
conf.columns = ['2.5%', '97.5%', 'OR']
print np.exp(conf)

                2.5%     97.5%        OR
gre         0.998531  1.003656  1.001090
gpa         0.495357  1.289905  0.799352
prestige_2  0.284343  1.225432  0.590291
prestige_3  0.144728  0.775252  0.334963
prestige_4  0.092907  0.700714  0.255150


> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

Answer: A

> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer:

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [111]:
# TODO
p1 = model_score.predict([800, 4, 0, 0, 0])
print "The probability of admissions if from a tier-1: {}".format(p1[0])
p2 = model_score.predict([800, 4, 1, 0, 0])
print "The probability of admissions if from a tier-2: {}".format(p2[0])
p3 = model_score.predict([800, 4, 0, 1, 0])
print "The probability of admissions if from a tier-3: {}".format(p3[0])
p4 = model_score.predict([800, 4, 0, 0, 1])
print "The probability of admissions if from a tier-4: {}".format(p4[0])


The probability of admissions if from a tier-1: 0.494015474217
The probability of admissions if from a tier-2: 0.365614030597
The probability of admissions if from a tier-3: 0.246443015052
The probability of admissions if from a tier-4: 0.199432816699


Answer:

## Part D. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [None]:
# TODO

> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [None]:
# TODO

Answer:

> ### Question 20.  Again assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [None]:
# TODO

Answer: