# DS-NYC-45 | Unit Project 3: Basic Machine Learning Modeling

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.

In [1]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

import statsmodels.formula.api as smf

from sklearn import linear_model

In [39]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'ucla-admissions.csv'))
df.dropna(inplace = True)

df

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.00,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0
...,...,...,...,...
395,0,620.0,4.00,2.0
396,0,560.0,3.04,3.0
397,0,460.0,2.63,2.0
398,0,700.0,3.65,2.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether or not an applicant was admitted.

In [40]:
# TODO
df['prestige'].value_counts()

2.0    148
3.0    121
4.0     67
1.0     61
Name: prestige, dtype: int64

In [41]:
df['admit'].value_counts()

0    271
1    126
Name: admit, dtype: int64

In [42]:
df.groupby(['prestige', 'admit']).size()

prestige  admit
1.0       0        28
          1        33
2.0       0        95
          1        53
3.0       0        93
          1        28
4.0       0        55
          1        12
dtype: int64

## Part B.  Variable Transformations

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [43]:
# TODO
df = df.join(pd.get_dummies(df['prestige']))

In [44]:
df.head()


Unnamed: 0,admit,gre,gpa,prestige,1.0,2.0,3.0,4.0
0,0,380.0,3.61,3.0,0.0,0.0,1.0,0.0
1,1,660.0,3.67,3.0,0.0,0.0,1.0,0.0
2,1,800.0,4.0,1.0,1.0,0.0,0.0,0.0
3,1,640.0,3.19,4.0,0.0,0.0,0.0,1.0
4,0,520.0,2.93,4.0,0.0,0.0,0.0,1.0


> ### Question 3.  How many of these binary variables do we need for modeling?

Answer: all four. 

> ### Question 4.  Why are we doing this?

Answer: so that we can easily chart and work with each level of prestige individually

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [45]:
# TODO
del df['prestige']
df.head()

Unnamed: 0,admit,gre,gpa,1.0,2.0,3.0,4.0
0,0,380.0,3.61,0.0,0.0,1.0,0.0
1,1,660.0,3.67,0.0,0.0,1.0,0.0
2,1,800.0,4.0,1.0,0.0,0.0,0.0
3,1,640.0,3.19,0.0,0.0,0.0,1.0
4,0,520.0,2.93,0.0,0.0,0.0,1.0


## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether or not an applicant was admitted.

In [46]:
# TODO
df.groupby([1.0, 'admit']).size()

1.0  admit
0.0  0        243
     1         93
1.0  0         28
     1         33
dtype: int64

> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [51]:
# TODO
#prestige admit/all prestige
print  (33.0/(28.0+33.0))

0.540983606557


> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [52]:
# TODO
#not prestige admit/ all not prestige
print 93.0/(243+93)

0.276785714286


> ### Question 9.  Finally, what's the odds ratio?

In [67]:
# TODO
#(prestige admit* not prestige not admit)/ (prestige not admit*notprestige admit)
print  (33.0*243.0)/(28.0*93.0)

3.07949308756


> ### Question 10.  Write this finding in a sentenance. *sentence

Answer: The odds ratio of admitted graduate applicants from the most prestigious undergraduate schools to others is 3.08.

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentenance.

In [64]:
# TODO 'using above', assuming least means 'not 1.0' here

#odds of being admitted to graduate school for applicants 
#that attended the least prestigious undergraduate schools
print 93.0/(243+93) 

#Then calculate their odds ratio of being admitted to UCLA
# odds admit from less prestigous : odds admit prestigious 
# (notprestige admit/not prestige not admit)/(prestige admit/ prestige not admit)
print  (93.0/243.0)/(33.0/28.0)

0.276785714286
0.324728769173


Answer: The odds ratio of applicants admitted from less prestigious schools compared to the most is  0.32.

## Part C. Analysis using `statsmodels`

> ### Question 12.  Fit a logistic regression model prediting admission into UCLA using `gre`, `gpa`, and the prestige of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [59]:
# TODO
feature_cols = ['gre', 'gpa', 1.0, 2.0, 3.0, 4.0]
X= df[feature_cols]
y = df['admit']

In [62]:
logreg = smf.Logit(y, X)
results = logreg.fit()



Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6


> ### Question 13.  Print the model's summary results.

In [61]:
# TODO
print results.summary()

                           Logit Regression Results                           
Dep. Variable:                  admit   No. Observations:                  397
Model:                          Logit   Df Residuals:                      391
Method:                           MLE   Df Model:                            5
Date:                Tue, 17 Jan 2017   Pseudo R-squ.:                 0.08166
Time:                        15:43:12   Log-Likelihood:                -227.82
converged:                       True   LL-Null:                       -248.08
                                        LLR p-value:                 1.176e-07
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
gre            0.0022      0.001      2.028      0.043      7.44e-05     0.004
gpa            0.7793      0.333      2.344      0.019         0.128     1.431
1.0           -3.8769      1.142     -3.393      0.0

> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

In [65]:
# TODO features odds ratios
np.exp(results.params)


gre    1.002221
gpa    2.180027
1.0    0.020716
2.0    0.010494
3.0    0.005432
4.0    0.004382
dtype: float64

In [66]:
# CI odds ratios
params = results.params
conf = results.conf_int()
conf['OddsR'] = params
conf.columns = ['2.5%', '97.5%', 'OddsR']
print np.exp(conf)

         2.5%     97.5%     OddsR
gre  1.000074  1.004372  1.002221
gpa  1.136120  4.183113  2.180027
1.0  0.002207  0.194440  0.020716
2.0  0.001183  0.093045  0.010494
3.0  0.000569  0.051880  0.005432
4.0  0.000469  0.040919  0.004382


> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

Answer: The odds ratio for prestige =2 is 0.02, suggesting that is not that effective as a postive predictor of admission.

> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer: gpa has an odds ratio of 2.18, suggesting that is very beneficial when it comes to admisssion.

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [92]:
# TODO
#probability admission from each tier assuming GRE 800 and GPA 4
#not entirely clear on how meant to find this
df.groupby(['gre', 'admit']).size() 
#11:14 with 800 get admitted


gre    admit
220.0  0         1
300.0  0         2
       1         1
340.0  0         3
       1         1
                ..
760.0  1         4
780.0  0         1
       1         4
800.0  0        14
       1        11
dtype: int64

In [93]:
df.groupby(['gpa', 'admit']).size()
#13:15 4.0 get admitted

gpa   admit
2.26  0         1
2.42  0         1
      1         1
2.48  0         1
2.52  0         1
               ..
3.98  1         1
3.99  0         2
      1         1
4.00  0        15
      1        13
dtype: int64

In [100]:
topmarks= (11.0/25)*(13.0/28) #odds admit w/ 4.0 and 800
print topmarks
print "odds by tier"
#by Odds Ratio for each tier?
print  (topmarks)*0.020716
print  (topmarks)*0.010494
print (topmarks)*0.005432
print  (topmarks)*0.004382


0.204285714286
odds by tier
0.00423198285714
0.00214377428571
0.00110968
0.00089518


Answer:
The probability for a student with 800 on the GRE  and a 4.0 GPA is 0.0042 from a tier 1 school, 0.0021 from a tier 2 school, 0.0011 from a tier 3 school, and 0.00089 from a tier four school.


## Part D. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [68]:
# TODO
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
from sklearn.model_selection import ShuffleSplit
cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=2)
from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()
X_train_std = stdsc.fit_transform(X_train)
X_test_std = stdsc.transform(X_test)

In [77]:
from sklearn.linear_model import LogisticRegression
logreg2 = LogisticRegression( C=10 ** 2)
logreg2.fit(X_train_std, y_train)
zip(feature_cols, logreg2.coef_[0])

[('gre', 0.37560643110460101),
 ('gpa', 0.31820912351339253),
 (1.0, 0.314293862251669),
 (2.0, 0.20505405224976539),
 (3.0, -0.21727446585345289),
 (4.0, -0.28169406531265123)]

> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [80]:
# TODO
#unclear as to how to do this in sklearn. attempts to find out just said "uses statsmodels".

Answer:

> ### Question 20.  Again assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [None]:
# TODO

Answer: