# DS-NYC-45 | Unit Project 3: Basic Machine Learning Modeling

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.

In [102]:
import os

import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
%matplotlib inline

pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

import statsmodels.formula.api as smf


from sklearn import linear_model

In [103]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'ucla-admissions.csv'))
df.dropna(inplace = True)

df

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.00,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0
...,...,...,...,...
395,0,620.0,4.00,2.0
396,0,560.0,3.04,3.0
397,0,460.0,2.63,2.0
398,0,700.0,3.65,2.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether or not an applicant was admitted.

In [104]:
df.groupby(['prestige'])[['admit']].count()

Unnamed: 0_level_0,admit
prestige,Unnamed: 1_level_1
1.0,61
2.0,148
3.0,121
4.0,67


In [105]:
pd.crosstab(df['admit'],df['prestige'],rownames=['admit'])

prestige,1.0,2.0,3.0,4.0
admit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,28,95,93,55
1,33,53,28,12


## Part B.  Variable Transformations

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [106]:
prestige_binary_trimmed = pd.get_dummies(df.prestige, drop_first = True)
prestige_binary_trimmed

Unnamed: 0,2.0,3.0,4.0
0,0.0,1.0,0.0
1,0.0,1.0,0.0
2,0.0,0.0,0.0
3,0.0,0.0,1.0
4,0.0,0.0,1.0
...,...,...,...
395,1.0,0.0,0.0
396,0.0,1.0,0.0
397,1.0,0.0,0.0
398,1.0,0.0,0.0


In [137]:
# Leaving 1.0 prestige in to do part C of the hw, will remove later.
prestige_binary = pd.get_dummies(df.prestige)
prestige_binary

Unnamed: 0,1.0,2.0,3.0,4.0
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,1.0
...,...,...,...,...
395,0.0,1.0,0.0,0.0
396,0.0,0.0,1.0,0.0
397,0.0,1.0,0.0,0.0
398,0.0,1.0,0.0,0.0


> ### Question 3.  How many of these binary variables do we need for modeling?

Answer: We can focus on just three prestige variables since the fourth would just become redundant. If all three current variables are 0 then we can assume that it is the 4th.

> ### Question 4.  Why are we doing this?

Answer: This serves as an way to cut down on feature inputs. This can help prevent an overfitting scenario down the line. 

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [108]:
df1 = df.join(prestige_binary)

In [109]:
df1.drop('prestige', axis=1)

Unnamed: 0,admit,gre,gpa,1.0,2.0,3.0,4.0
0,0,380.0,3.61,0.0,0.0,1.0,0.0
1,1,660.0,3.67,0.0,0.0,1.0,0.0
2,1,800.0,4.00,1.0,0.0,0.0,0.0
3,1,640.0,3.19,0.0,0.0,0.0,1.0
4,0,520.0,2.93,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...
395,0,620.0,4.00,0.0,1.0,0.0,0.0
396,0,560.0,3.04,0.0,0.0,1.0,0.0
397,0,460.0,2.63,0.0,1.0,0.0,0.0
398,0,700.0,3.65,0.0,1.0,0.0,0.0


In [110]:
df1.count()

admit       397
gre         397
gpa         397
prestige    397
1.0         397
2.0         397
3.0         397
4.0         397
dtype: int64

## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether or not an applicant was admitted.

In [111]:
best_table = pd.crosstab(df1['admit'], df1[1.0], rownames=['admit'])
best_table

1.0,0.0,1.0
admit,Unnamed: 1_level_1,Unnamed: 2_level_1
0,243,28
1,93,33


> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [112]:
odds_prest_1 = (33.0/61.0) / (1 - 33.0/61.0)
odds_prest_1

1.1785714285714288

> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [113]:
odds_prest_not1 = (93.0/336.0) / (1 - 93.0/336.0)
odds_prest_not1

0.3827160493827161

> ### Question 9.  Finally, what's the odds ratio?

In [114]:
odds_ratio = odds_prest_1/odds_prest_not1
odds_ratio

3.079493087557604

> ### Question 10.  Write this finding in a sentenance.

Answer: Students who went to a Prestige 1 standing school has 3 times greater odds than students who have not. 

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentenance.

In [115]:
least_table = pd.crosstab(df1['admit'], df1[4.0], rownames=['admit'])
least_table

4.0,0.0,1.0
admit,Unnamed: 1_level_1,Unnamed: 2_level_1
0,216,55
1,114,12


In [116]:
prob4_admit = 12.0 / (55+12) 
prob4_admit

0.1791044776119403

In [117]:
odds4_admit = prob4_admit/ (1 - prob4_admit)
odds4_admit

0.21818181818181817

In [118]:
prob_all_admit = 114.0/(216+216)
prob_all_admit

0.2638888888888889

In [119]:
odds_all_admit = prob_all_admit/ (1 - prob_all_admit)
odds_all_admit

0.3584905660377358

In [120]:
odds4_ratio = odds_all_admit/ odds4_admit
odds4_ratio

1.6430817610062893

Answer: The odds of getting admitted to UCLA is 1.64 times more likely (64%) if you don't come from a 4 school ranking.

## Part C. Analysis using `statsmodels`

> ### Question 12.  Fit a logistic regression model prediting admission into UCLA using `gre`, `gpa`, and the prestige of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [121]:
df2 = df1.drop([1.0, 'prestige'], axis=1)
df2['intercept'] = 1.0
df2.head()

Unnamed: 0,admit,gre,gpa,2.0,3.0,4.0,intercept
0,0,380.0,3.61,0.0,1.0,0.0,1.0
1,1,660.0,3.67,0.0,1.0,0.0,1.0
2,1,800.0,4.0,0.0,0.0,0.0,1.0
3,1,640.0,3.19,0.0,0.0,1.0,1.0
4,0,520.0,2.93,0.0,0.0,1.0,1.0


In [122]:
X = df2[df2.columns[1:]]
y = df2['admit']

logit_reg = smf.Logit(y, X)
output = logit_reg.fit()

Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6


In [123]:
admission_pred = output.predict(X)

In [124]:
df3 = df2.copy()
df3['admission_pred'] = admission_pred
df3.head()

Unnamed: 0,admit,gre,gpa,2.0,3.0,4.0,intercept,admission_pred
0,0,380.0,3.61,0.0,1.0,0.0,1.0,0.173771
1,1,660.0,3.67,0.0,1.0,0.0,1.0,0.290859
2,1,800.0,4.0,0.0,0.0,0.0,1.0,0.73404
3,1,640.0,3.19,0.0,0.0,1.0,1.0,0.178814
4,0,520.0,2.93,0.0,0.0,1.0,1.0,0.119915


> ### Question 13.  Print the model's summary results.

In [125]:
output.summary()

0,1,2,3
Dep. Variable:,admit,No. Observations:,397.0
Model:,Logit,Df Residuals:,391.0
Method:,MLE,Df Model:,5.0
Date:,"Wed, 18 Jan 2017",Pseudo R-squ.:,0.08166
Time:,11:52:06,Log-Likelihood:,-227.82
converged:,True,LL-Null:,-248.08
,,LLR p-value:,1.176e-07

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
gre,0.0022,0.001,2.028,0.043,7.44e-05 0.004
gpa,0.7793,0.333,2.344,0.019,0.128 1.431
2.0,-0.6801,0.317,-2.146,0.032,-1.301 -0.059
3.0,-1.3387,0.345,-3.882,0.000,-2.015 -0.663
4.0,-1.5534,0.417,-3.721,0.000,-2.372 -0.735
intercept,-3.8769,1.142,-3.393,0.001,-6.116 -1.638


> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

In [126]:
np.exp(output.params)

gre          1.002221
gpa          2.180027
2.0          0.506548
3.0          0.262192
4.0          0.211525
intercept    0.020716
dtype: float64

> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

    Answer: When Prestige is equal to 2 the odd's ratio of being admitted reduce by roughly 50%

> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer: For every unit increase of GPA the odds of being admitted increase by 2.2%.

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [143]:
if df3['gre'] >= 800 & df3['gpa'] >= 4.0 & df3[2.0] == 1.0:
    df3['admission_pred']

TypeError: cannot compare a dtyped [float64] array with a scalar of type [bool]

Answer: I am not entirely sure how the best way to show this would be. I am trying to run an if statement that returns the probability when the above criteria is met for each tier of prestige. I will have to trouble you in office hours

## Part D. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [133]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=10**2)
X = df2[df2.columns[1:]]
y = df2['admit']
logreg.fit(X, y)
df2['sklearn_admit'] = logreg.predict_proba(X)[:, 1]
df2

Unnamed: 0,admit,gre,gpa,2.0,3.0,4.0,intercept,sklearn_admit
0,0,380.0,3.61,0.0,1.0,0.0,1.0,0.182358
1,1,660.0,3.67,0.0,1.0,0.0,1.0,0.290498
2,1,800.0,4.00,0.0,0.0,0.0,1.0,0.733622
3,1,640.0,3.19,0.0,0.0,1.0,1.0,0.180641
4,0,520.0,2.93,0.0,0.0,1.0,1.0,0.126712
...,...,...,...,...,...,...,...,...
395,0,620.0,4.00,1.0,0.0,0.0,1.0,0.458963
396,0,560.0,3.04,0.0,1.0,0.0,1.0,0.179415
397,0,460.0,2.63,1.0,0.0,0.0,1.0,0.195857
398,0,700.0,3.65,1.0,0.0,0.0,1.0,0.440649


> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

Answer: Follow up in office hours

> ### Question 20.  Again assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [None]:
# TODO

Answer: Follow up in office hours