# DS-SF-30 | Unit Project 3: Machine Learning Modeling

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.

In [1]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

import statsmodels.formula.api as smf

from sklearn import linear_model

In [2]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'dataset-ucla-admissions.csv'))
df.dropna(inplace = True)

df

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.00,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0
...,...,...,...,...
395,0,620.0,4.00,2.0
396,0,560.0,3.04,3.0
397,0,460.0,2.63,2.0
398,0,700.0,3.65,2.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether an applicant was admitted.

In [3]:
# TODO
prestige_to_admit = pd.crosstab(index=df["prestige"], 
                           columns=df["admit"])

prestige_to_admit.index= ["1.0","2.0","3.0","4.0"]

prestige_to_admit

admit,0,1
1.0,28,33
2.0,95,53
3.0,93,28
4.0,55,12


## Part B.  Variable Transformations

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [4]:
# TODO
X = df[ ['admit', 'gre', 'gpa'] ]

In [5]:
c = df.prestige

In [6]:
cs = pd.get_dummies(c, prefix = None)

In [7]:
cs

Unnamed: 0,1.0,2.0,3.0,4.0
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,1.0
...,...,...,...,...
395,0.0,1.0,0.0,0.0
396,0.0,0.0,1.0,0.0
397,0.0,1.0,0.0,0.0
398,0.0,1.0,0.0,0.0


In [8]:
model_Prestige1 = linear_model.LogisticRegression().\
    fit(X, cs[1])

print model_Prestige1.coef_
print model_Prestige1.intercept_

[[  1.04418376e+00   3.47952409e-04  -3.76477313e-01]]
[-1.01670195]


In [9]:
model_Prestige1.score(X, cs[1])

0.84634760705289669

> ### Question 3.  How many of these binary variables do we need for modeling?

Answer:We need 4, one for each of the options within out target variable (prestige)

> ### Question 4.  Why are we doing this?

Answer: We are transforming the variables to be binary, so that we can solve as a classification problem using logistic regression.

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [10]:
# TODO
df2 = df.drop('prestige',1)

In [11]:
df2 = pd.concat([df2, cs], axis=1)

In [12]:
df2

Unnamed: 0,admit,gre,gpa,1.0,2.0,3.0,4.0
0,0,380.0,3.61,0.0,0.0,1.0,0.0
1,1,660.0,3.67,0.0,0.0,1.0,0.0
2,1,800.0,4.00,1.0,0.0,0.0,0.0
3,1,640.0,3.19,0.0,0.0,0.0,1.0
4,0,520.0,2.93,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...
395,0,620.0,4.00,0.0,1.0,0.0,0.0
396,0,560.0,3.04,0.0,0.0,1.0,0.0
397,0,460.0,2.63,0.0,1.0,0.0,0.0
398,0,700.0,3.65,0.0,1.0,0.0,0.0


## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether an applicant was admitted.

In [13]:
# TODO
prestige1_to_admit = pd.crosstab(index=df2[1], 
                           columns=df2["admit"])

prestige1_to_admit.index= ["0.0","1.0"]

prestige1_to_admit

admit,0,1
0.0,243,93
1.0,28,33


> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [14]:
# TODO
#33:28

> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [15]:
# TODO
#93:243

> ### Question 9.  Finally, what's the odds ratio?

In [16]:
# TODO
(33/28.0)/(93/243.0)

3.079493087557604

> ### Question 10.  Write this finding in a sentence.

Answer:The odds of being admitted are higher if you have prestige

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentence.

In [17]:
### TODO
#prestiger 4.0 odds ratio
(12/55.0)/(114/216.0)

0.4133971291866028

Answer: Students attending a lower prestige school have much lower odds of being admitted than not being admitted.

## Part C. Analysis using `statsmodels`

> ### Question 12.  Fit a logistic regression model predicting admission into UCLA using `gre`, `gpa`, and the `prestige` of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [18]:
###Use below code first?

#train_df_stats = df2.sample(frac = .6, random_state = 0)
#test_df_stats = df2.drop(train_df.index)

In [19]:
### Use below code if you use the above code block

#df3 = train_df_stats.drop(1,1)
df3 = df2.drop(1,1)

In [20]:

train_cols = df3.columns[1:]

logit = smf.Logit(df3['admit'], df3[train_cols])

result = logit.fit()

Optimization terminated successfully.
         Current function value: 0.589121
         Iterations 5


> ### Question 13.  Print the model's summary results.

In [21]:
result.summary()

0,1,2,3
Dep. Variable:,admit,No. Observations:,397.0
Model:,Logit,Df Residuals:,392.0
Method:,MLE,Df Model:,4.0
Date:,"Fri, 10 Feb 2017",Pseudo R-squ.:,0.05722
Time:,18:20:23,Log-Likelihood:,-233.88
converged:,True,LL-Null:,-248.08
,,LLR p-value:,1.039e-05

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
gre,0.0014,0.001,1.308,0.191,-0.001 0.003
gpa,-0.1323,0.195,-0.680,0.497,-0.514 0.249
2.0,-0.9562,0.302,-3.171,0.002,-1.547 -0.365
3.0,-1.5375,0.332,-4.627,0.000,-2.189 -0.886
4.0,-1.8699,0.401,-4.658,0.000,-2.657 -1.083


> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

In [22]:
print np.exp(result.params)

gre    1.001368
gpa    0.876073
2.0    0.384342
3.0    0.214918
4.0    0.154135
dtype: float64


In [23]:
params = result.params
conf = result.conf_int()
conf['OR'] = params
conf.columns = ['2.5%', '97.5%', 'OR']
print np.exp(conf)

         2.5%     97.5%        OR
gre  0.999320  1.003420  1.001368
gpa  0.598303  1.282800  0.876073
2.0  0.212826  0.694082  0.384342
3.0  0.112055  0.412207  0.214918
4.0  0.070176  0.338540  0.154135


> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

####Answer:Odds of being admitted are .38 lower than attendees of a Prestige 1 school

> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer: for two students both attending a Prestige 1 school, your odds decrease as your gpa decreases

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [24]:
#Tier 2
predict_X = [ [800,4,1,0,0] ]

In [25]:
#Tier 2
result.predict(predict_X)

array([ 0.40320425])

In [26]:
#Tier 3
predict_X = [ [800,4,0,1,0] ]

In [27]:
#Tier 3
result.predict(predict_X)

array([ 0.27420161])

In [28]:
#Tier 4
predict_X = [ [800,4,0,0,1] ]

In [29]:
#Tier 4
result.predict(predict_X)

array([ 0.21318433])

In [30]:
#Tier 1
predict_X = [ [800,4,0,0,0] ]

In [31]:
#Tier 1
result.predict(predict_X)

array([ 0.63739858])

Answer:

## Part D. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [32]:
X = df2[ ['gre', 'gpa',1] ]

c = df2.admit

model_admission = linear_model.LogisticRegression().\
    fit(X, c)


In [33]:
# TODO
print model_admission.coef_
print model_admission.intercept_

[[ 0.00184569  0.06560212  0.94622713]]
[-2.21655818]


In [34]:
model_admission.score(X, c)

0.7002518891687658

In [35]:
train_df = df2.sample(frac = .6, random_state = 0)
test_df = df2.drop(train_df.index)

In [36]:
names_X = ['gre', 'gpa',
    1,2,3,4]

def X_c(df):
    X = df2[ names_X ]
    c = df2.admit
    return X, c

train_X, train_c = X_c(train_df)
test_X, test_c = X_c(test_df)

In [37]:
model = linear_model.LogisticRegression(C = 10**2).\
    fit(train_X, train_c)

print model.intercept_
print model.coef_

[-3.66383698]
[[ 0.00215865  0.73484431 -0.05427652 -0.70070849 -1.34355883 -1.56529314]]


In [38]:
# TODO


> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [39]:
# TODO
prestige2_to_admit = pd.crosstab(index=df2[2], 
                           columns=df2["admit"])

prestige2_to_admit.index= ["0.0","1.0"]

prestige2_to_admit

admit,0,1
0.0,176,73
1.0,95,53


In [40]:
#prestige 2.0 odds ratio
(53/95.0)/(73/176.0)

1.3450612833453495

In [41]:
prestige3_to_admit = pd.crosstab(index=df2[3], 
                           columns=df2["admit"])

prestige3_to_admit.index= ["0.0","1.0"]

prestige3_to_admit

admit,0,1
0.0,178,98
1.0,93,28


In [42]:
#Prestige 3.0 odds ratio
(28/93.0)/(98/178.0)

0.5468509984639017

In [43]:
prestige4_to_admit = pd.crosstab(index=df2[4], 
                           columns=df2["admit"])

prestige4_to_admit.index= ["0.0","1.0"]

prestige4_to_admit

admit,0,1
0.0,216,114
1.0,55,12


In [44]:
#prestiger 4.0 odds ratio
(12/55.0)/(114/216.0)

0.4133971291866028

In [45]:
# TODO

Answer:

> ### Question 20.  Again, assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [46]:
# Tier 1
predict_X = [ [800,4,1,0,0,0] ]

print model.predict(predict_X)
print model.predict_proba(predict_X)

[1]
[[ 0.27924987  0.72075013]]


In [47]:
# Tier 2
predict_X = [ [800,4,0,1,0,0] ]

print model.predict(predict_X)
print model.predict_proba(predict_X)

[1]
[[ 0.42512887  0.57487113]]


In [48]:
# Tier 3
predict_X = [ [800,4,0,0,1,0] ]

print model.predict(predict_X)
print model.predict_proba(predict_X)

[0]
[[ 0.58445692  0.41554308]]


In [49]:
# Tier 4
predict_X = [ [800,4,0,0,0,1] ]

print model.predict(predict_X)
print model.predict_proba(predict_X)

[0]
[[ 0.63710735  0.36289265]]


In [50]:
# TODO

Answer: