# DS-SF-36 | Unit Project | 3 | Machine Learning Modeling and Executive Summary | Starter Code

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Project 1 and 2.  You will summarize and present your findings and the methods you used.

In [2]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

import statsmodels.formula.api as smf

from sklearn import linear_model

In [3]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'dataset-ucla-admissions.csv'))
df.dropna(inplace = True)

df

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.00,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0
...,...,...,...,...
395,0,620.0,4.00,2.0
396,0,560.0,3.04,3.0
397,0,460.0,2.63,2.0
398,0,700.0,3.65,2.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether an applicant was admitted.

In [4]:
pd.crosstab(df.admit, df.prestige, dropna = False)

prestige,1.0,2.0,3.0,4.0
admit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,28,95,93,55
1,33,53,28,12


## Part B.  Feature Engineering

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [5]:
df.prestige.value_counts(dropna = False).sort_index()

1.0     61
2.0    148
3.0    121
4.0     67
Name: prestige, dtype: int64

In [6]:
prestige_df = pd.get_dummies(df.prestige, prefix = 'prestige')

In [7]:
prestige_df

Unnamed: 0,prestige_1.0,prestige_2.0,prestige_3.0,prestige_4.0
0,0,0,1,0
1,0,0,1,0
2,1,0,0,0
3,0,0,0,1
4,0,0,0,1
...,...,...,...,...
395,0,1,0,0
396,0,0,1,0
397,0,1,0,0
398,0,1,0,0


In [8]:
prestige_df.rename(columns = {'prestige_1.0': 'prestige_1',
    'prestige_2.0': 'prestige_2',
    'prestige_3.0': 'prestige_3',
    'prestige_4.0': 'prestige_4'}, inplace = True)

In [9]:
prestige_df

Unnamed: 0,prestige_1,prestige_2,prestige_3,prestige_4
0,0,0,1,0
1,0,0,1,0
2,1,0,0,0
3,0,0,0,1
4,0,0,0,1
...,...,...,...,...
395,0,1,0,0
396,0,0,1,0
397,0,1,0,0
398,0,1,0,0


In [10]:
df = df.join([prestige_df])

In [11]:
df.columns

Index([u'admit', u'gre', u'gpa', u'prestige', u'prestige_1', u'prestige_2',
       u'prestige_3', u'prestige_4'],
      dtype='object')

> ### Question 3.  How many of these binary variables do we need for modeling?

Answer: For modeling you would need 2 binary variables since machine learning is either finding an optimal point or finding a point in a regression.

> ### Question 4.  Why are we doing this?

Answer: We are doing this because categories can't be used in classification and regression algorithms.

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [12]:
df = df[ ['admit', 'gre', 'gpa'] ].join(prestige_df)

df

Unnamed: 0,admit,gre,gpa,prestige_1,prestige_2,prestige_3,prestige_4
0,0,380.0,3.61,0,0,1,0
1,1,660.0,3.67,0,0,1,0
2,1,800.0,4.00,1,0,0,0
3,1,640.0,3.19,0,0,0,1
4,0,520.0,2.93,0,0,0,1
...,...,...,...,...,...,...,...
395,0,620.0,4.00,0,1,0,0
396,0,560.0,3.04,0,0,1,0
397,0,460.0,2.63,0,1,0,0
398,0,700.0,3.65,0,1,0,0


## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether an applicant was admitted.

In [13]:
pd.crosstab(df.admit, df['prestige_1'], dropna = False)

prestige_1,0,1
admit,Unnamed: 1_level_1,Unnamed: 2_level_1
0,243,28
1,93,33


> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [14]:
odds_admitted_tier_1 = 33. / 28

odds_admitted_tier_1

1.1785714285714286

> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [15]:
odds_admitted_NOT_tier_1 = 93. / 243

odds_admitted_NOT_tier_1

0.38271604938271603

> ### Question 9.  Finally, what's the odds ratio?

In [16]:
odds_ratio = odds_admitted_tier_1 / odds_admitted_NOT_tier_1

odds_ratio

3.079493087557604

> ### Question 10.  Write this finding in a sentence.

The odds of admission when applying from schools that are not ranked #1 are lower than applying from ones that are.

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentence.

In [17]:
pd.crosstab(df.admit, df.prestige_4)

prestige_4,0,1
admit,Unnamed: 1_level_1,Unnamed: 2_level_1
0,216,55
1,114,12


In [18]:
odds_admitted_tier_4 = 12. / 55
print 'odds_admitted_tier_4 =', odds_admitted_tier_4

odds_admitted_NOT_tier_4 = 114. / 216
print 'odds_admitted_NOT_tier_4 = ', odds_admitted_NOT_tier_4

odds_ratio = odds_admitted_tier_4 / odds_admitted_NOT_tier_4
print 'odds_ratio = ', odds_ratio

odds_admitted_tier_4 = 0.218181818182
odds_admitted_NOT_tier_4 =  0.527777777778
odds_ratio =  0.413397129187


## Part D. Analysis using `statsmodels`

> ### Question 12.  Fit a logistic regression model predicting admission into UCLA using `gre`, `gpa`, and the `prestige` of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [20]:
model = smf.logit(formula = 'admit ~ gre + gpa + prestige_2 + prestige_3 + prestige_4', data = df).fit()

Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6


> ### Question 13.  Print the model's summary results.

In [21]:
model.summary()

0,1,2,3
Dep. Variable:,admit,No. Observations:,397.0
Model:,Logit,Df Residuals:,391.0
Method:,MLE,Df Model:,5.0
Date:,"Wed, 30 Aug 2017",Pseudo R-squ.:,0.08166
Time:,00:23:49,Log-Likelihood:,-227.82
converged:,True,LL-Null:,-248.08
,,LLR p-value:,1.176e-07

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-3.8769,1.142,-3.393,0.001,-6.116,-1.638
gre,0.0022,0.001,2.028,0.043,7.44e-05,0.004
gpa,0.7793,0.333,2.344,0.019,0.128,1.431
prestige_2,-0.6801,0.317,-2.146,0.032,-1.301,-0.059
prestige_3,-1.3387,0.345,-3.882,0.000,-2.015,-0.663
prestige_4,-1.5534,0.417,-3.721,0.000,-2.372,-0.735


> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

In [22]:
np.exp(model.params)

Intercept     0.020716
gre           1.002221
gpa           2.180027
prestige_2    0.506548
prestige_3    0.262192
prestige_4    0.211525
dtype: float64

In [23]:
np.exp(model.conf_int(alpha = .05)).\
    rename(columns = {0: '2.5%', 1: '97.5%'})

Unnamed: 0,2.5%,97.5%
Intercept,0.002207,0.19444
gre,1.000074,1.004372
gpa,1.13612,4.183113
prestige_2,0.272168,0.942767
prestige_3,0.133377,0.515419
prestige_4,0.093329,0.479411


> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

Answer: Students who attended a tier 2 undergraduate school had ~ 51% the odds of being admitted to graduate school compared to students who attend a tier 1 undergraduate school.

> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer: For `gpa`, the odds ratio change is ~ 2. Every time you increase GPA by 1, the possibility of being admitted is doubled.

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [24]:
predict_X = pd.DataFrame({'intercept': [1, 1, 1, 1],
    'gre': [800, 800, 800, 800],
    'gpa': [4, 4, 4, 4],
    'prestige_2': [0, 1, 0, 0],
    'prestige_3': [0, 0, 1, 0],
    'prestige_4': [0, 0, 0, 1]})

predict_X

Unnamed: 0,gpa,gre,intercept,prestige_2,prestige_3,prestige_4
0,4,800,1,0,0,0
1,4,800,1,1,0,0
2,4,800,1,0,1,0
3,4,800,1,0,0,1


In [25]:
model.predict(predict_X)

0    0.734040
1    0.582995
2    0.419833
3    0.368608
dtype: float64

Tier-1: 73%   
Tier-2: 58%   
Tier-3: 42%   
Tier-4: 37%   

## Part E. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [28]:
X = df[ ['gre', 'gpa', 'prestige_2', 'prestige_3', 'prestige_4'] ]
y = df.admit

model = linear_model.LogisticRegression(C = 10 ** 2).fit(X, y)

> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [29]:
print np.exp(model.intercept_)
print np.exp(model.coef_)

[ 0.02975414]
[[ 1.00216055  1.96041259  0.53321936  0.28586733  0.20829663]]


The odds ratios calculated with statsmodels are different.

> ### Question 20.  Again, assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [31]:
predict_X

Unnamed: 0,gpa,gre,intercept,prestige_2,prestige_3,prestige_4
0,4,800,1,0,0,0
1,4,800,1,1,0,0
2,4,800,1,0,1,0
3,4,800,1,0,0,1


In [32]:
predict_X.drop('intercept', axis = 1, inplace = True)

In [33]:
model.predict_proba(predict_X[ ['gre', 'gpa', 'prestige_2', 'prestige_3', 'prestige_4'] ])

array([[ 0.28814605,  0.71185395],
       [ 0.43153702,  0.56846298],
       [ 0.58608936,  0.41391064],
       [ 0.66024514,  0.33975486]])

Sklearn

Tier-1: 71%   
Tier-2: 57%   
Tier-3: 41%   
Tier-4: 34%  

Statsmodels

Tier-1: 73%   
Tier-2: 58%   
Tier-3: 42%   
Tier-4: 37%  
    
The two sets of results are similar.

## Part F.  Executive Summary

> ## Question 21.  Introduction
>
> Write a problem statement for this project.

When applying to graduate programs, there are different factors that are considered. For future applicants, this data can show which areas to focus on.

> ## Question 22.  Dataset
>
> Write up a description of your data and any cleaning that was completed.

Overall the data was clean. The main thing that needed to be updated was prestige when writing a frequency table.

> ## Question 23.  Demo
>
> Provide a table that explains the data by admission status.

In [35]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'dataset-ucla-admissions.csv'))
df.dropna(inplace = True)

def question_3(df):
    print '%.1f (%.1f)' % (df['gpa'].mean(), df['gpa'].std())
    print '%.1f (%.1f)' % (df['gre'].mean(), df['gre'].std())
    for i in range(1, 5):
        print '%s' % (df[df.prestige == i].shape[0])

question_3(df[df.admit == 0])
print
question_3(df[df.admit == 1])

3.3 (0.4)
573.6 (116.1)
28
95
93
55

3.5 (0.4)
618.6 (109.3)
33
53
28
12


| Not Admitted | Admitted
---|---|---
GPA | 3.3 (0.4) | 3.5 (0.4)
GRE | 573.6 (116.1) | 618.6 (109.3)
Prestige 1 | 28 | 33
Prestige 2 | 95 | 53
Prestige 3 | 93 | 28
Prestige 4 | 55 | 12

> ## Question 24.  Methods
>
> Write up the methods used in your analysis.

Modeled admission using logistic regression through statsmodel and sklearn.

> ## Question 25.  Results
>
> Write up your results.

Sklearn

Tier-1: 71%   
Tier-2: 57%   
Tier-3: 41%   
Tier-4: 34%  

Statsmodels

Tier-1: 73%   
Tier-2: 58%   
Tier-3: 42%   
Tier-4: 37%  

> ## Question 26.  Visuals
>
> Provide a table or visualization of these results.

Odds ratios of being admitted to graduate school vs. Prestige 1:

Prestige | Lower | Mean | Upper
:---:|---:|---:|---:
2 | 27% | 50% | 94%
3 | 13% | 26% | 52%
4 | 9% | 21% | 48%

> ## Question 27.  Discussion
>
> Write up your discussion and future steps.

There seems to be a correlation between prestige of undergraduate program to being admitted to graduate school. Next steps would be to continue refining the current model and adding new data to provide different areas to explore.