# DS-NYC-45 | Unit Project 4: Notebook with Executive Summary

In this project, you will summarize and present your analysis from Unit Projects 1-3.

> ## Question 1.  Introduction
> Write a problem statement for this project.

Determine the association between a UCLA candidates GPA, GRE scores and school prestige with a UCLA candidate's likelihood of being accepted as a student. 

> ## Question 2.  Dataset
> Write up a description of your data and any cleaning that was completed.

The predictors are GRE Score, GPA Score and School Prestige and our dependent variable is the Admission column:

    *GRE - Continuous feature (1 - 800)
    *GPA - Continuous float feature (1 - 4)
    *Prestige - Ordinal feature (1 - 4 Ranking, Highest to Lowest
    *Admission - Outcome, Categorical feature (binary) 
    
When importing I dropped NA values. A few of the lines were missing that information and would have created calculation errors down the line. To further clean up the data we used one hot coding to convert each of the 'Prestige' categories to it's own binary column and dropped one to avoid repetitive features. 

> ## Question 3.  Demo
> Provide a table that explains the data by admission status.

In [79]:
import os
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

df = pd.read_csv(os.path.join('..','..', 'dataset/ucla-admissions.csv'))

pd.crosstab(df['admit'],df['prestige'],rownames=['admit'])

prestige,1.0,2.0,3.0,4.0
admit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,28,97,93,55
1,33,53,28,12


In [80]:
df.groupby(['gpa'])[['admit']].count()

Unnamed: 0_level_0,admit
gpa,Unnamed: 1_level_1
2.26,1
2.42,2
2.48,1
2.52,1
2.55,1
2.56,1
2.62,2
2.63,1
2.65,1
2.67,2


In [81]:
df.groupby(['gre'])[['admit']].count()

Unnamed: 0_level_0,admit
gre,Unnamed: 1_level_1
220.0,1
300.0,3
340.0,4
360.0,4
380.0,8
400.0,11
420.0,7
440.0,10
460.0,13
480.0,16


We can now see the distribution of admits or rejections by prestige and get admission counts for GPA and GRE. It makes sense that the higher GPA's and GRE's see a higher count (representing admittance).

> ## Question 4. Methods
> Write up the methods used in your analysis.

After using EDA to examine our data. We chose to use logistic regression as our prediction model since what we are trying to predict is binary and the graphical curve of a logistic provides a more accurate fit. First using a stats model we found the log odds of each of the predictors at a 95% confidence level. Using those coefficients we were able to create predictions using the log odds as a unit of change.

Utilizing sklearn, we called logistic regression and fit a model using c = 10^2 as the only altered parameter. Using our logistic regression we were able to create predictions and check our coefficients to calculate unit of change for each feature. Using our model we examined how School Prestige affects the probability of a student being accepted by testing each Prestige teir with a student who scored 800 and has a 4.0 GPA. 

> ## Question 5. Results
> Write up your results.

Using the model we created the probability of being admitted is 73% if they are teir 1, 57% teir 2, 40% teir 3, and 35% teir 4 for a student with a 800 GRE Score and a 4.0 GPA. However, I think it might recquire a look at the models accuracy score before we can be more confident in those results. When using sklearns cross_val_score to measure accuracy we see that we have roughly 69% accuracy and which is not much better than just predicting 0 (not admitted) for everything. 

> ## Question 6. Visuals
> Provide a table or visualization of these results.

In [82]:
#Cleaned up data to the latest dataframe in Unit Project 3. 
df.dropna(inplace = True)
prestige_binary_trimmed = pd.get_dummies(df.prestige, drop_first = True)
df1 = df.join(prestige_binary_trimmed)
df1.drop('prestige', axis=1)
df2 = df1.drop('prestige', axis=1)
df2['intercept'] = 1.0
df2.head()

Unnamed: 0,admit,gre,gpa,2.0,3.0,4.0,intercept
0,0,380.0,3.61,0.0,1.0,0.0,1.0
1,1,660.0,3.67,0.0,1.0,0.0,1.0
2,1,800.0,4.0,0.0,0.0,0.0,1.0
3,1,640.0,3.19,0.0,0.0,1.0,1.0
4,0,520.0,2.93,0.0,0.0,1.0,1.0


**Visualization of the Results**

In [84]:
# Fitting logistic Regression
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=10**2)
X = df2[df2.columns[1:]]
y = df2['admit']
output2 = logreg.fit(X, y)

# The right column shows the probability of being admitted: 
print logreg.predict_proba([800,4,0,0,0,1])
print logreg.predict_proba([800,4,1,0,0,1])
print logreg.predict_proba([800,4,0,1,0,1])
print logreg.predict_proba([800,4,0,0,1,1])

[[ 0.26806485  0.73193515]]
[[ 0.43171408  0.56828592]]
[[ 0.59768586  0.40231414]]
[[ 0.64528942  0.35471058]]




**Examining Accuracy**

In [85]:
df2['sklearn_admit'] = logreg.predict_proba(X)[:, 1].round()
df2.head(20)

Unnamed: 0,admit,gre,gpa,2.0,3.0,4.0,intercept,sklearn_admit
0,0,380.0,3.61,0.0,1.0,0.0,1.0,0.0
1,1,660.0,3.67,0.0,1.0,0.0,1.0,0.0
2,1,800.0,4.0,0.0,0.0,0.0,1.0,1.0
3,1,640.0,3.19,0.0,0.0,1.0,1.0,0.0
4,0,520.0,2.93,0.0,0.0,1.0,1.0,0.0
5,1,760.0,3.0,1.0,0.0,0.0,1.0,0.0
6,1,560.0,2.98,0.0,0.0,0.0,1.0,0.0
7,0,400.0,3.08,1.0,0.0,0.0,1.0,0.0
8,1,540.0,3.39,0.0,1.0,0.0,1.0,0.0
9,0,700.0,3.92,1.0,0.0,0.0,1.0,1.0


In [86]:
# Here we can see that our sklearn logreg is good at predicting the 0's but but not so good with guessing the admissions. 
#  By running a cross_val_score on accuracy we can see that we are looking at roughly 69%

from sklearn.model_selection import cross_val_score
scores = cross_val_score(logreg, X, y, cv=10, scoring='accuracy')
print('CV Accuracy {}, Average Accuracy {}'.format(scores, scores.mean()))

CV Accuracy [ 0.80487805  0.6         0.725       0.725       0.675       0.7
  0.71794872  0.61538462  0.76923077  0.66666667], Average Accuracy 0.699910881801


In [87]:
# Finding the null accuracy, and here we can see it's just slightly worse if I just guess 
# rejected for every student rather than use our model

print y.value_counts()
print '====================================================='
print ('If I guess 0 for all I will have a probability of *{}* being correct'.format(1 - y.mean()))

0    271
1    126
Name: admit, dtype: int64
If I guess 0 for all I will have a probability of *0.682619647355* being correct


> ## Question 7.  Discussion
> Write up your discussion and future steps.

In conclusion from our we can see that when it comes to students with even accomplishments the prestige of their school plays a large factor in the UCLA acceptance decision.

For future steps I have no doubt we can further tune our model to get a higher prediction of accuracy (the scoring metric I will be using for this problem) by utlizing cross validation to create a more diversified sampling of data. Additionally, for our logistic regression the only parameter we altered was C. We can run grid search to help us further tune our model to our accuracy scoring approach.