# DS-NYC-45 | Unit Project 4: Notebook with Executive Summary

In this project, you will summarize and present your analysis from Unit Projects 1-3.

> ## Question 1.  Introduction
> Write a problem statement for this project.

Answer:
Admissions into UCLA are influenced by a number of factors, including a student's GPA and GRE scores. However, a student's prestige may also significantly influence the results of their application. By analyzing a crosssection of admission data we hope to be able to refute or confirm the existence of such a correlation between admission and prestige.

> ## Question 2.  Dataset
> Write up a description of your data and any cleaning that was completed.

Answer:
The dataset consisted of 400 prospective UCLA students, covering their admission status, GRE scores, GPA, and the prestige of their previous school (1-4). GPA and GRE had two inadmissable values, while prestige had one, all of which were replaced with by 0 through .fillna(). 

> ## Question 3.  Demo
> Provide a table that explains the data by admission status.

In [1]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

import statsmodels.formula.api as smf

from sklearn import linear_model



In [2]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'ucla-admissions.csv'))
df.dropna(inplace = True)

In [6]:
df.groupby(['admit']).size()

admit
0    271
1    126
dtype: int64

Answer:
Data by Admission status.

> ## Question 4. Methods
> Write up the methods used in your analysis.

Answer:
After cleaning the dataset, each variable was plotted to observe its distribution. Then all 3 factors were placed alongside admission into a correlation matrix to determine which relationships were most likely to be colinear. 
Admission for each prestige level was then plotted onto a frequency table, and the prestige variable was one-hot encoded and then dropped from the data. Odds ratios were then calculated before the data was analyzed through a logistic regression model in both statsmodels and sklearn.


> ## Question 5. Results
> Write up your results.

Answer:
The correlation matrix suggested that there was no significant relationship between admission and prestige. 
The odds ratio of admitted graduate applicants from the most prestigious undergraduate schools to others is 3.08, while for less prestigious (2-4) schools the ratio is 0.32.
From the logistic regression model in statsmodel the calculations used found that the probability for a student with 800 on the GRE and a 4.0 GPA is 0.0042 from a tier 1 school, 0.0021 from a tier 2 school, 0.0011 from a tier 3 school, and 0.00089 from a tier four school. The model used gave us odds ratios of admission using the tier-2 prestigious schools and GPA were 0.02 and 2.18, respectively. This suggest that the latter is a far more powerful factor in determining admission.  

> ## Question 6. Visuals
> Provide a table or visualization of these results.

In [7]:
df = df.join(pd.get_dummies(df['prestige']))
del df['prestige']
df.head()

Unnamed: 0,admit,gre,gpa,1.0,2.0,3.0,4.0
0,0,380.0,3.61,0.0,0.0,1.0,0.0
1,1,660.0,3.67,0.0,0.0,1.0,0.0
2,1,800.0,4.0,1.0,0.0,0.0,0.0
3,1,640.0,3.19,0.0,0.0,0.0,1.0
4,0,520.0,2.93,0.0,0.0,0.0,1.0


In [8]:

feature_cols = ['gre', 'gpa', 1.0, 2.0, 3.0, 4.0]
X= df[feature_cols]
y = df['admit']


In [9]:

logreg = smf.Logit(y, X)
results = logreg.fit()



Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6


Answer:

In [10]:
print results.summary()


                           Logit Regression Results                           
Dep. Variable:                  admit   No. Observations:                  397
Model:                          Logit   Df Residuals:                      391
Method:                           MLE   Df Model:                            5
Date:                Tue, 31 Jan 2017   Pseudo R-squ.:                 0.08166
Time:                        17:10:50   Log-Likelihood:                -227.82
converged:                       True   LL-Null:                       -248.08
                                        LLR p-value:                 1.176e-07
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
gre            0.0022      0.001      2.028      0.043      7.44e-05     0.004
gpa            0.7793      0.333      2.344      0.019         0.128     1.431
1.0           -3.8769      1.142     -3.393      0.0

In [19]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
from sklearn.model_selection import ShuffleSplit
cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=2)
from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()
X_train_std = stdsc.fit_transform(X_train)
X_test_std = stdsc.transform(X_test)

In [20]:


from sklearn.linear_model import LogisticRegression
logreg2 = LogisticRegression( C=10 ** 2)
logreg2.fit(X_train_std, y_train)
zip(feature_cols, logreg2.coef_[0])



[('gre', 0.37560643110460101),
 ('gpa', 0.31820912351339253),
 (1.0, 0.314293862251669),
 (2.0, 0.20505405224976539),
 (3.0, -0.21727446585345289),
 (4.0, -0.28169406531265123)]

> ## Question 7.  Discussion
> Write up your discussion and future steps.

Answer:
From  this work we can see that the prestige of a student's school does have some predictive correlation to their admission, that relationship is at its strongest with tier-1 schools and gets pregressively weaker. Additionaly, we can see that even with just tier-1 schools, the relationship coefficients are lower than they are with GRE scores and GPA. 
While some of the steps used in this analysis were flawed, the overall conclusion that prestige is not a reliable predictor for admission seems to be sound. 
Potential future analysis could include seeing how GRE and GPA scores relate to admission when broken down into ranges, or if, the data can be acquired, how thse features interact with a potential student applying from in/out of California.
