# DS-SF-30 | Unit Project 1: Research Design

In this first unit project you will create a framework to scope out data science projects.  This framework will provide you with a guide to develop a well-articulated problem statement and analysis plan that will be robust and reproducible.

## Part A.  Evaluate the following problem statement:

> "Determine which free-tier customers will covert to paying customers, using demographic data collected at signup (age, gender, location, and profession) and customer useage data (days since last log in, and `activity score 1 = active user`, `0 = inactive user`) based on Hooli data from January-April 2015."

> ### Question 1.  What is the outcome?

Answer:A binary result predicting if the free tier customer will convert to premium or stay free-tier

> ### Question 2.  What are the predictors/covariates?

Answer: Age, gender, location, profession, customer useage data: last log in, activity score

> ### Question 3.  What timeframe is this data relevent for?

Answer: January - April 2015

> ### Question 4.  What is the hypothesis?

Answer: If a free-tier customer is active and has certain attributes, then they will convert to a premium customers.

## Part B.  Let's start exploring our UCLA dataset and answer some simple questions:

In [12]:
import os
import pandas as pd

df = pd.read_csv(os.path.join('..', '..', 'dataset', 'dataset-ucla-admissions.csv'))

df.head()

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.0,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0


> ### Question 5.  Create a data dictionary.

Answer: (Use the template below)

Variable | Description | Type of Variable
---|---|---
admit| 0 = Not admit, 1 = admit | Categorical
gre | Score in GRE out of 800 | Continuous
gpa| Score in GPA out of 4 with 4 being the highest | Continuous
prestige| rank between 1 and 4 with 1 being most presitgious | Categorical

We would like to explore the association between X and Y.

> ### Question 6.  What is the outcome?

In [13]:
df.describe()

Unnamed: 0,admit,gre,gpa,prestige
count,400.0,398.0,398.0,399.0
mean,0.3175,588.040201,3.39093,2.486216
std,0.466087,115.628513,0.38063,0.945333
min,0.0,220.0,2.26,1.0
25%,0.0,,,
50%,0.0,,,
75%,1.0,,,
max,1.0,800.0,4.0,4.0


In [14]:
df.dropna().corr()

Unnamed: 0,admit,gre,gpa,prestige
admit,1.0,0.181202,0.174116,-0.243563
gre,0.181202,1.0,0.382408,-0.124533
gpa,0.174116,0.382408,1.0,-0.060976
prestige,-0.243563,-0.124533,-0.060976,1.0


In [15]:
df.plot(kind='box')

<matplotlib.axes._subplots.AxesSubplot at 0x112019310>

Answer: The outcome is predicting admit for a new student--1 they will get admitted and 0 they will not

> ### Question 7.  What are the predictors/covariates?

Answer: The predictors here are GRE, GPA, and Prestige.

> ### Question 8.  What timeframe is this data relevent for?

Answer: We do not know the exact timeframe but we can assume it was recorded prior to 12/5/16. 

> ### Question 9.  What is the hypothesis?

Answer: Students with a combination of high GPA, GRE, and Prestige are admitted to UCLA while low GPA, GRE, and Prestige students are not.

> ### Question 10.  What's the problem statement?

> Using your answers to the above questions, write a well-formed problem statement.

Answer: Is it possible to predict with reasonable accuracy whether or not a student will get admission based off the 3 available predictors--GPA, GRE and Prestige. 

## Part C.  Create an exploratory analysis plan by answering the following questions:

Because the answers to these questions haven't yet been covered in class yet, this section is optional.  This is by design.  By having you guess or look around for these answers will help make sense once we cover this material in class.  You will not be penalized for wrong answers but we encourage you to give it a try!

> ### Question 11. What are the goals of the exploratory analysis?

Answer: To understand the data in ways that the model will not indicate. For example, EDA helps us understand if GPA is evenly distributed or skewed, if there is a minimum level of GPA needed to gain admittance or if GPAs are more often round numbers.

> ### Question 12.  What are the assumptions of the distribution of data?

Answer: The assumptions are that the predictor variables are not randomly distributed. If the data is random, then they might be excluded from the model. 

> ### Question 13.  How will determine the distribution of your data?

Answer: By performing a series of visualizations like histograms, box plots, scatter plots, and frequency charts.

> ### Question 14.  How might outliers impact your analysis?

Answer: Outliers might reveal that there are meaningful exceptions to the model or that there are typos in the data and they can be excluded.

> ### Question 15.  How will you test for outliers?

Answer: We can test for outliers by creating a box plot or calculating how many data points are above and below Q1-1.5*IQR or Q3 +1.5*IQR.

> ### Question 16.  What is colinearity?

Answer: Colinearity is when one predictor reasonably linearly predicts another variable in the data set.

> ### Question 17.  How will you test for covariance?

Answer: By either quickly plotting the varaibles on scatter plots with a predictor on each axis or run df.corr() to see if the variables are highly correlated.


> ### Question 18.  What is your exploratory analysis plan?

> Using the above information, write an exploratory analysis plan that would allow you or a colleague to reproduce your analysis one year from now.

Answer: My plan for an EDA: 
1. Acquire the dataset
2. Establish the predictors and expected outcome
3. Formulate a testable hypothesis
4. Outline your problem statement
5. Plot the data in a historgram, scatterplot, box plot, and frequency chart. Also, run the describe and corr function to get an initial picture of the data.
6. Determine if there are any outliers in the data set and if these outliers are meaningful 
7. Determine if there is any colinearity or covariance between predictors