<img style="float: center;" src="./images/CI_horizontal.png" width="600">
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Website</a>
    </span>
</center>

Ghani, Rayid, Frauke Kreuter, Julia Lane, Adrianne Bradford, Alex Engler, Nicolas Guetta Jeanrenaud, Graham Henke, Daniela Hochfellner, Clayton Hunter, Brian Kim, Avishek Kumar, Jonathan Morgan, and Ridhima Sodhi. 

_Citation to be updated on export_

# Data Preparation for Machine Learning
----

In this notebook, we will go over the data preparation process for setting up our tables for Machine Learning. The next notebook will discuss actually applying the machine learning methods as well as the evaluation process. 

## Motivation

We want to use characteristics about individuals based on their answers in the SED as well as characteristics about their institution based on the HERD in order to predict whether a graduate student goes into academia after receiving their doctorate. More specifically, we will base the outcome on the answers to the SDR two years afterwards, which limits the doctorate recipients that we predict on to scientists and engineers. That is, our main question of interest is:

> **Which science, engineering, and health students will go into academia upon receiving their PhD?**

## Python Setup
Before we begin, run the code cell below to initialize the libraries we'll be using in this assignment. We're already familiar with `numpy`, `pandas`, and `sqlalchemy` from previous tutorials. We'll be using these same tools, as well as many SQL queries, in order to prepare our data for our Machine Learning problem. 

In [None]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine

# We will use weighted statistics
from statsmodels.stats.weightstats import DescrStatsW

In [None]:
# set database connections
host = 'stuffed.adrf.info'
DB = 'appliedda'

connection_string = "postgresql://{}/{}".format(host, DB)
conn = create_engine(connection_string)

> **NOTE:** In this notebook, we create a series of tables to use later on. In order to avoid having everyone in the class running this notebook and trying to simultaneously create the same tables from the same sources, we have used conditional statements to prevent creating the tables if they already exist. When you adapt this code for your projects, make sure you are creating the table, and that you are creating them in the appropriate schema (`ada_ncses_2019`).

In [None]:
qry = '''
SELECT * 
FROM pg_tables
WHERE schemaname = 'ada_ncses_2019'
'''

# List of tables inside ada_ncses_2019
tables = pd.read_sql(qry,conn)

## Creating Labels

Labels are the dependent variables, or y variables, that we are trying to predict. In the machine learning framework, your labels are usually *binary*: true or false, often encoded as 1 or 0. We start by defining a cohort of people to predict on as well as their associated labels. This is what we'll use as our basis for adding on any features, or predictor variables.

It is important to clearly and explicitly define the rows (aka observations) of your analysis to ensure you properly combine input datasets and populate the columns (aka features).

For this example, we will consider for our training set the cohort of graduate students who graduated with their doctorate in the 2012-2013 academic year and was part of the 2015 SDR. The testing set will be the cohort of graduate students who graduated in 2014-2015 and answered the 2017 SDR. Don't worry too much about how the training and testing sets work for now; we'll cover this in more detail in the next notebook.

### Outcome: Predict whether a graduate is in academia two years after receiving their PhD.

Our training cohort consists of 2012-2013 academic year graduates, while the testing cohort consists of 2014-2015 academic year graduates. We will now create a `label` variable that is set to `1` if a person went into academia after earning their doctorate and `0` if not. This will, at the same time, define the cohort of people for our training and testing sets.

In [None]:
# create label table
sql = """
DROP TABLE IF EXISTS ada_ncses_2019.sdr_label_2013;
CREATE TABLE ada_ncses_2019.sdr_label_2013 AS
SELECT sed.drf_id, sdr.refid, sed.phdinst, wtsurvy,
    CASE WHEN (edtp != '1' and edtp != 'L') THEN 1 ELSE 0 END as label
FROM ncses_2019.nsf_sed sed
JOIN ncses_2019.sdr_drfid_2015 xwalk
ON sed.drf_id = xwalk.drf_id
JOIN ncses_2019.nsf_sdr_2015 sdr
ON xwalk.refid = sdr.refid
WHERE sed.phdfy = '2013'
"""

if 'sdr_label_2013' in tables.tablename.tolist():
    print('Table already created.')
else:
    conn.execute(sql)

We are using a JOIN with the SED and SDR data, since we want to take the outcome from the SDR. Note that we are using the `CASE WHEN` statement here. This gives us a way to create a new binary variable, setting it equal to 1 when `edtp` is not one of the codes that are associated with a non-academia job, and 0 otherwise. We do the same with the 2015 SED cohort.

In [None]:
# create label table
sql = """
DROP TABLE IF EXISTS ada_ncses_2019.sdr_label_2015;
CREATE TABLE ada_ncses_2019.sdr_label_2015 AS
SELECT sed.drf_id, sdr.refid, sed.phdinst, wtsurvy,
    CASE WHEN (edtp != '1' and edtp != 'L') THEN 1 ELSE 0 END as label
FROM ncses_2019.nsf_sed sed
JOIN ncses_2019.sdr_drfid_2017 xwalk
ON sed.drf_id = xwalk.drf_id
JOIN ncses_2019.nsf_sdr_2017 sdr
ON xwalk.refid = sdr.refid
WHERE sed.phdfy = '2015'
"""

if 'sdr_label_2015' in tables.tablename.tolist():
    print('Table already created.')
else:
    conn.execute(sql)

Now that we've created two label tables, let's take a look at them to see if they seem to be what we want.

In [None]:
df = pd.read_sql("SELECT * FROM ada_ncses_2019.sdr_label_2013", conn)

In [None]:
df.head()

Let's take a look at the balance in our label. This is important for later, because this will provide the basis for our random model baseline in the evaluation portion of the machine learning process. 

Since the SDR uses survey weights, we make sure to use the weights in calculating our proportion.

In [None]:
wtstats = DescrStatsW(df.label, weights = df.wtsurvy)
wtstats.mean

We can check what the actual values are in the dataset using the `crosstab` function.

In [None]:
pd.crosstab(index = df['label'], columns =  'count')

We'll do the same for our 2015 cohort.

In [None]:
df = pd.read_sql("SELECT * FROM ada_ncses_2019.sdr_label_2015", conn)
wtstats = DescrStatsW(df.label, weights = df.wtsurvy)
wtstats.mean

It seems as though between around XX% and around XX% of doctoral recipients in science, engineering, and health go into academia.

<font color=red><h3> Checkpoint 1: Create a label table</h3></font>

Try creating a different label table based on a slightly different definition of the label. How would you create the labels if you were interested in whether graduates went into a government job? What if you wanted a different cohort (for example, only looking at people in certain fields)?

In [None]:
# create label table
sql = """
SELECT sed.drf_id, sdr.refid, sed.phdinst, wtsurvy,
    CASE WHEN (sdr.emsecsm = '1' and sdr.emsecsm != 'L') THEN 1 ELSE 0 END as label
FROM ncses_2019.nsf_sed sed
JOIN ncses_2019.sdr_drfid_2015 xwalk
ON sed.drf_id = xwalk.drf_id
JOIN ncses_2019.nsf_sdr_2015 sdr
ON xwalk.refid = sdr.refid
WHERE sed.phdfy = '2013'
"""

In [None]:
# read the SQL code into pandas
df = pd.read_sql(sql, conn)

In [None]:
# Find how many people get a label = 1 (working at the education institution)
len(df[df['label'] == 1])

Those who work in business (change the variable to `emsecsm = '3'`):

In [None]:
# create label table
sql = """
SELECT sed.drf_id, sdr.refid, sed.phdinst, wtsurvy,
    CASE WHEN (sdr.emsecsm = '3' and sdr.emsecsm != 'L') THEN 1 ELSE 0 END as label
FROM ncses_2019.nsf_sed sed
JOIN ncses_2019.sdr_drfid_2015 xwalk
ON sed.drf_id = xwalk.drf_id
JOIN ncses_2019.nsf_sdr_2015 sdr
ON xwalk.refid = sdr.refid
WHERE sed.phdfy = '2013'
"""

In [None]:
# read the SQL code into pandas
df = pd.read_sql(sql, conn)

In [None]:
# Find how many people get a label = 1 (working in business)
len(df[df['label'] == 1])

## Creating Features

Our features are our independent variables or predictors. Good features make machine learning systems effective. 
The better the features the easier it is the capture the structure of the data. You generate features using domain knowledge. In general, it is better to have more complex features and a simpler model rather than vice versa. Keeping the model simple makes it faster to train and easier to understand rather then extensively searching for the "right" model and "right" set of parameters. 

Machine Learning Algorithms learn a solution to a problem from sample data. The set of features is the best representation of the sample data to learn a solution to a problem. 

- **Feature engineering** is the process of transforming raw data into features that better represent the underlying problem/data/structure  to the predictive models, resulting in improved model accuracy on unseen data." ( from [Discover Feature Engineering](http://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/) ).  In text, for example, this might involve deriving traits of the text like word counts, verb counts, or topics to feed into a model rather than simply giving it the raw text.

Example of feature engineering are: 

- **Transformations**, such a log, square, and square root.
- **Dummy (binary) variables**, also known as *indicator variables*, often done by taking categorical variables
(such as city) which do not have a numerical value, and adding them to models as a binary value.
- **Discretization**. Several methods require features to be discrete instead of continuous. This is often done 
by binning, which you can do by various approaches like equal width, deciles, Fisher-Jenks, etc. 
- **Aggregation** Aggregate features often constitute the majority of features for a given problem. These use 
different aggregation functions (*count, min, max, average, standard deviation, etc.*) which summarize several
values into one feature, aggregating over varying windows of time and space. For example, for policing or criminal justice problems, we may want to calculate the *number* (and *min, max, mean, variance*, etc.) of crimes within an *m*-mile radius of an address in the past *t* months for varying values of *m* and *t*, and then use all of them as features.

>This notebook walks through creating the following features:
>- Individual-level characteristics, taken from the SED.
>- Institution-level characteristics from HERD.

### Feature Creation Plan

We will be creating a series of temporary tables containing all of the features we want to include in our model. These tables will be:
- `features_ind`: This table will contain information from the SED about the individual, such as source of funding, race, years worked on dissertation, etc.
- `features_inst`: This table will contain information about the institution.

We will then join these tables together, along with our labels table, to create the full dataset that we use for our machine learning problem.

### Individual level features

We will start by creating a table containing all of the individual level features. In general, this is mostly information taken from the SED.

For many variables, we can simply use them as they are. However, `sklearn`, which we will be using for our machine learning algorithms, we need to convert all categorical variables into binary variables (that is, make dummy variables). We can do that in Python, so it will be covered in the next notebook. 

We'll start by first bringing in the variables that are in the SED.

> The list of variables we're including isn't exhaustive, or even what you might be interested in looking at. These are just sample features, and you should think carefully about what you're interested in and what to include.

In [None]:
# SED Individual variables
sql = '''
DROP TABLE IF EXISTS features_ind_2013;
CREATE TEMP TABLE features_ind_2013 AS 
SELECT cohort.drf_id, cohort.label, -- ID and Label
doccode, tuitrems, srceprim, srce1ed, srcesec, srcea, srceb, srcec, srced, srcef, srceg, srceh, srcei, srcej, srcek, srcel, srcem, srcen, -- funding
udebtlvl, gdebtlvl, -- debt info
phdfield_name, --phd info
race, sex, -- demographic variables 
age_at_dissertation, -- age at time of doctorate
yrscours, yrsdisst,yrsnotwrk, -- PhD Program workload
wtsurvy -- survey weight
FROM ada_ncses_2019.sdr_label_2013 cohort
LEFT JOIN ncses_2019.nsf_sed sed
ON cohort.drf_id = sed.drf_id
where phdfy = '2013';
'''
conn.execute(sql)


In [None]:
# SED Individual variables
sql = '''
DROP TABLE IF EXISTS features_ind_2015;
CREATE TEMP TABLE features_ind_2015 AS 
SELECT cohort.drf_id, cohort.label, -- ID and Label
doccode, tuitrems, srceprim, srce1ed, srcesec, srcea, srceb, srcec, srced, srcef, srceg, srceh, srcei, srcej, srcek, srcel, srcem, srcen, -- funding
udebtlvl, gdebtlvl, -- debt info
phdfield_name, --phd info
race, sex, -- demographic variables 
age_at_dissertation, -- age at time of doctorate
yrscours, yrsdisst,yrsnotwrk, -- PhD Program workload
wtsurvy -- survey weight
FROM ada_ncses_2019.sdr_label_2015 cohort
LEFT JOIN ncses_2019.nsf_sed sed
ON cohort.drf_id = sed.drf_id
where phdfy = '2015';
'''
conn.execute(sql)


Let's take a quick look at the data to make sure it's working as intended.

In [None]:
df = pd.read_sql('select * from features_ind_2013', conn)
df.head()

In [None]:
df = pd.read_sql('select * from features_ind_2015', conn)
df.head()

### Add in Institutional characteristics

Now, we want to add in institutional characteristics.

In [None]:
# Add in institutional level features 
sql = '''
DROP TABLE IF EXISTS features_inst_2013;
CREATE TEMP TABLE features_inst_2013 AS
SELECT cohort.drf_id, cohort.phdinst,
hhe_flag, hbcu_flag, -- Flags for HBCU and HHE
total_rd, federal_rd -- R&D Funding
FROM ada_ncses_2019.sdr_label_2013 cohort
LEFT JOIN ncses_2019.nsf_herd herd
ON cohort.phdinst = herd.ipeds_inst_id 
where herd.year = '2013'
'''

conn.execute(sql)

In [None]:
# Add in institutional level features 
sql = '''
DROP TABLE IF EXISTS features_inst_2015;
CREATE TEMP TABLE features_inst_2015 AS
SELECT cohort.drf_id, cohort.phdinst, 
hhe_flag, hbcu_flag, -- Flags for HBCU and HHE
total_rd, federal_rd -- R&D Funding
FROM ada_ncses_2019.sdr_label_2015 cohort
LEFT JOIN ncses_2019.nsf_herd herd
ON cohort.phdinst = herd.ipeds_inst_id
where herd.year = '2015'
'''

conn.execute(sql)

As before, we'll do a little checking to make sure the variables we're bringing in look like they should.

In [None]:
df = pd.read_sql('select * from features_inst_2015', conn)

In [None]:
df.shape

Note that we can use the `isna()` method for DataFrames in order to check if there are missing values. We use it in conjunction with the `sum()` method to find how many missing values there are in each column.

In [None]:
df.isna().sum()

In [None]:
df.head(10)

### Combining all to make a features table

Now that we've created all of our individual and institution level feature tables, we can combine them all into one big table that we will use for our machine learning problem. 

In [None]:
sql = '''
DROP TABLE IF EXISTS ada_ncses_2019.sdr_ml_2013;
CREATE TABLE ada_ncses_2019.sdr_ml_2013 AS
SELECT ind.drf_id, label, wtsurvy, -- ID and label
tuitrems, srceprim, srce1ed, srcesec, srcea, srceb, srcec, srced, srcef, srceg, srceh, srcei, srcej, srcek, srcel, srcem, srcen, -- funding
udebtlvl, gdebtlvl, -- debt info
phdfield_name, --phd info
race, sex, -- race variables 
age_at_dissertation, -- age at time of doctorate
yrscours, yrsdisst,yrsnotwrk, -- PhD Program workload
hhe_flag, hbcu_flag, -- Flags for HBCU and HHE
total_rd, federal_rd -- R&D Funding
FROM features_ind_2013 ind
LEFT JOIN features_inst_2013 inst
ON ind.drf_id = inst.drf_id
'''

if 'sdr_ml_2013' in tables.tablename.tolist():
    print('Table already created.')
else:
    conn.execute(sql)

In [None]:
sql = '''
DROP TABLE IF EXISTS ada_ncses_2019.sdr_ml_2015;
CREATE TABLE ada_ncses_2019.sdr_ml_2015 AS
SELECT ind.drf_id, label, wtsurvy, -- ID and label
tuitrems, srceprim, srce1ed, srcesec, srcea, srceb, srcec, srced, srcef, srceg, srceh, srcei, srcej, srcek, srcel, srcem, srcen, -- funding
udebtlvl, gdebtlvl, -- debt info
phdfield_name, --phd info
race, sex, -- race variables 
age_at_dissertation, -- age at time of doctorate
yrscours, yrsdisst,yrsnotwrk, -- PhD Program workload
hhe_flag, hbcu_flag, -- Flags for HBCU and HHE
total_rd, federal_rd -- R&D Funding
FROM features_ind_2015 ind
LEFT JOIN features_inst_2015 inst
ON ind.drf_id = inst.drf_id
'''

if 'sdr_ml_2015' in tables.tablename.tolist():
    print('Table already created.')
else:
    conn.execute(sql)

In [None]:
df = pd.read_sql('select * from ada_ncses_2019.sdr_ml_2015', conn)
print(df.shape)
df.head()

<font color=red><h3>Checkpoint 2: Create a Feature and add to the feature table</h3></font>

What are some additional features you might want to add? Think about the different variables in all of the different tables that you have access to, both at the institutional level and at the individual level.

In [None]:
df.columns

In [None]:
# Add `race2` variable
sql = '''
SELECT cohort.drf_id, cohort.label, -- ID and Label
race2 -- detailed ethnicity code
FROM ada_ncses_2019.sdr_label_2015 cohort
LEFT JOIN ncses_2019.nsf_sed sed
ON cohort.drf_id = sed.drf_id
where phdfy = '2015';
'''

In [None]:
# Read into a pandas dataframe
race2 = pd.read_sql(sql,conn)

In [None]:
# Merge with the existing table with features
added_features = df.merge(race2, on=['drf_id','label'])

In [None]:
# Check that the new variable has been added to the columns list
added_features.columns

## Notes and Considerations

Notice that there are missing values in the final table we created. We'll have to figure out how to deal with those. By default, if we try to run our machine learning models using scikit-learn, it will use listwise deletion, which is not always desirable. In addition, we should carefully consider whether there are any other errors in our dataset. For example, there might have been data entry errors, or coding mistakes when transferring the data.

### Removing Outliers 

**It is never a good idea to drop observations without prior investigation AND a good reason to believe the data is wrong!** 

### Imputing Missing Values

There are many ways of imputing missing values based on the rest of the data. Missing values can be imputed to median of the rest of the data, or you can use other characteristics (eg industry, geography, etc.).

For our data, we have made an assumption about what "missing" means for each of our data's components (eg if the individual does not show up in the IDES data we say they do not have a job in that time period).

Before running any machine learning algorithms, we have to ensure there are no `NULL` (or `NaN`) values in the data for both our testing and training sets. As you have heard before, __never remove observations with missing values without considering the data you are dropping__. One easy way to check if there are any missing values with `Pandas` is to use the `.info()` method, which returns a count of non-null values for each column in your DataFrame.

In [None]:
df.info()