<img style="float: center;" src="./images/CI_horizontal.png" width="600">
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Website</a>
    </span>
</center>

Ghani, Rayid, Frauke Kreuter, Julia Lane, Adrianne Bradford, Alex Engler, Nicolas Guetta Jeanrenaud, Graham Henke, Daniela Hochfellner, Clayton Hunter, Brian Kim, Avishek Kumar, Jonathan Morgan, and Ridhima Sodhi. 

_Citation to be updated on export_

# Data Preparation for Machine Learning
----

In this notebook, we will go over the data preparation process for setting up our tables for Machine Learning. The next notebook will discuss actually applying the machine learning methods as well as the evaluation process. 

## Motivation

We want to use information about an individual's funding history combined with characteristics about individuals based on their answers in the SED as well as characteristics about their institution based on the HERD, PatentsView, and Federal Reporter in order to predict whether a graduate student goes into academia after receiving their doctorate. More specifically, we will base the outcome on the answers to the post-graduation plan questions in SED to answer the question:

> **Of students who graduated from an IRIS member institution, who will go into academia upon receiving their PhD?**

Note that because we only have UMETRICS funding data from IRIS member institutions, we can only generalize to that population. If we try to generalize the results to be about PhD graduates in general, we are making an additional assumption that the IRIS member institutions are essentially the same as non-member institutions in their behavior.

## Python Setup
Before we begin, run the code cell below to initialize the libraries we'll be using in this assignment. We're already familiar with `numpy`, `pandas`, and `sqlalchemy` from previous tutorials. We'll be using these same tools, as well as many SQL queries, in order to prepare our data for our Machine Learning problem. 

In [None]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine

In [None]:
# set database connections
host = 'stuffed.adrf.info'
DB = 'appliedda'

connection_string = "postgresql://{}/{}".format(host, DB)
conn = create_engine(connection_string)

> **NOTE:** In this notebook, we create a series of tables to use later on. In order to avoid having everyone in the class running this notebook and trying to simultaneously create the same tables from the same sources, we have used conditional statements to prevent creating the tables if they already exist. When you adapt this code for your projects, make sure you are creating the table, and that you are creating them in the appropriate schema (`ada_ncses_2019`).

In [None]:
qry = '''
SELECT * 
FROM pg_tables
WHERE schemaname = 'ada_ncses_2019'
'''

tables = pd.read_sql(qry,conn)

## Creating Labels

Labels are the dependent variables, or y variables, that we are trying to predict. In the machine learning framework, your labels are usually *binary*: true or false, often encoded as 1 or 0. We start by defining a cohort of people to predict on as well as their associated labels. This is what we'll use as our basis for adding on any features, or predictor variables.

It is important to clearly and explicitly define the rows (aka observations) of your analysis to ensure you properly combine input datasets and populate the columns (aka features).

For this example, we want to use the funding information from the UMETRICS data, along with some basic demographic information and institutional characteristics, in order to predict whether a PhD recipient goes into academia or not. Since we want to maximize the amount of data that we have and our main limiting factor is the number of IRIS member universities, we chose to limit our analytical frame to just 2014-2015 graduates from IRIS member universities.

### Outcome: Predict whether an IRIS member university graduate goes into academia upon receiving their PhD.

We will now create a `label` variable that is set to `1` if a person went into academia after earning their doctorate and `0` if not. We will keep our full dataset together for now, and separate into training and testing sets in the next notebook.

In [None]:
# create label table
sql = """
DROP TABLE IF EXISTS ada_ncses_2019.umetrics_label;
CREATE TABLE ada_ncses_2019.umetrics_label AS
SELECT drf_id, phdinst,
    CASE WHEN pdemploy IN ('A', 'B', 'C', 'D') THEN 1 ELSE 0 END as label  
FROM ncses_2019.nsf_sed 
WHERE phdfy = '2015' 
AND 
phdinst in -- list IRIS member universities
    ('List of University IDs that cannot be disclosed');
"""

if 'umetrics_label' in tables.tablename.tolist():
    print('Table already created.')
else:
    conn.execute(sql)

We are using a JOIN with the SED and UMETRICS crosswalk, since we want to take the outcome from the SED, but only want to include the people who are from UMETRICS member institutions. Recall that we used this list of IRIS member institutions in the Visualizations notebook for the funding history diagram, because these were the only ones with information for 2012 to 2015. We're able to then just keep the rows in the SED that correspond to people whose `phdinst` are in that list of member universities as well as graduated in 2015.

Note that we are using the `CASE WHEN` statement here. This gives us a way to create a new binary variable, setting it equal to 1 when `pdemploy` is one of the codes that are associated with an academia job, and 0 otherwise.

Now that we've created the label table, let's take a look to see if it seems to be what we want.

In [None]:
df = pd.read_sql("SELECT * FROM ada_ncses_2019.umetrics_label", conn)

In [None]:
df.head()

Let's take a look at the balance in our label. This is important for later, because this will provide the basis for our random model baseline in the evaluation portion of the machine learning process. 

In [None]:
pd.crosstab(index = df['label'], columns =  'count')

In [None]:
df.describe()

<font color=red><h3> Checkpoint 1: Create a label table</h3></font>

Try creating a different label table based on a slightly different definition of the label. How would you create the labels if you were interested in whether graduates went into a government job? What if you wanted a different cohort (for example, only looking at people in certain fields)?

For example, if we want to subset by people who work in the U.S. government (federal, state, and local), we would specify: <br>`CASE WHEN pdemploy IN ('H','I','J')`

In [None]:
# create label table
sql = """
SELECT drf_id, phdinst,
    CASE WHEN pdemploy IN ('H', 'I', 'J') THEN 1 ELSE 0 END as label  
FROM ncses_2019.nsf_sed 
WHERE phdfy = '2015' 
AND 
phdinst in -- list IRIS member universities
    (A list of IRIS Institution IDs which did not pass dislcosure review);
"""

In [None]:
# Read-in the SQL
df = pd.read_sql(sql,conn)

In [None]:
# How many people with label = 1 (work in the U.S. government)
len(df[df['label'] == 1])

## Creating Features

Our features are our independent variables or predictors. Good features make machine learning systems effective. 
The better the features the easier it is the capture the structure of the data. You generate features using domain knowledge. In general, it is better to have more complex features and a simpler model rather than vice versa. Keeping the model simple makes it faster to train and easier to understand rather then extensively searching for the "right" model and "right" set of parameters. 

Machine Learning Algorithms learn a solution to a problem from sample data. The set of features is the best representation of the sample data to learn a solution to a problem. 

- **Feature engineering** is the process of transforming raw data into features that better represent the underlying problem/data/structure  to the predictive models, resulting in improved model accuracy on unseen data." ( from [Discover Feature Engineering](http://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/) ).  In text, for example, this might involve deriving traits of the text like word counts, verb counts, or topics to feed into a model rather than simply giving it the raw text.

Example of feature engineering are: 

- **Transformations**, such a log, square, and square root.
- **Dummy (binary) variables**, also known as *indicator variables*, often done by taking categorical variables
(such as city) which do not have a numerical value, and adding them to models as a binary value.
- **Discretization**. Several methods require features to be discrete instead of continuous. This is often done 
by binning, which you can do by various approaches like equal width, deciles, Fisher-Jenks, etc. 
- **Aggregation.** Aggregate features often constitute the majority of features for a given problem. These use 
different aggregation functions (*count, min, max, average, standard deviation, etc.*) which summarize several
values into one feature, aggregating over varying windows of time and space. For example, for policing or criminal justice problems, we may want to calculate the *number* (and *min, max, mean, variance*, etc.) of crimes within an *m*-mile radius of an address in the past *t* months for varying values of *m* and *t*, and then use all of them as features.

>This notebook walks through creating the following features:
>- Individual-level funding history variables, taken from the UMETRICS.
>- Demographic characteristics, taken from the SED.
>- Institution-level characteristics from HERDS.

### Feature Creation Plan

We will be creating a series of temporary tables containing all of the features we want to include in our model. These tables will be:
- `features_fund`: This table will contain information about the funding history of the individual from UMETRICS.
- `features_ind`: This table will contain demographic information about the individual such as race and sex.
- `features_inst`: This table will contain institution level information from HERD.

We will then join these tables together, along with our labels table, to create the full dataset that we use for our machine learning problem.

### Individual level features

We will start by creating a table containing all of the individual level features. We start by bringing in data from the UMETRICS files, then move on to the SED variables. For many variables, we can simply use them as they are. However, `sklearn`, which we will be using for our machine learning algorithms, we need to convert all categorical variables into binary variables (that is, make dummy variables). We can do that in Python, so it will be covered in the next notebook. In addition, we might need to consider some aggregating and reformatting in order to get the data that we want. This is most apparent in the UMETRICS data that we pull.

> Note: The way we decide to bring in the semester level funding data is based on how we're choosing to define the features. This could be different for your problem! Additionally, the code to format it is just one way of doing it. If you think of a different way to do it that makes more intuitive sense to you, that might be even better!

Let's start with the UMETRICS data.

In [None]:
# UMETRICS variables 
# Only getting whether a person was funded, as well as total semesters with any federal funding
# You can add more!

sql = '''
drop table if exists semester_funding;
create temp table semester_funding as
select a.drf_id, a.phdinst, 
cast(case when team_size >= 1 then 1 else 0 end as int) as any_federal,
cast(case when any_non_federal = 1 then 1 else 0 end as int) as any_non_federal,
case when semester = '2012-jan-apr' then 1 else 0 end as spr12,
case when semester = '2012-may-aug' then 1 else 0 end as sum12,
case when semester = '2012-sep-dec' then 1 else 0 end as fal12,
case when semester = '2013-jan-apr' then 1 else 0 end as spr13,
case when semester = '2013-may-aug' then 1 else 0 end as sum13,
case when semester = '2013-sep-dec' then 1 else 0 end as fal13,
case when semester = '2014-jan-apr' then 1 else 0 end as spr14,
case when semester = '2014-may-aug' then 1 else 0 end as sum14,
case when semester = '2014-sep-dec' then 1 else 0 end as fal14
from ada_ncses_2019.umetrics_label a
left join ncses_2019.sed_umetrics_xwalk b
on a.drf_id = b.drf_id
left join ncses_2019.iris_semester c 
on b.emp_number = c.emp_number;

drop table if exists features_fund;
create temp table features_fund as
select drf_id, 
sum(any_federal) as total_sem_federal,
sum(any_non_federal) as total_sem_non_federal,
sum(spr12) as spr12, sum(sum12) as sum12, sum(fal12) as fal12,
sum(spr13) as spr13, sum(sum13) as sum13, sum(fal12) as fal13,
sum(spr14) as spr14, sum(sum14) as sum14, sum(fal12) as fal14
from semester_funding
group by drf_id;
'''
conn.execute(sql)


Let's take a look at the `features_fund` table to make sure it's reasonable.

In [None]:
df = pd.read_sql('select * from features_fund', conn)
df.shape # Check number of rows and columns - is this number correct?

In [None]:
df.describe(include = 'all')

In [None]:
# Proportion of people who were federally funded on all semesters in which they were funded
np.mean(df.iloc[:,2:].sum(axis = 1) == df.total_sem_federal)

Now, let's get the individual-level information.

In [None]:
# SED Individual variables
sql = '''
DROP TABLE IF EXISTS features_ind;
CREATE TEMP TABLE features_ind AS 
SELECT cohort.drf_id, cohort.label, -- ID and Label
phdfield_name, --phd info
race, sex -- demographic variables 
FROM ada_ncses_2019.umetrics_label cohort
LEFT JOIN ncses_2019.nsf_sed sed
ON cohort.drf_id = sed.drf_id
where phdfy = '2015';
'''
conn.execute(sql)


Let's take a quick look at the data to make sure it's working as intended.

In [None]:
df = pd.read_sql('select * from features_ind', conn)
df.head()

### Add in Institutional characteristics

Now, we want to add in institutional characteristics from the HERD data

In [None]:
# Add in institutional level features from various sources
sql = '''
DROP TABLE IF EXISTS features_inst;
CREATE TEMP TABLE features_inst AS
SELECT cohort.drf_id, cohort.phdinst, 
hhe_flag, hbcu_flag, -- Flags for HBCU and HHE
total_rd, federal_rd -- R&D Funding
FROM ada_ncses_2019.umetrics_label cohort
LEFT JOIN ncses_2019.nsf_herd herd
ON cohort.phdinst = herd.ipeds_inst_id 
where herd.year = '2015' 
'''

conn.execute(sql)

As before, we'll do a little checking to make sure the variables we're bringing in look like they should.

In [None]:
df = pd.read_sql('select * from features_inst', conn)

In [None]:
df.shape

Note that we can use the `isna()` method for DataFrames in order to check if there are missing values. We use it in conjunction with the `sum()` method to find how many missing values there are in each column.

In [None]:
df.isna().sum()

In [None]:
df.head(100)

### Combining all to make an overall table with labels and features

Now that we've created all of our individual and institution level feature tables, we can combine them all into one big table that we will use for our machine learning problem. 

In [None]:
sql = '''
DROP TABLE IF EXISTS ada_ncses_2019.umetrics_ml_aggregate;
CREATE TABLE ada_ncses_2019.umetrics_ml_aggregate AS
SELECT ind.drf_id, label, -- ID and label
total_sem_federal, total_sem_non_federal, spr12, sum12, fal12, spr13, sum13, fal13, spr14, sum14, fal14, -- UMETRICS data
phdfield_name, --phd info
race, sex, -- demographic variables 
hhe_flag, hbcu_flag, -- Flags for HBCU and HHE
total_rd, federal_rd -- R&D Funding
FROM features_ind ind
LEFT JOIN features_fund fund
ON ind.drf_id = fund.drf_id
LEFT JOIN features_inst inst
ON ind.drf_id = inst.drf_id
'''

if 'umetrics_ml_aggregate' in tables.tablename.tolist():
    print('Table already created.')
else:
    conn.execute(sql)

In [None]:
df = pd.read_sql('select * from ada_ncses_2019.umetrics_ml_aggregate', conn)
print(df.shape)
df.head()

<font color=red><h3>Checkpoint 2: Create a Feature and create an overall table</h3></font>

What are some additional features you might want to add? Think about the different variables in all of the different tables that you have access to, both at the institutional level and at the individual level.

## Notes and Considerations

Notice that there are missing values in the final table we created. We'll have to figure out how to deal with those. By default, if we try to run our machine learning models using scikit-learn, it will use listwise deletion, which is not always desirable. In addition, we should carefully consider whether there are any other errors in our dataset. For example, there might have been data entry errors, or coding mistakes when transferring the data.

### Removing Outliers 

**It is never a good idea to drop observations without prior investigation AND a good reason to believe the data is wrong!** 

### Imputing Missing Values

There are many ways of imputing missing values based on the rest of the data. Missing values can be imputed to median of the rest of the data, or you can use other characteristics (eg industry, geography, etc.).

For our data, we have made an assumption about what "missing" means for each of our data's components (eg if the individual does not show up in the IDES data we say they do not have a job in that time period).

Before running any machine learning algorithms, we have to ensure there are no `NULL` (or `NaN`) values in the data for both our testing and training sets. As you have heard before, __never remove observations with missing values without considering the data you are dropping__. One easy way to check if there are any missing values with `Pandas` is to use the `.info()` method, which returns a count of non-null values for each column in your DataFrame.

In [None]:
df.info()