<img style="float: center;" src="./images/CI_horizontal.png" width="600">
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Website</a>
    </span>
</center>

Ghani, Rayid, Frauke Kreuter, Julia Lane, Adrianne Bradford, Alex Engler, Nicolas Guetta Jeanrenaud, Graham Henke, Daniela Hochfellner, Clayton Hunter, Brian Kim, Avishek Kumar, and Jonathan Morgan. 

_Citation to be updated on export_

# Data Preparation for Machine Learning - Creating Labels
----

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Data-Preparation-for-Machine-Learning---Creating-Labels" data-toc-modified-id="Data-Preparation-for-Machine-Learning---Creating-Labels-1">Data Preparation for Machine Learning - Creating Labels</a></span><ul class="toc-item"><li><span><a href="#Python-Setup" data-toc-modified-id="Python-Setup-1.1">Python Setup</a></span></li><li><span><a href="#Creating-Labels" data-toc-modified-id="Creating-Labels-1.2">Creating Labels</a></span><ul class="toc-item"><li><span><a href="#Outcome-example:-not-employed-1-year-after-graduation" data-toc-modified-id="Outcome-example:-not-employed-1-year-after-graduation-1.2.1">Outcome example: not employed 1 year after graduation</a></span></li><li><span><a href="#Repeating-the-Label-Creation-Process" data-toc-modified-id="Repeating-the-Label-Creation-Process-1.2.2">Repeating the Label Creation Process</a></span></li><li><span><a href="#Writing-a-Function-to-Create-Labels" data-toc-modified-id="Writing-a-Function-to-Create-Labels-1.2.3">Writing a Function to Create Labels</a></span></li></ul></li></ul></li></ul></div>

## Python Setup
- Back to [Table of Contents](#Table-of-Contents)

Before we begin, run the code cell below to initialize the libraries we'll be using in this assignment. We're already familiar with `numpy`, `pandas`, and `psycopg2` from previous tutorials. Here we'll also be using [`scikit-learn`](http://scikit-learn.org) to fit modeling.

In [None]:
%pylab inline
import pandas as pd
import psycopg2
from sqlalchemy import create_engine
import time

In [None]:
# and set our database connection parameters
db_name = "appliedda"
hostname = "10.10.2.10"

## Creating Labels

Labels are the dependent variables, or *Y* variables, that we are trying to predict. In the machine learning framework, your labels are usually *binary*: true or false, often encoded as 1 or 0. 

It is important to clearly and explicitly define the rows (aka observations) of your analysis to ensure you properly combine input datasets and populate the columns (aka features).

In [None]:
# set database connections - use psycopg2 to more easily execute queries without returning data 
# (eg for series of CREATE queries)
conn = psycopg2.connect(database=db_name, host=hostname)
cursor = conn.cursor()

In [None]:
start_time = time.time()
sql = '''
DROP TABLE IF EXISTS cohort_2009;

CREATE TEMP TABLE cohort_2009 AS
SELECT DISTINCT ON (ssn) ssn, degree_conferred_date,
    extract(year from degree_conferred_date) AS year, 
    extract(quarter from degree_conferred_date) quarter,
    date_trunc('quarter', degree_conferred_date)::date yr_q,
    1 AS label --placeholder for the outcome to be created
FROM in_data_2019.che_completions
WHERE ssn_available_flag = 'Y' AND extract(year from degree_conferred_date) = 2009
ORDER BY ssn, degree_conferred_date;

COMMIT;
'''
# df = pd.read_sql(sql, conn)
cursor.execute(sql)

print('query complete in {:.2f} seconds'.format(time.time()-start_time))

In [None]:
sql = '''
SELECT *
FROM cohort_2009
'''
df = pd.read_sql(sql, conn)

In [None]:
print('there are {:,.0f} graduates in our selected study period'.format(df.shape[0]))

In [None]:
df['ssn'].nunique() # confirm unique individual records

### Outcome example: not employed 1 year after graduation

Above we defined our population: individuals who graduated from a higher ed institutions in 2009.

Now we'll say a given individual is "at risk of not getting a job" if they were not present in the wage record data 1 year after they graduated

In [None]:
start_time = time.time()

sql = '''
DROP TABLE IF EXISTS cohort_2009_in_jobs_1yr;

CREATE TEMP TABLE cohort_2009_in_jobs_1yr AS
SELECT *
FROM in_data_2019.wages_by_employer
where year = 2009+1 
    and ssn in (select ssn from cohort_2009);

COMMIT;
'''
cursor.execute(sql)

print('query complete in {:.2f} seconds'.format(time.time()-start_time))
pd.read_sql('select count(*) from cohort_2009_in_jobs_1yr;', conn)

In [None]:
# Load the jobs into Pandas 
sql = '''
SELECT count(*) jq, count(distinct ssn) num
FROM cohort_2009_in_jobs_1yr
'''
df = pd.read_sql(sql, conn)
print('people in our cohort were found in {:,.0f} "job-quarters" in 2010'.format(df['jq'][0]))
print('there are {:,.0f} unique individuals who were found in Indiana wage records in 2010'.format(df['num'][0]))

We will now update the `label` variable  to `0` if the individual was found in Indiana wage records 1 year after graduating

In [None]:
# update label column in the cohort table
# by setting those we found in the wage
# records to 0

sql = """
UPDATE cohort_2009 a SET label = 0
FROM cohort_2009_in_jobs_1yr b
WHERE a.ssn = b.ssn
    AND a.quarter = b.quarter;
    
commit;
"""
cursor.execute(sql)

df = pd.read_sql("SELECT * FROM cohort_2009", conn)

In [None]:
df.shape

In [None]:
pd.crosstab(index = df['label'], columns =  'count')

In [None]:
# or use .value_counts(normalize=True) to show ratio
df['label'].value_counts(normalize=True)

### Repeating the Label Creation Process

We will need at least one (but preferably many) training and test sets for our machine learning analysis. We will put the above steps into a function with parameters for easier reuse.

### Writing a Function to Create Labels

In the above, the SQL queries were all hard coded. In ths section, we demonstrate how to use functions with parameters for the choices we made to define our observations (rows) and label (outcome variable). 

In [None]:
# to "namespace" the table(s) created, recommend team number (eg 't2_')

# note - we recommend using lower case characters only!

table_prefix = 'no_job_' 

In [None]:
def generate_labels(YEAR, year_ahead=1, prefix=table_prefix, overwrite=False):
    
    #database connection
    conn = psycopg2.connect(database=db_name, host = hostname) 
    cursor = conn.cursor()
    
    # create full set of queries to create labels
    # this step will not execute the code in the database
    # it will only create the Python string object
    sql = """
    -- drop table if already exists in the database
    DROP TABLE IF EXISTS ada_edwork.{tbl_prefix}cohort_{year};

    -- create cohort of unique individuals who graduated in the 
    -- input year; this code takes the latest degree
    CREATE TABLE ada_edwork.{tbl_prefix}cohort_{year} AS
    SELECT DISTINCT ON (ssn) ssn, degree_conferred_date,
        extract(year from degree_conferred_date)::int AS year, 
        extract(quarter from degree_conferred_date)::int quarter,
        date_trunc('quarter', degree_conferred_date)::date yr_q,
        1 AS label --placeholder for the outcome to be created
    FROM in_data_2019.che_completions
    WHERE ssn_available_flag = 'Y' 
        AND extract(year from degree_conferred_date) = {year}
    ORDER BY ssn ASC, degree_conferred_date DESC;

    COMMIT;
    
    -- find wage records in following year for our cohort
    CREATE TEMP TABLE cohort_{year}_in_jobs_{ahead}yr AS
    SELECT *
    FROM in_data_2019.wages_by_employer
    where year = {year}+1 
        and ssn in (select ssn from ada_edwork.{tbl_prefix}cohort_{year});

    -- set label to 0 for those who were present in wage records
    UPDATE ada_edwork.{tbl_prefix}cohort_{year} a SET label = 0
    FROM cohort_{year}_in_jobs_{ahead}yr b
    WHERE a.ssn = b.ssn
        AND a.quarter = b.quarter;

    commit;
    
    """.format(year=YEAR, tbl_prefix=prefix, ahead=year_ahead)
    
    
    # Let's check if the table already exists:
    # This query will return an empty table (with no rows) if the table does not exist
    cursor.execute('''
    SELECT * FROM pg_tables 
    WHERE tablename = '{tbl_prefix}cohort_{year}' 
    AND schemaname = 'ada_edwork';
    '''.format(year=YEAR, tbl_prefix=prefix))
    
    # Let's write table if it does not exist (or if overwrite = True)
    if not(cursor.rowcount) or overwrite:
        print("Creating table")
        cursor.execute(sql)
    else:
        print("Table already exists")

    cursor.close()
    
    # Load table into pandas dataframe
    sql = '''
    SELECT * FROM ada_edwork.{tbl_prefix}cohort_{year}
    '''.format(year=YEAR, tbl_prefix=prefix)
    
    df = pd.read_sql(sql, conn)  
    
    return df

Let's test the function with a couple different paramaters:

In [None]:
start_time = time.time()

# Set parameter(s):
year = 2007

df_test1 = generate_labels(year)
print('Labels generated in {:.2f} seconds'.format(time.time()-start_time))
pd.crosstab(index = df_test1['label'], columns =  'count')

In [None]:
start_time = time.time()

# Set parameter(s):
year = 2008

df_test2 = generate_labels(year)
print('Labels generated in {:.2f} seconds'.format(time.time()-start_time))
pd.crosstab(index = df_test2['label'], columns =  'count')

In [None]:
years = [2009, 2010, 2011]

for y in years:
    start_time = time.time()
    
    df_test3 = generate_labels(y)
    
    print('Labels generated in {:.2f} seconds'.format(time.time()-start_time))
    print(pd.crosstab(index = df_test3['label'], columns =  'count'))

In [None]:
# here's an easy way to compare proportions of outcomes between DataFrames
df_test1['label'].value_counts(normalize=True)

In [None]:
df_test2['label'].value_counts(normalize=True)

In [None]:
df_test3['label'].value_counts(normalize=True)