<img style="float: center;" src="./images/CI_horizontal.png" width="600">
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Website</a>
    </span>
</center>

Ghani, Rayid, Frauke Kreuter, Julia Lane, Adrianne Bradford, Alex Engler, Nicolas Guetta Jeanrenaud, Graham Henke, Daniela Hochfellner, Clayton Hunter, Brian Kim, Avishek Kumar, Jonathan Morgan. 

_Citation to be updated on export_

# Data Preparation for Machine Learning - Creating Labels
----

## Python Setup
- Back to [Table of Contents](#Table-of-Contents)

Before we begin, run the code cell below to initialize the libraries we'll be using in this assignment. We're already familiar with `numpy`, `pandas`, and `psycopg2` from previous tutorials. Here we'll also be using [`scikit-learn`](http://scikit-learn.org) to fit modeling.

In [None]:
%pylab inline
import pandas as pd
import psycopg2
from sqlalchemy import create_engine
import time

In [None]:
# and set our database connection parameters
db_name = "appliedda"
hostname = "10.10.2.10"

## Creating Labels

Labels are the dependent variables, or _Y_ variables, that we are trying to predict. In the machine learning framework, your labels are usually _binary_: true or false, often encoded as 1 or 0. 

It is important to clearly and explicitly define the rows (aka observations) of your analysis to ensure you properly combine input datasets and populate the columns (aka features).

In this notebook, we define each row as an individual finishing a TANF spell. A spell could be participation in just one case or a series of multiple cases.

For this example, let's use January 1, 2010, as our "date of prediction" to simulate predicting return to TANF **after 6 months of being off TANF**. With this definition, we can consider the workforce participation of individuals who exited TANF in Q2 of 2009 as a "feature" (more on features in the next notebook) in our prediction.

In [None]:
# set database connections - use psycopg2 to more easily execute queries without returning data (eg for series of CREATE queries)
conn = psycopg2.connect(database=db_name, host=hostname)
cursor = conn.cursor()

In [None]:
start_time = time.time()
sql = '''
DROP TABLE IF EXISTS il_cohort_20100101;
CREATE TEMP TABLE il_cohort_20100101 AS
SELECT a.recptno, b.ch_dpa_caseid, a.start_date, a.end_date
FROM il_dhs.ind_spells a
JOIN il_dhs.indcase_spells b
ON a.recptno = b.recptno 
    AND a.end_date = b.end_date
WHERE a.end_date >= (('2010-01-01'::date - '6 months'::interval)-'3 months'::interval) AND 
        a.end_date < ('2010-01-01'::date - '6 months'::interval)
        AND a.benefit_type = 'tanf46' AND b.benefit_type = 'tanf46';
COMMIT;
'''
cursor.execute(sql)

print('query complete in {:.2f} seconds'.format(time.time()-start_time))

In [None]:
sql = '''
SELECT *
FROM il_cohort_20100101
'''
df = pd.read_sql(sql, conn)

In [None]:
print('there are {} TANF spells in IL that end in our selected study period'.format(df.shape[0]))

In [None]:
# check the same info for Indiana data
# subset to only those with affil==1

start_time = time.time()
sql = '''
DROP TABLE IF EXISTS in_cohort_20100101;
CREATE TEMP TABLE in_cohort_20100101 AS
SELECT DISTINCT ON (ssn) ssn, caseid, tanf_start_date, tanf_end_date , month
FROM in_fssa.person_month
WHERE tanf_end_date::date >= (('2010-01-01'::date - '6 months'::interval)-'3 months'::interval) AND 
    tanf_end_date::date < ('2010-01-01'::date - '6 months'::interval)
    AND affil = '1'
ORDER BY ssn asc, month desc;
COMMIT;
'''
cursor.execute(sql)

print('query complete in {:.2f} seconds'.format(time.time()-start_time))

In [None]:
# this time we'll just grab the count
sql = '''
SELECT count(*)
FROM in_cohort_20100101
'''
in_count = pd.read_sql(sql, conn)['count'][0]

print('there are {:,.0f} TANF spells in IN that end in our selected study period'.format(in_count))

### Outcome example: return to TANF within 1 year

For our prediction problem we will focus on the `ind_spells` table, which has the start and end dates of individual level spells on three different benefit programs; TANF, SNAP, and cash assistance.

We defined our `cohort` above as those who exited the TANF program between 6 and 9 months prior to our prediction date. Now we will find those in our cohort who returned to TANF in the following year after our prediction date.

In [None]:
start_time = time.time()
# only return the first spell in the event they returned within the following year
sql = '''
DROP TABLE IF EXISTS il_cohort_returned_20100101;
CREATE TEMP TABLE il_cohort_returned_20100101 AS
select distinct on (a.recptno) a.* 
from il_cohort_20100101 a 
join il_dhs.ind_spells b 
on a.recptno = b.recptno
where b.start_date >= a.end_date
    and b.start_date < ('2010-01-01'::date + '12 months'::interval)
    and b.benefit_type = 'tanf46'
order by a.recptno, b.start_date;
COMMIT;
'''
cursor.execute(sql)

print('query complete in {:.2f} seconds'.format(time.time()-start_time))

In [None]:
# Load into pandas 
sql = '''
SELECT *
FROM il_cohort_returned_20100101
'''
df = pd.read_sql(sql, conn)
print('of our study cohort, {} returned to TANF'.format(df.shape[0]))

In [None]:
# repeat for Indiana
start_time = time.time()
# only return the first spell in the event they returned within the following year
sql = '''
CREATE TEMP TABLE in_cohort_returned_20100101 AS
SELECT DISTINCT ON (ssn) a.*
FROM in_cohort_20100101 a
JOIN in_fssa.person_month b
ON a.ssn = b.ssn
WHERE b.tanf_start_date >= a.tanf_end_date::date
    AND b.tanf_end_date < ('2010-01-01'::date + '12 months'::interval)
    AND affil = '1'
ORDER BY ssn asc, month;
'''

cursor.execute(sql)

print('query complete in {:.2f} seconds'.format(time.time()-start_time))

In [None]:
sql = '''
SELECT count(*)
FROM in_cohort_returned_20100101
'''
df = pd.read_sql(sql, conn)
print('of our study cohort, {} returned to TANF'.format(df['count'][0]))

We will now create a `label` variable that is set to `0` if the individual _does not_ return to TANF in the following two years after our prediction date and `1` if the individual _does_ have another TANF spell beginning within our time horizon (2 years after the prediction date).

In [None]:
# create IL label table
sql = """
CREATE TEMP TABLE il_label_20100101 AS
SELECT a.recptno, a.start_date, a.end_date, 
    CASE WHEN b.recptno IS NULL THEN 0 ELSE 1 END as label
FROM il_cohort_20100101 a
LEFT JOIN il_cohort_returned_20100101 b
ON a.recptno = b.recptno;
commit;
"""
cursor.execute(sql)

df = pd.read_sql("SELECT * FROM il_label_20100101", conn)

In [None]:
df.shape

In [None]:
pd.crosstab(index = df['label'], columns =  'count')

In [None]:
# or use .value_counts(normalize=True) to show ratio
df['label'].value_counts(normalize=True)

In [None]:
# create IN label table
sql = """
DROP TABLE IF EXISTS in_label_20100101;
CREATE TEMP TABLE in_label_20100101 AS
SELECT a.*, 
    CASE WHEN b.ssn IS NULL THEN 0 ELSE 1 END as label
FROM in_cohort_20100101 a
LEFT JOIN in_cohort_returned_20100101 b
ON a.ssn = b.ssn;
commit;
"""
cursor.execute(sql)

df = pd.read_sql("SELECT * FROM in_label_20100101", conn)

In [None]:
pd.crosstab(index = df['label'], columns =  'count')

In [None]:
# or use .value_counts(normalize=True) to show ratio
df['label'].value_counts(normalize=True)

In [None]:
# close connection
cursor.close()
conn.close()

### Repeating the Label Creation Process for Prediction date

We will need at least one (but preferably many) training and test sets for our machine learning analysis. We will put the above steps into a function with parameters for easier reuse.

In [None]:
# overwrite connection and open new cursor
conn = psycopg2.connect(database=db_name, host=hostname)
cursor = conn.cursor()

### Writing a Function to Create Labels

In the section above, the SQL queries were all hard coded. In ths section, we demonstrate how to use functions with parameters for the choices we made to define our observations (rows) and label (outcome variable). The complete list of parameters is given in parentheses after the `def generate_labels` statement. Some parameters are given a default value (like `months_back=3`), others (like `preddate`) are not. 

**Paramaters of `generate_labels()` function**
- `preddate`: date of prediction, note that this should be the first day of the quarter.
- `months_off`: months off of TANF before prediction date.
- `months_back`: months before "date of prediction - months off TANF" to define cohort.
- `months_ahead`: time horizon ahead of date of prediction to consider.

- `schema`: Your team schema, where the label table will be written. The default value is set to `myschema`, which you define in the cell above the function.
- `overwrite`: Whether you want the function to overwrite tables that already exist. Before writing a table, the function will check whether this table exists, and by default will not overwrite existing tables.

In [None]:
my_prefix = 'tanfret'

In [None]:
def generate_IL_labels(preddate, months_off=6, months_back=3, 
                    months_ahead=12, schema='ada_tdc_2019', 
                    tbl_prefix = my_prefix, overwrite=False):
    
    #database connection
    conn = psycopg2.connect(database=db_name, host = hostname) 
    cursor = conn.cursor()
    
    # set variables based on prediction date
    tbl_suffix = preddate.replace('-', '') #remove dashes
   
    # create full set of queries to create labels
    sql = """
    -- create the our study cohort for this prediction date
    CREATE TEMP TABLE il_cohort_{tbl_suffix} AS
    SELECT a.recptno, b.ch_dpa_caseid, a.start_date, a.end_date
    FROM il_dhs.ind_spells a
    JOIN il_dhs.indcase_spells b
    ON a.recptno = b.recptno 
        AND a.end_date = b.end_date
    WHERE a.end_date >= (('{pred_date}'::date - '{months_off} months'::interval)-'{months_back} months'::interval) 
        AND a.end_date < ('{pred_date}'::date - '{months_off} months'::interval)
            AND a.benefit_type = 'tanf46' AND b.benefit_type = 'tanf46';
    COMMIT;
    
    -- find how many in our cohort returned to TANF
    CREATE TEMP TABLE il_cohort_returned_{tbl_suffix} AS
    select distinct on (a.recptno) a.* 
    from il_cohort_{tbl_suffix} a 
    join il_dhs.ind_spells b 
    on a.recptno = b.recptno
    where b.start_date >= a.end_date
        and b.start_date < ('{pred_date}'::date + '{months_ahead} months'::interval)
        and b.benefit_type = 'tanf46'
    order by a.recptno, b.start_date;
    COMMIT;
    
    -- create the label table for this prediction date
    -- first DROP to handle the overwrite case
    DROP TABLE IF EXISTS {schema}.{tbl_prefix}_il_label_{tbl_suffix};
    
    CREATE TABLE {schema}.{tbl_prefix}_il_label_{tbl_suffix} AS
    SELECT a.*, 
        CASE WHEN b.recptno IS NULL THEN 0 ELSE 1 END as label
    FROM il_cohort_{tbl_suffix} a
    LEFT JOIN il_cohort_returned_{tbl_suffix} b
    ON a.recptno = b.recptno;
    commit;
    
    -- also add the SSN from the member table
    ALTER TABLE {schema}.{tbl_prefix}_il_label_{tbl_suffix} ADD COLUMN ssn text;
    UPDATE {schema}.{tbl_prefix}_il_label_{tbl_suffix} a SET ssn = b.ssn_hash
    FROM il_dhs.member b
    WHERE a.recptno = b.recptno AND a.ch_dpa_caseid = b.ch_dpa_caseid;
    commit;
    
    -- change owner of table to schema group
    ALTER TABLE {schema}.{tbl_prefix}_il_label_{tbl_suffix} OWNER TO {schema}_admin;
    """.format(tbl_suffix=tbl_suffix, pred_date=preddate, months_off=months_off,
               months_back=months_back, months_ahead=months_ahead, tbl_prefix=tbl_prefix,
               schema=schema)
    
    
    # Let's check if the table already exists:
    # This query will return an empty table (with no rows) if the table does not exist
    cursor.execute('''
    SELECT * FROM pg_tables 
    WHERE tablename ='{tbl_prefix}_il_label_{tbl_suffix}'
    AND schemaname = '{schema}';
    '''.format(tbl_suffix=tbl_suffix, tbl_prefix=tbl_prefix, schema=schema))
    
    # Let's write table if it does not exist (or if overwrite = True)
    if not(cursor.rowcount) or overwrite:
        print("Creating table")
        cursor.execute(sql)
    else:
        print("Table already exists")

    cursor.close()
    
    # Load table into pandas dataframe
    sql = '''
    SELECT * FROM {schema}.{tbl_prefix}_il_label_{tbl_suffix}
    '''.format(tbl_suffix=tbl_suffix, tbl_prefix=tbl_prefix, schema=schema)
    df = pd.read_sql(sql, conn)  
    
    return df

In [None]:
def generate_IN_labels(preddate, months_off=6, months_back=3, 
                    months_ahead=12, schema='ada_tdc_2019', 
                    tbl_prefix = my_prefix, overwrite=False):
    
    #database connection
    conn = psycopg2.connect(database=db_name, host = hostname) 
    cursor = conn.cursor()
    
    # set variables based on prediction date
    tbl_suffix = preddate.replace('-', '') #remove dashes
   
    # create full set of queries to create labels
    sql = """
    -- create initial cohort table
    CREATE TEMP TABLE in_cohort_{tbl_suffix} AS
    SELECT DISTINCT ON (ssn) ssn, caseid, tanf_start_date, tanf_end_date , month
    FROM in_fssa.person_month
    WHERE tanf_end_date >= (('{pred_date}'::date - '{months_off} months'::interval)-'{months_back} months'::interval) 
        AND tanf_end_date < ('{pred_date}'::date - '{months_off} months'::interval)
        AND affil = '1'
    ORDER BY ssn asc, month desc;
    COMMIT;
    
    -- find individuals who did return
    CREATE TEMP TABLE in_cohort_returned_{tbl_suffix} AS
    SELECT DISTINCT ON (ssn) a.*
    FROM in_cohort_{tbl_suffix} a
    JOIN in_fssa.person_month b
    ON a.ssn = b.ssn
    WHERE b.tanf_start_date >= a.tanf_end_date::date
        AND b.tanf_end_date < ('{pred_date}'::date + '{months_ahead} months'::interval)
        AND affil = '1'
    ORDER BY ssn asc, month;
    COMMIT;
    
    -- create the label table for this prediction date
    -- first DROP to handle the overwrite case
    DROP TABLE IF EXISTS {schema}.{tbl_prefix}_in_label_{tbl_suffix};
    CREATE TABLE {schema}.{tbl_prefix}_in_label_{tbl_suffix} AS
    SELECT a.*, 
        CASE WHEN b.ssn IS NULL THEN 0 ELSE 1 END as label
    FROM in_cohort_{tbl_suffix} a
    LEFT JOIN in_cohort_returned_{tbl_suffix} b
    ON a.ssn = b.ssn;
    commit;
    
    
    -- change owner of table to schema group
    ALTER TABLE {schema}.{tbl_prefix}_in_label_{tbl_suffix} OWNER TO {schema}_admin;
    """.format(tbl_suffix=tbl_suffix, pred_date=preddate, months_off=months_off,
               months_back=months_back, months_ahead=months_ahead, tbl_prefix=tbl_prefix,
               schema=schema)
    
    
    # Let's check if the table already exists:
    # This query will return an empty table (with no rows) if the table does not exist
    cursor.execute('''
    SELECT * FROM pg_tables 
    WHERE tablename ='{tbl_prefix}_in_label_{tbl_suffix}'
    AND schemaname = '{schema}';
    '''.format(tbl_suffix=tbl_suffix, tbl_prefix=tbl_prefix, schema=schema))
    
    # Let's write table if it does not exist (or if overwrite = True)
    if not(cursor.rowcount) or overwrite:
        print("Creating table")
        cursor.execute(sql)
    else:
        print("Table already exists")

    cursor.close()
    
    # Load table into pandas dataframe
    sql = '''
    SELECT * FROM {schema}.{tbl_prefix}_in_label_{tbl_suffix}
    '''.format(tbl_suffix=tbl_suffix, tbl_prefix=tbl_prefix, schema=schema)
    df = pd.read_sql(sql, conn)  
    
    return df

Let's run the functions.

In [None]:
start_time = time.time()

# Set prediction date:
preddate = '2010-01-01' # "date of prediction"

# create labels and return DataFrame
# note: when functions have defaults only need to set parameters that change

df = generate_IL_labels(preddate)
print('Labels generated in {:.2f} seconds'.format(time.time()-start_time))

pd.crosstab(index = df['label'], columns =  'count')

In [None]:
start_time = time.time()

# Set prediction date:
preddate = '2010-01-01' # "date of prediction"

# create labels and return DataFrame
# note: when functions have defaults only need to set parameters that change

df = generate_IN_labels(preddate)
print('Labels generated in {:.2f} seconds'.format(time.time()-start_time))

pd.crosstab(index = df['label'], columns =  'count')

In [None]:
# and make both for the three following years:
pred_dates = ['2011-01-01', '2012-01-01', '2013-01-01']

for preddate in pred_dates:
    start_time = time.time()
    df = generate_IN_labels(preddate)
    print('IN Labels generated in {:.2f} seconds'.format(time.time()-start_time))
    print(pd.crosstab(index = df['label'], columns =  'count'))
    
    start_time = time.time()
    df = generate_IL_labels(preddate)
    print('IL Labels generated in {:.2f} seconds'.format(time.time()-start_time))
    print(pd.crosstab(index = df['label'], columns =  'count'))