<img style="float: center;" src="../images/CI_horizontal.png" width="600">
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Website</a>
    </span>
</center>

Ghani, Rayid, Frauke Kreuter, Julia Lane, Adrianne Bradford, Alex Engler, Nicolas Guetta Jeanrenaud, Graham Henke, Daniela Hochfellner, Clayton Hunter, Brian Kim, Avishek Kumar, and Jonathan Morgan.

# Data Preparation for Machine Learning - Creating Labels
----

## Python Setup
- Back to [Table of Contents](#Table-of-Contents)

Before we begin, run the code cell below to initialize the libraries we'll be using in this assignment. We're already familiar with `numpy`, `pandas`, and `psycopg2` from previous tutorials. Here we'll also be using [`scikit-learn`](http://scikit-learn.org) to fit modeling.

In [None]:
%pylab inline
import pandas as pd
import psycopg2
from sqlalchemy import create_engine

In [None]:
db_name = "appliedda"
hostname = "10.10.2.10"

## Creating Labels

Labels are the dependent variables, or *Y* variables, that we are trying to predict. In the machine learning framework, your labels are usually *binary*: true or false, encoded as 1 or 0. 

In this case, our label is __whether an existing single unit employer in a given year disappears whithin a given number of years__. By convention, we will flag employers who still exist in the following year as 0, and those who no longer exist as 1. 

Single unit employers can be flagged using the `multi_unit_code` (`multi_unit_code = '1'`). We create a unique firm ID using EIN (`ein`), SEIN Unit (`seinunit`) and Employer Number (`empr_no`).

We need to pick the year and quarter of prediction, and the number of years we look forward to see if the employer still exists. Let's use Q1 or 2013 as our date of prediction. Different projects might be interested in looking at short-term or long-term survivability of employers, but for this first example, we evaluate firm survivability within one year of the prediction date.

### Detailed Creation of Labels for a Given Year

For this example, let's use 2013 (Q1) as our reference year (year of prediction).

Let's start by creating the list of unique employers in that quarter:

In [None]:
conn = psycopg2.connect(database=db_name, host=hostname)
cursor = conn.cursor()

In [None]:
sql = '''
CREATE TEMP TABLE eins_2013q1 AS
SELECT DISTINCT CONCAT(ein, '-', seinunit, '-', empr_no) AS id, ein, seinunit, empr_no 
FROM il_des_kcmo.il_qcew_employers
WHERE multi_unit_code = '1' AND year = 2013 AND quarter = 1;
COMMIT;
'''
cursor.execute(sql)

In [None]:
sql = '''
SELECT *
FROM eins_2013q1
LIMIT 10
'''
pd.read_sql(sql, conn)

Now let's create this same table one year later.

In [None]:
sql = '''
CREATE TEMP TABLE eins_2014q1 AS
SELECT DISTINCT CONCAT(ein, '-', seinunit, '-', empr_no) AS id, 
ein, seinunit, empr_no 
FROM il_des_kcmo.il_qcew_employers
WHERE multi_unit_code = '1' AND year = 2014 AND quarter = 1;
COMMIT;
'''
cursor.execute(sql)

In [None]:
sql = '''
SELECT *
FROM eins_2014q1
LIMIT 10
'''
pd.read_sql(sql, conn)

In order to assess whether a 2013 employer still exists in 2014, let's merge the 2014 table onto the 2013 list of employers. Notice that we create a `label` variable that takes the value `0` if the 2013 employer still exists in 2014, `1` if the employer disappears.

In [None]:
sql = '''
CREATE TABLE IF NOT EXISTS ada_18_uchi.labels_2013q1_2014q1 AS
SELECT a.*, CASE WHEN b.ein IS NULL THEN 1 ELSE 0 END AS label
FROM eins_2013q1 AS a
LEFT JOIN eins_2014q1 AS b
ON a.id = b.id AND a.ein = b.ein AND a.seinunit = b.seinunit AND a.empr_no = b.empr_no;
COMMIT;

ALTER TABLE ada_18_uchi.labels_2013q1_2014q1 OWNER TO ada_18_uchi_admin;
COMMIT;
'''
cursor.execute(sql)

In [None]:
# Load the 2013 Labels into Python Pandas 
sql = '''
SELECT *
FROM ada_18_uchi.labels_2013q1_2014q1
'''
df_labels_2013 = pd.read_sql(sql, conn)
df_labels_2013.head(10)

Given these first rows, employers who survive seem to be more common than employers who disappear. Let's get an idea of the dsitribution of our label variable.

In [None]:
pd.crosstab(index = df_labels_2013['label'], columns =  'count')

### Repeating the Label Creation Process for Another Year

Since we need one training and one test set for our machine learning analysis, let's create the same labels table for the following year.

In [None]:
conn = psycopg2.connect(database=db_name, host=hostname)
cursor = conn.cursor()

In [None]:
sql = '''
CREATE TEMP TABLE eins_2014q1 AS
SELECT DISTINCT CONCAT(ein, '-', seinunit, '-', empr_no) AS id, ein, seinunit, empr_no 
FROM il_des_kcmo.il_qcew_employers
WHERE multi_unit_code = '1' AND year = 2014 AND quarter = 1;
COMMIT;

CREATE TEMP TABLE eins_2015q1 AS
SELECT DISTINCT CONCAT(ein, '-', seinunit, '-', empr_no) AS id, ein, seinunit, empr_no 
FROM il_des_kcmo.il_qcew_employers
WHERE multi_unit_code = '1' AND year = 2015 AND quarter = 1;
COMMIT;

CREATE TABLE IF NOT EXISTS ada_18_uchi.labels_2014q1_2015q1 AS
SELECT a.*, CASE WHEN b.ein IS NULL THEN 1 ELSE 0 END AS label
FROM eins_2014q1 AS a
LEFT JOIN eins_2015q1 AS b
ON a.id = b.id AND a.ein = b.ein AND a.seinunit = b.seinunit AND a.empr_no = b.empr_no;
COMMIT;

ALTER TABLE ada_18_uchi.labels_2014q1_2015q1 OWNER TO ada_18_uchi_admin;
COMMIT;
'''
cursor.execute(sql)

In [None]:
# Load the 2014 Labels into Python Pandas 
sql = '''
SELECT *
FROM ada_18_uchi.labels_2014q1_2015q1
'''
df_labels_2014 = pd.read_sql(sql, conn)
df_labels_2014.head()

Let's get an idea of the dsitribution of our label variable.

In [None]:
pd.crosstab(index = df_labels_2014['label'], columns =  'count')

### Writing a Function to Create Labels

If you feel comfortable with the content we saw above, and expect to be creating labels for several different years as part of your project, the following code defines a Python function that generates the label table for any given year and quarter.

In the above, the whole SQL query was hard coded. In the below, we made a function with parameters for your choice of year and quarter, your choice of prediction horizon, your team's schema, etc. The complete list of parameters is given in parentheses after the `def generate_labels` statement. Some parameters are given a default value (like `delta_t=1`), others (like `year` and `qtr`) are not. More information on the different parameters is given below:
- `year`: The year at which we are doing the prediction.
- `qtr`: The quarter at which we are doing the prediction.
- `delta_t`: The forward-looking window, or number of years over which we are predicting employer survival or failure. The default value is 1, which means we are prediction at a given time whether an employer will still exist one year later.
- `schema`: Your team schema, where the label table will be written. The default value is set to `myschema`, which you define in the cell above the function.
- `db_name`: Database name. This is the name of the SQL database we are using. The default value is set to `db_name`, defined in the [Python Setup](#Python-Setup) section of this notebook.
- `hostname`: Host name. This is the host name for the SQL database we are using. The default value is set to `hostname`, defined in the [Python Setup](#Python-Setup) section of this notebook.
- `overwrite`: Whether you want the function to overwrite tables that already exist. Before writing a table, the function will check whether this table exists, and by default will not overwrite existing tables.

In [None]:
# Insert team schema name below:
myschema = 'ada_18_uchi'

In [None]:
def generate_labels(year, qtr, delta_t=1, schema=myschema, db_name=db_name, hostname=hostname, overwrite=False):
    
    conn = psycopg2.connect(database=db_name, host = hostname) #database connection
    cursor = conn.cursor()
   
    sql = """
    
    CREATE TEMP TABLE eins_{year}q{qtr} AS
    SELECT DISTINCT CONCAT(ein, '-', seinunit, '-', empr_no) AS id, ein, seinunit, empr_no 
    FROM il_des_kcmo.il_qcew_employers
    WHERE multi_unit_code = '1' AND year = {year} AND quarter = {qtr};
    COMMIT;

    CREATE TEMP TABLE eins_{year_pdelta}q{qtr} AS
    SELECT DISTINCT CONCAT(ein, '-', seinunit, '-', empr_no) AS id, ein, seinunit, empr_no 
    FROM il_des_kcmo.il_qcew_employers
    WHERE multi_unit_code = '1' AND year = {year_pdelta} AND quarter = {qtr};
    COMMIT;
    
    DROP TABLE IF EXISTS {schema}.labels_{year}q{qtr}_{year_pdelta}q{qtr};
    CREATE TABLE {schema}.labels_{year}q{qtr}_{year_pdelta}q{qtr} AS
    SELECT a.*, CASE WHEN b.ein IS NULL THEN 1 ELSE 0 END AS label
    FROM eins_{year}q{qtr} AS a
    LEFT JOIN eins_{year_pdelta}q{qtr} AS b
    ON a.id = b.id AND a.ein = b.ein AND a.seinunit = b.seinunit AND a.empr_no = b.empr_no;
    COMMIT;
    
    ALTER TABLE {schema}.labels_{year}q{qtr}_{year_pdelta}q{qtr} OWNER TO {schema}_admin;
    COMMIT;
   
    """.format(year=year, year_pdelta=year+delta_t, qtr=qtr, schema=schema)
    
    
    # Let's check if the table already exists:
    # This query will return an empty table (with no rows) if the table does not exist
    cursor.execute('''
    SELECT * FROM information_schema.tables 
    WHERE table_name = 'labels_{year}q{qtr}_{year_pdelta}q{qtr}' 
    AND table_schema = '{schema}';
    '''.format(year=year, year_pdelta=year+delta_t, qtr=qtr, schema=schema))
    
    # Let's write table if it does not exist (or if overwrite = True)
    if not(cursor.rowcount) or overwrite:
        print("Creating table")
        cursor.execute(sql)
    else:
        print("Table already exists")

    cursor.close()
    
    # Load table into pandas dataframe
    sql = '''
    SELECT * FROM {schema}.labels_{year}q{qtr}_{year_pdelta}q{qtr}
    '''.format(year=year, year_pdelta=year+delta_t, qtr=qtr, schema=schema)
    df = pd.read_sql(sql, conn)  
    
    return df

Let's run the defined function for a few different years:

In [None]:
# For 2012 Q1
df_labels_2012 = generate_labels(year=2012, qtr=1)
pd.crosstab(index = df_labels_2012['label'], columns =  'count')

In [None]:
# For 2012 Q1 with a 3 year forward looking window
df_labels_2012 = generate_labels(year=2012, qtr=1, delta_t=3)
pd.crosstab(index = df_labels_2012['label'], columns =  'count')

Why is the number of 1's higher in the second case?

In [None]:
df_labels_2015 = generate_labels(year=2015, qtr=1)
pd.crosstab(index = df_labels_2015['label'], columns =  'count')

Notice the surprising results in 2015. What is the underlying data problem?