# Data Preparation for Machine Learning
----

In this notebook, we prepare the table we will to run our Machine Learning algorithm predicting business survival. We start by creating the model's label (whether the firm survives or not in the coming *X* years). We then create some potential features to predict the outcome (firm characteristics, geography, industrial sector).

## Python Setup

In [None]:
%pylab inline
import pandas as pd
import psycopg2
from sqlalchemy import create_engine

In [None]:
db_name = "appliedda"
hostname = "10.10.2.10"
conn = psycopg2.connect(database=db_name, host = hostname) #database connection

## Creating Labels

The function below generates labels for the machine learning model. This function is very similar to the one seen in class, with two additional customizable options (`lookback` and `term`).
- The `lookback` feature changes how long the employer has to have already existed in order to be included. This is to avoid accounting for short-term employers. The default value is set to 1 year: the algorithm will only consider the survival rate of employers over 1 year old.
- The `term` feature controls how far into the future we predict the employer's outcome. The default value is set to 1 year: we look at whether the employer still exists 1 year after the focal year. This feature can be changed depending on the scope of the analysis (predicting short-term failures vs. long-term failures).

In [None]:
def generate_labels(year, schema, lookback = 1, term = 1,
                    db_name = db_name, hostname = hostname, overwrite = False):
    
    conn = psycopg2.connect(database=db_name, host = hostname) #database connection
    cursor = conn.cursor()
   
    sql_script="""
    -- First, let's make a list of the employers present at time t: Q1 of 2013

    DROP TABLE IF EXISTS {schema}.labels_{year};
    CREATE TABLE ada_kcmo.labels_{year} AS
    SELECT CONCAT(a.ein, a.run, a.ui_acct) AS id
            , a.ein, a.run, a.ui_acct
            , case when b.flag = 1 then 0 else 1 end as label 
    FROM (
        SELECT x.ein, x.run, x.ui_acct
        FROM (
            SELECT ein, run, ui_acct
            FROM kcmo_lehd.mo_qcew_employers
            WHERE year = {year}
            AND qtr = 1
        ) AS x
        INNER JOIN (
            SELECT ein, run, ui_acct
            FROM kcmo_lehd.mo_qcew_employers
            WHERE year = {year}-{lookback}
            AND qtr = 1
        ) AS y
        ON x.ein = y.ein AND x.run = y.run AND x.ui_acct = y.ui_acct
    ) AS a
    LEFT JOIN (
        SELECT ein, run, ui_acct, 1 as flag 
        FROM kcmo_lehd.mo_qcew_employers
        WHERE year = {year}+{term}
        AND qtr = 1   
    ) AS b
    ON a.ein = b.ein AND a.run = b.run AND a.ui_acct = b.ui_acct;
    
    ALTER TABLE {schema}.labels_{year} OWNER TO (schema)_admin;

    COMMIT;

    """.format(year = year, schema = schema, lookback = lookback, term = term)
    
    # Let's check if the table already exists:
    cursor.execute('''
    SELECT * FROM information_schema.tables 
    WHERE table_name = 'labels_{year}' 
    AND table_schema = '{schema}';
    '''.format(year = year, schema = schema))
    # Let's write table if it does not exist (or if overwrite = True)
    if not(cursor.rowcount) or overwrite:
        cursor.execute(sql_script)
    
    cursor.close()
    
    df = pd.read_sql('SELECT * FROM {schema}.labels_{year}'.format(year = year, schema = schema, year), conn)  
    
    return df

## Creating Features

Here we will add features to the Machine Learning Model. We will start by defining functions that will be usefull throughout the notebook. We will then include the features that we presented in class during the module on Machine Learning. Finally, we will discuss potential additional features.

### Useful Functions

In [None]:
def scaling_var(df, var):
    min_var = df[var].min()
    max_var = df[var].max()
    scaled_var = '{}_scaled'.format(var)
    
    df[scaled_var] = (df[var] - min_var)/(max_var - min_var)
    
    return df[scaled_var]

### New vs Old Employers

This function defining "old" and "new" firms is the same as the one we saw in class. Notice the `age_cutoff` variable (defining how many years an employer has to exists to be considered an "old" employer) that you can choose to modify.

In [None]:
def employer_age_features(year, schema, age_cutoff = 5, 
                          db_name = db_name, hostname = hostname, overwrite = False):
    
    conn = psycopg2.connect(database=db_name, host = hostname) #database connection
    cursor = conn.cursor()
    
    sql_script = '''
    DROP TABLE IF EXISTS {schema}.features_age_{year};
    CREATE TABLE ada_kcmo.features_age_{year} AS
    SELECT a.*, CASE WHEN b.flag = 1 THEN 0 ELSE 1 END AS new_employer
    FROM (
        SELECT ein, run, ui_acct 
        FROM ada_kcmo.labels_{year}
    ) AS a
    LEFT JOIN (
        SELECT ein, run, ui_acct, 1 as flag 
        FROM kcmo_lehd.mo_qcew_employers
        WHERE year = {year}-{age_cutoff}
        AND qtr = 1   
    ) AS b
    ON a.ein = b.ein AND a.run = b.run AND a.ui_acct = b.ui_acct;
    
    ALTER TABLE {schema}.features_age_{year} OWNER TO {schema}_admin;    
    
    COMMIT;
    '''.format(year = year, age_cutoff = age_cutoff)
    
    # Let's check if the table already exists:
    cursor.execute('''
    SELECT * FROM information_schema.tables 
    WHERE table_name = 'features_age_{year}' 
    AND table_schema = '{schema}';
    '''.format(year = year))
    # Let's write table if it does not exist (or if overwrite = True)
    if not(cursor.rowcount) or overwrite:
        cursor.execute(sql_script)
    
    cursor.close()
        
    df = pd.read_sql('SELECT * FROM {schema}.features_age_{}'.format(year), conn)  
    
    return df

## QWI Statistics

Here too, the QWI statistics features are the same as the ones we saw in class. You may want to modify the code to include more years of QWI statistics, or replace the change in level of the QWI metrics by a rate of change (change in percent).

In [None]:
def qwi_features(year, schema,
                 db_name = db_name, hostname = hostname, overwrite = False):
    
    conn = psycopg2.connect(database=db_name, host = hostname) #database connection
    cursor = conn.cursor()
    
    sql_script = '''
    DROP TABLE IF EXISTS {schema}.features_qwi_{year};
    CREATE TABLE {schema}.features_qwi_{year} AS
    SELECT a.*
            , b.nb_jobs_current_qtr AS m1_nb_jobs_current_qtr
            , b.emp_current_qtr AS m1_emp_current_qtr
            , b.emp_4qtrs_ago AS m1_emp_4qtrs_ago
            , b.emp_3qtrs_ago AS m1_emp_3qtrs_ago
            , b.emp_2qtrs_ago AS m1_emp_2qtrs_ago
            , b.emp_prev_qtr AS m1_emp_prev_qtr
            , b.emp_next_qtr AS m1_emp_next_qtr
            , b.emp_begin_qtr AS m1_emp_begin_qtr
            , b.emp_end_qtr AS m1_emp_end_qtr
            , b.emp_full_qtr AS m1_emp_full_qtr
            , b.accessions_current AS m1_accessions_current
            , b.accessions_consecutive_qtr AS m1_accessions_consecutive_qtr
            , b.accessions_full_qtr AS m1_accessions_full_qtr
            , b.separations AS m1_separations
            , b.new_hires AS m1_new_hires
            , b.recalls AS m1_recalls
    FROM(
        SELECT * 
        FROM ada_kcmo.qwi_ein_{year}_1
    ) AS a
    LEFT JOIN (
        SELECT *
        FROM ada_kcmo.qwi_ein_{year_m1}_1
    ) AS b
    ON a.ein = b.ein;
    
    ALTER TABLE {schema}.features_qwi_{year} OWNER TO {schema}_admin; 
    
    COMMIT;
    '''.format(year = year, schema = schema, year_m1 = year-1)
    
    # Let's check if the table already exists:
    cursor.execute('''
    SELECT * FROM information_schema.tables 
    WHERE table_name = 'features_qwi_{year}'
    AND table_schema = '{schema}';
    '''.format(year = year, schema = schema))
    # Let's write table if it does not exist (or if overwrite = True)
    if not(cursor.rowcount) or overwrite:
        cursor.execute(sql_script)
    
    cursor.close()
    
    df = pd.read_sql('SELECT * FROM {schema}.features_qwi_{year};'.format(schema = schema, year = year), conn)
    
    for var in ['nb_jobs_current_qtr', 'emp_current_qtr'
                , 'emp_4qtrs_ago', 'emp_3qtrs_ago', 'emp_2qtrs_ago', 'emp_prev_qtr', 'emp_next_qtr'
                , 'emp_begin_qtr', 'emp_end_qtr', 'emp_full_qtr'
                , 'accessions_current', 'accessions_consecutive_qtr', 'accessions_full_qtr'
                , 'separations', 'new_hires', 'recalls']:
        m1_var = 'm1_{}'.format(var)
        change_var = 'change_{}'.format(var)
        df[change_var] = df[var] - df[m1_var]
   
    # Remove NULL rows
    isnan_rows = df.isnull().any(axis=1)
    df = df[~isnan_rows]
    
    return df

## Wages and Employees
Wages and Employee statistics are pulled from the MO employers data. The statistics here are the same as in class, but feel free to add additional metrics.

In [None]:
def wages_features(year, schema,
                   db_name = db_name, hostname = hostname, overwrite = False):
    
    conn = psycopg2.connect(database=db_name, host = hostname) #database connection
    cursor = conn.cursor()

    sql_script = '''
    DROP TABLE IF EXISTS {schema}.features_wages_{year};
    CREATE TABLE ada_kcmo.features_wages_{year} AS    
    SELECT ein, run, ui_acct
            , mon1_empl+mon2_empl+mon3_empl AS total_empl
            , total_wage 
    FROM kcmo_lehd.mo_qcew_employers 
    WHERE year = {year} AND qtr = 1;
    
    ALTER TABLE {schema}.features_wages_{year} OWNER TO {schema}_admin; 
    
    COMMIT;
    '''.format(year = year, schema = schema)
    
    # Let's check if the table already exists:
    cursor.execute('''
    SELECT * FROM information_schema.tables 
    WHERE table_name = 'features_wages_{year}'
    AND table_schema = '{schema}';
    '''.format(year = year, schema = schema))
    
    # Let's write table if it does not exist (or if overwrite = True)
    if not(cursor.rowcount) or overwrite:
        cursor.execute(sql_script)
    
    cursor.close()    
    
    df = pd.read_sql('SELECT * FROM {schema}.features_wages_{year}'.format(year), conn)
    df['avg_wage'] = df['total_wage']/df['total_empl']
    
    # Flag null, infinite average wage values
    mask = ((df['avg_wage'].isnull()) | (df['avg_wage'] == inf))
    vals_to_replace = df[mask]['avg_wage'].values
    df['avg_wage'].replace(vals_to_replace,np.NaN, inplace=True)
    
    # Impute the median wage value
    df['avg_wage'].fillna(df['avg_wage'].median(), inplace=True)
    
    # Remove Outliers
    outlier_rows = ((df['avg_wage'] == 0) | (df['avg_wage'] > 50000))
    df_wages = df[~outlier_rows]
    
    # Scaling values
    df['total_wage_scaled'] = scaling_var(df, 'total_wage')
    df['total_empl_scaled'] = scaling_var(df, 'total_empl')
    df['avg_wage_scaled'] = scaling_var(df, 'avg_wage')
    
    return df

In [None]:
df_wages = wages_features(2013)

In [None]:
df_wages.head()

## Combining all data

We can now combine all our subset of features into one features table.

In [None]:
df_features = pd.merge(df_age, df_qwi, how = 'left', on = 'ein')

In [None]:
df_features = pd.merge(df_features, df_wages, how = 'left', on = ['ein', 'run', 'ui_acct'])

Let's merge our features with our labels.

In [None]:
df_table = pd.merge(df_labels, df_features, how = 'left', on = ['ein', 'run', 'ui_acct'])

Let's now write the table into our class schema so we can use it for the Machine Learning notebook. In order to write a data table, we have to create an engine with SQLAlchemy (see notebook on Databases for more details).

In [None]:
# Let's check if the table already exists:  
conn = psycopg2.connect(database=db_name, host = hostname) #database connection
cursor = conn.cursor()    
cursor.execute('''
SELECT * FROM information_schema.tables 
WHERE table_name = 'table_employers_2013'
AND table_schema = 'ada_kcmo';
''')

# Let's write table if it does not exist (or if overwrite = True)
overwrite = False
if not(cursor.rowcount) or overwrite:
    engine = create_engine('postgresql://{}/{}'.format(hostname, db_name))
    df_table.to_sql('table_employers_2013', engine, schema = 'ada_kcmo', index = False, if_exists='replace')
    
    # Change Admin rights of table to admin
    conn = psycopg2.connect(database = db_name, host = hostname)
    cursor = conn.cursor()
    cursor.execute('ALTER TABLE ada_kcmo.table_employers_2013 OWNER TO ada_kcmo_admin; COMMIT;')

cursor.close() 

In [None]:
table_2013 = pd.read_sql('SELECT * FROM ada_kcmo.table_employers_2013 LIMIT 100', conn)
table_2013.head()

## Overall Function for Label and Features Generation:

We have recapitulated all the above steps into a general function below.

In [None]:
def generate_table(year, db_name = db_name, hostname = hostname, schema = 'ada_kcmo', overwrite = False):
    
    # Generate Labels
    print("Generating labels")
    df_label = generate_labels(year, db_name = db_name, hostname = hostname, overwrite = overwrite)
    
    # Generate Features
    print("Generating features")
    df_age = employer_age_features(year, db_name = db_name, hostname = hostname, overwrite = overwrite)
    df_qwi = qwi_features(year, db_name = db_name, hostname = hostname, overwrite = overwrite)
    df_wages = wages_features(year, db_name = db_name, hostname = hostname, overwrite = overwrite)
    
    # Merge Labels and Features together
    print("Merging labels and features")
    df_table = pd.merge(df_label, df_age, how = 'inner', on = ['ein', 'run', 'ui_acct'])
    df_table = pd.merge(df_table, df_qwi, how = 'inner', on = 'ein')
    df_table = pd.merge(df_table, df_wages, how = 'inner', on = ['ein', 'run', 'ui_acct'])
    
    # Removing NULL values
    isnan_rows = df_table.isnull().any(axis=1)
    df_table = df_table[~isnan_rows]
    
    # Write Table
    print("Writing table")
    
    # Let's check if the table already exists:  
    conn = psycopg2.connect(database=db_name, host = hostname) #database connection
    cursor = conn.cursor()    
    cursor.execute('''
    SELECT * FROM information_schema.tables 
    WHERE table_name = 'table_employers_{year}'
    AND table_schema = '{schema}';
    '''.format(year = year, schema = schema))
    
    # Let's write table if it does not exist (or if overwrite = True)
    if not(cursor.rowcount) or overwrite:
        table_name = 'table_employers_{}'.format(year)
        engine = create_engine('postgresql://{}/{}'.format(hostname, db_name))
        df_table.to_sql(table_name, engine, schema = 'ada_kcmo', index = False, if_exists='replace')
        
        # Change Admin rights of table to admin
        conn = psycopg2.connect(database = db_name, host = hostname)
        cursor = conn.cursor()
        cursor.execute('ALTER TABLE ada_kcmo.table_employers_{} OWNER TO ada_kcmo_admin; COMMIT;'.format(year))

    cursor.close()        
    
    return df_table

In [None]:
df_table_2013 = generate_table(2013)

In [None]:
df_table_2014 = generate_table(2014)

In [None]:
df_table_2013.head()

In [None]:
df_table_2014.head()