<img style="float: center;" src="images/CI_horizontal.png" width="600">
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Website</a>
    </span>
</center>

Ghani, Rayid, Frauke Kreuter, Julia Lane, Adrianne Bradford, Alex Engler, Nicolas Guetta Jeanrenaud, Graham Henke, Daniela Hochfellner, Clayton Hunter, Brian Kim, Avishek Kumar, Jonathan Morgan, and Ridhima Sodhi. "ADA-KCMO-2018." Coleridge Initiative GitHub Repositories. 2018. https://github.com/Coleridge-Initiative/ada-kcmo-2018. [![DOI](https://zenodo.org/badge/119078858.svg)](https://zenodo.org/badge/latestdoi/119078858)

# Data Preparation for Machine Learning
----

## Python Setup
- Back to [Table of Contents](#Table-of-Contents)

Before we begin, run the code cell below to initialize the libraries we'll be using in this assignment. We're already familiar with `numpy`, `pandas`, and `psycopg2` from previous tutorials. Here we'll also be using [`scikit-learn`](http://scikit-learn.org) to fit modeling.

In [None]:
%pylab inline
import pandas as pd
import psycopg2
from sqlalchemy import create_engine

In [None]:
db_name = "appliedda"
hostname = "10.10.2.10"
conn = psycopg2.connect(database=db_name, host = hostname) #database connection

## Creating Labels

Labels are the dependent variables, or *Y* variables, that we are trying to predict. In the machine learning framework, your labels are usually *binary*: true or false, encoded as 1 or 0. In this case, our label is whether an employer at least one year old is likely to disappear in the coming year. We need to pick our year of prediction. We will be looking back one year to see if this employer existed 1 year ago, and forward one year to see if the employer still exists one year from now. 
> For this example, let's use 2013 (Q1) as our reference year (year of prediction).

In [None]:
def generate_labels(year, db_name = db_name, hostname = hostname, overwrite = False):
    
    conn = psycopg2.connect(database=db_name, host = hostname) #database connection
    cursor = conn.cursor()
   
    sql_script="""
    -- First, let's make a list of the employers present at time t: Q1 of 2013

    DROP TABLE IF EXISTS ada_kcmo.labels_{year};
    CREATE TABLE ada_kcmo.labels_{year} AS
    SELECT CONCAT(a.ein, a.run, a.ui_acct) AS id
            , a.ein, a.run, a.ui_acct
            , case when b.flag = 1 then 0 else 1 end as label 
    FROM (
        SELECT x.ein, x.run, x.ui_acct
        FROM (
            SELECT ein, run, ui_acct
            FROM kcmo_lehd.mo_qcew_employers
            WHERE year = {year}
            AND qtr = 1
        ) AS x
        INNER JOIN (
            SELECT ein, run, ui_acct
            FROM kcmo_lehd.mo_qcew_employers
            WHERE year = {year}-1
            AND qtr = 1
        ) AS y
        ON x.ein = y.ein AND x.run = y.run AND x.ui_acct = y.ui_acct
    ) AS a
    LEFT JOIN (
        SELECT ein, run, ui_acct, 1 as flag 
        FROM kcmo_lehd.mo_qcew_employers
        WHERE year = {year}+1
        AND qtr = 1   
    ) AS b
    ON a.ein = b.ein AND a.run = b.run AND a.ui_acct = b.ui_acct;
    
    ALTER TABLE ada_kcmo.labels_{year} OWNER TO ada_kcmo_admin;

    COMMIT;

    """.format(year = year)
    
    # Let's check if the table already exists:
    cursor.execute('''
    SELECT * FROM information_schema.tables 
    WHERE table_name = 'labels_{year}' 
    AND table_schema = 'ada_kcmo';
    '''.format(year = year))
    # Let's write table if it does not exist (or if overwrite = True)
    if not(cursor.rowcount) or overwrite:
        cursor.execute(sql_script)
    
    cursor.close()
    
    df = pd.read_sql('SELECT * FROM ada_kcmo.labels_{}'.format(year), conn)  
    
    return df

In [None]:
df_labels = generate_labels(2013)

In [None]:
pd.crosstab(index = df_labels['label'], columns =  'count')

## Creating Features

Our features are our independent variables or predictors. Good features make machine learning systems effective. 
The better the features the easier it is the capture the structure of the data. You generate features using domain knowledge. In general, it is better to have more complex features and a simpler model rather than vice versa. Keeping the model simple makes it faster to train and easier to understand rather then extensively searching for the "right" model and "right" set of parameters. 

Machine Learning Algorithms learn a solution to a problem from sample data. The set of features is the best representation of the sample data to learn a solution to a problem. 

- **Feature engineering** is the process of transforming raw data into features that better represent the underlying problem/data/structure  to the predictive models, resulting in improved model accuracy on unseen data." ( from [Discover Feature Engineering](http://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/) ).  In text, for example, this might involve deriving traits of the text like word counts, verb counts, or topics to feed into a model rather than simply giving it the raw text.

Example of feature engineering are: 

- **Transformations**, such a log, square, and square root.
- **Dummy (binary) variables**, also known as *indicator variables*, often done by taking categorical variables
(such as city) which do not have a numerical value, and adding them to models as a binary value.
- **Discretization**. Several methods require features to be discrete instead of continuous. This is often done 
by binning, which you can do by equal width. 
- **Aggregation.** Aggregate features often constitute the majority of features for a given problem. These use 
different aggregation functions (*count, min, max, average, standard deviation, etc.*) which summarize several
values into one feature, aggregating over varying windows of time and space. For example, given urban data, 
we would want to calculate the *number* (and *min, max, mean, variance*, etc.) of crimes within an *m*-mile radius
of an address in the past *t* months for varying values of *m* and *t*, and then use all of them as features.

>Our preliminary features are the following
>
>- `n_spells` (Aggregation): Total number of spells someonse has had up until the date of prediction.
>- `age` (Transformation): The age feature is created by substracting the bdate_year with the current year of prediction. 
>- `edlevel` (Binary): 0 if the person has less than a high school education and 1 if they are more than a high school education. 
>- `workexp` (Binary): 0 if no work experience 1 if there is some sort of work experience
>- `married` (Binary): 1 if the person is married 0 if they are not. 
>- `gender`: (Binary) 1(male) 2(female)
>- `n_days_last_spell`: (Aggregation) The number of days since a person's last spell.
>- `(foodstamp, tanf, granf)`: (Binary) 0 if the last benefit was not foodstamp, tanf or grantf, 1 if it was

### New vs Old Employers

Let's create a first binary feature to defining "old" and "new" firms. Old firms are determined according to age cutoff, with a default value is 5 years.

In [None]:
def employer_age_features(year, age_cutoff = 5, db_name = db_name, hostname = hostname, overwrite = False):
    
    conn = psycopg2.connect(database=db_name, host = hostname) #database connection
    cursor = conn.cursor()
    
    sql_script = '''
    DROP TABLE IF EXISTS ada_kcmo.features_age_{year};
    CREATE TABLE ada_kcmo.features_age_{year} AS
    SELECT a.*, CASE WHEN b.flag = 1 THEN 0 ELSE 1 END AS new_employer
    FROM (
        SELECT ein, run, ui_acct 
        FROM ada_kcmo.labels_{year}
    ) AS a
    LEFT JOIN (
        SELECT ein, run, ui_acct, 1 as flag 
        FROM kcmo_lehd.mo_qcew_employers
        WHERE year = {year}-{age_cutoff}
        AND qtr = 1   
    ) AS b
    ON a.ein = b.ein AND a.run = b.run AND a.ui_acct = b.ui_acct;
    
    ALTER TABLE ada_kcmo.features_age_{year} OWNER TO ada_kcmo_admin;    
    
    COMMIT;
    '''.format(year = year, age_cutoff = age_cutoff)
    
    # Let's check if the table already exists:
    cursor.execute('''
    SELECT * FROM information_schema.tables 
    WHERE table_name = 'features_age_{year}' 
    AND table_schema = 'ada_kcmo';
    '''.format(year = year))
    # Let's write table if it does not exist (or if overwrite = True)
    if not(cursor.rowcount) or overwrite:
        cursor.execute(sql_script)
    
    cursor.close()
        
    df = pd.read_sql('SELECT * FROM ada_kcmo.features_age_{}'.format(year), conn)  
    
    return df

In [None]:
df_age = employer_age_features(2013)

In [None]:
df_age.head()

## QWI Statistics

The next set of features we would like to include are the QWI statistics. Since we are looking at firms what are at least one year old, it might be interesting to consider both the current QWI numbers, and the numbers from the year before. 

Note that these statistics are taken at company level (EIN), instead of individual entity level (combination of EIN, RUN, and UI Account Number). This is because QWI is calculated at firm level. We therefore merge on EIN, instead of using all three variables.

In [None]:
conn = psycopg2.connect(database = db_name, host = hostname)

In [None]:
df_qwi = pd.read_sql('SELECT * FROM ada_kcmo.qwi_ein_{year}_1'.format(year = 2013), conn)

In [None]:
df_qwi.head()

Let's also consider the QWI statistics one year before our prediction quarter. We can create additional features accounting for the variation in level of the QWI statustics.

In [None]:
df_qwi_m1 = pd.read_sql('SELECT* FROM ada_kcmo.qwi_ein_{year}_1'.format(year = 2012), conn)
df_qwi_m1 = df_qwi_m1.add_prefix('m1_')

In [None]:
df_qwi_m1.head()

In [None]:
df_qwi = pd.merge(df_qwi, df_qwi_m1, how = 'left', left_on = 'ein', right_on = 'm1_ein')

In [None]:
for var in ['nb_jobs_current_qtr', 'emp_current_qtr'
            , 'emp_4qtrs_ago', 'emp_3qtrs_ago', 'emp_2qtrs_ago', 'emp_prev_qtr', 'emp_next_qtr'
            , 'emp_begin_qtr', 'emp_end_qtr', 'emp_full_qtr'
            , 'accessions_current', 'accessions_consecutive_qtr', 'accessions_full_qtr'
            , 'separations', 'new_hires', 'recalls']:
    m1_var = 'm1_{}'.format(var)
    change_var = 'change_{}'.format(var)
    df_qwi[change_var] = df_qwi[var] - df_qwi[m1_var]

### Dropping Missing Values
`NULL` values will make it impossible to run our Machine Leaning Algorithm. Let's see if there are any in the data.

In [None]:
isnan_rows = df_qwi.isnull().any(axis=1)

In [None]:
df_qwi[isnan_rows].head()

In [None]:
nrows_df_qwi = df_qwi.shape[0]
nrows_df_qwi_isnan = df_qwi[isnan_rows].shape[0]
print('%of rows with NaNs: {} '.format(float(nrows_df_qwi_isnan)/nrows_df_qwi))

In [None]:
df_qwi = df_qwi[~isnan_rows]

Let's combine the two previous queries into a unique SQL query that will retrive all the relevant QWI statistics.

In [None]:
def qwi_features(year, db_name = db_name, hostname = hostname, overwrite = False):
    conn = psycopg2.connect(database=db_name, host = hostname) #database connection
    cursor = conn.cursor()
    
    sql_script = '''
    DROP TABLE IF EXISTS ada_kcmo.features_qwi_{year};
    CREATE TABLE ada_kcmo.features_qwi_{year} AS
    SELECT a.*
            , b.nb_jobs_current_qtr AS m1_nb_jobs_current_qtr
            , b.emp_current_qtr AS m1_emp_current_qtr
            , b.emp_4qtrs_ago AS m1_emp_4qtrs_ago
            , b.emp_3qtrs_ago AS m1_emp_3qtrs_ago
            , b.emp_2qtrs_ago AS m1_emp_2qtrs_ago
            , b.emp_prev_qtr AS m1_emp_prev_qtr
            , b.emp_next_qtr AS m1_emp_next_qtr
            , b.emp_begin_qtr AS m1_emp_begin_qtr
            , b.emp_end_qtr AS m1_emp_end_qtr
            , b.emp_full_qtr AS m1_emp_full_qtr
            , b.accessions_current AS m1_accessions_current
            , b.accessions_consecutive_qtr AS m1_accessions_consecutive_qtr
            , b.accessions_full_qtr AS m1_accessions_full_qtr
            , b.separations AS m1_separations
            , b.new_hires AS m1_new_hires
            , b.recalls AS m1_recalls
    FROM(
        SELECT * 
        FROM ada_kcmo.qwi_ein_{year}_1
    ) AS a
    LEFT JOIN (
        SELECT *
        FROM ada_kcmo.qwi_ein_{year_m1}_1
    ) AS b
    ON a.ein = b.ein;
    
    ALTER TABLE ada_kcmo.features_qwi_{year} OWNER TO ada_kcmo_admin; 
    
    COMMIT;
    '''.format(year = year, year_m1 = year-1)
    
    # Let's check if the table already exists:
    cursor.execute('''
    SELECT * FROM information_schema.tables 
    WHERE table_name = 'features_qwi_{year}'
    AND table_schema = 'ada_kcmo';
    '''.format(year = year))
    # Let's write table if it does not exist (or if overwrite = True)
    if not(cursor.rowcount) or overwrite:
        cursor.execute(sql_script)
    
    cursor.close()
    
    df = pd.read_sql('SELECT * FROM ada_kcmo.features_qwi_{};'.format(year), conn)
    
    for var in ['nb_jobs_current_qtr', 'emp_current_qtr'
                , 'emp_4qtrs_ago', 'emp_3qtrs_ago', 'emp_2qtrs_ago', 'emp_prev_qtr', 'emp_next_qtr'
                , 'emp_begin_qtr', 'emp_end_qtr', 'emp_full_qtr'
                , 'accessions_current', 'accessions_consecutive_qtr', 'accessions_full_qtr'
                , 'separations', 'new_hires', 'recalls']:
        m1_var = 'm1_{}'.format(var)
        change_var = 'change_{}'.format(var)
        df[change_var] = df[var] - df[m1_var]
   
    # Remove NULL rows
    isnan_rows = df.isnull().any(axis=1)
    df = df[~isnan_rows]
    
    return df

In [None]:
df_qwi = qwi_features(2013)

In [None]:
df_qwi.head()

## Wages and Employees

Let's use wage and employee statistics from the MO wage records.

In [None]:
conn = psycopg2.connect(database = db_name, host = hostname)

In [None]:
query = '''
SELECT ein, run, ui_acct
        , mon1_empl+mon2_empl+mon3_empl AS total_empl
        , total_wage 
FROM kcmo_lehd.mo_qcew_employers 
WHERE year = 2013 AND qtr = 1
'''

In [None]:
df_wages = pd.read_sql(query, conn)

Let's create an additional feature for average monthly wage

In [None]:
df_wages['avg_wage'] = df_wages['total_wage']/df_wages['total_empl']

### Imputation 

It is important to to do a quick check of our matrix to see if we have any outlier values. 

In [None]:
df_wages.describe(include = 'all', percentiles=[0.01,0.05,0.25,0.50,0.75,0.95,0.99])

Because of some data inconsistencies in total employees and total wages, some average wages could not be calculated (when `total_empl == 0` and `total_wages == 0`) and some have `inf` values (when `total_empl == 0`). These `NULL` and `inf` values will be problematic for the machine learning algorithm. 

Let's impute these missing values to the medial value of all average wages.

In [None]:
mask = ((df_wages['avg_wage'].isnull()) | (df_wages['avg_wage'] == inf))
vals_to_replace = df_wages[mask]['avg_wage'].values
df_wages['avg_wage'].replace(vals_to_replace,np.NaN, inplace=True)

In [None]:
median_avg_wage = df_wages['avg_wage'].median()
print(median_avg_wage)

In [None]:
df_wages['avg_wage'].fillna(median_avg_wage, inplace=True)

In [None]:
df_wages.describe(include = 'all')

### Removing Outliers 

Some values of average wage still seem impossible for very unlikely. Certain employers can have an average wage of 0, and some outliers have average wages far exceeding the 99th percentile. These are things you'd want to do a "sanity check" on with someone who knows the data will.

Here, we believe these are data errors and chose to drop these values.

In [None]:
# Find all rows where the wage is 0 or above 50,000 per month
outlier_rows = ((df_wages['avg_wage'] == 0) | (df_wages['avg_wage'] > 50000))
df_wages[outlier_rows].head()

In [None]:
nrows_wages = df_wages.shape[0]
nrows_wages_outliers = df_wages[outlier_rows].shape[0]

In [None]:
print('%of outlier rows: {} '.format(float(nrows_wages_outliers)/nrows_wages))

In [None]:
df_wages = df_wages[~outlier_rows]

### Scaling of Values

Certain models will have issue with the distance between features such as number of employees and average wages. Number of employees is typically a number between 1 and 100 while average wages are usually between 1000 and 4000. In order to circumvent this problem we can scale our features.  

In [None]:
# Example: let's scale average wages:
min_avg_wage = df_wages['avg_wage'].min()
max_avg_wage = df_wages['avg_wage'].max()

df_wages['avg_wage_scaled'] = (df_wages['avg_wage']-min_avg_wage)/(max_avg_wage-min_avg_wage)

In [None]:
df_wages[['avg_wage', 'avg_wage_scaled']].describe()

In [None]:
# Replace the original var by the scaled var
df_wages['avg_wage'] = df_wages['avg_wage_scaled']
del df_wages['avg_wage_scaled']

This generic function can be used to scale other variables.

In [None]:
def scaling_var(df, var):
    min_var = df[var].min()
    max_var = df[var].max()
    scaled_var = '{}_scaled'.format(var)

    df[scaled_var] = (df[var] - min_var)/(max_var - min_var)
    
    return df[scaled_var]

In [None]:
df_wages['total_empl_scaled'] = scaling_var(df_wages, 'total_empl')
df_wages['total_wage_scaled'] = scaling_var(df_wages, 'total_wage')

All the steps above can be summarized in the following function:

In [None]:
def wages_features(year, db_name = db_name, hostname = hostname, overwrite = False):
    
    conn = psycopg2.connect(database=db_name, host = hostname) #database connection
    cursor = conn.cursor()

    sql_script = '''
    DROP TABLE IF EXISTS ada_kcmo.features_wages_{year};
    CREATE TABLE ada_kcmo.features_wages_{year} AS    
    SELECT ein, run, ui_acct
            , mon1_empl+mon2_empl+mon3_empl AS total_empl
            , total_wage 
    FROM kcmo_lehd.mo_qcew_employers 
    WHERE year = {year} AND qtr = 1;
    
    ALTER TABLE ada_kcmo.features_wages_{year} OWNER TO ada_kcmo_admin; 
    
    COMMIT;
    '''.format(year = year)
    
    # Let's check if the table already exists:
    cursor.execute('''
    SELECT * FROM information_schema.tables 
    WHERE table_name = 'features_wages_{year}'
    AND table_schema = 'ada_kcmo';
    '''.format(year = year))
    
    # Let's write table if it does not exist (or if overwrite = True)
    if not(cursor.rowcount) or overwrite:
        cursor.execute(sql_script)
    
    cursor.close()    
    
    df = pd.read_sql('SELECT * FROM ada_kcmo.features_wages_{}'.format(year), conn)
    df['avg_wage'] = df['total_wage']/df['total_empl']
    
    # Flag null, infinite average wage values
    mask = ((df['avg_wage'].isnull()) | (df['avg_wage'] == inf))
    vals_to_replace = df[mask]['avg_wage'].values
    df['avg_wage'].replace(vals_to_replace,np.NaN, inplace=True)
    
    # Impute the median wage value
    df['avg_wage'].fillna(df['avg_wage'].median(), inplace=True)
    
    # Remove Outliers
    outlier_rows = ((df['avg_wage'] == 0) | (df['avg_wage'] > 50000))
    df_wages = df[~outlier_rows]
    
    # Scaling values
    df['total_wage_scaled'] = scaling_var(df, 'total_wage')
    df['total_empl_scaled'] = scaling_var(df, 'total_empl')
    df['avg_wage_scaled'] = scaling_var(df, 'avg_wage')
    
    return df

In [None]:
df_wages = wages_features(2013)

In [None]:
df_wages.head()

## Combining all data

We can now combine all our subset of features into one features table.

In [None]:
df_features = pd.merge(df_age, df_qwi, how = 'left', on = 'ein')

In [None]:
df_features = pd.merge(df_features, df_wages, how = 'left', on = ['ein', 'run', 'ui_acct'])

Let's merge our features with our labels.

In [None]:
df_table = pd.merge(df_labels, df_features, how = 'left', on = ['ein', 'run', 'ui_acct'])

Let's now write the table into our class schema so we can use it for the Machine Learning notebook. In order to write a data table, we have to create an engine with SQLAlchemy (see notebook on Databases for more details).

In [None]:
# Let's check if the table already exists:  
conn = psycopg2.connect(database=db_name, host = hostname) #database connection
cursor = conn.cursor()    
cursor.execute('''
SELECT * FROM information_schema.tables 
WHERE table_name = 'table_employers_2013'
AND table_schema = 'ada_kcmo';
''')

# Let's write table if it does not exist (or if overwrite = True)
overwrite = False
if not(cursor.rowcount) or overwrite:
    engine = create_engine('postgresql://{}/{}'.format(hostname, db_name))
    df_table.to_sql('table_employers_2013', engine, schema = 'ada_kcmo', index = False, if_exists='replace')
    
    # Change Admin rights of table to admin
    conn = psycopg2.connect(database = db_name, host = hostname)
    cursor = conn.cursor()
    cursor.execute('ALTER TABLE ada_kcmo.table_employers_2013 OWNER TO ada_kcmo_admin; COMMIT;')

cursor.close() 

In [None]:
table_2013 = pd.read_sql('SELECT * FROM ada_kcmo.table_employers_2013 LIMIT 100', conn)
table_2013.head()

## Overall Function for Label and Features Generation:

We have recapitulated all the above steps into a general function below.

In [None]:
def generate_table(year, db_name = db_name, hostname = hostname, schema = 'ada_kcmo', overwrite = False):
    
    # Generate Labels
    print("Generating labels")
    df_label = generate_labels(year, db_name = db_name, hostname = hostname, overwrite = overwrite)
    
    # Generate Features
    print("Generating features")
    df_age = employer_age_features(year, db_name = db_name, hostname = hostname, overwrite = overwrite)
    df_qwi = qwi_features(year, db_name = db_name, hostname = hostname, overwrite = overwrite)
    df_wages = wages_features(year, db_name = db_name, hostname = hostname, overwrite = overwrite)
    
    # Merge Labels and Features together
    print("Merging labels and features")
    df_table = pd.merge(df_label, df_age, how = 'inner', on = ['ein', 'run', 'ui_acct'])
    df_table = pd.merge(df_table, df_qwi, how = 'inner', on = 'ein')
    df_table = pd.merge(df_table, df_wages, how = 'inner', on = ['ein', 'run', 'ui_acct'])
    
    # Removing NULL values
    isnan_rows = df_table.isnull().any(axis=1)
    df_table = df_table[~isnan_rows]
    
    # Write Table
    print("Writing table")
    
    # Let's check if the table already exists:  
    conn = psycopg2.connect(database=db_name, host = hostname) #database connection
    cursor = conn.cursor()    
    cursor.execute('''
    SELECT * FROM information_schema.tables 
    WHERE table_name = 'table_employers_{year}'
    AND table_schema = '{schema}';
    '''.format(year = year, schema = schema))
    
    # Let's write table if it does not exist (or if overwrite = True)
    if not(cursor.rowcount) or overwrite:
        table_name = 'table_employers_{}'.format(year)
        engine = create_engine('postgresql://{}/{}'.format(hostname, db_name))
        df_table.to_sql(table_name, engine, schema = 'ada_kcmo', index = False, if_exists='replace')
        
        # Change Admin rights of table to admin
        conn = psycopg2.connect(database = db_name, host = hostname)
        cursor = conn.cursor()
        cursor.execute('ALTER TABLE ada_kcmo.table_employers_{} OWNER TO ada_kcmo_admin; COMMIT;'.format(year))

    cursor.close()        
    
    return df_table

In [None]:
df_table_2013 = generate_table(2013)

In [None]:
df_table_2014 = generate_table(2014)

In [None]:
df_table_2013.head()

In [None]:
df_table_2014.head()