<img style="float: center;" src="./images/CI_horizontal.png" width="600">
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Website</a>
    </span>
</center>

Ghani, Rayid, Frauke Kreuter, Julia Lane, Adrianne Bradford, Alex Engler, Nicolas Guetta Jeanrenaud, Graham Henke, Daniela Hochfellner, Clayton Hunter, Brian Kim, Avishek Kumar, and Jonathan Morgan.

_Citation to be updated on export_

# Data Preparation for Machine Learning - Feature Creation
----

## Python Setup
- Back to [Table of Contents](#Table-of-Contents)

Before we begin, run the code cell below to initialize the libraries we'll be using in this assignment. We're already familiar with `numpy`, `pandas`, and `psycopg2` from previous tutorials.

In [None]:
%pylab inline
import pandas as pd
import psycopg2
from sqlalchemy import create_engine

In [None]:
db_name = "appliedda"
hostname = "10.10.2.10"

## Creating Features

Our features are our independent variables or predictors. Good features make machine learning systems effective. 
The better the features the easier it is the capture the structure of the data. You generate features using domain knowledge. In general, it is better to have more complex features and a simpler model rather than vice versa. Keeping the model simple makes it faster to train and easier to understand rather then extensively searching for the "right" model and "right" set of parameters. 

Machine Learning Algorithms learn a solution to a problem from sample data. The set of features is the best representation of the sample data to learn a solution to a problem. 

- **Feature engineering** is the process of transforming raw data into features that better represent the underlying problem/data/structure  to the predictive models, resulting in improved model accuracy on unseen data." ( from [Discover Feature Engineering](http://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/) ).  In text, for example, this might involve deriving traits of the text like word counts, verb counts, or topics to feed into a model rather than simply giving it the raw text.

Example of feature engineering are: 

- **Transformations**, such a log, square, and square root.
- **Dummy (binary) variables**, also known as *indicator variables*, often done by taking categorical variables
(such as city) which do not have a numerical value, and adding them to models as a binary value.
- **Discretization**. Several methods require features to be discrete instead of continuous. This is often done 
by binning, which you can do by various approaches like equal width, deciles, Fisher-Jenks, etc. 
- **Aggregation.** Aggregate features often constitute the majority of features for a given problem. These use 
different aggregation functions (*count, min, max, average, standard deviation, etc.*) which summarize several
values into one feature, aggregating over varying windows of time and space. For example, for policing or criminal justice problems, we may want to calculate the *number* (and *min, max, mean, variance*, etc.) of crimes within an *m*-mile radius of an address in the past *t* months for varying values of *m* and *t*, and then use all of them as features.

## Graduate demographics

### Step by Step Approach

In [None]:
conn = psycopg2.connect(database=db_name, host = hostname)
cursor = conn.cursor()

In [None]:
# first let's confirm what our cohort table currently has
sql = '''
select * from ada_edwork.no_job_cohort_2007
limit 10;
'''
pd.read_sql(sql, conn)

In [None]:
# add demographic columns to the cohort table
sql = """
ALTER TABLE ada_edwork.no_job_cohort_2007 
    ADD COLUMN years_old int,
    ADD COLUMN gender text,
    ADD COLUMN ethnicity text;
"""

cursor.execute(sql)

In [None]:
# update columns from oh_hei_demo table
sql = '''
UPDATE ada_edwork.no_job_cohort_2007 a SET (years_old, gender, ethnicity)
    = (2007 - birth_year, b.gender_code, b.ethnicity_code)
FROM in_data_2019.che_completions b
WHERE a.ssn = b.ssn AND a.degree_conferred_date = b.degree_conferred_date;
'''
cursor.execute(sql)

In [None]:
df = pd.read_sql('select * from ada_edwork.no_job_cohort_2007;', conn)

In [None]:
df.head()

In [None]:
cursor.close()

In [None]:
conn = psycopg2.connect(database=db_name, host = hostname)

### Define Function

In order to facilitate creating this feature for several years of data, we combined all the above steps into a Python function, and added a final step that writes the feature table to the database.

Note that we assume the corresponding `<prefix>cohort_<year>` table has already been created.

In [None]:
# Insert team table prefix
tbl_prefix = 'no_job_'

In [None]:
def grad_demographics(YEAR, prefix = tbl_prefix):
    # set the database connection
    conn = psycopg2.connect(database=db_name, host = hostname) 
    cursor = conn.cursor()
    
    print("Adding demographic features")    
        
    sql = '''
    
    ALTER TABLE ada_edwork.{pref}cohort_{year} 
        ADD COLUMN years_old int,
        ADD COLUMN gender text,
        ADD COLUMN ethnicity text;
    
    commit;
    
    UPDATE ada_edwork.{pref}cohort_{year}  a 
        SET (years_old, gender, ethnicity)
        = ({year} - birth_year, b.gender_code, b.ethnicity_code)
    FROM in_data_2019.che_completions b
    WHERE a.ssn = b.ssn AND a.degree_conferred_date = b.degree_conferred_date;
    
    commit;
    '''.format(pref=prefix, year=YEAR)  
#         print(sql) # to debug
    cursor.execute(sql)
        
    print("demographic features added")
    
    cursor.close()
    
    sql = '''
    SELECT * FROM ada_edwork.{pref}cohort_{year};
        '''.format(pref=prefix, year=YEAR) 
    df = pd.read_sql(sql, conn)  
    
    return df

In [None]:
start_time = time.time()
df_test1 = grad_demographics(2007)
print('demographic features added in {:.2f} seconds'.format(time.time()-start_time))

In [None]:
years = [2008, 2009, 2010, 2011]

for year in years:
    start_time = time.time()
    df = grad_demographics(year)
    print('demographic features added in {:.2f} seconds'.format(time.time()-start_time))

## Removing Outliers 

**It is never a good idea to drop observations without prior investigation AND a good reason to believe the data is wrong!** 



## Imputing Missing Values

There are many ways of imputing missing values based on the rest of the data. Missing values can be imputed to median of the rest of the data, or you can use other characteristics (eg industry, geography, etc.).

For our data, we have made an assumption about what "missing" means for each of our data's components (eg if the individual does not show up in the IDES data we say they do not have a job in that time period).