# Machine Learning
-----


## Table of Contents
- [Introduction](#Introduction)
- [Glossary of Terms](#Glossary-of-Terms)
- [Setup](#Setup)
- [The Machine Learning Process](#The-Machine-Learning-Process)
- [Problem Formulation](#Problem-Formulation)
- [Label Generation](#Creating-Labels)
- [Feature Generation](#Feature-Generation)
- [Model Fitting](#Model-Fitting)
- [Model Evaluation](#Model-Evaluation)
- [Machine Learning Pipeline](#Machine-Learning-Pipeline)
- [Survey of Algorithms](#Survey-of-Algorithms)
- [Assess Model Against Baselines](#Assess-Model-Against-Baselines)
- [Exercise](#Exercise)
- [Resources](#Resources)

# Introduction

- Back to [Table of Contents](#Table-of-Contents)

In this tutorial, we'll discuss how to formulate a research question in the machine learning framework; how to transform raw data into something that can be fed into a model; how to build, evaluate, compare, and select models; and how to reasonably and accurately interpret model results. You'll also get hands-on experience using the `scikit-learn` package in Python to model the data you're familiar with from previous tutorials. 


This tutorial is based on chapter 6 of [Big Data and Social Science](https://github.com/BigDataSocialScience/).

# Glossary of Terms

- Back to [Table of Contents](#Table-of-Contents)

**Glossary of Terms:**

- **Learning**: In machine learning, you'll hear about "learning a model." This is what you probably know as 
*fitting* or *estimating* a function, or *training* or *building* a model. These terms are all synonyms and are 
used interchangeably in the machine learning literature.
- **Examples**: These are what you probably know as *data points* or *observations*. 
- **Features**: These are what you probably know as *independent variables*, *attributes*, *predictors*, 
or *explanatory variables.*
- **Underfitting**: This happens when a model is too simple and does not capture the structure of the data well 
enough.
- **Overfitting**: This happens when a model is too complex or too sensitive to the noise in the data; this can
result in poor generalization performance, or applicability of the model to new data. 
- **Regularization**: This is a general method to avoid overfitting by applying additional constraints to the model. 
For example, you can limit the number of features present in the final model, or the weight coefficients applied
to the (standardized) features are small.
- **Supervised learning** involves problems with one target or outcome variable (continuous or discrete) that we want
to predict, or classify data into. Classification, prediction, and regression fall into this category. We call the
set of explanatory variables $X$ **features**, and the outcome variable of interest $Y$ the **label**.
- **Unsupervised learning** involves problems that do not have a specific outcome variable of interest, but rather
we are looking to understand "natural" patterns or groupings in the data - looking to uncover some structure that 
we do not know about a priori. Clustering is the most common example of unsupervised learning, another example is 
principal components analysis (PCA).


## Setup
---
*[Back to Table of Contents](#Table-of-Contents)*

Before we begin, run the code cell below to initialize the libraries we'll be using in this assignment. We're already familiar with `numpy`, `pandas`, and `psycopg2` from previous tutorials. Here we'll also be using [`scikit-learn`](http://scikit-learn.org) to fit modeling.

In [None]:
%pylab inline
from __future__ import division 
import pandas as pd
import psycopg2
import sklearn
import seaborn as sns
from sklearn.metrics import precision_recall_curve,roc_curve, auc
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.ensemble import (RandomForestClassifier, ExtraTreesClassifier,
                              GradientBoostingClassifier,
                              AdaBoostClassifier)
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sqlalchemy import create_engine
#import pydotplus
sns.set_style("white")
sns.set_context("poster", font_scale=1.25, rc={"lines.linewidth":1.25, "lines.markersize":8})

### Connect to the database

In [None]:
db_name = "appliedda"
hostname = "10.10.2.10"
conn = psycopg2.connect(database=db_name, host = hostname) #database connection

The database connection allows us to make queries to a database from Python. 

In [None]:
df_tables = pd.read_sql("""SELECT * FROM ides.il_wage limit 10;""", conn)

In [None]:
df_tables.head()

# The Machine Learning Process
*[Go back to Table of Contents](#Table-of-Contents)*

- [**Understand the problem and goal.**](#problem-formulation) *This sounds obvious but is often nontrivial.* Problems typically start as vague 
descriptions of a goal - improving health outcomes, increasing graduation rates, understanding the effect of a 
variable *X* on an outcome *Y*, etc. It is really important to work with people who understand the domain being
studied to dig deeper and define the problem more concretely. What is the analytical formulation of the metric 
that you are trying to optimize?
- [**Formulate it as a machine learning problem.**](#problem-formulation) Is it a classification problem or a regression problem? Is the 
goal to build a model that generates a ranked list prioritized by risk, or is it to detect anomalies as new data 
come in? Knowing what kinds of tasks machine learning can solve will allow you to map the problem you are working on
to one or more machine learning settings and give you access to a suite of methods.
- **Data exploration and preparation.** Next, you need to carefully explore the data you have. What additional data
do you need or have access to? What variable will you use to match records for integrating different data sources?
What variables exist in the data set? Are they continuous or categorical? What about missing values? Can you use the 
variables in their original form, or do you need to alter them in some way?
- [**Feature engineering.**](#feature-generation) In machine learning language, what you might know as independent variables or predictors 
or factors or covariates are called "features." Creating good features is probably the most important step in the 
machine learning process. This involves doing transformations, creating interaction terms, or aggregating over data
points or over time and space.
- **Method selection.** Having formulated the problem and created your features, you now have a suite of methods to
choose from. It would be great if there were a single method that always worked best for a specific type of problem. Typically, in machine learning, you take a variety of methods and try them, empirically validating which one is the best approach to your problem.
- [**Evaluation.**](#evaluation) As you build a large number of possible models, you need a way choose the best among them. We'll cover methodology to validate models on historical data and discuss a variety of evaluation metrics. The next step is to validate using a field trial or experiment.
- [**Deployment.**](#deployment) Once you have selected the best model and validated it using historical data as well as a field
trial, you are ready to put the model into practice. You still have to keep in mind that new data will be coming in,
and the model might change over time.



You're probably used to fitting models in physical or social science classes. In those cases, you probably had a hypothesis or theory about the underlying process that gave rise to your data, chose an appropriate model based on prior knowledge and fit it using least squares, and used the resulting parameter or coefficient estimates (or confidence intervals) for inference. This type of modeling is very useful for *interpretation*.

In machine learning, our primary concern is *generalization*. This means that:
- **We care less about the structure of the model and more about the performance** This means that we'll try out a whole bunch of models at a time and choose the one that works best, rather than determining which model to use ahead of time. We can then choose to select a *suboptimal* model if we care about a specific model type. 
- **We don't (necessarily) want the model that best fits the data we've *already seen*,** but rather the model that will perform the best on *new data*. This means that we won't gauge our model's performance using the same data that we used to fit the model (e.g., sum of squared errors or $R^2$), and that "best fit" or accuracy will most often *not* determine the best model.  
- **We can include a lot of variables in to the model.** This may sound like the complete opposite of what you've heard in the past, and it can be hard to swallow. But we will use different methods to deal with many of those concerns in the model fitting process by using a more automatic variable selection process.

# Problem Formulation
*[Go back to Table of Contents](#Table-of-Contents)*

First, turning something into a real objective function. What do you care about? Do you have data on that thing? What action can you take based on your findings? Do you risk introducing any bias based on the way you model something? 

## Four Main Types of ML Tasks for Policy Problems
- **Description**: [How can we identify and respond to the most urgent online government petitions?](https://dssg.uchicago.edu/project/improving-government-response-to-citizen-requests-online/)
- **Prediction**: [Which students will struggle academically by third grade?](https://dssg.uchicago.edu/project/predicting-students-that-will-struggle-academically-by-third-grade/)
- **Detection**: [Which police officers are likely to have an adverse interaction with the public?](https://dssg.uchicago.edu/project/expanding-our-early-intervention-system-for-adverse-police-interactions/)
- **Behavior Change**: [How can we prevent juveniles from interacting with the criminal justice system?](https://dssg.uchicago.edu/project/preventing-juvenile-interactions-with-the-criminal-justice-system/)
  
## Our Machine Learning Problem
>Of all the head of households that have been off of government assistance for one month, who is likely to need assistance in the >next year. This is an example of a *binary prediction classification problem*.


Note the outcome window of 1 year(s) is completely arbitrary. You could use a window of 5, 3, 1 years or 1 day. The outcome window will depend on how often you receive new data -- there is no sense in making the same predictions on the same data -- how accurate your predictions are for a given time period or on what time-scale you can use the output of the data. 

# Data Exploration and Preparation. 

We have already explored the data in the first module and database modules. In order to predict whether someone will need benefits, we will be using data from the `idhs.hh_member` , `idhs.hh_member_info`, and `idhs.hh_indcase_spells` table to create **labels** and **features**. 


## Building a Model

We need to munge our dataset into **features** (predictors, or independent variables, or $X$ variables) and **labels** (dependent variables, or $Y$ variables).  For ease of reference, in subsequent examples, names of variables that pertain to predictors will start with "`X_`", and names of variables that pertain to outcome variables will start with "`y_`".



# Creating Labels

Labels are the dependent variables, or *Y* variables, that we are trying to predict. In the machine learning framework, your labels are usually *binary*: true or false, encoded as 1 or 0. In this case, our label is whether a person will likely need assistance in the future. Let's pick a day to make our prediction, `2007-01-01`. We will look back two years into the past for everyone who received assistance from `2005-2006` but did not receive assistance in the last month of  `2006`. 

We can write SQL code in `psql`, `dbeaver`, `pgAdmin`, or programmaticaly generate the SQL and pass to the DB using `psycopg2` to create the labels. 

In [None]:
def create_labels(date_of_prediction,
                  conn,
                  past_days=730,
                  off_benefit_days=30,
                  prediction_horizon=365,
                  schema='ada_class3',
                  overwrite=False):
    """
    Generate a list of labels and return the 
    table as a dataframe.
    Parameters
    ----------
    date_of_prediction: str
        string for the day predictions are made on '2006-01-01'
    past_days: int
        number of days we are looking into the past to see how
        long people have been on benefits
    off_benefit_days: int
        amount of days in the past someone does not have assistance
    overwrite: bool
        if True runs the query if table does
        not exist
    schema: str
        name of the schema tables will be written to
    conn: obj
        psycopg2 conection object to database
        
    Returns
    -------
    df_labels: DataFrame
        Dataframe of labels
    """
    
    table_date = date_of_prediction.replace('-','')
    
    # check if the table you're trying to create already exists
    cursor = conn.cursor()
    query = """
            select * from information_schema.tables 
            where table_name=\'binary_label_{table_date}\'
            and table_schema=\'{schema}\';
            """.format(table_date=table_date, schema=schema);
    cursor.execute(query) 
    
    if not(cursor.rowcount) or overwrite:
        print('generating labels')
        sql_script="""---------------------------------------------------------------------
--LABELS--------------------------------------------------------------
----------------------------------------------------------------------
/*
Create a label. Predict who is likely to go back on 
benefits after being off for at least one year within
the next year.
*/

--fields
--{date_of_prediction}: day the prediction is being made
--{past_days}: number of days into the past of people receiving
--             benefits
--{off_benefit_days}: number of days off benefits from day or prediction
--{prediction_horizon}: number of days into the future we are making a prediction
--{table_date}: date for the table



DROP TABLE IF EXISTS {schema}.hh_indcase_spells_before_{table_date}; 
CREATE TABLE {schema}.hh_indcase_spells_before_{table_date} AS
SELECT * 
FROM idhs.hh_indcase_spells
WHERE start_date >= ('{date_of_prediction}'::date - {past_days}) 
and end_date < ('{date_of_prediction}'::date - {off_benefit_days});

COMMIT;

--find all the records from 2006
DROP TABLE IF EXISTS hh_indcase_spells_{table_date};
CREATE TEMP TABLE hh_indcase_spells_{table_date} AS
SELECT *
FROM idhs.hh_indcase_spells
WHERE start_date >= ('{date_of_prediction}'::date - {off_benefit_days}) 
and end_date < '{date_of_prediction}'::date; 

COMMIT;

--grab all the people that are in the first table that are 
--not in the second
DROP TABLE IF EXISTS hh_before_{table_date}_recptno;
CREATE TEMP TABLE hh_before_{table_date}_recptno AS
SELECT DISTINCT(recptno)
FROM {schema}.hh_indcase_spells_before_{table_date}
WHERE recptno NOT IN (
	SELECT DISTINCT(recptno)
	FROM hh_indcase_spells_{table_date});  

COMMIT;

--grab the list of cases during 2006-2006
DROP TABLE IF EXISTS hh_after_{table_date};
CREATE TEMP TABLE hh_after_{table_date} AS
SELECT *	
FROM idhs.hh_indcase_spells
WHERE start_date >= '{date_of_prediction}'::date 
AND end_date <= ('{date_of_prediction}'::date + {prediction_horizon});

COMMIT;

-- 
DROP TABLE IF EXISTS label_{table_date};
CREATE TEMP TABLE label_{table_date} AS
SELECT a.recptno, b.benefit_type, b.start_date, b.end_date
FROM hh_before_{table_date}_recptno a
LEFT JOIN hh_after_{table_date} b ON a.recptno = b.recptno;

COMMIT;

-- turn into binary labels
DROP TABLE IF EXISTS pre_binary_label_{table_date};
CREATE TEMP TABLE pre_binary_label_{table_date} AS
SELECT recptno,
case when benefit_type is null then 0 else 1 end benefits
FROM label_{table_date};

COMMIT;

DROP TABLE IF EXISTS {schema}.binary_label_{table_date};
CREATE TABLE {schema}.binary_label_{table_date} AS
SELECT DISTINCT recptno, benefits
FROM pre_binary_label_{table_date};

COMMIT;
-----------------------------------------------------------------------
-----------------------------------------------------------------------
-----------------------------------------------------------------------

        """.format(date_of_prediction=date_of_prediction,
                   past_days=past_days,
                   off_benefit_days=off_benefit_days,
                   prediction_horizon=prediction_horizon,
                   table_date=table_date,
                   schema=schema)
    
        cursor.execute(sql_script)
    else:
        print('Table already generated')
    
    cursor.close()
    df_label = pd.read_sql('select * from {schema}.binary_label_{table_date};'.format(table_date=table_date,
                                                                                     schema=schema), conn)
    
    return df_label

In [None]:
df_label_2007 = create_labels('2007-01-01',
                              conn,
                              overwrite=False)

In [None]:
df_label_2008 = create_labels('2008-01-01',
                              conn,
                              overwrite=False)

In [None]:
df_label_2007.head()

Now we have a label: 0 indicates *did not need assistance for a year*, 1 indicates that person did receive assistance in one year (in our case 2005). 

## Feature Generation
*[Go back to Table of Contents](#Table-of-Contents)*


Our features are our independent variables or predictors. Good features make machine learning systems effective. 
The better the features the easier it is the capture the structure of the data. You generate features using domain knowledge. In general, it is better to have more complex features and a simpler model rather than vice versa. Keeping the model simple makes it faster to train and easier to understand rather then extensively searching for the "right" model and "right" set of parameters. 

Machine Learning Algorithms learn a solution to a problem from sample data. The set of features is the best representation of the sample data to learn a solution to a problem. 

- **Feature engineering** is the process of transforming raw data into features that better represent the underlying problem/data/structure  to the predictive models, resulting in improved model accuracy on unseen data." ( from [Discover Feature Engineering](http://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/) ).  In text, for example, this might involve deriving traits of the text like word counts, verb counts, or topics to feed into a model rather than simply giving it the raw text.

Example of feature engineering are: 

- **Transformations**, such a log, square, and square root.
- **Dummy (binary) variables**, also known as *indicator variables*, often done by taking categorical variables
(such as city) which do not have a numerical value, and adding them to models as a binary value.
- **Discretization**. Several methods require features to be discrete instead of continuous. This is often done 
by binning, which you can do by equal width. 
- **Aggregation.** Aggregate features often constitute the majority of features for a given problem. These use 
different aggregation functions (*count, min, max, average, standard deviation, etc.*) which summarize several
values into one feature, aggregating over varying windows of time and space. For example, given urban data, 
we would want to calculate the *number* (and *min, max, mean, variance*, etc.) of crimes within an *m*-mile radius
of an address in the past *t* months for varying values of *m* and *t*, and then use all of them as features.

Our preliminary features are the following

- `n_spells` (Aggregation): Total number of spells someonse has had up until the date of prediction.

- `age` (Transformation): The age feature is created by substracting the bdate_year with the current year of prediction. 

- `edlevel` (Binary): 0 if the person has less than a high school education and 1 if they are more than a high school education. 

- `workexp` (Binary): 0 if no work experience 1 if there is some sort of work experience

- `married` (Binary): 1 if the person is married 0 if they are not. 

- `gender`: (Binary) 1(male) 2(female)

- `n_days_last_spell`: (Aggregation) The number of days since a person's last spell.

- `(foodstamp, tanf, granf)`: (Binary) 0 if the last benefit was not foodstamp, tanf or grantf, 1 if it was

In [None]:
def create_features(date_of_prediction, conn, past_days=7200,
                    schema='ada_class3', overwrite=False):
    """
    Generate a list of features and return the 
    table as a dataframe.
    
    Note: There has to be a table of labels that
    correspond with the same time period. 
    
    Parameters
    ----------
    date_of_prediction: str
        date to make prediction from e.g., '2006-01-01'
    conn: obj
        psycopg2 conection object to database
    past_days: int
        number of days to look into the past for the
        wage feature
    schema: str
        schema to write tables into
    overwrite: bool
        If True will run SQL script if tables
        do not exist. 
        
    Returns
    -------
    table_name: str
        name of table with features
    """
    table_date = date_of_prediction.replace('-','')
        
    cursor = conn.cursor()
    query = """select * from information_schema.tables 
            where table_name=\'feature_table_{table_date}\'
            and table_schema=\'{schema}\';""".format(table_date=table_date, schema=schema)
    print(query)
    cursor.execute(query)
    
    if not(cursor.rowcount) or overwrite:

    
        sql_script="""
        -----------------------------------------------------------------------
--FEATURE CREATION-----------------------------------------------------
-----------------------------------------------------------------------

--number of individual spells
--find how many records before the end_date
-- group by ssn_hash
drop table if exists feature_n_spells_{table_date};
create temp table feature_n_spells_{table_date} as 
select recptno, count(*) n_spells
from idhs.hh_indcase_spells
where end_date <= '{date_of_prediction}'::date
and recptno in (
	select distinct(recptno)
	from {schema}.binary_label_{table_date})
group by recptno; 

commit;

-- age 
drop table if exists feature_age_{table_date};
create temp table feature_age_{table_date} as
select recptno, 
	(date_part('year','{date_of_prediction}'::date)-bdate_year) age
from idhs.hh_member
where recptno in (
select distinct(recptno) from {schema}.binary_label_{table_date}); 

commit; 

-- marstat, edlvel, workexp

drop table if exists last_case_before_{table_date};
create temp table last_case_before_{table_date} as
select distinct on ("recptno") 	recptno,
				ch_dpa_caseid,
			        start_date,
			        end_date,
				(end_date::date - start_date::date) n_days_spell,
				benefit_type
from {schema}.hh_indcase_spells_before_{table_date}
where recptno in (
select distinct(recptno) from {schema}.binary_label_{table_date})
order by recptno, end_date desc;

commit; 

drop table if exists pre_categorical_features;
create temp table pre_categorical_features as 
select 	c.recptno,
	c.n_days_spell,
case when c.edlevel in ('A', 'B', 'C', 'D', 'E', 'F','1','2','3','4') then 0 
     when c.edlevel is NULL then 0 
     else 1 end edlevel,
case when c.martlst in (0,1,3,4,5,6) then 0 else 1 end marstat,
case when c.workexp in ('0','1') then 0 else 1 end workexp,
case when c.benefit_type = 'foodstamp' then 1 else 0 end foodstamp,
case when c.benefit_type = 'tanf46' then 1 else 0 end tanf,
case when c.benefit_type = 'grant' then 1 else 0 end grantf,
('{date_of_prediction}'::date - end_date::date) n_days_last_spell
from (select a.recptno,
       	b.edlevel,
	b.workexp,
	b.martlst,
	a.end_date,
	a.n_days_spell,
	a.benefit_type
from last_case_before_{table_date} a
join idhs.member_info b
 on a.recptno = b.recptno and a.ch_dpa_caseid=b.ch_dpa_caseid) as c; 
-- how much money are they earning

commit; 

-- gender
drop table if exists feature_gender_{table_date};
create temp table feature_gender_{table_date} as
select recptno, sex gender
from idhs.hh_member
where recptno in (
select distinct(recptno) from {schema}.binary_label_{table_date}); 

commit; 

--salary
drop table if exists {schema}.recptno_ssn_{table_date}; 
create table {schema}.recptno_ssn_{table_date} as
select a.recptno, b.ssn_hash
from {schema}.binary_label_{table_date} a
join idhs.hh_member b on a.recptno = b.recptno;

commit; 

drop table if exists {schema}.wage_ssn_{table_date};
create table {schema}.wage_ssn_{table_date} as
select *
from {schema}.il_wage_hh_recipient
where ssn in ( 	select distinct ssn_hash
	from {schema}.recptno_ssn_{table_date});

commit; 

drop table if exists {schema}.feature_wage_{table_date};
create table {schema}.feature_wage_{table_date} as  
select 	recptno, 
	sum(wage) total_wages,
	 count(distinct(year,quarter)) n_quarters
from {schema}.wage_ssn_{table_date}
where year > date_part('year', timestamp '{date_of_prediction}'::date-{past_days}) 
and year < date_part('year', timestamp '{date_of_prediction}'::date)
group by recptno;

commit; 

--create feature table
drop table if exists {schema}.feature_table_{table_date};
create table {schema}.feature_table_{table_date} as 
select 	a.recptno,
       	b.n_spells,
       	c.age,
 	e.edlevel,
	e.workexp,
	e.marstat,
	e.n_days_last_spell,
	e.n_days_spell,
	e.foodstamp,
	e.tanf,
	e.grantf,
	d.gender,
    case when f.total_wages is NULL then 0 else f.total_wages end total_wages,
    case when f.n_quarters is NULL then 0 else f.n_quarters end n_quarters
from {schema}.binary_label_{table_date} a
left join feature_n_spells_{table_date} b on a.recptno=b.recptno
left join feature_age_{table_date} c on a.recptno = c.recptno
left join feature_gender_{table_date} d on a.recptno = d.recptno
left join pre_categorical_features e on a.recptno = e.recptno
left join {schema}.feature_wage_{table_date} f on a.recptno = f.recptno;

commit; 

drop table if exists {schema}.set_{table_date};
create table {schema}.set_{table_date} as 
select 	a.*,
	b.n_spells,
	b.age,
	b.edlevel,
	b.workexp,
	b.marstat,
	b.gender,
	b.n_days_last_spell,
	b.foodstamp,
	b.tanf,
	b.grantf,
	b.n_days_spell,
    b.total_wages,
    b.n_quarters
from {schema}.binary_label_{table_date} a
join {schema}.feature_table_{table_date} b on a.recptno=b.recptno;

commit; 


----------------------------------------------------------------------
----------------------------------------------------------------------

        """.format(date_of_prediction=date_of_prediction,
                    table_date=table_date,
                   past_days=past_days,
                       schema=schema)
    
    
        cursor.execute(sql_script)
    
    cursor.close()
    
    print('created {schema}.feature_table_{table_date}'.format(
        table_date=table_date,
        schema=schema))
    
    table_name = '{schema}.features_table_{table_date}'.format(schema=schema,
                                                               table_date=table_date)
    return table_name      

In [None]:
train_feature_table = create_features('2007-01-01',conn, overwrite=False)

In [None]:
test_feature_table = create_features('2008-01-01',conn, overwrite=False)

## Model Fitting
*[Go back to Table of Contents](#Table-of-Contents)*

It's not enough to just build the model; we're going to need a way to know whether or not it's working. Convincing others of the quality of results is often the *most challenging* part of an analysis.  Making repeatable, well-documented work with clear success metrics makes all the difference.

To convince ourselves - and others - that our modeling results will generalize, we need to hold
some data back (not using it to train the model), then apply our model to that hold-out set and "blindly" predict, comparing the model's predictions to what we actually observed. This is called **cross-validation**, and it's the best way we have to estimate how a model will perform on *entirely* novel data. We call the data used to build the model the **training set**, and the rest the **test set**.

In general, we'd like our training set to be as large as possible, to give our model more information. However, you also want to be as confident as possible that your model will be applicable to new data, or else the model is useless. In practice, you'll have to balance these two objectives in a reasonable way.  

There are also many ways to split up your data into training and testing sets. Since you're trying to evaluate how your model will perform *in practice*, it's best to emulate the true use case of your model as closely as possible when you decide how to evaluate it. A good [tutorial on cross-validation](http://scikit-learn.org/stable/modules/cross_validation.html) can be found on the `scikit-learn` site.

One simple and commonly used method is ***k-fold* cross-validation**, which entails splitting up our dataset into *k* groups, holding out one group while training a model on the rest of the data, evaluating model performance on the held-out "fold," and repeating this process *k* times (we'll get back to this in the text-analysis tutorial). Another method is **temporal validation**, which involves building a model using all the data up until a given point in time, and then testing the model on observations that happened after that point. Our problem of predicting whether someone will need government assistance is a problem in time where we are trying to predict an event in the future. Generally, if you use the future to predict the past there will be temporal effects that will help the accuracy of your predictions. We cannot use the future to predict the past in real life, so it is important to use `temporal validation` and create our training and test sets accordingly. 

Our training set uses labels from 2005. Starting from 2005 we found all people that received benefits from `01-2003` to `11-2004`.  The features are then developed using data from our day or prediction `2005-01-01`. *Note: it is important to segregate your data based on time when creating features. Otherwise there can be "leakage," where you accidentally use information that you would not have known at the time.*  This happens often when calculating aggregation features; for instance, it is quite easy to calculate an average using values that go beyond our training set time-span and not realize it.  

Our testing set will use labels for the following year 2006, and our features will be generated from 1989-2006. 

Notice that our testing set uses more data than our training set, because we have "new" data from 2005. This emulates the way our models could be used in practice. Every year we can run our model and make new predictions updating our dataset with the most recent five years of data. 


# Data Check 

In [None]:
# generate a matrix for training the models

def create_train_or_test_matrix(date_of_prediction, conn, schema='ada_class3', overwrite=False):
    """
    joins feature table with the labels table to generate a matrix
      
    _
    Parameters
    ----------
    date_of_prediction: str
        day to make prediction from '2005-01-01'
    schema: str
        schema to write the tables into
    conn: obj
        psycopg2 conection object to database
    overwrite: bool
        If True will run SQL script if tables
        do not exist. 
        
    Returns
    -------
    table
        table with features
    """
        
    table_date = date_of_prediction.replace('-','')
        
    cursor = conn.cursor()
    query = """
            select * from information_schema.tables 
            where table_name=\'set_{table_date}\'
            and table_schema=\'{schema}\';
            """.format(table_date=table_date,
                      schema=schema)
    cursor.execute(query)
    
    if not(cursor.rowcount) or overwrite:

        sql_script="""
   drop table if exists class2.set_{table_date};
create table class2.set_{table_date} as 
select 	a.*,
	b.n_spells,
	b.age,
	b.edlevel,
	b.workexp,
	b.marstat,
	b.gender,
	b.n_days_last_spell,
	b.foodstamp,
	b.tanf,
	b.grantf,
	b.n_days_spell,
    b.total_wages,
    b.n_quarters
from class2.binary_label_{table_date} a
join {schema}.feature_table_{table_date} b on a.recptno=b.recptno;

    commit; 

    """.format(table_date=table_date,
               schema=schema)
    
    
        cursor.execute(sql_script)
    
    cursor.close()
    
    print('created {schema}.set_{table_date}'.format(
        schema=schema, table_date=table_date))
    
      


In [None]:
create_train_or_test_matrix('2007-01-01', conn, overwrite=False)

In [None]:
create_train_or_test_matrix('2008-01-01', conn, overwrite=False)

In [None]:
df_training = pd.read_sql('select * from ada_class3.set_20070101;', conn)
df_testing = pd.read_sql('select * from ada_class3.set_20080101;', conn)

In [None]:
df_training.head()

In [None]:
df_testing.head()

In [None]:
isnan_training_rows = df_training.isnull().any(axis=1) # Find the rows where there are NaNs

In [None]:
df_training[isnan_training_rows].head()

No `NaNs` 

In [None]:
nrows_training = df_training.shape[0]
nrows_training_isnan = df_training[isnan_training_rows].shape[0]

In [None]:
print('%of frows with NaNs {} '.format(float(nrows_training_isnan)/nrows_training))

In [None]:
df_training = df_training[~isnan_training_rows]

### Imputation 

It is important to to do a quick check of our matrix to see if we have any outlier values. 

In [None]:
df_training.describe()

Let's check the values of the ages at see if they are reasonable. 

In [None]:
np.unique( df_training['age'] )

Aha! It is unlikely there are any pepole being born in the future receiving benefits in the past! This is likely due to an incorrect entry in the `birth_yr` in the `idhs.hh_member` table. On the other end of the age spectrum, the ages are more likely to be correct, but this is still something that you'd want to do a "sanity check" on with someone who knows the data well.

Let's mark rows that have age less than 0 or age greater than 100 as NaN and then impute the age with the mean. 

In [None]:
mask = ( (df_training['age'] < 1) | (df_training['age'] > 100) )
vals_to_replace = df_training[mask]['age'].values
df_training['age'].replace(vals_to_replace,np.NaN, inplace=True)

In [None]:
df_training['age'].unique()

In [None]:
mean_training_age = df_training['age'].mean()

In [None]:
mean_training_age

In [None]:
df_training['age'].fillna(mean_training_age, inplace=True)

In [None]:
df_training['age'].unique()

### Class Balancing

Let's check how much data we still have and how many examples of going back on benefits are in our training dataset. We don't necessarily need to have a perfect 50-50 balance of off-benefits/on-benefits, but it's good to know what the "baseline" is in our dataset, to be able to intelligently evaluate our performance.

In [None]:
print('Number of rows: {}'.format(df_training.shape[0]))
df_training['benefits'].value_counts(normalize=True)

We have about N examples, and about X% of those are *positive* examples (needed assistance), which is what we're trying to identify. About Y% of the examples are *negative* examples (did not need assistance).

Let's take a look at our testing set. 

In [None]:
df_testing.head()

In [None]:
isnan_testing_rows = df_testing.isnull().any(axis=1) # Find the rows where there are NaNs
nrows_testing = df_testing.shape[0]
nrows_testing_isnan = df_testing[isnan_testing_rows].shape[0]
print('%of rows with NaNs {} '.format(float(nrows_testing_isnan)/nrows_testing))

In [None]:
df_testing[isnan_testing_rows].head()

In [None]:
mask = ( (df_testing['age'] < 1) | (df_testing['age'] > 100) )
vals_to_replace = df_testing[mask]['age'].values
df_testing['age'].replace(vals_to_replace,np.NaN, inplace=True)
df_testing['age'].fillna(mean_training_age, inplace=True)

In [None]:
df_testing.head()

In [None]:
print('Number of rows: {}'.format(df_testing.shape[0]))
df_testing['benefits'].value_counts(normalize=True)

### Scaling of Values

Certain models will have issue with the distance between features such as age and total earning. Age is typically a number between 0 and 100 while earnings can be number between 0 and 1000000. In order to circumvent this problem we can scale our features.  

In [None]:
min_training_wage = df_training['total_wages'].min()
max_training_wage = df_training['total_wages'].max()

df_training['scaled_wages'] = (df_training['total_wages'] - min_training_wage)/(max_training_wage-min_training_wage)


In [None]:
df_training[['scaled_wages','total_wages']].describe()

In [None]:
df_testing['scaled_wages'] = (df_testing['total_wages'] - min_training_wage)/(max_training_wage-min_training_wage) 

### Crosstabs

We can use crosstabs to find trends and patterns in our data. 

In [None]:
df_training.head()

In [None]:
pd.crosstab(index=df_training['benefits'], columns=df_training['gender']).plot(kind='bar')

In [None]:
pd.crosstab(index=df_training['benefits'], columns=df_training['edlevel']).plot(kind='bar')

In [None]:
pd.crosstab(index=df_training['benefits'], columns=df_training['marstat']).plot(kind='bar')

In [None]:
ax = sns.boxplot(x="benefits", y="scaled_wages", data=df_training)

In [None]:
ax = sns.boxplot(x="benefits", y="n_days_last_spell", data=df_training)

In [None]:
ax = sns.boxplot(x="benefits", y="n_spells", data=df_training)

In [None]:
ax = sns.boxplot(x="benefits", y="age", data=df_training)

### Split into features and labels

In [None]:
sel_features = ['n_spells','age', 'edlevel','workexp','marstat','gender',
                'n_days_last_spell', 'scaled_wages', 'n_quarters']
sel_label = 'benefits'

In [None]:
# use conventions typically used in python scikitlearn

X_train = df_training[sel_features].values
y_train = df_training[sel_label].values
X_test = df_testing[sel_features].values
y_test = df_testing[sel_label].values

# Model Selection

## Model Evaluation 
*[Go back to Table of Contents](#Table-of-Contents)*

In this phase, you take the predictors from your test set and apply your model to them, then assess the quality of the model by comparing the *predicted values* to the *actual values* for each record in your testing data set. 

- **Performance Estimation**: How well will our model do once it is deployed and applied to new data?

Now let's use the model we just fit to make predictions on our test dataset, and see what our accuracy score is:

Python's [`scikit-learn`](http://scikit-learn.org/stable/) is a commonly used, well documented Python library for machine learning. This library can help you split your data into training and test sets, fit models and use them to predict results on new data, and evaluate your results.

We will start with the simplest [`LogisticRegression`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) model and see how well that does.

You can use any number of metrics to judge your models (see [model evaluation](#model-evaluation)), but we'll use [`accuracy_score()`](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) (ratio of correct predictions to total number of predictions) as our measure.

In [None]:
# Let's fit a model
from sklearn import linear_model
model = linear_model.LogisticRegression(penalty='l1', C=1e5)
model.fit( X_train, y_train )
print(model)

When we print the model results, we see different parameters we can adjust as we refine the model based on running it against test data (values such as `intercept_scaling`, `max_iters`, `penalty`, and `solver`).  Example output:

    LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr',
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0)

To adjust these parameters, one would alter the call that creates the `LogisticRegression()` model instance, passing it one or more of these parameters with a value other than the default.  So, to re-fit the model with `max_iter` of 1000, `intercept_scaling` of 2, and `solver` of "lbfgs" (pulled from thin air as an example), you'd create your model as follows:

    model = LogisticRegression( max_iter = 1000, intercept_scaling = 2, solver = "lbfgs" )

The basic way to choose values for, or "tune," these parameters is the same as the way you choose a model: fit the model to your training data with a variety of parameters, and see which perform the best on the test set. An obvious drawback is that you can also *overfit* to your test set; in this case, you can alter your method of cross-validation.



# Model Understanding

In [None]:
print "The coefficients for each of the features are " 
zip(sel_features, model.coef_[0])

In [None]:
std_coef = np.std(X_test,0)*model.coef_
zip(sel_features, std_coef[0])

# Model Evaluation 

Machine learning models usually do not produce a prediction (0 or 1) directly. Rather, models produce a score between 0 and 1 (that can sometimes be interpreted as a probability), which lets you more finely rank all of the examples from *most likely* to *least likely* to have label 1 (positive). This score is then turned into a 0 or 1 based on a user-specified threshold. For example, you might label all examples that have a score greater than 0.5 (1/2) as positive (1), but there's no reason that has to be the cutoff. 

In [None]:
#  from our "predictors" using the model.
y_scores = model.predict_proba(X_test)[:,1]

In [None]:
y_scores

Let's take a look at the distribution of scores and see if it makes sense to us. 

In [None]:
sns.distplot(y_scores, kde=False, rug=False)

In [None]:
df_testing['y_score'] = y_scores

In [None]:
df_testing[['recptno', 'y_score']].head()

Tools like `sklearn` often have a default threshold of 0.5, but a good threshold is selected based on the data, model and the specific problem you are solving. As a trial run, let's set a threshold of 0.5. 

In [None]:
calc_threshold = lambda x,y: 0 if x < y else 1 
predicted = np.array( [calc_threshold(score,0.45) for score in y_scores] )
expected = y_test

## Confusion Matrix

Once we have tuned our scores to 0 or 1 for classification, we create a *confusion matrix*, which  has four cells: true negatives, true positives, false negatives, and false positives. Each data point belongs in one of these cells, because it has both a ground truth and a predicted label. If an example was predicted to be negative and is negative, it's a true negative. If an example was predicted to be positive and is positive, it's a true positive. If an example was predicted to be negative and is positive, it's a false negative. If an example was predicted to be positive and is negative, it's a false negative.

In [None]:
from sklearn.metrics import confusion_matrix
conf_matrix = confusion_matrix(expected,predicted)
print conf_matrix

The count of true negatives is `conf_matrix[0,0]`, false negatives `conf_matrix[1,0]`, true positives `conf_matrix[1,1]`, and false_positives `conf_matrix[0,1]`.

Accuracy is the ratio of the correct predictions (both positive and negative) to all predictions. 
$$ Accuracy = \frac{TP+TN}{TP+TN+FP+FN} $$

In [None]:
# generate an accuracy score by comparing expected to predicted.
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(expected, predicted)
print( "Accuracy = " + str( accuracy ) )

Two additional metrics that are often used are **precision** and **recall**. 

Precision measures the accuracy of the classifier when it predicts an example to be positive. It is the ratio of correctly predicted positive examples to examples predicted to be positive. 

$$ Precision = \frac{TP}{TP+FP}$$

Recall measures the accuracy of the classifier to find positive examples in the data. 

$$ Recall = \frac{TP}{TP+FN} $$

By selecting different thresholds we can vary and tune the precision and recall of a given classifier. A conservative classifier (threshold 0.99) will classify a case as 1 only when it is *very sure*, leading to high precision. On the other end of the spectrum, a low threshold (e.g. 0.01) will lead to higher recall. 

In [None]:
from sklearn.metrics import precision_score, recall_score
precision = precision_score(expected, predicted)
recall = recall_score(expected, predicted)
print( "Precision = " + str( precision ) )
print( "Recall= " + str(recall))

If we care about our whole precision-recall space, we can optimize for a metric known as the **area under the curve (AUC-PR)**, which is the area under the precision-recall curve. The maximum AUC-PR is 1. 

In [None]:
def plot_precision_recall(y_true,y_score):
    """
    Plot a precision recall curve
    
    Parameters
    ----------
    y_true: ls
        ground truth labels
    y_score: ls
        score output from model
    """
    precision_curve, recall_curve, pr_thresholds = precision_recall_curve(y_true,y_score)
    plt.plot(recall_curve, precision_curve)
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    auc_val = auc(recall_curve,precision_curve)
    print('AUC-PR: {0:1f}'.format(auc_val))
    plt.show()
    plt.clf()

In [None]:
plot_precision_recall(expected, y_scores)

## Precision and Recall at k%

If we only care about a specific part of the precision-recall curve we can focus on more fine-grained metrics. For instance, say there is a special program for people likely to need assistance within the next year , but only *3000 or 1% of the people in our test set*  can be admitted. In that case, we would want to prioritize the 1% who were *most likely* to need assistance within the next year, and it wouldn't matter too much how accurate we were on the 78% or so who weren't very likely to need assistane.

Let's say that, out of the approximately 300,000 peoiple, we can intervene on 1% of them, or the "top" 3000 people in a year (where "top" means highest likelihood of needing assistance in the next year). We can then focus on optimizing our **precision at 1%**.

In [None]:
def plot_precision_recall_n(y_true, y_prob, model_name):
    """
    y_true: ls 
        ls of ground truth labels
    y_prob: ls
        ls of predic proba from model
    model_name: str
        str of model name (e.g, LR_123)
    """
    from sklearn.metrics import precision_recall_curve
    y_score = y_prob
    precision_curve, recall_curve, pr_thresholds = precision_recall_curve(y_true, y_score)
    precision_curve = precision_curve[:-1]
    recall_curve = recall_curve[:-1]
    pct_above_per_thresh = []
    number_scored = len(y_score)
    for value in pr_thresholds:
        num_above_thresh = len(y_score[y_score>=value])
        pct_above_thresh = num_above_thresh / float(number_scored)
        pct_above_per_thresh.append(pct_above_thresh)
    pct_above_per_thresh = np.array(pct_above_per_thresh)
    plt.clf()
    fig, ax1 = plt.subplots()
    ax1.plot(pct_above_per_thresh, precision_curve, 'b')
    ax1.set_xlabel('percent of population')
    ax1.set_ylabel('precision', color='b')
    ax1.set_ylim(0,1.05)
    ax2 = ax1.twinx()
    ax2.plot(pct_above_per_thresh, recall_curve, 'r')
    ax2.set_ylabel('recall', color='r')
    ax2.set_ylim(0,1.05)
    
    name = model_name
    plt.title(name)
    plt.show()
    plt.clf()

In [None]:
def precision_at_k(y_true, y_scores,k):
    
    threshold = np.sort(y_scores)[::-1][int(k*len(y_scores))]
    y_pred = np.asarray([1 if i >= threshold else 0 for i in y_scores ])
    return precision_score(y_true, y_pred)

In [None]:
plot_precision_recall_n(expected,y_scores, 'LR')

In [None]:
p_at_1 = precision_at_k(expected,y_scores, 0.01)
print('Precision at 1%: {:.2f}'.format(p_at_1))

## Machine Learning Pipeline
*[Go back to Table of Contents](#Table-of-Contents)*

When working on machine learning projects, it is a good idea to structure your code as a modular **pipeline**, which contains all of the steps of your analysis, from the original data source to the results that you report, along with documentation. This has many advantages:
- **Reproducibility**. It's important that your work be reproducible. This means that someone else should be able
to see what you did, follow the exact same process, and come up with the exact same results. It also means that
someone else can follow the steps you took and see what decisions you made, whether that person is a collaborator, 
a reviewer for a journal, or the agency you are working with. 
- **Ease of model evaluation and comparison**.
- **Ability to make changes.** If you receive new data and want to go through the process again, or if there are 
updates to the data you used, you can easily substitute new data and reproduce the process without starting from scratch.

# Survey of Algorithms

*[Go back to Table of Contents](#Table-of-Contents)*

We have only scratched the surface of what we can do with our model. We've only tried one classifier (Logistic Regression), and there are plenty more classification algorithms in `sklearn`. Let's try them! 

In [None]:
clfs = {'RF': RandomForestClassifier(n_estimators=50, n_jobs=-1),
       'ET': ExtraTreesClassifier(n_estimators=10, n_jobs=-1, criterion='entropy'),
        'LR': LogisticRegression(penalty='l1', C=1e5),
        'SGD':SGDClassifier(loss='log'),
        'GB': GradientBoostingClassifier(learning_rate=0.05, subsample=0.5, max_depth=6, random_state=17, n_estimators=10),
        'NB': GaussianNB()}

In [None]:
sel_clfs = ['RF', 'ET', 'LR', 'SGD', 'GB', 'NB']


In [None]:
max_p_at_k = 0
df_results = pd.DataFrame()
for clfNM in sel_clfs:
    clf = clfs[clfNM]
    clf.fit( X_train, y_train )
    print clf
    y_score = clf.predict_proba(X_test)[:,1]
    predicted = np.array(y_score)
    expected = np.array(y_test)
    plot_precision_recall_n(expected,predicted, clfNM)
    p_at_1 = precision_at_k(expected,y_score, 0.01)
    p_at_5 = precision_at_k(expected,y_score,0.05)
    p_at_10 = precision_at_k(expected,y_score,0.10)
    fpr, tpr, thresholds = roc_curve(expected,y_score)
    auc_val = auc(fpr,tpr)
    df_results = df_results.append([{
        'clfNM':clfNM,
        'p_at_1':p_at_1,
        'p_at_5':p_at_5,
        'p_at_10':p_at_10,
        'auc':auc_val,
        'clf': clf
    }])
    
    #feature importances
    if hasattr(clf, 'coef_'):
        feature_import = dict(
            zip(sel_features,clf.coef_.ravel()))
    elif hasattr(clf, 'feature_importances_'):
        feature_import = dict(
            zip(sel_features, clf.feature_importances_))
    print("FEATURE IMPORTANCES")
    print(feature_import)
    
    plt.clf()
    sns.set_style('whitegrid')
    f, ax = plt.subplots(figsize=(36,12))
    sns.barplot(x=feature_import.keys(), y = feature_import.values(), palette="Blues")
    plt.show()
    
    if max_p_at_k < p_at_1:
        max_p_at_k = p_at_1
    print('Precision at 1%: {:.2f}'.format(p_at_1))
df_results.to_csv('modelrun.csv')

# Assess Model Against Baselines

- Back to [Table of Contents](#Table-of-Contents)

It is important to check our model against a reasonable **baseline** to know how well our model is doing. Without any context, 78% accuracy can sound really great... but it's not so great when you remember that you could do almost that well by declaring everyone will not need benefits in the next year, which would be stupid (not to mention useless) model. 

A good place to start is checking against a *random* baseline, assigning every example a label (positive or negative) completely at random. 

In [None]:
max_p_at_k

In [None]:
random_score = [random.uniform(0,1) for i in enumerate(y_test)] 
random_predicted = np.array( [calc_threshold(score,0.5) for score in random_score] )
random_p_at_5 = precision_at_k(expected,random_predicted, 0.01)

Another good practice is checking against an "expert" or rule of thumb baseline. For example, say that talking to people at the IDHS, you find that they think it's much more likely that someone who has been on assistance multiple times already will need assistance in the future. Then you should check that your classifier does better than just labeling everyone who has had multiple past admits as positive.

In [None]:
reenter_predicted = np.array([ 1 if n_spells > 3 else 0 for n_spells in df_testing.n_spells.values ])
reenter_p_at_1 = precision_at_k(expected,reenter_predicted,0.01)

In [None]:
all_non_reenter = np.array([0 for n_spells in df_testing.n_spells.values])
all_non_reenter_p_at_1 = precision_at_k(expected, all_non_reenter,0.01)

In [None]:
sns.set_style("white")
sns.set_context("poster", font_scale=2.25, rc={"lines.linewidth":2.25, "lines.markersize":8})
fig, ax = plt.subplots(1, figsize=(22,12))
sns.barplot(['Random','All no need', 'More than 3 Spell','Model'],
            [random_p_at_5, all_non_reenter_p_at_1, reenter_p_at_1, max_p_at_k],
            palette=['#6F777D','#6F777D','#6F777D','#800000'])
sns.despine()
plt.ylim(0,1)
plt.ylabel('precision at 1%')

# Exercise

- Back to [Table of Contents](#Table-of-Contents)

Our model has just scratched the surface. Try the following: 
    
- Create more features
- Try more models
- Try different parameters for your model

## Resources
*[Go back to Table of Contents](#Table-of-Contents)*

- Hastie et al.'s [The Elements of Statistical Learning](http://statweb.stanford.edu/~tibs/ElemStatLearn/) is a classic and is available online for free.
- James et al.'s [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/), also available online, includes less mathematics and is more approachable.
- Wu et al.'s [Top 10 Algorithms in Data Mining](http://www.cs.uvm.edu/~icdm/algorithms/10Algorithms-08.pdf).