# Machine Learning
-----


## Table of Contents
- [Introduction](#introduction)
- [Problem Formulation](#problem-formulation)
- [Creating Labels (Outcomes)](#labels)
- [Feature Generation](#features)
- [Create Training and Test Sets](#train-test)
- [Model Training](#model-training)
- [Model Evaluation](#model-evaluation)
- [More Models](#more-models)
- [Resources](#resources)

# Introduction

In this tutorial, we'll discuss how to formulate a policy problem or a social science question in the machine learning framework; how to transform raw data into something that can be fed into a model; how to build, evaluate, compare, and select models; and how to reasonably and accurately interpret model results. You'll also get hands-on experience using the `scikit-learn` package in Python. 

This tutorial is based on chapter "Machine Learning" of [Big Data and Social Science](https://coleridge-initiative.github.io/big-data-and-social-science/).

## Setup

In [None]:
%pylab inline
import yaml
import pandas as pd
import psycopg2
import sklearn
import seaborn as sns
from dateutil.parser import parse
from sklearn.metrics import precision_recall_curve, roc_curve, auc
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score
from sklearn.ensemble import (RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier)
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sqlalchemy import create_engine
sns.set_style("white")

In [None]:
# to be adjusted
with open('/home/ckern/nc-data/database_login.yaml') as f:
    db_connection_string = yaml.load(f)

In [None]:
conn = psycopg2.connect(db_connection_string)
cur = conn.cursor()

# Problem Formulation
---
  
Our Machine Learning Problem
>Of all prisoners released, we would like to predict who is likely to reenter jail within *5* years of the day we make our prediction. For instance, say it is Jan 1, 2012 and we want to identify which 
>prisoners are likely to re-enter jail between now and end of 2016. We can run our predictive model and identify who is most likely at risk. The is an example of a *binary classification* problem. 

Note the outcome window of 5 years is completely arbitrary. You could use a window of 5, 3, 1 years or 1 day. 

In order to predict recidivism, we will be using data from the `...` and `...` table to create **labels** and **features**. 

We need to munge our dataset into **features** (predictors, or independent variables, or $X$ variables) and **labels** (dependent variables, or $Y$ variables).

# Creating Labels (Outcomes)
---

First, we create a table `release_dates_1989_2006`, which is based on the `inmt4bb1` table. We take all of the records for `inmate_doc_number` and `actual_sentence_end_date` between 1989 and 2006.

In [None]:
#drop table if exists release_dates_1989_2006;
sql_string = "create temp table release_dates_1989_2006 as "
sql_string += "select inmate_doc_number, actual_sentence_end_date, sentence_begin_date_for_max "
sql_string += "from inmt4bb1 "
sql_string += "where actual_sentence_end_date >= '1989-01-01' and actual_sentence_end_date < '2006-01-01' "
sql_string += ";"

cur.execute(sql_string)

Next we create a table `last_exit_1989_2006`, which takes the *maximum* (most recent) `actual_sentence_end_date` for every `inmate_doc_number` and writes into `last_exit_1989_2006`. This table will only have one entry per `inmate_doc_number`, so for any given `inmate_doc_number`, or individual, we know their *most recent* release year.

In [None]:
#drop table if exists last_exit_1989_2006;
sql_string = "create temp table last_exit_1989_2006 as "
sql_string += "select inmate_doc_number, max(actual_sentence_end_date) actual_sentence_end_date "
sql_string += "from release_dates_1989_2006 "
sql_string += "group by inmate_doc_number "
sql_string += ";"

cur.execute(sql_string)

We then find everyone admitted into prison between 2006 and 2010.

In [None]:
#drop table if exists admit_2006_2010;
sql_string = "create temp table admit_2006_2011 as "
sql_string += "select inmate_doc_number, sentence_begin_date_for_max "
sql_string += "from inmt4bb1 "
sql_string += "where sentence_begin_date_for_max >= '2006-01-01' and sentence_begin_date_for_max < '2011-01-01' and inmate_sentence_component = 1 "
sql_string += ";"

cur.execute(sql_string)

Next, we do a `left join`  on the `last_exit_1989_2006` (left) table and the `recidivism_2006_2010` (right) table on the `inmate_doc_number` field. The resulting table will keep all the entries from the *left* table (most recent releases between 1989 and 2006) and add their admits between 2006 and 2010. 

In [None]:
#drop table if exists recidivism_2005_2010;
sql_string = "create temp table recidivism_2006_2011 as "
sql_string += "select r.inmate_doc_number, r.actual_sentence_end_date, a.sentence_begin_date_for_max, "
sql_string += "case when a.sentence_begin_date_for_max is null then 0 else 1 end recidivism "
sql_string += "from last_exit_1989_2006 r "
sql_string += "left join admit_2006_2011 a on r.inmate_doc_number = a.inmate_doc_number "
sql_string += ";"

cur.execute(sql_string)

Now we have a label: 0 indicates *no recidivism*, 1 indicates that person did return to jail within the outcome period (beginning of 2006 to end 2010). 

In [None]:
#drop table if exists recidivism_labels_2005_2010;
sql_string = "create table public.recidivism_labels_2006_2011 as "
sql_string += "select distinct inmate_doc_number, recidivism "
sql_string += "from recidivism_2006_2011 "
sql_string += ";"

cur.execute(sql_string)

In [None]:
sql_string = "SELECT *"
sql_string += "FROM recidivism_labels_2006_2011 "
sql_string += ";"

label_2006_2011 = pd.read_sql(sql_string, con = conn)

In [None]:
label_2006_2011.head(5)

Label function

In [None]:
def create_labels(prediction_start, prediction_end, conn):
    """
    Generate a list of labels and return the table as a dataframe.
    
    Parameters
    ----------
    prediction_start
    prediction_end
    conn: obj
        
    Returns
    -------
    df_labels: DataFrame
    """
    begin_range = prediction_start
    end_range = prediction_end
    begin_year = parse(begin_range, fuzzy=True).year
    end_year = parse(end_range, fuzzy=True).year
    cursor = conn.cursor()
    
    sql_script="""

drop table if exists release_dates_1989_{begin_year};
create temp table release_dates_1989_{begin_year} as
select inmate_doc_number, actual_sentence_end_date, sentence_begin_date_for_max
from inmt4bb1
where actual_sentence_end_date >= '1989-01-01' and actual_sentence_end_date < '{begin_range}';
commit;

drop table if exists last_exit_1989_{begin_year};
create temp table last_exit_1989_{begin_year} as
select inmate_doc_number, max(actual_sentence_end_date) actual_sentence_end_date
from release_dates_1989_{begin_year}
group by inmate_doc_number;
commit;

drop table if exists admit_{begin_year}_{end_year};
create temp table admit_{begin_year}_{end_year} as
select inmate_doc_number, sentence_begin_date_for_max
from inmt4bb1
where sentence_begin_date_for_max >= '{begin_range}' and sentence_begin_date_for_max < '{end_range}' and inmate_sentence_component = 1;
commit;

drop table if exists recidivism_{begin_year}_{end_year};
create temp table recidivism_{begin_year}_{end_year} as
select r.inmate_doc_number, r.actual_sentence_end_date, a.sentence_begin_date_for_max,
case when a.sentence_begin_date_for_max is null then 0 else 1 end recidivism
from last_exit_1989_{begin_year} r
left join admit_{begin_year}_{end_year} a on r.inmate_doc_number = a.inmate_doc_number;
commit;

drop table if exists recidivism_labels_{begin_year}_{end_year};
create table recidivism_labels_{begin_year}_{end_year} as
select distinct inmate_doc_number, recidivism
from recidivism_{begin_year}_{end_year};
commit; 

    """.format(begin_range=begin_range,
               end_range=end_range,
               begin_year=begin_year,
               end_year=end_year)
    
    cursor.execute(sql_script)
    df_label = pd.read_sql('select * from recidivism_labels_{begin_year}_{end_year}'.format(
                                                                                    begin_year=begin_year,
                                                                                    end_year=end_year), conn)    
    return df_label

In [None]:
label_2011_2016 = create_labels('2011-01-01', '2016-01-01', conn)

In [None]:
label_2011_2016.head(5)

# Feature Generation
---

Our preliminary features are the following

- `nadmits`: Number of times someone has been addmitted to prison between 1989-2005. The more times someone has been to prison the more times they are likely continue to be arrested. 

- `length_sentence`: The length of the longest sentence of all admits between 1989-2005.

- `age_first_admit`: The age someone was first admitted to prison. This is calculated by subtracting their `birth_yr` from the year they were first admitted into prison. The idea behind creating this feature is that people who are younger when they are first arrested are more likely to be arrested again. 

- ...

Number of admits

In [None]:
#drop table if exists feature_nadmits_1989_2005;
sql_string = "create table feature_num_admits_1989_2006 as "
sql_string += "select inmate_doc_number, count(*) num_admits "
sql_string += "from inmt4bb1 "
sql_string += "where inmate_doc_number in (select inmate_doc_number from recidivism_labels_2006_2011) "
sql_string += "and sentence_begin_date_for_max >= '1988-01-01' and sentence_begin_date_for_max < '2006-01-01' and inmate_sentence_component = 1 " 
sql_string += "group by inmate_doc_number "
sql_string += ";"

cur.execute(sql_string)

Length of longest sentence

In [None]:
#drop table if exists feature_length_sentence_1989_2006;
sql_string = "create table feature_length_sentence_1989_2006 as "
sql_string += "select inmate_doc_number, inmate_sentence_component, (actual_sentence_end_date - sentence_begin_date_for_max) length_sentence "
sql_string += "from inmt4bb1 "
sql_string += "where inmate_doc_number in (select inmate_doc_number from recidivism_labels_2006_2011) "
sql_string += "and sentence_begin_date_for_max >= '1988-01-01' and sentence_begin_date_for_max < '2006-01-01' and inmate_sentence_component = 1 " 
sql_string += "and sentence_begin_date_for_max > '0001-01-01' and actual_sentence_end_date > '0001-01-01' and actual_sentence_end_date > sentence_begin_date_for_max "
sql_string += ";"

cur.execute(sql_string)

In [None]:
#drop table if exists feature_length_long_sentence_1989_2006;
sql_string = "create temp table feature_length_long_sentence_1989_2006 as "
sql_string += "select inmate_doc_number, max(length_sentence) length_longest_sentence "
sql_string += "from feature_length_sentence_1989_2006 "
sql_string += "group by inmate_doc_number "
sql_string += ";"

cur.execute(sql_string)

Age at first arrest

In [None]:
#drop table if exists docnbr_admityr;
sql_string = "create temp table docnbr_admityr as "
sql_string += "select inmate_doc_number, min(sentence_begin_date_for_max) min_admityr "
sql_string += "from inmt4bb1 "
sql_string += "where sentence_begin_date_for_max > '0001-01-01' "
sql_string += "group by inmate_doc_number "
sql_string += ";"

cur.execute(sql_string)

In [None]:
#drop table if exists age_first_admit_birth_year;
sql_string = "create temp table age_first_admit_birth_year as "
sql_string += "select da.inmate_doc_number, extract(year from da.min_admityr) min_admityr, extract(year from p.offender_birth_date) offender_birth_date "
sql_string += "from docnbr_admityr da "
sql_string += "left join ofnt3aa1 p on da.inmate_doc_number = p.offender_nc_doc_id_number "
sql_string += ";"

cur.execute(sql_string)

In [None]:
#drop table if exists feature_age_first_admit;
sql_string = "create table feature_age_first_admit as "
sql_string += "select inmate_doc_number, (min_admityr - offender_birth_date) age_first_admit "
sql_string += "from age_first_admit_birth_year "
sql_string += ";"

cur.execute(sql_string)

In [None]:
# drop table if exists feature_agefirstadmit;
sql_string = "create table feature_agefirstadmit as "
sql_string += "select inmate_doc_number, age_first_admit "
sql_string += "from feature_age_first_admit "
sql_string += "where inmate_doc_number in (select inmate_doc_number from feature_num_admits_1989_2006) "
sql_string += ";"

cur.execute(sql_string)

Join everything

In [None]:
# drop table if exists features_1989_2006;
sql_string = "create table features_1989_2006 as "
sql_string += "select f1.inmate_doc_number, f1.num_admits, f2.length_longest_sentence, f3.age_first_admit "
sql_string += "from feature_num_admits_1989_2006 f1 "
sql_string += "left join feature_length_long_sentence_1989_2006 f2 on f1.inmate_doc_number = f2.inmate_doc_number "
sql_string += "left join feature_agefirstadmit f3 on f1.inmate_doc_number = f3.inmate_doc_number "
sql_string += ";"

cur.execute(sql_string)

In [None]:
sql_string = "SELECT *"
sql_string += "FROM features_1989_2006 "
sql_string += ";"

features_1989_2006 = pd.read_sql(sql_string, con = conn)
features_1989_2006.describe()

Function to create features

In [None]:
def create_features(prediction_start, prediction_end, conn):
    """
    Generate a list of features and return the table as a dataframe.
    Note: There has to be a table of labels that correspond with the same time period. 
    
    Parameters
    ----------
    prediction_date
    prediction_end
    conn: obj
        
    Returns
    -------
    df_features: Dataframe
    """
    begin_range = prediction_start
    end_range = prediction_end
    begin_year = parse(begin_range, fuzzy=True).year
    end_year = parse(end_range, fuzzy=True).year 
    cursor = conn.cursor()
    
    sql_script="""

drop table if exists feature_num_admits_1989_{begin_year};
create table feature_num_admits_1989_{begin_year} as 
select inmate_doc_number, count(*) num_admits
from inmt4bb1
where inmate_doc_number in (select inmate_doc_number from recidivism_labels_{begin_year}_{end_year})
and sentence_begin_date_for_max >= '1988-01-01' and sentence_begin_date_for_max < '{begin_range}' and inmate_sentence_component = 1
group by inmate_doc_number; 
commit; 

drop table if exists feature_length_sentence_1989_{begin_year};
create table feature_length_sentence_1989_{begin_year} as
select inmate_doc_number, inmate_sentence_component, (actual_sentence_end_date - sentence_begin_date_for_max) length_sentence
from inmt4bb1
where inmate_doc_number in (select inmate_doc_number from recidivism_labels_{begin_year}_{end_year})
and sentence_begin_date_for_max >= '1988-01-01' and sentence_begin_date_for_max < '2006-01-01' and inmate_sentence_component = 1
and sentence_begin_date_for_max > '0001-01-01' and actual_sentence_end_date > '0001-01-01' and actual_sentence_end_date > sentence_begin_date_for_max ;
commit; 

drop table if exists feature_length_long_sentence_1989_{begin_year};
create temp table feature_length_long_sentence_1989_{begin_year} as
select inmate_doc_number, max(length_sentence) length_longest_sentence
from feature_length_sentence_1989_{begin_year}
group by inmate_doc_number;
commit; 

drop table if exists docnbr_admityr;
create temp table docnbr_admityr as
select inmate_doc_number, min(sentence_begin_date_for_max) min_admityr
from inmt4bb1
where sentence_begin_date_for_max > '0001-01-01'
group by inmate_doc_number;
commit; 

drop table if exists age_first_admit_birth_year;
create temp table age_first_admit_birth_year as
select da.inmate_doc_number, extract(year from da.min_admityr) min_admityr, extract(year from p.offender_birth_date) offender_birth_date
from docnbr_admityr da
left join ofnt3aa1 p on da.inmate_doc_number = p.offender_nc_doc_id_number;
commit; 

drop table if exists feature_age_first_admit; 
create table feature_age_first_admit as 
select inmate_doc_number, (min_admityr - offender_birth_date) age_first_admit
from age_first_admit_birth_year;
commit; 

drop table if exists feature_agefirstadmit; 
create table feature_agefirstadmit as
select inmate_doc_number, age_first_admit
from feature_age_first_admit
where inmate_doc_number in (select inmate_doc_number from feature_num_admits_1989_{begin_year});
commit; 

drop table if exists features_1989_{begin_year}; 
create table features_1989_{begin_year} as
select f1.inmate_doc_number, f1.num_admits, f2.length_longest_sentence, f3.age_first_admit
from feature_num_admits_1989_{begin_year} f1
left join feature_length_long_sentence_1989_{begin_year} f2 on f1.inmate_doc_number = f2.inmate_doc_number
left join feature_agefirstadmit f3 on f1.inmate_doc_number = f3.inmate_doc_number;
commit; 

    """.format(begin_range=begin_range, 
               end_range = end_range,
               begin_year = begin_year,
               end_year = end_year)
    
    cursor.execute(sql_script)
    df_features = pd.read_sql('select * from features_1989_{begin_year}'.format(begin_year=begin_year), conn)    
    return df_features     

In [None]:
features_1989_2011 = create_features('2011-01-01', '2016-01-01', conn)

In [None]:
features_1989_2011.describe()

# Create Training and Test Sets
---

### Our Training Set

We are going to create a training set that will take people at the beginning of 2006 and will generate labels for them based on data from 2006-2010. The features for each person are created based on data from the beginnig of our  data (1989) up to the end of 2005.

*Note:* it is important to segregate your data based on time when creating features. Otherwise there can be "leakage," where you accidentally use information that you would not have known at the time.  This happens often when calculating aggregation features; for instance, it is quite easy to calculate an average using values that go beyond our training set time-span and not realize it.  

In [None]:
sql_string = "create table train_matrix as "
sql_string += "select l.inmate_doc_number, l.recidivism, f.num_admits, f.length_longest_sentence, f.age_first_admit "
sql_string += "from recidivism_labels_2006_2011 l "
sql_string += "left join features_1989_2006 f on f.inmate_doc_number = l.inmate_doc_number "
sql_string += ";"

cur.execute(sql_string)

In [None]:
sql_string = "SELECT *"
sql_string += "FROM train_matrix "
sql_string += ";"

df_training = pd.read_sql(sql_string, con = conn)
df_training.head()

### Our Test (Validation) Set

We will then take the model built on that training set and validate it on the Test Set. Our testing set will use labels from 2011-2015, and our features will be generated from 1989-2010. 

In [None]:
sql_string = "create table test_matrix as "
sql_string += "select l.inmate_doc_number, l.recidivism, f.num_admits, f.length_longest_sentence, f.age_first_admit "
sql_string += "from recidivism_labels_2011_2016 l "
sql_string += "left join features_1989_2011 f on f.inmate_doc_number = l.inmate_doc_number "
sql_string += ";"

cur.execute(sql_string)

In [None]:
sql_string = "SELECT *"
sql_string += "FROM test_matrix "
sql_string += ";"

df_testing = pd.read_sql(sql_string, con = conn)
df_testing.head()

### Data Cleaning

Before we proceed to model training, we check the percentage of missing values in the training data.

In [None]:
isnan_training_rows = df_training.isnull().any(axis=1)
nrows_training = df_training.shape[0]
nrows_training_isnan = df_training[isnan_training_rows].shape[0]
print('%of frows with NaNs {} '.format(float(nrows_training_isnan)/nrows_training))

We see that about 4% of the rows in our training set have missing values. In our example, we will drop rows with missing values. In practice, better ways for dealing with missings exist.

In [None]:
df_training = df_training[~isnan_training_rows]

Let's check the values of the ages at first admit are reasonable.

In [None]:
np.unique( df_training['age_first_admit'] )

Let's drop any rows that have age <= 15 and >= 99.  

In [None]:
keep = (df_training['age_first_admit'] > 15) & (df_training['age_first_admit'] < 99)
df_training = df_training[keep]

Let's check how much data we still have and how many examples of recidivism are in our training dataset. We don't necessarily need to have a perfect balance of recidivists and non-recivists, but it's good to know what the "baseline" is in our dataset.

In [None]:
print('Number of rows: {}'.format(df_training.shape[0]))
df_training['recidivism'].value_counts(normalize=True)

We have about 200,000 examples, and about 20% of those are *positive* examples (recidivist), which is what we're trying to identify. About 80% of the examples are *negative* examples (non-recidivst). Let's take a look at our testing set.

In [None]:
isnan_testing_rows = df_testing.isnull().any(axis=1)
nrows_testing = df_testing.shape[0]
nrows_testing_isnan = df_testing[isnan_testing_rows].shape[0]
print('%of rows with NaNs {} '.format(float(nrows_testing_isnan)/nrows_testing))

We see that about 3% of the rows in our testing set have missing values. This matches what we'd expect based on what we saw in the training set.

In [None]:
df_testing = df_testing[~isnan_testing_rows]

 As before, we drop cases with age <= 15 and >= 99.

In [None]:
keep = (df_testing['age_first_admit'] > 15) & (df_testing['age_first_admit'] < 99)
df_testing = df_testing[keep]

In [None]:
print('Number of rows: {}'.format(df_testing.shape[0]))
df_testing['recidivism'].value_counts(normalize=True)

### Split into features and labels

In [None]:
sel_features = ['num_admits', 'length_longest_sentence', 'age_first_admit']
sel_label = 'recidivism'

In [None]:
X_train = df_training[sel_features].values
y_train = df_training[sel_label].values
X_test = df_testing[sel_features].values
y_test = df_testing[sel_label].values

# Model Training
---

In [None]:
from sklearn import linear_model
model = linear_model.LogisticRegression(penalty = 'l1', C = 1e5)
model.fit( X_train, y_train )
print(model)

When we print the model results, we see different parameters we can adjust as we refine the model based on running it against test data (values such as `penalty`, `C` , and `intercept_scaling`).

To adjust these parameters, one would alter the call that creates the `LogisticRegression()` model instance, passing it one or more of these parameters with a value other than the default.  So, to re-fit the model with `penalty` of "elasticnet", `C` of 0.01, and `intercept_scaling` of 2 (as an example), you'd create your model as follows:

    model = LogisticRegression(penalty = 'elasticnet', C = 0.01, intercept_scaling = 2)

The basic way to choose values for, or "tune," these parameters is the same as the way you choose a model: fit the model to your training data with a variety of parameters, and see which perform the best on the test set. An obvious drawback is that you can also *overfit* to your test set; in this case, you can alter your method of cross-validation.

Let's look at what the model learned and what the coefficients are.

In [None]:
model.coef_[0]

In [None]:
std_coef = np.std(X_test,0)*model.coef_
std_coef[0]

# Model Evaluation 
---

Machine learning models usually do not produce a prediction (0 or 1) directly. Rather, models produce a score (that can sometimes be interpreted a a probabilty) between 0 and 1, which lets you more finely rank all of the examples from *most likely* to *least likely* to have label 1 (positive). This score is then turned into a 0 or 1 based on a user-specified threshold. For example, you might label all examples that have a score greater than 0.5 (1/2) as positive (1), but there's no reason that has to be the cutoff. 

In [None]:
y_scores = model.predict_proba(X_test)[:,1]

Let's take a look at the distribution of scores and see if it makes sense to us. 

In [None]:
sns.distplot(y_scores, kde=False, rug=False)

Our distribution of scores is skewed, with the majority of scores on the lower end of the scale. We expect this because 75% of the data is made up of nonrecidivists, so we'd guess that a higher proportion of the examples in the test set will be negative (meaning they should have lower scores). 

In [None]:
df_testing['y_score'] = y_scores

Tools like `sklearn` often have a default threshold of 0.5, but a good threshold is selected based on the data, model and the specific problem you are solving. As a trial run, let's set a threshold of 0.5. 

In [None]:
calc_threshold = lambda x,y: 0 if x < y else 1 
predicted = np.array( [calc_threshold(score,0.5) for score in y_scores] )
expected = y_test

## Confusion Matrix

Once we have tuned our scores to 0 or 1 for classification, we create a *confusion matrix*, which  has four cells: true negatives, true positives, false negatives, and false positives. If an example was predicted to be negative and is negative, it's a true negative. If an example was predicted to be positive and is positive, it's a true positive. If an example was predicted to be negative and is positive, it's a false negative. If an example was predicted to be positive and is negative, it's a false negative.

In [None]:
conf_matrix = confusion_matrix(expected,predicted)
print(conf_matrix)

The count of true negatives is `conf_matrix[0,0]`, false negatives `conf_matrix[1,0]`, true positives `conf_matrix[1,1]`, and false_positives `conf_matrix[0,1]`.

In [None]:
accuracy = accuracy_score(expected, predicted)
print( "Accuracy = " + str( accuracy ) )

We get an accuracy score of 84%. Recall that our testing dataset had 85% non-recidivists and 15% recidivists. If we had just labeled all the examples as negative and guessed non-recidivist every time, we would have had an accuracy of 85%, so our basic model is not doing better than a "dumb classifier". That's ok, because we're just getting started!

In [None]:
precision = precision_score(expected, predicted)
recall = recall_score(expected, predicted)
print( "Precision = " + str( precision ) )
print( "Recall= " + str(recall))

## AUC-PR and AUC-ROC

If we care about our whole precision-recall space, we can optimize for a metric known as the **area under the curve (AUC-PR)**, which is the area under the precision-recall curve. The maximum AUC-PR is 1. 

In [None]:
precision_curve, recall_curve, pr_thresholds = precision_recall_curve(expected, y_scores)
auc_val = auc(recall_curve,precision_curve)

In [None]:
plt.plot(recall_curve, precision_curve)
plt.xlabel('Recall')
plt.ylabel('Precision')
print('AUC-PR: {0:1f}'.format(auc_val))
plt.show()

In [None]:
fpr, tpr, thresholds = roc_curve(expected, y_scores)
roc_auc = auc(fpr, tpr)

In [None]:
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

## Precision and Recall at k%

If we only care about a specific part of the precision-recall curve we can focus on more fine-grained metrics. For instance, say there is a special program for people likely to be recidivists, but only 5% can be admitted. In that case, we would want to prioritize the 5% who were *most likely* to end up back in jail, and it wouldn't matter too much how accurate we were on the 80% or so who weren't very likely to end up back in jail. 

Let's say that, out of the approximately 200,000 prisoners, we can intervene on 5% of them, or the "top" 10,000 prisoners (where "top" means highest predicted risk of recidivism). We can then focus on optimizing our **precision at 5%**.

In [None]:
def plot_precision_recall_n(y_true, y_prob, model_name):
    """
    y_true: ls
        ls of ground truth labels
    y_prob: ls
        ls of predic proba from model
    model_name: str
        str of model name (e.g, LR_123)
    """
    from sklearn.metrics import precision_recall_curve
    y_score = y_prob
    precision_curve, recall_curve, pr_thresholds = precision_recall_curve(y_true, y_score)
    precision_curve = precision_curve[:-1]
    recall_curve = recall_curve[:-1]
    pct_above_per_thresh = []
    number_scored = len(y_score)
    for value in pr_thresholds:
        num_above_thresh = len(y_score[y_score>=value])
        pct_above_thresh = num_above_thresh / float(number_scored)
        pct_above_per_thresh.append(pct_above_thresh)
    pct_above_per_thresh = np.array(pct_above_per_thresh)
    plt.clf()
    fig, ax1 = plt.subplots()
    ax1.plot(pct_above_per_thresh, precision_curve, 'b')
    ax1.set_xlabel('percent of population')
    ax1.set_ylabel('precision', color='b')
    ax1.set_ylim(0,1.05)
    ax2 = ax1.twinx()
    ax2.plot(pct_above_per_thresh, recall_curve, 'r')
    ax2.set_ylabel('recall', color='r')
    ax2.set_ylim(0,1.05)
    
    name = model_name
    plt.title(name)
    plt.show()
    plt.clf()

In [None]:
def precision_at_k(y_true, y_scores,k):
    
    threshold = np.sort(y_scores)[::-1][int(k*len(y_scores))]
    y_pred = np.asarray([1 if i >= threshold else 0 for i in y_scores ])
    return precision_score(y_true, y_pred)

In [None]:
plot_precision_recall_n(expected,y_scores, 'LR')

In [None]:
p_at_1 = precision_at_k(expected,y_scores, 0.01)
print('Precision at 1%: {:.2f}'.format(p_at_1))

In [None]:
p_at_5 = precision_at_k(expected,y_scores, 0.05)
print('Precision at 5%: {:.2f}'.format(p_at_5))

## Baseline 

It is important to check our model against a reasonable **baseline** to know how well our model is doing. Without any context, 83% accuracy can sound really great... but it's not so great when you remember that you could do almost that well by declaring everyone a non-recividist, which would be stupid (not to mention useless) model. 

A good place to start is checking against a *random* baseline, assigning every example a label (positive or negative) completely at random. 

In [None]:
random_score = [random.uniform(0,1) for i in enumerate(y_test)] 
random_predicted = np.array( [calc_threshold(score,0.5) for score in random_score] )
random_p_at_5 = precision_at_k(expected,random_predicted, 0.05)
random_p_at_5

# More models
---

We have only scratched the surface of what we can do with our model. We've only tried one classifier (Logistic Regression), and there are plenty more classification algorithms in `sklearn`. Let's try them! 


In [None]:
clfs = {'RF': RandomForestClassifier(n_estimators=500, n_jobs=-1),
        'ET': ExtraTreesClassifier(n_estimators=100, n_jobs=-1, criterion='entropy'),
        'GB': GradientBoostingClassifier(learning_rate=0.05, subsample=0.5, max_depth=6, n_estimators=100)}

In [None]:
sel_clfs = ['RF', 'ET', 'GB']

In [None]:
max_p_at_k = 0
for clfNM in sel_clfs:
    clf = clfs[clfNM]
    clf.fit( X_train, y_train )
    print(clf)
    y_score = clf.predict_proba(X_test)[:,1]
    predicted = np.array(y_score)
    expected = np.array(y_test)
    plot_precision_recall_n(expected,predicted, clfNM)
    p_at_5 = precision_at_k(expected,y_score, 0.05)
    if max_p_at_k < p_at_5:
        max_p_at_k = p_at_5
    print('Precision at 5%: {:.2f}'.format(p_at_5))

Let's explore some of the models we just built

In [None]:
# explore random forest RF
sel_clfs
clf = clfs[sel_clfs[0]]
#clf = clfs[clfNM]
print(clf)
clf.fit( X_train, y_train )
print(clf.feature_importances_)

Let's see if we can make this look a little better

In [None]:
importances = clf.feature_importances_
std = np.std ([tree.feature_importances_ for tree in clf.estimators_],
       axis=0)
indices = np.argsort(importances)[::-1]

print ("Feature ranking")
for f in range(X_test.shape[1]):
    print ("%d. %s (%f)" % (f + 1, sel_features[f], importances[indices[f]]))

# plot 
plt.figure
plt.title ("Feature Importances")
plt.bar(range(X_test.shape[1]), importances[indices], color='r',
      yerr=std[indices], align = "center")
plt.xticks(range(X_test.shape[1]), sel_features, rotation=90)
plt.xlim([-1, X_test.shape[1]])
plt.show

Our model has just scratched the surface. Try the following: 
    
- Create more features
- Try more models
- Try different parameters for your model

In [None]:
cur.close()
conn.close()

## Resources

- Hastie et al.'s [The Elements of Statistical Learning](http://statweb.stanford.edu/~tibs/ElemStatLearn/) is a classic and is available online for free.
- James et al.'s [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/), also available online, includes less mathematics and is more approachable.
- Wu et al.'s [Top 10 Algorithms in Data Mining](http://www.cs.uvm.edu/~icdm/algorithms/10Algorithms-08.pdf).