## This is a recap of the machine learning lecture and notebook.

We will take the IL DOC data and build predictive models for recidivism
 
1. Import Python Libraries
2. Connect to the Database
3. Create rows for nalysis
4. Create labels for each row
5. Create features for each row (based on the data of prediction for each row)
6. Create Training and Test/Validation Sets
7. Process Features within the training and test sets
    1. Create dummy variables
    2. Impute Missing values
    3. Scale/Normalize Variables
8. Build Models: For each model type
    1. Select features to use
    2. Select Label to build model for
    3. Fit model on training set
    4. Predict/Score on Test set
    5. Evaluate (try different metrics)
    6. Store results (print or csv)
9. Compare models to see how they work
10. Go deeper into well performing models to see which features are useful/predictive
11. Check for what types of people it puts in high risk groups/low risk groups
12 Check for biases
13. Decide which model to move forward with for future use

# Setup - Import python libraries

In [None]:
%pylab inline
from __future__ import division 
import pandas as pd
import psycopg2
import sklearn
import seaborn as sns
from sklearn.metrics import precision_recall_curve,roc_curve, auc
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.ensemble import (RandomForestClassifier, ExtraTreesClassifier,
                              GradientBoostingClassifier,
                              AdaBoostClassifier)
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score, recall_score
from sqlalchemy import create_engine
#import pydotplus
sns.set_style("white")
sns.set_context("poster", font_scale=1.25, rc={"lines.linewidth":1.25, "lines.markersize":8})

# Connect to the database

In [None]:
db_name = "appliedda"
hostname = "10.10.2.10"
conn = psycopg2.connect(database=db_name, host = hostname) #database connection

# Create rows that we want and labels for each row

In [None]:
--create labels for 2 year readmission outcome

create table ada_class3.ildoc_exit_admit_joined as

select docnbr, curadm_date, exit_date, next_admit_date,

case 
   when (next_admit_date - exit_date <= 730) then 1    
   else 0
end as two_year_readmit 
from
(
select exit.docnbr, exit.curadm_date, exit.exit_date, 
    min(admit.curadm_date) as next_admit_date
from ildoc.ildoc_exit as exit left join ildoc.ildoc_admit as admit 
    on exit.docnbr = admit.docnbr and exit.exit_date <= admit.curadm_date
group by exit.docnbr, exit.curadm_date, exit.exit_date
) d;


# Create Features

For each exit, we will create:
- demographic features: sex, gender
- features from current stay: lengtgh of stay, age at exit, etc.
- aggregate features from all stays up to now: # of stays, age at first stay, total number of days in prison, etc.

## Remember: These features can *only* come from data on or before the exit date since that is your time of prediction

If you are predicting at admit time or during someone's stay in prison, then you should have a date pf prediction to use.

We will create different types of features in different tables (with different types of source data) and then join them at the end.


In [None]:
--create feature set 1

create table ada_class3.temp_ildoc_features1 as

select a.docnbr, a.curadm_date, a.exit_date,

max(((exit_date - birth_date)/365)) as age_at_exit,
max(sex) as sex,
max(race) as race,
max( a.exit_date -  a.curadm_date) as days_in_prison_this_time,
max(birth_date) as birth_date

from ada_class3.ildoc_exit_admit_joined a
join ildoc.ildoc_exit b using (docnbr, curadm_date, exit_date)
group by 1,2,3;


--create feature set 2

create table ada_class3.temp_ildoc_features2 as

select a.docnbr, a.exit_date, a.curadm_date,

count(distinct b.exit_date) as prior_exits,
min (b.curadm_date) as first_admit_date,
sum(b.exit_date - b.curadm_date) as total_days_in_prison,
sum(b.exit_date - b.curadm_date)/count(distinct b.exit_date) as avg_days_in_prison

from ada_class3.ildoc_exit_admit_joined a
left join ildoc.ildoc_exit b on a.docnbr = b.docnbr and a.exit_date >= b.exit_date
group by 1,2,3;

--create feature set 3

create table ada_class3.temp_ildoc_features3 as 

select a.docnbr,
max(first_admit_date - birth_date)/365 as age_at_first_admit 

from ildoc.ildoc_exit a
left join (select docnbr, min(curadm_date) as first_admit_date 
from ildoc.ildoc_exit group by docnbr) b using (docnbr)
group by 1 ;


# Combine Features and Labels

In [None]:
--create joined feature and labels table

create table ada_class3.ildoc_matrix as
select a.docnbr,a.exit_date,a.curadm_date, a.next_admit_date, a.two_year_readmit,age_at_exit, sex,race,
days_in_prison_this_time, prior_exits, first_admit_date, total_days_in_prison,avg_days_in_prison,
age_at_first_admit

from 

ada_class3.ildoc_exit_admit_joined a left join ada_class3.temp_ildoc_features1 f1  
using (docnbr, curadm_date, exit_date)
left join  ada_class3.temp_ildoc_features2 f2  
using (docnbr, curadm_date, exit_date) left join
ada_class3.temp_ildoc_features3 f3 using (docnbr);


# Pull data in to python

In [None]:
df_all = pd.read_sql("select * from ada_class3.ildoc_matrix where exit_date is not null;", conn, parse_dates = ['exit_date','curadm_date', 'next_admit_date'])

###  Create Dummy variables (convert categorical to binary)


In [None]:
columns_to_dummify = ['sex', 'race']
df_all = pd.get_dummies(df_all, dummy_na = True, columns = columns_to_dummify)

## Create Train and test sets

In [None]:
df_train1 = df_all[df_all['exit_date'] < '2009-06-01']
df_test1 = df_all[df_all['exit_date'].between('2009-06-01','2010-06-01')]
df_train2 = df_all[df_all['exit_date'] < '2012-06-01']
df_test2 = df_all[df_all['exit_date'].between('2012-06-01' , '2013-06-01')]

## Check for missing values and impute

In [None]:
print df_all.isnull().sum()

In [None]:
df_train1['age_at_exit'].fillna(df_train1['age_at_exit'].mean(), inplace=True)
df_train1['age_at_first_admit'].fillna(df_train1['age_at_first_admit'].mean(), inplace=True)


df_test1['age_at_exit'].fillna(df_test1['age_at_exit'].mean(), inplace=True)
df_test1['age_at_first_admit'].fillna(df_test1['age_at_first_admit'].mean(), inplace=True)


df_train2['age_at_exit'].fillna(df_train2['age_at_exit'].mean(), inplace=True)
df_train2['age_at_first_admit'].fillna(df_train2['age_at_first_admit'].mean(), inplace=True)


df_test2['age_at_exit'].fillna(df_test2['age_at_exit'].mean(), inplace=True)
df_test2['age_at_first_admit'].fillna(df_test2['age_at_first_admit'].mean(), inplace=True)




# Define  feature groups and labels 

In [None]:
all_features = ['days_in_prison_this_time','age_at_exit','prior_exits','total_days_in_prison','avg_days_in_prison',
                 'race_ASN','race_BLK','race_HSP','race_IND','race_WHI', 'race_nan','race_UNK',
                'sex_F', 'sex_M', 'sex_nan' ]

sex_features = ['sex_F', 'sex_M', 'sex_nan']
race_features = ['race_ASN','race_BLK','race_HSP','race_IND','race_WHI', 'race_nan','race_UNK']

sel_label = 'two_year_readmit'

In [None]:
features_to_use = all_features

X_train = df_train1[features_to_use]
y_train = df_train1[sel_label]
X_test = df_test1[features_to_use]
y_test = df_test1[sel_label]

# Scale/Normalize Variables

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train[X_train.columns] = scaler.fit_transform(X_train[X_train.columns])
X_test[X_test.columns] = scaler.fit_transform(X_test[X_test.columns])



# Fit a model

In [None]:
# Let's fit a model
from sklearn.tree import DecisionTreeClassifier
model = RandomForestClassifier(n_estimators=1000, n_jobs = -1 )
model.fit( X_train, y_train )
print(model)

# Predict on the Test Set and Look at the Score Distribution

In [None]:
#  from our "predictors" using the model.
y_scores = model.predict_proba(X_test)[:,1]
df_test1['y_score'] = y_scores
sns.distplot(y_scores, kde=False, rug=False)

# Evaluate: Calculate Precision and Recall at different levels of thresholds and intervention capacity

In [None]:
def plot_precision_recall_n(y_true, y_prob, model_name):
    """
    y_true: ls 
        ls of ground truth labels
    y_prob: ls
        ls of predic proba from model
    model_name: str
        str of model name (e.g, LR_123)
    """
    from sklearn.metrics import precision_recall_curve
    y_score = y_prob
    precision_curve, recall_curve, pr_thresholds = precision_recall_curve(y_true, y_score)
    precision_curve = precision_curve[:-1]
    recall_curve = recall_curve[:-1]
    pct_above_per_thresh = []
    number_scored = len(y_score)
    for value in pr_thresholds:
        num_above_thresh = len(y_score[y_score>=value])
        pct_above_thresh = num_above_thresh / float(number_scored)
        pct_above_per_thresh.append(pct_above_thresh)
    pct_above_per_thresh = np.array(pct_above_per_thresh)
    plt.clf()
    fig, ax1 = plt.subplots()
    ax1.plot(pct_above_per_thresh, precision_curve, 'b')
    ax1.set_xlabel('percent of population')
    ax1.set_ylabel('precision', color='b')
    ax1.set_ylim(0,1.05)
    ax2 = ax1.twinx()
    ax2.plot(pct_above_per_thresh, recall_curve, 'r')
    ax2.set_ylabel('recall', color='r')
    ax2.set_ylim(0,1.05)
    
    name = model_name
    plt.title(name)
    plt.show()
    plt.clf()

In [None]:
expected = y_test
plot_precision_recall_n(expected,y_scores, 'RF')

# THRESHOLD THRESHOLD THRESHOLD
To explore the effect of choosing different thresholds to turn the prediction scores to 0 or 1, we will select one arbitrary threshold and computer the confusion matrix, accuracy, precision, and recall metrics

In [None]:
threshold = 0.8

calc_threshold = lambda x,y: 0 if x < y else 1 
predicted = np.array( [calc_threshold(score,Threshold) for score in y_scores] )


## Calculate Confusion Matrix, Accuracy, Precision, and Recall metrics

In [None]:
conf_matrix = confusion_matrix(expected,predicted)

print "THRESHOLD = " + str(threshold) + "\n"
print "Confusion Matrix\n[[TN   FP]\n [FN  TP]]\n\n",conf_matrix

# generate an accuracy score by comparing expected to predicted.

accuracy = accuracy_score(expected, predicted)
print( "\nAccuracy = " + str( round(accuracy*100,2) ) ) + "%"


precision = round(precision_score(expected, predicted)*100,0)
recall = round(recall_score(expected, predicted)*100,0)
print( "Precision = " + str( precision ) + "%" )
print( "Recall= " + str(recall)) + "%"


In [None]:
def plot_precision_recall(y_true,y_score):
    """
    Plot a precision recall curve
    
    Parameters
    ----------
    y_true: ls
        ground truth labels
    y_score: ls
        score output from model
    """
    precision_curve, recall_curve, pr_thresholds = precision_recall_curve(y_true,y_score)
    plt.plot(recall_curve, precision_curve)
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    auc_val = auc(recall_curve,precision_curve)
    print('AUC-PR: {0:1f}'.format(auc_val))
    plt.show()
    plt.clf()

In [None]:
plot_precision_recall(expected, y_scores)

In [None]:
def precision_at_k(y_true, y_scores,k):
    
    threshold = np.sort(y_scores)[::-1][int(k*len(y_scores))]
    y_pred = np.asarray([1 if i >= threshold else 0 for i in y_scores ])
    return precision_score(y_true, y_pred)

In [None]:
for p_at_1 = precision_at_k(expected,y_scores, 0.01)
print('Precision at 1%: {:.2f}'.format(p_at_1))

### So far we've run one model and looked at the results. Now,
# Let's run a lot of models

In [None]:
clfs = {'RF': RandomForestClassifier(n_estimators=1000, n_jobs=-1),
       'ET': ExtraTreesClassifier(n_estimators=100, n_jobs=-1, criterion='entropy'),
        'LR': LogisticRegression(penalty='l1', C=1e5),
        'SGD':SGDClassifier(loss='log'),
        'GB': GradientBoostingClassifier(learning_rate=0.05, subsample=0.5, max_depth=6, random_state=17, n_estimators=10),
        'NB': GaussianNB()}

In [None]:
sel_clfs = ['RF', 'ET', 'LR', 'SGD', 'GB', 'NB']


In [None]:
max_p_at_k = 0
df_results = pd.DataFrame()
sns.set(font_scale=2)

for selected_classifier in sel_clfs:
    clf = clfs[selected_classifier]
    clf.fit( X_train, y_train )
    print clf
    y_score = clf.predict_proba(X_test)[:,1]
    predicted = np.array(y_score)
    expected = np.array(y_test)
    plot_precision_recall_n(expected,predicted, selected_classifier)
    p_at_1 = precision_at_k(expected,y_score, 0.01)
    p_at_5 = precision_at_k(expected,y_score,0.05)
    p_at_10 = precision_at_k(expected,y_score,0.10)
    p_at_20 = precision_at_k(expected,y_score,0.20)
    fpr, tpr, thresholds = roc_curve(expected,y_score)
    auc_val = auc(fpr,tpr)
    df_results = df_results.append([{
        'Classifier Type':selected_classifier,
        'precision_at_1_percent':p_at_1,
        'precision_at_5_percent':p_at_5,
        'precision_at_10_percent':p_at_10,
        'precision_at_20_percent':p_at_20,
        'Area Under Curve':auc_val,
        'Classifier Details': clf
    }])
    
    #feature importances
    if hasattr(clf, 'coef_'):
        feature_import = dict(
            zip(features_to_use,clf.coef_.ravel()))
    elif hasattr(clf, 'feature_importances_'):
        feature_import = dict(
            zip(features_to_use, clf.feature_importances_))
    print("FEATURE IMPORTANCES")
    print(feature_import)
    
    plt.clf()
    sns.set_style('whitegrid')
    f, ax = plt.subplots(figsize=(36,12))
    sns.barplot(x=feature_import.keys(), y = feature_import.values())
    plt.xticks(rotation=90)
    plt.tight_layout()
    #plt.rcParams["xtick.labelsize"]=24
   
    plt.show()
    
#saving results to csv
df_results.to_csv('modelrun.csv')
df_results

# Assess Model Against Baselines

- Back to [Table of Contents](#Table-of-Contents)

It is important to check our model against a reasonable **baseline** to know how well our model is doing. Without any context, 78% accuracy can sound really great... but it's not so great when you remember that you could do almost that well by declaring everyone will not need benefits in the next year, which would be stupid (not to mention useless) model. 

A good place to start is checking against a *random* baseline, assigning every example a label (positive or negative) completely at random. 

In [None]:
max_p_at_k

In [None]:
random_score = [random.uniform(0,1) for i in enumerate(y_test)] 
random_predicted = np.array( [calc_threshold(score,0.5) for score in random_score] )
random_p_at_5 = precision_at_k(expected,random_predicted, 0.01)

Another good practice is checking against an "expert" or rule of thumb baseline. For example, say that talking to people at the IDHS, you find that they think it's much more likely that someone who has been on assistance multiple times already will need assistance in the future. Then you should check that your classifier does better than just labeling everyone who has had multiple past admits as positive.

In [None]:
reenter_predicted = np.array([ 1 if n_spells > 3 else 0 for n_spells in df_testing.n_spells.values ])
reenter_p_at_1 = precision_at_k(expected,reenter_predicted,0.01)

In [None]:
all_non_reenter = np.array([0 for n_spells in df_testing.n_spells.values])
all_non_reenter_p_at_1 = precision_at_k(expected, all_non_reenter,0.01)

In [None]:
sns.set_style("white")
sns.set_context("poster", font_scale=2.25, rc={"lines.linewidth":2.25, "lines.markersize":8})
fig, ax = plt.subplots(1, figsize=(22,12))
sns.barplot(['Random','All no need', 'More than 3 Spell','Model'],
            [random_p_at_5, all_non_reenter_p_at_1, reenter_p_at_1, max_p_at_k],
            palette=['#6F777D','#6F777D','#6F777D','#800000'])
sns.despine()
plt.ylim(0,1)
plt.ylabel('precision at 1%')

## Resources
*[Go back to Table of Contents](#Table-of-Contents)*

- Hastie et al.'s [The Elements of Statistical Learning](http://statweb.stanford.edu/~tibs/ElemStatLearn/) is a classic and is available online for free.
- James et al.'s [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/), also available online, includes less mathematics and is more approachable.
- Wu et al.'s [Top 10 Algorithms in Data Mining](http://www.cs.uvm.edu/~icdm/algorithms/10Algorithms-08.pdf).