# Machine Learning Quick Start

# Table of contents

- [Setup](#Setup)

    - [Setup - Connect to the database](#Setup---Connect-to-the-database)
    
        - [Setup - Database - SQLAlchemy](#Setup---Database---SQLAlchemy)
        - [Setup - Database - psycopg2](#Setup---Database---psycopg2)
        - [Setup - Database - rollback if needed](#Setup---Database---rollback-if-needed)
        
- [Load data table](#Load-data-table)

    - [Database access examples](#Database-access-examples)
    - [Load your features and labels table](#Load-your-features-and-labels-table)

- [Data Check](#Data-Check)

    - [Look for null values](#Look-for-null-values)
    - [Examine values within columns](#Examine-values-within-columns)
    - [Examine distribution of key variables](#Examine-distribution-of-key-variables)

- [Model Fitting](#Model-Fitting)

    - [Make training and testing data](#Make-training-and-testing-data)
    
        - [Specify features and labels](#Specify-features-and-labels)
        - [Split into training and testing sets using scikit-learn](#Split-into-training-and-testing-sets-using-scikit-learn)
    
            - [OPTIONAL - free up memory](#OPTIONAL---free-up-memory)

        - [Create training and testing sets manually](#Create-training-and-testing-sets-manually)
    
    - [Model Selection](#Model-Selection)
    - [Model Understanding](#Model-Understanding)

- [Model Evaluation](#Model-Evaluation)

    - [Predicted vs. Expected](#Predicted-vs.-Expected)
    - [Confusion Matrix](#Confusion-Matrix)
    - [Accuracy](#Accuracy)
    - [Precision and Recall](#Precision-and-Recall)
    - [Precision and Recall at k percent](#Precision-and-Recall-at-k-percent)
    - [Baseline](#Baseline)

# Setup

- Back to [Table of contents](#Table-of-contents)

In [None]:
%pylab inline
import gc
import pandas
import pandas as pd
import psycopg2
import sklearn
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import (RandomForestClassifier, ExtraTreesClassifier,
                              GradientBoostingClassifier,
                              AdaBoostClassifier)
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.metrics import precision_recall_curve, auc
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
import sqlalchemy
from sqlalchemy import create_engine
sns.set_style("white")

## Setup - Connect to the database

- Back to [Table of contents](#Table-of-contents)

In [None]:
# schema name
schema_name = ""

# ==> database table names - just like file names above, store reused database information in variables here.

# work table name
work_db_table = ""

print( "Database variables initialized at " + str( datetime.datetime.now() ) )

In [None]:
# Database connection properties
db_host = "10.10.2.10"
db_port = -1
db_username = None
db_password = None
db_name = "appliedda"

print( "Database connection properties initialized at " + str( datetime.datetime.now() ) )

### Setup - Database - `SQLAlchemy`

- Back to [Table of contents](#Table-of-contents)

Initialize database connections.  First, SQLAlchemy engine:

In [None]:
# initialize database connections
# Create connection to database using SQLAlchemy
#     (3 '/' indicates use enviroment settings for username, host, and port)
sqlalchemy_connection_string = "postgresql://"

if ( ( db_host is not None ) and ( db_host != "" ) ):
    sqlalchemy_connection_string += str( db_host )
#-- END check to see if host --#

sqlalchemy_connection_string += "/"

if ( ( db_name is not None ) and ( db_name != "" ) ):
    sqlalchemy_connection_string += str( db_name )
#-- END check to see if host --#

# create engine.
pgsql_engine = sqlalchemy.create_engine( sqlalchemy_connection_string )

print( "SQLAlchemy engine created at " + str( datetime.datetime.now() ) )

### Setup - Database - `psycopg2`

- Back to [Table of contents](#Table-of-contents)

And then a direct psycopg2 connection and cursor:

In [None]:
# create psycopg2 connection to Postgresql

# example connect() call that uses all the possible parameters
#pgsql_connection = psycopg2.connect( host = db_host, port = db_port, database = db_name, user = db_username, password = db_password )

# for SQLAlchemy, just needed database name. Same for DBAPI?
pgsql_connection = psycopg2.connect( host = db_host, database = db_name )

print( "Postgresql connection to database \"" + db_name + "\" created at " + str( datetime.datetime.now() ) )

In [None]:
# create a cursor that maps column names to values
pgsql_cursor = pgsql_connection.cursor( cursor_factory = psycopg2.extras.DictCursor )

print( "Postgresql cursor for database \"" + db_name + "\" created at " + str( datetime.datetime.now() ) )

### Setup - Database - rollback if needed

- Back to [Table of contents](#Table-of-contents)

In [None]:
# rollback, in case you need it.
pgsql_connection.rollback()

print( "Postgresql connection for database \"" + db_name + "\" rolled back at " + str( datetime.datetime.now() ) )

# Load data table

- Back to [Table of contents](#Table-of-contents)

For this quick start, we assume that there is a table that contains the items you want to analyze with a row per column and a column per feature or label.  This table can contain multiple columns you'd like to use for labels.  You can either filter to a given set of features and labels in the SQL you use to load the data, or you can filter later, when you break out the data in X and Y training and testing data frames. 

## Database access examples

- Back to [Table of contents](#Table-of-contents)

The database connection allows us to use queries of a database to populate pandas DataFrames in Python.

In [None]:
# create SQL query
sql_select = """SELECT table_schema, table_name
FROM information_schema.tables
order by table_schema, table_name;"""

In [None]:
# load the data into a DataFrame (df)
df_tables = pd.read_sql( sql_select, pgsql_engine )

In [None]:
# look at a few sample rows.
df_tables.head()

## Load your features and labels table

- Back to [Table of contents](#Table-of-contents)

Now, we'll load our table that contains features and labels into a pandas DataFrame.

First we create a SELECT statement.  This can be very simple:

In [None]:
# build SELECT to pull in features and labels table.
sql_select = "SELECT *"
sql_select += " FROM " + schema_name + "." + work_db_table
sql_select += ";"

Or, it can be more complex:

In [None]:
# build SELECT to pull in features and labels table.
sql_select = "SELECT feature1, feature2, feature3, feature4, label1, label2"
sql_select += " FROM " + schema_name + "." + work_db_table
sql_select += " WHERE important_variable IS NOT NULL"
sql_select += " AND age > 18"
sql_select += ";"

Load the data into a pandas DataFrame.

In [None]:
# load the data into a DataFrame (df)
data_table_df = pd.read_sql( sql_select, pgsql_engine )

# Data Check

- Back to [Table of contents](#Table-of-contents)

Now, we look at the columns in our data set to see if they are appropriate for machine learning models, and if not, we fix them.  Examples of this process:

In [None]:
# first, look at small sample of rows.
data_table_df.head()

## Look for null values

- Back to [Table of contents](#Table-of-contents)

Machine learning models might or might not be able to accommodate null values in features or labels.  It is good to be aware of where nulls are in your data.

In [None]:
# get a list of the rows in the data that have empty values (known as NaN or null).
isnan_rows_list = data_table_df.isnull().any( axis = 1 )

In [None]:
# Take a look at the contents of these rows.
data_table_df[ isnan_rows_list ].head()

In [None]:
# look at percent of rows that contain nulls
nrows_data_table = data_table_df.shape[ 0 ]
nrows_data_table_isnan = data_table_df[ isnan_rows_list ].shape[ 0 ]
percent_isnan = float( nrows_data_table_isnan) / nrows_data_table
print( '% of frows with NaNs {} '.format( str( percent_isnan ) ) )

In general, machine leraning doesn't like nulls.  So, we remove the rows with nulls/`NaN`s. 

In [None]:
data_table_df = data_table_df[ ~isnan_rows_list ]

## Examine values within columns

- Back to [Table of Contents](#Table-of-contents)

Let's check the values of a column to see if they are reasonable.  Our example: "age":

In [None]:
# column name
column_name = "age"

First, we grab the unique values in the column.

In [None]:
# Use numpy to get unique values in a given column.
np.unique( data_table_df[ column_name ] )

Let's say, in our table, there are ages of 0 or less.  This is unlikely if these are people, for example.  So, let's drop any rows that have age less than 0 or greater than 150.

In [None]:
# create a filter criteria, then apply it to our data.
filter_criteria = ~( ( data_table_df[ column_name ] < 1) | ( data_table_df[ column_name ] > 150 ) )

# only keep rows from our DataFrame that fit our criteria.
data_table_df = data_table_df[ filter_criteria ]

## Examine distribution of key variables

- Back to [Table of contents](#Table-of-contents)

As we clean up, we should intermittently check how much data we still have and how key variables of interest are distributed. We don't necessarily need to have a perfect balance in any given feature or label, but it's good to know what the "baseline" is in our dataset, to be able to intelligently evaluate our performance.

In [None]:
# number of rows:
print('Number of rows: {}'.format( data_table_df.shape[ 0 ] ) )

In [None]:
# look at distribution of a key variable
key_variable = "awesomeness"
data_table_df[ key_variable ].value_counts( normalize = True )

# Model Fitting

- Back to [Table of contents](#Table-of-contents)

## Make training and testing data

- Back to [Table of contents](#Table-of-contents)

Before we can fit a model, we need to split our data into separate sets of training and testing data.

For simple models where you have a pool of data you want to use to both test and train, scikit-learn provides an automated method for randomly splitting data from a single table of records into either a training or testing set.

For more complex models where you have specific data you want to use to train and test (training on older data, then testing against newer data to test performance over time, for instance), you can also set up your training and testing data manually.  Examples of each are below.

For the rest of the code in this notebook to run correctly, whichever method you choose, when you are done, you need the following variables set:

In [None]:
# DataFrames to hold training and testing data, with features (X-variables)
#     and label (y-variable) commingled.
df_training = None
df_testing = None

# DataFrames to hold features (X-variables) and label (y-variable)
#     for training and testing sets of data.  Should just include
#     features and labels, not additional columns.
X_train = None
y_train = None
X_test = None
y_test = None

# numpy array of label/y values, for use in scikit-learn training.
y_train_values = None

### Specify features and labels

- Back to [Table of contents](#Table-of-contents)

To start, we specify the names of the columns that we will use as features (predictors, or X variables) and as the label (predicted, or y variable).  Make sure to set these variables no matter how you are setting up your training and testing data, as they are referenced later in the notebook.

In [None]:
# Make list of the names of columns that contain features in our data table.
feature_column_names = []
feature_column_names.append( 'feature1' )
feature_column_names.append( 'feature2' )
feature_column_names.append( 'feature3' )
# ... etc.

# And, capture name of label column.
label_column_name = 'label1'

### Split into training and testing sets using scikit-learn

- Back to [Table of contents](#Table-of-contents)

First, starting with a single DataFrame that contains all of our data (`data_table_df`), we look at using scikit-learn to randomly split a single data set into test and train sets.

To start, create DataFrames that just contain features (predictors, or x-variables) and our label (the value we want to predict, or the y-variable).

Split into separate Feature and Label DataFrames, where any columns not named in either `feature_column_names` or `label_column_name` are ommitted.

In [None]:
# Split into separate Feature and Label DataFrames

# features, based on feature_column_names...
feature_df = pandas.DataFrame.copy( data_table_df )

In [None]:
# ...and the label, based on label_column_name.
label_column_name_list = [ label_column_name ]
label_df = data_table_df[ label_column_name_list ]

Next, we split our features (predictors, or x-variables) and our label values (the value we want to predict, or the y-variable) into training and testing sets.  We'll use the `scikit-learn` `train_test_split()` function.

In [None]:
# configuration
percent_in_test = 0.25
desired_random_state = 0

# use `train_test_split` from scikit-learn.
X_train, X_test, y_train, y_test = train_test_split( feature_df, 
                                                     label_df,
                                                     test_size = percent_in_test,
                                                     random_state = desired_random_state )

# Filter X_train and X_test to just the features we want.
df_testing = pandas.DataFrame.copy( X_test )
df_training = pandas.DataFrame.copy( X_train )
X_test = X_test[ feature_column_names ]
X_train = X_train[ feature_column_names ]

# Convert to numpy arrays
y_train_values = y_train[ label_column_name ].values

#### OPTIONAL - free up memory

- Back to [Table of contents](#Table-of-contents)

In case you need or want to, here is how you free up memory now that you have your features and labels filtered.  If you are going to be working with different sets of features or different labels, you probably don't want to do this, because it will remove your data table and feature and label data frames from memory.

In [None]:
# First set variables that refer to DataFrames to None.
data_table_df = None
feature_df = None
label_df = None

# then tell Python to collect garbage.
gc.collect()

### Create training and testing sets manually

- Back to [Table of contents](#Table-of-contents)

If you have a purposive set of training and testing data you'd like to use, the code below shows how you can set all the variables the rest of this notebook needs manually.

In [None]:
# set df_testing and df_training if necessary.
df_training = my_training_data_frame
df_testing = my_testing_data_frame

# ...and the label, based on label_column_name.
label_column_name_list = [ label_column_name ]

# create pandas series of training and testing features (X) and label (y)
X_train = df_training[ feature_column_names ]
y_train = df_training[ label_column_name_list ]
X_test = df_testing[ feature_column_names ]
y_test = df_testing[ label_column_name_list ]

# Convert to numpy arrays as needed.

# y_train
y_train_values = y_train[ label_column_name ].values

## Model Selection

- Back to [Table of contents](#Table-of-contents)

In [None]:
# Let's fit a model
from sklearn import linear_model
model = linear_model.LogisticRegression( penalty = 'l1', C = 1e5 )

# use y_train_values - it wants a numpy array.
model.fit( X_train, y_train_values )

print(model)

## Model Understanding

- Back to [Table of contents](#Table-of-contents)

Look at the coefficients for each of the features in the model (an indication of the weight the machine learning algrithm assigned to each feature, similar to regression beta-weights/coefficients):

In [None]:
print "The coefficients for each of the features are:" 
zip( feature_column_names, model.coef_[ 0 ] )

In [None]:
print "The standardized coefficients for each of the features are:" 
std_coef = np.std( X_test, 0 ) * model.coef_[ 0 ]
zip( feature_column_names, std_coef )

# Model Evaluation

- Back to [Table of contents](#Table-of-contents)

## Predicted vs. Expected

- Back to [Table of contents](#Table-of-contents)

Machine learning models usually do not produce a prediction (0 or 1) directly. Rather, models produce a score between 0 and 1 (that can sometimes be interpreted as a probability), which lets you more finely rank all of the examples from *most likely* to *least likely* to have label 1 (positive). This score is then turned into a 0 or 1 based on a user-specified threshold. For example, you might label all examples that have a score greater than 0.5 (1/2) as positive (1), but there's no reason that has to be the cutoff. 

In [None]:
#  from our "predictors" using the model.
y_scores = model.predict_proba( X_test )[ :,1]

In [None]:
y_scores

Let's take a look at the distribution of scores and see if it makes sense to us. 

In [None]:
sns.distplot(y_scores, kde=False, rug=False)

Our distribution of scores is skewed, with the majority of scores on the lower end of the scale. We expect this because 79% of the training data is made up of people not returning to benefits, so we'd guess that a higher proportion of the examples in the test set will be negative (meaning they should have lower scores)

In [None]:
# add set of y scores to testing data, alongside actual data.
df_testing['y_score'] = y_scores

In [None]:
# display actual values alongside y scores.
df_testing[ [ label_column_name, 'y_score' ] ].head()

Tools like `sklearn` often have a default threshold of 0.5, but a good threshold is selected based on the data, model and the specific problem you are solving. As a trial run, let's set a threshold of 0.5. 

In [None]:
calc_threshold = lambda x, y : 0 if x < y else 1 
predicted = np.array( [ calc_threshold( score, 0.45 ) for score in y_scores ] )
expected = y_test

## Confusion Matrix

- Back to [Table of contents](#Table-of-contents)

Once we have tuned our scores to 0 or 1 for classification, we create a *confusion matrix*, which  has four cells: true negatives, true positives, false negatives, and false positives. Each data point belongs in one of these cells, because it has both a ground truth and a predicted label. If an example was predicted to be negative and is negative, it's a true negative. If an example was predicted to be positive and is positive, it's a true positive. If an example was predicted to be negative and is positive, it's a false negative. If an example was predicted to be positive and is negative, it's a false negative.

In [None]:
from sklearn.metrics import confusion_matrix
conf_matrix = confusion_matrix( expected, predicted )
print conf_matrix

The count of true negatives is `conf_matrix[0,0]`, false negatives `conf_matrix[1,0]`, true positives `conf_matrix[1,1]`, and false_positives `conf_matrix[0,1]`.

### Accuracy

- Back to [Table of contents](#Table-of-contents)

Accuracy is the ratio of the correct predictions (both positive and negative) to all predictions. 
$$ Accuracy = \frac{TP+TN}{TP+TN+FP+FN} $$

In [None]:
# generate an accuracy score by comparing expected to predicted.
from sklearn.metrics import accuracy_score
accuracy = accuracy_score( expected, predicted )
print( "Accuracy = " + str( accuracy ) )

Example of interpreting accuracy score:

We get an accuracy score of XX%. Recall that our testing dataset had XX% people staying off benefits and XX% off benefits. If we had just labeled all the examples as negative and guessed going back to benefits every time, we would have had an accuracy of XX%, so our basic model is not doing much better than a "dumb classifier." That's ok, because we're just getting started!

### Precision and Recall

- Back to [Table of contents](#Table-of-contents)

Precision and recall are other ways you can look at the relationships between true and false positives and negatives.

In [None]:
from sklearn.metrics import precision_score, recall_score
precision = precision_score( expected, predicted )
recall = recall_score( expected, predicted )
print( "Precision = " + str( precision ) )
print( "Recall= " + str( recall ) )

If we care about our whole precision-recall space, we can optimize for a metric known as the **area under the curve (AUC-PR)**, which is the area under the precision-recall curve. The maximum AUC-PR is 1. 

In [None]:
def plot_precision_recall(y_true,y_score):
    """
    Plot a precision recall curve
    
    Parameters
    ----------
    y_true: ls
        ground truth labels
    y_score: ls
        score output from model
    """
    precision_curve, recall_curve, pr_thresholds = precision_recall_curve(y_true,y_score)
    plt.plot(recall_curve, precision_curve)
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    auc_val = auc(recall_curve,precision_curve)
    print('AUC-PR: {0:1f}'.format(auc_val))
    plt.show()
    plt.clf()

In [None]:
plot_precision_recall(expected, y_scores)

## Precision and Recall at k percent

- Back to [Table of contents](#Table-of-contents)

If we only care about a specific part of the precision-recall curve we can focus on more fine-grained metrics. For instance, say there is a special program for people likely to need assistance within the next year , but only *3000 or 1% of the people in our test set*  can be admitted. In that case, we would want to prioritize the 1% who were *most likely* to need assistance within the next year, and it wouldn't matter too much how accurate we were on the 78% or so who weren't very likely to need assistane.

Let's say that, out of the approximately 300,000 peoiple, we can intervene on 1% of them, or the "top" 3000 people in a year (where "top" means highest likelihood of needing assistance in the next year). We can then focus on optimizing our **precision at 1%**.

In [None]:
def plot_precision_recall_n(y_true, y_prob, model_name):
    """
    y_true: ls 
        ls of ground truth labels
    y_prob: ls
        ls of predic proba from model
    model_name: str
        str of model name (e.g, LR_123)
    """
    from sklearn.metrics import precision_recall_curve
    y_score = y_prob
    precision_curve, recall_curve, pr_thresholds = precision_recall_curve(y_true, y_score)
    precision_curve = precision_curve[:-1]
    recall_curve = recall_curve[:-1]
    pct_above_per_thresh = []
    number_scored = len(y_score)
    for value in pr_thresholds:
        num_above_thresh = len(y_score[y_score>=value])
        pct_above_thresh = num_above_thresh / float(number_scored)
        pct_above_per_thresh.append(pct_above_thresh)
    pct_above_per_thresh = np.array(pct_above_per_thresh)
    plt.clf()
    fig, ax1 = plt.subplots()
    ax1.plot(pct_above_per_thresh, precision_curve, 'b')
    ax1.set_xlabel('percent of population')
    ax1.set_ylabel('precision', color='b')
    ax1.set_ylim(0,1.05)
    ax2 = ax1.twinx()
    ax2.plot(pct_above_per_thresh, recall_curve, 'r')
    ax2.set_ylabel('recall', color='r')
    ax2.set_ylim(0,1.05)
    
    name = model_name
    plt.title(name)
    plt.show()
    plt.clf()

In [None]:
def precision_at_k(y_true, y_scores,k):
    
    threshold = np.sort(y_scores)[::-1][int(k*len(y_scores))]
    y_pred = np.asarray([1 if i >= threshold else 0 for i in y_scores ])
    return precision_score(y_true, y_pred)

In [None]:
plot_precision_recall_n(expected,y_scores, 'LR')

In [None]:
p_at_1 = precision_at_k(expected,y_scores, 0.01)
print('Precision at 1%: {:.2f}'.format(p_at_1))

# Multiple Models

In [None]:
clfs = {'RF': RandomForestClassifier(n_estimators=50, n_jobs=-1),
       'ET': ExtraTreesClassifier(n_estimators=10, n_jobs=-1, criterion='entropy'),
        'LR': LogisticRegression(penalty='l1', C=1e5),
        'SGD':SGDClassifier(loss='log'),
        'GB': GradientBoostingClassifier(learning_rate=0.05, subsample=0.5, max_depth=6, random_state=17, n_estimators=10),
        'NB': GaussianNB()}

In [None]:
sel_clfs = ['RF', 'ET', 'LR', 'SGD', 'GB', 'NB']

In [None]:
max_p_at_k = 0
for clfNM in sel_clfs:
    clf = clfs[clfNM]
    clf.fit( X_train, y_train_values )
    print clf
    y_score = clf.predict_proba(X_test)[:,1]
    predicted = np.array(y_score)
    expected = np.array(y_test)
    plot_precision_recall_n(expected,predicted, clfNM)
    p_at_1 = precision_at_k(expected,y_score, 0.01)
    if max_p_at_k < p_at_1:
        max_p_at_k = p_at_1
    print('Precision at 1%: {:.2f}'.format(p_at_1))

## Baseline 

- Back to [Table of contents](#Table-of-contents)

It is important to check our model against a reasonable **baseline** to know how well our model is doing. Without any context, 78% accuracy can sound really great... but it's not so great when you remember that you could do almost that well by declaring everyone will not need benefits in the next year, which would be stupid (not to mention useless) model. 

A good place to start is checking against a *random* baseline, assigning every example a label (positive or negative) completely at random. 

In [None]:
max_p_at_k

In [None]:
# make a set of random scores that is the same length as y_test
random_score = []
for i in range( 0, len( y_test ) ):
    random_score.append( random.uniform( 0,1 ) )

# calculate predicted values
random_predicted = np.array( [calc_threshold(score,0.5) for score in random_score] )

print( "Count of items in y_test (type = " + str( type( y_test ) ) + ") = " + str( len( y_test ) ) )
print( "Random score length: " + str( len( random_score ) ) )
print( "Random predicted length: " + str( len( random_predicted ) ) )

In [None]:
# calcualte precision at 0.5 for random
random_p_at_5 = precision_at_k(expected,random_predicted, 0.01)
print( "Precision with random values at 0.5 precision: " + str( random_p_at_5 ) )

Another good practice is checking against an "expert" or rule of thumb baseline. For example, say that talking to people at the IDHS, you find that they think it's much more likely that someone who has been on assistance multiple times already will need assistance in the future. Then you should check that your classifier does better than just labeling everyone who has had multiple past admits as positive.

In [None]:
reenter_predicted = np.array([ 1 if n_spells > 3 else 0 for n_spells in df_testing.n_spells.values ])
reenter_p_at_1 = precision_at_k(expected,reenter_predicted,0.01)

In [None]:
all_non_reenter = np.array([0 for n_spells in df_testing.n_spells.values])
all_non_reenter_p_at_1 = precision_at_k(expected, all_non_reenter,0.01)

In [None]:
sns.set_style("white")
sns.set_context("poster", font_scale=2.25, rc={"lines.linewidth":2.25, "lines.markersize":8})
fig, ax = plt.subplots(1, figsize=(22,12))
sns.barplot(['Random','All no need', 'More than 3 Spell','Model'],
            [random_p_at_5, all_non_reenter_p_at_1, reenter_p_at_1, max_p_at_k],
            palette=['#6F777D','#6F777D','#6F777D','#800000'])
sns.despine()
plt.ylim(0,1)
plt.ylabel('precision at 1%')