# How to train a model and compute predictions

In today's class we'll close the loop started in the previous class, and see how we can train some model over the training set, and then use it to compute the probability that some applicant won't be able to repay a loan (with 1 representing the certainty that an applicant won't be able to repay). 

In [None]:
# numpy and pandas for data manipulation
import numpy as np
import pandas as pd 

# File system manangement
import os

# matplotlib and seaborn for plotting
import matplotlib.pyplot as plt
import seaborn as sns

## Read in Data 

First, we read the training and test sets. 

In [None]:
# Training data
app_train = pd.read_csv('./input/application_train.csv')
print('Training data shape: ', app_train.shape)
app_train.head()

In [None]:
# Testing data features
app_test = pd.read_csv('./input/application_test.csv')
print('Testing data shape: ', app_test.shape)
app_test.head()

# Baseline

For a naive baseline, we could guess the same value for all examples on the testing set.  We are asked to predict the probability of not repaying the loan, so if we are entirely unsure, we would guess 0.5 for all observations on the test set. This  will get us a Reciever Operating Characteristic Area Under the Curve (AUC ROC) of 0.5 in the competition ([random guessing on a classification task will score a 0.5](https://stats.stackexchange.com/questions/266387/can-auc-roc-be-between-0-0-5)).

Since we already know what score we are going to get, we don't really need to make a naive baseline guess. Let's use a slightly more sophisticated model for our actual baseline: Logistic Regression.

## Logistic Regression Implementation

To get a baseline, we will use all of the features after encoding the categorical variables. We will preprocess the data by filling in the missing values (imputation) and normalizing the range of the features (feature scaling). The following code performs both of these preprocessing steps.

In [None]:
# Let's perform the label encoding of categorical features with just 2 values...

from sklearn.preprocessing import LabelEncoder

def label_encode(app_train, app_test) : 
    le = LabelEncoder()
    le_count = 0

    # Iterate through the columns
    for col in app_train:
        if app_train[col].dtype == 'object':
            # If 2 or fewer unique categories
            set_values = app_train[col].unique()
            num_values = len(list(set_values))
            if num_values <= 2:
                print(f"{col} will be label encoded! Found {num_values} values: {set_values}")
                # Train on the training data
                le.fit(app_train[col])
                # Transform both training and testing data
                app_train[col] = le.transform(app_train[col])
                app_test[col] = le.transform(app_test[col])

                # Keep track of how many columns were label encoded
                le_count += 1

    print('%d columns were label encoded.' % le_count)
    print('Training Features shape: ', app_train.shape)
    print('Testing Features shape: ', app_test.shape)
    
    return app_train, app_test


def one_hot_encode(app_train, app_test) :
    
    # Let's perform the one-hot encoding of categorical features with > 2 values...
    app_train = pd.get_dummies(app_train)
    app_test = pd.get_dummies(app_test)
    
    return app_train, app_test


def align_train_test(app_train, app_test) :
    
    # Save target variable in a separate Series...
    train_labels = app_train['TARGET']

    # Align the training and testing data on columns -- this keeps only the columns present in both dataframes.
    app_train, app_test = app_train.align(app_test, join = 'inner', axis = 1)

    # Add the target column back in.
    app_train['TARGET'] = train_labels
    
    return train_labels, app_train, app_test

In [None]:
# Let's perform the label encoding of categorical features with <= 2 values...
app_train, app_test = label_encode(app_train, app_test)

# Let's perform the one-hot encoding of categorical features with > 2 values...
app_train, app_test = one_hot_encode(app_train, app_test)

print('Training Features shape: ', app_train.shape)
print('Testing Features shape: ', app_test.shape)
print('Training Features shape: ', app_train.columns.values)

In [None]:
train_labels, app_train, app_test = align_train_test(app_train, app_test)

print('Training Features shape: ', app_train.shape)
print('Testing Features shape: ', app_test.shape)

#### Data preprocessing

In [None]:
# Prepare the training and test sets...

# Drop the target from the training data
train = app_train.drop(columns = ['TARGET'], errors = 'ignore')
    
# Copy of the testing data
test = app_test.copy()

display(train)

In [None]:
display(train_labels)

In [None]:
display(test)

We then proceed to impute...

https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer

...and normalize the data.

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler

# Median imputation of missing values
# NOTE: different strategies can be used. Another one is to replace NaNs with some fixed value.
imputer = SimpleImputer(strategy = 'median')

# Scale each feature such that its values fall within the interval [0,1].
scaler = MinMaxScaler(feature_range = (0, 1))

In [None]:
# Fit imputer ON the ***training data*** (finds the median for each column...)
imputer.fit(train)

# Transform both train and test sets according to the medians found before.
display(train)
train = imputer.transform(train)
test = imputer.transform(test)
display(train)

In [None]:
# Fit scaler on the training data
scaler.fit(train)

# Scale the columns within the training and test sets.
display(train)
train = scaler.transform(train)
test = scaler.transform(test)
display(train)

print('Training data shape: ', train.shape)
print('Testing data shape: ', test.shape)

We'll consider the use of [`LogisticRegression`from Scikit-Learn](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) as our first model. The only change we will make from the default model settings is to lower the [regularization parameter](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression), C, which controls the amount of overfitting (a lower value should decrease overfitting). This will get us slightly better results than the default `LogisticRegression`, but it still will set a low bar for any future models.

Here we use the familiar Scikit-Learn modeling syntax. We **(1)** first **instantiate the model**, then **(2)** we **train the model using `.fit`** and finally **(3) compute predictions** on the testing data using `.predict_proba` -- remember that we want probabilities and not a 0 or 1.

In [None]:
# Step 1 -- Instantiate an object representing the model. The constructor can accept several parameters.

# Use of Logistic Regression...
from sklearn.linear_model import LogisticRegression
# Note: here we're specifying the regularization parameter C
log_reg = LogisticRegression(C = 0.0001, class_weight = 'balanced')


# Use of SVC...
# from sklearn.svm import SVC
# log_reg = SVC(C = 0.0001, class_weight = 'balanced', max_iter = 10, probability=True)

In [None]:
# Step 2 -- Train the model on the training data
log_reg.fit(train, train_labels)

Now that the model has been trained, we can use it to make predictions. We want to predict the probabilities of not paying a loan, so we use the model `predict.proba` method. This returns an m x 2 array where m is the number of observations. The first column is the probability of the target being 0 and the second column is the probability of the target being 1 (so for a single row, the two columns must sum to 1). We want the probability the loan won't be repaid, so we will select the second column.

The following code makes the predictions and selects the correct column.

In [None]:
# Step 3 -- Compute predictions
# NOTE: Make sure to select the second column only (i.e., the probability that an applicant won't be able to repay the loan)
log_reg_pred = log_reg.predict_proba(test)[:, 1]
print(log_reg_pred)

Each of the above predictions represent some probability between 0 and 1 that the associated loan will not be repaid. If we were using these predictions to classify applicants, we could set a probability threshold for determining that a loan is risky. 

Once we compute the probabilities, we want them to be stored in a dataframe with the appropriate format. We use the format required by the challenge (as shown in the CSV example `sample_submission.csv`), where there are only two columns: `SK_ID_CURR` and `TARGET`. We will create a dataframe in this format from the test set and the predictions called `submit`. 

In [None]:
# Final result dataframe
submit = app_test[['SK_ID_CURR']]
submit['TARGET'] = log_reg_pred

submit.head()

In [None]:
# Save the submission to a csv file
submit.to_csv('log_reg_baseline.csv', index = False)

Let's try to submit the file to Kaggle and see which score we get...

__The logistic regression baseline should score around 0.671 when submitted.__

**Just for fun**: let's see how our classifier performs on the training set...

In [None]:
log_reg_pred = log_reg.predict_proba(train)[:, 1]
print(log_reg_pred)

In [None]:
from sklearn.metrics import roc_auc_score
test_acc = roc_auc_score(y_true = train_labels, 
                         y_score = log_reg_pred)
test_acc

**Question**: given the performance achieved on the test set, do we have underfitting or overfitting?

**Exercise**: try different C values and explore the other parameters. 

# Using a different classifier: SV classifier

To actually use scikit SV classifier **we don't need to rewrite much code**...in fact, by commenting/uncommenting a few lines of code in the above cells, we'll be able to use a different classifier! Indeed, in scikit all classifiers follow a protocol imposed by a common interface.

Note: technically, in scikit each classifier is a class derived from a set of superclasses that define said interface. Thus, when using any scikit classifier we can expect them to offer a set of common methods. Among these, we're especially interested in **fit** and **predict_proba**.

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC

Let's try to submit the file to Kaggle and see which score we get...

__The SVC baseline should score around 0.60 when submitted.__

- **Exercise 1**: what happens if we use a different number of iterations? Or different C values? 
- **Exercise 2**: try to use the k-NN classifier (or any other suitable classifier of choice) and see how it goes.

## Streamlining of the steps behind our classification task...use of scikit pipelines!

If we pay attention to what we've done before, we can see that going from the initial train data to the actual predictions on the test set takes several steps:

- Read the training and test sets
- Preprocess the training and test set: categorical encoding, imputing, rescaling, feature selection, ...
- Fit some model on training set
- Computing predictions on test set via the previously trained model

What if we rationalize a little bit the code we wrote? Let's use a **pipeline of transforms with a final estimator.**

https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

**Underlying idea**: sequentially apply a list of transforms and a final estimator. Let's then use a pipeline.

Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement the fit and transform methods. 
The final estimator only needs to implement fit.

Going back to the operations we've seen before, which falls in which category?

In [None]:
# Read training and test sets
app_train = pd.read_csv('./input/application_train.csv')
app_test = pd.read_csv('./input/application_test.csv')


# Label encoding...
app_train, app_test = label_encode(app_train, app_test)

# One-hot encoding...
app_train, app_test = one_hot_encode(app_train, app_test)

# Let's align train and test data...
train_labels, app_train, app_test = align_train_test(app_train, app_test)
print('Training Features shape: ', app_train.shape)
print('Testing Features shape: ', app_test.shape)

    
# Copy training and test data
train = app_train.drop(columns = ['TARGET'], errors = 'ignore')
test = app_test.copy()

display(train)
display(train_labels)
display(test)

In [None]:
# Median imputation of missing values
# imputer = SimpleImputer(strategy = 'median')

# Scale each feature such that it falls within [0,1].
# scaler = MinMaxScaler(feature_range = (0, 1))

# from sklearn.linear_model import LogisticRegression
# here we're specifying the regularization parameter
log_reg = LogisticRegression(C = 0.0001)
# log_reg = SVC(C = 0.0001, class_weight = 'balanced', max_iter = 10, probability=True)


from sklearn.pipeline import Pipeline
clf = Pipeline([('imp', SimpleImputer(strategy = 'median')),
                ('sca', MinMaxScaler(feature_range = (0, 1))),
                ('clf', log_reg)],
                verbose = True)

In [None]:
# Here we execute the oh, imp, and sca transforms, and finally fit the clf estimator.
clf.fit(train, train_labels)

**Question**: why we don't include label and one hot encoding in the pipeline? Check the scikit documentation to see the limitations that come with having these functionalities within the pipeline...

In [None]:
# Finally, we compute the predictions over the test set.
log_reg_pred = clf.predict_proba(test)[:, 1]
print(log_reg_pred)

In [None]:
# Final result dataframe
submit = app_test[['SK_ID_CURR']]
submit['TARGET'] = log_reg_pred

submit.head()

# Save the submission to a csv file
submit.to_csv('log_reg_pipe.csv', index = False)

**Exercise**: try to use the SVC and k-NN classifiers, and see how they perform...

# Tuning the hyper-parameters: grid search!

We usually want to try out some model with different combinations of parameters, to find out the combination that yields the best results.

### Step 1: read and prepare the training and test sets

In [None]:
# Read training and test sets
app_train = pd.read_csv('./input/application_train.csv')
app_test = pd.read_csv('./input/application_test.csv')


# Label encoding...
app_train, app_test = label_encode(app_train, app_test)


# One-hot encoding...
app_train, app_test = one_hot_encode(app_train, app_test)


# Let's align train and test data...
train_labels, app_train, app_test = align_train_test(app_train, app_test)
print('Training Features shape: ', app_train.shape)
print('Testing Features shape: ', app_test.shape)

    
# Copy training and test data
train = app_train.drop(columns = ['TARGET'], errors = 'ignore')
test = app_test.copy()

display(train)
display(train_labels)
display(test)

### Step 2: Set up the pipeline

Note that this time we're not passing any parameter to the classifier's constructor...!

In [None]:
# Pick the classifier of your choice...
log_reg = LogisticRegression()
# log_reg = SVC

from sklearn.pipeline import Pipeline
clf = Pipeline([('imp', SimpleImputer(strategy = 'median')),
                ('sca', MinMaxScaler(feature_range = (0, 1))),
                ('clf', log_reg)],
                verbose = True)

# Set up the grid of hyperparameters we want to explore...
param_grid =\
{
    'clf__C': [0.0001, 0.0005]
}

### Step 3: Set up the grid search

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

Now it's time to try out our model with different combinations of parameters. The goal is to find out the combination yielding the best results.

Here there are several things we need to keep in mind:

- in order to evaluate the performance of different configurations, we need to use the notion of **cross validation**. Thus, we need to reserve part of the training set to _"emulate"_ the test set.
- the type of cross validation should take into consideration that our problem is imbalanced: **stratified k-fold** cross validation.
- we need to use the loss function considered by the challenge (AUC).


Let's first set up the grid search...

In [None]:
from sklearn.model_selection import GridSearchCV
search = GridSearchCV(estimator = clf, 
                      param_grid = param_grid, 
                      cv = 5,
                      scoring = 'roc_auc',
                      n_jobs = 1,
                      verbose = 3)

search

### Step 4: tuning the hyperparameters

Let's find out the best combination of hyperparameters via the grid search...

In [None]:
search.fit(train, train_labels)

Let's inspect the results we got from the grid search...

In [None]:
print('Miscellanea of results:', search.cv_results_)
print()
print('Score achieved by the best config. during stratified CV:', search.best_score_)
print()
print('Best estimator config:', search.best_estimator_)

### Step 5: compute the predictions over the test set and see the final result

The object _search_ will compute the predictions, via the method _predict_proba_, **according to the best configuration found during the fit phase**.

In [None]:
# Finally, we compute the predictions over the test set with the best configuration of hyperparameters found before.
log_reg_pred = search.predict_proba(test)[:, 1]
print(log_reg_pred)

# Final result dataframe
submit = app_test[['SK_ID_CURR']]
submit['TARGET'] = log_reg_pred
submit.head()

# Save the submission to a csv file
submit.to_csv('log_reg_grid.csv', index = False)

Once we submit the CSV, we should get around **0.711**. This is close to what we observe over the validation sets used during the grid search, and represents a noticeable improvement from the previous result (which was 0.67). Good!

In general, there's a trade off between the time and the computational resources we want to spend on tuning, and the accuracy we want achieve.

- **Exercise**: experiment the code above with the SV classifier.


- **Exercise 2**: during the grid search you can go as far as experimenting with different models. E.g., https://stackoverflow.com/questions/50265993/alternate-different-models-in-pipeline-for-gridsearchcv


- **Exercise 3**: there are other types of searches, i.e., random search and Bayesian search. They're way faster than the grid search, although they may return less accurate estimators. See the links below and try them in place of GridSearchCV.

    https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html
    
    https://scikit-optimize.github.io/stable/modules/generated/skopt.BayesSearchCV.html