## Random forest
Random forest classifers are similar to decision trees in that they use hierarchical structures to split the dataset based on features. However, unlike decision trees, these classifiers use muliple decision trees (a "forest") in classification process using a method called *bagging*. Random forest is called an *ensemble* method because we have multiple classifiers by which we make our final prediction.

The random forest algorithm consists of four general steps:
* Select random samples from a given dataset - *bootstrapping*.
* Construct a decision tree for each sample and get a prediction result from each decision tree.
* Perform a vote for each predicted result.
* Select the prediction result with the most votes as the final prediction - *aggregating*.

<img width="500px" src="img/random_forest_voting.png" />

**Advantages**
* Random forests is considered as a highly accurate and robust method because of the number of decision trees participating in the process.
* It does not suffer from the overfitting problem. The main reason is that it takes the average of all the predictions, which cancels out the biases.
* The algorithm can be used in both classification and regression problems.
* Random forests can also handle missing values. There are two ways to handle these: using median values to replace continuous variables, and computing the proximity-weighted average of missing values.
* You can get the relative feature importance, which helps in selecting the most contributing features for the classifier.

**Disadvantages**
* Random forests is slow in generating predictions because it has multiple decision trees. Whenever it makes a prediction, all the trees in the forest have to make a prediction for the same given input and then perform voting on it. This whole process is time-consuming.
* The model is difficult to interpret compared to a decision tree, where you can easily make a decision by following the path in the tree.



## Implementing random forest
Like decision trees, building and fitting a random forest classifier is a straightforward task  in scikit-learn. First, we define a random forest classifier variable, and, second, we train the classifier by calling the `fit` method.

Random forest has many hyperparameters. Hyperparameters included in Random Forest are:
* `n_estimators` = number of trees in the forest
* `criterion` = the criterion used to choose a split at each node (e.g. gini, entropy, mse, etc.)
* `max_depth` = maximum length of the longest route in each tree
* `min_samples_split` = minimum number of samples to split on at a node
* `max_leaf_nodes` = maximum number of leaf nodes
* `max_features` = maximum number of random features to test at each node
* `max_samples` = size of bootstrapped dataset for each tree

In [None]:
import pandas as pds
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pydotplus
from sklearn.tree import DecisionTreeClassifier, export_graphviz # Import Decision Tree Classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,\
        roc_auc_score, auc, precision_recall_curve, roc_curve
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn import tree
from IPython.display import Image 

import random
## set seed for randomization
random.seed(42)

In [None]:
## build and fit random forest classifier
rfc = RandomForestClassifier(n_estimators=100, max_depth=4, random_state=42)
rfc.fit(X_train, y_train)

## Evaluating random forest
We can evaluate the our random forest classifier by calculating the accuracy, recall, precision, and F1 scores.

In [None]:
y_pred_forest = rfc.predict(X_test)
y_proba_forest = list(zip(*rfc.predict_proba(X_test)))[1]
accuracy_score(y_test, y_pred_forest)

In [None]:
recall_score(y_test, y_pred_forest)

In [None]:
precision_score(y_test, y_pred_forest)

In [None]:
f1_score(y_test, y_pred_forest)

As before, we can display the `confusion_matrix` of our classifier.

In [None]:
## get values for confusion matrix
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_forest).ravel()
print((tn, fp, fn, tp))

In [None]:
## show confusion matrix for random forest
show_confusion_matrix(y_test, y_pred_forest)

In [None]:
def plot_static_roc_curve(fpr, tpr):
    plt.figure(figsize=[5,5])
    plt.fill_between(fpr, tpr, alpha=.5, color='darkorange')
    # Add dashed line with a slope of 1
    plt.plot(fpr, tpr, color='darkorange', lw=2)
    plt.plot([0,1], [0,1], linestyle=(0, (5, 5)), linewidth=2)
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.title("ROC curve");
    
def plot_static_pr_curve(recall, precision):
    plt.figure(figsize=[5,5])
    plt.fill_between(recall, precision, alpha=.5, color='darkorange')
    plt.plot(recall, precision, color='darkorange', lw=2)
    # Add dashed line with a slope of 1
    plt.plot([1,0], [0,1], linestyle=(0, (5, 5)), linewidth=2)
    plt.xlabel("Recall")
    plt.ylabel("Precision")
    plt.title("Precision-recall curve");

In [None]:
roc_auc_score(y_test, y_proba_forest)

In [None]:
fpr, tpr, thresholds = roc_curve(y_test, y_proba_forest)
plot_static_roc_curve(fpr,tpr)

In [None]:
precision, recall, thresholds = precision_recall_curve(y_test, y_proba_forest)
auc(recall, precision)

In [None]:
plot_static_pr_curve(recall,precision)

## Hyperparameter tuning

Cross-validation is key to choosing the best possible hyperparameters. This involves splitting the training set into $k$ number of subsets where one subset is used as a validation set and the remaining $k-1$ are used for training. This is then completed over all possible sets of $k$ and the average of the metrics is used to assess the model with the given hyperparameters.

To further this idea, we can use cross-validation in concert with a *grid search* which runs a model with variable hyperparameters that are defined by lists of values. This will "check" the metrics for each of this runs and average them. The optimal combination of hyperparameters will be outputted as the best model.

In [None]:
# Number of trees to be used
rfc_n_estimators = [int(x) for x in np.linspace(100, 500, 5)]
# Maximum length in tree
rfc_max_depth = [int(x) for x in np.linspace(2, 10, 5)]

rfc_grid = {'n_estimators': rfc_n_estimators,
            'max_depth': rfc_max_depth}

# Create the model to be tuned
rfc_base = RandomForestClassifier(random_state=42)

# Create the random search Random Forest
rfc_random = RandomizedSearchCV(estimator = rfc_base, param_distributions = rfc_grid, 
                                n_iter = 200, cv = 4, scoring='f1',
                                random_state = 42, n_jobs = -1)

# Fit the random search model
rfc_random.fit(X_train, y_train)

In [None]:
# Get the optimal parameters
rfc_random.best_params_

In [None]:
y_pred_best = rfc_random.predict(X_test)
accuracy_score(y_test, y_pred_best)

In [None]:
f1_score(y_test, y_pred_best)

## Feature ranking
In addition to evaluating the random forest classifier, it is sometimes helpful to see how important each of the features were in arriving at final predictions. If we notice that a feature is of little importance, we can eliminate it from our training dataset in order to gain efficiency.

When building a random forest classifier, scikit-learn returns a variable named `feature_importances_`.

In [None]:
## find important features
rfc.feature_importances_

The raw output is a little difficult to interpret. So, we will put the output in a Pandas Series.

In [None]:
feature_imp = \
    pds.Series(rfc.feature_importances_, index=feature_cols).sort_values(ascending=False)
feature_imp

We can also visualize the feature importances using a seaborn barplot.

In [None]:
## visualize important features
%matplotlib inline

# Creating a bar plot
sns.barplot(x=feature_imp, y=feature_imp.index)

# Add labels to your graph
plt.xlabel('\nFeature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features\n")

plt.show()

## XGBoost

How does this differ from Random Forest? Random Forest uses bagging in order to train a final model. XGBoost works by a method called **boosting**, which is an iterative, sequential method that adds a new decision tree to the overall model at each step to minimize error from the previous trees. Each new tree is a *weak learner* that when all combined creates a strong learner that will accurately predict the outcome.

<img width="500px" src="img/xgboost_boosting.png" />

A problem with XGBoost is that it is highly sensitive to it's hyperparameters. If too many trees are added, it can be overfit. Moreover, the `learning rate` is crucial because the model will perform better if trained slowly, but the likelihood of many trees being created increases with a decreaed learning rate. FInding the right balance for the model is key to the robustness and generalizability of the model.


In [None]:
import xgboost as xgb

In [None]:
## build and fit XGBoost classifier
xgc = xgb.XGBClassifier(objective='reg:logistic',n_estimators=100, \
                        alpha=0.01, max_depth=4, learning_rate=0.1, \
                        colsample_bytree=0.3, use_label_encoder=False)
xgc.fit(X_train, y_train)

y_pred_boost = xgc.predict(X_test)

In [None]:
show_confusion_matrix(y_test, y_pred_boost)

In [None]:
accuracy_score(y_test, y_pred_boost)

In [None]:
recall_score(y_test, y_pred_boost)

In [None]:
precision_score(y_test, y_pred_boost)

In [None]:
f1_score(y_test, y_pred_boost)

In [None]:
feature_imp = \
    pds.Series(xgc.feature_importances_, index=feature_cols).sort_values(ascending=False)

## visualize important features
%matplotlib inline

# Creating a bar plot
sns.barplot(x=feature_imp, y=feature_imp.index)

# Add labels to your graph
plt.xlabel('\nFeature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features\n")

plt.show()

In [None]:

import seaborn as sns
import numpy as np
import math
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, \
    f1_score, roc_auc_score, auc, precision_recall_curve, roc_curve,\
    classification_report, confusion_matrix

In [None]:
## load Pima Indians Diabetes dataset (downloaded May 14, 2019; N=768)
df = pds.read_csv("diabetes.csv")

In [None]:
def show_confusion_matrix(y_test, y_pred, palette="inferno"):
    ## see: https://www.geeksforgeeks.org/confusion-matrix-machine-learning/
    ##      https://jakevdp.github.io/PythonDataScienceHandbook/05.08-random-forests.html
    ##      https://classeval.wordpress.com/introduction/basic-evaluation-measures/
    matrix = confusion_matrix(y_test, y_pred)

    colors = sns.color_palette(palette) # set the colors to use for heatmap
    # print(colors.as_hex()) # uncomment this to see color palette

    ax = sns.heatmap(matrix, square=True, annot=True, fmt='d', 
                     cbar=False, cmap=colors, vmin=-1, annot_kws={"size":13}, linewidths=1.0)

    # set labels on figure
    ax.set_xticklabels(labels=["neg","pos"], fontsize=13)
    ax.set_yticklabels(labels=["neg","pos"], fontsize= 13)
    plt.xlabel("\nactual value", fontsize=15)
    plt.ylabel("predicted value\n", fontsize=15)
    plt.show()
    
def plot_static_roc_curve(fpr, tpr):
    plt.figure(figsize=[5,5])
    plt.fill_between(fpr, tpr, alpha=.5, color='darkorange')
    # Add dashed line with a slope of 1
    plt.plot(fpr, tpr, color='darkorange', lw=2)
    plt.plot([0,1], [0,1], linestyle=(0, (5, 5)), linewidth=2)
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.title("ROC curve");
    
def plot_static_pr_curve(recall, precision):
    plt.figure(figsize=[5,5])
    plt.fill_between(recall, precision, alpha=.5, color='darkorange')
    plt.plot(recall, precision, color='darkorange', lw=2)
    # Add dashed line with a slope of 1
    plt.plot([1,0], [0,1], linestyle=(0, (5, 5)), linewidth=2)
    plt.xlabel("Recall")
    plt.ylabel("Precision")
    plt.title("Precision-recall curve");

## Neural Networks

Now that we understand and have run a logistic regression model, let's go a bit "deeper". We can think of a neural network (NN) as a set of nested functions -- we call these layers. Each layer in our model takes input from the previous layer and outputs directly to the next layer, i.e. fully connected. 

We are going to create a 3 layer neural network with the previously used 8 variables as features and the "Outcome" as the label. 

The first layer of our NN will take in all 8 features as input, has a ReLU (rectified linear unit) activation function, and outputs 12 latent features (hidden). As opposed to the logistic function, discussed previously, ReLU sets the input to 0 if it is <0 or uses the input as is if >0.

$f(x)=max(0,x)$

The second layer of our NN will take in all 12 latent features from the previous layer as input, has a ReLU (rectified linear unit) activation function, and outputs 8 latent features.

The third (and last) layer of our model is a sigmoid output layer that takes in the previous 8 latent features as input.

The loss function we use for this model is binary cross entropy, which basically sums the log probabilty of a given sample being in the 0 class and the log probability of the sample being in the 1 class across all samples. This is essentially the same function as the log likelihood. We want to minimize this loss function.

$ \ln Loss = \sum_{i=1}^{N}-(y_{i}\ln f(x_{i})+(1-{y_{i}}) \ln (1-f(x_{i}))$


For our implementation of neural network, we will use keras's sequential model:
* https://keras.io/guides/sequential_model/

Let's load the libraries we will be using...

In [None]:
from numpy.random import seed
seed(42)
from tensorflow.random import set_seed
set_seed(42)
from keras.models import Sequential
from keras.layers import Dense

In [None]:
# define the keras model
model = Sequential()
model.add(Dense(12, input_dim=8, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

# compile the keras model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit the keras model on the dataset
model.fit(X_train, y_train, epochs=75, batch_size=10)

# make class predictions with the model
y_proba = model.predict(X_test)
y_pred = (y_proba > 0.5).astype("int32")

In [None]:
## show confustion matrix
show_confusion_matrix(y_test, y_pred)

In [None]:
accuracy_score(y_test, y_pred)

In [None]:
# calculate F1 score
f1_score(y_test, y_pred)

In [None]:
# calculate AUROC
roc_auc_score(y_test, y_pred)

In [None]:
# calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
plot_static_roc_curve(fpr,tpr)

In [None]:
# calculate PR curve
precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
# calculate AUPR
auc(recall, precision)

In [None]:
plot_static_pr_curve(recall,precision)

## Unsupervised Learning
---

Without labels, we can still use machine learning to extract informstion from data, such as how to group the data, what patterns exist in the data, or how to restructure the data to be more concise without losing information.


### K-Means Clustering

This is one of the most common methods when discussing clustering. It works by the following steps:

1. Guess some cluster centers
2. E-Step: assign data points to the nearest cluster center
3. M-Step: set the cluster centers to the mean of each cluster
4. Repeat steps 3 and 4 until converged

In [None]:
from sklearn.datasets import make_blobs
import seaborn as sns
import pandas as pd
X, y_true = make_blobs(n_samples=300, centers=4,
                       cluster_std=0.60, random_state=42)
df = pd.DataFrame(X)
sns.scatterplot(data=df, x=0, y=1);

In [None]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)

In [None]:
centers = kmeans.cluster_centers_
df['clusters'] = y_kmeans
sns.scatterplot(data=df, x=0, y=1, hue='clusters');