<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo">
    </a>
</p>

# Random Forests (RF) for classification with Python

Estimated time needed: **45** minutes

## Objectives

After completing this lab you will be able to:

*   Understand the difference between Bagging and Random Forest
*   Understand  that Random Forests have less Correlation between predictors in their ensemble, improving accuracy
*   Apply Random Forest
*   Understand Hyperparameters selection in  Random Forest


In this notebook, you will learn Random Forests (RF) for classification and Regression. Random Forest is similar to Bagging using multiple model versions and aggregating the ensemble of models to make a single prediction. RF uses an ensemble of tree’s and introduces randomness into each tree by randomly selecting a subset of the features for each node to split on. This makes the predictions of each tree uncorrelated, improving results when the models are aggregated. In this lab, we will illustrate the sampling process of RF to Bagging, then demonstrate how each predictor for random forest are not correlated. Finally, we will apply Random Forests to several datasets using Grid-Search to find the optimum  Hyperparameters.


<h1>Table of contents</h1>

<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ol>
        <li><a href="https://#RFvsBag">What's the difference between RF and Bagging </a></li>
        <li><a href="https://#Example">Cancer Data Example</li>
        <li><a href="https://practice/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML241ENSkillsNetwork31576874-2022-01-01">Practice</a></li>

</div>
<br>
<hr>


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize as opt
from sklearn import metrics
from tqdm import tqdm

In [2]:
import warnings

warnings.filterwarnings('ignore')

In [3]:
np.random.seed(42)

This function will calculate the accuracy of the training and testing data given a model.

In [4]:
def get_accuracy(X_train, X_test, y_train, y_test, model):
    y_test_pred = model.predict(X_test)
    y_train_pred = model.predict(X_train)
    return {
        'Test Accuracy': metrics.accuracy_score(y_test, y_test_pred),
        'Train Accuracy': metrics.accuracy_score(y_train, y_train_pred)
    }

This function calculates the average correlation between predictors and displays the pairwise correlation between  predictors.


In [32]:
def get_correlation(X_test, y_test, models):
    n_estimators = len(models.estimators_)
    model_predictions = [model.predict(X_test) for model in models]
    model_predictions = np.transpose(model_predictions)
    #model_predictions = np.expand_dims(model_predictions, axis=0)
    column_names = [f"model {i}" for i in range(len(models))]
    predictions_df = pd.DataFrame(model_predictions, columns=column_names)
    corr = predictions_df.corr()
    mean_correlation = corr.mean().mean() - 1 / n_estimators
    print(f"Average correlations between predictors: {mean_correlation}")
    return corr
    

<h2 id="RFvsBag">  What's the difference between RF and Bagging </h2>

RF is similar to Bagging in that it uses model ensembles to make predictions. Unlike Bagging, when you add more models, RF does not suffer from Overfitting. In this section, we go over some of the differences between RF and Bagging, using the dataset:


### About the dataset

We will use a telecommunications dataset for predicting customer churn. This is a historical customer dataset where each row represents one customer. The data is relatively easy to understand, and you may uncover insights you can use immediately. Typically, it is less expensive to keep customers than acquire new ones, so the focus of this analysis is to predict the customers who will stay with the company.

This data set provides information to help you predict what behavior will help you to retain customers. You can analyze all relevant customer data and develop focused customer retention programs.

The dataset includes information about:

*   Customers who left within the last month – the column is called Churn
*   Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
*   Customer account information – how long they had been a customer, contract, payment method, paperless billing, monthly charges, and total charges
*   Demographic info about customers – gender, age range, and if they have partners and dependents


In [5]:
churn_df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/ChurnData.csv")

In [6]:
churn_df.head()

Unnamed: 0,tenure,age,address,income,ed,employ,equip,callcard,wireless,longmon,...,pager,internet,callwait,confer,ebill,loglong,logtoll,lninc,custcat,churn
0,11.0,33.0,7.0,136.0,5.0,5.0,0.0,1.0,1.0,4.4,...,1.0,0.0,1.0,1.0,0.0,1.482,3.033,4.913,4.0,1.0
1,33.0,33.0,12.0,33.0,2.0,0.0,0.0,0.0,0.0,9.45,...,0.0,0.0,0.0,0.0,0.0,2.246,3.24,3.497,1.0,1.0
2,23.0,30.0,9.0,30.0,1.0,2.0,0.0,0.0,0.0,6.3,...,0.0,0.0,0.0,1.0,0.0,1.841,3.24,3.401,3.0,0.0
3,38.0,35.0,5.0,76.0,2.0,10.0,1.0,1.0,1.0,6.05,...,1.0,1.0,1.0,1.0,1.0,1.8,3.807,4.331,4.0,0.0
4,7.0,35.0,14.0,80.0,2.0,15.0,0.0,1.0,0.0,7.1,...,0.0,0.0,1.0,1.0,0.0,1.96,3.091,4.382,3.0,0.0


In [7]:
churn_df.dtypes

tenure      float64
age         float64
address     float64
income      float64
ed          float64
employ      float64
equip       float64
callcard    float64
wireless    float64
longmon     float64
tollmon     float64
equipmon    float64
cardmon     float64
wiremon     float64
longten     float64
tollten     float64
cardten     float64
voice       float64
pager       float64
internet    float64
callwait    float64
confer      float64
ebill       float64
loglong     float64
logtoll     float64
lninc       float64
custcat     float64
churn       float64
dtype: object

### Data Preprocessing and Feature Selection

In [8]:
Y_COLUMN = 'churn'

Need to convert 'churn' column to int as there are only two values, 1 or 0.

In [9]:
churn_df['churn'].unique()

array([1., 0.])

In [10]:
churn_df[Y_COLUMN] = churn_df[Y_COLUMN].astype(int)
churn_df[Y_COLUMN].dtypes

dtype('int64')

Next need to select some feature columns to do modelling.

In [11]:
FEATURE_COLUMNS = ['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip']

In [12]:
full_features = FEATURE_COLUMNS.copy()
full_features.append(Y_COLUMN)

In [13]:
reduced_churn_df = churn_df[full_features]
reduced_churn_df

Unnamed: 0,tenure,age,address,income,ed,employ,equip,churn
0,11.0,33.0,7.0,136.0,5.0,5.0,0.0,1
1,33.0,33.0,12.0,33.0,2.0,0.0,0.0,1
2,23.0,30.0,9.0,30.0,1.0,2.0,0.0,0
3,38.0,35.0,5.0,76.0,2.0,10.0,1.0,0
4,7.0,35.0,14.0,80.0,2.0,15.0,0.0,0
...,...,...,...,...,...,...,...,...
195,55.0,44.0,24.0,83.0,1.0,23.0,0.0,0
196,34.0,23.0,3.0,24.0,1.0,7.0,0.0,0
197,6.0,32.0,10.0,47.0,1.0,10.0,0.0,0
198,24.0,30.0,0.0,25.0,4.0,5.0,0.0,1


### Bootstrap Sampling

Bootstrap Sampling is a method that involves drawing sample data repeatedly with replacement from a data source to estimate a model parameter. Scikit-learn has methods for Bagging, but it is helpful to understand Bootstrap Sampling. We will import "resample".


In [14]:
from sklearn.utils import resample

In [15]:
for n in range(5):
    print(resample(reduced_churn_df[0: 5]))
    print("\n\n")

   tenure   age  address  income   ed  employ  equip  churn
3    38.0  35.0      5.0    76.0  2.0    10.0    1.0      0
4     7.0  35.0     14.0    80.0  2.0    15.0    0.0      0
2    23.0  30.0      9.0    30.0  1.0     2.0    0.0      0
4     7.0  35.0     14.0    80.0  2.0    15.0    0.0      0
4     7.0  35.0     14.0    80.0  2.0    15.0    0.0      0



   tenure   age  address  income   ed  employ  equip  churn
1    33.0  33.0     12.0    33.0  2.0     0.0    0.0      1
2    23.0  30.0      9.0    30.0  1.0     2.0    0.0      0
2    23.0  30.0      9.0    30.0  1.0     2.0    0.0      0
2    23.0  30.0      9.0    30.0  1.0     2.0    0.0      0
4     7.0  35.0     14.0    80.0  2.0    15.0    0.0      0



   tenure   age  address  income   ed  employ  equip  churn
3    38.0  35.0      5.0    76.0  2.0    10.0    1.0      0
2    23.0  30.0      9.0    30.0  1.0     2.0    0.0      0
4     7.0  35.0     14.0    80.0  2.0    15.0    0.0      0
1    33.0  33.0     12.0    33.0  

### Bootstrap Sampling

Bootstrap Sampling is a method that involves drawing sample data repeatedly with replacement from a data source to estimate a model parameter. Scikit-learn has methods for Bagging, but it is helpful to understand Bootstrap Sampling. We will import "resample".


In [16]:
X = churn_df[FEATURE_COLUMNS]

In [17]:
M = X.shape[1]
M

7

In [18]:
import random

In [19]:
feature_index = range(M)
random.sample(feature_index, M)

[6, 2, 1, 3, 0, 5, 4]

In [20]:
for n in range(M):
    print(f"sample {n}")
    print(resample(X[0: M]).iloc[:, random.sample(feature_index, M)])
    print("\n\n")

sample 0
   employ  address  income  tenure  equip   age   ed
2     2.0      9.0    30.0    23.0    0.0  30.0  1.0
2     2.0      9.0    30.0    23.0    0.0  30.0  1.0
6     8.0      7.0    37.0    42.0    1.0  40.0  2.0
1     0.0     12.0    33.0    33.0    0.0  33.0  2.0
3    10.0      5.0    76.0    38.0    1.0  35.0  2.0
3    10.0      5.0    76.0    38.0    1.0  35.0  2.0
6     8.0      7.0    37.0    42.0    1.0  40.0  2.0



sample 1
   tenure  address   age  income  equip   ed  employ
5    68.0     17.0  52.0   120.0    0.0  1.0    24.0
5    68.0     17.0  52.0   120.0    0.0  1.0    24.0
6    42.0      7.0  40.0    37.0    1.0  2.0     8.0
5    68.0     17.0  52.0   120.0    0.0  1.0    24.0
2    23.0      9.0  30.0    30.0    0.0  1.0     2.0
3    38.0      5.0  35.0    76.0    1.0  2.0    10.0
6    42.0      7.0  40.0    37.0    1.0  2.0     8.0



sample 2
    ed  address  equip  income   age  employ  tenure
3  2.0      5.0    1.0    76.0  35.0    10.0    38.0
0  5.0      7

### Train Test Split

In [21]:
y = churn_df[Y_COLUMN]
y.head()

0    1
1    1
2    0
3    0
4    0
Name: churn, dtype: int64

In [22]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

In [23]:
print(f"Train set {X_train.shape}, {y_train.shape}")
print(f"Test set {X_test.shape}, {y_test.shape}")

Train set (140, 7), (140,)
Test set (60, 7), (60,)


### Bagging Review


In [24]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

Bagging improves models that suffer from overfitting; they do well on the training data, but they do not generalize well to unseen data. Decision Trees are a prime candidate for this reason. In addition, they are fast to train; We create a <code>BaggingClassifier</code> object,  with a Decision Tree as the <code>base_estimator</code>.


In [25]:
n_estimators = 20
bagging_classifier = BaggingClassifier(
    estimator=DecisionTreeClassifier(
        criterion='entropy',
        max_depth=4,
        random_state=2
    ),
    n_estimators=n_estimators,
    random_state=0,
    bootstrap=True
)

In [26]:
bagging_classifier.fit(X_train, y_train)

In [27]:
bagging_classifier.predict(X_test).shape

(60,)

In [28]:
print(get_accuracy(X_train, X_test, y_train, y_test, bagging_classifier))

{'Test Accuracy': 0.7333333333333333, 'Train Accuracy': 0.9071428571428571}


In [31]:
get_correlation(X_test, y_test, bagging_classifier).style.background_gradient(cmap='coolwarm')

Average correlations between predictors: 0.25400671537472364


Unnamed: 0,model 0,model 1,model 2,model 3,model 4,model 5,model 6,model 7,model 8,model 9,model 10,model 11,model 12,model 13,model 14,model 15,model 16,model 17,model 18,model 19
model 0,1.0,-0.057709,0.152641,0.132379,0.068323,0.195047,0.209679,0.256111,0.177811,0.318511,-0.024845,0.318511,0.209679,0.112611,0.294475,-0.035245,0.161491,0.161491,0.236433,0.015456
model 1,-0.057709,1.0,-0.002979,0.335171,0.349647,0.121829,-0.078409,0.013546,0.180022,0.223814,0.451486,-0.074605,-0.078409,0.404443,0.24658,0.481571,0.04413,0.04413,0.215365,-0.059131
model 2,0.152641,-0.002979,1.0,0.395985,-0.010903,0.342381,0.455239,0.674356,0.442603,0.359425,-0.092675,0.51917,0.552099,0.296511,0.32485,0.216541,0.561502,0.47973,0.415029,0.006783
model 3,0.132379,0.335171,0.395985,1.0,0.456572,0.242393,0.436809,0.427623,0.417131,0.494783,0.051331,0.415618,0.340807,0.405843,0.224442,0.199294,0.375523,0.294475,0.445634,0.19496
model 4,0.068323,0.349647,-0.010903,0.456572,1.0,0.362231,-0.011036,0.090878,0.002915,0.409514,0.347826,-0.045502,0.099322,0.434355,0.294475,0.387699,0.068323,0.161491,0.315244,-0.100465
model 5,0.195047,0.121829,0.342381,0.242393,0.362231,1.0,0.19803,0.370625,0.183073,0.163299,0.195047,0.244949,0.19803,0.505181,0.605983,0.158114,0.529414,0.195047,0.494975,-0.069338
model 6,0.209679,-0.078409,0.455239,0.436809,-0.011036,0.19803,1.0,0.474619,0.564524,0.404226,-0.121393,0.619813,0.738562,0.323942,0.148803,0.062622,0.540752,0.430394,0.140028,0.247156
model 7,0.256111,0.013546,0.674356,0.427623,0.090878,0.370625,0.474619,1.0,0.546688,0.464008,-0.074355,0.625402,0.474619,0.256776,0.283884,0.140642,0.669193,0.50396,0.454257,0.020559
model 8,0.177811,0.180022,0.442603,0.417131,0.002915,0.183073,0.564524,0.546688,1.0,0.405727,-0.084533,0.491144,0.357359,0.241594,0.188913,0.314275,0.352707,0.177811,0.332877,0.07979
model 9,0.318511,0.223814,0.359425,0.494783,0.409514,0.163299,0.404226,0.464008,0.405727,1.0,0.318511,0.466667,0.404226,0.392837,0.178122,0.464758,0.318511,0.318511,0.50037,0.113228


### Random  Forest

Random forests are a combination of trees such that each tree depends on a random subset of the features and data. As a result, each tree in the forest is different and usually performs better than Bagging. The most important parameters are the number of trees and the number of features to sample. First, we import <code>RandomForestClassifier</code>.
