<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo">
    </a>
</p>

# Random Forests (RF) for classification with Python

Estimated time needed: **45** minutes

## Objectives

After completing this lab you will be able to:

*   Understand the difference between Bagging and Random Forest
*   Understand  that Random Forests have less Correlation between predictors in their ensemble, improving accuracy
*   Apply Random Forest
*   Understand Hyperparameters selection in  Random Forest


In this notebook, you will learn Random Forests (RF) for classification and Regression. Random Forest is similar to Bagging using multiple model versions and aggregating the ensemble of models to make a single prediction. RF uses an ensemble of tree’s and introduces randomness into each tree by randomly selecting a subset of the features for each node to split on. This makes the predictions of each tree uncorrelated, improving results when the models are aggregated. In this lab, we will illustrate the sampling process of RF to Bagging, then demonstrate how each predictor for random forest are not correlated. Finally, we will apply Random Forests to several datasets using Grid-Search to find the optimum  Hyperparameters.


<h1>Table of contents</h1>

<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ol>
        <li><a href="https://#RFvsBag">What's the difference between RF and Bagging </a></li>
        <li><a href="https://#Example">Cancer Data Example</li>
        <li><a href="https://practice/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML241ENSkillsNetwork31576874-2022-01-01">Practice</a></li>

</div>
<br>
<hr>


In [88]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize as opt
from sklearn import metrics
from tqdm import tqdm

In [89]:
import warnings

warnings.filterwarnings('ignore')

In [90]:
np.random.seed(42)

This function will calculate the accuracy of the training and testing data given a model.

In [91]:
def get_accuracy(X_train, X_test, y_train, y_test, model):
    y_test_pred = model.predict(X_test)
    y_train_pred = model.predict(X_train)
    return {
        'Test Accuracy': metrics.accuracy_score(y_test, y_test_pred),
        'Train Accuracy': metrics.accuracy_score(y_train, y_train_pred)
    }

This function calculates the average correlation between predictors and displays the pairwise correlation between  predictors.


In [92]:
def get_correlation(X_test, y_test, models):
    n_estimators = len(models.estimators_)
    model_predictions = [model.predict(X_test) for model in models]
    model_predictions = np.transpose(model_predictions)
    #model_predictions = np.expand_dims(model_predictions, axis=0)
    column_names = [f"model {i}" for i in range(len(models))]
    predictions_df = pd.DataFrame(model_predictions, columns=column_names)
    corr = predictions_df.corr()
    mean_correlation = corr.mean().mean() - 1 / n_estimators
    print(f"Average correlations between predictors: {mean_correlation}")
    return corr
    

<h2 id="RFvsBag">  What's the difference between RF and Bagging </h2>

RF is similar to Bagging in that it uses model ensembles to make predictions. Unlike Bagging, when you add more models, RF does not suffer from Overfitting. In this section, we go over some of the differences between RF and Bagging, using the dataset:


### About the dataset

We will use a telecommunications dataset for predicting customer churn. This is a historical customer dataset where each row represents one customer. The data is relatively easy to understand, and you may uncover insights you can use immediately. Typically, it is less expensive to keep customers than acquire new ones, so the focus of this analysis is to predict the customers who will stay with the company.

This data set provides information to help you predict what behavior will help you to retain customers. You can analyze all relevant customer data and develop focused customer retention programs.

The dataset includes information about:

*   Customers who left within the last month – the column is called Churn
*   Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
*   Customer account information – how long they had been a customer, contract, payment method, paperless billing, monthly charges, and total charges
*   Demographic info about customers – gender, age range, and if they have partners and dependents


In [93]:
churn_df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/ChurnData.csv")

In [94]:
churn_df.head()

Unnamed: 0,tenure,age,address,income,ed,employ,equip,callcard,wireless,longmon,...,pager,internet,callwait,confer,ebill,loglong,logtoll,lninc,custcat,churn
0,11.0,33.0,7.0,136.0,5.0,5.0,0.0,1.0,1.0,4.4,...,1.0,0.0,1.0,1.0,0.0,1.482,3.033,4.913,4.0,1.0
1,33.0,33.0,12.0,33.0,2.0,0.0,0.0,0.0,0.0,9.45,...,0.0,0.0,0.0,0.0,0.0,2.246,3.24,3.497,1.0,1.0
2,23.0,30.0,9.0,30.0,1.0,2.0,0.0,0.0,0.0,6.3,...,0.0,0.0,0.0,1.0,0.0,1.841,3.24,3.401,3.0,0.0
3,38.0,35.0,5.0,76.0,2.0,10.0,1.0,1.0,1.0,6.05,...,1.0,1.0,1.0,1.0,1.0,1.8,3.807,4.331,4.0,0.0
4,7.0,35.0,14.0,80.0,2.0,15.0,0.0,1.0,0.0,7.1,...,0.0,0.0,1.0,1.0,0.0,1.96,3.091,4.382,3.0,0.0


In [95]:
churn_df.dtypes

tenure      float64
age         float64
address     float64
income      float64
ed          float64
employ      float64
equip       float64
callcard    float64
wireless    float64
longmon     float64
tollmon     float64
equipmon    float64
cardmon     float64
wiremon     float64
longten     float64
tollten     float64
cardten     float64
voice       float64
pager       float64
internet    float64
callwait    float64
confer      float64
ebill       float64
loglong     float64
logtoll     float64
lninc       float64
custcat     float64
churn       float64
dtype: object

### Data Preprocessing and Feature Selection

In [96]:
Y_COLUMN = 'churn'

Need to convert 'churn' column to int as there are only two values, 1 or 0.

In [97]:
churn_df['churn'].unique()

array([1., 0.])

In [98]:
churn_df[Y_COLUMN] = churn_df[Y_COLUMN].astype(int)
churn_df[Y_COLUMN].dtypes

dtype('int64')

Next need to select some feature columns to do modelling.

In [99]:
FEATURE_COLUMNS = ['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip']

In [100]:
full_features = FEATURE_COLUMNS.copy()
full_features.append(Y_COLUMN)

In [101]:
reduced_churn_df = churn_df[full_features]
reduced_churn_df

Unnamed: 0,tenure,age,address,income,ed,employ,equip,churn
0,11.0,33.0,7.0,136.0,5.0,5.0,0.0,1
1,33.0,33.0,12.0,33.0,2.0,0.0,0.0,1
2,23.0,30.0,9.0,30.0,1.0,2.0,0.0,0
3,38.0,35.0,5.0,76.0,2.0,10.0,1.0,0
4,7.0,35.0,14.0,80.0,2.0,15.0,0.0,0
...,...,...,...,...,...,...,...,...
195,55.0,44.0,24.0,83.0,1.0,23.0,0.0,0
196,34.0,23.0,3.0,24.0,1.0,7.0,0.0,0
197,6.0,32.0,10.0,47.0,1.0,10.0,0.0,0
198,24.0,30.0,0.0,25.0,4.0,5.0,0.0,1


### Bootstrap Sampling

Bootstrap Sampling is a method that involves drawing sample data repeatedly with replacement from a data source to estimate a model parameter. Scikit-learn has methods for Bagging, but it is helpful to understand Bootstrap Sampling. We will import "resample".


In [102]:
from sklearn.utils import resample

In [103]:
for n in range(5):
    print(resample(reduced_churn_df[0: 5]))
    print("\n\n")

   tenure   age  address  income   ed  employ  equip  churn
3    38.0  35.0      5.0    76.0  2.0    10.0    1.0      0
4     7.0  35.0     14.0    80.0  2.0    15.0    0.0      0
2    23.0  30.0      9.0    30.0  1.0     2.0    0.0      0
4     7.0  35.0     14.0    80.0  2.0    15.0    0.0      0
4     7.0  35.0     14.0    80.0  2.0    15.0    0.0      0



   tenure   age  address  income   ed  employ  equip  churn
1    33.0  33.0     12.0    33.0  2.0     0.0    0.0      1
2    23.0  30.0      9.0    30.0  1.0     2.0    0.0      0
2    23.0  30.0      9.0    30.0  1.0     2.0    0.0      0
2    23.0  30.0      9.0    30.0  1.0     2.0    0.0      0
4     7.0  35.0     14.0    80.0  2.0    15.0    0.0      0



   tenure   age  address  income   ed  employ  equip  churn
3    38.0  35.0      5.0    76.0  2.0    10.0    1.0      0
2    23.0  30.0      9.0    30.0  1.0     2.0    0.0      0
4     7.0  35.0     14.0    80.0  2.0    15.0    0.0      0
1    33.0  33.0     12.0    33.0  

### Bootstrap Sampling

Bootstrap Sampling is a method that involves drawing sample data repeatedly with replacement from a data source to estimate a model parameter. Scikit-learn has methods for Bagging, but it is helpful to understand Bootstrap Sampling. We will import "resample".


In [104]:
X = churn_df[FEATURE_COLUMNS]

In [105]:
M = X.shape[1]
M

7

In [106]:
import random

In [107]:
feature_index = range(M)
random.sample(feature_index, M)

[4, 3, 2, 6, 1, 0, 5]

In [108]:
for n in range(M):
    print(f"sample {n}")
    print(resample(X[0: M]).iloc[:, random.sample(feature_index, M)])
    print("\n\n")

sample 0
   tenure  employ  equip  address   ed   age  income
2    23.0     2.0    0.0      9.0  1.0  30.0    30.0
2    23.0     2.0    0.0      9.0  1.0  30.0    30.0
6    42.0     8.0    1.0      7.0  2.0  40.0    37.0
1    33.0     0.0    0.0     12.0  2.0  33.0    33.0
3    38.0    10.0    1.0      5.0  2.0  35.0    76.0
3    38.0    10.0    1.0      5.0  2.0  35.0    76.0
6    42.0     8.0    1.0      7.0  2.0  40.0    37.0



sample 1
   equip   ed  income  address  tenure   age  employ
5    0.0  1.0   120.0     17.0    68.0  52.0    24.0
5    0.0  1.0   120.0     17.0    68.0  52.0    24.0
6    1.0  2.0    37.0      7.0    42.0  40.0     8.0
5    0.0  1.0   120.0     17.0    68.0  52.0    24.0
2    0.0  1.0    30.0      9.0    23.0  30.0     2.0
3    1.0  2.0    76.0      5.0    38.0  35.0    10.0
6    1.0  2.0    37.0      7.0    42.0  40.0     8.0



sample 2
    ed   age  income  tenure  address  employ  equip
3  2.0  35.0    76.0    38.0      5.0    10.0    1.0
0  5.0  33.0 

### Train Test Split

In [109]:
y = churn_df[Y_COLUMN]
y.head()

0    1
1    1
2    0
3    0
4    0
Name: churn, dtype: int64

In [110]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

In [111]:
print(f"Train set {X_train.shape}, {y_train.shape}")
print(f"Test set {X_test.shape}, {y_test.shape}")

Train set (140, 7), (140,)
Test set (60, 7), (60,)


### Bagging Review


In [112]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

Bagging improves models that suffer from overfitting; they do well on the training data, but they do not generalize well to unseen data. Decision Trees are a prime candidate for this reason. In addition, they are fast to train; We create a <code>BaggingClassifier</code> object,  with a Decision Tree as the <code>base_estimator</code>.


In [113]:
n_estimators = 20
bagging_classifier = BaggingClassifier(
    estimator=DecisionTreeClassifier(
        criterion='entropy',
        max_depth=4,
        random_state=2
    ),
    n_estimators=n_estimators,
    random_state=0,
    bootstrap=True
)

In [114]:
bagging_classifier.fit(X_train, y_train)

In [115]:
bagging_classifier.predict(X_test).shape

(60,)

In [116]:
print(get_accuracy(X_train, X_test, y_train, y_test, bagging_classifier))

{'Test Accuracy': 0.7333333333333333, 'Train Accuracy': 0.9071428571428571}


In [117]:
get_correlation(X_test, y_test, bagging_classifier).style.background_gradient(cmap='coolwarm')

Average correlations between predictors: 0.25400671537472364


Unnamed: 0,model 0,model 1,model 2,model 3,model 4,model 5,model 6,model 7,model 8,model 9,model 10,model 11,model 12,model 13,model 14,model 15,model 16,model 17,model 18,model 19
model 0,1.0,-0.057709,0.152641,0.132379,0.068323,0.195047,0.209679,0.256111,0.177811,0.318511,-0.024845,0.318511,0.209679,0.112611,0.294475,-0.035245,0.161491,0.161491,0.236433,0.015456
model 1,-0.057709,1.0,-0.002979,0.335171,0.349647,0.121829,-0.078409,0.013546,0.180022,0.223814,0.451486,-0.074605,-0.078409,0.404443,0.24658,0.481571,0.04413,0.04413,0.215365,-0.059131
model 2,0.152641,-0.002979,1.0,0.395985,-0.010903,0.342381,0.455239,0.674356,0.442603,0.359425,-0.092675,0.51917,0.552099,0.296511,0.32485,0.216541,0.561502,0.47973,0.415029,0.006783
model 3,0.132379,0.335171,0.395985,1.0,0.456572,0.242393,0.436809,0.427623,0.417131,0.494783,0.051331,0.415618,0.340807,0.405843,0.224442,0.199294,0.375523,0.294475,0.445634,0.19496
model 4,0.068323,0.349647,-0.010903,0.456572,1.0,0.362231,-0.011036,0.090878,0.002915,0.409514,0.347826,-0.045502,0.099322,0.434355,0.294475,0.387699,0.068323,0.161491,0.315244,-0.100465
model 5,0.195047,0.121829,0.342381,0.242393,0.362231,1.0,0.19803,0.370625,0.183073,0.163299,0.195047,0.244949,0.19803,0.505181,0.605983,0.158114,0.529414,0.195047,0.494975,-0.069338
model 6,0.209679,-0.078409,0.455239,0.436809,-0.011036,0.19803,1.0,0.474619,0.564524,0.404226,-0.121393,0.619813,0.738562,0.323942,0.148803,0.062622,0.540752,0.430394,0.140028,0.247156
model 7,0.256111,0.013546,0.674356,0.427623,0.090878,0.370625,0.474619,1.0,0.546688,0.464008,-0.074355,0.625402,0.474619,0.256776,0.283884,0.140642,0.669193,0.50396,0.454257,0.020559
model 8,0.177811,0.180022,0.442603,0.417131,0.002915,0.183073,0.564524,0.546688,1.0,0.405727,-0.084533,0.491144,0.357359,0.241594,0.188913,0.314275,0.352707,0.177811,0.332877,0.07979
model 9,0.318511,0.223814,0.359425,0.494783,0.409514,0.163299,0.404226,0.464008,0.405727,1.0,0.318511,0.466667,0.404226,0.392837,0.178122,0.464758,0.318511,0.318511,0.50037,0.113228


### Random  Forest

Random forests are a combination of trees such that each tree depends on a random subset of the features and data. As a result, each tree in the forest is different and usually performs better than Bagging. The most important parameters are the number of trees and the number of features to sample. First, we import <code>RandomForestClassifier</code>.


In [118]:
from sklearn.ensemble import RandomForestClassifier

Like Bagging, increasing the number of trees improves results and does not lead to overfitting in most cases; but the improvements plateau as you add more trees. For this exxample, the number of trees in the forest (default=100):


In [119]:
n_estimators = 20

<code>max_features </code>   $m$ the number of features to consider when looking for the best split. If we have M features denoted by:


In [120]:
x_features = X.shape[1]

If we have M features, a popular method to determine m is to use the square root of M

$m= floor(\sqrt{M}) $


In [121]:
max_features = round(np.sqrt(x_features)) - 1
max_features

2

In [122]:
random_forest_classifier = RandomForestClassifier(
    max_features=max_features,
    n_estimators=n_estimators
)

In [123]:
random_forest_classifier.fit(X_train, y_train)

In [124]:
print(get_accuracy(X_train, X_test, y_train, y_test, random_forest_classifier))

{'Test Accuracy': 0.8, 'Train Accuracy': 1.0}


In [125]:
get_correlation(X_test, y_test, random_forest_classifier).style.background_gradient(cmap='coolwarm')

Average correlations between predictors: 0.18476416891490555


Unnamed: 0,model 0,model 1,model 2,model 3,model 4,model 5,model 6,model 7,model 8,model 9,model 10,model 11,model 12,model 13,model 14,model 15,model 16,model 17,model 18,model 19
model 0,1.0,0.4344,0.296844,0.251643,0.250801,0.296752,0.296844,0.268914,0.258009,0.049031,0.28663,0.248464,0.43069,0.103429,0.103429,0.446583,0.418221,0.401878,0.200038,0.336194
model 1,0.4344,1.0,0.154303,0.302614,0.11547,-0.09759,0.308607,-0.066667,0.041996,0.149478,-0.096225,0.174078,0.244949,0.2,0.022222,0.405727,0.323287,0.383311,0.427205,0.125988
model 2,0.296844,0.154303,1.0,0.08405,-0.0,0.090351,0.0625,0.308607,0.408248,-0.069195,0.451003,0.161165,0.472456,0.077152,0.0,0.301493,0.074826,0.08405,0.058428,0.116642
model 3,0.251643,0.302614,0.08405,1.0,-0.034943,-0.053158,0.224133,-0.020174,0.205879,0.081422,0.157243,0.319582,0.2965,0.060523,0.14122,0.314055,0.194157,0.413919,0.345525,0.434634
model 4,0.250801,0.11547,-0.0,-0.034943,1.0,0.101419,0.200446,0.19245,0.072739,0.110959,-0.0,0.301511,0.070711,0.11547,0.11547,-0.036986,0.043073,0.2446,0.201802,0.072739
model 5,0.296752,-0.09759,0.090351,-0.053158,0.101419,1.0,0.225877,0.136626,0.110657,0.06877,0.422577,0.118918,0.119523,0.136626,0.448914,0.143792,0.21114,0.301227,0.147813,0.33197
model 6,0.296844,0.308607,0.0625,0.224133,0.200446,0.225877,1.0,0.077152,0.335347,0.004943,0.200446,0.36262,0.330719,0.154303,0.231455,0.301493,0.161165,0.434257,0.462932,0.408248
model 7,0.268914,-0.066667,0.308607,-0.020174,0.19245,0.136626,0.077152,1.0,0.20998,0.234895,0.096225,0.174078,0.326599,0.022222,-0.066667,0.149478,0.024868,-0.020174,0.11651,-0.041996
model 8,0.258009,0.041996,0.408248,0.205879,0.072739,0.110657,0.335347,0.20998,1.0,-0.008071,0.309142,0.212007,0.385758,-0.125988,0.125988,0.153351,-0.028198,0.205879,0.088074,0.047619
model 9,0.049031,0.149478,-0.069195,0.081422,0.110959,0.06877,0.004943,0.234895,-0.008071,1.0,-0.036986,0.026021,-0.052307,0.149478,0.149478,0.261286,-0.011152,0.003877,0.345828,-0.008071


<h2 id="Example">Cancer Data Example</h2>

The example is based on a dataset that is publicly available from the UCI Machine Learning Repository (Asuncion and Newman, 2007)[[http://mlearn.ics.uci.edu/MLRepository.html](http://mlearn.ics.uci.edu/MLRepository.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML241ENSkillsNetwork31576874-2022-01-01)]. The dataset consists of several hundred human cell sample records, each of which contains the values of a set of cell characteristics. The fields in each record are:

| Field name  | Description                 |
| ----------- | --------------------------- |
| ID          | Clump thickness             |
| Clump       | Clump thickness             |
| UnifSize    | Uniformity of cell size     |
| UnifShape   | Uniformity of cell shape    |
| MargAdh     | Marginal adhesion           |
| SingEpiSize | Single epithelial cell size |
| BareNuc     | Bare nuclei                 |
| BlandChrom  | Bland chromatin             |
| NormNucl    | Normal nucleoli             |
| Mit         | Mitoses                     |
| Class       | Benign or malignant         |

<br>
<br>

Let's load the dataset:


In [126]:
df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/cell_samples.csv")

In [127]:
df.head()

Unnamed: 0,ID,Clump,UnifSize,UnifShape,MargAdh,SingEpiSize,BareNuc,BlandChrom,NormNucl,Mit,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


In [128]:
df.dtypes

ID              int64
Clump           int64
UnifSize        int64
UnifShape       int64
MargAdh         int64
SingEpiSize     int64
BareNuc        object
BlandChrom      int64
NormNucl        int64
Mit             int64
Class           int64
dtype: object

Have a quick review of the Class values to see what type of classification we are dealing with.

In [129]:
df['Class'].unique()

array([2, 4])

In [130]:
df['BareNuc'].unique()

array(['1', '10', '2', '4', '3', '9', '7', '?', '5', '8', '6'],
      dtype=object)

Dropping BareNuc values which are '?'.

In [131]:
to_remove_barenuc_mask = (df['BareNuc'] == '?')

In [132]:
df[to_remove_barenuc_mask].index

Index([23, 40, 139, 145, 158, 164, 235, 249, 275, 292, 294, 297, 315, 321, 411,
       617],
      dtype='int64')

In [133]:
df.drop(df[to_remove_barenuc_mask].index, inplace=True)

Converting remaining BareNuc values to int.

In [134]:
df['BareNuc'] = df['BareNuc'].astype(int)

In [135]:
df['BareNuc'].dtype

dtype('int64')

Define the y column.

In [136]:
Y_COLUMN = 'Class'

Define the feature columns.

In [137]:
FEATURE_COLUMNS = ['Clump', 'UnifSize', 'UnifShape', 'MargAdh', 'SingEpiSize', 'BareNuc', 'BlandChrom', 'NormNucl', 'Mit']

In [138]:
X = df[FEATURE_COLUMNS]
X.head()

Unnamed: 0,Clump,UnifSize,UnifShape,MargAdh,SingEpiSize,BareNuc,BlandChrom,NormNucl,Mit
0,5,1,1,1,2,1,3,1,1
1,5,4,4,5,7,10,3,2,1
2,3,1,1,1,2,2,3,1,1
3,6,8,8,1,3,4,3,7,1
4,4,1,1,3,2,1,3,1,1


In [139]:
y = df[Y_COLUMN]
y.head()

0    2
1    2
2    2
3    2
4    2
Name: Class, dtype: int64

Create the train test split.

In [140]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Use GridSearchCV to find the optimal hyperparameters.

In [141]:
from sklearn.model_selection import GridSearchCV

In [142]:
param_grid = {
    'n_estimators': [2 * n + 1 for n in range(20)],
    'max_depth': [2 * n + 1 for n in range(10)],
    'max_features': ['auto', 'sqrt', 'log2']
}

In [143]:
random_forest_classifier = RandomForestClassifier()
random_forest_classifier.get_params().keys()

dict_keys(['bootstrap', 'ccp_alpha', 'class_weight', 'criterion', 'max_depth', 'max_features', 'max_leaf_nodes', 'max_samples', 'min_impurity_decrease', 'min_samples_leaf', 'min_samples_split', 'min_weight_fraction_leaf', 'monotonic_cst', 'n_estimators', 'n_jobs', 'oob_score', 'random_state', 'verbose', 'warm_start'])

In [144]:
search = GridSearchCV(
    estimator=RandomForestClassifier(),
    param_grid=param_grid,
    scoring='accuracy'
)
search.fit(X_train, y_train)

In [145]:
search.best_score_

0.9706922435362803

In [146]:
search.best_params_

{'max_depth': 5, 'max_features': 'log2', 'n_estimators': 13}

In [147]:
print(get_accuracy(X_train, X_test, y_train, y_test, search.best_estimator_))

{'Test Accuracy': 0.948905109489051, 'Train Accuracy': 0.9871794871794872}


<h2 id="practice">Practice</h2>


Imagine that you are a medical researcher compiling data for a study. You have collected data about a set of patients, all of whom suffered from the same illness. During their course of treatment, each patient responded to one of 5 medications, Drug A, Drug B, Drug c, Drug x and y.

Part of your job is to build a model to find out which drug might be appropriate for a future patient with the same illness. The features of this dataset are Age, Sex, Blood Pressure, and the Cholesterol of the patients, and the target is the drug that each patient responded to.

It is a sample of multiclass classifier, and you can use the training part of the dataset to build a decision tree, and then use it to predict the class of a unknown patient, or to prescribe a drug to a new patient.


In [148]:
df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/drug200.csv", delimiter=",")
df.head()

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,F,HIGH,HIGH,25.355,drugY
1,47,M,LOW,HIGH,13.093,drugC
2,47,M,LOW,HIGH,10.114,drugC
3,28,F,NORMAL,HIGH,7.798,drugX
4,61,F,LOW,HIGH,18.043,drugY


In [149]:
df.dtypes

Age              int64
Sex             object
BP              object
Cholesterol     object
Na_to_K        float64
Drug            object
dtype: object

In [150]:
Y_COLUMN = 'Drug'

In [151]:
df[Y_COLUMN].unique()

array(['drugY', 'drugC', 'drugX', 'drugA', 'drugB'], dtype=object)

In [157]:
FEATURE_COLUMNS = [colname for colname in df.columns if colname != Y_COLUMN]

Import the necessary libs for the ColumnTransformer, Pipeline and Preprocessing

In [152]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder

Start with Encoding the y columns.

In [153]:
label_encoder = LabelEncoder()
y_col_values = df[Y_COLUMN].unique()
label_encoder.fit(y_col_values)
df[Y_COLUMN] = label_encoder.fit_transform(df[Y_COLUMN])
df[Y_COLUMN].unique()

array([4, 2, 3, 0, 1])

Split X and y and then train_test_split.

In [158]:
X = df[FEATURE_COLUMNS]
y = df[Y_COLUMN]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)

Define the label columns and Column Transformer.

In [154]:
LABEL_COLUMNS = [colname for colname in df.columns if df[colname].dtype == 'object' and colname != Y_COLUMN]
LABEL_COLUMNS

['Sex', 'BP', 'Cholesterol']

In [155]:
column_transformer = ColumnTransformer(
    transformers=[
        ('ordinal_encoder', OrdinalEncoder(), LABEL_COLUMNS)
        ],
    remainder='passthrough'
)

Definne the param_grid for GridSearchCV.

In [159]:
param_grid = {
    'n_estimators': [2 * n + 1 for n in range(20)],
    'max_depth': [2 * n + 1 for n in range(10)],
    'max_features': ['auto', 'sqrt', 'log2']
}

Finally define the Pipeline.

In [161]:
pipeline = Pipeline(
    steps=[
        ('transformer', column_transformer),
        ('grid_search', GridSearchCV(
            estimator=RandomForestClassifier(),
            param_grid=param_grid,
            scoring='accuracy'
        ))
    ]
)

In [162]:
pipeline.fit(X_train, y_train)

In [163]:
pipeline[-1].best_score_

0.99375

In [164]:
pipeline[-1].best_params_

{'max_depth': 5, 'max_features': 'sqrt', 'n_estimators': 17}

In [167]:
y_train_pred = pipeline.predict(X_train)
y_test_pred = pipeline.predict(X_test)

In [169]:
metrics.accuracy_score(y_train, y_train_pred)

1.0

In [170]:
metrics.accuracy_score(y_test, y_test_pred)

1.0