<p style="text-align:center">
    <a href="https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML241ENSkillsNetwork31576874-2022-01-01" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>

# Random Forests (RF) for classification with Python

Estimated time needed: **45** minutes

## Objectives

After completing this lab you will be able to:

*   Understand the difference between Bagging and Random Forest
*   Understand  that Random Forests have less Correlation between predictors in their ensemble, improving accuracy
*   Apply Random Forest
*   Understand Hyperparameters selection in  Random Forest


In this notebook, you will learn Random Forests (RF) for classification and Regression. Random Forest is similar to Bagging using multiple model versions and aggregating the ensemble of models to make a single prediction. RF uses an ensemble of tree’s and introduces randomness into each tree by randomly selecting a subset of the features for each node to split on. **This makes the predictions of each tree uncorrelated**, improving results when the models are aggregated. In this lab we will illustrate the sampling process of RF to Bagging, then demonstrate how each predictor for random forest are not correlated. Finally, we will apply Random Forests to several datasets using Grid-Search to find the optimum  Hyperparameters.


<h1>Table of contents</h1>

<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ol>
        <li><a href="https://#RFvsBag">What's the difference between RF and Bagging </a></li>
        <li><a href="https://#Example">Cancer Data Example</li>
        <li><a href="https://practice/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML241ENSkillsNetwork31576874-2022-01-01">Practice</a></li>

</div>
<br>
<hr>


Let's first import the required libraries:


In [1]:
import pandas as pd # For dataframe manipulation
import numpy as np
from sklearn import preprocessing
%matplotlib inline 
import matplotlib.pyplot as plt
from sklearn import metrics

Ignore error warnings


In [2]:
import warnings
warnings.filterwarnings('ignore')

This function will calculate the accuracy of the training and testing data given a model.


In [3]:
def get_accuracy(X_train, X_test, y_train, y_test, model):
    return  {"test Accuracy":metrics.accuracy_score(y_test, model.predict(X_test)),
             "trian Accuracy": metrics.accuracy_score(y_train, model.predict(X_train))}

This function calculates the average correlation between predictors and displays the pairwise correlation between  predictors.


In [4]:
def get_correlation(X_test, y_test, models):
    #This function calculates the average correlation between predictors  
    n_estimators=len(models.estimators_)
    prediction=np.zeros((y_test.shape[0],n_estimators))
    predictions=pd.DataFrame({'estimator '+str(n+1):[] for n in range(n_estimators)})
    
    for key,model in zip(predictions.keys(),models.estimators_):
        predictions[key]=model.predict(X_test.to_numpy())
    
    corr=predictions.corr()
    print("Average correlation between predictors: ", corr.mean().mean()-1/n_estimators)
    return corr

<h2 id="RFvsBag">  What's the difference between RF and Bagging </h2>


RF is similar to Bagging in that it uses model ensembles to make predictions. Like Bagging it if you add more models, RF does not suffer from Overfitting. In this section, we go over some of the differences between RF and Bagging, using the dataset:


### About the dataset

We will use a telecommunications dataset for predicting customer churn. This is a historical customer dataset where each row represents one customer. The data is relatively easy to understand, and you may uncover insights you can use immediately. Typically, it is less expensive to keep customers than acquire new ones, so the focus of this analysis is to predict the customers who will stay with the company.

This data set provides information to help you predict what behavior will help you to retain customers. You can analyze all relevant customer data and develop focused customer retention programs.

The dataset includes information about:

*   Customers who left within the last month – the column is called Churn
*   Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
*   Customer account information – how long they had been a customer, contract, payment method, paperless billing, monthly charges, and total charges
*   Demographic info about customers – gender, age range, and if they have partners and dependents


Load Data From CSV File


In [5]:
churn_df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/ChurnData.csv")

churn_df.head()

Unnamed: 0,tenure,age,address,income,ed,employ,equip,callcard,wireless,longmon,...,pager,internet,callwait,confer,ebill,loglong,logtoll,lninc,custcat,churn
0,11.0,33.0,7.0,136.0,5.0,5.0,0.0,1.0,1.0,4.4,...,1.0,0.0,1.0,1.0,0.0,1.482,3.033,4.913,4.0,1.0
1,33.0,33.0,12.0,33.0,2.0,0.0,0.0,0.0,0.0,9.45,...,0.0,0.0,0.0,0.0,0.0,2.246,3.24,3.497,1.0,1.0
2,23.0,30.0,9.0,30.0,1.0,2.0,0.0,0.0,0.0,6.3,...,0.0,0.0,0.0,1.0,0.0,1.841,3.24,3.401,3.0,0.0
3,38.0,35.0,5.0,76.0,2.0,10.0,1.0,1.0,1.0,6.05,...,1.0,1.0,1.0,1.0,1.0,1.8,3.807,4.331,4.0,0.0
4,7.0,35.0,14.0,80.0,2.0,15.0,0.0,1.0,0.0,7.1,...,0.0,0.0,1.0,1.0,0.0,1.96,3.091,4.382,3.0,0.0


### Data pre-processing and selection


Let's select some features for the modeling. Also, we change the target data type to be an integer, as it is a requirement by the skitlearn algorithm:


In [6]:
# Applying feature selection
churn_df = churn_df[['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip', 'callcard', 'wireless','churn']]
churn_df['churn'] = churn_df['churn'].astype('int')
churn_df.head()

Unnamed: 0,tenure,age,address,income,ed,employ,equip,callcard,wireless,churn
0,11.0,33.0,7.0,136.0,5.0,5.0,0.0,1.0,1.0,1
1,33.0,33.0,12.0,33.0,2.0,0.0,0.0,0.0,0.0,1
2,23.0,30.0,9.0,30.0,1.0,2.0,0.0,0.0,0.0,0
3,38.0,35.0,5.0,76.0,2.0,10.0,1.0,1.0,1.0,0
4,7.0,35.0,14.0,80.0,2.0,15.0,0.0,1.0,0.0,0


### Bootstrap Sampling

Bootstrap Sampling is a method that involves drawing of sample data repeatedly with replacement from a data source to estimate a model parameter. Scikit-learn has methods for Bagging but its helpful to understand Bootstrap sampling. We will import <code>resample</code>


In [7]:
from sklearn.utils import resample # Randomly selecting rows

Consider the five rows of data:


In [8]:
churn_df[0:5]

Unnamed: 0,tenure,age,address,income,ed,employ,equip,callcard,wireless,churn
0,11.0,33.0,7.0,136.0,5.0,5.0,0.0,1.0,1.0,1
1,33.0,33.0,12.0,33.0,2.0,0.0,0.0,0.0,0.0,1
2,23.0,30.0,9.0,30.0,1.0,2.0,0.0,0.0,0.0,0
3,38.0,35.0,5.0,76.0,2.0,10.0,1.0,1.0,1.0,0
4,7.0,35.0,14.0,80.0,2.0,15.0,0.0,1.0,0.0,0


We can perform a bootstrap sample using the function <code>resample</code>; we see the dataset is the same size, but some rows are repeated:


In [9]:
for n in range(5):

    print(resample(churn_df[0:5])) # We are basically changing rows
print('')
print('the length is {} rows'.format(len(resample(churn_df[0:5]))))

   tenure   age  address  income   ed  employ  equip  callcard  wireless  \
3    38.0  35.0      5.0    76.0  2.0    10.0    1.0       1.0       1.0   
2    23.0  30.0      9.0    30.0  1.0     2.0    0.0       0.0       0.0   
0    11.0  33.0      7.0   136.0  5.0     5.0    0.0       1.0       1.0   
4     7.0  35.0     14.0    80.0  2.0    15.0    0.0       1.0       0.0   
0    11.0  33.0      7.0   136.0  5.0     5.0    0.0       1.0       1.0   

   churn  
3      0  
2      0  
0      1  
4      0  
0      1  
   tenure   age  address  income   ed  employ  equip  callcard  wireless  \
0    11.0  33.0      7.0   136.0  5.0     5.0    0.0       1.0       1.0   
4     7.0  35.0     14.0    80.0  2.0    15.0    0.0       1.0       0.0   
0    11.0  33.0      7.0   136.0  5.0     5.0    0.0       1.0       1.0   
1    33.0  33.0     12.0    33.0  2.0     0.0    0.0       0.0       0.0   
4     7.0  35.0     14.0    80.0  2.0    15.0    0.0       1.0       0.0   

   churn  
0      1 

### Select Variables at Random


Like Bagging, RF uses an independent bootstrap sample from the training data. In addition, we select $m$ variables at random out of all $M$ possible
variables. Let's do an example.


In [10]:
X=churn_df[['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip']]

there are 7 features


In [11]:
M=X.shape[1]
M # 7 possible variables.

7

Let us select $𝑚=3$, and randomly sample features from the 5 Bootstrap Samples from above.


In [12]:
# After we did select randomly 5 subsets of data, and each subset has 5 rows.
# We now select 3 random attributes to assign to each subset out of the M attributes, i.e.,7
m=3

We list out the index of the features


In [13]:
feature_index= range(M)
feature_index

range(0, 7)

We can use the function to sample to  randomly select indexes


In [14]:
# To select randomly from an index
import random
random.sample(feature_index,m) # Select 3 randomly out of 0-6.

[4, 3, 2]

We now randomly select features from the bootstrap samples, in randomly selecting a subset of the features for each node to split on.


In [15]:
for n in range(5):

    print("sample {}".format(n)) # To write 'n' in the text
    print(resample(X[0:5]).iloc[:,random.sample(feature_index,m)]) # Random 5 rows, and random 3 attributes

sample 0
    ed  employ  equip
0  5.0     5.0    0.0
2  1.0     2.0    0.0
3  2.0    10.0    1.0
2  1.0     2.0    0.0
3  2.0    10.0    1.0
sample 1
   employ  address  tenure
2     2.0      9.0    23.0
0     5.0      7.0    11.0
2     2.0      9.0    23.0
4    15.0     14.0     7.0
2     2.0      9.0    23.0
sample 2
   income  tenure   age
3    76.0    38.0  35.0
4    80.0     7.0  35.0
4    80.0     7.0  35.0
0   136.0    11.0  33.0
3    76.0    38.0  35.0
sample 3
   employ   age  equip
1     0.0  33.0    0.0
4    15.0  35.0    0.0
0     5.0  33.0    0.0
3    10.0  35.0    1.0
4    15.0  35.0    0.0
sample 4
   tenure   ed   age
1    33.0  2.0  33.0
1    33.0  2.0  33.0
3    38.0  2.0  35.0
3    38.0  2.0  35.0
2    23.0  1.0  30.0


In Random Forest, we would use these data subsets to train each node of a tree.


## Train/Test dataset


Let's define X, and y for our dataset:


In [16]:
y = churn_df['churn']
y.head()

0    1
1    1
2    0
3    0
4    0
Name: churn, dtype: int64

## Train/Test dataset


We split our dataset into train and test set:


In [17]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
print ('Train set', X_train.shape,  y_train.shape)
print ('Test set', X_test.shape,  y_test.shape)

Train set (140, 7) (140,)
Test set (60, 7) (60,)


### Bagging  Review


In [18]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

Bagging improves models that suffer from overfitting; they do well on the training data, but they do not Generalize well. Decision Trees are a prime candidate for this reason, in addition, they are fast to train; We create a <code>BaggingClassifier</code> object,  with a Decision Tree as the <code>base_estimator</code>.


In [19]:
n_estimators=20 # Number of trees
Bag= BaggingClassifier(base_estimator=DecisionTreeClassifier(criterion="entropy", max_depth = 4,random_state=2)
                       ,n_estimators=n_estimators, random_state=0, bootstrap=True)

We fit the model:


In [20]:
Bag.fit(X_train,y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(criterion='entropy',
                                                        max_depth=4,
                                                        random_state=2),
                  n_estimators=20, random_state=0)

The method <code>predict</code> aggregates the predictions by voting:


In [21]:
Bag.predict(X_test).shape # Churn or did not churn

(60,)

We see the training accuracy is slightly better but the test accuracy improves dramatically:


In [22]:
print(get_accuracy(X_train, X_test, y_train, y_test, Bag))

{'test Accuracy': 0.7333333333333333, 'trian Accuracy': 0.9071428571428571}


Each tree is similar; we can see this by plotting the correlation between each tree and the average correlation.


In [23]:
get_correlation(X_test, y_test,Bag).style.background_gradient(cmap='coolwarm')
# The correlation between trees is very low

Average correlation between predictors:  0.25400671537472364


Unnamed: 0,estimator 1,estimator 2,estimator 3,estimator 4,estimator 5,estimator 6,estimator 7,estimator 8,estimator 9,estimator 10,estimator 11,estimator 12,estimator 13,estimator 14,estimator 15,estimator 16,estimator 17,estimator 18,estimator 19,estimator 20
estimator 1,1.0,-0.057709,0.152641,0.132379,0.068323,0.195047,0.209679,0.256111,0.177811,0.318511,-0.024845,0.318511,0.209679,0.112611,0.294475,-0.035245,0.161491,0.161491,0.236433,0.015456
estimator 2,-0.057709,1.0,-0.002979,0.335171,0.349647,0.121829,-0.078409,0.013546,0.180022,0.223814,0.451486,-0.074605,-0.078409,0.404443,0.24658,0.481571,0.04413,0.04413,0.215365,-0.059131
estimator 3,0.152641,-0.002979,1.0,0.395985,-0.010903,0.342381,0.455239,0.674356,0.442603,0.359425,-0.092675,0.51917,0.552099,0.296511,0.32485,0.216541,0.561502,0.47973,0.415029,0.006783
estimator 4,0.132379,0.335171,0.395985,1.0,0.456572,0.242393,0.436809,0.427623,0.417131,0.494783,0.051331,0.415618,0.340807,0.405843,0.224442,0.199294,0.375523,0.294475,0.445634,0.19496
estimator 5,0.068323,0.349647,-0.010903,0.456572,1.0,0.362231,-0.011036,0.090878,0.002915,0.409514,0.347826,-0.045502,0.099322,0.434355,0.294475,0.387699,0.068323,0.161491,0.315244,-0.100465
estimator 6,0.195047,0.121829,0.342381,0.242393,0.362231,1.0,0.19803,0.370625,0.183073,0.163299,0.195047,0.244949,0.19803,0.505181,0.605983,0.158114,0.529414,0.195047,0.494975,-0.069338
estimator 7,0.209679,-0.078409,0.455239,0.436809,-0.011036,0.19803,1.0,0.474619,0.564524,0.404226,-0.121393,0.619813,0.738562,0.323942,0.148803,0.062622,0.540752,0.430394,0.140028,0.247156
estimator 8,0.256111,0.013546,0.674356,0.427623,0.090878,0.370625,0.474619,1.0,0.546688,0.464008,-0.074355,0.625402,0.474619,0.256776,0.283884,0.140642,0.669193,0.50396,0.454257,0.020559
estimator 9,0.177811,0.180022,0.442603,0.417131,0.002915,0.183073,0.564524,0.546688,1.0,0.405727,-0.084533,0.491144,0.357359,0.241594,0.188913,0.314275,0.352707,0.177811,0.332877,0.07979
estimator 10,0.318511,0.223814,0.359425,0.494783,0.409514,0.163299,0.404226,0.464008,0.405727,1.0,0.318511,0.466667,0.404226,0.392837,0.178122,0.464758,0.318511,0.318511,0.50037,0.113228


It can be shown that this correlation reduces performance. Random forest minimizes the correlation between trees, improving results.


## Random  Forest


Random forests are a combination of trees such that each tree depends on a random subset of the features and data. **As a result, each tree in the forest is different and usually performs better than Bagging.** The most important parameters are the number of trees and the number of features to sample. First, we import <code>RandomForestClassifier</code>.


In [24]:
from sklearn.ensemble import RandomForestClassifier

Like Bagging, increasing the number of trees improves results and does not lead to overfitting in most cases; but the improvements plateau as you add more trees. For this example, the number of trees in the forest (default=100):


In [25]:
n_estimators=20 # 20 trees

<code>max_features </code>   $m$ the number of features to consider when looking for the best split. If we have M features denoted by:


In [26]:
M_features=X.shape[1] # 7

If we have M features, a popular method to determine m is to use the square root of M


$m= floor(\sqrt{M}) $


In [27]:
max_features=round(np.sqrt(M_features))-1
max_features

2

In [28]:
y_test.shape[0] # 60 outcomes to test

60

We use floor to make sure $m$ is an integer:


We create the RF object :


In [29]:
model = RandomForestClassifier(max_features=max_features, n_estimators=n_estimators, random_state=0)

We train the model


In [30]:
model.fit(X_train,y_train)

RandomForestClassifier(max_features=2, n_estimators=20, random_state=0)

We obtain the training and testing accuracy; we see that RF does better than Bagging:


In [31]:
print(get_accuracy(X_train, X_test, y_train, y_test, model))

{'test Accuracy': 0.8, 'trian Accuracy': 0.9857142857142858}


We see that each tree in RF is less correlated than Bagging:


In [32]:
get_correlation(X_test, y_test, model).style.background_gradient(cmap='coolwarm')

Average correlation between predictors:  0.2133517931899367


Unnamed: 0,estimator 1,estimator 2,estimator 3,estimator 4,estimator 5,estimator 6,estimator 7,estimator 8,estimator 9,estimator 10,estimator 11,estimator 12,estimator 13,estimator 14,estimator 15,estimator 16,estimator 17,estimator 18,estimator 19,estimator 20
estimator 1,1.0,0.071067,0.339993,0.126674,0.169675,-0.052307,0.308607,0.222375,0.242393,-0.0,0.154303,0.195047,0.385758,0.0533,0.19803,0.122279,0.077152,0.385758,0.25,0.163299
estimator 2,0.071067,1.0,0.174712,0.104427,0.182597,0.100366,0.358218,0.108868,0.113692,-0.031607,0.212007,0.213862,0.285112,0.363636,0.182953,0.493591,0.212007,0.065795,0.355335,0.09671
estimator 3,0.339993,0.174712,1.0,0.287563,0.188913,0.015048,0.234061,0.314055,0.188913,0.158966,0.314772,0.440155,0.234061,0.289947,0.357359,0.212347,0.314772,0.314772,0.261533,0.234895
estimator 4,0.126674,0.104427,0.287563,1.0,0.42127,0.208052,0.336194,0.326761,0.200195,0.401878,0.179825,0.217425,0.258009,0.318682,0.215732,0.299877,0.179825,0.10164,0.278682,0.103429
estimator 5,0.169675,0.182597,0.188913,0.42127,1.0,0.341058,0.231893,0.283884,0.153937,0.140145,0.082285,0.213427,0.082285,0.299735,0.244805,0.18258,0.00748,0.157089,0.387829,0.178122
estimator 6,-0.052307,0.100366,0.015048,0.208052,0.341058,1.0,-0.008071,0.081422,0.112841,0.158966,-0.088782,0.440155,0.07264,0.373585,0.253777,0.058843,0.153351,0.153351,0.183073,0.149478
estimator 7,0.308607,0.358218,0.234061,0.336194,0.231893,-0.008071,1.0,0.282131,0.231893,0.358382,0.285714,0.326762,0.365079,0.263181,0.437978,0.332078,0.047619,0.126984,0.231455,0.125988
estimator 8,0.222375,0.108868,0.314055,0.326761,0.283884,0.081422,0.282131,1.0,0.212015,0.340659,0.205879,0.338727,0.358382,0.189642,0.2789,0.094265,0.205879,0.510885,0.2965,-0.020174
estimator 9,0.242393,0.113692,0.188913,0.200195,0.153937,0.112841,0.231893,0.212015,1.0,0.068276,0.306697,0.294475,0.381501,0.067182,0.340807,0.253715,0.00748,0.082285,0.096957,0.019791
estimator 10,-0.0,-0.031607,0.158966,0.401878,0.140145,0.158966,0.358382,0.340659,0.068276,1.0,0.053376,0.256111,0.282131,0.189642,0.2789,0.021753,0.129628,-0.022875,0.222375,0.14122


<h2 id="Example">Cancer Data Example</h2>

The example is based on a dataset that is publicly available from the UCI Machine Learning Repository (Asuncion and Newman, 2007)\[[http://mlearn.ics.uci.edu/MLRepository.html](http://mlearn.ics.uci.edu/MLRepository.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML241ENSkillsNetwork31576874-2022-01-01)]. The dataset consists of several hundred human cell sample records, each of which contains the values of a set of cell characteristics. The fields in each record are:

| Field name  | Description                 |
| ----------- | --------------------------- |
| ID          | Clump thickness             |
| Clump       | Clump thickness             |
| UnifSize    | Uniformity of cell size     |
| UnifShape   | Uniformity of cell shape    |
| MargAdh     | Marginal adhesion           |
| SingEpiSize | Single epithelial cell size |
| BareNuc     | Bare nuclei                 |
| BlandChrom  | Bland chromatin             |
| NormNucl    | Normal nucleoli             |
| Mit         | Mitoses                     |
| Class       | Benign or malignant         |

<br>
<br>

Let's load the dataset:


In [33]:
df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/cell_samples.csv")

df.head()

Unnamed: 0,ID,Clump,UnifSize,UnifShape,MargAdh,SingEpiSize,BareNuc,BlandChrom,NormNucl,Mit,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


Now lets remove rows that have a ? in the <code>BareNuc</code> column:


In [34]:
len(df)

699

In [35]:
df = df[pd.to_numeric(df['BareNuc'], errors='coerce').notnull()]

In [36]:
len(df)

683

In [37]:
df.Class.value_counts()

2    444
4    239
Name: Class, dtype: int64

We obtain the features:


In [38]:
# Feature selection
X =  df[['Clump', 'UnifSize', 'UnifShape', 'MargAdh', 'SingEpiSize', 'BareNuc', 'BlandChrom', 'NormNucl', 'Mit']]

X.head()

Unnamed: 0,Clump,UnifSize,UnifShape,MargAdh,SingEpiSize,BareNuc,BlandChrom,NormNucl,Mit
0,5,1,1,1,2,1,3,1,1
1,5,4,4,5,7,10,3,2,1
2,3,1,1,1,2,2,3,1,1
3,6,8,8,1,3,4,3,7,1
4,4,1,1,3,2,1,3,1,1


We obtain the class labels:


In [39]:
y=df['Class'] # Target column
y.head()

0    2
1    2
2    2
3    2
4    2
Name: Class, dtype: int64

We split the data into training and testing sets.


In [40]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

Train set: (546, 9) (546,)
Test set: (137, 9) (137,)


We use <code>GridSearchCV</code> to search over specified parameter values  of the model.


In [41]:
from sklearn.model_selection import GridSearchCV

We create a <code>RandomForestClassifier</code> object and list the parameters using the method <code>get_params()</code>:


In [42]:
model = RandomForestClassifier() # Initiate RF classifier
model.get_params().keys() # To find out what can be adjusted

dict_keys(['bootstrap', 'ccp_alpha', 'class_weight', 'criterion', 'max_depth', 'max_features', 'max_leaf_nodes', 'max_samples', 'min_impurity_decrease', 'min_samples_leaf', 'min_samples_split', 'min_weight_fraction_leaf', 'n_estimators', 'n_jobs', 'oob_score', 'random_state', 'verbose', 'warm_start'])

We can use GridSearch for Exhaustive search over specified parameter values. We see many of the parameters are similar to Classification trees; let's try a different parameter for <code>max_depth</code>, <code>max_features</code> and <code>n_estimators</code>.


In [43]:
# The dictionary we want to loop through 
param_grid = {'n_estimators': [2*n+1 for n in range(20)], # Try different no. of trees, 1,3,5,7,...,37,39
             'max_depth' : [2*n+1 for n in range(10)], # try different depths, 1,3,5,7,9,...,19
             'max_features':["auto", "sqrt", "log2"]}


We create the Grid Search object and fit it:


In [44]:
search = GridSearchCV(estimator=model, param_grid=param_grid,scoring='accuracy') # Initiate search object
# The search object usually encompasses A - Model, B - Dictionary to loop through, C - 'CV' or Scoring
search.fit(X_train, y_train)

GridSearchCV(estimator=RandomForestClassifier(),
             param_grid={'max_depth': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19],
                         'max_features': ['auto', 'sqrt', 'log2'],
                         'n_estimators': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21,
                                          23, 25, 27, 29, 31, 33, 35, 37, 39]},
             scoring='accuracy')

We can see the best accuracy score of the searched parameters was \~77%.


In [45]:
search.best_score_

0.9780817347789826

The best parameter values are:


In [46]:
search.best_params_

{'max_depth': 5, 'max_features': 'auto', 'n_estimators': 19}

We can calculate accuracy on the test data using the test data:

In [47]:
print(get_accuracy(X_train, X_test, y_train, y_test, search.best_estimator_))

{'test Accuracy': 0.9708029197080292, 'trian Accuracy': 0.9853479853479854}


In [48]:
# Create the model
FinalModel = RandomForestClassifier(max_depth = 3, max_features = 'auto', n_estimators = 25) # Initiate RF classifier
FinalModel.fit(X_train,y_train)
print(get_accuracy(X_train, X_test, y_train, y_test, FinalModel))

{'test Accuracy': 0.9708029197080292, 'trian Accuracy': 0.9816849816849816}


In [49]:
# Plotting the correlations of the model
get_correlation(X_test,y_test, FinalModel).style.background_gradient(cmap='coolwarm')

Average correlation between predictors:  0.8637573124871327


Unnamed: 0,estimator 1,estimator 2,estimator 3,estimator 4,estimator 5,estimator 6,estimator 7,estimator 8,estimator 9,estimator 10,estimator 11,estimator 12,estimator 13,estimator 14,estimator 15,estimator 16,estimator 17,estimator 18,estimator 19,estimator 20,estimator 21,estimator 22,estimator 23,estimator 24,estimator 25
estimator 1,1.0,0.952676,0.968993,0.888307,0.893958,0.936456,0.922654,0.954035,0.893958,0.908149,0.922654,0.857778,0.939417,0.984309,0.904685,0.840123,0.936456,0.859892,0.925122,1.0,0.968228,0.954035,0.939417,0.908149,0.862795
estimator 2,0.952676,1.0,0.921711,0.873638,0.877815,0.921043,0.906756,0.938,0.908842,0.89213,0.906756,0.842529,0.923261,0.937011,0.889411,0.793551,0.921043,0.844269,0.939869,0.952676,0.952676,0.906756,0.923261,0.923261,0.846788
estimator 3,0.968993,0.921711,1.0,0.891283,0.892926,0.905988,0.953448,0.953448,0.892926,0.938474,0.922333,0.858982,0.969477,0.984441,0.905988,0.842989,0.937491,0.860102,0.954724,0.968993,0.937491,0.953448,0.938474,0.938474,0.862026
estimator 4,0.888307,0.873638,0.891283,1.0,0.849827,0.823814,0.877154,0.909004,0.849827,0.895076,0.909004,0.873638,0.895076,0.905744,0.85606,0.788509,0.85606,0.877154,0.881455,0.888307,0.888307,0.877154,0.895076,0.863341,0.818198
estimator 5,0.893958,0.877815,0.892926,0.849827,1.0,0.831631,0.877352,0.938913,0.877733,0.862105,0.877352,0.877815,0.923444,0.908842,0.862795,0.770065,0.862795,0.908133,0.877733,0.893958,0.893958,0.877352,0.892774,0.862105,0.847166
estimator 6,0.936456,0.921043,0.905988,0.823814,0.831631,1.0,0.891273,0.891273,0.862795,0.87688,0.891273,0.826145,0.87688,0.921043,0.872913,0.872201,0.968228,0.797129,0.925122,0.936456,0.904685,0.891273,0.908149,0.87688,0.831631
estimator 7,0.922654,0.906756,0.953448,0.877154,0.877352,0.891273,1.0,0.938009,0.877352,0.953793,0.938009,0.875512,0.953793,0.938,0.922654,0.82887,0.891273,0.845023,0.969693,0.922654,0.954035,0.907014,0.953793,0.953793,0.877352
estimator 8,0.954035,0.938,0.953448,0.909004,0.938913,0.891273,0.938009,1.0,0.908133,0.92291,0.938009,0.875512,0.953793,0.969244,0.922654,0.82887,0.922654,0.907014,0.938913,0.954035,0.954035,0.938009,0.953793,0.92291,0.877352
estimator 9,0.893958,0.908842,0.892926,0.849827,0.877733,0.862795,0.877352,0.908133,1.0,0.862105,0.908133,0.846788,0.923444,0.908842,0.893958,0.832993,0.893958,0.846572,0.9083,0.893958,0.893958,0.938913,0.923444,0.892774,0.9083
estimator 10,0.908149,0.89213,0.938474,0.895076,0.862105,0.87688,0.953793,0.92291,0.862105,1.0,0.92291,0.89213,0.938455,0.923261,0.908149,0.815056,0.87688,0.861142,0.954113,0.908149,0.908149,0.892026,0.938455,0.938455,0.862105


<h2 id="practice">Practice</h2>


Imagine that you are a medical researcher compiling data for a study. You have collected data about a set of patients, all of whom suffered from the same illness. During their course of treatment, each patient responded to one of 5 medications, Drug A, Drug B, Drug c, Drug x and y.

Part of your job is to build a model to find out which drug might be appropriate for a future patient with the same illness. The features of this dataset are Age, Sex, Blood Pressure, and the Cholesterol of the patients, and the target is the drug that each patient responded to.

It is a sample of multiclass classifier, and you can use the training part of the dataset to build a decision tree, and then use it to predict the class of a unknown patient, or to prescribe a drug to a new patient.


In [50]:
df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/drug200.csv", delimiter=",")
df.head()

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,F,HIGH,HIGH,25.355,drugY
1,47,M,LOW,HIGH,13.093,drugC
2,47,M,LOW,HIGH,10.114,drugC
3,28,F,NORMAL,HIGH,7.798,drugX
4,61,F,LOW,HIGH,18.043,drugY


Let's create the X and y for our dataset:


In [51]:
X = df[['Age','Sex', 'BP', 'Cholesterol', 'Na_to_K']].values
X[0:5]

array([[23, 'F', 'HIGH', 'HIGH', 25.355],
       [47, 'M', 'LOW', 'HIGH', 13.093],
       [47, 'M', 'LOW', 'HIGH', 10.114],
       [28, 'F', 'NORMAL', 'HIGH', 7.798],
       [61, 'F', 'LOW', 'HIGH', 18.043]], dtype=object)

In [52]:
y = df["Drug"]
y[0:5]

0    drugY
1    drugC
2    drugC
3    drugX
4    drugY
Name: Drug, dtype: object

Now lets use a <code>LabelEncoder</code> to turn categorical features into numerical:


In [53]:
from sklearn import preprocessing
le_sex = preprocessing.LabelEncoder()
le_sex.fit(['F','M'])
X[:,1] = le_sex.transform(X[:,1]) 


le_BP = preprocessing.LabelEncoder()
le_BP.fit([ 'LOW', 'NORMAL', 'HIGH'])
X[:,2] = le_BP.transform(X[:,2])


le_Chol = preprocessing.LabelEncoder()
le_Chol.fit([ 'NORMAL', 'HIGH'])
X[:,3] = le_Chol.transform(X[:,3]) 

X[0:5]

array([[23, 0, 0, 0, 25.355],
       [47, 1, 1, 0, 13.093],
       [47, 1, 1, 0, 10.114],
       [28, 0, 2, 0, 7.798],
       [61, 0, 1, 0, 18.043]], dtype=object)

Split the data into training and testing data with a 80/20 split


In [54]:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

Train set: (160, 5) (160,)
Test set: (40, 5) (40,)


We can use GridSearch for Exhaustive search over specified parameter values.


In [55]:
param_grid = {'n_estimators': [2*n+1 for n in range(20)],
             'max_depth' : [2*n+1 for n in range(10) ],
             'max_features':["auto", "sqrt", "log2"]}

Create a <code>RandomForestClassifier </code>object called <cood>model</code> :


In [56]:
model = RandomForestClassifier()

<details><summary>Click here for the solution</summary>

```python
model = RandomForestClassifier()

```

</details>


Create <code>GridSearchCV</code> object called `search` with the `estimator` set to <code>model</code>, <code>param_grid</code> set to <code>param_grid</code>, <code>scoring</code> set to <code>accuracy</code>, and  <code>cv</code> set to 3 and Fit the <code>GridSearchCV</code> object to our <code>X_train</code> and <code>y_train</code> data


In [57]:
search = GridSearchCV(estimator=model, param_grid=param_grid,scoring='accuracy', cv = 3) # Initiate search object
# The search object usually encompasses A - Model, B - Dictionary to loop through, C - 'CV' or Scoring
search.fit(X_train, y_train)

GridSearchCV(cv=3, estimator=RandomForestClassifier(),
             param_grid={'max_depth': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19],
                         'max_features': ['auto', 'sqrt', 'log2'],
                         'n_estimators': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21,
                                          23, 25, 27, 29, 31, 33, 35, 37, 39]},
             scoring='accuracy')

<details><summary>Click here for the solution</summary>

```python
search = GridSearchCV(estimator=model, param_grid=param_grid,scoring='accuracy', cv=3)
search.fit(X_train, y_train)

```

</details>


We can find the accuracy of the best model.


In [58]:
search.best_score_

1.0

We can find the best parameter values:


In [59]:
search.best_params_

{'max_depth': 5, 'max_features': 'sqrt', 'n_estimators': 23}

We can find the accuracy test data:


In [60]:
print(get_accuracy(X_train, X_test, y_train, y_test, search.best_estimator_))

{'test Accuracy': 0.95, 'trian Accuracy': 1.0}


<details><summary>Click here for the solution</summary>

```python
print(get_accuracy(X_train, X_test, y_train, y_test, search.best_estimator_))
```

</details>


In [61]:
### Checking with similar approach with few differences (Own work)

In [62]:
df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/drug200.csv", delimiter=",")
df.head()

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,F,HIGH,HIGH,25.355,drugY
1,47,M,LOW,HIGH,13.093,drugC
2,47,M,LOW,HIGH,10.114,drugC
3,28,F,NORMAL,HIGH,7.798,drugX
4,61,F,LOW,HIGH,18.043,drugY


In [63]:
print(df.select_dtypes('object').columns.to_list())

['Sex', 'BP', 'Cholesterol', 'Drug']


In [64]:
df['Sex'] = df['Sex'].map({'M' : 0, 'F' : 1})
df['BP'] = df['BP'].map({'LOW' : 0, 'NORMAL' : 1, 'HIGH' : 2})
df['Cholesterol'] = df['Cholesterol'].map({'NORMAL' :0,'HIGH':1})

In [65]:
display(df.head())
df.dtypes

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,1,2,1,25.355,drugY
1,47,0,0,1,13.093,drugC
2,47,0,0,1,10.114,drugC
3,28,1,1,1,7.798,drugX
4,61,1,0,1,18.043,drugY


Age              int64
Sex              int64
BP               int64
Cholesterol      int64
Na_to_K        float64
Drug            object
dtype: object

In [66]:
X = df.drop('Drug', axis = 1)
y = df['Drug']

In [67]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

Train set: (160, 5) (160,)
Test set: (40, 5) (40,)


In [68]:
param_grid = {'n_estimators': [2*n+1 for n in range(20)],
             'max_depth' : [2*n+1 for n in range(10) ],
             'max_features':["auto", "sqrt", "log2"]}

In [69]:
model = RandomForestClassifier()

In [70]:
%%time 
search = GridSearchCV(estimator=model, param_grid=param_grid,scoring='accuracy', cv = 3) # Initiate search object
# The search object usually encompasses A - Model, B - Dictionary to loop through, C - 'CV' or Scoring
search.fit(X_train, y_train)

CPU times: user 1min 1s, sys: 1 s, total: 1min 2s
Wall time: 1min 6s


GridSearchCV(cv=3, estimator=RandomForestClassifier(),
             param_grid={'max_depth': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19],
                         'max_features': ['auto', 'sqrt', 'log2'],
                         'n_estimators': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21,
                                          23, 25, 27, 29, 31, 33, 35, 37, 39]},
             scoring='accuracy')

In [71]:
search.best_score_

1.0

In [72]:
search.best_params_

{'max_depth': 7, 'max_features': 'log2', 'n_estimators': 35}

In [73]:
print(get_accuracy(X_train, X_test, y_train, y_test, search.best_estimator_))

{'test Accuracy': 0.925, 'trian Accuracy': 1.0}


In [74]:
get_correlation(X_test,y_test, search.best_estimator_).style.background_gradient(cmap='coolwarm')

Average correlation between predictors:  0.7264884473701136


Unnamed: 0,estimator 1,estimator 2,estimator 3,estimator 4,estimator 5,estimator 6,estimator 7,estimator 8,estimator 9,estimator 10,estimator 11,estimator 12,estimator 13,estimator 14,estimator 15,estimator 16,estimator 17,estimator 18,estimator 19,estimator 20,estimator 21,estimator 22,estimator 23,estimator 24,estimator 25,estimator 26,estimator 27,estimator 28,estimator 29,estimator 30,estimator 31,estimator 32,estimator 33,estimator 34,estimator 35
estimator 1,1.0,0.662217,0.685934,0.449088,0.599945,0.572879,0.667689,0.567564,0.634583,0.677684,0.490779,0.596386,0.754314,0.508427,0.666802,0.420946,0.670956,0.530792,0.598422,0.666306,0.641575,0.672327,0.610237,0.663341,0.655113,0.677592,0.499883,0.653325,0.553783,0.440474,0.663294,0.641573,0.325721,0.68547,0.656826
estimator 2,0.662217,1.0,0.888645,0.675788,0.828778,0.845031,0.870556,0.753938,0.874967,0.886788,0.73853,0.787408,0.798432,0.72982,0.895219,0.633108,0.884286,0.804218,0.625254,0.875839,0.858895,0.879782,0.81241,0.852883,0.839823,0.884167,0.344818,0.890998,0.63415,0.628859,0.882806,0.860378,0.412269,0.886885,0.743486
estimator 3,0.685934,0.888645,1.0,0.655046,0.917192,0.87027,0.972219,0.872707,0.943775,0.97832,0.768774,0.88001,0.868517,0.804277,0.97249,0.756207,0.977378,0.8424,0.695938,0.974045,0.953132,0.983615,0.912251,0.971204,0.939785,0.983799,0.421491,0.962435,0.720136,0.737899,0.971851,0.958418,0.361302,0.994397,0.829118
estimator 4,0.449088,0.675788,0.655046,1.0,0.59741,0.580111,0.641502,0.575561,0.617499,0.621147,0.442046,0.51345,0.515341,0.404079,0.616001,0.482307,0.62327,0.564833,0.295829,0.645607,0.564234,0.656888,0.708356,0.629312,0.660162,0.616001,0.056908,0.617333,0.537568,0.493945,0.64652,0.589945,0.54855,0.653331,0.410271
estimator 5,0.599945,0.828778,0.917192,0.59741,1.0,0.882321,0.906001,0.792592,0.905684,0.91187,0.77146,0.816208,0.824409,0.745942,0.9356,0.697061,0.909763,0.864141,0.631748,0.921031,0.896335,0.915988,0.831117,0.902353,0.858665,0.891048,0.39363,0.934072,0.688246,0.686699,0.912199,0.887788,0.315552,0.925198,0.738041
estimator 6,0.572879,0.845031,0.87027,0.580111,0.882321,1.0,0.888199,0.781792,0.929396,0.895672,0.780102,0.826199,0.82247,0.792379,0.923922,0.648633,0.882341,0.908276,0.686838,0.887817,0.857346,0.874876,0.822804,0.843086,0.82619,0.858241,0.400174,0.936621,0.684717,0.613731,0.900972,0.862082,0.379745,0.876764,0.717175
estimator 7,0.667689,0.870556,0.972219,0.641502,0.906001,0.888199,1.0,0.891972,0.929928,0.983509,0.717957,0.851893,0.86392,0.781726,0.957712,0.744333,0.959725,0.823224,0.679637,0.965271,0.935341,0.964726,0.887659,0.954475,0.919567,0.946471,0.432019,0.960637,0.691754,0.721527,0.956976,0.941367,0.324122,0.97753,0.807022
estimator 8,0.567564,0.753938,0.872707,0.575561,0.792592,0.781792,0.891972,1.0,0.853875,0.899687,0.54169,0.729859,0.779184,0.675242,0.849517,0.70237,0.862767,0.722228,0.609493,0.879881,0.85985,0.866224,0.81337,0.86832,0.863103,0.860695,0.379605,0.862079,0.556791,0.639236,0.847644,0.816712,0.258621,0.86832,0.742286
estimator 9,0.634583,0.874967,0.943775,0.617499,0.905684,0.929396,0.929928,0.853875,1.0,0.935766,0.76969,0.813279,0.883864,0.756551,0.950565,0.705162,0.934431,0.907053,0.678024,0.94403,0.945142,0.928978,0.89125,0.905263,0.879103,0.950565,0.426865,0.961673,0.643861,0.706246,0.938816,0.891274,0.375248,0.92788,0.815534
estimator 10,0.677684,0.886788,0.97832,0.621147,0.91187,0.895672,0.983509,0.899687,0.935766,1.0,0.735871,0.870605,0.87231,0.807604,0.973828,0.754402,0.977719,0.840263,0.701212,0.973695,0.954919,0.97241,0.904752,0.960992,0.92702,0.962861,0.441463,0.976069,0.687325,0.690971,0.973548,0.960942,0.355224,0.983485,0.835042


<h2>Want to learn more?</h2>

IBM SPSS Modeler is a comprehensive analytics platform that has many machine learning algorithms. It has been designed to bring predictive intelligence to decisions made by individuals, by groups, by systems – by your enterprise as a whole. A free trial is available through this course, available here: <a href="https://www.ibm.com/analytics/spss-statistics-software?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2021-01-01">SPSS Modeler</a>

Also, you can use Watson Studio to run these notebooks faster with bigger datasets. Watson Studio is IBM's leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, Watson Studio enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of Watson Studio users today with a free account at <a href="https://www.ibm.com/cloud/watson-studio?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2021-01-01">Watson Studio</a>


### Thank you for completing this lab!

## Author

<a href="https://www.linkedin.com/in/joseph-s-50398b136/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2021-01-01" target="_blank">Joseph Santarcangelo</a>

### Other Contributors

## Change Log

| Date (YYYY-MM-DD) | Version | Changed By           | Change Description   |
| ----------------- | ------- | -------------------- | -------------------- |
| 2022-02-09        | 0.1     | Joseph Santarcangelo | Created Lab Template |
| 2022-05-03        | 0.2     | Richard Ye           | QA pass              |

## <h3 align="center"> © IBM Corporation 2020. All rights reserved. <h3/>
