# SVM, Decision Trees, and Random Forest:

This tutorial will focus on Support Vector Machines, Decision Trees, and Random Forest. We will be using [sci-kit learn's](https://scikit-learn.org/stable/index.html) package for these models.

In [None]:
# import sklearn objects
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.inspection import DecisionBoundaryDisplay

# importing numpy and pandas
import numpy as np
import pandas as pd

# import plotting functions
import matplotlib.pyplot as plt
import matplotlib.colors
from cycler import cycler
binary_cmap = matplotlib.colors.LinearSegmentedColormap.from_list("", ['#332288', 'white', '#AA4499'])
plt.rcParams["axes.prop_cycle"] = cycler(
    color=['#332288','#88CCEE','#44AA99','#117733','#999933','#DDCC77','#CC6677','#882255','#AA4499']
    )

# class for holding the random state throughout the notebook.
# this keeps results consistent
class RandomState(object):
    def __init__(self, random_state=None):
        self.random_state = random_state
    def next(self):
        self.random_state,\
            out_state = np.random.default_rng(self.random_state).integers(0, 1e9, size=(2,))
        return out_state

In [None]:
random_state = RandomState(42)

## Loading the Data:

In [None]:
# loading the dataset
bc_data = datasets.load_breast_cancer(as_frame=True)

In [None]:
# accessing the data or the target from the dataset loaded above
bc_features = bc_data.data
bc_target = bc_data.target

The first 5 lines of this data look as follows:

In [None]:
bc_features.head()

And the targets are as follows:

In [None]:
bc_target.value_counts()

We will cut the dataset to the mean features to make thing slightly easier:

In [None]:
bc_mean_features = bc_features[
    ['mean radius', 'mean texture', 'mean perimeter', 'mean area',
    'mean smoothness', 'mean compactness', 'mean concavity',
    'mean concave points', 'mean symmetry', 'mean fractal dimension']
    ]

We can use [TSNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) to visualise the data in two dimensions:

In [None]:
# import TSNE
from sklearn.manifold import TSNE

# pre-processing the data
scaler = StandardScaler()
tsne = TSNE(n_components=2, learning_rate='auto', init='random', random_state=random_state.next())
x = tsne.fit_transform(scaler.fit_transform(bc_mean_features))

# setting the figure
fig, ax = plt.subplots(1,1,figsize=(5,5))

# plotting the data
scatter = ax.scatter(
    x=x[:,0], 
    y=x[:,1], 
    c=bc_target.astype(bool), 
    alpha=0.5, 
    cmap=binary_cmap,
    )

# adding the legend
ax.legend(
    scatter.legend_elements(num=1)[0],
    ['Negative', 'Positive'],
    loc="upper right", 
    title="Diagnosis",
    )

# set title and labels
ax.set_title('TSNE Plot of Breast Cancer Dataset')
ax.set_xlabel('Component 1')
ax.set_ylabel('Component 2')

# showing plot
fig.show()

The values of the features are distributed as follows:

In [None]:
# setting the plotting area
fig, axes = plt.subplots(5,2, figsize=(8,12))

# getting the column names
column_names = bc_mean_features.columns
# getting the colours to make the plot look nicer!
colors = plt.rcParams["axes.prop_cycle"]()

# looping over the subplots and the column names together
for ax, col in zip(np.ravel(axes), column_names):
    # plotting a histogram
    ax.hist(
        bc_mean_features[col], # the data, accessed by the column name 
        color=next(colors)["color"], # the colour to look nicer!
        bins=20 # the number of bins
        )
    # setting the title and labels
    ax.set_title(f"{col.title()} Histogram")
    ax.set_ylabel('Frequency')
    ax.set_xlabel('Value')

# setting plotting formats
fig.subplots_adjust(hspace=0.75, wspace=0.25)

# showing plot
fig.show()

## SVM

[Supprt Vector Machine](https://en.wikipedia.org/wiki/Support_vector_machine) is a classic machine learning classifier that attempts to separate classes of data using a [hyperplane](https://en.wikipedia.org/wiki/Hyperplane). 

It works by maximising the width of the gap between two categories, when they are linearly separable by minimising the [hinge loss](https://en.wikipedia.org/wiki/Hinge_loss).

Different kernels can be used to learn different boundaries between classes, that might have different distributions. In the following examples, we will see where these different kernels will be helpful.

See Also: https://scikit-learn.org/stable/modules/svm.html#support-vector-machines

### The Basics

To access the [SVM code from sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html), we need to import it as follows:

In [None]:
from sklearn.svm import SVC

What arguments can we supply this model and what are the defaults?

In [None]:
SVC().get_params()

These arguments are explained in the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html). For the following experiments, we will be varying the kernel function, to see how it can affect the classification performance on different datasets:

We will start by seeing how we can train, and evaluate the performance of our model, and understand the model's decision boundary.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

Here are the train and test splits of a synthetic dataset:

In [None]:
X, y = datasets.make_moons(1000, noise=0.15, random_state=random_state.next())

# train-test splits:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.75, random_state=random_state.next()
    )

In [None]:
# scaling the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
fig, axes = plt.subplots(1,2,figsize=(8,4))
ax1, ax2 = axes

ax1.scatter(x=X_train[:,0], y=X_train[:,1], c=y_train, alpha=0.5, cmap=binary_cmap, edgecolor='black')
ax2.scatter(x=X_test[:,0], y=X_test[:,1], c=y_test, alpha=0.5, cmap=binary_cmap, edgecolor='black')

ax1.set_title('Train')
ax2.set_title('Test')

fig.show()

The model can be fit as follows:

In [None]:
# start with linear kernel
svc = SVC(kernel='linear', random_state=random_state.next())

In [None]:
svc.fit(X_train, y_train)

Let's evaluate the model and see how well its decision boundary fit the data:

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
print(f"The accuracy is {accuracy_score(y_test, svc.predict(X_test))*100}%")

In [None]:
fig, ax = plt.subplots(1,1,figsize=(5,5))

dbd = DecisionBoundaryDisplay.from_estimator(
    estimator=svc,
    X=X_test,
    grid_resolution=200,
    plot_method='contourf',
    response_method='decision_function',
    ax=ax,
    cmap=binary_cmap,
    alpha=0.5,
    eps=0.3,
    levels=100,
    )

ax.scatter(x=X_test[:,0], y=X_test[:,1], c=y_test, alpha=0.5, cmap=binary_cmap, edgecolor='black')
ax.set_title('Linear Kernel')
fig.suptitle('Boundaries on the Test Set', fontsize=20)
fig.show()

This doesn't fit the data correctly, we can see that the linear kernel is not designed for this dataset. In the following, we will try many different kernels:

In [None]:
kernels = ['linear', 'poly', 'rbf', 'sigmoid']

fig, axes = plt.subplots(1,len(kernels),figsize=(len(kernels)*4,4))

# looping over kernels
for nk, kernel in enumerate(kernels):
    ax = np.ravel(axes)[nk] # getting the current axis

    # fitting the model
    svc = SVC(kernel=kernel, random_state=random_state.next())
    svc.fit(X_train, y_train)

    # plotting the decision boundary
    dbd = DecisionBoundaryDisplay.from_estimator(
        estimator=svc,
        X=X_test,
        grid_resolution=200,
        plot_method='contourf',
        response_method='decision_function',
        ax=ax,
        cmap=binary_cmap,
        alpha=0.5,
        eps=0.3,
        levels=100,
        )
    
    # plotting the data
    ax.scatter(
        x=X_test[:,0], y=X_test[:,1], c=y_test, 
        alpha=0.5, cmap=binary_cmap, edgecolor='black'
        )

    # title
    ax.set_title(f'{kernel.title()} Kernel - '\
        f'accuracy {accuracy_score(y_test, svc.predict(X_test))*100:.2f}%')

# figure title
fig.suptitle('Boundaries on the Test Set', fontsize=20, y=1.1)
fig.show()

Clearly, in this example, the RBF kernel was the best!

But in which cases are the different kernels better?

In [None]:
# generating datasets
data_dict = {
    'moons': datasets.make_moons(
        1000, noise=0.15, random_state=random_state.next()
        ),
    'circles': datasets.make_circles(
        1000, noise=0.15, factor=0.2, random_state=random_state.next()
        ),
    'blobs': datasets.make_blobs(
        1000, centers=[[1, -1], [1, 1]], cluster_std=0.3, random_state=random_state.next(),
        ),
    }

In [None]:
# kernel names
kernels = ['linear', 'poly', 'rbf', 'sigmoid']

In [None]:
# plotting figure
fig, axes = plt.subplots(
    len(data_dict), len(kernels), figsize=(len(kernels)*4,len(data_dict)*4),
    )

# looping over kernels
for nd, data in enumerate(data_dict):
    for nk, kernel in enumerate(kernels):
        
        # getting the current axis
        ax = axes[nd, nk]

        # getting the data
        X, y = data_dict[data]

        # train-test splits:
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, train_size=0.75, random_state=random_state.next()
            )

        # scaling the data
        scaler = StandardScaler()
        X_train = scaler.fit_transform(X_train)
        X_test = scaler.transform(X_test)

        # fitting the model
        svc = SVC(kernel=kernel, random_state=random_state.next())
        svc.fit(X_train, y_train)

        # plotting the decision boundary
        dbd = DecisionBoundaryDisplay.from_estimator(
            estimator=svc,
            X=X_test,
            grid_resolution=200,
            plot_method='contourf',
            response_method='decision_function',
            ax=ax,
            cmap=binary_cmap,
            alpha=0.5,
            eps=0.3,
            levels=100,
            )
        
        # plotting the data
        ax.scatter(
            x=X_test[:,0], y=X_test[:,1], c=y_test, 
            alpha=0.5, cmap=binary_cmap, edgecolor='black'
            )

        # title
        ax.set_title(f'{kernel.title()} Kernel - '\
            f'accuracy {accuracy_score(y_test, svc.predict(X_test))*100:.2f}%')

# figure title
fig.suptitle('Boundaries on the Test Set', fontsize=20)
# showing plot
fig.show()

### An Example

Now that we have an understanding of how this can be used in our generated examples, let's try to use SVM to predict the classes on the breast cancer dataset that we introduced at the beginning of this notebook.

Our features are as follows:

In [None]:
bc_mean_features.describe()

And our targets are:

In [None]:
bc_target.value_counts()

In [None]:
# turning data from table to arrays
X, y = bc_mean_features.values, bc_target.values

# train-test splits:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.75, random_state=random_state.next()
    )

# scaling the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

We will use cross validation to understand which of the SVM models might be the best predictor of breast cancer on this dataset.

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
# setting up the cross validated grid search
gscv = GridSearchCV(
    estimator=SVC(random_state=random_state.next()), # the model
    param_grid={'kernel': ['linear', 'poly', 'rbf', 'sigmoid']}, # the parameters to change in the search
    scoring='accuracy',  # how to score the parameters
    refit=True, # return the best model fitted on all of the training data
    cv=5, # the number of cross-validated folds
    verbose=4, # print lots of info as the code is running
    )

In [None]:
# fitting the model on the training data, with cross validation
gscv.fit(X_train, y_train)

In [None]:
print('The results were:')
pd.DataFrame(gscv.cv_results_)

In [None]:
print(f"The best set of parameters was {gscv.best_params_}")

In [None]:
best_model = gscv.best_estimator_
best_model

The accuracy of this model on the test data is:

In [None]:
print(f"The test accuracy is {accuracy_score(y_test, best_model.predict(X_test))*100:.2f}%")

## Decision Trees

[Decision Trees](https://en.wikipedia.org/wiki/Decision_tree_learning) is a classic machine learning classifier that attempts to separate classes of data by learning a rule based system on the features independently. Because of this, we actually do not need to scale the data, since all features are split separately.

At each iteration, the next split is performed on the feature that optimises the criterion most. For example, when using [Gini Impurity](https://victorzhou.com/blog/gini-impurity/), we want to make a split where the Gini Impurity is reduced the most between before and after the split is made.

See also: https://scikit-learn.org/stable/modules/tree.html#decision-trees

### The Basics

The decision tree classifier is easily imported from sklearn:

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
print(f"This has the default parameters:\n {DecisionTreeClassifier().get_params()}")

What each of these parameters refers to can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier). We will investigate how this decision tree can be used to classify data:

Similarly to earlier, we will start by seeing how we can train, and evaluate the performance of our model, and understand the model's decision boundary.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

Here are the train and test splits of a synthetic dataset:

In [None]:
X, y = datasets.make_moons(1000, noise=0.15, random_state=random_state.next())

# train-test splits:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.75, random_state=random_state.next()
    )

In [None]:
# scaling the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
fig, axes = plt.subplots(1,2,figsize=(8,4))
ax1, ax2 = axes

ax1.scatter(x=X_train[:,0], y=X_train[:,1], c=y_train, alpha=0.5, cmap=binary_cmap, edgecolor='black')
ax2.scatter(x=X_test[:,0], y=X_test[:,1], c=y_test, alpha=0.5, cmap=binary_cmap, edgecolor='black')

ax1.set_title('Train')
ax2.set_title('Test')

fig.show()

The model can be fit as follows:

In [None]:
# start with linear kernel
dt = DecisionTreeClassifier(random_state=random_state.next())

In [None]:
dt.fit(X_train, y_train)

Let's evaluate the model and see how well its decision boundary fit the data:

In [None]:
from sklearn.metrics import accuracy_score

Without any tuning, this model already performs much better than SVM. Let us try and see why.

In [None]:
print(f"The accuracy is {accuracy_score(y_test, dt.predict(X_test))*100}%")

In [None]:
fig, ax = plt.subplots(1,1,figsize=(5,5))

dbd = DecisionBoundaryDisplay.from_estimator(
    estimator=dt,
    X=X_test,
    grid_resolution=200,
    plot_method='contourf',
    response_method='predict_proba',
    ax=ax,
    cmap=binary_cmap,
    alpha=0.5,
    eps=0.3,
    levels=100,
    )

ax.scatter(x=X_test[:,0], y=X_test[:,1], c=y_test, alpha=0.5, cmap=binary_cmap, edgecolor='black')
ax.set_title('Default Parameters')
fig.suptitle('Boundaries on the Test Set', fontsize=20)
fig.show()

We can see that this mostly fits the data, but has clearly tried to over fit to the few datapoints from the pink class that fall within the purple moon.

Let's study how the max depth of the tree can effect the performance:

In [None]:
max_depths = [1, 2, 5, 10, 100]

fig, axes = plt.subplots(1, len(max_depths), figsize=(len(max_depths)*4,4))

# looping over max_depths
for nmd, max_depth in enumerate(max_depths):
    # getting the current axis
    ax = np.ravel(axes)[nmd]

    # fitting the model
    dt = DecisionTreeClassifier(max_depth=max_depth, random_state=random_state.next())
    dt.fit(X_train, y_train)

    # plotting the decision boundary
    dbd = DecisionBoundaryDisplay.from_estimator(
        estimator=dt,
        X=X_test,
        grid_resolution=200,
        plot_method='contourf',
        response_method='predict_proba',
        ax=ax,
        cmap=binary_cmap,
        alpha=0.5,
        eps=0.3,
        levels=100,
        )
    
    # plotting the data
    ax.scatter(
        x=X_test[:,0], y=X_test[:,1], c=y_test, 
        alpha=0.5, cmap=binary_cmap, edgecolor='black'
        )

    # title
    ax.set_title(f'Tree Depth: {max_depth} - '\
        f'accuracy {accuracy_score(y_test, dt.predict(X_test))*100:.2f}%')

# figure title
fig.suptitle('Boundaries on the Test Set', fontsize=20, y=1.1)
fig.show()

Why might a tree depth of 100 and 10 produce the same results? Because any of `min_samples_split`, `min_samples_leaf`, or `min_weight_fraction_leaf` may have been met!

Let's see how this model performs over different datasets:

In [None]:
# generating datasets
data_dict = {
    'moons': datasets.make_moons(
        1000, noise=0.15, random_state=random_state.next()
        ),
    'circles': datasets.make_circles(
        1000, noise=0.15, factor=0.2, random_state=random_state.next()
        ),
    'blobs': datasets.make_blobs(
        1000, centers=[[1, -1], [1, 1]], cluster_std=0.3, random_state=random_state.next(),
        ),
    }

In [None]:
# max depths
max_depths = [1, 2, 5, 10, 100]

In [None]:
# plotting figure
fig, axes = plt.subplots(
    len(data_dict), len(max_depths), figsize=(len(max_depths)*4,len(data_dict)*4),
    )

# looping over max_depths
for nd, data in enumerate(data_dict):
    for nmd, max_depth in enumerate(max_depths):
        
        # getting the current axis
        ax = axes[nd, nmd]

        # getting the data
        X, y = data_dict[data]

        # train-test splits:
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, train_size=0.75, random_state=random_state.next()
            )

        # scaling the data
        scaler = StandardScaler()
        X_train = scaler.fit_transform(X_train)
        X_test = scaler.transform(X_test)

        # fitting the model
        dt = DecisionTreeClassifier(max_depth=max_depth, random_state=random_state.next())
        dt.fit(X_train, y_train)

        # plotting the decision boundary
        dbd = DecisionBoundaryDisplay.from_estimator(
            estimator=dt,
            X=X_test,
            grid_resolution=200,
            plot_method='contourf',
            response_method='predict_proba',
            ax=ax,
            cmap=binary_cmap,
            alpha=0.5,
            eps=0.3,
            levels=100,
            )
        
        # plotting the data
        ax.scatter(
            x=X_test[:,0], y=X_test[:,1], c=y_test, 
            alpha=0.5, cmap=binary_cmap, edgecolor='black'
            )

        # title
        ax.set_title(f'Tree Depth: {max_depth} - '\
            f'accuracy {accuracy_score(y_test, dt.predict(X_test))*100:.2f}%')

# figure title
fig.suptitle('Boundaries on the Test Set', fontsize=20)
# showing plot
fig.show()

### An Example

Now that we have an understanding of how this can be used in our generated examples, let's try to use a Decision Tree to predict the classes on the breast cancer dataset that we introduced at the beginning of this notebook.

Our features are as follows:

In [None]:
bc_mean_features.describe()

And our targets are:

In [None]:
bc_target.value_counts()

In [None]:
# turning data from table to arrays
X, y = bc_mean_features.values, bc_target.values

# train-test splits:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.75, random_state=random_state.next()
    )

# scaling the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

We will use cross validation to understand which parameters in the DT models might be the best predictor of breast cancer on this dataset. We will test different max depths and criterions.

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
# setting up the cross validated grid search
gscv = GridSearchCV(
    estimator=DecisionTreeClassifier(random_state=random_state.next()), # the model
    param_grid={ # the parameters to change in the search
        'max_depth': [1, 2, 5, 10, 20, 50, 100,], # max depth
        'criterion': ['gini', 'entropy', 'log_loss'], # criterion
        }, 
    scoring='accuracy',  # how to score the parameters
    refit=True, # return the best model fitted on all of the training data
    cv=5, # the number of cross-validated folds
    verbose=1, # print lots of info as the code is running
    )

In [None]:
# fitting the model on the training data, with cross validation
gscv.fit(X_train, y_train)

In [None]:
print('The results were:')
pd.DataFrame(gscv.cv_results_).head()

In [None]:
print(f"The best set of parameters was {gscv.best_params_}")

In [None]:
best_model = gscv.best_estimator_
best_model

The accuracy of this model on the test data is:

In [None]:
print(f"The test accuracy is {accuracy_score(y_test, best_model.predict(X_test))*100:.2f}%")

## Random Forest

[Random Forest](https://en.wikipedia.org/wiki/Random_forest) is a classifier that is built on top of the work done by Decision Trees and is a type of ensemble learning model. This is because it uses a "forest" of decision trees when classifying data. 

During training many decision trees are built based on different random splits of the features and data (depending on the parameters), and during testing, the predictions of these trees are combined to get a single prediction.

See also https://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees

### The Basics

The decision tree classifier is easily imported from sklearn:

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
print(f"This has the default parameters:\n {RandomForestClassifier().get_params()}")

What each of these parameters refers to can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html). We will investigate how this random forest can be used to classify data:

Similarly to earlier, we will start by seeing how we can train, and evaluate the performance of our model, and understand the model's decision boundary.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

Here are the train and test splits of a synthetic dataset:

In [None]:
X, y = datasets.make_moons(1000, noise=0.15, random_state=random_state.next())

# train-test splits:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.75, random_state=random_state.next()
    )

In [None]:
# scaling the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
fig, axes = plt.subplots(1,2,figsize=(8,4))
ax1, ax2 = axes

ax1.scatter(x=X_train[:,0], y=X_train[:,1], c=y_train, alpha=0.5, cmap=binary_cmap, edgecolor='black')
ax2.scatter(x=X_test[:,0], y=X_test[:,1], c=y_test, alpha=0.5, cmap=binary_cmap, edgecolor='black')

ax1.set_title('Train')
ax2.set_title('Test')

fig.show()

The model can be fit as follows:

In [None]:
# start with linear kernel
rf = RandomForestClassifier(random_state=random_state.next())

In [None]:
rf.fit(X_train, y_train)

Let's evaluate the model and see how well its decision boundary fit the data:

In [None]:
from sklearn.metrics import accuracy_score

Without any tuning, this model already performs much better than SVM. Let us try and see why.

In [None]:
print(f"The accuracy is {accuracy_score(y_test, rf.predict(X_test))*100}%")

In [None]:
fig, ax = plt.subplots(1,1,figsize=(5,5))

dbd = DecisionBoundaryDisplay.from_estimator(
    estimator=rf,
    X=X_test,
    grid_resolution=200,
    plot_method='contourf',
    response_method='predict_proba',
    ax=ax,
    cmap=binary_cmap,
    alpha=0.5,
    eps=0.3,
    levels=100,
    )

ax.scatter(x=X_test[:,0], y=X_test[:,1], c=y_test, alpha=0.5, cmap=binary_cmap, edgecolor='black')
ax.set_title('Default Parameters')
fig.suptitle('Boundaries on the Test Set', fontsize=20)
fig.show()

We can see that this fits the data much better than Decision Trees!

Let's study how the number of trees in the forest can effect the performance:

In [None]:
n_trees = [1, 2, 5, 10, 100,]

fig, axes = plt.subplots(1, len(n_trees), figsize=(len(n_trees)*4,4))

# looping over n_trees
for nmd, n_estimators in enumerate(n_trees):
    # getting the current axis
    ax = np.ravel(axes)[nmd]

    # fitting the model
    rf = RandomForestClassifier(n_estimators=n_estimators, random_state=random_state.next())
    rf.fit(X_train, y_train)

    # plotting the decision boundary
    dbd = DecisionBoundaryDisplay.from_estimator(
        estimator=rf,
        X=X_test,
        grid_resolution=200,
        plot_method='contourf',
        response_method='predict_proba',
        ax=ax,
        cmap=binary_cmap,
        alpha=0.5,
        eps=0.3,
        levels=100,
        )
    
    # plotting the data
    ax.scatter(
        x=X_test[:,0], y=X_test[:,1], c=y_test, 
        alpha=0.5, cmap=binary_cmap, edgecolor='black'
        )

    # title
    ax.set_title(f'No. Trees: {n_estimators} - '\
        f'accuracy {accuracy_score(y_test, rf.predict(X_test))*100:.2f}%')

# figure title
fig.suptitle('Boundaries on the Test Set', fontsize=20, y=1.1)
fig.show()

Here, because multiple trees are used, and each of them is acting over a subset of the data, Random Forest is less likely to over-fit to the data as a single decision tree acting over all of the training data.

We see that even with a small number of trees, the performance is good!

Let's see how this model performs over different datasets:

In [None]:
# generating datasets
data_dict = {
    'moons': datasets.make_moons(
        1000, noise=0.15, random_state=random_state.next()
        ),
    'circles': datasets.make_circles(
        1000, noise=0.15, factor=0.2, random_state=random_state.next()
        ),
    'blobs': datasets.make_blobs(
        1000, centers=[[1, -1], [1, 1]], cluster_std=0.3, random_state=random_state.next(),
        ),
    }

In [None]:
# number of trees
n_trees = [1, 2, 5, 10, 100,]

In [None]:
# plotting figure
fig, axes = plt.subplots(
    len(data_dict), len(n_trees), figsize=(len(n_trees)*4,len(data_dict)*4),
    )

# looping over n_trees
for nd, data in enumerate(data_dict):
    for nmd, n_estimators in enumerate(n_trees):
        
        # getting the current axis
        ax = axes[nd, nmd]

        # getting the data
        X, y = data_dict[data]

        # train-test splits:
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, train_size=0.75, random_state=random_state.next()
            )

        # scaling the data
        scaler = StandardScaler()
        X_train = scaler.fit_transform(X_train)
        X_test = scaler.transform(X_test)

        # fitting the model
        rf = RandomForestClassifier(n_estimators=n_estimators, random_state=random_state.next())
        rf.fit(X_train, y_train)

        # plotting the decision boundary
        dbd = DecisionBoundaryDisplay.from_estimator(
            estimator=rf,
            X=X_test,
            grid_resolution=200,
            plot_method='contourf',
            response_method='predict_proba',
            ax=ax,
            cmap=binary_cmap,
            alpha=0.5,
            eps=0.3,
            levels=100,
            )
        
        # plotting the data
        ax.scatter(
            x=X_test[:,0], y=X_test[:,1], c=y_test, 
            alpha=0.5, cmap=binary_cmap, edgecolor='black'
            )

        # title
        ax.set_title(f'No. Trees: {n_estimators} - '\
            f'accuracy {accuracy_score(y_test, rf.predict(X_test))*100:.2f}%')

# figure title
fig.suptitle('Boundaries on the Test Set', fontsize=20)
# showing plot
fig.show()

### An Example

Now that we have an understanding of how this can be used in our generated examples, let's try to use a Random Forest to predict the classes on the breast cancer dataset that we introduced at the beginning of this notebook.

Our features are as follows:

In [None]:
bc_mean_features.describe()

And our targets are:

In [None]:
bc_target.value_counts()

In [None]:
# turning data from table to arrays
X, y = bc_mean_features.values, bc_target.values

# train-test splits:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.75, random_state=random_state.next()
    )

# scaling the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

We will use cross validation to understand which parameters in the DT models might be the best predictor of breast cancer on this dataset. We will test different max depths and criterions.

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
# setting up the cross validated grid search
gscv = GridSearchCV(
    estimator=RandomForestClassifier(random_state=random_state.next()), # the model
    param_grid={ # the parameters to change in the search
        'max_depth': [1, 2, 5, 10, 20, 50, 100,], # max depth
        'criterion': ['gini', 'entropy', 'log_loss'], # criterion
        'n_estimators': [1, 2, 5, 10, 100, 200],
        }, 
    scoring='accuracy',  # how to score the parameters
    refit=True, # return the best model fitted on all of the training data
    cv=5, # the number of cross-validated folds
    verbose=1, # print lots of info as the code is running
    )

In [None]:
# fitting the model on the training data, with cross validation
gscv.fit(X_train, y_train)

In [None]:
print('The results were:')
pd.DataFrame(gscv.cv_results_).head()

In [None]:
print(f"The best set of parameters was {gscv.best_params_}")

In [None]:
best_model = gscv.best_estimator_
best_model

The accuracy of this model on the test data is:

In [None]:
print(f"The test accuracy is {accuracy_score(y_test, best_model.predict(X_test))*100:.2f}%")