# RandomForest Classifier

This is a basic implementation of a RandomForest classifier

In [1]:
import pandas as pd
import numpy as np

from RandomForest import RandomForest
from RandomForest import build_forest

## Loading the Dataset

The iris dataset is loaded and shuffled for training and testing purposes

In [2]:
iris_df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
iris_df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [3]:
n_samples = len(iris_df)
print(f'# Samples: {n_samples}')

# Samples: 150


In [4]:
iris_df = iris_df.sample(frac=1).reset_index(drop=True)
iris_df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,6.0,2.2,5.0,1.5,virginica
1,6.2,2.8,4.8,1.8,virginica
2,6.0,2.7,5.1,1.6,versicolor
3,7.1,3.0,5.9,2.1,virginica
4,7.4,2.8,6.1,1.9,virginica


## Splitting Train and Test Set

In [5]:
train_test_ratio = 0.8

In [6]:
n_train_samples = int(train_test_ratio*n_samples)

In [7]:
train_data_df = iris_df.iloc[:n_train_samples]
test_data_df = iris_df.iloc[n_train_samples:]

In [8]:
n_test_samples = len(test_data_df)

In [9]:
print(f'# train samples: {n_train_samples}')
print(f'# test samples: {n_test_samples}')

# train samples: 120
# test samples: 30


## Building The Forest

It is very straightforward to build the forest through the `build_forest` function. The parameters of the function work as follows:<br>
`attributes_sampling_rate` indicates the fraction of attributes that each tree of the forest will see during the training phase.<br>
`data_sampling_rate` indicates the fractions of the data that each tree will see during the training phase.<br>
`n_trees` is the number of trees that will be build.<br>
`data` is the training dataset.<br>
`label_column` is the name of the column that contains the labels.
<br><br>
The attributes and the data are of course sampled randomly for the training of each tree.

In [10]:
forest = build_forest(attributes_sampling_rate=.5, data_sampling_rate=.5,
                      n_trees=5, data=train_data_df, label_column='species')

## Making Predictions

In order to make predictions with the trained forest it is sufficient to call the `predict` method of the `RandomForest` object, passing it the data one wants to predict labels for. In addition to the predicted class, the method will return a numeric value that indicates the fraction of trees that predicted that outcome.

In [11]:
feature_cols = list(train_data_df.columns)
feature_cols.remove('species')
feature_cols

['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

In [12]:
forest.predict(train_data_df[feature_cols].iloc[:5])

[['virginica', 0.6],
 ['virginica', 1.0],
 ['versicolor', 0.6],
 ['virginica', 0.8],
 ['virginica', 0.8]]

In [13]:
train_predictions = forest.predict(train_data_df[feature_cols])
train_predictions = np.asarray(train_predictions)
train_predictions[:5]

array([['virginica', '0.6'],
       ['virginica', '1.0'],
       ['versicolor', '0.6'],
       ['virginica', '0.8'],
       ['virginica', '0.8']], dtype='<U10')

In [14]:
train_predictions_df = train_data_df.copy()
train_predictions_df['predicted_species'] = train_predictions[:, 0]

In [15]:
train_predictions_df.iloc[:10]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,predicted_species
0,6.0,2.2,5.0,1.5,virginica,virginica
1,6.2,2.8,4.8,1.8,virginica,virginica
2,6.0,2.7,5.1,1.6,versicolor,versicolor
3,7.1,3.0,5.9,2.1,virginica,virginica
4,7.4,2.8,6.1,1.9,virginica,virginica
5,6.1,2.6,5.6,1.4,virginica,virginica
6,5.5,2.5,4.0,1.3,versicolor,versicolor
7,6.1,2.8,4.0,1.3,versicolor,versicolor
8,5.7,3.0,4.2,1.2,versicolor,versicolor
9,5.1,3.8,1.9,0.4,setosa,setosa


### Predictions on test data:

In [16]:
test_predictions = forest.predict(test_data_df[feature_cols])
test_predictions = np.asarray(test_predictions)
test_predictions[:5]

array([['virginica', '0.6'],
       ['setosa', '0.8'],
       ['virginica', '0.6'],
       ['virginica', '0.8'],
       ['virginica', '0.8']], dtype='<U10')

In [17]:
test_predictions_df = test_data_df.copy()
test_predictions_df['predicted_species'] = test_predictions[:, 0]

In [18]:
test_predictions_df.iloc[:10]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,predicted_species
120,6.3,3.3,6.0,2.5,virginica,virginica
121,4.7,3.2,1.6,0.2,setosa,setosa
122,7.7,2.8,6.7,2.0,virginica,virginica
123,6.9,3.1,5.4,2.1,virginica,virginica
124,6.4,2.7,5.3,1.9,virginica,virginica
125,6.9,3.1,4.9,1.5,versicolor,versicolor
126,5.7,4.4,1.5,0.4,setosa,setosa
127,6.5,3.2,5.1,2.0,virginica,virginica
128,7.2,3.0,5.8,1.6,virginica,versicolor
129,6.3,2.9,5.6,1.8,virginica,virginica


## Evaluating Performances on Train Data


In [19]:
n_train_matches = sum(train_predictions_df['species'] == train_predictions_df['predicted_species'])
train_accuracy = n_train_matches/n_train_samples

In [20]:
print(f'Train Accuracy: {round(train_accuracy, 3)*100}%')

Train Accuracy: 100.0%


## Evaluating Performances on Test Data


In [21]:
n_test_matches = sum(test_predictions_df['species'] == test_predictions_df['predicted_species'])
test_accuracy = n_test_matches/n_test_samples

In [22]:
print(f'Test Accuracy: {round(test_accuracy, 3)*100}%')

Test Accuracy: 96.7%
