In [1]:
pip install -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


# Random Forest


Random Forest is one of the most popular and most powerful machine learning algorithms. It is a type of ensemble machine learning algorithm called Bootstrap Aggregation or bagging. <b>

![](https://miro.medium.com/v2/resize:fit:640/format:webp/1*i0o8mjFfCn-uD79-F1Cqkw.png)

We will apply a Random Forest classifier to the task of classifying penguin species. To optimize performance and accuracy, we will focus on two key parameters: max_depth and n_estimators.

- n_estimators: number of trees in the forest
- max_depth: controls how deep each decision tree can grow

In [2]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

In [3]:
penguins = sns.load_dataset("penguins")

In [4]:
features = ['bill_length_mm'] #add features per iteration such as 'body_mass_g'
X = penguins[features]
y = penguins['species']

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=101)

In [6]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(criterion='entropy', n_estimators=5, max_depth=3)
rf.fit(X_train, y_train) #fit the random forest to the training data

To evaluate the performance of our model, we can use a classification report and a confusion matrix that will allow us to compare the predictions of our model with the true value in the test data.

In [7]:
def calculate_accuracy(predictions, actuals):
    if(len(predictions) != len(actuals)):
        raise Exception("The amount of predictions did not equal the amount of actuals")
    
    return (predictions == actuals).sum() / len(actuals)

In [8]:
predictionsOnTrainset = rf.predict(X_train)
predictionsOnTestset = rf.predict(X_test)

accuracyTrain = calculate_accuracy(predictionsOnTrainset, y_train)
accuracyTest = calculate_accuracy(predictionsOnTestset, y_test)

print("Accuracy on training set " + str(accuracyTrain))
print("Accuracy on test set " + str(accuracyTest))

Accuracy on training set 0.7739130434782608
Accuracy on test set 0.7192982456140351


## Portfolio assignment 19
30 min: Train a random forest to predict one of the categorical columns of your **own** dataset.
- Prepare the data:<br>
    - <b>Note</b>: Some machine learning algorithms can not handle missing values. You will either need to: 
         - replace missing values (with the mean or most popular value). For replacing missing values you can use .fillna(\<value\>) https://pandas.pydata.org/docs/reference/api/pandas.Series.fillna.html
         - remove rows with missing data.  You can remove rows with missing data with .dropna() https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html <br>
- Split your dataset into a train (70%) and test (30%) set.
- Use the train set to fit a RandomForestClassifier. You are free to to choose which columns you want to use as feature variables and you are also free to choose the max_depth of the tree. 
- Use your random forest model to make predictions for both the train and test set.
<br>
    
![](https://i.imgur.com/0v1CGNV.png)<br>
- Calculate the accuracy for both the train set predictions and test set predictions.
- Is the accurracy different? Did you expect this difference?
- Which number of trees, depth and features did you add per cycle?
- Is the accurracy different? Did you expect this difference?



Findings: ...