## Random Forests for Classification

In this notebook, we'll use a **Random Forest Classifier** to predict penguin species based on physical characteristics using the **Palmer Penguins** dataset.

We are using physical features of penguins as input features and Penguin Species as the target feature.

`bill_length` - length of the beak

`bill_depth` - depth of the beak

`flipper_length_mm` - length of the penguin’s flipper in millimeters

`body_mass_g` - weight of the penguin in grams

### 1. Import Required Libraries

In [None]:
from palmerpenguins import load_penguins
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt 
from matplotlib.colors import ListedColormap
import numpy as np
from sklearn.tree import  plot_tree


### 2. Load and Preprocess the Palmer Penguins Dataset

In [None]:
# Load dataset and drop any nan records
penguins = load_penguins().dropna()

# Preprocess the data
le = preprocessing.LabelEncoder()

# Apply encoding to the categorical data
penguins['encoded'] = le.fit_transform(penguins['species'])

# Color map for each species
colours = {'Adelie':'#8966a3','Chinstrap':'#dba162','Gentoo':'#4e7e82',}

### 3. Feature Selection and Train-Test Split

In [None]:

features = [
            'bill_length_mm',
            'bill_depth_mm',
            'flipper_length_mm',
            'body_mass_g'
            ]
var_0 = features[0]
var_1 = features[1]
var_2 = features[2]
var_3 = features[3]
class_names = ['Adelie', 'Chinstrap' ,'Gentoo']

x = penguins[features].to_numpy()
y = penguins['encoded'].to_numpy()

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4)

### 4. Visualize Feature Relationships

#### a. Bill Length vs. Bill Depth (used for training)

In [None]:
for species in list(colours.keys()):
    mask = penguins[penguins['species'] == species]
    plt.scatter(mask[var_0], mask[var_1],
                c=colours[species], s=64, label=species)
plt.xlabel('Bill Length (mm)')
plt.ylabel('Bill Depth (mm)')
plt.axis('tight')
plt.legend()
plt.title('Bill Length vs. Bill Depth')
plt.show()


#### b. Flipper Length vs. Body Mass (not used for training here)

In [None]:
for species in list(colours.keys()):
    mask = penguins[penguins['species'] == species]
    plt.scatter(mask[var_2], mask[var_3],
                c=colours[species], s=64, label=species)
plt.xlabel('Flipper Length (mm)')
plt.ylabel('Body Mass (g)')
plt.axis('tight')
plt.legend()
plt.title('Flipper Length vs. Body Mass')
plt.show()
 

### 5. Train a Random Forest Classifier
Our forest has 6 Decision Trees.

In [None]:
clf=RandomForestClassifier(n_estimators=6, max_depth=2, bootstrap=False)
clf.fit(x_train,y_train)

### 6. Visualise the random forest for *Explainability*


In [None]:
for i, tree in enumerate(clf.estimators_): #Estimators are basically number of trees in your forest
    plt.figure(figsize=(8, 4))  
    plot_tree(tree, 
              feature_names=features,
              class_names=class_names,
              filled=True,
              rounded=True)
    plt.title(f"Decision Tree {i+1} from Random Forest")
    plt.tight_layout()
    plt.show()


Note: The splitting criteria here is Gini Index that computes the probability of a datapoint being classfied incorrectly. The value lies b/w 0 and 1.

### 7. Visualise the Feature importance

In [None]:
fig, axes = plt.subplots(2, 3, figsize=(15, 8))  
axes = axes.flatten()  

for i, tree in enumerate(clf.estimators_):
    importances = clf.feature_importances_
    indices = np.argsort(importances)
    
    ax = axes[i]
    ax.barh(range(len(indices)), importances[indices], align='center')
    ax.set_yticks(range(len(indices)))
    ax.set_yticklabels([features[j] for j in indices])
    ax.set_xlabel("Importance")
    ax.set_title(f"Tree {i+1}")

plt.tight_layout()
plt.suptitle("Feature Importances of Trees in Random Forest", fontsize=16, y=1.03)
plt.show()

### 8. Evaluate the Model

In [None]:
y_pred=clf.predict(x_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

###  9. Visualise the Boundaries

In [None]:
# Create a mesh grid based on feature space
x1_min, x1_max = penguins[var_0].min() - 0.1, penguins[var_0].max() + 0.1 
x2_min, x2_max = penguins[var_1].min() - 0.1, penguins[var_1].max() + 0.1 
x3_mean = penguins[var_2].mean() 
x4_mean = penguins[var_3].mean() 
xx1, xx2 = np.meshgrid(np.linspace(x1_min, x1_max, 200), 
np.linspace(x2_min, x2_max, 200)) 
grid_points = np.c_[xx1.ravel(), xx2.ravel(), 
np.full(xx1.ravel().shape, x3_mean), 
np.full(xx2.ravel().shape, x4_mean)] 

# Predict over mesh grid
Z = clf.predict(grid_points) 
Z = Z.reshape(xx1.shape) 

# Plot decision regions
colour_map = ListedColormap(list(colours.values()))
plt.pcolormesh(xx1, xx2, Z, cmap = colour_map, alpha=0.5, shading='auto') 

# Plot training points
for species in list(colours.keys()):
    mask = penguins[penguins['species'] == species]
    plt.scatter(mask[var_0], mask[var_1],
                c=colours[species], s=64, label=species)

plt.xlabel('Bill Length (mm)')
plt.ylabel('Bill Depth (mm)')
plt.axis('tight')
plt.legend()
plt.title('Decision Boundary of Random Forest Classifier')
plt.show()
