# Random Forest

- Random forest is an ensemble learning method that combines many decision trees to make a final prediction.

- Each decision tree is trained on a random subset of data and features, which increases diversity and reduces overfitting.

- For classification, the final output is chosen by majority voting among all the trees.

- For regression, the final prediction is the average of all tree predictions.

- Random forest is more accurate and stable than a single decision tree because it reduces variance.

## 1. Important Terms in Random Forest 

1. **Root Node**
- The starting point of the tree where the entire training data is given. The first split happens at this node.

2. **Splitting**
- The process of dividing data into smaller groups using conditions. Methods like Gini and Entropy are used to select the best split.

3. **Decision Nodes**
- Nodes that appear after splitting and lead the path toward leaf nodes. They contain further conditions based on features.

4. **Leaf Nodes (Terminal Nodes)**
- These are the endpoints of the tree where no further splitting occurs. Final predictions are made here.

5. **Random Forest Context**
- In a random forest, many such trees are created and each tree uses these nodes. The final prediction is based on the combined decision of multiple trees.

## 2. Working of Random Forest

1. **Random Sampling with Replacement**
- Each tree in the random forest is trained on a bootstrap sample, meaning a random subset of the dataset is selected with replacement.

2. **Feature Selection for Splits**
- At each split in a tree, only a random subset of features is considered. This reduces correlation between trees and improves accuracy.

3. **Best Split Selection**
- Each decision tree chooses the best split using measures like Gini Impurity or Information Gain to separate classes effectively.

4. **Bootstrap Aggregation (Bagging)**
- Random forest is an ensemble technique that uses bagging. Each tree makes a prediction and the final output is based on majority vote or averaging.

5. **Improved Stability and Accuracy**
- Combining multiple trees reduces variance and makes the model more stable compared to a single decision tree.

### 2.1 Feature Selection in Random Forest:

**Classification Problems**
- Random forest selects features by default using the square root of the total number of features.
For example, if there are 16 features, only sqrt(16) which is 4 features are considered at each split.

**Regression Problems**
- Random forest selects features by default using one third of the total number of features.
For example, if there are 15 features, only 15 divided by 3 which is 5 features are considered at each split.

**Purpose of Random Feature Selection**
- This randomness reduces correlations between trees and makes the ensemble more robust and accurate.

### 2.2 Bootstrap Aggregation (Bagging):

1. **Multiple Decision Trees**
- Random forest builds many decision trees. Each tree is trained on a different bootstrap sample, which means a random sample of the dataset taken with replacement.

2. **Different Splits and Paths**
- Because each tree gets different data, the splits chosen by the trees are different. This creates diversity in the model.

3. **Prediction from Each Tree**
- For a new input, every tree in the forest predicts a class label.
Example shown: one tree predicts Chinstrap, another predicts Adelie, and others may also vote differently.

4. **Majority Voting**
- The final prediction is based on majority vote among all the trees.
Example: If 2 trees predict Chinstrap and 1 predicts Adelie, the output becomes Chinstrap.

5. **Why Bagging Works**
- It reduces variance and prevents overfitting by averaging the predictions of multiple diverse trees.
The ensemble becomes more accurate and stable than any single decision tree.

### 2.3 Splitting Methods

**Gini Impurity**

- Measures how often a randomly chosen sample would be misclassified.

- Value ranges from 0 to 1, where 0 means perfectly pure and 1 means highly impure.

- Lower Gini is preferred for splits.

**Information Gain**

- Selects the feature that gives the most information about the class.

- Calculated using entropy before and after the split.

- Higher information gain means a better split.

**Entropy**

- Measures randomness or uncertainty in the data.

- High entropy means mixed classes; low entropy means purer nodes.

- Used in calculating information gain.

**How Splitting Works**

- At each node, the algorithm evaluates all features using Gini or Information Gain.

- The feature and threshold giving the best purity improvement is selected for the split.

## **Hands on Practical Implementation of RF**

#### Importing the Libraries

In [14]:
import pandas as pd
import numpy as np
import seaborn as sns

#### Loading the Data Set

In [15]:
data_set = sns.load_dataset("penguins")
data_set.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


In [42]:
data_set.shape

(344, 7)

In [16]:
data_set.info

<bound method DataFrame.info of     species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0    Adelie  Torgersen            39.1           18.7              181.0   
1    Adelie  Torgersen            39.5           17.4              186.0   
2    Adelie  Torgersen            40.3           18.0              195.0   
3    Adelie  Torgersen             NaN            NaN                NaN   
4    Adelie  Torgersen            36.7           19.3              193.0   
..      ...        ...             ...            ...                ...   
339  Gentoo     Biscoe             NaN            NaN                NaN   
340  Gentoo     Biscoe            46.8           14.3              215.0   
341  Gentoo     Biscoe            50.4           15.7              222.0   
342  Gentoo     Biscoe            45.2           14.8              212.0   
343  Gentoo     Biscoe            49.9           16.1              213.0   

     body_mass_g     sex  
0         3750.0    Male  
1

In [7]:
data_set.dtypes

species               object
island                object
bill_length_mm       float64
bill_depth_mm        float64
flipper_length_mm    float64
body_mass_g          float64
sex                   object
dtype: object

In [44]:
data_set.isnull().sum()

species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
dtype: int64

### Droping the Null values 

In [17]:
data_set.dropna(inplace = True)

### Checking the data set

In [18]:
data_set.isnull().sum() # null values been droped

species              0
island               0
bill_length_mm       0
bill_depth_mm        0
flipper_length_mm    0
body_mass_g          0
sex                  0
dtype: int64

## Feature Engineering

#### One hot Encoding Transforming categorical data into numeric

In [48]:
data_set.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male


In [49]:
pd.get_dummies(data_set['sex']).head()

Unnamed: 0,Female,Male
0,False,True
1,True,False
2,True,False
4,True,False
5,False,True


In [21]:
sex = pd.get_dummies(data_set['sex'],drop_first=True)
sex.head()

Unnamed: 0,Male
0,True
1,False
2,False
4,False
5,True


In [19]:
island = pd.get_dummies(data_set['island'] , drop_first=True)
island.head()

Unnamed: 0,Dream,Torgersen
0,False,True
1,False,True
2,False,True
4,False,True
5,False,True


### Concatenate the above two data frames to the original data_set

In [22]:
new_data = pd.concat([data_set,island , sex] ,axis = 1)

In [23]:
new_data.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,Dream,Torgersen,Male
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male,False,True,True
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female,False,True,False
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female,False,True,False
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female,False,True,False
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male,False,True,True


### Drop the Repeated Columns

In [24]:
new_data.drop(['sex' ,'island'] , axis =1 , inplace = True)

In [25]:
new_data.head()

Unnamed: 0,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,Dream,Torgersen,Male
0,Adelie,39.1,18.7,181.0,3750.0,False,True,True
1,Adelie,39.5,17.4,186.0,3800.0,False,True,False
2,Adelie,40.3,18.0,195.0,3250.0,False,True,False
4,Adelie,36.7,19.3,193.0,3450.0,False,True,False
5,Adelie,39.3,20.6,190.0,3650.0,False,True,True


### Creating Separate target variable

In [26]:
Y = new_data.species
Y.head()

0    Adelie
1    Adelie
2    Adelie
4    Adelie
5    Adelie
Name: species, dtype: object

In [65]:
Y.unique()

array(['Adelie', 'Chinstrap', 'Gentoo'], dtype=object)

In [67]:
Y = Y.map({'Adelie': 0,'Chinstrap':1,'Gentoo':2}) #Using map function to convert catrgorical values into numeric.
Y.head()

0    0
1    0
2    0
4    0
5    0
Name: species, dtype: int64

### Droping The Target Variables: Species

In [27]:
new_data.drop('species' , inplace = True , axis = 1)

In [28]:
new_data.head()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,Dream,Torgersen,Male
0,39.1,18.7,181.0,3750.0,False,True,True
1,39.5,17.4,186.0,3800.0,False,True,False
2,40.3,18.0,195.0,3250.0,False,True,False
4,36.7,19.3,193.0,3450.0,False,True,False
5,39.3,20.6,190.0,3650.0,False,True,True


In [29]:
X = new_data

### Splitting the data set into Training and test data

In [30]:
from sklearn.model_selection import train_test_split
X_train , X_test , y_train , y_test = train_test_split(X , Y , test_size = 0.3 , random_state = 0)

In [31]:
print('X_train' , X_train.shape)
print('X_test' , X_train.shape)
print('Y_train' , y_train.shape)
print('Y_test' , y_test.shape)

X_train (233, 7)
X_test (233, 7)
Y_train (233,)
Y_test (100,)


### Training Random Forest classification on trainig set

In [32]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators= 5 , criterion='entropy' , random_state=0)
classifier.fit(X_train , y_train)

0,1,2
,n_estimators,5
,criterion,'entropy'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


### Predicting the Test Results

In [43]:
y_pred = classifier.predict(X_test)
y_pred

array(['Adelie', 'Adelie', 'Gentoo', 'Adelie', 'Adelie', 'Adelie',
       'Chinstrap', 'Gentoo', 'Gentoo', 'Chinstrap', 'Gentoo', 'Adelie',
       'Adelie', 'Chinstrap', 'Adelie', 'Adelie', 'Gentoo', 'Adelie',
       'Chinstrap', 'Adelie', 'Adelie', 'Adelie', 'Gentoo', 'Gentoo',
       'Gentoo', 'Gentoo', 'Adelie', 'Adelie', 'Adelie', 'Adelie',
       'Adelie', 'Chinstrap', 'Adelie', 'Chinstrap', 'Adelie', 'Gentoo',
       'Chinstrap', 'Adelie', 'Chinstrap', 'Adelie', 'Gentoo', 'Gentoo',
       'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie',
       'Gentoo', 'Adelie', 'Adelie', 'Adelie', 'Gentoo', 'Gentoo',
       'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Gentoo',
       'Adelie', 'Chinstrap', 'Adelie', 'Gentoo', 'Adelie', 'Adelie',
       'Gentoo', 'Gentoo', 'Chinstrap', 'Gentoo', 'Gentoo', 'Chinstrap',
       'Gentoo', 'Chinstrap', 'Adelie', 'Gentoo', 'Adelie', 'Gentoo',
       'Adelie', 'Gentoo', 'Chinstrap', 'Gentoo', 'Gentoo', 'Gentoo',
       'Chinstrap', 'Ge

### Confusion Matrix

In [35]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report , accuracy_score

In [36]:
cm = confusion_matrix(y_test , y_pred)
print(cm)

[[48  0  0]
 [ 2 14  0]
 [ 0  0 36]]


In [37]:
accuracy_score(y_test,y_pred)

0.98

In [38]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

      Adelie       0.96      1.00      0.98        48
   Chinstrap       1.00      0.88      0.93        16
      Gentoo       1.00      1.00      1.00        36

    accuracy                           0.98       100
   macro avg       0.99      0.96      0.97       100
weighted avg       0.98      0.98      0.98       100



## Try with Diff number of trees and gini criteria

In [39]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=7 , criterion='gini' , random_state=0)
classifier.fit(X_train , y_train)

0,1,2
,n_estimators,7
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [40]:
y_pred = classifier.predict(X_test)
accuracy_score(y_test, y_pred)

0.99

### Note: **With more trees the model gives 99% accuracy**