# Random Forest Algorithm

Random Forest is an ensemble learning method used for classification and regression tasks. It builds multiple decision trees and merges their results to improve accuracy and control overfitting.

## Theory

A Random Forest consists of a collection of decision trees, each trained on a random subset of the data and a random subset of features. The final prediction is made by aggregating the predictions of all individual trees (majority vote for classification, average for regression).

### Steps:
1. Draw $n$ bootstrap samples from the original dataset.
2. For each sample, grow a decision tree:
    - At each node, select a random subset of $m$ features.
    - Split the node using the best feature among the subset.
3. Aggregate the predictions from all trees.

## Mathematical Expressions

Given a dataset $D = \{(x_i, y_i)\}_{i=1}^N$:

- For each tree $t$:
  - Draw a bootstrap sample $D_t$ from $D$.
  - At each split, select $m$ features at random from $M$ total features.
  - Grow the tree $h_t(x)$ to the largest extent possible.

### Prediction

- **Classification:**  
  The final prediction $\hat{y}$ for input $x$ is:
  $$
  \hat{y} = \mathrm{mode}\{h_t(x)\}_{t=1}^T
  $$
  where $T$ is the number of trees.

- **Regression:**  
  The final prediction $\hat{y}$ for input $x$ is:
  $$
  \hat{y} = \frac{1}{T} \sum_{t=1}^T h_t(x)
  $$

## Advantages

- Handles high-dimensional data well.
- Reduces overfitting by averaging multiple trees.
- Provides feature importance estimates.

## Disadvantages

- Can be computationally intensive.
- Less interpretable than a single decision tree.

In [217]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [218]:
df = pd.read_csv('../data/car.data')

In [219]:
df.isnull().sum()

vhigh      0
vhigh.1    0
2          0
2.1        0
small      0
low        0
unacc      0
dtype: int64

In [220]:
col_names = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety','class']

In [221]:
df

Unnamed: 0,vhigh,vhigh.1,2,2.1,small,low,unacc
0,vhigh,vhigh,2,2,small,med,unacc
1,vhigh,vhigh,2,2,small,high,unacc
2,vhigh,vhigh,2,2,med,low,unacc
3,vhigh,vhigh,2,2,med,med,unacc
4,vhigh,vhigh,2,2,med,high,unacc
...,...,...,...,...,...,...,...
1722,low,low,5more,more,med,med,good
1723,low,low,5more,more,med,high,vgood
1724,low,low,5more,more,big,low,unacc
1725,low,low,5more,more,big,med,good


In [222]:
df.columns = col_names

In [223]:
df

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,vhigh,vhigh,2,2,small,med,unacc
1,vhigh,vhigh,2,2,small,high,unacc
2,vhigh,vhigh,2,2,med,low,unacc
3,vhigh,vhigh,2,2,med,med,unacc
4,vhigh,vhigh,2,2,med,high,unacc
...,...,...,...,...,...,...,...
1722,low,low,5more,more,med,med,good
1723,low,low,5more,more,med,high,vgood
1724,low,low,5more,more,big,low,unacc
1725,low,low,5more,more,big,med,good


In [224]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1727 entries, 0 to 1726
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   buying    1727 non-null   object
 1   maint     1727 non-null   object
 2   doors     1727 non-null   object
 3   persons   1727 non-null   object
 4   lug_boot  1727 non-null   object
 5   safety    1727 non-null   object
 6   class     1727 non-null   object
dtypes: object(7)
memory usage: 94.6+ KB


In [225]:
df.describe(include='all').T

Unnamed: 0,count,unique,top,freq
buying,1727,4,high,432
maint,1727,4,high,432
doors,1727,4,3,432
persons,1727,3,4,576
lug_boot,1727,3,med,576
safety,1727,3,med,576
class,1727,4,unacc,1209


In [226]:
df.value_counts().T

buying  maint  doors  persons  lug_boot  safety  class
high    high   2      2        big       high    unacc    1
med     med    4      more     small     med     acc      1
                                         high    acc      1
                               med       med     acc      1
                                         low     unacc    1
                                                         ..
low     low    3      2        med       med     unacc    1
                                         low     unacc    1
                                         high    unacc    1
                               big       med     unacc    1
vhigh   vhigh  5more  more     small     med     unacc    1
Name: count, Length: 1727, dtype: int64

In [227]:
for col in col_names:
    print(df[col].value_counts())

buying
high     432
med      432
low      432
vhigh    431
Name: count, dtype: int64
maint
high     432
med      432
low      432
vhigh    431
Name: count, dtype: int64
doors
3        432
4        432
5more    432
2        431
Name: count, dtype: int64
persons
4       576
more    576
2       575
Name: count, dtype: int64
lug_boot
med      576
big      576
small    575
Name: count, dtype: int64
safety
med     576
high    576
low     575
Name: count, dtype: int64
class
unacc    1209
acc       384
good       69
vgood      65
Name: count, dtype: int64


In [228]:
df.duplicated().sum()

np.int64(0)

In [229]:
df[df.duplicated()]

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class


## Encoding Values

In [230]:
from sklearn.preprocessing import OrdinalEncoder

In [231]:
encoder = OrdinalEncoder()

In [232]:
for col in col_names:
    df[col] = encoder.fit_transform(df[[col]])

In [233]:
df

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,3.0,3.0,0.0,0.0,2.0,2.0,2.0
1,3.0,3.0,0.0,0.0,2.0,0.0,2.0
2,3.0,3.0,0.0,0.0,1.0,1.0,2.0
3,3.0,3.0,0.0,0.0,1.0,2.0,2.0
4,3.0,3.0,0.0,0.0,1.0,0.0,2.0
...,...,...,...,...,...,...,...
1722,1.0,1.0,3.0,2.0,1.0,2.0,1.0
1723,1.0,1.0,3.0,2.0,1.0,0.0,3.0
1724,1.0,1.0,3.0,2.0,0.0,1.0,2.0
1725,1.0,1.0,3.0,2.0,0.0,2.0,1.0


In [234]:
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
X,y

(      buying  maint  doors  persons  lug_boot  safety
 0        3.0    3.0    0.0      0.0       2.0     2.0
 1        3.0    3.0    0.0      0.0       2.0     0.0
 2        3.0    3.0    0.0      0.0       1.0     1.0
 3        3.0    3.0    0.0      0.0       1.0     2.0
 4        3.0    3.0    0.0      0.0       1.0     0.0
 ...      ...    ...    ...      ...       ...     ...
 1722     1.0    1.0    3.0      2.0       1.0     2.0
 1723     1.0    1.0    3.0      2.0       1.0     0.0
 1724     1.0    1.0    3.0      2.0       0.0     1.0
 1725     1.0    1.0    3.0      2.0       0.0     2.0
 1726     1.0    1.0    3.0      2.0       0.0     0.0
 
 [1727 rows x 6 columns],
 0       2.0
 1       2.0
 2       2.0
 3       2.0
 4       2.0
        ... 
 1722    1.0
 1723    3.0
 1724    2.0
 1725    1.0
 1726    3.0
 Name: class, Length: 1727, dtype: float64)

In [235]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=42)

In [236]:

RFclassifier = RandomForestClassifier(
    n_estimators=200,
    max_depth=10,
    min_samples_split=4,
    min_samples_leaf=2,
    class_weight='balanced',
    random_state=42
)

In [237]:
RFclassifier.fit(X_train,y_train)

In [238]:
y_pred = RFclassifier.predict(X_test)

In [239]:
y_pred

array([2., 2., 2., 0., 2., 0., 2., 2., 2., 2., 3., 2., 2., 2., 2., 2., 2.,
       2., 2., 0., 2., 2., 0., 2., 2., 0., 2., 2., 2., 2., 0., 2., 2., 1.,
       2., 2., 1., 2., 2., 3., 0., 1., 2., 2., 1., 0., 2., 2., 2., 2., 2.,
       2., 0., 1., 2., 2., 2., 2., 2., 0., 2., 2., 2., 2., 2., 3., 2., 2.,
       3., 2., 3., 0., 2., 2., 2., 0., 2., 2., 2., 2., 3., 2., 2., 0., 2.,
       0., 2., 2., 0., 3., 2., 1., 2., 2., 2., 2., 2., 2., 2., 0., 2., 2.,
       2., 2., 2., 1., 0., 2., 2., 0., 2., 0., 2., 2., 0., 2., 2., 2., 2.,
       2., 2., 0., 2., 0., 0., 2., 1., 2., 2., 2., 2., 2., 2., 2., 2., 2.,
       2., 2., 2., 2., 2., 0., 2., 0., 2., 0., 2., 2., 2., 2., 1., 3., 2.,
       2., 2., 2., 2., 2., 2., 2., 0., 0., 2., 3., 2., 3., 0., 0., 2., 3.,
       0., 2., 3., 2., 0., 0., 3., 2., 0., 2., 0., 2., 2., 0., 1., 2., 2.,
       2., 2., 2., 2., 0., 0., 2., 2., 2., 0., 2., 2., 2., 2., 2., 2., 3.,
       2., 0., 2., 0., 2., 3., 2., 2., 0., 2., 2., 2., 2., 2., 2., 3., 3.,
       2., 2., 0., 3., 0.

In [240]:
from sklearn.metrics import accuracy_score, confusion_matrix

In [241]:
print(accuracy_score(y_pred=y_pred, y_true=y_test))

0.9460500963391136


In [242]:
print(confusion_matrix(y_pred=y_pred,y_true=y_test))

[[103   7   4   4]
 [  0  12   0   5]
 [  7   0 354   0]
 [  1   0   0  22]]
