# **Random Forest algorithm intuition** <a class="anchor" id="3"></a>

![Random Forest](https://i.ytimg.com/vi/goPiwckWE9M/maxresdefault.jpg)

# **Advantages and disadvantages of Random Forest algorithm** <a class="anchor" id="3"></a>

[Table of Contents](#0.1)


The advantages of Random forest algorithm are as follows:-


1.	Random forest algorithm can be used to solve both classification and regression problems.
2.	It is considered as very accurate and robust model because it uses large number of decision-trees to make predictions.
3.	Random forests takes the average of all the predictions made by the decision-trees, which cancels out the biases. So, it does not suffer from the overfitting problem. 
4.	Random forest classifier can handle the missing values. There are two ways to handle the missing values. First is to use median values to replace continuous variables and second is to compute the proximity-weighted average of missing values.
5.	Random forest classifier can be used for feature selection. It means selecting the most important features out of the available features from the training dataset.


The disadvantages of Random Forest algorithm are listed below:-


1.	The biggest disadvantage of random forests is its computational complexity. Random forests is very slow in making predictions because large number of decision-trees are used to make predictions. All the trees in the forest have to make a prediction for the same input and then perform voting on it. So, it is a time-consuming process.
2.	The model is difficult to interpret as compared to a decision-tree, where we can easily make a prediction as compared to a decision-tree.


# **Feature selection with Random Forests** <a class="anchor" id="4"></a>

[Table of Contents](#0.1)



Random forests algorithm can be used for feature selection process. This algorithm can be used to rank the importance of variables in a regression or classification problem. 


We measure the variable importance in a dataset by fitting the random forest algorithm to the data. During the fitting process, the out-of-bag error for each data point is recorded and averaged over the forest. 


The importance of the j-th feature was measured after training. The values of the j-th feature were permuted among the training data and the out-of-bag error was again computed on this perturbed dataset. The importance score for the j-th feature is computed by averaging the difference in out-of-bag error before and after the permutation over all trees. The score is normalized by the standard deviation of these differences.


Features which produce large values for this score are ranked as more important than features which produce small values. Based on this score, we will choose the most important features and drop the least important ones for model building. 


# **Lib Imports** <a class="anchor" id="4"></a>

In [2]:
import numpy as np
import seaborn as sb
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf

%matplotlib inline

2024-01-24 16:20:49.646320: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-01-24 16:20:49.826628: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-24 16:20:49.826674: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-24 16:20:49.830059: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-24 16:20:49.845941: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-01-24 16:20:49.846947: I tensorflow/core/platform/cpu_feature_guard.cc:1

In [3]:
import warnings

warnings.filterwarnings('ignore')

# **Data Import** 

In [4]:
data_file = './car_evaluation.csv'
car_evaluation_df = pd.read_csv(data_file)
car_evaluation_df.head()

Unnamed: 0,vhigh,vhigh.1,2,2.1,small,low,unacc
0,vhigh,vhigh,2,2,small,med,unacc
1,vhigh,vhigh,2,2,small,high,unacc
2,vhigh,vhigh,2,2,med,low,unacc
3,vhigh,vhigh,2,2,med,med,unacc
4,vhigh,vhigh,2,2,med,high,unacc


In [5]:
col_names = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']
car_evaluation_df.columns = col_names
car_evaluation_df.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,vhigh,vhigh,2,2,small,med,unacc
1,vhigh,vhigh,2,2,small,high,unacc
2,vhigh,vhigh,2,2,med,low,unacc
3,vhigh,vhigh,2,2,med,med,unacc
4,vhigh,vhigh,2,2,med,high,unacc


# **Exploratory Data Analysis**

In [6]:
car_evaluation_df['class'].value_counts()

class
unacc    1209
acc       384
good       69
vgood      65
Name: count, dtype: int64

In [7]:
car_evaluation_df.isnull().sum()

buying      0
maint       0
doors       0
persons     0
lug_boot    0
safety      0
class       0
dtype: int64

# **Data Split**

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
X = car_evaluation_df.drop(['class'], axis=1)
Y = car_evaluation_df['class']

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.33, random_state = 42)

In [11]:
X_train.shape, X_test.shape

((1157, 6), (570, 6))

In [12]:
X_train.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety
83,vhigh,vhigh,5more,2,med,low
48,vhigh,vhigh,3,more,med,med
468,high,vhigh,3,4,small,med
155,vhigh,high,3,more,med,low
1043,med,high,4,more,small,low


In [13]:
y_train.head()

83      unacc
48      unacc
468     unacc
155     unacc
1043    unacc
Name: class, dtype: object

# **Feature Engineering**

In [14]:
from sklearn.preprocessing import OneHotEncoder
import category_encoders as ce

In [15]:
X_train.dtypes

buying      object
maint       object
doors       object
persons     object
lug_boot    object
safety      object
dtype: object

In [16]:
x_encoder = ce.OrdinalEncoder(cols=['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety'])
X_train = x_encoder.fit_transform(X_train)
X_test = x_encoder.transform(X_test)

In [17]:
X_train.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety
83,1,1,1,1,1,1
48,1,1,2,2,1,2
468,2,1,2,3,2,2
155,1,2,2,2,1,1
1043,3,2,3,2,2,1


In [18]:
X_test.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety
599,2,2,3,1,3,1
932,3,1,3,3,3,1
628,2,2,1,1,3,3
1497,4,2,1,3,1,2
1262,3,4,3,2,1,1


# **Random Forest Classified with Default Parameters**

In [19]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [20]:
rfc = RandomForestClassifier(random_state=0)
rfc.fit(X_train, y_train)

y_pred = rfc.predict(X_test)

In [22]:
rfc_100 = RandomForestClassifier(n_estimators=1000, random_state=0)
rfc_100.fit(X_train, y_train)
y_pred_100 = rfc_100.predict(X_test)

In [21]:
print(f"Model accuracy score with 10 decision-trees : {accuracy_score(y_test, y_pred)}")

Model accuracy score with 10 decision-trees : 0.9649122807017544


In [23]:
print(f"Model accuracy score with 1000 decision-trees : {accuracy_score(y_test, y_pred_100)}")

Model accuracy score with 1000 decision-trees : 0.9701754385964912


## **Feature Extraction**

In [24]:
feature_scores = pd.Series(rfc.feature_importances_,
                           index=X_train.columns).sort_values(ascending=False)
feature_scores

safety      0.291657
persons     0.235380
buying      0.160692
maint       0.134143
lug_boot    0.111595
doors       0.066533
dtype: float64

# **Model Evaluation**

### **Model Accuracy**

### **Confusion Matrix**

A confusion matrix is a tool for summarizing the performance of a classification algorithm. A confusion matrix will give us a clear picture of classification model performance and the types of errors produced by the model. It gives us a summary of correct and incorrect predictions broken down by each category. The summary is represented in a tabular form.

Four types of outcomes are possible while evaluating a classification model performance. These four outcomes are described below:-

**True Positives (TP)** – True Positives occur when we predict an observation belongs to a certain class and the observation actually belongs to that class.

**True Negatives (TN)** – True Negatives occur when we predict an observation does not belong to a certain class and the observation actually does not belong to that class.

**False Positives (FP)** – False Positives occur when we predict an observation belongs to a    certain class but the observation actually does not belong to that class. This type of error is called **Type I error.**

**False Negatives (FN)** – False Negatives occur when we predict an observation does not belong to a certain class but the observation actually belongs to that class. This is a very serious error and it is called **Type II error.**

These four outcomes are summarized in a confusion matrix given below.

In [25]:
cm = confusion_matrix(y_test, y_pred, labels=y_test.unique())
cm

array([[397,   2,   0,   0],
       [  5, 119,   1,   2],
       [  0,   5,  21,   0],
       [  2,   1,   2,  13]])

In [26]:
cm_100 = confusion_matrix(y_test, y_pred_100)
cm

array([[397,   2,   0,   0],
       [  5, 119,   1,   2],
       [  0,   5,  21,   0],
       [  2,   1,   2,  13]])

### **Classification Report**

In [27]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         acc       0.94      0.94      0.94       127
        good       0.87      0.72      0.79        18
       unacc       0.98      0.99      0.99       399
       vgood       0.88      0.81      0.84        26

    accuracy                           0.96       570
   macro avg       0.92      0.87      0.89       570
weighted avg       0.96      0.96      0.96       570



In [28]:
print(classification_report(y_test, y_pred_100))

              precision    recall  f1-score   support

         acc       0.95      0.95      0.95       127
        good       0.87      0.72      0.79        18
       unacc       0.99      0.99      0.99       399
       vgood       0.92      0.85      0.88        26

    accuracy                           0.97       570
   macro avg       0.93      0.88      0.90       570
weighted avg       0.97      0.97      0.97       570

