# <font color='green'>  Random Forest :</font>

A **Random Forest** is indeed a supervised machine learning algorithm that is used for both **classification** and **regression tasks.** 

- In classification, it predicts the class label of a given input, while in regression, it predicts a continuous numerical value.

- Random Forest is  a group of Decision Trees it combine the tree to make a Better Decision.

- **For classification:** Takes the majority vote (the class that most trees predict).

- **For regression:** Takes the average of all the predictions made by the individual trees.

- **Bootstrap Aggregation (Bagging):** Makes the Random Forest stronger by training multiple trees on different subsets of the data and combining their predictions.

- **Gini Index:** Helps the trees make good decisions about how to split the data by measuring how impure or mixed the data is at each step.

**Advantages:**
    
- High accuracy

- Handles large datasets with many features

- Handle Missing values and Reduce overfitting.

## Import Libraries and Dataset:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

In [2]:
data=pd.read_csv("C:/Users/USER/Desktop/M.L Data Set/car_evaluation.csv")
data.head()

Unnamed: 0,vhigh,vhigh.1,2,2.1,small,low,unacc
0,vhigh,vhigh,2,2,small,med,unacc
1,vhigh,vhigh,2,2,small,high,unacc
2,vhigh,vhigh,2,2,med,low,unacc
3,vhigh,vhigh,2,2,med,med,unacc
4,vhigh,vhigh,2,2,med,high,unacc


## Exploratory Data Analysis (EDA):

In [3]:
data.shape

(1727, 7)

In [4]:
# Rename column names

col_names = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']
data.columns= col_names

In [5]:
# summary of data
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1727 entries, 0 to 1726
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   buying    1727 non-null   object
 1   maint     1727 non-null   object
 2   doors     1727 non-null   object
 3   persons   1727 non-null   object
 4   lug_boot  1727 non-null   object
 5   safety    1727 non-null   object
 6   class     1727 non-null   object
dtypes: object(7)
memory usage: 94.6+ KB


In [6]:
data.isna()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...
1722,False,False,False,False,False,False,False
1723,False,False,False,False,False,False,False
1724,False,False,False,False,False,False,False
1725,False,False,False,False,False,False,False


In [7]:
data.isnull().sum()

buying      0
maint       0
doors       0
persons     0
lug_boot    0
safety      0
class       0
dtype: int64

In [8]:
data.dtypes

buying      object
maint       object
doors       object
persons     object
lug_boot    object
safety      object
class       object
dtype: object

In [9]:
#The class target variable is ordinal in nature.

data['class'].value_counts()

unacc    1209
acc       384
good       69
vgood      65
Name: class, dtype: int64

### Feature vector and Target variable

In [10]:
X = data.drop(['class'], axis=1)

Y = data['class']

In [11]:
# convert categorical columns into numerical format using ordinal encoding

import category_encoders as ce
encoder = ce.OrdinalEncoder(cols=['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety'])
X = encoder.fit_transform(X)
X.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety
0,1,1,1,1,1,1
1,1,1,1,1,1,2
2,1,1,1,1,2,3
3,1,1,1,1,2,1
4,1,1,1,1,2,2


**Split data into training and test set**

In [12]:
# split X and y into training and testing sets

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 42)

In [13]:
# checking the shape of train and test 
# checking the shape of train and test 

X_train.shape,X_test.shape 
Y_train.shape,Y_test.shape 

print("X train and test :", X_train.shape,X_test.shape)
print("Y train and test :", Y_train.shape,Y_test.shape )

X train and test : (1157, 6) (570, 6)
Y train and test : (1157,) (570,)


### Random Forest Classifier model with default parameters

In [14]:
# import Random Forest classifier

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# instantiate the classifier 

RFC = RandomForestClassifier(random_state=0)

# fit the model
RFC.fit(X_train, Y_train)

# Predict the Test set results

Y_pred = RFC.predict(X_test)

print('Model accuracy score with default parameters : {0:0.4f}'. format(accuracy_score(Y_test, Y_pred)))

Model accuracy score with default parameters : 0.9684


### Random Forest Classifier model with parameter n_estimators=120 and criterion='gini'

In [15]:
# instantiate the classifier with n_estimators = 120

RFC_120 = RandomForestClassifier(n_estimators=120,max_depth=None,criterion='gini', random_state=0, )

# fit the model to the training set

RFC_120.fit(X_train, Y_train)



RandomForestClassifier(n_estimators=120, random_state=0)

In [16]:
# Predict on the test set results

Y_pred_120 = RFC_120.predict(X_test)

# Check accuracy score 
print('Model accuracy score with 120 decision-trees : {0:0.4f}'. format(accuracy_score(Y_test, Y_pred_120)))

Model accuracy score with 120 decision-trees : 0.9684


In [17]:
# view the feature scores

feature_scores = pd.Series(RFC.feature_importances_, index=X_train.columns).sort_values(ascending=False)

feature_scores

safety      0.299179
persons     0.235318
buying      0.164889
maint       0.142675
lug_boot    0.095426
doors       0.062513
dtype: float64

We can see that the most important feature is **safety** and least important feature is **doors**

### Hyperparameter Tuning Using Grid Search

hyperparameters in RandomForestClassifier, we can use Grid Search to find the best combination of hyperparameters that improves our model's performance.

In [18]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# Define the parameter grid (hyperparameters to search over)
# Define the parameter grid (hyperparameters to search over)
param_grid = {
    'n_estimators': [50, 100, 200],  # Number of trees
    'max_depth': [None, 10, 20, 30],  # Maximum depth of each tree
    'min_samples_split': [2, 5, 10],  # Minimum samples required to split a node
    'min_samples_leaf': [1, 2, 4],    # Minimum samples required at a leaf node
    'max_features': ['auto', 'sqrt'], # Number of features to consider when looking for the best split
    'bootstrap': [True, False],       # Whether to use bootstrap samples or not
    'criterion': ['gini']             # Splitting criterion (we can also use 'entropy' if we want)
}

# Create a RandomForestClassifier
clf = RandomForestClassifier(random_state=42)
 # Perform Grid Search with cross-validation (using 5-fold cross-validation by default)
grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, 
                           cv=5, n_jobs=-1, verbose=2, scoring='accuracy')

# Fit the grid search to data
grid_search.fit(X_train, Y_train)

# Get the best parameters and the best score
print("Best Parameters:", grid_search.best_params_)
print("Best Cross-Validation Accuracy:", grid_search.best_score_)

# Use the best model to make predictions on the test data
best_model = grid_search.best_estimator_
Y_pred_best = best_model.predict(X_test)

# Evaluate the best model
best_accuracy = accuracy_score(Y_test, Y_pred_best)
print("Test Accuracy with Best Model:", best_accuracy)

Fitting 5 folds for each of 432 candidates, totalling 2160 fits
Best Parameters: {'bootstrap': False, 'criterion': 'gini', 'max_depth': None, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 50}
Best Cross-Validation Accuracy: 0.9723503507986268
Test Accuracy with Best Model: 0.9614035087719298


- The best model has been successfully tuned using GridSearchCV, and it shows a high cross-validation accuracy (97.32%) and test accuracy (96.14%), meaning it's performing very well on both the training and unseen data.


- The chosen hyperparameters (like no bootstrapping, full depth, 50 trees) appear to be a good fit for this dataset,



### Confusion matrix

In [19]:
# Confusion Matrix

from sklearn.metrics import confusion_matrix
conf_matrix = confusion_matrix(Y_test, Y_pred_best)
print("Confusion Matrix:\n", conf_matrix)

Confusion Matrix:
 [[115   6   4   2]
 [  0  13   0   5]
 [  1   0 398   0]
 [  4   0   0  22]]


### Classification Report

In [20]:
from sklearn.metrics import classification_report

print(classification_report(Y_test, Y_pred))

              precision    recall  f1-score   support

         acc       0.96      0.94      0.95       127
        good       0.76      0.72      0.74        18
       unacc       0.99      1.00      0.99       399
       vgood       0.81      0.81      0.81        26

    accuracy                           0.97       570
   macro avg       0.88      0.87      0.87       570
weighted avg       0.97      0.97      0.97       570



## Results and conclusion

- In this project, I built a Random Forest Classifier to predict car safety. I created two models: one with the **default parameters** and **another with n_estimators=120 and criterion='gini'**.


- The accuracy of the model with default parameters is **0.9526,** and with n_estimators=120 and criterion='gini', the accuracy increases to **0.9684.** As expected, the accuracy improves with more decision trees in the model.



- I used the Random Forest model to identify the most important features, then built a new model using only these important features. **The most important feature was safety**, while the **least important was the number of doors.**


- I applied Grid Search to find the best hyperparameters for a RandomForestRegressor.


-  used the confusion matrix and classification report to evaluate the model’s performance, and they showed good results.


