# Activity 4.02 – Random Forest Classification for Car Rental Company


## Goal of this notebook

Building on Activity 4.01, we now optimize the car classification model using **Random Forest** and **Extra Trees** classifiers. These ensemble methods combine multiple decision trees to achieve higher accuracy and better generalization for selecting cars clients will love (unacceptable, acceptable, good, very good).


## Dataset description

We use the classic **Car Evaluation** car_car_dataset from **the UCI Machine Learning Repository**. The dataset directly relates the overall car evaluation (the target) to six input attributes.

**Input features:**

- `buying`: buying price of the car (values: vhigh, high, med, low).  

- `maint`: price of the maintenance (vhigh, high, med, low). 

- `doors`: number of doors (2, 3, 4, 5 , more). 

- `persons`: capacity in terms of persons to carry (2, 4, more).

- `lug_boot`: size of luggage boot (small, med, big). 

- `safety`: estimated safety of the car (low, med, high). 

**Target variable:**

- `class`: evaluation level of the car with four possible categories: **unacc** (unacceptable), **acc** (acceptable), **good**, **vgood** (very good).


### 1. Imports and Configurations : 

In [1]:
# Import core libraries for data handling and visualization
import pandas as pd              # For Data loading and manipulation
import numpy as np               # For Numerical operations
import matplotlib.pyplot as plt  # For Creating Plots
import seaborn as sns            # For Enhanced Statistical visualization

# Import scikit-learn tools for modeling
from sklearn.model_selection import train_test_split , GridSearchCV
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier # For Calling and using the enhanced RandomForestClassifier and ExtraTrees also
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Make plots look nicer
sns.set(style="whitegrid")


### 2. Load the Same Dataset as in 4_01 , Same Train_Test Split and Same Preprocessing Pipeline for Fair Comparison of different models : 


In [5]:
# Load SAME dataset as Activity 4.01
data = pd.read_csv("car.csv")
X = data.drop("Class", axis=1)
y = data["Class"]

# SAME train-test split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

categorical_cols = X.columns.tolist()
print("Dataset loaded. Train shape:", X_train.shape, "Test shape:", X_test.shape)

# Same preprocessing as Decision Tree
preprocess = ColumnTransformer([("cat", OneHotEncoder(handle_unknown="ignore"), categorical_cols)])

Dataset loaded. Train shape: (1382, 6) Test shape: (346, 6)


### 3. Applying Baseline Random Forest : 

In [6]:
# Baseline Random Forest (100 trees, default params)
rf_baseline = RandomForestClassifier(
    n_estimators=100,
    random_state=42,
    n_jobs=-1
)

rf_pipeline = Pipeline([
    ("preprocess", preprocess),
    ("classifier", rf_baseline)
])

# Train & evaluate
rf_pipeline.fit(X_train, y_train)
y_pred_rf_base = rf_pipeline.predict(X_test)

rf_base_train_acc = accuracy_score(y_train, rf_pipeline.predict(X_train))
rf_base_test_acc = accuracy_score(y_test, y_pred_rf_base)

print("=== BASELINE Random Forest ===")
print(f"Train Accuracy: {rf_base_train_acc:.4f}")
print(f"Test Accuracy: {rf_base_test_acc:.4f}")
print("\nClassification Report:\n", classification_report(y_test, y_pred_rf_base))


=== BASELINE Random Forest ===
Train Accuracy: 1.0000
Test Accuracy: 0.9682

Classification Report:
               precision    recall  f1-score   support

         acc       0.94      0.96      0.95        77
        good       0.81      0.93      0.87        14
       unacc       0.99      0.99      0.99       242
       vgood       0.90      0.69      0.78        13

    accuracy                           0.97       346
   macro avg       0.91      0.89      0.90       346
weighted avg       0.97      0.97      0.97       346



### 4. Hyperparameter Tuning Random Forest :

In [17]:
# Tune Random Forest hyperparameters
param_grid_rf = {
    "classifier__n_estimators": [20, 50, 100],
    "classifier__max_depth": [5, 6, 7, 8 ],}

grid_rf = GridSearchCV(
    rf_pipeline, param_grid_rf, cv=5, scoring="accuracy", n_jobs=-1, verbose=1
)

grid_rf.fit(X_train, y_train)
print("Best RF params:", grid_rf.best_params_)
print("Best CV accuracy:", grid_rf.best_score_)

# Evaluate best RF
best_rf = grid_rf.best_estimator_
y_pred_rf_best = best_rf.predict(X_test)
rf_best_train = accuracy_score(y_train, best_rf.predict(X_train))
rf_best_test = accuracy_score(y_test, y_pred_rf_best)

print(f"\nTuned RF - Train: {rf_best_train:.4f}, Test: {rf_best_test:.4f}")


Fitting 5 folds for each of 12 candidates, totalling 60 fits
Best RF params: {'classifier__max_depth': 8, 'classifier__n_estimators': 100}
Best CV accuracy: 0.9268979228797154

Tuned RF - Train: 0.9790, Test: 0.9509


### 5. Final Interpretation for Random Forest :

**Before tuning (Baseline Random Forest)**:

* Near-perfect training accuracy → still some overfitting risk
* Good test accuracy → but not optimized
* Default parameters → not tailored to car evaluation patterns

*Risk: Suboptimal performance on edge cases*

**After tuning (max_depth=8, n_estimators=100)**:

* 97.9% train, 95.1% test → consistent high performance across datasets
* +1.5% test accuracy over Decision Tree (93.6% → 95.1%)
* Reasonable train-test gap (2.8%) → reliable generalization to new cars and no overfitting issue
* 100 diverse trees → errors cancel out, stable predictions than single decision tree

**Summary:**

The 1.5% accuracy gain over Decision Tree comes with ensemble robustness that a single tree can't match

### 6. Training using Extra Trees (Extremely Randomized Trees) :

In [20]:
et_clf = ExtraTreesClassifier(
    n_estimators=100,
    random_state=42,
    n_jobs=-1
)

#  Pipeline for Extra Trees
et_pipeline = Pipeline(steps=[
    ("preprocess", preprocess),
    ("classifier", et_clf)
])

#  Train Extra Trees model
et_pipeline.fit(X_train, y_train)

#  Evaluate Extra Trees on test set
y_pred_et = et_pipeline.predict(X_test)

print("Extra Trees Accuracy on test set:", accuracy_score(y_test, y_pred_et))
print("\nClassification report (Extra Trees):")
print(classification_report(y_test, y_pred_et))
print("\nConfusion matrix (Extra Trees):")
print(confusion_matrix(y_test, y_pred_et))

Extra Trees Accuracy on test set: 0.9797687861271677

Classification report (Extra Trees):
              precision    recall  f1-score   support

         acc       0.94      0.97      0.96        77
        good       1.00      0.93      0.96        14
       unacc       0.99      0.99      0.99       242
       vgood       1.00      0.92      0.96        13

    accuracy                           0.98       346
   macro avg       0.98      0.95      0.97       346
weighted avg       0.98      0.98      0.98       346


Confusion matrix (Extra Trees):
[[ 75   0   2   0]
 [  1  13   0   0]
 [  3   0 239   0]
 [  1   0   0  12]]


### 7. Hyperparameter Tuning ExtraTrees : 

In [29]:
# Hyperparameter tuning for Extra Trees (Extremely Randomized Trees)
param_grid_et = {
    "classifier__n_estimators": [20, 50, 100,  150],
    "classifier__max_depth": [5, 6, 7, 8],}

grid_et = GridSearchCV(
    et_pipeline, 
    param_grid_et, 
    cv=5, 
    scoring="accuracy", 
    n_jobs=-1, 
    verbose=1
)

grid_et.fit(X_train, y_train)
print("Best Extra Trees params:", grid_et.best_params_)
print("Best CV accuracy:", grid_et.best_score_)

# Evaluate tuned Extra Trees
best_et = grid_et.best_estimator_
y_pred_et_tuned = best_et.predict(X_test)
et_tuned_train = accuracy_score(y_train, best_et.predict(X_train))
et_tuned_test = accuracy_score(y_test, y_pred_et_tuned)

print(f"\nTuned Extra Trees - Train: {et_tuned_train:.4f}, Test: {et_tuned_test:.4f}")
print("Classification report (Tuned Extra Trees):")
print(classification_report(y_test, y_pred_et_tuned))


Fitting 5 folds for each of 16 candidates, totalling 80 fits
Best Extra Trees params: {'classifier__max_depth': 8, 'classifier__n_estimators': 150}
Best CV accuracy: 0.9319625385863025

Tuned Extra Trees - Train: 0.9819, Test: 0.9509
Classification report (Tuned Extra Trees):
              precision    recall  f1-score   support

         acc       0.83      0.97      0.90        77
        good       1.00      0.43      0.60        14
       unacc       0.99      0.98      0.99       242
       vgood       1.00      0.85      0.92        13

    accuracy                           0.95       346
   macro avg       0.96      0.81      0.85       346
weighted avg       0.96      0.95      0.95       346



### 8. Final Interpretation for Extra Trees Classifier : 

**Before tuning (Baseline Extra Trees)**:

* Default parameters → suboptimal randomization
* Good accuracy → but not exploiting Extra Trees strengths
* Less aggressive split randomization → similar to Random Forest

*Risk: Missing potential accuracy gains from extreme randomization*

**After tuning (max_depth=8, n_estimators=150)**:

* 98.2% train, 95.1% test → matches Random Forest exactly (95.1%)
* 150 trees → optimal ensemble size (more than RF's 100)
* 3.0% train-test gap → Great generalization for ensemble learning algorithms
* Provides Superior recall on vgood (85% vs Random Forest's lower) → better at finding premium cars