# **2. Random Forest Regression**

**Problem Statement**  

Predict the fuel efficiency of cars (measured in miles per gallon, `mpg`) based on their technical specifications using the **Auto-MPG Dataset**. Develop a **Random Forest Regression model** to accurately predict `mpg`, optimize the model’s performance through hyperparameter tuning, and evaluate its accuracy using appropriate metrics. Analyze the importance of different features (e.g., horsepower, weight, cylinders) in determining fuel efficiency.


**DatasetLink**: https://www.kaggle.com/datasets/uciml/autompg-dataset

**Instructions**  

1. **Data Loading and Cleaning**  
   - Load the dataset.  
   - Handle missing values in the `horsepower` column (use mean or median imputation).  

2. **Feature Encoding**  
   - Perform one-hot encoding for the `origin` column.  

3. **Splitting the Data**  
   - Split the data into training (80%) and testing (20%) sets.  

4. **Model Training**  
   - Train a Random Forest Regressor with default hyperparameters to predict `mpg`.  

5. **Hyperparameter Tuning**  
   - Perform grid search or random search to optimize parameters like:  
     - `n_estimators`  
     - `max_depth`  
     - `min_samples_split`  

6. **Model Evaluation**  
   - Evaluate the model on the test set using:  
     - Mean Squared Error (MSE)  
     - R² Score  

7. **Feature Importance**  
   - Plot the feature importance to identify the most influential variables for predicting `mpg`.  

8. **Visualization**  
   - Compare predicted vs. actual values for `mpg` using a scatter plot.  


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [None]:
df = pd.read_csv("auto-mpg.csv")
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino


In [None]:
missing_values_count = df['horsepower'].isnull().sum()
print(f"Number of missing values in 'horsepower' column: {missing_values_count}")

Number of missing values in 'horsepower' column: 0


In [None]:
print(df['origin'].unique())

[1 3 2]


In [None]:
# prompt: apply one hot encoding on car name

# Convert 'horsepower' column to numeric, coercing errors to NaN
df['horsepower'] = pd.to_numeric(df['horsepower'], errors='coerce')

# Fill missing values in 'horsepower' with the mean
df['horsepower'].fillna(df['horsepower'].mean(), inplace=True)

# Perform one-hot encoding on the 'car name' column
df = pd.get_dummies(df, columns=['car name'], prefix='car')

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['horsepower'].fillna(df['horsepower'].mean(), inplace=True)


In [None]:
X = df.iloc[:,1:-1].values
y = df.iloc[:,-1].values

In [None]:
y.shape

(398,)

In [None]:
X.shape

(398, 311)

In [None]:
print(df.columns)

Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'model year', 'origin', 'car_amc ambassador brougham',
       'car_amc ambassador dpl',
       ...
       'car_volvo 145e (sw)', 'car_volvo 244dl', 'car_volvo 245',
       'car_volvo 264gl', 'car_volvo diesel', 'car_vw dasher (diesel)',
       'car_vw pickup', 'car_vw rabbit', 'car_vw rabbit c (diesel)',
       'car_vw rabbit custom'],
      dtype='object', length=313)


In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [None]:
rf_regressor = RandomForestRegressor(random_state=0)
rf_regressor.fit(X_train, y_train)



In [None]:
# Hyperparameter Tuning (using GridSearchCV as an example)
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(estimator=rf_regressor, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
grid_search.fit(X_train, y_train)

best_rf_regressor = grid_search.best_estimator_
print(f"Best Hyperparameters: {grid_search.best_params_}")



Best Hyperparameters: {'max_depth': None, 'min_samples_split': 2, 'n_estimators': 50}


In [None]:
# Model Evaluation
y_pred = best_rf_regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

# Feature Importance
feature_importance = best_rf_regressor.feature_importances_



Mean Squared Error: 0.0125
R-squared: -0.012658227848101333
