## Regression Analysis with Random Forest and Extra Trees

In this notebook, you will explore how to use **Random Forest Regressor** and **Extra Trees Regressor** from the `scikit-learn` library for a regression task. We will use the [Auto MPG dataset](https://www.kaggle.com/datasets/uciml/autompg-dataset) for this purpose.

In [None]:
import kagglehub
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

## 2. Download and Load the Dataset

**Info from Kaggle:**

- This dataset is a slightly modified version of the dataset provided in
  the StatLib library. In line with the use by Ross Quinlan (1993) in
  predicting the attribute "mpg", 8 of the original instances were removed
  because they had unknown values for the "mpg" attribute. The original
  dataset is available in the file "auto-mpg.data-original". 
- "The data concerns city-cycle fuel consumption in miles per gallon,
  to be predicted in terms of 3 multivalued discrete and 5 continuous
  attributes." (Quinlan, 1993)
- Number of Instances: 398
- Number of Attributes: 9 including the class attribute
- Attribute Information:
  - mpg: continuous
  - cylinders: multi-valued discrete
  - displacement: continuous
  - horsepower: continuous
  - weight: continuous
  - acceleration: continuous
  - model year: multi-valued discrete
  - origin: multi-valued discrete
  - car name: string (unique for each instance)
- Missing Attribute Values: horsepower has 6 missing values

In [None]:
# Download the latest version of the dataset
path = kagglehub.dataset_download("uciml/autompg-dataset")

print("Path to dataset files:", path)

In [None]:
!ls -lh $path

In [None]:
# Load the dataset
data_path = path + "/auto-mpg.csv"
auto_df = pd.read_csv(data_path)
auto_df_orig = auto_df.copy()

## 3. Explore the Dataset

In [None]:
auto_df.head()

In [None]:
auto_df.info()

The `horsepower` column is of type `object` and may contain missing or non-numeric values.

In [None]:
unique_horsepower = auto_df.horsepower.unique()
unique_horsepower.sort()
unique_horsepower[::-1]

## 4. Data Preprocessing

Clean the `horsepower` Column

In [None]:
auto_df['horsepower'].replace('?', np.nan, inplace=True)
auto_df['horsepower'] = auto_df['horsepower'].astype(float)

In [None]:
auto_df.isnull().sum()

In [None]:
auto_df.dropna(subset=['horsepower'], inplace=True)
auto_df.reset_index(drop=True, inplace=True)

Encode (one-hot) categorical variables and remove the `car name` column.

In [None]:
auto_df = pd.get_dummies(auto_df, columns=['origin'], prefix='origin')
auto_df.drop('car name', axis=1, inplace=True)
auto_df.head()

In [None]:
# Separate features and target variable  
X = auto_df.drop('mpg', axis=1)
y = auto_df['mpg']

# Split the data (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

## 5. Initialize, train, and predict

In [None]:
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor

# Initialize the Random Forest Regressor model
# TODO: Replace `None` with the appropriate model initialization
rf_regressor = None  # e.g., RandomForestRegressor(random_state=42)

# Initialize the Extra Trees Regressor model
# TODO: Replace `None` with the appropriate model initialization
et_regressor = None  # e.g., ExtraTreesRegressor(random_state=42)

# Fit the models to the training data
rf_regressor.fit(X_train, y_train)
et_regressor.fit(X_train, y_train)

In [None]:
y_pred_rf = rf_regressor.predict(X_test)
y_pred_et = et_regressor.predict(X_test)

## 6. Evaluate model performance

In [None]:
# Calculate Mean Squared Error and R-squared
mse_rf = mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

print("Random Forest Regressor Performance:")
print(f"Mean Squared Error: {mse_rf:.2f}")
print(f"R-squared Score: {r2_rf:.2f}")

In [None]:
mse_et = mean_squared_error(y_test, y_pred_et)
r2_et = r2_score(y_test, y_pred_et)

print("Extra Trees Regressor Performance:")
print(f"Mean Squared Error: {mse_et:.2f}")
print(f"R-squared Score: {r2_et:.2f}")

## 7. Visualize the Results

### 7.1 Plot Actual vs. Predicted Values

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_rf, alpha=0.7, color='b')
plt.xlabel('Actual MPG')
plt.ylabel('Predicted MPG')
plt.title('Random Forest Regressor: Actual vs. Predicted MPG')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_rf, alpha=0.7, color='b')
plt.xlabel('Actual MPG')
plt.ylabel('Predicted MPG')
plt.title('Random Forest Regressor: Actual vs. Predicted MPG')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')

### 7.2 Residual Plots

In [None]:
residuals_rf = y_test - y_pred_rf
plt.figure(figsize=(10, 6))
sns.histplot(residuals_rf, kde=True, color='b')
plt.title('Random Forest Regressor Residuals')
plt.xlabel('Residuals')
plt.ylabel('Frequency')

In [None]:
residuals_et = y_test - y_pred_et
plt.figure(figsize=(10, 6))
sns.histplot(residuals_et, kde=True, color='g')
plt.title('Extra Trees Regressor Residuals')
plt.xlabel('Residuals')
plt.ylabel('Frequency')

### 7.3 Feature Importances

In [None]:
importances_rf = rf_regressor.feature_importances_
indices_rf = np.argsort(importances_rf)[::-1]
features = X.columns

plt.figure(figsize=(10, 6))
plt.title("Random Forest Regressor Feature Importances")
sns.barplot(x=importances_rf[indices_rf], y=features[indices_rf], palette='viridis')
plt.xlabel('Importance Score')
plt.ylabel('Features')

In [None]:
importances_et = et_regressor.feature_importances_
indices_et = np.argsort(importances_et)[::-1]

plt.figure(figsize=(10, 6))
plt.title("Extra Trees Regressor Feature Importances")
sns.barplot(x=importances_et[indices_et], y=features[indices_et], palette='viridis')
plt.xlabel('Importance Score')
plt.ylabel('Features')