# Random Forest Regression Analysis

This notebook evaluates the performance of Random Forest Regressors on the FINAL_cleaned.csv dataset using:

- Train/Test Splits
- K-Fold Cross-Validation
- Leave-One-Out Cross-Validation (LOO-CV)
- Performance metrics: MSE, MAE, and R²
- Hyperparameter tuning: n_estimators, max_depth

### Import Libraries & Load Dataset

This cell:

- Imports core libraries for data handling (numpy, pandas), visualization (matplotlib, seaborn), and modeling (sklearn).
- Loads the dataset FINAL_cleaned.csv.
- Defines feature matrix X and target vector y.
- Sets up cross-validation and scoring metrics for regression evaluation.

In [3]:
# === Libraries ===
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, KFold, LeaveOneOut, cross_val_score
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, make_scorer

sns.set(style="whitegrid")

# === Load Dataset ===
df = pd.read_csv("FINAL_cleaned.csv")  # Adjust path if needed

# === Define Features and Target ===
X = df.drop(columns=["Quality_of_Life_Value", "Quality_of_Life_Category","Country","Sub-region"])
y = df["Quality_of_Life_Value"]
print(X.head())
print(y.head())

# === Cross-validation ===
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# === Scorers ===
mse_scorer = make_scorer(mean_squared_error, greater_is_better=False)
mae_scorer = make_scorer(mean_absolute_error, greater_is_better=False)


   Sub_region_encoded  Coastline_wf   Latitude   Longitude  Elevation_m  \
0                  10         362.0  41.153332   20.168331        708.0   
1                   5         998.0  28.033886    1.659626        800.0   
2                   4        4989.0 -38.416097  -63.616672        595.0   
3                  12           0.0  40.069099   45.038189       1792.0   
4                   0       25760.0 -25.274398  133.775136        330.0   

   Temperature_C  
0          12.44  
1          23.60  
2          16.30  
3           7.82  
4          22.05  
0    104.16
1     98.83
2    115.06
3    116.56
4    190.69
Name: Quality_of_Life_Value, dtype: float64


### Train-Test Evaluation Across Hyperparameters

This cell:

- Evaluates Random Forest performance on a single 60/40 Train/Test Split
- Loops through combinations of:
- n_estimators ∈ [100, 200, 400]
- max_depth ∈ [3, 4, 5, 6, 7]
- Computes and stores MSE, MAE, and R² on the test set
- Displays the top-performing configurations sorted by R²

In [18]:
# Parameters to explore
train_ratio = 0.6
random_state = 42
n_estimators_list = [100, 200, 400]
max_depth_list = [3, 4, 5, 6, 7]

rows = []

# Loop through combinations
for n_estimators in n_estimators_list:
    for max_depth in max_depth_list:
        # Split the data
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, train_size=train_ratio, random_state=random_state
        )

        # Model initialization and training
        model = RandomForestRegressor(n_estimators=n_estimators,
                                       max_depth=max_depth,
                                       random_state=random_state)
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)

        # Evaluation
        mse = mean_squared_error(y_test, y_pred)
        mae = mean_absolute_error(y_test, y_pred)
        r2 = r2_score(y_test, y_pred)

        rows.append({
            "split": "Train/Test",
            "n_estimators": n_estimators,
            "max_depth": max_depth,
            "MSE": mse,
            "MAE": mae,
            "R2": r2
        })

# Create and display results table
df_rf_split = pd.DataFrame(rows)
display(df_rf_split.sort_values(by="R2", ascending=False))


Unnamed: 0,split,n_estimators,max_depth,MSE,MAE,R2
3,Train/Test,100,6,1361.044185,27.020675,0.521953
2,Train/Test,100,5,1373.607055,27.050909,0.517541
8,Train/Test,200,6,1381.130565,27.112353,0.514898
7,Train/Test,200,5,1394.923476,27.122108,0.510054
9,Train/Test,200,7,1404.644013,27.353788,0.506639
4,Train/Test,100,7,1408.812912,27.493401,0.505175
1,Train/Test,100,4,1425.483217,27.477298,0.49932
6,Train/Test,200,4,1440.855672,27.55764,0.493921
14,Train/Test,400,7,1458.470647,27.966432,0.487734
12,Train/Test,400,5,1460.949235,27.805657,0.486863


#### 5-Fold Cross-Validation (Grid Search)

This cell:

- Applies 5-Fold Cross-Validation across:
    - n_estimators ∈ [100, 200, 400]
    - max_depth ∈ [5, 6, 7, 8, 9]
- Reports mean ± std for:
    - Mean Squared Error (MSE)
    - Mean Absolute Error (MAE)
    - R² Score (R²)
-   Aggregates results into a DataFrame and sorts by highest R² mean

In [20]:
# Define hyperparameter grid
n_estimators_list = [100, 200, 400]
max_depth_list = [ 5, 6, 7, 8, 9]

rows_rf_kfold = []

for n_estimators in n_estimators_list:
    for max_depth in max_depth_list:
        model = RandomForestRegressor(
            n_estimators=n_estimators,
            max_depth=max_depth,
            random_state=42
        )

        mse_scores = -cross_val_score(model, X, y, cv=kf, scoring=mse_scorer)
        mae_scores = -cross_val_score(model, X, y, cv=kf, scoring=mae_scorer)
        r2_scores = cross_val_score(model, X, y, cv=kf, scoring='r2')

        rows_rf_kfold.append({
            "split": "KFold",
            "n_estimators": n_estimators,
            "max_depth": max_depth,
            "MSE_mean": mse_scores.mean(),
            "MSE_std": mse_scores.std(),
            "MAE_mean": mae_scores.mean(),
            "MAE_std": mae_scores.std(),
            "R2_mean": r2_scores.mean(),
            "R2_std": r2_scores.std()
        })

df_rf_kfold = pd.DataFrame(rows_rf_kfold)
display(df_rf_kfold.sort_values(by="R2_mean", ascending=False))


Unnamed: 0,split,n_estimators,max_depth,MSE_mean,MSE_std,MAE_mean,MAE_std,R2_mean,R2_std
3,KFold,100,8,1186.184975,347.243142,27.26859,3.692511,0.393727,0.249162
8,KFold,200,8,1193.320205,378.993312,27.226056,3.964573,0.389835,0.258094
13,KFold,400,8,1184.390433,369.822475,27.150573,4.094474,0.388575,0.270002
2,KFold,100,7,1198.544638,353.241541,27.473618,3.770685,0.38765,0.251656
4,KFold,100,9,1199.371008,351.600043,27.490084,3.770179,0.387447,0.250512
0,KFold,100,5,1203.324106,344.767555,27.348297,3.691165,0.387427,0.243873
7,KFold,200,7,1199.935409,383.235056,27.339093,4.020781,0.387392,0.257448
12,KFold,400,7,1189.081825,369.983719,27.236152,4.119601,0.386121,0.270963
5,KFold,200,5,1206.326889,381.195238,27.274298,4.055602,0.385651,0.25322
14,KFold,400,9,1189.2303,372.673863,27.224157,4.176483,0.384981,0.274385


### Leave-One-Out Cross-Validation (LOO-CV)

This cell:

- Applies LOO-CV for fine-grained error estimation using:
    - n_estimators ∈ [100, 200, 400]
    - max_depth ∈ [5, 6, 7, 8, 9]
- Evaluates only MSE and MAE (R² not included here)
- Stores mean and std of scores across all folds
- Helps identify lowest-error models across exhaustive testing

In [21]:
loo = LeaveOneOut()

n_estimators_list = [100, 200, 400]
max_depth_list = [5, 6, 7, 8, 9]

rows_rf_loo = []

for n_estimators in n_estimators_list:
    for max_depth in max_depth_list:
        model = RandomForestRegressor(
            n_estimators=n_estimators,
            max_depth=max_depth,
            random_state=42
        )

        mse_scores = -cross_val_score(model, X, y, cv=loo, scoring=mse_scorer)
        mae_scores = -cross_val_score(model, X, y, cv=loo, scoring=mae_scorer)

        rows_rf_loo.append({
            "split": "LOO",
            "n_estimators": n_estimators,
            "max_depth": max_depth,
            "MSE_mean": mse_scores.mean(),
            "MSE_std": mse_scores.std(),
            "MAE_mean": mae_scores.mean(),
            "MAE_std": mae_scores.std(),
            "R2_mean": None,
            "R2_std": None
        })

df_rf_loo = pd.DataFrame(rows_rf_loo)
display(df_rf_loo.sort_values(by="MSE_mean"))  # lower MSE is better


Unnamed: 0,split,n_estimators,max_depth,MSE_mean,MSE_std,MAE_mean,MAE_std,R2_mean,R2_std
8,LOO,200,8,1044.667893,1691.946408,25.25786,20.167012,,
14,LOO,400,9,1049.404879,1711.436518,25.3665,20.148091,,
13,LOO,400,8,1050.174062,1718.477577,25.326039,20.217957,,
12,LOO,400,7,1050.846143,1723.509191,25.283394,20.287832,,
2,LOO,100,7,1052.299956,1767.163253,25.118171,20.52748,,
11,LOO,400,6,1052.338684,1712.80252,25.383256,20.199727,,
7,LOO,200,7,1055.46725,1730.12828,25.219659,20.480138,,
9,LOO,200,9,1056.984314,1722.093542,25.35315,20.351956,,
3,LOO,100,8,1059.286694,1778.396939,25.257072,20.527226,,
6,LOO,200,6,1059.310166,1711.972884,25.40703,20.341902,,


### Final Summary & Best Configurations

This section summarizes the top-performing Random Forest models under different evaluation strategies:


- Best Train/Test Split configuration:
    - Found with specific combination of n_estimators and max_depth (as shown in results table)
    - Offers quick insight but can vary due to random partitioning
- Best 5-Fold CV model:
    - Delivers more stable error estimates
    - Reports lowest average MSE with highest R²
    - Suitable for balancing training size and evaluation robustness
- Best LOO-CV model:
    - Achieves lowest possible MSE, often outperforming other methods in accuracy
    - Computationally expensive but valuable for small datasets or when high precision is needed

<img src="RFT.png" width="500"/>
