# Random Forest Regression
This notebook provides an introduction to **Random Forest Regression**, a powerful ensemble technique used for predicting continuous numerical values.

We'll use a regression dataset and demonstrate practical implementation step-by-step.

## Step 1: Import necessary libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

## Step 2: Load and Explore the Diabetes Dataset
We'll use the Diabetes dataset for demonstration.

### About the Diabetes Dataset

The **Diabetes Dataset** from `scikit-learn` contains data from 442 diabetes patients. Each patient is represented by 10 baseline variables (features), including:

- **age**
- **sex**
- **body mass index (BMI)**
- **average blood pressure**
- **six blood serum measurements** (e.g., cholesterol, LDL, HDL levels)

The target variable represents a quantitative measure of **disease progression** one year after baseline.

In [None]:
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target

df = pd.DataFrame(X, columns=diabetes.feature_names)
df['target'] = y

print(df.head())

## Step 3: Splitting the dataset
Split data into training (70%) and testing (30%) sets.

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=14)

## Step 4: Training a Random Forest Regressor

In [None]:
rf_reg = RandomForestRegressor(n_estimators=100, random_state=14)
rf_reg.fit(X_train, y_train)

## Step 4: Evaluating Model Performance

In [None]:
y_pred = rf_reg.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse:.2f}')
print(f'R-squared Score: {r2:.2f}')

## Step 5: Visualizing Predicted vs Actual Values
Visualize how the model's predictions compare to actual values.

In [None]:
plt.figure(figsize=(8,6))
plt.scatter(y_test, rf_reg.predict(X_test), alpha=0.7)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Random Forest Regression: Actual vs Predicted')
plt.plot([y.min(), y.max()], [y.min(), y.max()], '--')
plt.tight_layout()
plt.show()

## Step 6: Feature Importance
Evaluate which features contribute most significantly to the prediction.

In [None]:
importances = rf_reg.feature_importances_
indices = np.argsort(importances)[::-1]

print("Feature Importances:")
for idx in indices:
    print(f"{diabetes.feature_names[idx]}: {importances[idx]:.4f}")

plt.figure(figsize=(8, 6))
plt.title("Feature Importances in Random Forest Regression")
plt.bar(range(X.shape[1]), importances[indices], align="center")
plt.xticks(range(X.shape[1]), [diabetes.feature_names[i] for i in indices], rotation=45)
plt.ylabel('Importance Score')
plt.tight_layout()
plt.show()

### Hyperparameter Tuning

We can perform hyperparameter tuning to improve model performance.

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10, 15],
    'min_samples_split': [2, 5, 10]
}

from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(RandomForestRegressor(random_state=42), param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)

print(f'Best Parameters: {grid_search.best_params_}')

In [None]:
# Retrain Random Forest with best parameters
best_rf = RandomForestRegressor(**grid_search.best_params_, random_state=14)
best_rf.fit(X_train, y_train)

# Evaluate model performance
y_pred_best = best_rf.predict(X_test)

mse_best = mean_squared_error(y_test, y_pred_best)
r2_best = r2_score(y_test, y_pred_best)

print(f'Improved Mean Squared Error: {mse_best:.2f}')
print(f'Improved R-squared Score: {r2_best:.2f}')


---
## Conclusion
In this notebook, we introduced Random Forest Regression, evaluated its performance, and interpreted feature importance. Random Forest Regression is effective for continuous prediction tasks due to its ensemble nature and robustness against overfitting.