# Random Forest Regression — Advanced Theory & Interview Q&A

## Advanced Theory

- **Ensemble Learning:** Combines multiple decision trees to improve accuracy and reduce overfitting.
- **Bagging:** Each tree is trained on a random bootstrap sample of the data.
- **Feature Randomness:** At each split, a random subset of features is considered, increasing diversity.
- **Out-of-Bag (OOB) Error:** Uses unused samples for validation, providing an unbiased estimate.
- **Feature Importance:** Calculated by reduction in impurity or permutation importance.
- **Handling Missing Values:** Some implementations can handle missing data natively.
- **Extensions:** Extra Trees, Random Forest Classification.
- **Diagnostics:** RMSE, R², residual analysis, feature importance plots.
- **Limitations:** Less interpretable, slower for large datasets.

## Advanced Interview Q&A

**Q1: How does Random Forest Regression reduce overfitting?**  
A: By averaging predictions from many trees, each trained on random subsets.

**Q2: What is Out-of-Bag (OOB) error?**  
A: An unbiased estimate of model performance using unused samples.

**Q3: How is feature importance calculated in Random Forest Regression?**  
A: By reduction in impurity or permutation importance.

**Q4: What is the difference between Random Forest and Extra Trees?**  
A: Extra Trees use more random splits, increasing diversity.

**Q5: How do you assess model fit?**  
A: Use RMSE, R², residual analysis, feature importance plots.

**Q6: How do you handle missing values in Random Forest Regression?**  
A: Some implementations can handle missing data natively.

**Q7: What are the limitations of Random Forest Regression?**  
A: Less interpretable, slower for large datasets.

**Q8: How do you tune hyperparameters in Random Forest Regression?**  
A: Use grid search, cross-validation for n_estimators, max_depth, min_samples_split.

**Q9: What is permutation importance?**  
A: Measures feature importance by shuffling feature values and observing impact on accuracy.

**Q10: What is the bias-variance tradeoff in Random Forest Regression?**  
A: More trees reduce variance, but bias remains similar to individual trees.

# Random Forest Regression — Theory & Interview Q&A

Random Forest Regression is an ensemble learning method that builds multiple decision trees and combines their predictions to improve accuracy and reduce overfitting for continuous outcomes.

| Aspect                | Details                                                                 |
|-----------------------|------------------------------------------------------------------------|
| **Definition**        | Ensemble of decision trees for regression, uses averaging.               |
| **Equation**          | Aggregates predictions from multiple trees                              |
| **Use Cases**         | Price prediction, time series, environmental modeling                   |
| **Assumptions**       | No strict assumptions, handles non-linear relationships                 |
| **Pros**              | Reduces overfitting, handles mixed data, robust to noise                |
| **Cons**              | Less interpretable, slower for large datasets                           |
| **Key Parameters**    | n_estimators, max_depth, min_samples_split, max_features                |
| **Evaluation Metrics**| MSE, RMSE, R² Score                                                     |

## Interview Q&A

**Q1: What is Random Forest Regression?**  
A: An ensemble of decision trees that averages their predictions for better accuracy.

**Q2: How does Random Forest reduce overfitting?**  
A: By averaging predictions from many trees, each trained on random subsets.

**Q3: What is bagging?**  
A: Training each tree on a random sample of the data (bootstrap aggregating).

**Q4: What are the advantages of Random Forest Regression?**  
A: Robust to noise, reduces overfitting, handles mixed data types.

**Q5: What are the limitations?**  
A: Less interpretable, slower for large datasets.

**Q6: How do you evaluate Random Forest Regression?**  
A: Using MSE, RMSE, and R² score.

In [None]:
# 1️⃣ Import Libraries
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns

# 2️⃣ Load Dataset
data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

# 3️⃣ Split Dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 4️⃣ Create Pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),  # Optional for tree-based models
    ('rf', RandomForestRegressor(random_state=42))
])

# 5️⃣ Hyperparameter Tuning
param_grid = {
    'rf__n_estimators': [100, 200, 300],
    'rf__max_depth': [None, 5, 10, 15],
    'rf__min_samples_split': [2, 5, 10],
    'rf__min_samples_leaf': [1, 2, 4],
    'rf__max_features': ['sqrt', 'log2', None]
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='r2')
grid_search.fit(X_train, y_train)

# 6️⃣ Best Model Evaluation
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

print("Best Parameters:", grid_search.best_params_)
print("R2 Score:", r2_score(y_test, y_pred))
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))

# 7️⃣ Feature Importance Visualization
importances = best_model.named_steps['rf'].feature_importances_
feature_names = X.columns
feat_imp_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
feat_imp_df = feat_imp_df.sort_values(by='Importance', ascending=False)

plt.figure(figsize=(10,6))
sns.barplot(x='Importance', y='Feature', data=feat_imp_df, palette='coolwarm')
plt.title("Feature Importance in Random Forest Regressor")
plt.show()

# 8️⃣ Predicted vs Actual Visualization
plt.figure(figsize=(8,6))
plt.scatter(y_test, y_pred, alpha=0.6, color='teal')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Random Forest Regressor: Predicted vs Actual")
plt.show()
