# Random Forest Regressor

A **Random Forest Regressor** is an ensemble learning method that combines multiple Decision Trees to produce more accurate and stable predictions for regression tasks. It operates on the principle of "wisdom of crowds" - many weak learners (trees) together form a strong learner.



### Core Components
* **Multiple Decision Trees:** Base estimators that make individual predictions.
* **Bootstrap Aggregating (Bagging):** Training each tree on a random subset of data.
* **Feature Randomness:** Considering only a random subset of features at each split.
* **Averaging Predictions:** The final output is the **average** of all individual tree predictions.

---

## How Random Forest Regression Works

### Training Process
1.  **Create Bootstrap Samples:** Draw random samples *with replacement* from the original training data.
2.  **Build Decision Trees:** For each bootstrap sample, grow a decision tree. Crucially, at each split, the tree considers only a random subset of features (not all features).
3.  **Parallel Training:** All trees are trained independently and usually in parallel.
4.  **Aggregate Results:** Combine predictions from all trees by calculating the mean.

### Code Example

```python
from sklearn.ensemble import RandomForestRegressor

# Initialize
# n_estimators: Number of trees (default is 100)
# random_state: Ensures reproducibility
rf_reg = RandomForestRegressor(n_estimators=100, random_state=42)

# Train
# rf_reg.fit(X_train, y_train)

# Predict (Returns continuous values)
# predictions = rf_reg.predict(X_test)

In [None]:
# Random Forest prediction
def predict_random_forest(X):
    predictions = []
    for tree in forest:
        pred = tree.predict(X)
        predictions.append(pred)
    
    # For regression: average all predictions
    final_prediction = np.mean(predictions)
    return final_prediction

### 1. Bootstrap Aggregating (Bagging)
* **Sampling:** Creates multiple datasets by sampling *with replacement*.
* **Data Distribution:** Each tree sees $\approx 63.2\%$ of the original data (some samples are repeated).
* **OOB:** The remaining $36.8\%$ form **"Out-of-Bag" (OOB)** samples, which are used for internal validation.



### 2. Feature Randomness
* **The Logic:** At each split, the algorithm considers only a random subset (`max_features`) of the total features.
* **Typical Values:** $\sqrt{\text{n\_features}}$ (common for classification) or $\log_2(\text{n\_features})$ (common for regression).
* **Goal:** Reduces the **correlation** between trees, ensuring they don't all look the same.

### 3. Variance Reduction Formula
The overall variance of the Random Forest is reduced compared to single trees according to this approximation:

$$Var(RF) = \rho \cdot \sigma^2 + \frac{(1 - \rho) \cdot \sigma^2}{M}$$

* **$\rho$ (rho):** Correlation between trees (we want this low).
* **$\sigma^2$ (sigma squared):** Variance of individual trees.
* **$M$:** Number of trees in the forest (as $M$ increases, the second term vanishes).

---

## Key Hyperparameters

```python
rf = RandomForestRegressor(
    n_estimators=100,       # Number of trees
    max_depth=None,         # Maximum tree depth
    min_samples_split=2,    # Minimum samples to split a node
    min_samples_leaf=1,     # Minimum samples at a leaf node
    max_features='auto',    # Features to consider for split
    bootstrap=True,         # Use bootstrap samples
    oob_score=False,        # Use out-of-bag samples for scoring
    random_state=42         # Seed for reproducibility
)
```
### Types of Feature Importance

1.  **Gini Importance:** Based on the total **impurity reduction** brought by a feature.
2.  **Permutation Importance:** Measures the drop in model performance (accuracy or $R^2$) when a specific feature is randomly shuffled (breaking its relationship with the target).
3.  **Mean Decrease Impurity:** The average impurity decrease calculated across all trees in the forest.



### When to Use Random Forest Regressor

| Criteria | Description |
| :--- | :--- |
| **Good for** | Tabular data, mixed data types, non-linear relationships. |
| **Use when** | Interpretability is somewhat important (via feature importance) but you need higher accuracy than a single tree. |
| **Ideal for** | Medium-sized datasets ($1,000$ - $100,000$ samples). |

---

### FAQ: Random Forest Regressor

#### 1. How does Random Forest reduce overfitting compared to a single Decision Tree?
It uses two main mechanisms to lower variance:
* **Bagging:** Each tree trains on a different random subset of data, so no single tree sees the whole picture.
* **Feature Randomness:** Trees are forced to de-correlate by considering only a random subset of features at each split, preventing them from all relying on the same dominant features.

#### 2. What is the Out-of-Bag (OOB) score and why is it useful?
The OOB score is computed using the $\approx 37\%$ of data that is **not** included in the bootstrap sample for a specific tree.
* **Built-in Cross-Validation:** It acts as a validation set without needing to manually split the data.
* **Unbiased Estimate:** It provides a reliable estimate of the generalization error.
* **Tuning:** It is extremely useful for tuning hyperparameters efficiently.

#### 3. How do you interpret feature importance in Random Forest?
* It represents the **average decrease in impurity** contributed by that feature across all trees.
* **Higher importance** = Greater contribution to the model's predictive power.
* *Note:* It should be interpreted relatively (Feature A is 2x more important than Feature B), not absolutely.

#### 4. When would you choose Random Forest over Gradient Boosting?
Choose Random Forest when:
* **Speed:** You need faster training times (RF trains in parallel; Boosting is sequential).
* **Simplicity:** You want a model that works well "out of the box" with less hyperparameter tuning.
* **Noise:** You are dealing with very noisy data (RF is generally more robust to outliers).
* **Validation:** You want to use OOB scores for quick validation.

#### 5. What are the limitations of Random Forest?
* **Computational Cost:** Can be slow and memory-intensive for very large datasets.
* **Extrapolation:** Poor at predicting values outside the range of the training data (a limitation of all tree-based models).
* **Black Box:** While feature importance helps, it is harder to interpret the specific decision path compared to a single Decision Tree.

# Random Forest: Classifier vs. Regressor

Just like the single Decision Tree, the Random Forest ensemble adapts its logic based on whether your target variable is a **Category** or a **Number**.

### 1. Random Forest Classifier
**Use this when:** Your target variable is **Categorical** (Classes/Labels).



* **The Logic:** "Majority Vote."
    * Every tree in the forest makes a prediction (e.g., Tree 1: "Red", Tree 2: "Blue", Tree 3: "Red").
    * The forest counts the votes.
    * The class with the most votes wins.
* **Output:** A Class Label (or probability).
* **Evaluation Metrics:** Accuracy, Confusion Matrix, ROC-AUC.

**Example:**
* **Fraud Detection:** 100 trees look at a transaction. 90 say "Legit", 10 say "Fraud". The model predicts **Legit**.

### 2. Random Forest Regressor
**Use this when:** Your target variable is **Continuous** (Numerical Values).

* **The Logic:** "Averaging."
    * Every tree in the forest predicts a specific value (e.g., Tree 1: 100, Tree 2: 105, Tree 3: 95).
    * The forest calculates the average (mean) of all these predictions.
* **Output:** A continuous number.
* **Evaluation Metrics:** MSE, MAE, RMSE, $R^2$ Score.

**Example:**
* **House Price:** 100 trees predict the price. The average of all 100 predictions is taken as the final price.

---

### Comparison Table

| Feature | Random Forest **Classifier** | Random Forest **Regressor** |
| :--- | :--- | :--- |
| **Target Type** | **Categories** (Discrete) | **Numbers** (Continuous) |
| **Prediction Method** | Majority Voting (Mode) | Averaging (Mean) |
| **Criterion** | Gini Impurity, Entropy | Squared Error (MSE), Absolute Error |
| **Python Class** | `RandomForestClassifier` | `RandomForestRegressor` |

---

### Code Implementation

```python
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

# -------------------------------------------------------
# SCENARIO 1: CLASSIFICATION (Predicting Churn)
# -------------------------------------------------------
# y target: [0, 1, 0, 0, 1] (Classes)

clf = RandomForestClassifier(n_estimators=100, criterion='gini')
# clf.fit(X_train, y_train)
# prediction = clf.predict(X_test)  # Returns 0 or 1


# -------------------------------------------------------
# SCENARIO 2: REGRESSION (Predicting Temperature)
# -------------------------------------------------------
# y target: [23.5, 24.1, 19.8] (Continuous)

reg = RandomForestRegressor(n_estimators=100, criterion='squared_error')
# reg.fit(X_train, y_train)
# prediction = reg.predict(X_test)  # Returns e.g., 22.4

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Create and train the model
rf_regressor = RandomForestRegressor(
    n_estimators=100,
    max_depth=10,
    random_state=42
)

rf_regressor.fit(X_train, y_train)

# Make predictions
y_pred = rf_regressor.predict(X_test)

# Evaluate
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"MSE: {mse:.4f}, R2 Score: {r2:.4f}")

In [None]:
# Grid Search Example
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5, 10],
    'max_features': ['auto', 'sqrt', 'log2']
}

grid_search = GridSearchCV(
    RandomForestRegressor(random_state=42),
    param_grid,
    cv=5,
    scoring='neg_mean_squared_error',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")

In [7]:
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Load dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Train-test split
X_train, X_test, Y_train, Y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Model
model = RandomForestRegressor(
    n_estimators=200,
    max_depth=None,
    random_state=42
)

# Train
model.fit(X_train, Y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluation
print("MSE:", mean_squared_error(Y_test, y_pred))
print("R2 Score:", r2_score(Y_test, y_pred))
print("Feature Importances:", model.feature_importances_)


MSE: 0.25395595827057227
R2 Score: 0.8062009933635497
Feature Importances: [0.52588625 0.05435475 0.04445027 0.02960746 0.03069355 0.13805542
 0.08864622 0.08830608]


In [6]:
import pandas as pd
import numpy as np

# Set random seed for reproducibility
np.random.seed(42)

# Number of samples
n_samples = 1000

# Generate synthetic housing data
data = {
    # Basic property features
    'area_sqft': np.random.normal(1800, 600, n_samples).astype(int),
    'bedrooms': np.random.choice([1, 2, 3, 4, 5], n_samples, p=[0.1, 0.25, 0.4, 0.2, 0.05]),
    'bathrooms': np.random.choice([1, 1.5, 2, 2.5, 3, 3.5, 4], n_samples, p=[0.15, 0.2, 0.3, 0.2, 0.1, 0.04, 0.01]),
    'stories': np.random.choice([1, 2, 3], n_samples, p=[0.6, 0.35, 0.05]),
    'year_built': np.random.randint(1950, 2023, n_samples),
    
    # Location features
    'latitude': np.random.uniform(37.5, 37.9, n_samples),
    'longitude': np.random.uniform(-122.5, -121.9, n_samples),
    'distance_city_center': np.random.exponential(5, n_samples),
    'school_rating': np.random.choice([1, 2, 3, 4, 5], n_samples, p=[0.05, 0.15, 0.3, 0.35, 0.15]),
    
    # Property amenities
    'has_garage': np.random.choice([0, 1], n_samples, p=[0.2, 0.8]),
    'garage_cars': np.random.choice([0, 1, 2, 3], n_samples, p=[0.2, 0.4, 0.3, 0.1]),
    'has_pool': np.random.choice([0, 1], n_samples, p=[0.7, 0.3]),
    'has_garden': np.random.choice([0, 1], n_samples, p=[0.3, 0.7]),
    'has_fireplace': np.random.choice([0, 1], n_samples, p=[0.4, 0.6]),
    
    # Neighborhood features
    'crime_rate': np.random.uniform(0.1, 10.0, n_samples),
    'population_density': np.random.normal(5000, 2000, n_samples),
    'median_income_neighborhood': np.random.normal(75000, 25000, n_samples),
    
    # Additional features
    'lot_size_sqft': np.random.normal(5000, 2000, n_samples).astype(int),
    'condition': np.random.choice([1, 2, 3, 4, 5], n_samples, p=[0.05, 0.1, 0.3, 0.4, 0.15]),
    'energy_efficiency': np.random.choice(['A', 'B', 'C', 'D', 'E'], n_samples, p=[0.1, 0.3, 0.4, 0.15, 0.05]),
    'property_type': np.random.choice(['Apartment', 'Townhouse', 'Single Family', 'Condo'], 
                                    n_samples, p=[0.3, 0.2, 0.4, 0.1])
}

# Create DataFrame
df = pd.DataFrame(data)

# Ensure some logical constraints
df['area_sqft'] = np.abs(df['area_sqft'])  # No negative areas
df['bedrooms'] = np.maximum(1, df['bedrooms'])  # At least 1 bedroom
df['bathrooms'] = np.maximum(1, df['bathrooms'])  # At least 1 bathroom
df['year_built'] = np.minimum(2023, df['year_built'])  # No future years
df['lot_size_sqft'] = np.abs(df['lot_size_sqft'])  # No negative lot sizes

# If no garage, set garage_cars to 0
df.loc[df['has_garage'] == 0, 'garage_cars'] = 0

# Generate realistic housing prices based on features with some noise
base_price = 300000  # Base price in dollars

# Calculate price based on features
price = (
    base_price +
    df['area_sqft'] * 150 +  # $150 per sqft
    df['bedrooms'] * 25000 +  # $25k per bedroom
    df['bathrooms'] * 30000 +  # $30k per bathroom
    df['stories'] * 20000 +  # $20k per story
    (2023 - df['year_built']) * -500 +  # Newer houses more expensive
    df['school_rating'] * 15000 +  # $15k per school rating point
    df['has_garage'] * 25000 +  # $25k for garage
    df['garage_cars'] * 10000 +  # $10k per garage car space
    df['has_pool'] * 35000 +  # $35k for pool
    df['has_garden'] * 15000 +  # $15k for garden
    df['has_fireplace'] * 8000 +  # $8k for fireplace
    df['condition'] * 12000 +  # $12k per condition point
    -df['crime_rate'] * 3000 +  # Lower crime = higher price
    -df['distance_city_center'] * 8000 +  # Closer to city = higher price
    df['median_income_neighborhood'] * 0.5  # Neighborhood wealth effect
)

# Add property type premium
property_type_premium = {
    'Apartment': 0,
    'Townhouse': 20000,
    'Single Family': 50000,
    'Condo': -10000
}
df['property_type_premium'] = df['property_type'].map(property_type_premium)
price += df['property_type_premium']

# Add energy efficiency premium
energy_efficiency_premium = {
    'A': 15000,
    'B': 8000,
    'C': 0,
    'D': -5000,
    'E': -10000
}
df['energy_efficiency_premium'] = df['energy_efficiency'].map(energy_efficiency_premium)
price += df['energy_efficiency_premium']

# Add some random noise (10% of price)
noise = np.random.normal(0, 0.1 * price.mean(), n_samples)
price += noise

# Ensure prices are realistic (no negative prices)
price = np.maximum(150000, price)

# Add price to DataFrame
df['price'] = price.astype(int)

# Drop temporary columns
df = df.drop(['property_type_premium', 'energy_efficiency_premium'], axis=1)

# Display dataset info
print("Housing Dataset Overview:")
print(f"Shape: {df.shape}")
print("\nFirst 5 rows:")
print(df.head())
print("\nDataset Info:")
print(df.info())
print("\nBasic Statistics:")
print(df.describe())

# Save to CSV
df.to_csv('housing_data.csv', index=False)
print(f"\nDataset saved as 'housing_data.csv'")

# Display correlation with price
print("\nTop 10 features correlated with price:")



Housing Dataset Overview:
Shape: (1000, 22)

First 5 rows:
   area_sqft  bedrooms  bathrooms  stories  year_built   latitude   longitude  \
0       2098         2        1.5        2        1965  37.627162 -122.278365   
1       1717         2        1.0        1        1996  37.503511 -122.404276   
2       2188         3        1.0        1        2008  37.652853 -121.968664   
3       2713         3        1.5        2        1980  37.620724 -122.493994   
4       1659         1        2.5        1        1953  37.514959 -121.970554   

   distance_city_center  school_rating  has_garage  ...  has_garden  \
0              8.244515              3           1  ...           1   
1              2.203316              4           1  ...           1   
2              0.532070              5           1  ...           0   
3             16.336162              5           1  ...           1   
4              5.286421              2           1  ...           1   

   has_fireplace  crime_rat