# Task 2.2: Advanced Model - Random Forest Classifier Implementation

## Objective

The goal of this task is to implement a Random Forest Classifier, an ensemble learning method that builds multiple decision trees and combines their predictions. This model is expected to outperform the baseline Logistic Regression by capturing non-linear relationships and feature interactions.

## Understanding Key Metrics

Since this is a supervised classification task, we use specific metrics to evaluate how well our model performs:

1. **Accuracy:** - Measures the overall correctness of predictions. Range is 0 to 1, where 1 indicates perfect classification.
2. **Precision (Macro):** - Average precision across all classes, measuring the proportion of correct positive predictions.
3. **Recall (Macro):** - Average recall across all classes, measuring the proportion of actual positives correctly identified.
4. **F1-Score (Macro):** - Harmonic mean of precision and recall, providing a balanced measure of model performance. Range is 0 to 1.

## Why Random Forest?

Random Forest is an ensemble method that offers several advantages:

1. **Handles Non-Linear Relationships:** Unlike Logistic Regression, Random Forest can capture complex, non-linear patterns in the data.
2. **Feature Interactions:** Automatically learns interactions between features without manual feature engineering.
3. **Robust to Outliers:** Less sensitive to outliers and noisy data compared to linear models.
4. **Feature Importance:** Provides insights into which features are most important for predictions.
5. **Reduces Overfitting:** By averaging multiple trees, it reduces the risk of overfitting compared to a single decision tree.

## Step 1: Environment Setup and Data Discovery

In this step, we import the required libraries and load the preprocessed dataset from the project directory.

### Libraries Used:
- **pandas & numpy:** For data manipulation and numerical operations
- **RandomForestClassifier:** The main algorithm from sklearn.ensemble
- **sklearn.metrics:** For evaluating model performance
- **LabelEncoder:** To convert categorical target labels to numeric format
- **pickle:** For saving the trained model

### Data Loading:
We load the scaled training and testing sets that were prepared in previous tasks. The scaling ensures all features are on the same scale, which helps with model convergence and performance.

In [None]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.preprocessing import LabelEncoder
import pickle
import warnings
warnings.filterwarnings('ignore')

# Load data
X_train = pd.read_csv('../../data/processed/X_train_scaled.csv')
X_test = pd.read_csv('../../data/processed/X_test_scaled.csv')
y_train = pd.read_csv('../../data/processed/y_train.csv')
y_test = pd.read_csv('../../data/processed/y_test.csv')

if 'id' in X_train.columns:
    X_train = X_train.drop('id', axis=1)
if 'id' in X_test.columns:
    X_test = X_test.drop('id', axis=1)

# Encode target labels
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train['value_category'])
y_test_encoded = label_encoder.transform(y_test['value_category'])


print(f"Training samples: {X_train.shape[0]}")
print(f"Testing samples: {X_test.shape[0]}")
print(f"Number of features: {X_train.shape[1]}")
print(f"\nTarget distribution (Training):")
unique, counts = np.unique(y_train_encoded, return_counts=True)
for val, count in zip(unique, counts):
    category = label_encoder.classes_[val]
    print(f"  Class {val} ({category}): {count} samples ({count/len(y_train_encoded)*100:.2f}%)")

## Removing ID Columns from Features

**Why we remove ID columns:**

1. **IDs are unique identifiers, not predictive features** - They don't contain information about value categories
2. **Including IDs would cause overfitting** - The model would memorize specific listings instead of learning patterns
3. **IDs have no relationship with value categories** - A listing's ID number doesn't determine its value

This is a critical preprocessing step to ensure our model learns meaningful patterns rather than memorizing training data.

## Step 2: Model Training

We train a Random Forest Classifier with carefully selected hyperparameters:

### Hyperparameter Explanation:

1. **n_estimators=100** - Number of decision trees in the forest. More trees generally improve performance but increase computation time. 100 is a good balance.

2. **max_depth=20** - Maximum depth of each tree. Limits how deep each tree can grow, preventing overfitting. A depth of 20 allows complex patterns while maintaining generalization.

3. **min_samples_split=10** - Minimum number of samples required to split an internal node. Higher values prevent the model from learning overly specific patterns.

4. **min_samples_leaf=4** - Minimum number of samples required at a leaf node. Ensures each leaf represents a meaningful group of samples.

5. **random_state=42** - Ensures reproducibility. The same random seed produces the same results every time.

6. **n_jobs=-1** - Uses all available CPU cores for parallel processing, significantly speeding up training.

### How Random Forest Works:

1. Creates 100 different decision trees, each trained on a random subset of the data
2. Each tree makes its own prediction
3. Final prediction is determined by majority voting across all trees
4. This ensemble approach reduces overfitting and improves accuracy

In [None]:
# Train model
rf_model = RandomForestClassifier(
    n_estimators=100,
    max_depth=20,
    min_samples_split=10,
    min_samples_leaf=4,
    random_state=42,
    n_jobs=-1
)
rf_model.fit(X_train, y_train_encoded)

print("Model training complete!")

## Step 3: Model Evaluation

Evaluating model performance on both training and testing sets.

### Understanding the Metrics:

**Training vs Testing Accuracy:**
- **Training Accuracy:** How well the model performs on data it has seen during training
- **Testing Accuracy:** How well the model generalizes to new, unseen data
- **Gap between them:** A large gap indicates overfitting (model memorized training data)

**Macro-Averaged Metrics:**
- Calculate metric for each class separately, then average them
- Treats all classes equally, regardless of their size
- Important for imbalanced datasets to ensure all classes are considered

**What to Look For:**
- Testing accuracy should be higher than the baseline (Logistic Regression)
- Training and testing accuracy should be reasonably close (not too large a gap)
- F1-Score provides the best overall measure of model quality

In [None]:
# Predictions
y_train_pred = rf_model.predict(X_train)
y_test_pred = rf_model.predict(X_test)

# Calculate metrics
train_acc = accuracy_score(y_train_encoded, y_train_pred)
test_acc = accuracy_score(y_test_encoded, y_test_pred)
test_precision = precision_score(y_test_encoded, y_test_pred, average='macro')
test_recall = recall_score(y_test_encoded, y_test_pred, average='macro')
test_f1 = f1_score(y_test_encoded, y_test_pred, average='macro')

print("\nModel Performance:")
print(f"  Training Accuracy: {train_acc:.4f}")
print(f"  Testing Accuracy: {test_acc:.4f}")
print(f"  Precision (Macro): {test_precision:.4f}")
print(f"  Recall (Macro): {test_recall:.4f}")
print(f"  F1-Score (Macro): {test_f1:.4f}")

## Step 4: Detailed Classification Report

The classification report provides per-class performance metrics, helping us understand which value categories the model predicts well and which ones it struggles with.

### Reading the Report:

**For Each Class:**
- **Precision:** Of all listings predicted as this class, what percentage were correct?
- **Recall:** Of all actual listings in this class, what percentage did we correctly identify?
- **F1-Score:** Harmonic mean of precision and recall for this class
- **Support:** Number of actual samples in this class

**Overall Metrics:**
- **Accuracy:** Overall correctness across all classes
- **Macro avg:** Simple average of metrics across all classes (treats each class equally)
- **Weighted avg:** Average weighted by the number of samples in each class

### What to Look For:
- Are all three classes performing similarly, or is one class much harder to predict?
- Low recall for a class means we're missing many actual instances of that class
- Low precision for a class means we're incorrectly labeling other classes as this one

In [None]:
print(classification_report(y_test_encoded, y_test_pred, 
                    target_names=label_encoder.classes_))

## Step 5: Feature Importance Analysis

One of the key advantages of Random Forest is its ability to measure feature importance. This tells us which features contribute most to the model's predictions.

### How Feature Importance Works:

1. **Gini Importance:** Measures how much each feature decreases impurity across all trees
2. **Higher values:** Indicate features that are more important for making accurate predictions
3. **Sum to 1.0:** All feature importances add up to 1.0 (100%)

### Why This Matters:

- **Model Interpretability:** Understand what drives the model's decisions
- **Feature Selection:** Identify which features could potentially be removed
- **Business Insights:** Learn which listing characteristics most affect value perception
- **Validation:** Ensure the model is using sensible features (not just noise)

### Expected Important Features:

We expect features like price, review scores, location, and amenities to be highly important, as these directly relate to a listing's value proposition.

In [None]:
# Get feature importances
feature_importance = pd.DataFrame({
    'feature': X_train.columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

print("\nTop 10 Most Important Features:")
print(feature_importance.head(10).to_string(index=False))

# Save feature importance
feature_importance.to_csv('../../data/processed/random_forest_feature_importance.csv', index=False)
print("\nFeature importance saved to: ../../data/processed/random_forest_feature_importance.csv")

## Step 6: Saving Results

We save all important outputs for future reference and comparison with other models.

### Files Saved:

1. **random_forest_model.pkl** - The trained model object
   - Can be loaded later for making predictions without retraining
   - Preserves all learned parameters and tree structures

2. **random_forest_results.csv** - Summary of model performance metrics
   - Allows easy comparison with other models (Logistic Regression, etc.)
   - Contains all key metrics in one place

3. **random_forest_predictions.csv** - Actual vs predicted values
   - Useful for error analysis and understanding misclassifications
   - Can be used to create confusion matrices

4. **random_forest_feature_importance.csv** - Feature importance rankings
   - Documents which features the model relies on most
   - Useful for feature selection and model interpretation

### Why Save These Files:

- **Reproducibility:** Can recreate results without rerunning the entire notebook
- **Comparison:** Easy to compare Random Forest with other models
- **Documentation:** Provides a record of model performance for reports
- **Deployment:** The saved model can be used in production applications

In [None]:
# Save results
results_df = pd.DataFrame({
    'model': ['Random Forest'],
    'train_accuracy': [train_acc],
    'test_accuracy': [test_acc],
    'precision_macro': [test_precision],
    'recall_macro': [test_recall],
    'f1_macro': [test_f1]
})
results_df.to_csv('../../data/processed/random_forest_results.csv', index=False)

# Save predictions
predictions_df = pd.DataFrame({
    'y_true': y_test_encoded,
    'y_pred': y_test_pred
})
predictions_df.to_csv('../../data/processed/random_forest_predictions.csv', index=False)

# Save model
with open('../../models/random_forest_model.pkl', 'wb') as f:
    pickle.dump(rf_model, f)

print("Files saved successfully:")
print("  - ../../models/random_forest_model.pkl")
print("  - ../../data/processed/random_forest_results.csv")
print("  - ../../data/processed/random_forest_predictions.csv")
print("  - ../../data/processed/random_forest_feature_importance.csv")

## Conclusion

### Expected Improvements Over Logistic Regression:

Random Forest should outperform the baseline Logistic Regression model because:

1. **Non-Linear Patterns:** Can capture complex relationships that linear models miss
2. **Feature Interactions:** Automatically learns how features work together
3. **Robustness:** Less affected by outliers and noisy data
4. **Flexibility:** No assumptions about data distribution required

