# Random Forest - In-Depth Notes

## What is Random Forest?
Random Forest is an ensemble learning method used for both classification and regression tasks.
It builds multiple decision trees and merges their results for better accuracy and control over overfitting.

### Key Characteristics:
- Ensemble method (Bagging)
- Uses bootstrapped datasets
- Introduces feature randomness at each node split
- Outputs: majority vote (classification) or average (regression)

## Advantages
- Handles large datasets efficiently
- Works well with both categorical and numerical features
- Reduces overfitting compared to single decision trees
- Automatically handles missing values
- Gives feature importance scores

## Disadvantages
- Less interpretable than individual decision trees
- Can be computationally intensive
- Predictions are slower for large forests

## How Random Forest Works
1. **Bootstrapping**: Draw multiple random samples (with replacement) from the training data.
2. **Decision Trees**: Train a separate decision tree on each sample.
3. **Random Feature Selection**: At each node, choose a random subset of features for the best split.
4. **Aggregation**:
   - Classification: Majority vote
   - Regression: Average prediction

## Important Parameters (sklearn)
- `n_estimators`: Number of trees in the forest
- `max_depth`: Maximum depth of each tree
- `min_samples_split`: Minimum samples required to split an internal node
- `min_samples_leaf`: Minimum samples required at a leaf node
- `max_features`: Number of features to consider at each split
- `bootstrap`: Whether bootstrap samples are used
- `random_state`: Controls randomness

In [None]:
# Classification Example with Iris Dataset
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Random Forest
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Predict and evaluate
y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

In [None]:
# Visualize Feature Importance
import pandas as pd
import matplotlib.pyplot as plt

feature_importance = pd.Series(clf.feature_importances_, index=iris.feature_names)
feature_importance.sort_values().plot(kind='barh')
plt.title("Feature Importance")
plt.show()

## Use Cases
- Credit scoring
- Fraud detection
- Healthcare diagnostics
- Customer segmentation
- Stock price prediction

## Tips for Better Performance
- Use `GridSearchCV` or `RandomizedSearchCV` to tune hyperparameters
- Drop irrelevant or highly correlated features
- Use cross-validation to check generalization
- Normalize data if used with other models

## Summary
Random Forest is a powerful and versatile model that provides high accuracy,
handles both classification and regression problems, and automatically manages missing data and outliers.
Though less interpretable than a single tree, its performance often justifies the trade-off.