***Ensemble learning***

It is a technique that combines predictions from multiple models to improve accuracy and robustness.

***Bagging (Bootstrap Aggregating)***

It involves training multiple models on different random subsets of the training data (with replacement) and averaging their predictions (for regression) or using majority voting (for classification).





***Row Sampling***

Selecting a random subset of rows from the dataset to train each tree, which helps in reducing overfitting and improves model robustness. 
Typically, bootstrap sampling (sampling with replacement) is used.

***Column Sampling***

Selecting a random subset of features (columns) for each split in the tree, which helps in reducing correlation between trees and improves model diversity. No replacement sampling allowed.

Both techniques combined lead to a more robust and generalized model by ensuring that each tree in the forest is trained on different data and features, thereby reducing variance and improving overall performance.



***Random forest***

It builds a forest of decision trees, each trained on a random subset of the data and features, and aggregates their predictions to improve accuracy and control overfitting.

***OOB - Out of bag samples***

when building each tree, about 1/3 of the data is not used (out-of-bag samples). These can be used to get an unbiased estimate of model performance without needing a separate validation set.


In [None]:
# implementation of random forest algorithm

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

# For classification tasks
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

# For regression tasks
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)

***how to find the best n_estimators***

1. Cross-Validation (Most Reliable)

Use cross-validation to evaluate model performance for different values of n_estimators (e.g., 50, 100, 200) and select the one that yields the best validation score.

2. OOB (Out-of-Bag) Error Plot

Random Forest can estimate generalization error using samples not used for training each tree.
You can plot OOB error vs. number of trees 
 

***Rule of thumb***

| Dataset Size         | Recommended Trees |
| -------------------- | ----------------- |
| Small (<10K samples) | 100–200           |
| Medium (10K–100K)    | 200–500           |
| Large (>100K)        | 500–1000          |


***Implementation of Random Forest***

1. Define the problem & target metrics  
    - Identify whether it’s classification or regression, and decide evaluation metrics (accuracy, F1, RMSE, R², etc.).

2. Perform quick EDA (Exploratory Data Analysis)  
    - Check distributions, missing values, outliers, and class balance (for classification).

3. Preprocess the data  
    - Handle missing values, encode categorical features, normalize/standardize if needed, and apply basic feature engineering.

4. Split the dataset  
    - Use train_test_split; for classification, use stratify=y to maintain class ratios.

5. Train a baseline Random Forest model  
    - Start with default hyperparameters to get an initial performance benchmark.

6. Cross-validation & imbalance handling  
    - Apply KFold / StratifiedKFold; handle imbalance via class_weight='balanced' or SMOTE (for classification).

7. Hyperparameter tuning  
    - Optimize key parameters such as n_estimators, max_depth, max_features, min_samples_split, and min_samples_leaf using GridSearchCV or RandomizedSearchCV.

8. Train the final model & evaluate  
    - Retrain using the best parameters; evaluate on the test set using suitable metrics for your task.

9. Interpret the model  
    - Analyze feature importance and SHAP values to understand which features drive predictions.

10. Serialize & monitor in production  
     - Save using joblib or pickle; track drift, feature importance changes, and model performance over time.
