<h1 align=center> Random Forest In Depth

![randomforest.jpg](<attachment:random forest.jpg>)

- Supervised Learning Algorithm
- Used in Regression and Classification problems
- Based on concept of ensemble learning
- No scaling required because it is a tree based model
- Not great on imbalanced classification problems
- Very robust and powerful, not good for very high dimensional space data

### How it works

![bag.png](attachment:bag.png)

1. **Bootstrap Aggregation (Bagging):**
    - Random Forest utilizes the concept of bagging. It creates multiple training datasets by randomly sampling (with replacement) from the original data. This means some data points may be included in multiple training sets, while others might not be included at all
    
    `Note:`The step of row sampling with replacement is referred to as **bootstrapping.** Bootstrapping is a resampling technique where multiple datasets are created by randomly sampling with replacement from the original dataset.
    
2. **Building Decision Trees:**
    - For each training dataset, a decision tree is constructed. These decision trees use the same algorithm but consider different random subsets of features at each split point. This randomness helps prevent the trees from becoming overly correlated with any specific features
3. **Making Predictions:**
    - When a new data point needs to be classified or a value predicted, it's passed through all the decision trees in the forest. Each tree makes its own prediction based on its learned rules
4. **Ensemble Prediction (Voting):**
    - For classification tasks, the final prediction is typically made by a majority vote. The class that receives the most votes from the individual trees becomes the overall prediction for the forest
    - For regression tasks, the final prediction is often the average of the predictions from all the trees in the forest
    
    `Note:` The step, which involves combining all the results and generating the final output based on majority voting (classification) or averaging (regression), is known as **aggregation**.
    
**Advantages:**
    
- **High Accuracy:** Random forests often achieve excellent predictive performance, especially when compared to single decision trees
- **Robustness:** By averaging predictions from multiple trees, random forests reduce variance and are less susceptible to overfitting
- **Handling Missing Data:** Random forests can effectively handle missing values in the data without requiring special imputation techniques
- **Feature Importance:** Random forests provide a measure of feature importance, indicating which features contribute most to the model's predictions
- **Scalability:** Random forests can be effectively applied to large datasets due to their parallelizable nature
    
**Disadvantages:**
    
- **Interpretability:** While feature importance can be obtained, the internal workings of individual trees within the forest can be less interpretable compared to simpler models like decision trees
- **Computational Complexity**: It can be computationally expensive, especially for large datasets with a large number of trees

In [1]:
# Random forest for classification problem
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=4,
                           n_informative=2, n_redundant=0,
                            random_state=0, shuffle=False)

clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(X, y)

print(clf.predict([[0, 0, 0, 0]]))

[1]


In [2]:
# Random forest for Regression problem
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression

X, y = make_regression(n_features=4, n_informative=2,
                        random_state=0, shuffle=False)

regr = RandomForestRegressor(max_depth=2, random_state=0)
regr.fit(X, y)
print(regr.predict([[0, 0, 0, 0]]))

[-8.32987858]
