# RandomForestClassifier
The RandomForestClassifier in sklearn.ensemble is a versatile and powerful classifier based on an ensemble of decision trees. It works by training multiple decision trees on random subsets of the data and averaging their predictions to improve accuracy and reduce overfitting.

class sklearn.ensemble.RandomForestClassifier(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='sqrt', max_leaf_nodes=None, min_impurity_decrease=0.0, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None, monotonic_cst=None)

In [1]:
from sklearn.tree import DecisionTreeClassifier,DecisionTreeRegressor
import numpy as np
from collections import Counter

In [2]:
class Random_Forest_Classifier:
    def __init__(self,n_trees=10,max_depth=10,min_samples_split=2):
        self.n_trees=n_trees
        self.max_depth=max_depth
        self.min_samples_split=min_samples_split
        self.n_feature=None
        self.trees=[]
    
    def fit(self,X,y):
        for i in range(self.n_trees):
            tree=DecisionTreeClassifier(max_depth=self.max_depth,min_samples_split=self.min_samples_split)
            X_sample,y_sample=self._bootstrap_samples(X,y)
            tree.fit(X_sample,y_sample)
            self.trees.append(tree)
            
    def _bootstrap_samples(self,X,y):
        n_samples=X.shape[0]
        indx=np.random.choice(n_samples,n_samples,replace=True)
        return X[indx],y[indx]

    def _most_common_label(self,y):
        counter=Counter(y)
        most_common=counter.most_common(1)[0][0]
        return most_common
    
    def predict(self,X):
        predictions=np.array([tree.predict(X) for tree in self.trees])
        pred=np.swapaxes(predictions,0,1)
        return np.array([self._most_common_label(p) for p in pred])

In [3]:
import pandas as pd

In [4]:
df=pd.read_csv("diabetes.csv")
X=df.drop(columns=["Outcome"])
y=df["Outcome"]

In [5]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=1234)

In [6]:
X_train=np.array(X_train)
y_train=np.array(y_train)
y_test=np.array(y_test)
x_test=np.array(X_test)

In [7]:
rf=Random_Forest_Classifier()
rf.fit(X_train,y_train)
predictions=rf.predict(np.array(X_test))
predictions

array([0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1,
       0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1],
      dtype=int64)

In [8]:
y_test

array([0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1,
       0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1,
       0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0],
      dtype=int64)

In [9]:
def accuracy(y_true,y_pred):
    accuracy=np.sum(y_true==y_pred)/len(y_true)
    return accuracy

In [10]:
accuracy(y_test,predictions)

0.7727272727272727

### Key Parameters:
- n_estimators: The number of trees in the forest (default: 100).
- criterion: Function to measure the quality of a split ("gini", "entropy", "log_loss").
- max_depth: Maximum depth of the trees (default: None, meaning trees grow until leaves are pure or below a threshold).
- min_samples_split: Minimum number of samples required to split a node.
- min_samples_leaf: Minimum number of samples at a leaf node.
- max_features: Number of features to consider when looking for the best split (default: 'sqrt').
- bootstrap: Whether to use bootstrap samples to build trees (default: True).
- oob_score: Whether to use out-of-bag samples to estimate accuracy (default: False).
- n_jobs: The number of jobs to run in parallel (-1 for all cores).
- random_state: For reproducibility.
- class_weight: Weights associated with classes for handling imbalanced data.

In [11]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=10,max_depth=10,min_samples_split=2)
clf.fit(X_train, y_train)
predictions=clf.predict(X_test)
predictions



array([0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0,
       1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1],
      dtype=int64)

### Important Attributes:
- estimators_: Collection of fitted trees.
- feature_importances_: Importance of each feature based on the reduction in impurity.
- oob_score_: The out-of-bag score, if oob_score=True.

### Key Methods:
- fit(X, y): Fits the model to the data.
- predict(X): Predicts classes for X.
- apply(X): Returns leaf indices for X.
- decision_path(X): Returns the decision path in the forest.

In [12]:
y_test

array([0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1,
       0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1,
       0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0],
      dtype=int64)

In [13]:
accuracy(y_test,predictions)

0.7402597402597403

### Example: python

- from sklearn.ensemble import RandomForestClassifier<br>
from sklearn.datasets import make_classification

- Generate a random dataset<br>
X, y = make_classification(n_samples=1000, n_features=4, n_informative=2, random_state=0)

- Create and train the model<br>
clf = RandomForestClassifier(max_depth=2, random_state=0)<br>
clf.fit(X, y)

- Make predictions<br>
predictions = clf.predict([[0, 0, 0, 0]])

### Use Cases:
- Effective in classification tasks.
- Handles large datasets and high-dimensional spaces well.
- Robust to overfitting with appropriate hyperparameters.

# RandomForestRegressor
The RandomForestRegressor in scikit-learn is a powerful and versatile machine learning model used for regression tasks. It leverages the ensemble learning technique by constructing multiple decision trees during training and averaging their predictions to improve accuracy and control overfitting.

### Overview:
- Random Forest is an ensemble method that builds a collection (forest) of decision tree regressors.
- Each tree is trained on a random subset of the data (bootstrap sampling) and a random subset of features for splitting nodes.
- The final prediction is the average of the predictions from all individual trees, which enhances the model's robustness and generalization capabilities compared to a single decision tree.

class sklearn.ensemble.RandomForestRegressor(n_estimators=100, *, criterion='squared_error', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=1.0, max_leaf_nodes=None, min_impurity_decrease=0.0, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, ccp_alpha=0.0, max_samples=None, monotonic_cst=None)

In [14]:
class Random_Forest_Regressor:
    def __init__(self,n_trees=10,max_depth=10,min_samples_split=2):
        self.n_trees=n_trees
        self.max_depth=max_depth
        self.min_samples_split=min_samples_split
        self.n_feature=None
        self.trees=[]
    
    def fit(self,X,y):
        for i in range(self.n_trees):
            tree=DecisionTreeRegressor(max_depth=self.max_depth,min_samples_split=self.min_samples_split)
            X_sample,y_sample=self._bootstrap_samples(X,y)
            tree.fit(X_sample,y_sample)
            self.trees.append(tree)
            
    def _bootstrap_samples(self,X,y):
        n_samples=X.shape[0]
        indx=np.random.choice(n_samples,n_samples,replace=True)
        return X[indx],y[indx]

    
    def predict(self,X):
        predictions=np.array([tree.predict(X) for tree in self.trees])
        pred=np.swapaxes(predictions,0,1)
        return np.array([np.mean(p) for p in pred])

In [15]:
df=pd.read_csv("Boston.csv")
X=df.drop(columns=["medv"])
y=df["medv"]

In [16]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=1234)

In [17]:
X_train=np.array(X_train)
y_train=np.array(y_train)
y_test=np.array(y_test)
x_test=np.array(X_test)

In [18]:
rf=Random_Forest_Regressor()
rf.fit(X_train,y_train)
predictions=rf.predict(np.array(X_test))
predictions

array([31.545     , 26.62      ,  7.51685714, 21.81761381, 13.46      ,
       23.72061709, 19.90275253, 16.98280556, 20.29096032, 29.63041667,
       16.69      , 21.27779453, 21.18760546, 18.18115333, 19.9172514 ,
       21.9425    ,  9.75779798, 20.19464539, 13.30725   ,  7.51685714,
       35.46111111, 19.96033333, 19.81976492, 16.59714286, 22.82806164,
       20.44255943, 20.83546129, 23.59818056, 29.29      , 19.52719697,
       20.70055983, 16.80512158, 19.60504717, 13.32      , 24.028     ,
       16.5       , 20.30346032, 22.11405884,  9.31285714,  7.59622222,
        6.96066667, 23.23      , 47.66      , 27.74      , 15.51475   ,
       24.98413889, 26.86780303, 45.97      , 16.591875  , 20.43363636,
       20.65519339, 32.545     , 27.10720836, 23.63763086, 37.75      ,
       32.17      ,  9.30467532, 32.545     , 30.1       , 21.42790939,
       35.71771825, 17.16183333, 24.95054711, 37.39111111,  9.51848485,
       21.15374315, 33.5       , 19.66907138, 49.28      , 45.21

In [19]:
y_test

array([33. , 27.5,  5.6, 21.2, 14.9, 22.3, 18.8, 14.6, 19.4, 32. , 13.8,
       21.7, 22.6, 18.4, 20.5, 22.2, 10.8, 22.5, 13.8,  5. , 32.9, 18.6,
       16.8, 27.1, 22.9, 19.6, 22.7, 28.1, 26.6, 18.8, 18.9, 14.8, 19.6,
       17.1, 22.6, 27.5, 22.8, 21. ,  7.4, 10.4,  8.5, 21. , 45.4, 28.2,
       14.2, 22. , 24.1, 43.5, 15.2, 22. , 20.8, 36.1, 27. , 23.4, 50. ,
       31.6,  5. , 32.7, 30.1, 23. , 34.9, 14.1, 22.8, 36.4,  8.3, 22. ,
       36.5, 16.1, 50. , 44.8, 13.1, 23.1, 12. , 21.5, 24.2, 13.1, 50. ,
       20.6, 17.3, 13.3, 37.6, 48.8, 24.1, 14.3, 36.2, 11.9, 27.1, 29.6,
       20.4, 21.2, 13.9, 13.6, 22. , 19.7, 15.4, 46.7, 11.7, 22.9, 30.5,
       23.3, 19.4, 23. ])

In [20]:
def mse(y_true,y_pred):
    mse=np.sum((y_true-y_pred)**2)/len(y_true)
    return mse

In [21]:
mse(y_test,predictions)

11.742058617355111

### Key Parameters:
- n_estimators (int, default=100):

Description: Number of trees in the forest.
Note: Increased number can improve performance but also increase computational cost.
- criterion ({"squared_error", "absolute_error", "friedman_mse", "poisson"}, default="squared_error"):

Description: Function to measure the quality of a split.
squared_error: Mean Squared Error (MSE).
absolute_error: Mean Absolute Error (MAE).
friedman_mse: MSE with Friedman's improvement score.
poisson: Reduction in Poisson deviance.
Note: "absolute_error" can be significantly slower than "squared_error".
- max_depth (int, default=None):

Description: Maximum depth of the trees.
Behavior: If None, trees expand until all leaves are pure or contain fewer than min_samples_split samples.
- min_samples_split (int or float, default=2):

Description: Minimum number of samples required to split an internal node.
Behavior:
If int, it's the exact number.
If float, it's a fraction of the total samples.
- min_samples_leaf (int or float, default=1):

Description: Minimum number of samples required to be at a leaf node.
Behavior:
If int, it's the exact number.
If float, it's a fraction of the total samples.
Effect: Helps in smoothing the model, especially useful in regression.
- max_features ({"sqrt", "log2", None}, int, or float, default=1.0):

Description: Number of features to consider when looking for the best split.
Options:
"sqrt": Square root of the number of features.
"log2": Logarithm base 2 of the number of features.
int: Exact number of features.
float: Fraction of total features.
None or 1.0: All features.
Note: Smaller values increase randomness and reduce correlation among trees.
- bootstrap (bool, default=True):

Description: Whether to use bootstrap samples (sampling with replacement) when building trees.
Effect: False means the whole dataset is used to build each tree.
- oob_score (bool or callable, default=False):

Description: Whether to use out-of-bag samples to estimate the generalization score.
Usage: Only available if bootstrap=True.
Default Metric: r2_score, but can provide a custom metric via a callable.
- random_state (int, RandomState instance, or None, default=None):

Description: Controls the randomness of the bootstrapping and feature sampling.
Usage: Ensures reproducibility of results.
- n_jobs (int, default=None):

Description: Number of jobs to run in parallel.
Options:
-1: Use all available cores.
None: Defaults to 1 unless within a joblib.parallel_backend context.
- ccp_alpha (float, default=0.0):

Description: Complexity parameter for Minimal Cost-Complexity Pruning.
Effect: Controls tree size by pruning nodes with the least contribution to impurity reduction.
- max_samples (int or float, default=None):

Description: If bootstrap=True, the number of samples to draw from X to train each base estimator.
Behavior:
None: Draw all samples (X.shape[0]).
int: Exact number of samples.
float: Fraction of total samples.
- monotonic_cst (array-like of int, shape=(n_features,), default=None):

Description: Monotonicity constraints for each feature.
1: Monotonically increasing.
-1: Monotonically decreasing.
0: No constraint.
Limitations: Not supported for multioutput regressions or datasets with missing values.

### Attributes:
- estimators_: List of individual DecisionTreeRegressor objects.
- feature_importances_: Importance of each feature based on impurity reduction.
- n_features_in_: Number of features seen during fit.
- oob_score_: Out-of-bag score if oob_score=True.
- oob_prediction_: Predictions computed with out-of-bag estimates.
- classes_: Not applicable for regressors.
- n_outputs_: Number of outputs when fit is performed.

In [22]:
from sklearn.ensemble import RandomForestRegressor
clf = RandomForestRegressor(n_estimators=10,max_depth=10,min_samples_split=2)
clf.fit(X_train, y_train)
predictions=clf.predict(X_test)
predictions



array([33.22904762, 24.41816667,  9.01030303, 20.7577729 , 13.632     ,
       23.34861111, 19.27886553, 16.23      , 20.76410654, 27.48033333,
       16.96571429, 21.31838655, 21.9598456 , 18.27983991, 21.01716492,
       23.445     ,  9.62      , 19.49544643, 16.50374338,  8.20363636,
       37.06958333, 21.29366667, 20.29895691, 16.27      , 22.56667355,
       20.0849584 , 20.95513593, 22.67000689, 27.93266667, 18.25666667,
       21.06422335, 15.022     , 19.63876523, 13.625     , 23.79125   ,
       19.1       , 20.95613165, 21.9706193 ,  9.4475    ,  8.31030303,
        7.79030303, 23.74582284, 47.18      , 27.76      , 16.44193317,
       25.98      , 26.782     , 48.83      , 15.19      , 22.38664953,
       19.20149846, 33.454     , 23.86155556, 23.87316667, 39.61      ,
       30.52      ,  9.53921212, 32.3       , 26.27      , 20.82133425,
       39.27      , 15.41583333, 26.18818182, 43.035     ,  9.38254545,
       18.97180976, 36.36      , 20.09538819, 49.38      , 41.07

In [23]:
mse(y_test,predictions)

12.005985960774359

### Important Notes:
- Feature Importance:<br>
Calculated based on the total decrease in impurity brought by each feature.
May be misleading for high-cardinality features (many unique values).
Consider using sklearn.inspection.permutation_importance for more reliable feature importance.

- Out-of-Bag Error:<br>
Provides an internal estimate of model performance without needing a separate validation set.
Useful for assessing the generalization error.

- Pruning and Complexity:<br>
Default parameters lead to fully grown and unpruned trees, which can be large.
Control tree size and memory consumption by adjusting parameters like max_depth, min_samples_split, min_samples_leaf, and ccp_alpha.

- Deterministic Behavior:<br>
To ensure reproducible results, especially when max_features=n_features and bootstrap=False, set random_state.

### Example : python
- from sklearn.ensemble import RandomForestRegressor<br>
from sklearn.datasets import make_regression<br>
from sklearn.model_selection import train_test_split<br>
from sklearn.metrics import mean_squared_error

-  Generate a sample regression dataset<br>
X, y = make_regression(n_samples=1000, n_features=4, n_informative=2, noise=0.1, random_state=0)

-  Split into training and testing sets<br>
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

- Instantiate and train the regressor<br>
regr = RandomForestRegressor(n_estimators=100, max_depth=5, random_state=42)<br>
regr.fit(X_train, y_train)

- Make predictions<br>
predictions = regr.predict(X_test)

- Evaluate the model<br>
mse = mean_squared_error(y_test, predictions)<br>
print(f"Mean Squared Error: {mse:.2f}")

### Use Cases:
- Regression Tasks:<br>
Predicting continuous outcomes such as house prices, stock prices, or any other real-valued target.
- Handling High-Dimensional Data:<br>
Effective in scenarios with a large number of features.
- Robustness to Overfitting:<br>
Ensemble of trees reduces the risk of overfitting compared to individual decision trees.
- Feature Importance Assessment:<br>
Identifying which features contribute most to the prediction task.

### When to Use:
- When you need a model that balances bias and variance effectively.
- For datasets where the relationship between features and target is complex and non-linear.
- When interpretability of feature importance is valuable.

### Alternatives:
- DecisionTreeRegressor: Simpler model but more prone to overfitting.
- ExtraTreesRegressor: Similar to Random Forest but uses a different method for splitting nodes, often resulting in faster training.
- HistGradientBoostingRegressor: Suitable for very large datasets with high performance and speed.


The RandomForestRegressor is a highly effective tool for regression problems, offering strong predictive performance and the ability to handle a variety of data types and complexities. Proper tuning of its parameters can lead to robust models that generalize well to unseen data.