## Ensemble Technique

`Ensemble techniques` are machine learning methods that combine multiple individual models to create a more powerful and accurate predictive model. The idea behind ensemble techniques is that by aggregating the predictions of multiple models, the strengths of individual models can be leveraged to compensate for their weaknesses, resulting in a more robust and accurate prediction.

There are several popular ensemble techniques, including:

1. `Bagging`: Bagging stands for Bootstrap Aggregating. It involves training multiple instances of the same base model on different subsets of the training data, obtained through bootstrap sampling (sampling with replacement). Each model is trained independently, and their predictions are combined through averaging or voting to make the final prediction.

2. `Boosting`: Boosting is an iterative ensemble technique that aims to improve the performance of a weak base model by training multiple models in sequence. Each subsequent model focuses on correcting the mistakes made by the previous models, with more emphasis on the data points that were misclassified. The final prediction is typically a weighted combination of the predictions made by each individual model.

3. `Random Forest`: Random Forest is an ensemble method that combines the ideas of bagging and decision trees. It creates an ensemble of decision trees, where each tree is trained on a random subset of the training data and a random subset of the features. The predictions of the individual trees are then aggregated through majority voting to make the final prediction.

4. `Stacking`: Stacking, also known as stacked generalization, involves training multiple diverse models and combining their predictions using another model called a meta-learner or blender. The base models are trained on the training data, and their predictions serve as inputs to the meta-learner, which learns to make the final prediction based on the outputs of the base models.

Ensemble techniques are effective because they exploit the diversity and complementary strengths of individual models. By combining multiple models, ensemble methods can reduce overfitting, increase generalization performance, and improve the robustness of predictions. They have been successfully applied in various machine learning tasks, including classification, regression, and anomaly detection.

### Bagging, which stands for Bootstrap Aggregating, is an ensemble technique in machine learning that aims to improve the performance and stability of predictive models. It involves creating multiple instances of the same base model by training them on different subsets of the training data obtained through bootstrap sampling.

### Here's a step-by-step explanation of the Bagging process:

- 1. `Bootstrap Sampling:`  Given a training dataset with N samples, Bagging generates B bootstrap samples by randomly selecting N samples from the original dataset with replacement. This means that each bootstrap sample can contain duplicate instances and some original instances may be left out.

- 2. `Base Model Training:` For each bootstrap sample, a base model (e.g., decision tree, random forest, or neural network) is trained independently. Each base model is trained on a different subset of the training data, capturing different variations and patterns in the data.

- 3. `Prediction Aggregation`: Once the base models are trained, they are used to make predictions on new unseen data. The predictions from all the base models are then aggregated to make the final prediction. The aggregation can be done through averaging (for regression tasks) or voting (for classification tasks).

The key idea behind Bagging is that by training multiple models on different subsets of the data, the ensemble can reduce overfitting and variance in the predictions. The individual models are exposed to different patterns and noise in the data, leading to diverse predictions. When these predictions are combined, the errors tend to cancel out, resulting in a more accurate and robust prediction.

One popular type of Bagging is `Random Forest`, which combines the concepts of Bagging and decision trees. In Random Forest, each base model is a decision tree trained on a random subset of the training data and a random subset of the features. The predictions of the individual trees are then aggregated through majority voting to make the final prediction.

`Bagging and Random Forest` are widely used in machine learning because they are relatively simple yet effective techniques for improving model performance, handling noisy data, and reducing overfitting. They have been applied to various tasks such as classification, regression, and feature selection.

## Random Forest Classifier And Regressor

The mathematical intuition behind Random Forest lies in the combination of the bootstrap sampling technique and the ensemble of decision trees. Let's break down the key components:

- `Bootstrap Sampling:` Random Forest starts by creating multiple bootstrap samples from the original training data. Each bootstrap sample is created by randomly selecting data points from the original dataset with replacement. This means that some instances may appear multiple times in a bootstrap sample, while others may be left out.

The mathematical intuition behind bootstrap sampling is rooted in statistics and probability. By creating multiple bootstrap samples, we simulate different possible training datasets that can be generated from the original data. This allows us to capture different variations and distributions of the data, which helps in reducing overfitting and increasing the robustness of the model.

- `Decision Trees:` Random Forest uses decision trees as the base model. Decision trees are binary tree structures that recursively split the data based on features to create a hierarchical decision-making process.

Each decision tree in Random Forest is trained on a different bootstrap sample. When training a decision tree, at each split, a subset of features is randomly selected. This random feature selection helps to introduce diversity among the trees and prevents them from being highly correlated.

The mathematical intuition behind decision trees lies in the algorithm's ability to partition the feature space based on the training data. Decision trees learn to make decisions by evaluating features and creating binary splits that minimize impurity or maximize information gain. This process can be represented mathematically using various impurity measures or objective functions, such as Gini index or entropy.

- `Ensemble Aggregation:` The final prediction in Random Forest is obtained by aggregating the predictions of all the individual decision trees. For classification tasks, this aggregation is typically done through majority voting, where the class with the most votes is chosen as the final prediction. For regression tasks, the predictions are averaged.

The mathematical intuition behind ensemble aggregation is based on the principle of combining multiple weak learners to create a stronger predictor. Each decision tree in the Random Forest may have its biases and limitations, but by combining their predictions, the ensemble can capture a more accurate representation of the underlying patterns and relationships in the data.

In summary, the mathematical intuition behind Random Forest comes from the principles of bootstrap sampling, the decision-making process of decision trees, and the aggregation of their predictions. Through these components, Random Forest leverages the diversity and collective knowledge of multiple decision trees to create a more robust and accurate model for both classification and regression tasks.

The mathematical intuition behind Random Forest's use of both raw sampling (bootstrap sampling) and feature sampling lies in the principles of ensemble learning and reducing overfitting.

Raw Sampling (Bootstrap Sampling): Random Forest uses raw sampling, specifically bootstrap sampling, to create multiple subsets of the training data. Bootstrap sampling involves randomly selecting data points from the original training set with replacement. As a result, each bootstrap sample may contain duplicate instances and some original instances may be left out. This sampling technique ensures that each base model (decision tree) in the Random Forest ensemble is trained on a slightly different subset of the training data.
The mathematical intuition behind raw sampling lies in the concept of the law of large numbers. When we generate multiple bootstrap samples, each sample is likely to represent the original training data with some variations. By training each base model on these different subsets, Random Forest captures diverse patterns and noise present in the data. The ensemble benefits from the collective knowledge of the individual models, resulting in a more robust and accurate prediction.

Feature Sampling: In addition to raw sampling, Random Forest also employs feature sampling. Feature sampling involves randomly selecting a subset of features (predictors) at each split when constructing a decision tree. Rather than considering all the available features at each split, only a subset of features is considered.
The mathematical intuition behind feature sampling is to introduce randomness and reduce the correlation between base models (decision trees). When all features are available for selection at each split, strong features tend to dominate the decision-making process, resulting in highly correlated trees and potentially overfitting. By randomly selecting a subset of features, Random Forest encourages diversity among the base models. Each decision tree focuses on different sets of features, resulting in a set of less correlated trees. This reduces the risk of overfitting and enhances the ensemble's generalization capability.

The combination of raw sampling and feature sampling in Random Forest aims to strike a balance between individual model diversity and predictive accuracy. By leveraging the diverse perspectives captured through raw sampling and feature sampling, Random Forest can effectively handle complex and noisy datasets, reduce overfitting, and provide robust predictions.

## Out of Bags score

Ensemble methods in machine learning, such as bagging, are designed to improve the predictive accuracy and stability of models by combining the predictions of multiple individual models. The intuition behind bagging lies in the concept of the "wisdom of the crowd," where aggregating the opinions of a group of individuals often leads to better results than relying on a single individual.

Bagging, short for bootstrap aggregating, involves creating multiple subsets of the original training data through a process called bootstrapping. Bootstrapping is a statistical technique where random samples are drawn with replacement from the original dataset, resulting in subsets of the same size as the original dataset. Each subset is used to train a separate base model, typically referred to as weak learners or base learners.

The mathematical intuition behind bagging can be understood in the context of reducing both bias and variance in model predictions. Bias refers to the error introduced by approximating a real-world problem with a simplified model. Variance, on the other hand, refers to the variability of model predictions when trained on different subsets of the training data.

When multiple base models are trained using different bootstrapped subsets, they collectively capture different aspects of the underlying data distribution. Each model may have its biases, but by combining their predictions, these biases tend to cancel each other out, resulting in an overall reduction in bias.

In terms of variance, since each base model is trained on a different subset of the data, they are exposed to different variations and noise present in the training data. By averaging or combining their predictions, the impact of individual outliers or noisy samples is reduced. This leads to a decrease in the overall variability or variance of the ensemble's predictions compared to a single model.

Mathematically, the bagging process can be described as follows:

- 1. Let's assume we have a training dataset of size N.
- 2.  Generate B bootstrap samples by randomly selecting N samples from the original dataset with replacement. Each bootstrap sample has the same size as the original dataset.
- 3. Train a separate base model on each bootstrap sample, resulting in B different base models.
- 4. When making predictions, each base model independently predicts the output for a given input.
- 5. For regression problems, the final prediction is often computed as the average of the predictions from all base models. For classification problems, voting or averaging probabilities is commonly used.

The ensemble's prediction is the aggregation of the individual predictions from the base models.

The key idea behind the mathematical intuition of bagging is that by combining the predictions of multiple models trained on different subsets of the data, the ensemble model achieves a better overall generalization and robustness, which can lead to improved performance in terms of accuracy and stability.

## RandomForest Classifier Implementation With pipeline And HyperParameter Tunning

In [1]:
import seaborn as sns
df = sns.load_dataset("tips")
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [2]:
df["time"].unique()

['Dinner', 'Lunch']
Categories (2, object): ['Lunch', 'Dinner']

In [3]:
df.isnull().sum()

total_bill    0
tip           0
sex           0
smoker        0
day           0
time          0
size          0
dtype: int64

In [8]:
for i in df.columns:
    print(i, df[i].dtype)

total_bill float64
tip float64
sex category
smoker category
day category
time category
size int64


In [9]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df["time"] = encoder.fit_transform(df["time"])

In [12]:
df['time'].value_counts()

0    176
1     68
Name: time, dtype: int64

In [14]:
X = df.drop(labels=["time"], axis=1)
y = df["time"]

In [15]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.20, random_state=42)

In [17]:
X_train.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,size
228,13.28,2.72,Male,No,Sat,2
208,24.27,2.03,Male,Yes,Sat,2
96,27.28,4.0,Male,Yes,Fri,2
167,31.71,4.5,Male,No,Sun,4
84,15.98,2.03,Male,No,Thur,2


In [19]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

In [20]:
categorical_cols = ["sex", "smoker", "day"]
numerical_cols = ["total_bill", "tip", "size"]

In [23]:
num_pipeline = Pipeline(
    steps = [
        ('imputer',SimpleImputer(strategy="median")),
        ('scaler', StandardScaler())
    ]
)



In [24]:
cat_pipeline = Pipeline(
    steps = [
        ('imputer',SimpleImputer(strategy="most_frequent")),
        ('onehotencoding', OneHotEncoder())
    ]
)



In [25]:
preprocessor = ColumnTransformer([
    ("num_pipeline", num_pipeline, numerical_cols),
    ('cat_pipeline', cat_pipeline, categorical_cols)
])

In [26]:
X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)

In [36]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

In [41]:
models = {
    "Random Forest" : RandomForestClassifier(),
    "DecisionTree"  : DecisionTreeClassifier(), 
    "SVC" : SVC()
    
}

In [42]:
from sklearn.metrics import accuracy_score

In [43]:
def evaluate_model(X_train, X_test, y_train, y_test, models):
    
    report = {}
    
    for i in range(len(models)):
        model = list(models.values())[i]
        
        model.fit(X_train, y_train)
        
        y_test_pred = model.predict(X_test)
        
        test_model_score = accuracy_score(y_test, y_test_pred)
        
        report[list(models.keys())[i]] = test_model_score
        
    return report    

In [44]:
evaluate_model(X_train, X_test, y_train, y_test, models)

{'Random Forest': 0.9591836734693877,
 'DecisionTree': 0.9387755102040817,
 'SVC': 0.9591836734693877}

In [45]:
classification = RandomForestClassifier()

In [46]:
## Hyperparameter Training
params = {
    "max_depth":[3,5,10,None],
    "n_estimators":[100,200,300],
    "criterion":['gini','entropy']
}

In [47]:
from sklearn.model_selection import RandomizedSearchCV

In [None]:
RndomizedSearchCV(classification, param_distributions=params,scoring="accuracy",cv=5,verbose=3)

In [49]:
rm= RandomizedSearchCV(classification, param_distributions=params,scoring="accuracy",cv=5,verbose=3)

In [50]:
rm.fit(X_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 1/5] END criterion=entropy, max_depth=10, n_estimators=100;, score=0.974 total time=   0.2s
[CV 2/5] END criterion=entropy, max_depth=10, n_estimators=100;, score=0.923 total time=   0.2s
[CV 3/5] END criterion=entropy, max_depth=10, n_estimators=100;, score=1.000 total time=   0.2s
[CV 4/5] END criterion=entropy, max_depth=10, n_estimators=100;, score=0.897 total time=   0.2s
[CV 5/5] END criterion=entropy, max_depth=10, n_estimators=100;, score=0.949 total time=   0.2s
[CV 1/5] END criterion=gini, max_depth=None, n_estimators=100;, score=0.974 total time=   0.2s
[CV 2/5] END criterion=gini, max_depth=None, n_estimators=100;, score=0.923 total time=   0.2s
[CV 3/5] END criterion=gini, max_depth=None, n_estimators=100;, score=1.000 total time=   0.2s
[CV 4/5] END criterion=gini, max_depth=None, n_estimators=100;, score=0.949 total time=   0.2s
[CV 5/5] END criterion=gini, max_depth=None, n_estimators=100;, score=0.923 tot

In [53]:
rm.best_params_

{'n_estimators': 100, 'max_depth': None, 'criterion': 'gini'}