## Q1. How does bagging reduce overfitting in decision trees?

Ans-Bagging attempts to reduce the chance of overfitting complex models. It trains a large number of “strong” learners in parallel. A strong learner is a model that's relatively unconstrained. Bagging then combines all the strong learners together in order to “smooth out” their predictions.

## Q2. What are the advantages and disadvantages of using different types of base learners in bagging?

Ans-Bagging, also known as Bootstrap aggregating, is an ensemble learning technique that helps to improve the performance and accuracy of machine learning algorithms. It is used to deal with bias-variance trade-offs and reduces the variance of a prediction model. Bagging avoids overfitting of data and is used for both regression and classification models, specifically for decision tree algorithms.



Steps to Perform Bagging

Consider there are n observations and m features in the training set. You need to select a random sample from the training dataset without replacement

A subset of m features is chosen randomly to create a model using sample observations

The feature offering the best split out of the lot is used to split the nodes

The tree is grown, so you have the best root nodes

The above steps are repeated n times. It aggregates the output of individual decision trees to give the best prediction

Advantages of Bagging in Machine Learning
Bagging minimizes the overfitting of data

It improves the model’s accuracy

It deals with higher dimensional data efficiently

Advantages and Disadvantages of Bagging
Random forest is one of the most popular bagging algorithms. Bagging offers the advantage of allowing many weak learners to combine efforts to outdo a single strong learner. It also helps in the reduction of variance, hence eliminating the overfitting of models in the procedure.

One disadvantage of bagging is that it introduces a loss of interpretability of a model. The resultant model can experience lots of bias when the proper procedure is ignored. Despite bagging being highly accurate, it can be computationally expensive, which may discourage its use in certain instances.

## Q3. How does the choice of base learner affect the bias-variance tradeoff in bagging?

Ans-While building the machine learning model, it is really important to take care of bias and variance in order to avoid overfitting and underfitting in the model. If the model is very simple with fewer parameters, it may have low variance and high bias. Whereas, if the model has a large number of parameters, it will have high variance and low bias. So, it is required to make a balance between bias and variance errors, and this balance between the bias error and variance error is known as the Bias-Variance trade-off.

For an accurate prediction of the model, algorithms need a low variance and low bias. But this is not possible because bias and variance are related to each other:

If we decrease the variance, it will increase the bias.
If we decrease the bias, it will increase the variance.

Bias-Variance trade-off is a central issue in supervised learning. Ideally, we need a model that accurately captures the regularities in training data and simultaneously generalizes well with the unseen dataset. Unfortunately, doing this is not possible simultaneously. Because a high variance algorithm may perform well with training data, but it may lead to overfitting to noisy data. Whereas, high bias algorithm generates a much simple model that may not even capture important regularities in the data. So, we need to find a sweet spot between bias and variance to make an optimal model.

Hence, the Bias-Variance trade-off is about finding the sweet spot to make a balance between bias and variance errors.


## Q4. Can bagging be used for both classification and regression tasks? How does it differ in each case?

Ans-Bagging is an ensemble method that can be used in regression and classification.
It is also known as bootstrap aggregation, which forms the two classifications of bagging.

## Q5. What is the role of ensemble size in bagging? How many models should be included in the ensemble?

Ans-Bagging is a powerful ensemble method which helps to reduce variance, and by extension, prevent overfitting. Ensemble methods improve model precision by using a group (or "ensemble") of models which, when combined, outperform individual models when used separately.

There are no restrictions/guidelines on the number of models. You can start even from 3 models. You can keep the number of models as a hyperparameter if the training cost is less.

## Q6. Can you provide an example of a real-world application of bagging in machine learning?

from sklearn.linear_model import LinearRegression
import pandas as pd

# Load the dataset
data = pd.read_csv("house_prices.csv")

# separates independent (features) and dependent (prices) variables
X = data.drop("price", axis=1)
y = data["price"]

# create the linear regression model
model = LinearRegression()

# fit the model to the data
model.fit(X, y)

# perform a prediction for a new set of features
new_house= [[1500, 3, 2]]  # area, rooms, bathrooms
price= model.predict(nova_casa)

print("Expected price for the new house:", price)

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# load the iris dataset
iris = load_iris()

# split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3)

# create the logistic regression model
logreg = LogisticRegression()

# fit the model to the training data
logreg.fit(X_train, y_train)

# predict the target values for the test data
y_pred = logreg.predict(X_test)

# print the accuracy score of the model
print("Accuracy:", logreg.score(X_test, y_test))

In [1]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

# load the iris dataset
iris = load_iris()

# separate the features (independent variables) and target (dependent variable)
X = iris.data
y = iris.target

# create a Decision Tree classifier
clf = DecisionTreeClassifier()

# fit the model to the data
clf.fit(X, y)

# use the model to make predictions
new_observation = [[5.2, 3.1, 4.2, 1.5]] # a new observation to predict
prediction = clf.predict(new_observation)

print("Prediction for the new observation:", prediction)

Prediction for the new observation: [1]


In [3]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

# generate a random dataset
X, y = make_classification(n_features=4, random_state=0)

# create a random forest classifier with 100 estimators
rf = RandomForestClassifier(n_estimators=100, random_state=0)

# fit the model to the data
rf.fit(X, y)

# predict the class of a new observation
new_observation = [[-2, 2, -1, 1]]
print("Predicted class:", rf.predict(new_observation))

Predicted class: [0]


In [4]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn import svm

# Load the iris dataset
iris = datasets.load_iris()

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=42)

# Create an SVM classifier with a linear kernel
clf = svm.SVC(kernel='linear')

# Train the SVM classifier on the training set
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Print the accuracy of the classifier
accuracy = clf.score(X_test, y_test)
print("Accuracy:", accuracy)

Accuracy: 1.0


In [5]:
from sklearn.naive_bayes import GaussianNB
import numpy as np

# training data
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
Y = np.array([1, 1, 1, 2, 2, 2])

# create Naive Bayes classifier and fit to the data
clf = GaussianNB()
clf.fit(X, Y)

# make a prediction for a new data point
new_point = [[0, 0]]
prediction = clf.predict(new_point)

print("Prediction:", prediction)

Prediction: [1]


In [6]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# load the iris dataset
iris = load_iris()

# split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3)

# create a kNN classifier with k=3
knn = KNeighborsClassifier(n_neighbors=3)

# fit the classifier to the training data
knn.fit(X_train, y_train)

# predict the classes of the testing set
y_pred = knn.predict(X_test)

# print the accuracy of the classifier
accuracy = knn.score(X_test, y_test)
print("Accuracy:", accuracy)

Accuracy: 0.9555555555555556


In [7]:
from sklearn.datasets import make_classification
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# generate a random binary classification dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, random_state=42)

# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# create a gradient boosting classifier with default parameters
clf = GradientBoostingClassifier()

# train the model on the training data
clf.fit(X_train, y_train)

# make predictions on the test data
y_pred = clf.predict(X_test)

# calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.96
