Bagging vs Boosting (Hinglish Explanation) 📌
Bagging (Bootstrap Aggregating) aur Boosting dono ensemble learning techniques hain jo multiple models combine karke accuracy improve karte hain. Dono ka purpose different hota hai.

1️⃣ Bagging (Bootstrap Aggregating)
📌 Concept – Multiple models (weak learners) parallel me train hote hain, aur unka average/mode liya jata hai final prediction ke liye.
📌 Working

Dataset ke random subsets create hote hain (with replacement).
Har subset par independent models (mostly Decision Trees) train hote hain.
Final prediction majority voting (classification) ya averaging (regression) se hota hai.
📌 Example – Random Forest Algorithm Bagging ka best example hai.

📌 Advantages
✅ Overfitting kam hota hai, kyunki models alag-alag data subsets par train hote hain.
✅ Variance reduce hota hai, jisse model ka stability badhti hai.

2️⃣ Boosting
📌 Concept – Ek weak model train hota hai, uske errors ko next model correct karta hai. Ye sequential (step-by-step) process hai.
📌 Working

Pehla weak learner train hota hai.
Jisme errors hote hain, unko next weak learner zyada importance deta hai.
Ye process iteratively chalta hai, jab tak final strong model ban jaye.
📌 Example – AdaBoost, Gradient Boosting, XGBoost, LightGBM

📌 Advantages
✅ High accuracy, kyunki har step par errors improve hote hain.
✅ Works well on complex datasets.


👀 Difference Between Bagging & Boosting
Feature	            Bagging	    Boosting
Model Training	    Parallel	Sequential
Focus	            Reduce variance (overfitting control)	Reduce bias (error minimize)

Working	Multiple independent models train hote hain	Ek model ke errors next model fix karta hai
Example	Random Forest	AdaBoost, XGBoost
🛠️ Kab Kya Use Karein?
✅ Bagging – Jab overfitting problem ho ya dataset noisy ho (e.g. Random Forest).
✅ Boosting – Jab high bias ho ya accuracy improve karni ho (e.g. XGBoost for Kaggle competitions).

In [1]:
import pandas as pd
import numpy as np


In [2]:
df = pd.read_csv('diabetes.csv')
df

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


In [3]:
df.isna().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [4]:
X = df.drop(['Outcome'],axis='columns')
X

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33
...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63
764,2,122,70,27,0,36.8,0.340,27
765,5,121,72,23,112,26.2,0.245,30
766,1,126,60,0,0,30.1,0.349,47


In [5]:
y = df['Outcome']
y

0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64

In [6]:
df.Outcome.value_counts()

Outcome
0    500
1    268
Name: count, dtype: int64

In [8]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3)

In [9]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()

In [36]:
model.fit(X_train,y_train)
model.score(X_test,y_test)*100

68.3982683982684

In [31]:
from sklearn.model_selection import cross_val_score
score = cross_val_score(DecisionTreeClassifier(),X,y,cv = 5)

In [33]:
score.mean()*100

np.float64(71.3649096002037)

In [39]:
from sklearn.ensemble import BaggingClassifier
bag_model = BaggingClassifier(
    estimator = DecisionTreeClassifier(),
    n_estimators =100,
    max_samples = 0.7,
    oob_score = True,
    random_state=0
)

In [40]:
bag_model.fit(X_train,y_train)

In [42]:
bag_model.oob_score_

0.7579143389199255

In [43]:
bag_model.predict(X_test)

array([0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1,
       0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,
       1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0,
       1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1])

In [45]:
bag_model.score(X_test,y_test)*100

74.45887445887446

In [47]:
from sklearn.ensemble import BaggingClassifier
from sklearn.linear_model import LogisticRegression
bag_model = BaggingClassifier(
    estimator = LogisticRegression(),
    n_estimators =100,
    max_samples = 0.7,
    oob_score = True,
    random_state=0
)

In [48]:
bag_model.fit(X_train,y_train)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [49]:
bag_model.score(X_test,y_test)*100

78.35497835497836