<a href="https://colab.research.google.com/github/MishraShardendu22/Ensemble-Learning-Project/blob/main/Ensemble_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
import seaborn as sns
df = sns.load_dataset('iris')
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


**Label Encoder**

* Converts categorical values into integer labels (e.g., `red → 0`, `blue → 1`).
* Imposes an arbitrary ordinal relationship.
* Used for **target variables** or **ordinal categories**.
* Output: **1D integer array**.

**Standard Encoder (StandardScaler)**

* Scales numerical features to zero mean and unit variance.
* Does **not** handle categorical data.
* Used for **numerical features** only.
* Output: **scaled numerical values**.

**Core difference**

* Label Encoder = categorical → integers.
* Standard Encoder = numerical scaling (normalization).

**Example 1: Correct usage (numerical data)**

Numerical feature:

```
Age = [20, 30, 40]
```

Apply StandardScaler:

* mean = 30
* std = 8.16

Result:

```
[-1.22, 0.00, 1.22]
```

Nothing semantic is lost. Distances are meaningful.

---

**Example 2: Wrong usage (categorical data)**

Categorical feature (label-encoded):

```
Color = ["Red", "Blue", "Green"]
LabelEncoded → [0, 1, 2]
```

Apply StandardScaler:

* mean = 1
* std ≈ 0.82

Result:

```
[-1.22, 0.00, 1.22]
```

Problem:

* `Green` is now “twice” `Red`.
* Distances are artificial.
* Order is fake.

---

**Bottom line**

* Scaling numbers preserves meaning.
* Scaling categories **creates false meaning**.

In [3]:
ydf = df['species']
ydf.head()

Unnamed: 0,species
0,setosa
1,setosa
2,setosa
3,setosa
4,setosa


In [4]:
import pandas as pd

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

xdf = df.drop('species', axis=1)
xdf = pd.DataFrame(
    scaler.fit_transform(xdf),
    columns=xdf.columns,
    index=xdf.index
)

xdf.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,-0.900681,1.019004,-1.340227,-1.315444
1,-1.143017,-0.131979,-1.340227,-1.315444
2,-1.385353,0.328414,-1.397064,-1.315444
3,-1.506521,0.098217,-1.283389,-1.315444
4,-1.021849,1.249201,-1.340227,-1.315444


In [16]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    xdf, ydf, test_size=0.2, random_state=42, stratify=ydf
)

In [17]:
# This is how stacking is done.

from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

base_learners = [
    ('dtc', DecisionTreeClassifier(random_state=42)),
    ('svc', SVC(probability=True, kernel='rbf', random_state=42)),
    ('lgr', LogisticRegression(max_iter=1000))
]

meta_learner = LogisticRegression(max_iter=1000)

from sklearn.ensemble import StackingClassifier

stackingClassifierModel = StackingClassifier(
    estimators=base_learners,
    final_estimator=meta_learner
)

stackingClassifierModel.fit(X_train, y_train)
accuracyStacking = stackingClassifierModel.score(X_test, y_test)

print(accuracyStacking)

0.9666666666666667


In [18]:
# This is how Bagging is done.
# Bagging is Bootstrap Aggregating.

# Mostly used in the case of over-fitting.

from sklearn.ensemble import RandomForestClassifier

randomForestModel = RandomForestClassifier(
    n_estimators=100,
    max_depth=None,
    random_state=42,

)

randomForestModel.fit(X_train, y_train)
accuracyRandomForest = randomForestModel.score(X_test, y_test)

print(accuracyRandomForest)

0.9333333333333333


In [20]:
# This is how Boosting is done.
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y_train_encoded = le.fit_transform(y_train)
y_test_encoded = le.transform(y_test)

adaBoostModel = AdaBoostClassifier(
    random_state=42,
    n_estimators=100,
    learning_rate=0.1,
)

gradientBoostModel = GradientBoostingClassifier(
    random_state=42,
    n_estimators=100,
    learning_rate=0.1,
)

xgbModel = XGBClassifier(
    max_depth=3,
    random_state=42,
    n_estimators=100,
    learning_rate=0.1,
    eval_metric='mlogloss',
    objective='multi:softprob',
)

xgbModel.fit(X_train, y_train_encoded)
adaBoostModel.fit(X_train, y_train_encoded)
gradientBoostModel.fit(X_train, y_train_encoded)

accuracyXGB = xgbModel.score(X_test, y_test_encoded)
accuracyAdaBoost = adaBoostModel.score(X_test, y_test_encoded)
accuracyGradientBoost = gradientBoostModel.score(X_test, y_test_encoded)

print("Accuracy of XGB Boost - ",accuracyXGB)
print("Accuracy of AdaBoost - ",accuracyAdaBoost)
print("Accuracy of Gradient Boost - ",accuracyGradientBoost)

Accuracy of XGB Boost -  0.9333333333333333
Accuracy of AdaBoost -  0.9666666666666667
Accuracy of Gradient Boost -  0.9666666666666667
