**Doubt 1.1**

In [130]:



# "Sir is saying that in stacking overfitting
# happens when we train on the same data and
# also make predictions on the same data…
# so why do we even do that?
# Bro, training has to be done on the training data anyway,
#  and prediction we make on the testing data, right?"

**Solution**

In [131]:
# When you use the same training data for both the base models
# and the meta-model →
# the base models already start giving
#  “perfect predictions” (or very accurate) on the training set.

# Now, the input that the meta-model receives
#  (the output of the base models) is very clean
#  and unrealistically perfect.
# The meta-model thinks: “Wow! this data is too easy” →
# that’s why the training accuracy becomes very high (sometimes even 100%).

# But when new test data comes →
# the base models don’t perform that perfectly
#  (because it’s unseen data),
# and the meta-model, which was trained on those
# “perfect signals,” fails in reality.

#  This is called overfitting in stacking.

# Think of it simply like this:

# During training, you cheated
#  (used the same data for training + meta-training).

# But during testing, the cheating doesn’t work → the accuracy drops.

**Implemantation of Stacking in ScikitLearn**

In [132]:
import numpy as np
import pandas as pd



In [133]:
v = pd.read_csv("/content/heart.csv")

In [134]:
v.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [186]:
v.shape

(303, 14)

In [135]:
x = v.drop(columns = ["target"])
y = v["target"]

**lets do train test split**

In [136]:
from sklearn.model_selection import  train_test_split
x_train , x_test , y_train , y_test = train_test_split(x , y , test_size = 0.2 ,
                                                       random_state = 42)

In [137]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


In [138]:
estimators = [
    ("rf"  , RandomForestClassifier(n_estimators = 10 , random_state = 42)),
    ("knn" , KNeighborsClassifier(n_neighbors = 5)),
    ("gbdt", GradientBoostingClassifier())
]

# here we have given the  specific algorithms
#  to each  base models ....

In [139]:
from sklearn.ensemble import StackingClassifier

clf = StackingClassifier(
    estimators = estimators , final_estimator = LogisticRegression(),
    cv = 10
)
# here our final estimator are logistic regression
# final estimators also knows as metamodel

In [140]:
clf.fit(x_train , y_train)

In [141]:
y_pred = clf.predict(x_test)

In [142]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test , y_pred)

0.819672131147541

**without using ensembling techniques we have to find the accuracy**

In [143]:
import numpy as np
import pandas as pd

In [144]:
G = pd.read_csv("/content/heart.csv")

In [145]:
x = G.drop(columns = ["target"])
y = G["target"]

In [146]:
from sklearn.model_selection import  train_test_split
x_train , x_test , y_train , y_test = train_test_split(x , y , test_size = 0.2 ,
                                                       random_state = 42)

In [147]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)

In [148]:
from sklearn.linear_model import LogisticRegression

In [149]:
clf = LogisticRegression()

In [150]:
clf.fit(x_train , y_train)

In [151]:
clf.predict(x_test)

array([0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0,
       0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0])

In [161]:
y_pred = clf.predict(x_test)

In [162]:
y_test

Unnamed: 0,target
179,0
228,0
111,1
246,0
60,1
...,...
249,0
104,1
300,0
193,0


In [163]:
x_test

array([[ 2.76218225e-01,  7.22504380e-01, -9.71890936e-01,
         1.16949120e+00,  5.53408401e-01, -3.83300706e-01,
        -1.04610909e+00, -1.70875171e+00,  1.47790748e+00,
        -3.75556294e-01, -6.94988026e-01,  3.21860343e-01,
        -2.19657581e+00],
       [ 4.93953764e-01,  7.22504380e-01,  1.96807914e+00,
         2.36038903e+00,  7.81171723e-01, -3.83300706e-01,
        -1.04610909e+00,  3.98288831e-01, -6.76632341e-01,
        -7.39094787e-01, -6.94988026e-01, -6.89700735e-01,
         1.17848036e+00],
       [ 2.76218225e-01,  7.22504380e-01,  9.88089118e-01,
         1.16949120e+00, -2.29363312e+00,  2.60891771e+00,
         8.43132697e-01,  1.02591793e+00, -6.76632341e-01,
        -7.39094787e-01,  9.53905134e-01,  3.21860343e-01,
         1.17848036e+00],
       [ 1.67350456e-01, -1.38407465e+00, -9.71890936e-01,
         2.16772932e-01,  3.07778522e+00, -3.83300706e-01,
        -1.04610909e+00, -5.18701733e-03,  1.47790748e+00,
         8.05943807e-01, -6.94988026e

In [164]:
from sklearn.metrics import accuracy_score

In [165]:
accuracy_score(y_test , y_pred)

0.8524590163934426

**conclusion on ensemble stacking technique and normal logistic technique**

In [None]:
# “Ensemble methods like stacking don’t always guarantee higher accuracy.
# If the base models are weak, overfitting, or highly correlated,
# or if the dataset is small, the meta-learner can get misleading inputs,
# which can make the ensemble perform
# worse than a well-tuned single model like Logistic Regression.”

# Simple version:

# Ensemble = powerful, but only if base models are strong and diverse.

# Single model sometimes beats ensemble, especially on small or simple datasets.