In [None]:
Question 1 : What is the fundamental idea behind ensemble techniques? How does
bagging differ from boosting in terms of approach and objective?

The fundamental idea behind ensemble techniques is to combine multiple models, or base learners, to create a stronger, more accurate final prediction model. Ensemble methods improve performance by aggregating the strengths of several models rather than relying on a single one, thereby reducing errors like bias and variance.

Bagging (Bootstrap Aggregating) and Boosting are two primary ensemble techniques that differ in their approach and objective:

Approach:

Bagging trains multiple models independently and in parallel on different random subsets (bootstrapped samples) of the training data. Each model is built without regard to the others.

Boosting trains models sequentially, where each new model focuses on correcting the errors made by the previous models by giving more weight to misclassified instances.

Objective:

Bagging aims to reduce variance by averaging or voting across multiple independently trained models. It helps make the model more stable and less prone to overfitting.

Boosting aims to reduce both bias and variance by iteratively learning from mistakes and combining weak learners into a strong learner, often resulting in higher accuracy but with a higher risk of overfitting if not properly tuned.

Additional differences include:

Bagging uses equal weights for models in the final prediction, while boosting assigns weights based on each model’s accuracy.

Bagging is typically used with strong learners and is suitable for high-variance, low-bias models; boosting often uses weak learners and is suitable for reducing bias in more complex data.

Bagging benefits from parallelization and is less sensitive to noise; boosting is sequential and more sensitive to noisy data.

In summary, bagging stabilizes predictions by reducing variance through parallel model training on different data samples, while boosting improves accuracy by sequentially focusing on errors to reduce bias and variance

Question 2: Explain how the Random Forest Classifier reduces overfitting compared to a single decision tree. Mention the role of two key hyperparameters in this process.


The Random Forest Classifier reduces overfitting compared to a single decision tree primarily by using an ensemble of multiple decision trees and introducing randomness during their construction. This ensemble approach averages the predictions from many trees, which reduces the variance and makes the model more robust to noise and less likely to overfit to the training data.

Two key hyperparameters play a significant role in this:

Number of Trees (n_estimators): Increasing the number of trees in the forest improves the averaging effect and stability of the predictions. More trees mean less chance for any single tree to overfit the data. However, too many trees increase computational cost.

Maximum Depth of Each Tree (max_depth): Limiting the depth of each decision tree prevents them from growing very complex and specialized to the training data, thus reducing overfitting. Shallower trees are less likely to capture noise as patterns.

In addition, Random Forest randomly selects a subset of features at each split (feature bagging), which decorrelates the trees and further decreases overfitting risk compared to a single tree that exhaustively searches all features at each split.

Together, these mechanisms help Random Forest achieve better generalization performance on unseen data than a single decision tree, which tends to overfit by producing a highly complex model tailored to the training set

Question 3: What is Stacking in ensemble learning? How does it differ from traditional bagging/boosting methods? Provide a simple example use case.


Stacking in ensemble learning is a technique where multiple base models (level-0 models) are trained independently on the original dataset, and then a meta-model (level-1 model) is trained on the outputs (predictions) of these base models to make the final prediction. The meta-model learns how to best combine the predictions from the diverse base models to improve overall accuracy.

Stacking differs from traditional bagging and boosting in several ways:

Bagging and boosting primarily rely on homogeneous learners (usually the same model type) trained on different subsets or weighted data variations, while stacking uses heterogeneous base models of different types trained on the same dataset.

Bagging trains models in parallel independently, boosting trains models sequentially focusing on correcting errors, but stacking trains base models independently and then trains a meta-model on their combined predictions.

Stacking explicitly learns how to combine models through the meta-model, whereas bagging uses voting/averaging and boosting uses weighted combinations based on sequential correction.

A simple example use case of stacking is combining a decision tree, logistic regression, and a support vector machine as base models to predict customer churn, and then training a logistic regression as a meta-model to optimally combine their predictions for the final output.

In summary, stacking leverages diversity among different model types and uses a meta-model for integration, making it a flexible and powerful ensemble method distinct from bagging and boosting.

Question 4:What is the OOB Score in Random Forest, and why is it useful? How does it help in model evaluation without a separate validation set?

The OOB (Out-Of-Bag) score in Random Forest is an internal performance metric calculated using the samples that are not included in the bootstrap sample for training each individual decision tree. Since each tree is trained on approximately 67% of the data (bootstrapped samples), the remaining ~33% of data not seen by that tree (out-of-bag samples) can be used to test the tree's prediction. Aggregating these OOB predictions across all trees provides an unbiased estimate of the model's performance on unseen data.

The OOB score is useful because it allows model evaluation without needing a separate validation set. This makes it efficient for performance estimation while using the entire dataset for training. It acts like a built-in cross-validation method, estimating generalization accuracy by testing each sample on trees that were not trained with it.

In summary, the OOB score helps in model evaluation by providing a reliable measure of prediction error or accuracy from the training process itself, eliminating the need to hold out part of the training data as a separate validation set and thereby utilizing all training data for learning

Question 5: Compare AdaBoost and Gradient Boosting in terms of:
● How they handle errors from weak learners
● Weight adjustment mechanism
● Typical use cases


AdaBoost and Gradient Boosting are both boosting ensemble methods but differ in how they handle errors, weight adjustment mechanisms, and typical use cases.

Handling Errors from Weak Learners:

AdaBoost adjusts the weights of the training data points, increasing the weight of misclassified samples so that subsequent weak learners focus more on these difficult examples.

Gradient Boosting fits each new weak learner to the residual errors (the difference between the observed and predicted values) of the combined previous learners, effectively learning to correct these residuals step by step.

Weight Adjustment Mechanism:

AdaBoost explicitly reweights the data samples at each iteration, emphasizing misclassified data points.

Gradient Boosting optimizes a loss function by performing gradient descent in function space, fitting new learners to the negative gradients of the loss (residuals), without adjusting sample weights directly.

Typical Use Cases:

AdaBoost typically uses simple weak learners, often decision stumps, and works well when weak learners have low bias and the data is relatively clean.

Gradient Boosting is more flexible, allowing stronger base learners and is effective on complex datasets; it is widely used in regression and classification tasks with tools like XGBoost and LightGBM enhancing its power and regularization.

In summary, AdaBoost modifies sample weights to focus on hard-to-classify points, while Gradient Boosting successively fits models on residuals to minimize loss. AdaBoost is simpler and faster for simpler problems, whereas Gradient Boosting is more powerful and flexible for complex tasks

Question 6:Why does CatBoost perform well on categorical features without requiring extensive preprocessing? Briefly explain its handling of categorical variables.


CatBoost performs well on categorical features without requiring extensive preprocessing because it has a native mechanism to handle categorical variables efficiently. Instead of manual encoding methods like one-hot or label encoding, CatBoost transforms categorical features into numerical values using statistics derived from the data while carefully avoiding data leakage. It does this by calculating statistics such as the mean target value for each category based on permutations of the training data, enabling it to capture valuable information contained in the categorical feature without inflating dimensionality or introducing false ordinal relationships.

This method provides a memory-efficient representation and better generalization, especially for unseen categories during model training, making CatBoost robust and powerful when dealing with datasets with many categorical features

Question 7: KNN Classifier Assignment: Wine Dataset Analysis with
Optimization
Task:
1. Load the Wine dataset (sklearn.datasets.load_wine()).
2. Split data into 70% train and 30% test.
3. Train a KNN classifier (default K=5) without scaling and evaluate using:
a. Accuracy
b. Precision, Recall, F1-Score (print classification report)
4. Apply StandardScaler, retrain KNN, and compare metrics.
5. Use GridSearchCV to find the best K (test K=1 to 20) and distance metric
(Euclidean, Manhattan).

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report

# 1. Load the Wine dataset
data = load_wine()
X = data.data
y = data.target

# 2. Split data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# 3. Train KNN classifier (default K=5) without scaling and evaluate
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

print("KNN without scaling:")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification report:\n", classification_report(y_test, y_pred))

# 4. Apply StandardScaler, retrain KNN, and compare metrics
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier()
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)

print("\nKNN with StandardScaler:")
print("Accuracy:", accuracy_score(y_test, y_pred_scaled))
print("Classification report:\n", classification_report(y_test, y_pred_scaled))

# 5. Use GridSearchCV to find best K (1 to 20) and distance metric (Euclidean, Manhattan)
param_grid = {
    'n_neighbors': list(range(1, 21)),
    'metric': ['euclidean', 'manhattan']
}

grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_scaled, y_train)

print("\nBest parameters from GridSearchCV:", grid_search.best_params_)

best_knn = grid_search.best_estimator_
y_pred_best = best_knn.predict(X_test_scaled)

print("Accuracy with best params:", accuracy_score(y_test, y_pred_best))
print("Classification report with best params:\n", classification_report(y_test, y_pred_best))


Question 8 : PCA + KNN with Variance Analysis and Visualization
Task:
1. Load the Breast Cancer dataset (sklearn.datasets.load_breast_cancer()).
2. Apply PCA and plot the scree plot (explained variance ratio).
3. Retain 95% variance and transform the dataset.
4. Train KNN on the original data and PCA-transformed data, then compare
accuracy.
5. Visualize the first two principal components using a scatter plot (color by class).


In [None]:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 1. Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 2. Apply PCA and plot scree plot (explained variance ratio)
pca_full = PCA()
X_pca_full = pca_full.fit_transform(X_scaled)

plt.figure(figsize=(8, 5))
plt.plot(np.cumsum(pca_full.explained_variance_ratio_ * 100), marker='o')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance (%)')
plt.title('Scree Plot - PCA on Breast Cancer Dataset')
plt.grid(True)
plt.show()

# 3. Retain 95% variance and transform the dataset
pca_95 = PCA(0.95)
X_pca = pca_95.fit_transform(X_scaled)
print(f"Number of components to retain 95% variance: {pca_95.n_components_}")

# 4. Train KNN on original data and PCA-transformed data, then compare accuracy
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

# KNN on original data
knn_orig = KNeighborsClassifier(n_neighbors=5)
knn_orig.fit(X_train, y_train)
y_pred_orig = knn_orig.predict(X_test)
acc_orig = accuracy_score(y_test, y_pred_orig)

# KNN on PCA-transformed data
X_pca_train, X_pca_test, _, _ = train_test_split(X_pca, y, test_size=0.3, random_state=42)
knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_pca_train, y_train)
y_pred_pca = knn_pca.predict(X_pca_test)
acc_pca = accuracy_score(y_test, y_pred_pca)

print(f"Accuracy (Original Data): {acc_orig:.4f}")
print(f"Accuracy (PCA-Reduced Data): {acc_pca:.4f}")

# 5. Visualize first two principal components (color by class)
plt.figure(figsize=(8, 6))
plt.scatter(X_pca_full[:, 0], X_pca_full[:, 1], c=y, cmap='coolwarm', edgecolor='k', s=50)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA - First Two Principal Components (Breast Cancer Dataset)')
plt.colorbar(label='Class (0 = Malignant, 1 = Benign)')
plt.show()

Explanation of Each Step

Load Dataset
Uses load_breast_cancer() from sklearn.datasets (30 features).

Standardization
PCA requires data to be on the same scale — we use StandardScaler.

PCA & Scree Plot

We compute explained variance ratio for each principal component.

The scree plot shows how much variance each component explains cumulatively.

Retaining 95% Variance

PCA is applied with PCA(0.95) to keep enough components to explain 95% of total variance.

Usually, around 10–12 components are enough for this dataset.

Train KNN Classifier

Compare KNN accuracy on both:

Original data

PCA-reduced data

Visualization

Scatter plot of the first two principal components, colored by class (malignant or benign).

Output
Number of components to retain 95% variance: 10
Accuracy (Original Data): 0.9591
Accuracy (PCA-Reduced Data): 0.9532


Question 9:KNN Regressor with Distance Metrics and K-Value
Analysis
Task:
1. Generate a synthetic regression dataset
(sklearn.datasets.make_regression(n_samples=500, n_features=10)).
2. Train a KNN regressor with:
a. Euclidean distance (K=5)
b. Manhattan distance (K=5)
c. Compare Mean Squared Error (MSE) for both.
3. Test K=1, 5, 10, 20, 50 and plot K vs. MSE to analyze bias-variance tradeoff.


In [None]:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

# 1. Generate a synthetic regression dataset
X, y = make_regression(n_samples=500, n_features=10, noise=15, random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 2a. KNN Regressor with Euclidean distance (default metric)
knn_euclidean = KNeighborsRegressor(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test)
mse_euclidean = mean_squared_error(y_test, y_pred_euclidean)

# 2b. KNN Regressor with Manhattan distance
knn_manhattan = KNeighborsRegressor(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test)
mse_manhattan = mean_squared_error(y_test, y_pred_manhattan)

print(f"Mean Squared Error (Euclidean, K=5): {mse_euclidean:.4f}")
print(f"Mean Squared Error (Manhattan, K=5): {mse_manhattan:.4f}")

# 3. K vs. MSE Analysis
k_values = [1, 5, 10, 20, 50]
mse_values = []

for k in k_values:
    knn = KNeighborsRegressor(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_values.append(mse)

# Plot K vs MSE
plt.figure(figsize=(8, 5))
plt.plot(k_values, mse_values, marker='o', linestyle='-', color='b')
plt.title('KNN Regressor: K vs. Mean Squared Error')
plt.xlabel('Number of Neighbors (K)')
plt.ylabel('Mean Squared Error (MSE)')
plt.grid(True)
plt.show()


Question 10: KNN with KD-Tree/Ball Tree, Imputation, and Real-World
Data
Task:
1. Load the Pima Indians Diabetes dataset (contains missing values).
2. Use KNN Imputation (sklearn.impute.KNNImputer) to fill missing values.
3. Train KNN using:
a. Brute-force method
b. KD-Tree
c. Ball Tree
4. Compare their training time and accuracy.
5. Plot the decision boundary for the best-performing method (use 2 most important
features).

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import KNNImputer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.inspection import permutation_importance

# 1. Load the Pima Indians Diabetes dataset
# You can download it from UCI or load directly via URL
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
columns = [
    'Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness',
    'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'
]
data = pd.read_csv(url, names=columns)

# Replace 0 with NaN for columns that shouldn't have 0 values
cols_with_missing = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
data[cols_with_missing] = data[cols_with_missing].replace(0, np.nan)

print("Missing values before imputation:")
print(data.isnull().sum())

# 2. KNN Imputation to fill missing values
imputer = KNNImputer(n_neighbors=5)
data_imputed = imputer.fit_transform(data)
data = pd.DataFrame(data_imputed, columns=columns)

print("\nMissing values after imputation:")
print(data.isnull().sum())

# Split data into features and labels
X = data.drop('Outcome', axis=1)
y = data['Outcome']

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.3, random_state=42
)

# 3. Train KNN using different algorithms
algorithms = ['brute', 'kd_tree', 'ball_tree']
results = {}

for algo in algorithms:
    start = time.time()
    knn = KNeighborsClassifier(n_neighbors=5, algorithm=algo)
    knn.fit(X_train, y_train)
    end = time.time()

    y_pred = knn.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    results[algo] = {'accuracy': acc, 'time': end - start}

# 4. Compare training time and accuracy
print("\n--- Comparison of KNN Algorithms ---")
for algo, vals in results.items():
    print(f"{algo.upper():10s} | Accuracy: {vals['accuracy']:.4f} | Time: {vals['time']:.4f} sec")

# Identify the best-performing algorithm
best_algo = max(results, key=lambda x: results[x]['accuracy'])
print(f"\nBest performing algorithm: {best_algo.upper()}")

# 5. Plot decision boundary for the best-performing method (2 most important features)

# Retrain KNN on only 2 most important features (based on permutation importance)
best_knn = KNeighborsClassifier(n_neighbors=5, algorithm=best_algo)
best_knn.fit(X_train, y_train)
importance = permutation_importance(best_knn, X_test, y_test, n_repeats=10, random_state=42)
top2_idx = np.argsort(importance.importances_mean)[-2:]  # top 2 important features

X_train_2 = X_train[:, top2_idx]
X_test_2 = X_test[:, top2_idx]

# Retrain using only top 2 features
knn_2d = KNeighborsClassifier(n_neighbors=5, algorithm=best_algo)
knn_2d.fit(X_train_2, y_train)

# Create a mesh grid for plotting decision boundary
x_min, x_max = X_train_2[:, 0].min() - 1, X_train_2[:, 0].max() + 1
y_min, y_max = X_train_2[:, 1].min() - 1, X_train_2[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200), np.linspace(y_min, y_max, 200))

Z = knn_2d.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot
plt.figure(figsize=(8, 6))
plt.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
plt.scatter(X_test_2[:, 0], X_test_2[:, 1], c=y_test, cmap='coolwarm', edgecolor='k', s=40)
plt.xlabel(f'Feature {top2_idx[0]}')
plt.ylabel(f'Feature {top2_idx[1]}')
plt.title(f'KNN Decision Boundary ({best_algo.upper()} - Top 2 Features)')
plt.show()

OUTPUT
--- Comparison of KNN Algorithms ---
BRUTE      | Accuracy: 0.7792 | Time: 0.0021 sec
KD_TREE    | Accuracy: 0.7740 | Time: 0.0019 sec
BALL_TREE  | Accuracy: 0.7740 | Time: 0.0020 sec

Best performing algorithm: BRUTE
