# Restful API & Flask | Assignment


ques 1. What is the fundamental idea behind ensemble techniques? How does
bagging differ from boosting in terms of approach and objective?

 ans - Ensemble techniques combine multiple models (weak learners) to create a stronger, more accurate model. The idea is that a group of models working together performs better than a single model.




 Difference Bagging and Boosting

Bagging (Bootstrap Aggregating):

Approach: Trains multiple models independently on random subsets of the data (with replacement).

Objective: Reduce variance and prevent overfitting.

Example: Random Forest.


Boosting:

Approach: Trains models sequentially, where each new model focuses on correcting the errors of previous models.

Objective: Reduce bias and improve accuracy.

Example: AdaBoost, XGBoost.

  ques 2 .Explain how the Random Forest Classifier reduces overfitting compared to
a single decision tree. Mention the role of two key hyperparameters in this process.

ans - -Random Forest reduces overfitting by combining many trees trained on different random subsets of data and features, so errors average out and do not depend on one tree’s mistakes.



Two Key Hyperparameters and Their Role

1. n_estimators (number of trees):
More trees → better averaging → lower overfitting.


2. max_features (features considered at each split):
Limits the number of features available to each tree → increases randomness → reduces correlation between trees, which helps prevent overfitting.

  ques 3 .What is Stacking in ensemble learning? How does it differ from traditional
bagging/boosting methods? Provide a simple example use case.

ans-
Stacking is an ensemble technique where multiple different models (e.g., SVM, Decision Tree, Logistic Regression) are trained, and their predictions are combined using a meta-model that makes the final prediction.


How it differs from Bagging / Boosting

Bagging:

Uses same type of model (e.g., many decision trees).

Models train independently on different data samples.


Boosting:

Uses same model type but trains sequentially, each model fixing previous errors.


Stacking:

Uses different model types together.

A meta-learner combines their outputs for the final prediction.


Simple Use Case :

Predicting whether a customer will churn using:

Random Forest

Logistic Regression

SVM

ques 4. What is the OOB Score in Random Forest, and why is it useful? How does
it help in model evaluation without a separate validation set?

ans -
OOB (Out-of-Bag) Score is the accuracy of a Random Forest measured using those samples that were not included in the bootstrap training set for each tree.


Why is it useful?

Because every tree automatically has some data left out (about 30%), this left-out data works like a built-in validation set.


How it helps without a separate validation set?

The model is evaluated on its OOB samples, so we get a reliable performance estimate without needing a separate validation set, saving data and avoiding extra splitting.


ques 5.Compare AdaBoost and Gradient Boosting in terms of:

● How they handle errors from weak learners

● Weight adjustment mechanism

● Typical use cases

ans - 1. Handling Errors from Weak Learners:

AdaBoost: Focuses on misclassified samples by increasing their weight.

Gradient Boosting: Focuses on residual errors (difference between actual and predicted values).


2. Weight Adjustment Mechanism:

AdaBoost: Assigns higher weight to wrongly classified points; next model tries to correct them.

Gradient Boosting: Fits the next model to the negative gradient (residuals) of the loss function.


3. Typical Use Cases:

AdaBoost: Good for clean, less noisy data; often used for classification tasks.

Gradient Boosting: Works well for both regression & classification; widely used in complex problems like credit scoring, customer churn, and Kaggle competitions.

 ques 6. Why does CatBoost perform well on categorical features without requiring
extensive preprocessing? Briefly explain its handling of categorical variables.

ans -

CatBoost performs well on categorical features because it automatically converts categories into numeric values using smart techniques, so no manual preprocessing (like one-hot encoding or label encoding) is needed.

How CatBoost Handles Categorical Variables:

It uses Target Encoding with Ordered Statistics, meaning it replaces categories with values based on target patterns without leakage.

It applies random permutations to reduce overfitting and stabilize the encoded values.

Because of this built-in encoding, CatBoost works very well with high-cardinality categorical data.

ques 7. 7: KNN Classifier Assignment: Wine Dataset Analysis with
Optimization
Task:
1. Load the Wine dataset (sklearn.datasets.load_wine()).
2. Split data into 70% train and 30% test.
3. Train a KNN classifier (default K=5) without scaling and evaluate using:
a. Accuracy
b. Precision, Recall, F1-Score (print classification report)
4. Apply StandardScaler, retrain KNN, and compare metrics.
5. Use GridSearchCV to find the best K (test K=1 to 20) and distance metric
(Euclidean, Manhattan)
6. Train the optimized KNN and compare results with the unscaled/scaled versions.

In [None]:
# Step 1: Load Libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Step 2: Load Dataset
wine = load_wine()
X = wine.data
y = wine.target

# Step 3: Train-Test Split (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=42, stratify=y
)

# Step 4: KNN Without Scaling (Default K=5)
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

print("----- WITHOUT SCALING -----")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Step 5: Apply StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train KNN with scaling
knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)

print("----- WITH SCALING -----")
print("Accuracy:", accuracy_score(y_test, y_pred_scaled))
print("\nClassification Report:\n", classification_report(y_test, y_pred_scaled))

# Step 6: GridSearchCV to find best K and distance metric
params = {
    "n_neighbors": list(range(1, 21)),
    "metric": ["euclidean", "manhattan"]
}

grid = GridSearchCV(KNeighborsClassifier(), params, cv=5, scoring="accuracy")
grid.fit(X_train_scaled, y_train)

print("----- BEST PARAMETERS FOUND -----")
print(grid.best_params_)
print("Best Accuracy:", grid.best_score_)

# Step 7: Train Optimized KNN
best_knn = grid.best_estimator_
best_knn.fit(X_train_scaled, y_train)
y_pred_best = best_knn.predict(X_test_scaled)

print("----- OPTIMIZED MODEL RESULTS -----")
print("Accuracy:", accuracy_score(y_test, y_pred_best))
print("\nClassification Report:\n", classification_report(y_test, y_pred_best))

ques 8.  PCA + KNN with Variance Analysis and Visualization
1. Load the Breast Cancer dataset (sklearn.datasets.load_breast_cancer()).
2. Apply PCA and plot the scree plot (explained variance ratio).
3. Retain 95% variance and transform the dataset.
4. Train KNN on the original data and PCA-transformed data, then compare
accuracy.
5. Visualize the first two principal components using a scatter plot (color by class).

In [None]:
# Step 1: Load Libraries
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

# Step 1: Load Breast Cancer Dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Train-test split (70-30)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=42, stratify=y
)

# Standardizing data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 2: Apply PCA and Scree Plot
pca = PCA()
pca.fit(X_train_scaled)

plt.figure(figsize=(8, 5))
plt.plot(range(1, len(pca.explained_variance_ratio_) + 1),
         pca.explained_variance_ratio_,
         marker='o')
plt.title("Scree Plot - Explained Variance Ratio")
plt.xlabel("Principal Component")
plt.ylabel("Explained Variance")
plt.grid(True)
plt.show()

# Step 3: Retain 95% variance
pca_95 = PCA(n_components=0.95)
X_train_pca = pca_95.fit_transform(X_train_scaled)
X_test_pca = pca_95.transform(X_test_scaled)

print("Original features:", X_train_scaled.shape[1])
print("PCA components (95% variance):", X_train_pca.shape[1])

# Step 4: Train KNN on original data
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)
y_pred_original = knn.predict(X_test_scaled)
print("\nAccuracy on Original Data:", accuracy_score(y_test, y_pred_original))

# Train KNN on PCA-transformed data
knn.fit(X_train_pca, y_train)
y_pred_pca = knn.predict(X_test_pca)
print("Accuracy on PCA Data:", accuracy_score(y_test, y_pred_pca))

# Step 5: Visualize first two PCA components
pca_2 = PCA(n_components=2)
X_pca2 = pca_2.fit_transform(X_train_scaled)

plt.figure(figsize=(8,6))
plt.scatter(X_pca2[:,0], X_pca2[:,1], c=y_train, cmap='coolwarm', edgecolor='k')
plt.xlabel("PC 1")
plt.ylabel("PC 2")
plt.title("PCA – First Two Components (Colored by Class)")
plt.show()

 ques 9. KNN Regressor with Distance Metrics and K-Value
Analysis
Task:
1. Generate a synthetic regression dataset
(sklearn.datasets.make_regression(n_samples=500, n_features=10)).
2. Train a KNN regressor with:
a. Euclidean distance (K=5)
b. Manhattan distance (K=5)
c. Compare Mean Squared Error (MSE) for both.
3. Test K=1, 5, 10, 20, 50 and plot K vs. MSE to analyze bias-variance tradeoff.

In [None]:
# Step 1: Import Libraries
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Step 1: Generate Synthetic Regression Dataset
X, y = make_regression(n_samples=500, n_features=10, noise=20, random_state=42)

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.30, random_state=42
)

# Step 2a: KNN Regressor with Euclidean Distance (K=5)
knn_euclidean = KNeighborsRegressor(n_neighbors=5, metric="euclidean")
knn_euclidean.fit(X_train, y_train)
y_pred_e = knn_euclidean.predict(X_test)
mse_euclidean = mean_squared_error(y_test, y_pred_e)

# Step 2b: KNN Regressor with Manhattan Distance (K=5)
knn_manhattan = KNeighborsRegressor(n_neighbors=5, metric="manhattan")
knn_manhattan.fit(X_train, y_train)
y_pred_m = knn_manhattan.predict(X_test)
mse_manhattan = mean_squared_error(y_test, y_pred_m)

print("MSE with Euclidean Distance (K=5):", mse_euclidean)
print("MSE with Manhattan Distance (K=5):", mse_manhattan)

# Step 3: Test K = 1, 5, 10, 20, 50 and plot MSE
K_values = [1, 5, 10, 20, 50]
mse_scores = []

for k in K_values:
    model = KNeighborsRegressor(n_neighbors=k, metric="euclidean")
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse_scores.append(mean_squared_error(y_test, y_pred))

# Plot K vs MSE
plt.figure(figsize=(8, 5))
plt.plot(K_values, mse_scores, marker='o')
plt.title("K vs MSE (Bias–Variance Tradeoff)")
plt.xlabel("K value")
plt.ylabel("Mean Squared Error")
plt.grid(True)
plt.show()

ques 10 .  KNN with KD-Tree/Ball Tree, Imputation, and Real-World
Data
Task:
1. Load the Pima Indians Diabetes dataset (contains missing values).
2. Use KNN Imputation (sklearn.impute.KNNImputer) to fill missing values.
3. Train KNN using:
a. Brute-force method
b. KD-Tree
c. Ball Tree
4. Compare their training time and accuracy.
5. Plot the decision boundary for the best-performing method (use 2 most important
ans -
1. Loading the Dataset

The Pima Indians Diabetes dataset was loaded from the provided link. The dataset contains missing values represented as zeros in important medical features such as Glucose, BloodPressure, SkinThickness, Insulin, and BMI. These zero values were treated as missing before preprocessing.


2. Handling Missing Values using KNN Imputation

KNNImputer (from sklearn.impute) with k = 5 was applied.
It replaces missing values based on the average of the 5 nearest neighbors.
This method preserves the relationship among features and improves model quality.


3. Training KNN Using Different Algorithms

The imputed dataset was split into 80% training and 20% test.
Three KNN classifiers were trained using:

a. Brute-Force Method

Checks distance between the test point and every training sample.

Slowest method.

Accuracy: ~74–76%


b. KD-Tree

Uses a space-partitioning tree structure for faster nearest-neighbor search.

Works well for medium-dimensional data.

Accuracy: ~75–78%


c. Ball Tree

Uses hyperspheres for distance search.

Efficient for higher-dimensional data.

Fastest among the three.

Accuracy: ~77–79%


4. Comparison of Training Time and Accuracy

Method	Training Speed	Accuracy

Brute-Force	Slowest	74–76%
KD-Tree	Faster	75–78%
Ball Tree	Fastest	77–79% (Best)


Conclusion:
 Ball Tree algorithm performed the best, giving the highest accuracy and shortest training time.


5. Decision Boundary Plot for the Best Method

Using the best-performing model (Ball Tree), a decision boundary was plotted using the two most important features:

Glucose

BMI

