Here's a point-wise explanation for each of your 20 questions on **K-Nearest Neighbors (KNN)** and **Principal Component Analysis (PCA)** — ideal for concept clarity and exam prep:

---

### 🔵 **K-Nearest Neighbors (KNN)**

**1. What is K-Nearest Neighbors (KNN) and how does it work?**

* KNN is a **non-parametric, lazy learning algorithm** used for classification and regression.
* It **predicts the output** for a sample based on the **'K' closest training samples** (neighbors).
* Closeness is measured using a **distance metric** (like Euclidean distance).

---

**2. Difference between KNN Classification and KNN Regression:**

| KNN Classification              | KNN Regression                    |
| ------------------------------- | --------------------------------- |
| Predicts a **class label**      | Predicts a **continuous value**   |
| Majority vote among neighbors   | Average (mean) of neighbor values |
| Suitable for categorical output | Suitable for numerical output     |

---

**3. Role of the distance metric in KNN:**

* It determines **how closeness is measured** between data points.
* Common metrics:

  * **Euclidean distance** (default)
  * **Manhattan distance**
  * **Minkowski distance**
* The choice affects accuracy, especially with different feature scales.

---

**4. What is the Curse of Dimensionality in KNN?**

* In high dimensions, all points appear **equally distant**.
* This reduces KNN's effectiveness as it **relies on distance** for prediction.
* Dimensionality reduction (like PCA) can help mitigate this.

---

**5. How can we choose the best value of K in KNN?**

* Use **cross-validation** to test different K values.
* **Odd values** are preferred (to avoid ties in classification).
* Plot **error rate vs. K** to find the **elbow point**.

---

**6. What are KD Tree and Ball Tree in KNN?**

* They are **data structures** that help **optimize neighbor search**:

  * **KD Tree**: Binary tree partitioning the data space.
  * **Ball Tree**: Uses hyperspheres to group points.
* Speeds up query time from O(n) to O(log n) in many cases.

---

**7. When should you use KD Tree vs. Ball Tree?**

* **KD Tree**: Efficient for **low-dimensional** data (<20 features).
* **Ball Tree**: Better for **higher-dimensional** data.
* Scikit-learn chooses automatically based on dataset.

---

**8. Disadvantages of KNN:**

* **Slow prediction** (lazy learner).
* **High memory usage** (stores all data).
* **Sensitive to noise** and irrelevant features.
* **Performance drops** with high dimensions.

---

**9. How does feature scaling affect KNN?**

* Distance-based algorithms are **very sensitive to feature scales**.
* Always apply **standardization or normalization** before KNN.

---

---

### 🟠 **Principal Component Analysis (PCA)**

**10. What is PCA (Principal Component Analysis)?**

* PCA is a **dimensionality reduction** technique.
* It transforms data into a new coordinate system using **orthogonal components** (principal components).

---

**11. How does PCA work?**

1. **Standardize** the data
2. Compute the **covariance matrix**
3. Find **eigenvectors and eigenvalues**
4. Choose top-k eigenvectors
5. Transform the data onto the new axes

---

**12. Geometric intuition behind PCA:**

* PCA finds new **axes (principal components)** that **maximize variance**.
* The first component captures the **most variance**, the second captures the next most, **orthogonal to the first**, and so on.

---

**13. Difference between Feature Selection and Feature Extraction:**

| Feature Selection            | Feature Extraction              |
| ---------------------------- | ------------------------------- |
| Keeps original features      | Creates new features            |
| Removes unimportant features | Transforms features (e.g., PCA) |
| Example: Chi-square, ANOVA   | Example: PCA, LDA               |

---

**14. What are Eigenvalues and Eigenvectors in PCA?**

* **Eigenvectors**: Directions (axes) of new feature space.
* **Eigenvalues**: Magnitude of variance along each eigenvector.
* Larger eigenvalue = more important component.

---

**15. How do you decide the number of components to keep in PCA?**

* Use **explained variance ratio**.
* Keep components that together explain **90–95% variance**.
* Use a **scree plot** to visualize the drop-off (elbow method).

---

**16. Can PCA be used for classification?**

* **Indirectly**. PCA reduces dimensionality, which can:

  * Improve model performance
  * Reduce overfitting
* But PCA itself is **unsupervised**, not a classifier.

---

**17. What are the limitations of PCA?**

* Assumes **linear relationships**.
* Sensitive to **outliers**.
* Reduced interpretability — components are **linear combinations** of features.
* May discard important but **low-variance** features.

---

**18. How do KNN and PCA complement each other?**

* PCA reduces dimensions → **improves KNN performance**.
* Reduces **overfitting**, **noise**, and **computational cost** in KNN.
* Useful before applying KNN to high-dimensional data.

---

**19. How does KNN handle missing values in a dataset?**

* **KNN doesn’t handle missing values natively.**
* Preprocessing required:

  * Use **KNN imputation**
  * Or remove/replace missing values beforehand

---

**20. Key differences between PCA and LDA:**

| PCA                           | LDA                             |
| ----------------------------- | ------------------------------- |
| Unsupervised                  | Supervised                      |
| Maximizes **variance**        | Maximizes **class separation**  |
| Doesn't consider labels       | Uses class labels               |
| Can be used for visualization | Better for classification tasks |

---

Would you like a **PDF or revision sheet** for this set as well? I can generate that for your study materials.


In [1]:
# 21. Train a KNN Classifier on the Iris dataset and print model accuracy.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

# Load the Iris dataset
data = load_iris()
X, y = data.data, data.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features (important for KNN)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Initialize and train KNN Classifier
knn_clf = KNeighborsClassifier(n_neighbors=5)
knn_clf.fit(X_train, y_train)

# Predict and calculate accuracy
y_pred = knn_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print accuracy
print("KNN Classifier Accuracy: {:.4f}".format(accuracy))

KNN Classifier Accuracy: 1.0000


In [3]:
# 22. Train a KNN Regressor on a synthetic dataset and evaluate using Mean Squared Error (MSE).

from sklearn.neighbors import KNeighborsRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

# Generate synthetic regression dataset
X, y = make_regression(n_samples=1000, n_features=4, noise=0.1, random_state=42)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Initialize and train KNN Regressor
knn_reg = KNeighborsRegressor(n_neighbors=5)
knn_reg.fit(X_train, y_train)

# Predict and calculate MSE
y_pred = knn_reg.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

# Print MSE
print("KNN Regressor Mean Squared Error: {:.4f}".format(mse))

KNN Regressor Mean Squared Error: 579.5921


In [5]:
# 23. Train a KNN Classifier using different distance metrics (Euclidean and Manhattan) and compare accuracy.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

# Load the Iris dataset
data = load_iris()
X, y = data.data, data.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Test different distance metrics
metrics = ['euclidean', 'manhattan']

for metric in metrics:
    knn_clf = KNeighborsClassifier(n_neighbors=5, metric=metric)
    knn_clf.fit(X_train, y_train)
    y_pred = knn_clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"KNN Classifier ({metric} distance) Accuracy: {accuracy:.4f}")

KNN Classifier (euclidean distance) Accuracy: 1.0000
KNN Classifier (manhattan distance) Accuracy: 1.0000


In [6]:
# 24. Train a KNN Classifier with different values of K and visualize decision boundaries

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

# Load the Iris dataset (use only first two features for 2D visualization)
data = load_iris()
X, y = data.data[:, :2], data.target

# Scale the features
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Define k values
k_values = [1, 5, 10]

# Plot decision boundaries
plt.figure(figsize=(15, 5))
for i, k in enumerate(k_values, 1):
    knn_clf = KNeighborsClassifier(n_neighbors=k)
    knn_clf.fit(X, y)

    # Create mesh grid
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01), np.arange(y_min, y_max, 0.01))

    # Predict on mesh grid
    Z = knn_clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    # Plot
    plt.subplot(1, 3, i)
    plt.contourf(xx, yy, Z, alpha=0.3, cmap='viridis')
    plt.scatter(X[:, 0], X[:, 1], c=y, edgecolor='k', cmap='viridis')
    plt.title(f'KNN Decision Boundary (k={k})')
    plt.xlabel('Feature 1 (Scaled)')
    plt.ylabel('Feature 2 (Scaled)')

plt.tight_layout()
plt.savefig('knn_decision_boundaries.png')
plt.close()

In [7]:
# 25. Apply Feature Scaling before training a KNN model and compare results with unscaled data.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

# Load the Iris dataset
data = load_iris()
X, y = data.data, data.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train KNN without scaling
knn_unscaled = KNeighborsClassifier(n_neighbors=5)
knn_unscaled.fit(X_train, y_train)
y_pred_unscaled = knn_unscaled.predict(X_test)
accuracy_unscaled = accuracy_score(y_test, y_pred_unscaled)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train KNN with scaling
knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)

# Print accuracies
print(f"KNN Classifier Accuracy (Unscaled): {accuracy_unscaled:.4f}")
print(f"KNN Classifier Accuracy (Scaled): {accuracy_scaled:.4f}")

KNN Classifier Accuracy (Unscaled): 1.0000
KNN Classifier Accuracy (Scaled): 1.0000


In [8]:
# 26. Train a PCA model on synthetic data and print the explained variance ratio for each component.

from sklearn.decomposition import PCA
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, random_state=42)

# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train PCA
pca = PCA()
pca.fit(X_scaled)

# Print explained variance ratio
print("Explained Variance Ratio for Each Component:")
for i, var in enumerate(pca.explained_variance_ratio_, 1):
    print(f"Component {i}: {var:.4f}")

Explained Variance Ratio for Each Component:
Component 1: 0.2992
Component 2: 0.1560
Component 3: 0.1112
Component 4: 0.1029
Component 5: 0.0987
Component 6: 0.0959
Component 7: 0.0884
Component 8: 0.0475
Component 9: 0.0000
Component 10: 0.0000


In [10]:
# 27. Apply PCA before training a KNN Classifier and compare accuracy with and without PCA.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

# Load the Iris dataset
data = load_iris()
X, y = data.data, data.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train KNN without PCA
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)
y_pred = knn.predict(X_test_scaled)
accuracy_no_pca = accuracy_score(y_test, y_pred)

# Apply PCA
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

# Train KNN with PCA
knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)
accuracy_pca = accuracy_score(y_test, y_pred_pca)

# Print accuracies
print(f"KNN Classifier Accuracy (No PCA): {accuracy_no_pca:.4f}")
print(f"KNN Classifier Accuracy (PCA, 2 components): {accuracy_pca:.4f}")

KNN Classifier Accuracy (No PCA): 1.0000
KNN Classifier Accuracy (PCA, 2 components): 0.9333


In [12]:
# 28. Perform Hyperparameter Tuning on a KNN Classifier using GridSearchCV.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

# Load the Iris dataset
data = load_iris()
X, y = data.data, data.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Initialize KNN Classifier
knn = KNeighborsClassifier()

# Define parameter grid
param_grid = {'n_neighbors': [3, 5, 7, 9], 'metric': ['euclidean', 'manhattan']}

# Perform GridSearchCV
grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Get best model and predict
best_knn = grid_search.best_estimator_
y_pred = best_knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print results
print(f"Best Parameters: {grid_search.best_params_}")
print(f"KNN Classifier Accuracy (Tuned): {accuracy:.4f}")

Best Parameters: {'metric': 'manhattan', 'n_neighbors': 9}
KNN Classifier Accuracy (Tuned): 1.0000


In [14]:
# 29. Train a KNN Classifier and check the number of misclassified samples.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
import numpy as np

# Load the Iris dataset
data = load_iris()
X, y = data.data, data.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train KNN Classifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# Predict and count misclassified samples
y_pred = knn.predict(X_test)
misclassified = np.sum(y_pred != y_test)

# Print results
print(f"Number of Misclassified Samples: {misclassified}")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")

Number of Misclassified Samples: 0
Accuracy: 1.0000


In [18]:
# 30. Train a PCA model and visualize the cumulative explained variance.

import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
import numpy as np

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, random_state=42)

# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train PCA
pca = PCA()
pca.fit(X_scaled)

# Calculate cumulative explained variance
cumulative_variance = np.cumsum(pca.explained_variance_ratio_)

# Plot cumulative explained variance
plt.figure(figsize=(8, 6))
plt.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, marker='o', color='#1f77b4')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.title('PCA Cumulative Explained Variance')
plt.grid(True)
plt.savefig('pca_cumulative_variance.png')
plt.close()  # Close the figure to free up memory

In [20]:
# 31. Train a KNN Classifier using different values of the weights parameter (uniform vs. distance) and compare accuracy.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

# Load the Iris dataset
data = load_iris()
X, y = data.data, data.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Test different weights
weights_list = ['uniform', 'distance']

for weight in weights_list:
    knn_clf = KNeighborsClassifier(n_neighbors=5, weights=weight)
    knn_clf.fit(X_train, y_train)
    y_pred = knn_clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"KNN Classifier (weights={weight}) Accuracy: {accuracy:.4f}")

KNN Classifier (weights=uniform) Accuracy: 1.0000
KNN Classifier (weights=distance) Accuracy: 1.0000


In [22]:
# 32. Train a KNN Regressor and analyze the effect of different K values on performance.

from sklearn.neighbors import KNeighborsRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

# Generate synthetic regression dataset
X, y = make_regression(n_samples=1000, n_features=4, noise=0.1, random_state=42)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Test different k values
k_values = [1, 5, 10, 20]

for k in k_values:
    knn_reg = KNeighborsRegressor(n_neighbors=k)
    knn_reg.fit(X_train, y_train)
    y_pred = knn_reg.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    print(f"KNN Regressor (k={k}) Mean Squared Error: {mse:.4f}")

KNN Regressor (k=1) Mean Squared Error: 1209.4019
KNN Regressor (k=5) Mean Squared Error: 579.5921
KNN Regressor (k=10) Mean Squared Error: 616.3125
KNN Regressor (k=20) Mean Squared Error: 807.2294


In [24]:
# 33. Implement KNN Imputation for handling missing values in a dataset.

from sklearn.datasets import make_classification
from sklearn.impute import KNNImputer
import numpy as np

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=4, random_state=42)

# Introduce missing values (10% of data)
rng = np.random.RandomState(42)
missing_mask = rng.random(X.shape) < 0.1
X[missing_mask] = np.nan

# Apply KNN Imputation
imputer = KNNImputer(n_neighbors=5, weights='uniform')
X_imputed = imputer.fit_transform(X)

# Check number of missing values before and after
missing_before = np.isnan(X).sum()
missing_after = np.isnan(X_imputed).sum()

# Print results
print(f"Missing Values Before Imputation: {missing_before}")
print(f"Missing Values After Imputation: {missing_after}")

Missing Values Before Imputation: 426
Missing Values After Imputation: 0


In [25]:
# 34. Train a PCA model and visualize the data projection onto the first two principal components.

import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_classes=3, random_state=42)

# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Visualize projection
plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', edgecolor='k')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA Projection onto First Two Components')
plt.colorbar(scatter, label='Class')
plt.savefig('pca_projection.png')
plt.close()

ValueError: n_classes(3) * n_clusters_per_class(2) must be smaller or equal 2**n_informative(2)=4

In [27]:
# 35. Train a KNN Classifier using the KD Tree and Ball Tree algorithms and compare performance.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
import time

# Load the Iris dataset
data = load_iris()
X, y = data.data, data.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Test different algorithms
algorithms = ['kd_tree', 'ball_tree']

for algo in algorithms:
    start_time = time.time()
    knn_clf = KNeighborsClassifier(n_neighbors=5, algorithm=algo)
    knn_clf.fit(X_train, y_train)
    training_time = time.time() - start_time
    y_pred = knn_clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"KNN Classifier ({algo}) Accuracy: {accuracy:.4f}, Training Time: {training_time:.4f}s")

KNN Classifier (kd_tree) Accuracy: 1.0000, Training Time: 0.0009s
KNN Classifier (ball_tree) Accuracy: 1.0000, Training Time: 0.0013s


In [30]:
# 36. Train a PCA model on a high-dimensional dataset and visualize the Scree plot.

import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
import numpy as np

# Generate high-dimensional synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, random_state=42)

# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train PCA
pca = PCA()
pca.fit(X_scaled)

# Plot Scree plot
plt.figure(figsize=(8, 6))
plt.plot(range(1, len(pca.explained_variance_ratio_) + 1), pca.explained_variance_ratio_, marker='o', color='#1f77b4')
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.title('PCA Scree Plot')
plt.grid(True)
plt.savefig('pca_scree_plot.png')
plt.close()

In [32]:
# 37. Train a KNN Classifier and evaluate using Precision, Recall, and F1-score.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.preprocessing import StandardScaler

# Load the Iris dataset
data = load_iris()
X, y = data.data, data.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train KNN Classifier
knn_clf = KNeighborsClassifier(n_neighbors=5)
knn_clf.fit(X_train, y_train)

# Predict and calculate metrics
y_pred = knn_clf.predict(X_test)
precision = precision_score(y_test, y_pred, average='macro')
recall = recall_score(y_test, y_pred, average='macro')
f1 = f1_score(y_test, y_pred, average='macro')

# Print results
print("KNN Classifier Performance:")
print(f"Precision (macro): {precision:.4f}")
print(f"Recall (macro): {recall:.4f}")
print(f"F1-Score (macro): {f1:.4f}")


KNN Classifier Performance:
Precision (macro): 1.0000
Recall (macro): 1.0000
F1-Score (macro): 1.0000


In [34]:
# 38. Train a PCA model and analyze the effect of different numbers of components on accuracy.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

# Load the Iris dataset
data = load_iris()
X, y = data.data, data.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Test different numbers of PCA components
n_components_list = [1, 2, 3, 4]

for n in n_components_list:
    pca = PCA(n_components=n)
    X_train_pca = pca.fit_transform(X_train_scaled)
    X_test_pca = pca.transform(X_test_scaled)
    knn_clf = KNeighborsClassifier(n_neighbors=5)
    knn_clf.fit(X_train_pca, y_train)
    y_pred = knn_clf.predict(X_test_pca)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"KNN Classifier (PCA, {n} components) Accuracy: {accuracy:.4f}")

KNN Classifier (PCA, 1 components) Accuracy: 0.9000
KNN Classifier (PCA, 2 components) Accuracy: 0.9333
KNN Classifier (PCA, 3 components) Accuracy: 1.0000
KNN Classifier (PCA, 4 components) Accuracy: 1.0000


In [36]:
# 39. Train a KNN Classifier with different leaf_size values and compare accuracy.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

# Load the Iris dataset
data = load_iris()
X, y = data.data, data.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Test different leaf_size values
leaf_sizes = [10, 30, 50]

for leaf_size in leaf_sizes:
    knn_clf = KNeighborsClassifier(n_neighbors=5, algorithm='kd_tree', leaf_size=leaf_size)
    knn_clf.fit(X_train, y_train)
    y_pred = knn_clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"KNN Classifier (leaf_size={leaf_size}) Accuracy: {accuracy:.4f}")

KNN Classifier (leaf_size=10) Accuracy: 1.0000
KNN Classifier (leaf_size=30) Accuracy: 1.0000
KNN Classifier (leaf_size=50) Accuracy: 1.0000


In [38]:
# 40. Train a PCA model and visualize how data points are transformed before and after PCA.

import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

# Load the Iris dataset (use first two features for visualization)
data = load_iris()
X, y = data.data[:, :2], data.target

# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Visualize before and after PCA
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=y, cmap='viridis', edgecolor='k')
plt.xlabel('Feature 1 (Scaled)')
plt.ylabel('Feature 2 (Scaled)')
plt.title('Before PCA')

plt.subplot(1, 2, 2)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', edgecolor='k')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('After PCA')

plt.tight_layout()
plt.savefig('pca_transformation.png')
plt.close()

In [40]:
# 41. Train a KNN Classifier on a real-world dataset (Wine dataset) and print classification report.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler

# Load the Wine dataset
data = load_wine()
X, y = data.data, data.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train KNN Classifier
knn_clf = KNeighborsClassifier(n_neighbors=5)
knn_clf.fit(X_train, y_train)

# Predict and print classification report
y_pred = knn_clf.predict(X_test)
print("KNN Classifier Classification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))

KNN Classifier Classification Report:
              precision    recall  f1-score   support

     class_0       0.93      1.00      0.97        14
     class_1       1.00      0.86      0.92        14
     class_2       0.89      1.00      0.94         8

    accuracy                           0.94        36
   macro avg       0.94      0.95      0.94        36
weighted avg       0.95      0.94      0.94        36



In [42]:
# 42. Train a KNN Regressor and analyze the effect of different distance metrics on prediction error.

from sklearn.neighbors import KNeighborsRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

# Generate synthetic regression dataset
X, y = make_regression(n_samples=1000, n_features=4, noise=0.1, random_state=42)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Test different distance metrics
metrics = ['euclidean', 'manhattan']

for metric in metrics:
    knn_reg = KNeighborsRegressor(n_neighbors=5, metric=metric)
    knn_reg.fit(X_train, y_train)
    y_pred = knn_reg.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    print(f"KNN Regressor ({metric} distance) Mean Squared Error: {mse:.4f}")

KNN Regressor (euclidean distance) Mean Squared Error: 579.5921
KNN Regressor (manhattan distance) Mean Squared Error: 627.7911


In [44]:
# 43. Train a KNN Classifier and evaluate using ROC-AUC score.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler

# Load the Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train KNN Classifier
knn_clf = KNeighborsClassifier(n_neighbors=5)
knn_clf.fit(X_train, y_train)

# Predict probabilities and calculate ROC-AUC
y_pred_proba = knn_clf.predict_proba(X_test)[:, 1]
auc_score = roc_auc_score(y_test, y_pred_proba)

# Print ROC-AUC
print("KNN Classifier ROC-AUC Score: {:.4f}".format(auc_score))

KNN Classifier ROC-AUC Score: 0.9820


In [47]:
# 44. Train a PCA model and visualize the variance captured by each principal component.

import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
import numpy as np

# Generate high-dimensional synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, random_state=42)

# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train PCA
pca = PCA()
pca.fit(X_scaled)

# Plot variance per component
plt.figure(figsize=(8, 6))
plt.plot(range(1, len(pca.explained_variance_ratio_) + 1), pca.explained_variance_ratio_, marker='o', color='#1f77b4')
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.title('Variance Captured by Each Principal Component')
plt.grid(True)
plt.savefig('pca_variance_per_component.png')
plt.close()

In [49]:
# 45. Train a KNN Classifier and perform feature selection before training.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold

# Load the Wine dataset
data = load_wine()
X, y = data.data, data.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train KNN without feature selection
knn_clf = KNeighborsClassifier(n_neighbors=5)
knn_clf.fit(X_train_scaled, y_train)
y_pred = knn_clf.predict(X_test_scaled)
accuracy_no_selection = accuracy_score(y_test, y_pred)

# Apply feature selection
selector = VarianceThreshold(threshold=0.5)
X_train_selected = selector.fit_transform(X_train_scaled)
X_test_selected = selector.transform(X_test_scaled)

# Train KNN with feature selection
knn_clf_selected = KNeighborsClassifier(n_neighbors=5)
knn_clf_selected.fit(X_train_selected, y_train)
y_pred_selected = knn_clf_selected.predict(X_test_selected)
accuracy_selected = accuracy_score(y_test, y_pred_selected)

# Print results
print(f"KNN Classifier Accuracy (No Feature Selection): {accuracy_no_selection:.4f}")
print(f"KNN Classifier Accuracy (With Feature Selection): {accuracy_selected:.4f}")

KNN Classifier Accuracy (No Feature Selection): 0.9444
KNN Classifier Accuracy (With Feature Selection): 0.9444


In [52]:
# 46. Train a PCA model and visualize the data reconstruction error after reducing dimensions.

import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
import numpy as np

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, random_state=42)

# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Calculate reconstruction error for different components
n_components_list = range(1, 11)
errors = []

for n in n_components_list:
    pca = PCA(n_components=n)
    X_pca = pca.fit_transform(X_scaled)
    X_reconstructed = pca.inverse_transform(X_pca)
    error = np.mean((X_scaled - X_reconstructed) ** 2)
    errors.append(error)

# Plot reconstruction error
plt.figure(figsize=(8, 6))
plt.plot(n_components_list, errors, marker='o', color='#1f77b4')
plt.xlabel('Number of Components')
plt.ylabel('Reconstruction Error (MSE)')
plt.title('PCA Reconstruction Error')
plt.grid(True)
plt.savefig('pca_reconstruction_error.png')
plt.close()

In [56]:
# 47. Train a KNN Classifier and visualize the decision boundary.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

# Load the Iris dataset (use first two features for 2D visualization)
data = load_iris()
X, y = data.data[:, :2], data.target  # Sepal length and sepal width

# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train KNN Classifier
knn_clf = KNeighborsClassifier(n_neighbors=5)
knn_clf.fit(X_scaled, y)

# Create mesh grid for decision boundary
x_min, x_max = X_scaled[:, 0].min() - 1, X_scaled[:, 0].max() + 1
y_min, y_max = X_scaled[:, 1].min() - 1, X_scaled[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01), np.arange(y_min, y_max, 0.01))

# Predict on mesh grid
Z = knn_clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot decision boundary and data points
plt.figure(figsize=(8, 6))
plt.contourf(xx, yy, Z, alpha=0.3, cmap='viridis')
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=y, edgecolor='k', cmap='viridis')
plt.xlabel('Sepal Length (Scaled)')
plt.ylabel('Sepal Width (Scaled)')
plt.title('KNN Classifier Decision Boundary (k=5)')
plt.savefig('knn_decision_boundary.png')
plt.close()

In [54]:
# 48 Train a PCA model and analyze the effect of different numbers of components on data variance

import numpy as np
from sklearn.decomposition import PCA
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, random_state=42)

# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train PCA with all components
pca = PCA()
pca.fit(X_scaled)

# Analyze cumulative explained variance for different numbers of components
cumulative_variance = np.cumsum(pca.explained_variance_ratio_)

# Print explained variance for each number of components
print("Effect of Number of Components on Cumulative Explained Variance:")
for i, variance in enumerate(cumulative_variance, 1):
    print(f"Components: {i}, Cumulative Variance Explained: {variance:.4f}")

Effect of Number of Components on Cumulative Explained Variance:
Components: 1, Cumulative Variance Explained: 0.1512
Components: 2, Cumulative Variance Explained: 0.2449
Components: 3, Cumulative Variance Explained: 0.3176
Components: 4, Cumulative Variance Explained: 0.3832
Components: 5, Cumulative Variance Explained: 0.4413
Components: 6, Cumulative Variance Explained: 0.4943
Components: 7, Cumulative Variance Explained: 0.5469
Components: 8, Cumulative Variance Explained: 0.5985
Components: 9, Cumulative Variance Explained: 0.6490
Components: 10, Cumulative Variance Explained: 0.6988
Components: 11, Cumulative Variance Explained: 0.7463
Components: 12, Cumulative Variance Explained: 0.7932
Components: 13, Cumulative Variance Explained: 0.8385
Components: 14, Cumulative Variance Explained: 0.8817
Components: 15, Cumulative Variance Explained: 0.9232
Components: 16, Cumulative Variance Explained: 0.9596
Components: 17, Cumulative Variance Explained: 0.9816
Components: 18, Cumulative