In [None]:
                                              #      KNN & PCA Assignment

In [None]:
# Q1: What is K-Nearest Neighbors (KNN) and how does it work?

K-Nearest Neighbors (KNN) is a simple, instance-based machine learning algorithm used for classification and regression tasks. It works by finding the 'K' closest data points (neighbors) to the given input data and making predictions based on these neighbors. For classification, it assigns the class most common among the neighbors, and for regression, it calculates the average of the neighbors' values.

In [None]:
# Q2: What is the difference between KNN Classification and KNN Regression?

KNN Classification: Predicts a class label based on the majority vote among the 'K' nearest neighbors.

KNN Regression: Predicts a continuous value by averaging the values of the 'K' nearest neighbors.

In [None]:
# Q3: What is the role of the distance metric in KNN?

The distance metric determines how the neighbors are measured. Common distance metrics include Euclidean distance, Manhattan distance, and Minkowski distance. The choice of distance metric impacts how the algorithm finds the nearest neighbors.

In [None]:
# Q4: What is the Curse of Dimensionality in KNN?

The Curse of Dimensionality refers to the problem where the performance of KNN decreases as the number of features (dimensions) increases. In high-dimensional spaces, data points become more spread out, making it difficult to find close neighbors.

In [None]:
# Q5: How can we choose the best value of K in KNN?

Choosing the best value of K is often done by experimentation, using techniques such as cross-validation. A small K might lead to overfitting, while a large K could underfit the data. Typically, an odd number is chosen to avoid ties in classification.

In [None]:
# Q6: What are KD Tree and Ball Tree in KNN?

KD Tree and Ball Tree are data structures used to speed up the search for nearest neighbors:

KD Tree: A binary tree that partitions the data along axis-aligned hyperplanes.

Ball Tree: A hierarchical structure that groups data points into hyperspheres. These structures allow for faster querying, especially in large datasets.

In [None]:
# Q7: When should you use KD Tree vs. Ball Tree?

KD Tree: Works best with low-dimensional data (fewer than 20 dimensions).

Ball Tree: Performs better with higher-dimensional data and non-axis-aligned data.

In [None]:
# Q8: What are the disadvantages of KNN?

High computational cost during prediction as the model needs to calculate the distance to all points.

Sensitive to the choice of K.

Poor performance in high-dimensional spaces due to the Curse of Dimensionality.

Sensitive to noise and irrelevant features.

In [None]:
# Q9: How does feature scaling affect KNN?

KNN is sensitive to the scale of features since it uses distance metrics to calculate nearest neighbors. Feature scaling (e.g., normalization or standardization) ensures that all features contribute equally to the distance metric, improving the performance of the algorithm.

In [None]:
# Q10: What is PCA (Principal Component Analysis)?

PCA is a dimensionality reduction technique that transforms high-dimensional data into a lower-dimensional space by finding the directions (principal components) that maximize the variance in the data.

In [None]:
# Q11: How does PCA work?

PCA works by:

Standardizing the data.

Computing the covariance matrix of the data.

Finding the eigenvalues and eigenvectors of the covariance matrix.

Projecting the data onto the top 'k' eigenvectors (principal components) corresponding to the largest eigenvalues.

In [None]:
# Q12: What is the geometric intuition behind PCA?

Geometrically, PCA finds the directions (principal components) that capture the most variance in the data. It can be thought of as finding the axes of an ellipsoid that best represent the data's spread.

In [None]:
# Q13: What are Eigenvalues and Eigenvectors in PCA?

Eigenvectors: Represent the directions (principal components) of the new feature space.

Eigenvalues: Represent the magnitude of the variance captured by each principal component.

In [None]:
# Q14: What is the difference between Feature Selection and Feature Extraction?

Feature Selection: Involves selecting a subset of the original features based on some criteria.

Feature Extraction: Involves creating new features by transforming the original features (as in PCA) to capture the essential information.

In [None]:
# Q15: How do you decide the number of components to keep in PCA?

The number of components to keep is often decided by retaining enough components to explain a certain percentage of the variance (e.g., 90-95%). This can be visualized using a scree plot.

In [None]:
# Q16: Can PCA be used for classification?

PCA itself is not a classification algorithm, but it can be used as a preprocessing step for classification by reducing the dimensionality of the data, which helps in improving the performance of classification algorithms.

In [None]:
# Q17: What are the limitations of PCA?

PCA assumes that the data is linearly separable.

It may lose important information, especially in cases where the variance does not capture the essential structure of the data.

It may not perform well on highly noisy data or data where features have different variances.

In [None]:
# Q18: How do KNN and PCA complement each other?

PCA can be used to reduce the dimensionality of the data before applying KNN. This can help mitigate the Curse of Dimensionality and improve the performance of KNN by reducing noise and irrelevant features.

In [None]:
# Q19: How does KNN handle missing values in a dataset?

KNN can handle missing values by imputing them. One common approach is to replace the missing values with the average or mode of the K-nearest neighbors' values for that feature.

In [None]:
Q20: What are the key differences between PCA and Linear Discriminant Analysis (LDA)?

PCA: Focuses on maximizing variance in the data, unsupervised.

LDA: Focuses on maximizing class separability, supervised. LDA is better suited for classification problems where class labels are available, whereas PCA is used for general dimensionality reduction.

In [None]:
                                                                  # Practical

In [None]:
# Q21: Train a KNN Classifier on the Iris dataset and print model accuracy

iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(X_train, y_train)
y_pred = knn_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"KNN Classifier Accuracy: {accuracy * 100:.2f}%")

In [None]:
# Q22: Train a KNN Regressor on a synthetic dataset and evaluate using MSE

X, y = make_regression(n_samples=100, n_features=2, noise=0.1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

knn_reg = KNeighborsRegressor(n_neighbors=3)
knn_reg.fit(X_train, y_train)
y_pred = knn_reg.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Q2 - KNN Regressor MSE: {mse:.2f}")

In [None]:
# Q23: Train a KNN Classifier using different distance metrics (Euclidean and Manhattan) and compare accuracy

knn_clf_euclidean = KNeighborsClassifier(n_neighbors=3, metric='euclidean')
knn_clf_manhattan = KNeighborsClassifier(n_neighbors=3, metric='manhattan')

knn_clf_euclidean.fit(X_train, y_train)
knn_clf_manhattan.fit(X_train, y_train)

acc_euclidean = accuracy_score(y_test, knn_clf_euclidean.predict(X_test))
acc_manhattan = accuracy_score(y_test, knn_clf_manhattan.predict(X_test))

print(f"Q3 - Euclidean Accuracy: {acc_euclidean * 100:.2f}%")
print(f"Q3 - Manhattan Accuracy: {acc_manhattan * 100:.2f}%")

In [None]:
# Q24: Train a KNN Classifier with different values of K and visualize decision boundaries

k_values = [1, 5, 10]
for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train[:, :2], y_train)
    
    # Plotting decision boundary
    x_min, x_max = X_train[:, 0].min() - 1, X_train[:, 0].max() + 1
    y_min, y_max = X_train[:, 1].min() - 1, X_train[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
                         np.arange(y_min, y_max, 0.02))
    
    Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    plt.figure()
    plt.contourf(xx, yy, Z, alpha=0.8)
    plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, edgecolor='k', s=20)
    plt.title(f'K = {k}')
    plt.show()

In [None]:
# Q25: Apply Feature Scaling before training a KNN model and compare results with unscaled data

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_unscaled = KNeighborsClassifier(n_neighbors=3)
knn_scaled = KNeighborsClassifier(n_neighbors=3)

knn_unscaled.fit(X_train, y_train)
knn_scaled.fit(X_train_scaled, y_train)

acc_unscaled = accuracy_score(y_test, knn_unscaled.predict(X_test))
acc_scaled = accuracy_score(y_test, knn_scaled.predict(X_test_scaled))

print(f"Q5 - Unscaled Accuracy: {acc_unscaled * 100:.2f}%")
print(f"Q5 - Scaled Accuracy: {acc_scaled * 100:.2f}%")

In [None]:
# Q26: Train a PCA model on synthetic data and print the explained variance ratio for each component

X_synthetic, _ = make_regression(n_samples=100, n_features=5)
pca = PCA(n_components=5)
pca.fit(X_synthetic)

print(f"Q6 - Explained Variance Ratio: {pca.explained_variance_ratio_}")

In [None]:
# Q7: Apply PCA before training a KNN Classifier and compare accuracy with and without PCA

pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

knn_pca = KNeighborsClassifier(n_neighbors=3)
knn_pca.fit(X_train_pca, y_train)

acc_pca = accuracy_score(y_test, knn_pca.predict(X_test_pca))
acc_no_pca = accuracy_score(y_test, knn_clf.predict(X_test))

print(f"Q7 - Accuracy with PCA: {acc_pca * 100:.2f}%")
print(f"Q7 - Accuracy without PCA: {acc_no_pca * 100:.2f}%")


In [None]:
# Q28: Perform Hyperparameter Tuning on a KNN Classifier using GridSearchCV

param_grid = {'n_neighbors': [3, 5, 7, 10], 'metric': ['euclidean', 'manhattan']}
grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

print(f"Q8 - Best Params: {grid_search.best_params_}")
print(f"Q8 - Best Score: {grid_search.best_score_}")

In [None]:
# Q29: Train a KNN Classifier and check the number of misclassified samples

y_pred_knn = knn_clf.predict(X_test)
misclassified_samples = np.sum(y_test != y_pred_knn)

print(f"Q9 - Number of Misclassified Samples: {misclassified_samples}")

In [None]:
# Q30: Train a PCA model and visualize the cumulative explained variance

pca = PCA().fit(X_synthetic)
cumulative_variance = np.cumsum(pca.explained_variance_ratio_)

plt.figure()
plt.plot(cumulative_variance)
plt.title('Q10 - Cumulative Explained Variance by PCA Components')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.grid(True)
plt.show()

In [None]:
# Q31: Train a KNN Classifier using different values of the weights parameter (uniform vs. distance) and compare accuracy

iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

knn_uniform = KNeighborsClassifier(n_neighbors=5, weights='uniform')
knn_distance = KNeighborsClassifier(n_neighbors=5, weights='distance')

knn_uniform.fit(X_train, y_train)
knn_distance.fit(X_train, y_train)

acc_uniform = accuracy_score(y_test, knn_uniform.predict(X_test))
acc_distance = accuracy_score(y_test, knn_distance.predict(X_test))

print(f"Q1 - Uniform Accuracy: {acc_uniform * 100:.2f}%")
print(f"Q1 - Distance Accuracy: {acc_distance * 100:.2f}%")

In [None]:
# Q32: Train a KNN Regressor and analyze the effect of different K values on performance

X, y = make_regression(n_samples=100, n_features=2, noise=0.1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

k_values = [1, 3, 5, 7, 10]
for k in k_values:
    knn_reg = KNeighborsRegressor(n_neighbors=k)
    knn_reg.fit(X_train, y_train)
    y_pred = knn_reg.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    print(f"Q2 - K={k}, MSE: {mse:.2f}")

In [None]:
# Q33: Implement KNN Imputation for handling missing values in a dataset

# Creating synthetic dataset with missing values

X_missing, _ = make_regression(n_samples=100, n_features=2, noise=0.1)
X_missing[np.random.randint(0, 100, 20), np.random.randint(0, 2, 20)] = np.nan

imputer = KNNImputer(n_neighbors=3)
X_imputed = imputer.fit_transform(X_missing)
print(f"Q3 - First 5 rows after Imputation:\n{X_imputed[:5]}")

In [None]:
# Q34: Train a PCA model and visualize the data projection onto the first two principal components

X_synthetic, _ = make_classification(n_samples=100, n_features=5)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_synthetic)

plt.scatter(X_pca[:, 0], X_pca[:, 1])
plt.title("Q4 - Data Projection on First Two PCA Components")
plt.xlabel("PCA 1")
plt.ylabel("PCA 2")
plt.grid(True)
plt.show()

In [None]:
# Q35: Train a KNN Classifier using the KD Tree and Ball Tree algorithms and compare performance

knn_kd_tree = KNeighborsClassifier(n_neighbors=5, algorithm='kd_tree')
knn_ball_tree = KNeighborsClassifier(n_neighbors=5, algorithm='ball_tree')

knn_kd_tree.fit(X_train, y_train)
knn_ball_tree.fit(X_train, y_train)

acc_kd_tree = accuracy_score(y_test, knn_kd_tree.predict(X_test))
acc_ball_tree = accuracy_score(y_test, knn_ball_tree.predict(X_test))

print(f"Q5 - KD Tree Accuracy: {acc_kd_tree * 100:.2f}%")
print(f"Q5 - Ball Tree Accuracy: {acc_ball_tree * 100:.2f}%")

In [None]:
# Q36: Train a PCA model on a high-dimensional dataset and visualize the Scree plot

X_high_dim, _ = make_classification(n_samples=200, n_features=20)
pca_high_dim = PCA()
pca_high_dim.fit(X_high_dim)

plt.plot(np.cumsum(pca_high_dim.explained_variance_ratio_))
plt.title("Q6 - Scree Plot of High-Dimensional Dataset")
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
plt.grid(True)
plt.show()

In [None]:
# Q37: Train a KNN Classifier and evaluate performance using Precision, Recall, and F1-Score

knn_clf = KNeighborsClassifier(n_neighbors=5)
knn_clf.fit(X_train, y_train)
y_pred = knn_clf.predict(X_test)

precision = precision_score(y_test, y_pred, average='macro')
recall = recall_score(y_test, y_pred, average='macro')
f1 = f1_score(y_test, y_pred, average='macro')

print(f"Q7 - Precision: {precision:.2f}, Recall: {recall:.2f}, F1-Score: {f1:.2f}")

In [None]:
# Q38: Train a PCA model and analyze the effect of different numbers of components on accuracy

pca_components = [1, 2, 3, 4]
for n in pca_components:
    pca = PCA(n_components=n)
    X_train_pca = pca.fit_transform(X_train)
    X_test_pca = pca.transform(X_test)
    
    knn_pca = KNeighborsClassifier(n_neighbors=5)
    knn_pca.fit(X_train_pca, y_train)
    
    acc_pca = accuracy_score(y_test, knn_pca.predict(X_test_pca))
    print(f"Q8 - Components={n}, Accuracy: {acc_pca * 100:.2f}%")

In [2]:
# Q39: Train a KNN Classifier with different leaf_size values and compare accuracy

leaf_sizes = [10, 20, 30, 40]
for leaf in leaf_sizes:
    knn_leaf = KNeighborsClassifier(n_neighbors=5, leaf_size=leaf)
    knn_leaf.fit(X_train, y_train)
    acc_leaf = accuracy_score(y_test, knn_leaf.predict(X_test))
    print(f"Q9 - Leaf Size={leaf}, Accuracy: {acc_leaf * 100:.2f}%")

In [None]:
# Q40: Train a PCA model and visualize how data points are transformed before and after PCA

X_synthetic_2D, _ = make_classification(n_samples=100, n_features=2)
pca = PCA(n_components=1)
X_pca_transformed = pca.fit_transform(X_synthetic_2D)

plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.scatter(X_synthetic_2D[:, 0], X_synthetic_2D[:, 1])
plt.title("Q10 - Before PCA")

plt.subplot(1, 2, 2)
plt.scatter(X_pca_transformed, np.zeros_like(X_pca_transformed))
plt.title("Q10 - After PCA")
plt.show()

In [None]:
# Q41: Train a KNN Classifier on the Wine dataset and print classification report

wine = load_wine()
X, y = wine.data, wine.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

knn_wine = KNeighborsClassifier(n_neighbors=5)
knn_wine.fit(X_train, y_train)
y_pred_wine = knn_wine.predict(X_test)

print(f"Q11 - Classification Report:\n{classification_report(y_test, y_pred_wine)}")

In [None]:
# Q42: Train a KNN Regressor and analyze the effect of different distance metrics on prediction error

metrics = ['euclidean', 'manhattan', 'chebyshev']
for metric in metrics:
    knn_reg = KNeighborsRegressor(n_neighbors=5, metric=metric)
    knn_reg.fit(X_train, y_train)
    y_pred = knn_reg.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    print(f"Q12 - Metric={metric}, MSE: {mse:.2f}")

In [None]:
# Q43: Train a KNN Classifier and evaluate using ROC-AUC score

knn_clf = KNeighborsClassifier(n_neighbors=5)
knn_clf.fit(X_train, y_train)
y_prob = knn_clf.predict_proba(X_test)[:, 1]  # For binary ROC-AUC

roc_auc = roc_auc_score(y_test, y_prob, multi_class='ovr')
print(f"Q13 - ROC-AUC Score: {roc_auc:.2f}")

In [None]:
# Q44: Train a PCA model and visualize the variance captured by each principal component

pca = PCA()
pca.fit(X_train)

plt.bar(range(1, len(pca.explained_variance_ratio_) + 1), pca.explained_variance_ratio_)
plt.title("Q14 - Variance Captured by Each Principal Component")
plt.xlabel("Principal Components")
plt.ylabel("Explained Variance Ratio")
plt.grid(True)
plt.show()

In [None]:
# Q45: Train a KNN Classifier and perform feature selection before training

from sklearn.feature_selection import SelectKBest, f_classif

selector = SelectKBest(score_func=f_classif, k=3)
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)

knn_fs = KNeighborsClassifier(n_neighbors=5)
knn_fs.fit(X_train_selected, y_train)

acc_fs = accuracy_score(y_test, knn_fs.predict(X_test_selected))
print(f"Q15 - Accuracy after Feature Selection: {acc_fs * 100:.2f}%")

In [None]:
# Q46: Train a PCA model and visualize the data reconstruction error after reducing dimensions

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_train)
X_reconstructed = pca.inverse_transform(X_pca)

reconstruction_error = np.mean((X_train - X_reconstructed) ** 2)
print(f"Q16 - Reconstruction Error: {reconstruction_error:.4f}")

In [None]:
# Q47: Train a KNN Classifier and visualize the decision boundary

from matplotlib.colors import ListedColormap

X_2D, y_2D = make_classification(n_samples=100, n_features=2, n_classes=3, n_informative=2, n_redundant=0)
knn_clf_2D = KNeighborsClassifier(n_neighbors=5)
knn_clf_2D.fit(X_2D, y_2D)

x_min, x_max = X_2D[:, 0].min() - 1, X_2D[:, 0].max() + 1
y_min, y_max = X_2D[:, 1].min() - 1, X_2D[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02), np.arange(y_min, y_max, 0.02))

Z = knn_clf_2D.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, alpha=0.8, cmap=ListedColormap(('red', 'green', 'blue')))
plt.scatter(X_2D[:, 0], X_2D[:, 1], c=y_2D, edgecolors='k', cmap=ListedColormap(('red', 'green', 'blue')))
plt.title("Q17 - KNN Decision Boundary")
plt.show()

In [3]:
# Q18: Train a PCA model and analyze the effect of different numbers of components on data variance

pca_variance = [1, 2, 3, 4]
for n in pca_variance:
    pca = PCA(n_components=n)
    pca.fit(X_train)
    print(f"Q18 - Components={n}, Variance Ratio: {np.sum(pca.explained_variance_ratio_):.4f}")