<a href="https://colab.research.google.com/github/Chaakash16/Python-Basics/blob/main/KNN_%26_PCA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Theoretical**

**1. What is K-Nearest Neighbors (KNN) and how does it work?**  
KNN is a non-parametric algorithm that classifies or predicts a data point based on the majority label or average of its ‘K’ nearest neighbors. It uses distance metrics like Euclidean to find closeness.

**2. What is the difference between KNN Classification and KNN Regression?**  
In KNN classification, the output is the majority class among neighbors, while in KNN regression, the output is the average (or weighted average) of the neighbors’ values.

**3. What is the role of the distance metric in KNN?**  
The distance metric determines how closeness is measured between points. Common metrics include Euclidean, Manhattan, and Minkowski.

**4. What is the Curse of Dimensionality in KNN?**  
As dimensions increase, data becomes sparse and distance between points becomes less meaningful, reducing KNN’s effectiveness and increasing computation.

**5. How can we choose the best value of K in KNN?**  
Use cross-validation to test different K values and choose the one that gives the best performance. A smaller K may overfit; a larger K may underfit.

**6. What are KD Tree and Ball Tree in KNN?**  
They are data structures used to speed up nearest neighbor searches by organizing data hierarchically to avoid brute-force computation.

**7. When should you use KD Tree vs. Ball Tree?**  
KD Tree works better with low-dimensional data (less than 20 features), while Ball Tree is more efficient with higher-dimensional data.

**8. What are the disadvantages of KNN?**  
KNN is slow on large datasets, sensitive to irrelevant features and outliers, and struggles with high-dimensional data.

**9. How does feature scaling affect KNN?**  
Since KNN uses distance metrics, unscaled features can bias the results. Scaling ensures all features contribute equally to distance.

**10. What is PCA (Principal Component Analysis)?**  
PCA is a dimensionality reduction technique that transforms data into new axes (principal components) capturing maximum variance.

**11. How does PCA work?**  
PCA standardizes the data, computes the covariance matrix, then finds eigenvectors (directions) and eigenvalues (variance) to project data onto fewer dimensions.

**12. What is the geometric intuition behind PCA?**  
PCA rotates the coordinate system to align with the directions of greatest variance, so the first few axes (principal components) capture most of the information.

**13. What is the difference between Feature Selection and Feature Extraction?**  
Feature selection picks a subset of original features; feature extraction creates new features from combinations of original ones (like in PCA).

**14. What are Eigenvalues and Eigenvectors in PCA?**  
Eigenvectors represent the directions of new axes, while eigenvalues represent how much variance each principal component explains.

**15. How do you decide the number of components to keep in PCA?**  
Use a scree plot or retain enough components to explain a desired percentage (e.g., 95%) of total variance.

**16. Can PCA be used for classification?**  
Yes, PCA can be used for dimensionality reduction before classification to improve performance and reduce overfitting.

**17. What are the limitations of PCA?**  
PCA assumes linearity, may lose interpretability, and doesn’t work well if data variance doesn’t align with important features.

**18. How do KNN and PCA complement each other?**  
PCA reduces dimensionality and noise, which improves KNN’s accuracy and reduces computation, especially in high-dimensional spaces.

**19. How does KNN handle missing values in a dataset?**  
KNN doesn’t handle missing values natively; imputation techniques are needed beforehand, or distance-based methods can estimate missing values.

**20. What are the key differences between PCA and Linear Discriminant Analysis (LDA)?**  
PCA is unsupervised and maximizes variance, while LDA is supervised and maximizes class separability using label information.


**Practical**

21. Train a KNN Classifier on the Iris dataset and print model accuracy

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

22. Train a KNN Regressor on a synthetic dataset and evaluate using Mean Squared Error (MSE)

In [None]:
from sklearn.datasets import make_regression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

X, y = make_regression(n_samples=200, n_features=1, noise=10, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
knn_reg = KNeighborsRegressor(n_neighbors=5)
knn_reg.fit(X_train, y_train)
y_pred = knn_reg.predict(X_test)
print("MSE:", mean_squared_error(y_test, y_pred))

23. Train a KNN Classifier using different distance metrics (Euclidean and Manhattan) and compare accuracy

In [None]:
from sklearn.neighbors import KNeighborsClassifier

for metric in ['euclidean', 'manhattan']:
    knn = KNeighborsClassifier(n_neighbors=3, metric=metric)
    knn.fit(X_train, y_train)
    acc = knn.score(X_test, y_test)
    print(f"{metric.capitalize()} Distance Accuracy: {acc}")


24. Train a KNN Classifier with different values of K and visualize decision boundaries

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.colors import ListedColormap
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

iris = load_iris()
X, y = iris.data[:, :2], iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

for k in [1, 5, 15]:
    clf = KNeighborsClassifier(n_neighbors=k)
    clf.fit(X_train, y_train)
    h = .02
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.figure()
    plt.contourf(xx, yy, Z, cmap=ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF']))
    plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, edgecolor='k')
    plt.title(f"Decision boundary (k={k})")
    plt.show()


25. Apply Feature Scaling before training a KNN model and compare results with unscaled data

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
print("Accuracy without scaling:", knn.score(X_test, y_test))

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
print("Accuracy with scaling:", knn_scaled.score(X_test_scaled, y_test))


26. Train a PCA model on synthetic data and print the explained variance ratio for each component

In [None]:
from sklearn.datasets import make_regression
from sklearn.decomposition import PCA

X, _ = make_regression(n_samples=100, n_features=5, noise=10, random_state=42)
pca = PCA()
pca.fit(X)
print("Explained variance ratio:", pca.explained_variance_ratio_)


27. Apply PCA before training a KNN Classifier and compare accuracy with and without PCA

In [None]:
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier

pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
print("Accuracy with PCA:", knn_pca.score(X_test_pca, y_test))


28. Perform Hyperparameter Tuning on a KNN Classifier using GridSearchCV

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

param_grid = {'n_neighbors': [1, 3, 5, 7, 9], 'weights': ['uniform', 'distance']}
grid = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best params:", grid.best_params_)
print("Best accuracy:", grid.best_score_)


29. Train a KNN Classifier and check the number of misclassified samples

In [None]:
y_pred = knn.predict(X_test)
misclassified = (y_test != y_pred).sum()
print("Misclassified samples:", misclassified)


30. Train a PCA model and visualize the cumulative explained variance

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import numpy as np

pca = PCA().fit(X_train)
plt.figure()
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('PCA Cumulative Explained Variance')
plt.grid(True)
plt.show()


31. Train a KNN Classifier using different values of the weights parameter (uniform vs. distance) and compare accuracy

In [None]:
from sklearn.neighbors import KNeighborsClassifier

for weights in ['uniform', 'distance']:
    knn = KNeighborsClassifier(n_neighbors=5, weights=weights)
    knn.fit(X_train, y_train)
    print(f"Accuracy with {weights} weights:", knn.score(X_test, y_test))


32. Train a KNN Regressor and analyze the effect of different K values on performance

In [None]:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

X, y = make_regression(n_samples=200, n_features=1, noise=15, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

for k in [1, 3, 5, 10]:
    model = KNeighborsRegressor(n_neighbors=k)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"K={k}, MSE={mean_squared_error(y_test, y_pred)}")


33. Implement KNN Imputation for handling missing values in a dataset

In [None]:
import numpy as np
from sklearn.impute import KNNImputer

X = np.array([[1, 2], [np.nan, 3], [7, 6]])
imputer = KNNImputer(n_neighbors=2)
X_imputed = imputer.fit_transform(X)
print(X_imputed)


34. Train a PCA model and visualize the data projection onto the first two principal components

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_train)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_train)
plt.title("Data projected onto first two PCA components")
plt.show()


35. Train a KNN Classifier using the KD Tree and Ball Tree algorithms and compare performance

In [None]:
for algo in ['kd_tree', 'ball_tree']:
    knn = KNeighborsClassifier(algorithm=algo)
    knn.fit(X_train, y_train)
    print(f"{algo} accuracy:", knn.score(X_test, y_test))


36. Train a PCA model on a high-dimensional dataset and visualize the Scree plot

In [None]:
from sklearn.datasets import make_classification
X_hd, _ = make_classification(n_samples=200, n_features=50, random_state=0)
pca = PCA().fit(X_hd)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Components')
plt.ylabel('Cumulative Variance')
plt.title('Scree Plot')
plt.grid(True)
plt.show()


37. Train a KNN Classifier and evaluate performance using Precision, Recall, and F1-Score

In [None]:
from sklearn.metrics import classification_report

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(classification_report(y_test, y_pred))


38. Train a PCA model and analyze the effect of different numbers of components on accuracy

In [None]:
for n in [2, 3, 4]:
    pca = PCA(n_components=n)
    X_train_pca = pca.fit_transform(X_train)
    X_test_pca = pca.transform(X_test)
    knn = KNeighborsClassifier(n_neighbors=5)
    knn.fit(X_train_pca, y_train)
    print(f"Accuracy with {n} components:", knn.score(X_test_pca, y_test))


39. Train a KNN Classifier with different leaf_size values and compare accuracy

In [None]:
for leaf in [10, 30, 50]:
    knn = KNeighborsClassifier(leaf_size=leaf)
    knn.fit(X_train, y_train)
    print(f"Leaf size {leaf} accuracy:", knn.score(X_test, y_test))


40. Train a PCA model and visualize how data points are transformed before and after PCA

In [None]:
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train)
plt.title("Original Data")

X_pca = PCA(n_components=2).fit_transform(X_train)
plt.subplot(1, 2, 2)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_train)
plt.title("After PCA")
plt.show()


41. Train a KNN Classifier on a real-world dataset (Wine dataset) and print classification report

In [None]:
from sklearn.datasets import load_wine

wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.3, random_state=42)
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(classification_report(y_test, y_pred))


42. Train a KNN Regressor and analyze the effect of different distance metrics on prediction error

In [None]:
for metric in ['euclidean', 'manhattan']:
    knn = KNeighborsRegressor(metric=metric)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    print(f"{metric} MSE:", mean_squared_error(y_test, y_pred))


43. Train a KNN Classifier and evaluate using ROC-AUC score

In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import label_binarize

y_bin = label_binarize(y_test, classes=np.unique(y_test))
y_score = knn.predict_proba(X_test)
print("ROC-AUC Score:", roc_auc_score(y_bin, y_score, multi_class='ovr'))


44. Train a PCA model and visualize the variance captured by each principal component

In [None]:
pca = PCA().fit(X_train)
plt.bar(range(1, len(pca.explained_variance_ratio_)+1), pca.explained_variance_ratio_)
plt.xlabel("Components")
plt.ylabel("Variance Ratio")
plt.title("Explained Variance per Component")
plt.show()


45. Train a KNN Classifier and perform feature selection before training

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif

X_new = SelectKBest(score_func=f_classif, k=5).fit_transform(X_train, y_train)
X_test_new = SelectKBest(score_func=f_classif, k=5).fit(X_train, y_train).transform(X_test)
knn = KNeighborsClassifier()
knn.fit(X_new, y_train)
print("Accuracy after feature selection:", knn.score(X_test_new, y_test))


46. Train a PCA model and visualize the data reconstruction error after reducing dimensions

In [None]:
pca = PCA(n_components=5)
X_reduced = pca.fit_transform(X_train)
X_reconstructed = pca.inverse_transform(X_reduced)
error = np.mean((X_train - X_reconstructed) ** 2)
print("Reconstruction Error:", error)


47. Train a KNN Classifier and visualize the decision boundary

In [None]:
from matplotlib.colors import ListedColormap

X, y = iris.data[:, :2], iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02), np.arange(y_min, y_max, 0.02))
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.figure()
plt.contourf(xx, yy, Z, cmap=ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF']))
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, edgecolor='k')
plt.title("Decision Boundary (k=5)")
plt.show()


48. Train a PCA model and analyze the effect of different numbers of components on data variance

In [None]:
pca = PCA().fit(X_train)
cumulative_variance = np.cumsum(pca.explained_variance_ratio_)
plt.plot(range(1, len(cumulative_variance)+1), cumulative_variance)
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Variance')
plt.title('PCA Variance Analysis')
plt.grid(True)
plt.show()
