<a href="https://colab.research.google.com/github/Swati642/Python-Assignment-1/blob/main/KNN_%26_PCA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1. What is K-Nearest Neighbors (KNN) and how does it work
KNN is a simple, non-parametric algorithm used for classification and regression. It works by finding the *k* nearest data points (neighbors) to a query point and predicting the label based on majority voting (classification) or averaging (regression). It relies on distance metrics like Euclidean distance. No training phase is needed—just memorization of data.

2. What is the difference between KNN Classification and KNN Regression
KNN Classification predicts the class label by majority vote of *k* nearest neighbors.  
KNN Regression predicts the target value by averaging the values of *k* nearest neighbors.  
Classification deals with discrete labels; regression deals with continuous values.  
Both use distance metrics to find neighbors.

3. What is the role of the distance metric in KNN
The distance metric measures similarity between data points.  
It helps KNN find the closest neighbors to a query point.  
Common metrics: Euclidean (default), Manhattan, Minkowski.  
The choice of metric affects accuracy and performance.

4. What is the Curse of Dimensionality in KNN
In high dimensions, data becomes sparse and distances lose meaning.  
KNN struggles to find true nearest neighbors.  
Accuracy drops as noise increases.  
Feature selection or dimensionality reduction is needed.

5.  How can we choose the best value of K in KNN
Use cross-validation to test different K values.  
Plot accuracy vs K and pick the one with best performance.  
Avoid very low K (overfitting) and very high K (underfitting).  
Odd K helps in binary classification.

6.What are KD Tree and Ball Tree in KNN
KD Tree and Ball Tree are data structures used to speed up nearest neighbor search in KNN.  
- **KD Tree**: Best for low-dimensional data. Splits data along axes.  
- **Ball Tree**: Works better for high-dimensional data. Uses hyperspheres.  
They reduce computation time for distance queries.

7. When should you use KD Tree vs. Ball Tree
Use **KD Tree** when data is **low-dimensional (d < 20)** and balanced.  
Use **Ball Tree** when data is **high-dimensional**, sparse, or unbalanced.  
Ball Tree handles complex structures better; KD Tree is faster for simpler data.

8.  What are the disadvantages of KNN
- Slow for large datasets (no training, all computation at prediction).  
- Sensitive to noisy or irrelevant features.  
- Struggles in high-dimensional spaces (curse of dimensionality).  
- Requires feature scaling.  
- Memory-intensive (stores entire dataset).

9. How does feature scaling affect KNN
Feature scaling is crucial for KNN because it uses distance metrics. Without scaling, features with larger ranges dominate, leading to biased results. Scaling (like StandardScaler or MinMaxScaler) ensures fair distance computation.

10. What is PCA (Principal Component Analysis)
PCA (Principal Component Analysis) is a dimensionality reduction technique. It transforms data into a new coordinate system, where the greatest variance lies on the first axis (principal component), the second greatest on the second axis, and so on. This helps reduce the number of features while preserving as much variance as possible.

11.  How does PCA work
PCA works by identifying the directions (principal components) in which the data varies the most. Steps:
1. **Standardize the data** (mean = 0, variance = 1).
2. **Calculate the covariance matrix** to understand relationships between features.
3. **Compute the eigenvalues and eigenvectors** of the covariance matrix.
4. **Sort eigenvalues** in descending order; select top 'k' eigenvectors.
5. **Project data** onto the selected eigenvectors (principal components).

12. What is the geometric intuition behind PCA
PCA can be understood geometrically as projecting data onto a lower-dimensional subspace that captures the maximum variance.

1. **High variance direction**: PCA identifies the axis (principal component) along which the data varies the most.
2. **Projection**: It then projects the original data points onto this axis, reducing dimensions.
3. **Maximizing variance**: The first principal component captures the maximum variance, and each subsequent component captures the remaining variance orthogonally to the previous ones.
   
This helps in finding the best representation of the data in fewer dimensions.

13. What is the difference between Feature Selection and Feature Extraction
**Feature Selection**:
- **Goal**: Select the most important features from the original set, keeping the data in its original form.
- **Method**: Techniques like filtering, wrapper, or embedded methods are used to remove irrelevant or redundant features.
- **Outcome**: Reduced feature set, but original features are retained.

**Feature Extraction**:
- **Goal**: Transform the original features into a new set of features, often of lower dimensionality.
- **Method**: Techniques like PCA, LDA, or autoencoders are used to combine or transform features into new ones.
- **Outcome**: A new set of features (e.g., principal components) that captures the essential information.

In short, feature selection reduces the number of features, while feature extraction creates new features.

14. What are Eigenvalues and Eigenvectors in PCA
**Eigenvalues**:
- Represent the magnitude of the variance captured by each principal component in PCA.
- Larger eigenvalues indicate more significant components that capture more data variability.

**Eigenvectors**:
- Represent the direction of the principal components in the feature space.
- They define the new axes onto which the data will be projected, capturing the directions of maximum variance.

In PCA, eigenvalues and eigenvectors are derived from the covariance matrix and are used to determine the principal components.

15. How do you decide the number of components to keep in PCA
To decide the number of components to keep in PCA:

1. **Cumulative Explained Variance**: Choose components that explain a significant portion of the variance (typically 80-95%). Plot the cumulative explained variance and select the number of components where the curve plateaus.

2. **Scree Plot**: Look for an "elbow" in the plot of eigenvalues, where the explained variance starts to diminish significantly.

3. **Domain Knowledge**: Sometimes, domain knowledge or business requirements can guide the selection of the number of components.

16. Can PCA be used for classification
PCA itself is not a classification method, but it can be used as a preprocessing step. By reducing the dimensionality of the data, PCA helps to:

1. **Reduce noise** and irrelevant features, improving the performance of classification algorithms.
2. **Speed up computation**, especially with large datasets.
3. **Improve visualization** by projecting data onto 2D or 3D space for easier interpretation.

After applying PCA, you can use a classification algorithm (e.g., SVM, Logistic Regression) on the reduced data.

17. What are the limitations of PCA
Limitations of PCA:

1. **Assumes linearity** – can't capture complex non-linear relationships.  
2. **Loses interpretability** – transformed features (PCs) are not easily interpretable.  
3. **Sensitive to scaling** – requires feature scaling for meaningful results.  
4. **Affected by outliers** – outliers can distort the direction of principal components.  
5. **Only captures variance** – may ignore features important for classification if they have low variance.

18. 5 How do KNN and PCA complement each other
KNN and PCA complement each other well:

1. **PCA reduces dimensionality**, which helps **mitigate the Curse of Dimensionality** in KNN.  
2. PCA removes noise and redundant features, improving KNN’s accuracy.  
3. **Feature scaling in PCA** aligns with KNN's distance-based approach.  
4. PCA speeds up KNN by reducing computation on fewer features.  
5. Together, they enhance both performance and efficiency.

19. How does KNN handle missing values in a dataset
KNN can handle missing values by:

1. **Imputing missing values** using the mean, median, or mode of the k-nearest neighbors.  
2. Some libraries support **KNN imputation** directly.  
3. It’s not built to handle missing data inherently—**imputation is done before training**.  
4. Best to scale features after imputation for better results.

20. What are the key differences between PCA and Linear Discriminant Analysis (LDA)?
**Key differences between PCA and LDA**:

1. **PCA** is unsupervised; **LDA** is supervised.  
2. **PCA** maximizes variance; **LDA** maximizes class separability.  
3. **PCA** doesn’t use class labels; **LDA** uses them.  
4. **PCA** works for any data; **LDA** mainly for classification tasks.  
5. **PCA** components are orthogonal; **LDA** components may not be.

21. Train a KNN Classifier on the Iris dataset and print model accuracy


In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train KNN
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

# Predict & evaluate
y_pred = knn.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

22. Train a KNN Regressor on a synthetic dataset and evaluate using Mean Squared Error (MSE)



In [None]:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic regression data
X, y = make_regression(n_samples=200, n_features=1, noise=10, random_state=42)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train KNN Regressor
knn_reg = KNeighborsRegressor(n_neighbors=5)
knn_reg.fit(X_train, y_train)

# Predict & evaluate
y_pred = knn_reg.predict(X_test)
print("MSE:", mean_squared_error(y_test, y_pred))

23. Train a KNN Classifier using different distance metrics (Euclidean and Manhattan) and compare accuracy


In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Euclidean (default)
knn_euclidean = KNeighborsClassifier(metric='euclidean')
knn_euclidean.fit(X_train, y_train)
acc_euclidean = accuracy_score(y_test, knn_euclidean.predict(X_test))

# Manhattan
knn_manhattan = KNeighborsClassifier(metric='manhattan')
knn_manhattan.fit(X_train, y_train)
acc_manhattan = accuracy_score(y_test, knn_manhattan.predict(X_test))

print("Euclidean Accuracy:", acc_euclidean)
print("Manhattan Accuracy:", acc_manhattan)


24. Train a KNN Classifier with different values of K and visualize decision boundaried


In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Euclidean (default)
knn_euclidean = KNeighborsClassifier(metric='euclidean')
knn_euclidean.fit(X_train, y_train)
acc_euclidean = accuracy_score(y_test, knn_euclidean.predict(X_test))

# Manhattan
knn_manhattan = KNeighborsClassifier(metric='manhattan')
knn_manhattan.fit(X_train, y_train)
acc_manhattan = accuracy_score(y_test, knn_manhattan.predict(X_test))

print("Euclidean Accuracy:", acc_euclidean)
print("Manhattan Accuracy:", acc_manhattan)


25. Apply Feature Scaling before training a KNN model and compare results with unscaled data

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Without Scaling
knn_raw = KNeighborsClassifier()
knn_raw.fit(X_train, y_train)
acc_raw = accuracy_score(y_test, knn_raw.predict(X_test))

# With Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier()
knn_scaled.fit(X_train_scaled, y_train)
acc_scaled = accuracy_score(y_test, knn_scaled.predict(X_test_scaled))

print("Accuracy without scaling:", acc_raw)
print("Accuracy with scaling:", acc_scaled)

26. 5 Train a PCA model on synthetic data and print the explained variance ratio for each component5

In [None]:
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA
import pandas as pd

# Create synthetic data
X, _ = make_classification(n_samples=100, n_features=5, random_state=42)

# Apply PCA
pca = PCA()
pca.fit(X)

# Explained variance ratio
print("Explained variance ratio per component:")
print(pd.Series(pca.explained_variance_ratio_))

26. Apply PCA before training a KNN Classifier and compare accuracy with and without PCA

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load data
data = load_iris()
X, y = data.data, data.target

# Split & scale
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# KNN without PCA
knn = KNeighborsClassifier()
knn.fit(X_train_scaled, y_train)
acc_no_pca = accuracy_score(y_test, knn.predict(X_test_scaled))

# KNN with PCA
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)
knn_pca = KNeighborsClassifier()
knn_pca.fit(X_train_pca, y_train)
acc_pca = accuracy_score(y_test, knn_pca.predict(X_test_pca))

print(f"Accuracy without PCA: {acc_no_pca:.2f}")
print(f"Accuracy with PCA: {acc_pca:.2f}")


28. Perform Hyperparameter Tuning on a KNN Classifier using GridSearchCV5

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define KNN classifier
knn = KNeighborsClassifier()

# Hyperparameter grid
param_grid = {
    'n_neighbors': [3, 5, 7, 9],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}

# GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(estimator=knn, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Best parameters and model performance
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Evaluate the best model
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Best parameters: {best_params}")
print(f"Accuracy of best model: {accuracy:.2f}")


29. Train a KNN Classifier and check the number of misclassified samples5

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define KNN classifier
knn = KNeighborsClassifier()

# Hyperparameter grid
param_grid = {
    'n_neighbors': [3, 5, 7, 9],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}

# GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(estimator=knn, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Best parameters and model performance
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Evaluate the best model
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Best parameters: {best_params}")
print(f"Accuracy of best model: {accuracy:.2f}")

30. Train a PCA model and visualize the cumulative explained variance.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris

# Load dataset
data = load_iris()
X = data.data

# Train PCA model
pca = PCA()
pca.fit(X)

# Cumulative explained variance
explained_variance = np.cumsum(pca.explained_variance_ratio_)

# Plot cumulative explained variance
plt.figure(figsize=(8, 6))
plt.plot(range(1, len(explained_variance) + 1), explained_variance, marker='o', color='b', linestyle='--')
plt.title('Cumulative Explained Variance by PCA Components')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.grid(True)
plt.show()


31. Train a KNN Classifier using different values of the weights parameter (uniform vs. distance) and compare
accuracy

In [None]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load dataset
data = load_iris()
X = data.data
y = data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train KNN Classifier with uniform weights
knn_uniform = KNeighborsClassifier(weights='uniform')
knn_uniform.fit(X_train, y_train)
y_pred_uniform = knn_uniform.predict(X_test)
accuracy_uniform = accuracy_score(y_test, y_pred_uniform)

# Train KNN Classifier with distance weights
knn_distance = KNeighborsClassifier(weights='distance')
knn_distance.fit(X_train, y_train)
y_pred_distance = knn_distance.predict(X_test)
accuracy_distance = accuracy_score(y_test, y_pred_distance)

# Print results
print(f'Accuracy with uniform weights: {accuracy_uniform:.4f}')
print(f'Accuracy with distance weights: {accuracy_distance:.4f}')


32. Train a KNN Regressor and analyze the effect of different K values on performance

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.datasets import make_regression

# Generate synthetic regression data
X, y = make_regression(n_samples=100, n_features=5, noise=0.1, random_state=42)

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# List of K values to test
k_values = [1, 3, 5, 7, 9]
mse_scores = []

# Train KNN Regressor with different K values
for k in k_values:
    knn = KNeighborsRegressor(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)

# Print results
for k, mse in zip(k_values, mse_scores):
    print(f'K = {k}, Mean Squared Error: {mse:.4f}')


33. Implement KNN Imputation for handling missing values in a dataset

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.datasets import make_regression

# Generate synthetic regression data
X, y = make_regression(n_samples=100, n_features=5, noise=0.1, random_state=42)

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# List of K values to test
k_values = [1, 3, 5, 7, 9]
mse_scores = []

# Train KNN Regressor with different K values
for k in k_values:
    knn = KNeighborsRegressor(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)

# Print results
for k, mse in zip(k_values, mse_scores):
    print(f'K = {k}, Mean Squared Error: {mse:.4f}')


34. Train a PCA model and visualize the data projection onto the first two principal components

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.datasets import make_regression

# Generate synthetic regression data
X, y = make_regression(n_samples=100, n_features=5, noise=0.1, random_state=42)

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# List of K values to test
k_values = [1, 3, 5, 7, 9]
mse_scores = []

# Train KNN Regressor with different K values
for k in k_values:
    knn = KNeighborsRegressor(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)

# Print results
for k, mse in zip(k_values, mse_scores):
    print(f'K = {k}, Mean Squared Error: {mse:.4f}')


35. Train a KNN Classifier using the KD Tree and Ball Tree algorithms and compare performance

In [None]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
data = load_iris()
X = data.data
y = data.target

# Split dataset into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train KNN Classifier using KD Tree
knn_kd_tree = KNeighborsClassifier(n_neighbors=3, algorithm='kd_tree')
knn_kd_tree.fit(X_train, y_train)
y_pred_kd_tree = knn_kd_tree.predict(X_test)
accuracy_kd_tree = accuracy_score(y_test, y_pred_kd_tree)

# Train KNN Classifier using Ball Tree
knn_ball_tree = KNeighborsClassifier(n_neighbors=3, algorithm='ball_tree')
knn_ball_tree.fit(X_train, y_train)
y_pred_ball_tree = knn_ball_tree.predict(X_test)
accuracy_ball_tree = accuracy_score(y_test, y_pred_ball_tree)

# Compare performance
print(f'Accuracy with KD Tree: {accuracy_kd_tree:.4f}')
print(f'Accuracy with Ball Tree: {accuracy_ball_tree:.4f}')


36. Train a PCA model on a high-dimensional dataset and visualize the Scree plot

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import make_classification

# Generate a high-dimensional synthetic dataset (e.g., 100 features)
X, _ = make_classification(n_samples=100, n_features=100, random_state=42)

# Train PCA model
pca = PCA()
pca.fit(X)

# Plot Scree plot (Explained variance ratio)
plt.figure(figsize=(8, 6))
plt.plot(range(1, len(pca.explained_variance_ratio_) + 1), pca.explained_variance_ratio_, marker='o')
plt.title('Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.show()


37. Train a KNN Classifier and evaluate performance using Precision, Recall, and F1-Score

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import precision_score, recall_score, f1_score

# Load Iris dataset
data = load_iris()
X, y = data.data, data.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train KNN Classifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

# Predict on the test set
y_pred = knn.predict(X_test)

# Evaluate performance
precision = precision_score(y_test, y_pred, average='macro')
recall = recall_score(y_test, y_pred, average='macro')
f1 = f1_score(y_test, y_pred, average='macro')

# Print the results
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-Score: {f1}")


37. Train a PCA model and analyze the effect of different numbers of components on accuracy

In [None]:
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
data = load_iris()
X, y = data.data, data.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Analyze the effect of different numbers of PCA components
accuracies = []
components = [1, 2, 3, 4]

for n_components in components:
    # Apply PCA with n_components
    pca = PCA(n_components=n_components)
    X_train_pca = pca.fit_transform(X_train)
    X_test_pca = pca.transform(X_test)

    # Train a KNN Classifier on transformed data
    knn = KNeighborsClassifier(n_neighbors=3)
    knn.fit(X_train_pca, y_train)

    # Predict and calculate accuracy
    y_pred = knn.predict(X_test_pca)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)

# Print the results
for n_components, accuracy in zip(components, accuracies):
    print(f"Accuracy with {n_components} PCA components: {accuracy}")


39. Train a KNN Classifier with different leaf_size values and compare accuracy

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
data = load_iris()
X, y = data.data, data.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# List of different leaf_size values
leaf_sizes = [10, 20, 30, 40, 50]
accuracies = []

# Train KNN Classifier with different leaf_size values
for leaf_size in leaf_sizes:
    knn = KNeighborsClassifier(n_neighbors=3, leaf_size=leaf_size)
    knn.fit(X_train, y_train)

    # Predict and calculate accuracy
    y_pred = knn.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)

# Print the results
for leaf_size, accuracy in zip(leaf_sizes, accuracies):
    print(f"Accuracy with leaf_size={leaf_size}: {accuracy}")


40. Train a PCA model and visualize how data points are transformed before and after PCA

In [None]:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

# Load Iris dataset
data = load_iris()
X = data.data
y = data.target

# Standardize the dataset
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Visualize data before PCA
plt.figure(figsize=(12, 6))

# Plot data before PCA (First two features)
plt.subplot(1, 2, 1)
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=y, cmap='viridis')
plt.title("Data Before PCA")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")

# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Plot data after PCA
plt.subplot(1, 2, 2)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis')
plt.title("Data After PCA")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")

plt.tight_layout()
plt.show()


41. Train a KNN Classifier on a real-world dataset (Wine dataset) and print classification report

In [None]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

# Load the Wine dataset
data = load_wine()
X = data.data
y = data.target

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a KNN classifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

# Predict on the test set
y_pred = knn.predict(X_test)

# Print classification report
print(classification_report(y_test, y_pred))


42.  Train a KNN Regressor and analyze the effect of different distance metrics on prediction error

In [None]:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

# Create a synthetic regression dataset
X, y = make_regression(n_samples=200, n_features=5, noise=0.1, random_state=42)

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define distance metrics
metrics = ['euclidean', 'manhattan']

# Initialize a dictionary to store errors
errors = {}

# Train KNN regressor with different distance metrics and calculate error
for metric in metrics:
    knn = KNeighborsRegressor(n_neighbors=5, metric=metric)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    errors[metric] = mse

# Print the errors
for metric, error in errors.items():
    print(f"Mean Squared Error using {metric} distance: {error}")


43.  Train a KNN Classifier and evaluate using ROC-AUC score

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import label_binarize

# Load the Iris dataset
data = load_iris()
X = data.data
y = data.target

# Binarize the output labels for ROC-AUC calculation
y_bin = label_binarize(y, classes=[0, 1, 2])

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y_bin, test_size=0.3, random_state=42)

# Initialize KNN Classifier
knn = KNeighborsClassifier(n_neighbors=5)

# Train the model
knn.fit(X_train, y_train)

# Predict probabilities for the test set
y_pred_prob = knn.predict_proba(X_test)

# Calculate ROC-AUC score
roc_auc = roc_auc_score(y_test, y_pred_prob, multi_class='ovr')

print(f"ROC-AUC score: {roc_auc}")


44.  Train a PCA model and visualize the variance captured by each principal component

In [None]:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris

# Load Iris dataset
data = load_iris()
X = data.data

# Train PCA model
pca = PCA()
pca.fit(X)

# Get the explained variance ratio
explained_variance = pca.explained_variance_ratio_

# Plot the variance captured by each principal component
plt.figure(figsize=(8, 6))
plt.bar(range(1, len(explained_variance) + 1), explained_variance)
plt.xlabel('Principal Components')
plt.ylabel('Variance Explained')
plt.title('Variance Captured by Each Principal Component')
plt.show()


45. Train a KNN Classifier and perform feature selection before training

In [None]:
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load Iris dataset
data = load_iris()
X = data.data
y = data.target

# Perform feature selection using SelectKBest
selector = SelectKBest(f_classif, k=2)  # Select the top 2 features
X_new = selector.fit_transform(X, y)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size=0.3, random_state=42)

# Train KNN Classifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

# Evaluate performance
y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f'Accuracy of KNN Classifier after Feature Selection: {accuracy}')


46.  Train a PCA model and visualize the data reconstruction error after reducing dimensions

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

# Load Iris dataset
data = load_iris()
X = data.data

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train PCA model
pca = PCA(n_components=2)  # Reduce to 2 components
X_pca = pca.fit_transform(X_scaled)

# Reconstruct the data from the reduced dimensions
X_reconstructed = pca.inverse_transform(X_pca)

# Compute reconstruction error (Mean Squared Error)
reconstruction_error = np.mean((X_scaled - X_reconstructed) ** 2)

# Plot original vs reconstructed data
plt.figure(figsize=(8, 6))
plt.subplot(1, 2, 1)
plt.title("Original Data")
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=data.target, cmap='viridis')

plt.subplot(1, 2, 2)
plt.title("Reconstructed Data")
plt.scatter(X_reconstructed[:, 0], X_reconstructed[:, 1], c=data.target, cmap='viridis')

plt.show()

print(f'Reconstruction Error: {reconstruction_error}')


47. Train a KNN Classifier and visualize the decision boundary

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

# Load Iris dataset
data = load_iris()
X = data.data

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train PCA model
pca = PCA(n_components=2)  # Reduce to 2 components
X_pca = pca.fit_transform(X_scaled)

# Reconstruct the data from the reduced dimensions
X_reconstructed = pca.inverse_transform(X_pca)

# Compute reconstruction error (Mean Squared Error)
reconstruction_error = np.mean((X_scaled - X_reconstructed) ** 2)

# Plot original vs reconstructed data
plt.figure(figsize=(8, 6))
plt.subplot(1, 2, 1)
plt.title("Original Data")
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=data.target, cmap='viridis')

plt.subplot(1, 2, 2)
plt.title("Reconstructed Data")
plt.scatter(X_reconstructed[:, 0], X_reconstructed[:, 1], c=data.target, cmap='viridis')

plt.show()

print(f'Reconstruction Error: {reconstruction_error}')


48.  Train a PCA model and analyze the effect of different numbers of components on data variance.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

# Load Iris dataset
data = load_iris()
X = data.data

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train PCA model and analyze variance for different numbers of components
pca = PCA()
pca.fit(X_scaled)

# Plot the explained variance ratio for each component
plt.figure(figsize=(8, 6))
plt.plot(range(1, len(pca.explained_variance_ratio_) + 1), np.cumsum(pca.explained_variance_ratio_), marker='o')
plt.title("Cumulative Explained Variance vs Number of Components")
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
plt.grid(True)
plt.show()

# Print the cumulative explained variance for each component
print(f'Cumulative Explained Variance: {np.cumsum(pca.explained_variance_ratio_)}')
