
# 🧠 Complete Guide to KNN, SVM, and K-Means in Scikit-Learn
### For Aspiring Machine Learning Engineers (with generalizable concepts)

This notebook provides a practical and conceptual introduction to **KNN**, **SVM**, and **K-Means** — three cornerstone algorithms in machine learning.  
You’ll not only learn how to implement them in **Scikit-Learn**, but also how to think about them in a language-agnostic, engineering-focused way.


In [None]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.datasets import load_iris, make_classification, make_blobs
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.cluster import KMeans

plt.style.use("seaborn-v0_8")
np.random.seed(42)



## 🔹 K-Nearest Neighbors (KNN)

**Idea:** Predict a sample’s class by looking at its *k* closest data points.  
No training phase — it’s a **lazy learner**.

**Pros:** Simple, interpretable, flexible.  
**Cons:** Slow on large data, sensitive to scaling and irrelevant features.


In [None]:

iris = load_iris() # iris dataset of flowers with 4 features and 3 classes
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler() # feature scaling p7 in notes
X_train = scaler.fit_transform(X_train) # this fits the scaler on training data and transforms it into standardized form i.e mean=0, std=1
X_test = scaler.transform(X_test) # this transforms the test data using the same scaler fitted on training data i.e mean=0, std=1 

knn = KNeighborsClassifier(n_neighbors=5) # k-nearest neighbors classifier with k=5
knn.fit(X_train, y_train) # fitting the model on training data
y_pred = knn.predict(X_test) # predicting the labels for test data

print("KNN Accuracy:", accuracy_score(y_test, y_pred)) # printing accuracy of the model
print(classification_report(y_test, y_pred)) # detailed classification report


### Ploting K values vs accuracy to find best k

In [None]:

k_values = range(1, 21)
scores = []
for k in k_values:
    model = KNeighborsClassifier(n_neighbors=k)
    model.fit(X_train, y_train)
    scores.append(model.score(X_test, y_test))

plt.plot(k_values, scores, marker='o')
plt.title("KNN Accuracy vs k")
plt.xlabel("Number of Neighbors (k)")
plt.ylabel("Accuracy")
plt.show()



## ⚔️ Support Vector Machines (SVM)

**Idea:** Find the optimal boundary (hyperplane) that maximizes the margin between classes.

**Key features:**
- Works for linear and nonlinear data (via **kernels**)
- Strong theoretical foundation
- Robust to high-dimensional data


In [None]:

svm_linear = SVC(kernel='linear', C=1.0) # support vector machine with linear kernel function this function can be a hyperplane or even more complex functions some libraries dont require you to specify kernel as linear is default C=1.0 is standard regularization parameter 
svm_linear.fit(X_train, y_train) # fitting the model on training data
y_pred = svm_linear.predict(X_test) # predicting the labels for test data

print("SVM (Linear) Accuracy:", accuracy_score(y_test, y_pred)) # printing accuracy of the model
print(classification_report(y_test, y_pred)) # detailed classification report


In [None]:

# Non-linear SVM visualization (RBF Kernel)
X_blob, y_blob = make_classification(n_samples=300, n_features=2, n_informative=2,
                                     n_redundant=0, n_clusters_per_class=1, class_sep=1.0, random_state=42)

svm_rbf = SVC(kernel='rbf', gamma=0.7, C=1.0) # support vector machine with RBF kernel function gamma=0.7 controls the influence of a single training example C=1.0 is standard regularization parameter
svm_rbf.fit(X_blob, y_blob) # fitting the model on training data

# here we create a mesh grid to plot decision boundaries a mesh grid is a grid of points covering the feature space it allows us to visualize how the model classifies different regions i.e which class each point in the feature space belongs to
x_min, x_max = X_blob[:, 0].min() - 1, X_blob[:, 0].max() + 1 
y_min, y_max = X_blob[:, 1].min() - 1, X_blob[:, 1].max() + 1 
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
                     np.arange(y_min, y_max, 0.02))
Z = svm_rbf.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.3)
plt.scatter(X_blob[:, 0], X_blob[:, 1], c=y_blob, cmap=plt.cm.coolwarm, edgecolors="k")
plt.title("SVM with RBF Kernel")
plt.show()



## 🌀 K-Means Clustering

**Idea:** Partition data into *k* clusters by minimizing the distance to cluster centers (centroids).  
Unsupervised algorithm — no labels required.

**Steps:**
1. Initialize *k* centroids randomly.
2. Assign points to nearest centroid.
3. Recalculate centroids.
4. Repeat until convergence.

**Applications:** segmentation, image compression, anomaly detection.


In [None]:

X_kmeans, _ = make_blobs(n_samples=400, centers=4, cluster_std=1.0, random_state=42)

kmeans = KMeans(n_clusters=4, random_state=42, n_init=10) # some algos like DBSCAN do not require number of clusters to be specified
kmeans.fit(X_kmeans)
y_kmeans = kmeans.labels_

plt.scatter(X_kmeans[:, 0], X_kmeans[:, 1], c=y_kmeans, cmap='viridis', s=30)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c='red', marker='X', s=200)
plt.title("K-Means Clustering Results")
plt.show()


In [None]:

# Elbow Method to choose k
inertias = []
k_values = range(1, 10)
for k in k_values:
    model = KMeans(n_clusters=k, random_state=42, n_init=10)
    model.fit(X_kmeans)
    inertias.append(model.inertia_)

plt.plot(k_values, inertias, marker='o')
plt.title("Elbow Method for K Selection")
plt.xlabel("k (number of clusters)")
plt.ylabel("Inertia (Sum of Squared Distances)")
plt.show()



## 🧭 Summary

| Algorithm | Type | Supervised? | Core Idea | Common Use Cases |
|------------|------|-------------|------------|------------------|
| KNN | Instance-based | ✅ Yes | Predict by nearest neighbors | Classification, Regression |
| SVM | Margin-based | ✅ Yes | Find optimal separating boundary | Text, Image, Bio data |
| K-Means | Centroid-based | ❌ No | Cluster points by proximity to centroids | Segmentation, Anomaly detection |

---

### Engineering Takeaways
- **Scaling is critical** for KNN, SVM, and K-Means.
- Learn the math behind distance and margin — this generalizes to all frameworks.
- Always visualize and tune hyperparameters for intuition.
- Data preprocessing often matters more than model choice.
