# Lecture 13: Classification 
Description: Prof. Guttag introduces supervised learning with nearest neighbor classification using feature scaling and decision trees.

Instructor: John Guttag

## k-nearest neighbors
The k-nearest neighbors (k-NN) algorithm is a simple, intuitive, and widely-used machine learning algorithm for classification and regression tasks. It is a type of instance-based learning, which means it doesn't construct a general internal model but instead relies on storing instances of the training data.

### How k-NN Works

1. **Data Storage**:
   - The algorithm stores all the training data points. Each data point has features (independent variables) and a label (dependent variable).

2. **Choosing k**:
   - `k` is a user-defined constant that specifies the number of nearest neighbors to consider. Common choices for `k` are small positive integers like 3, 5, or 7, but it can be any positive integer.

3. **Distance Metric**:
   - A distance metric is used to measure the similarity between data points. The most common metric is Euclidean distance, but others like Manhattan distance, Minkowski distance, or Hamming distance can also be used.

4. **Making Predictions**:
   - **For Classification**:
     - To classify a new data point, the algorithm calculates the distance from this point to all training data points.
     - It then selects the `k` nearest neighbors (data points with the smallest distances).
     - The new data point is assigned the most common class (majority vote) among its `k` nearest neighbors.

### Advantages of k-Nearest Neighbors (k-NN)

1. **Simplicity**:
   - k-NN is easy to understand and implement. It doesn't require any complex parameter tuning or model building.

2. **No Training Phase**:
   - Since k-NN is a lazy learning algorithm, it doesn't have a training phase. This makes it fast to set up and start making predictions.

3. **Adaptability**:
   - k-NN can be used for both classification and regression tasks.

4. **No Assumptions about Data**:
   - k-NN makes no assumptions about the underlying data distribution, making it a non-parametric method.

5. **Flexibility with Distance Metrics**:
   - Various distance metrics can be employed, allowing customization based on the specific problem (e.g., Euclidean, Manhattan, Minkowski, Hamming distances).

### Disadvantages of k-Nearest Neighbors (k-NN)

1. **Computational Complexity**:
   - k-NN can be computationally expensive, especially for large datasets, as it requires calculating the distance between the new data point and all existing points in the dataset.

2. **Storage Requirement**:
   - The algorithm needs to store all the training data, which can be memory-intensive.

3. **Sensitive to Irrelevant Features**:
   - k-NN's performance can degrade if the data contains irrelevant or redundant features. Feature scaling and selection are often necessary.

4. **Curse of Dimensionality**:
   - As the number of features increases, the distance between data points becomes less informative (a phenomenon known as the curse of dimensionality). This can lead to decreased performance.

5. **Choosing the Right k**:
   - Selecting the optimal value of `k` can be tricky. A small `k` can lead to noisy predictions (overfitting), while a large `k` can smooth out the decision boundary too much (underfitting).

6. **Imbalanced Data**:
   - k-NN can perform poorly on imbalanced datasets because the algorithm might be biased towards the more frequent class.


In [1]:
from sklearn.datasets import make_classification
from lecturer_code import random_splits, knn_classification
# Generate a synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)
examples = list(zip(X, y))

# Run the k-NN classification with random splits
random_splits(examples, knn_classification, num_splits=10)


Precision: 0.86
Recall: 0.76
Accuracy: 0.81
F1 Score: 0.81


(78.8, 13.0, 83.9, 24.3)