# K Nearest Neighbors

## Summary

**Keywords**:
- supervised learning
- classification
    - binary
    - multiclass
- **lazy learner** - does not perform any training when you supply the training data
- **non-parametric** - does not make any assumptions about the underlying data distribution


### Assumptions

- Similar things exist in close proximity
- Distance-based algorithm
    - So, scaling / standardization is important


### Pros

- The algorithm is simple and easy to implement.
- There is no need to train the model.
- There’s only one parameter to tune, and it's fairly straightforward.
- The algorithm is versatile. It can be used for classification, regression, and search.
- There are no assumptions about the underlying data, so it's well-suited for non-linear data.

### Cons

- Prediction can be computationally expensive and requires high memory storage, especially with a very large dataset.
- The algorithm gets significantly slower as the number of examples and/or predictors/independent variables increase.
- It tends to be less accurate than more sophisticated algorithms.
- Sensitive to irrelevant features, subject to the curse of dimensionality. Dimensionality reduction and careful feature selection is recommended.

### Common Use Cases

- Pattern recognition
- Intrusion / fraud detection
- Search and recommendations

## How It Works

### Algorithm

1. Initialize k to your chosen number of neighbors.
2. For each example in the data:
    1. Calculate the distance between the query example and the current example from the data.
    2. Add the distance and the index of the example to an ordered collection.
3. Sort the ordered collection of distances and indices from smallest to largest (in ascending order) by distance.
4. Pick the first k entries from the sorted collection.
5. Get the labels of the selected k entries.
6. If regression, return the mean of the k labels. If classification, return the mode of the k labels.

### Choosing the right k

- Run the KNN algorithm several times with different values of k and choose the k that reduces the number of errors we encounter while maintaining the algorithm’s ability to accurately make predictions when it’s given data it hasn’t seen before.
- Things to keep in mind:
    - As we decrease the value of k to 1, our predictions become less stable. They can be noisy and the chance of overfitting is higher.
    - Inversely, as we increase the value of k, our predictions become more stable due to majority voting / averaging, and thus, more likely to make more accurate predictions (up to a certain point). We can often observe a smoother decision boundary.
    - Eventually, we begin to witness an increasing number of errors. It is at this point we know we have pushed the value of k too far.
    - In cases where we are taking a majority vote (e.g. picking the mode in a classification problem) among two labels, we usually make k an odd number to have a tiebreaker.
        - In general, it's best practice to set the value of k to not be a multiple of the number of classes present.

## Improving the Model

- A smart way to select features is to try the model with each feature individually. Pick the best performing one and repeat with the remaining features. Greedily selecting which feature is best based on performance.
- Weight votes so the closest neighbors get proportionally more voting power than the further neighbors.