# KNN
KNN is a supervised, instance-based (lazy learner), and non-parametric algorithm used for classification and regression. It makes predictions by finding the *k* closest data points (neighbors) in the training set to a new input and using a majority vote (classification) or an average (regression) of their values.

The core steps for making a prediction on a new data point are:

Calculate Distance: Measure the distance (e.g., Euclidean, Manhattan) from the new point to all points in the training set.

Find Neighbors: Identify the *k* points with the smallest distances.

Aggregate for Output:

For classification: Assign the most common class among the *k* neighbors.

For regression: Assign the average value of the *k* neighbors.


## Critical Parameters and Hyperparameter Tuning
The performance of KNN hinges on several key decisions.

Choosing 'k' (Number of Neighbors): This is the most critical parameter. A small *k* (like 1) makes the model sensitive to noise, while a very large *k* oversimplifies the model. An odd *k* value is preferred for classification to avoid ties. The optimal *k* is data-dependent and is typically found using cross-validation or the Elbow Method.

Selecting a Distance Metric: The choice defines "closeness." Common metrics include:

Euclidean Distance: The straight-line distance (default for continuous features).

Manhattan Distance: Sum of absolute differences (useful for grid-like data).

Minkowski Distance: A generalized formula; setting p=2 gives Euclidean, p=1 gives Manhattan.

## Other Tuning Parameters:

Weights: Neighbors can be weighted uniformly or by the inverse of their distance (weights='distance'), giving closer points more influence.

Algorithm: Methods like 'ball_tree' or 'kd_tree' can be faster than 'brute' force for larger datasets.

## Best Practice: Use GridSearchCV or RandomizedSearchCV from scikit-learn to systematically test combinations of these parameters (e.g., n_neighbors, weights, metric) and find the best set via cross-validation

In [None]:
# 1. Import Libraries & Load Data
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# df = pd.read_csv('your_data.csv')
# X = df.drop('target_column', axis=1)
# y = df['target_column']

# 2. Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Feature Scaling (CRITICAL for KNN)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 4. Hyperparameter Tuning with Grid Search
param_grid = {
    'n_neighbors': list(range(1, 31)),
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}
knn = KNeighborsClassifier()
grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_scaled, y_train)

# 5. Evaluate Best Model
best_knn = grid_search.best_estimator_
y_pred = best_knn.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
print(f"Best Params: {grid_search.best_params_}, Test Accuracy: {accuracy:.2f}")

Use KNN when:

You have a small to moderately sized dataset.

Interpretability is important (explaining "nearest neighbors" is intuitive).

Data has a non-linear pattern and you need a simple baseline model.

Avoid KNN when:

The dataset is very large (slow prediction).

The dataset has many features (high-dimensional).

Prediction speed is a critical requirement.

Real-World Applications
Due to its intuitive logic, KNN is widely used in:

Recommendation Systems: Finding users with similar tastes to suggest products or content.

Pattern Recognition & Security: Detecting fraudulent credit card transactions by identifying anomalous patterns.

Healthcare: Classifying medical diagnoses, such as predicting the risk of a disease based on similar patient records.

Finance: Credit scoring, stock market forecasting, and customer profiling