### K-Nearest Neighbors (KNN) Algorithm Steps

The K-Nearest Neighbors (KNN) algorithm is a simple, non-parametric, lazy learning algorithm used for classification and regression tasks. Here are the steps for a classification task:

1.  **Choose a value for K**: This is the number of nearest neighbors to consider. A smaller K makes the model sensitive to noise, while a larger K might smooth out the decision boundary too much.
2.  **Calculate Distance**: For a new, unclassified data point, calculate its distance to all other data points in the training dataset. Common distance metrics include Euclidean distance, Manhattan distance, or Minkowski distance.
3.  **Find K Nearest Neighbors**: Select the K data points from the training set that are closest (have the smallest distances) to the new data point.
4.  **Vote for Classification**: For classification, count the number of data points in each class among these K neighbors.
5.  **Assign Class**: Assign the new data point to the class that has the majority vote among the K nearest neighbors. In case of a tie, different tie-breaking rules can be applied (e.g., choose the class with the smallest average distance to the new point, or choose arbitrarily).

KNN is a 'lazy' algorithm because it doesn't build a model during the training phase; instead, it stores the entire training dataset and performs computations only when a prediction is requested.

In [1]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

# 1. Load a built-in dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target)

print("Features (X) head:")
display(X.head())
print("Target (y) head:")
display(y.head())
print("Target names:", iris.target_names)

Features (X) head:


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


Target (y) head:


Unnamed: 0,0
0,0
1,0
2,0
3,0
4,0


Target names: ['setosa' 'versicolor' 'virginica']


In [2]:
# 2. Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")

Training set size: 105 samples
Testing set size: 45 samples


In [10]:
# 3. Initialize the KNN Classifier
knn_model = KNeighborsClassifier(n_neighbors=3)

# 4. Train the model
print("Training the KNN model...")
knn_model.fit(X_train, y_train)
print("Model training complete.")

Training the KNN model...
Model training complete.


In [11]:
# 5. Make predictions on the test set
y_pred = knn_model.predict(X_test)

# 6. Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of KNN model: {accuracy:.2f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

Accuracy of KNN model: 1.00

Classification Report:
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        19
  versicolor       1.00      1.00      1.00        13
   virginica       1.00      1.00      1.00        13

    accuracy                           1.00        45
   macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45



### K-Nearest Neighbors (KNN) Algorithm Steps for Regression

For regression tasks, the initial steps of KNN are similar to classification, but the final prediction step differs:

1.  **Choose a value for K**: This is the number of nearest neighbors to consider. Just like classification, the choice of K influences the model's sensitivity to noise and smoothness.
2.  **Calculate Distance**: For a new, unclassified data point, calculate its distance to all other data points in the training dataset. Common distance metrics include Euclidean distance, Manhattan distance, or Minkowski distance.
3.  **Find K Nearest Neighbors**: Select the K data points from the training set that are closest (have the smallest distances) to the new data point.
4.  **Aggregate Values**: For regression, instead of voting for a class, the algorithm takes the average (or median) of the target values of these K nearest neighbors.
5.  **Assign Predicted Value**: The new data point is assigned this aggregated value as its prediction.

In [5]:
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score

# 1. Load a built-in dataset
diabetes = load_diabetes()
X_reg = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
y_reg = pd.Series(diabetes.target)

print("Features (X_reg) head:")
display(X_reg.head())
print("Target (y_reg) head:")
display(y_reg.head())

Features (X_reg) head:


Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641


Target (y_reg) head:


Unnamed: 0,0
0,151.0
1,75.0
2,141.0
3,206.0
4,135.0


In [6]:
# 2. Split the dataset into training and testing sets
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_reg, y_reg, test_size=0.3, random_state=42)

print(f"Training set size: {X_train_reg.shape[0]} samples")
print(f"Testing set size: {X_test_reg.shape[0]} samples")

Training set size: 309 samples
Testing set size: 133 samples


In [8]:
# 3. Initialize the KNN Regressor
knn_reg_model = KNeighborsRegressor(n_neighbors=5)

# 4. Train the model
print("Training the KNN regression model...")
knn_reg_model.fit(X_train_reg, y_train_reg)
print("Model training complete.")

Training the KNN regression model...
Model training complete.


In [9]:
# 5. Make predictions on the test set
y_pred_reg = knn_reg_model.predict(X_test_reg)

# 6. Evaluate performance
mse = mean_squared_error(y_test_reg, y_pred_reg)
r2 = r2_score(y_test_reg, y_pred_reg)

print(f"Mean Squared Error (MSE) of KNN regression model: {mse:.2f}")
print(f"R-squared (R2) of KNN regression model: {r2:.2f}")

Mean Squared Error (MSE) of KNN regression model: 3222.12
R-squared (R2) of KNN regression model: 0.40


### Predicting with the KNN Classifier

To predict a new data point using the `knn_model`, you need to provide it as a 2D array (even if it's a single sample). The features must match the order and type of the features used during training.

In [12]:
import numpy as np

# Example new data point for classification (Iris dataset features: sepal length, sepal width, petal length, petal width)
# Let's say we have a flower with characteristics similar to one of the classes
new_flower_data = np.array([[5.0, 3.5, 1.3, 0.2]]) # Example: similar to setosa

# Predict the class of the new data point
predicted_class_index = knn_model.predict(new_flower_data)
predicted_class_name = iris.target_names[predicted_class_index[0]]

print(f"New flower data: {new_flower_data}")
print(f"Predicted class index: {predicted_class_index[0]}")
print(f"Predicted class name: {predicted_class_name}")


New flower data: [[5.  3.5 1.3 0.2]]
Predicted class index: 0
Predicted class name: setosa




### Predicting with the KNN Regressor

Similarly, for the `knn_reg_model`, provide the new data point as a 2D array, ensuring its features match the structure of the diabetes dataset features.

In [13]:
# Example new data point for regression
# Let's create a hypothetical patient's data
# The features are normalized values, so let's pick some values within the typical range
new_patient_data = np.array([[-0.001882, -0.044642, -0.051474, -0.026328, -0.008449, -0.019163, 0.074412, -0.039493, -0.068330, -0.092204]])

# Predict the progression value for the new patient
predicted_progression = knn_reg_model.predict(new_patient_data)

print(f"New patient data: {new_patient_data}")
print(f"Predicted disease progression: {predicted_progression[0]:.2f}")


New patient data: [[-0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163  0.074412
  -0.039493 -0.06833  -0.092204]]
Predicted disease progression: 116.60


