Name : Marcel Zama <br> 
ID: C00260146 <br> 
Date: 25/02/2024

# k-Nearest Neighbors (kNN) model <br>

1. Business Understanding <br>
In this program, we are using the standard k-Nearest Neighbors (kNN) algorithm for classification. The standard kNN algorithm is a non-parametric, instance-based learning algorithm that is used for classification and regression tasks. It’s easy to implement and understand algorithm, but has a major drawback of becoming significantly slows as the size of that data in use grows.

The k-Nearest Neighbors (kNN) algorithm is suitable for the following scenarios:<br><br>

1. Classification and Regression: kNN can be used for both classification and regression tasks. In classification, the algorithm assigns the majority class label among its k-nearest neighbors. In regression, it computes the average (or weighted average) of the target values of its k-nearest neighbors.<br>

2. Small to Medium-Sized Datasets: kNN is effective when dealing with small to medium-sized datasets. It does not require training a model beforehand, as it memorizes the entire dataset and uses it during inference. However, this can lead to high memory usage for large datasets.<br>

3. Non-Linear Data: kNN can handle non-linear data distributions, making it versatile in various domains. It can capture complex decision boundaries by using appropriate values of k and distance metrics.<br>

4. Simple to Implement: kNN is easy to implement and understand. It's a straightforward algorithm that does not require complex assumptions or tuning of hyperparameters.<br>

5. Lazy Learning: kNN is an example of lazy learning or instance-based learning. It does not make assumptions about the underlying data distribution and generalizes well to new, unseen data points.<br>

6. When Feature Importance is Not Known: kNN does not make any assumptions about the underlying data distribution or feature importance. It treats all features equally, making it suitable when feature importance is not clear or when all features contribute equally to the prediction.<br>

7. Anomaly Detection: kNN can be used for anomaly detection tasks, where abnormal data points are identified based on their proximity to neighboring data points. Anomalies are often points that are distant from their k-nearest neighbors.<br>

8. Recommendation Systems: In collaborative filtering for recommendation systems, kNN can be used to find similar users or items. For example, in movie recommendations, users who have rated movies similarly in the past may have similar tastes.<br>

9. Imbalanced Data: kNN can handle imbalanced datasets where one class is much more prevalent than the others. By adjusting the value of k or using weighted distances, it can give better predictions for minority classes.<br>

10. Low Computational Cost During Inference: While kNN can be slow during the training phase (as it memorizes the entire dataset), it is computationally efficient during inference. The prediction time scales linearly with the size of the dataset.<br>

11. Human-Level Similarity: kNN can be used in applications where human-level similarity is important, such as image recognition based on visual similarity, recommending similar products based on user preferences, or finding similar documents based on content.<br>

In summary, k-Nearest Neighbors (kNN) is a versatile and simple algorithm suitable for small to medium-sized datasets, non-linear data distributions, and scenarios where feature importance is not well-defined. It is effective for classification, regression, anomaly detection, and recommendation systems, among others. However, it may not perform well on high-dimensional data or large datasets due to its computational and memory requirements during training.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

## 2. Data Understanding

In [None]:
# Load the Iris dataset
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target

In [None]:
# Print the types of data
print("Data types:")
print("X - Features (first 5 rows):")
print(pd.DataFrame(X, columns=cancer.feature_names).head())
print("\ny - Target (first 5 rows):")
print(pd.DataFrame(y, columns=["Target"]).head())

## 3. Data Preparation

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# 4. Model Training

In [None]:
# Create a kNN classifier
k = 3  # Number of neighbors
knn_classifier = KNeighborsClassifier(n_neighbors=k)

# Train the kNN model
knn_classifier.fit(X_train, y_train)

# 5. Model Evaluation

In [None]:
# Make predictions on the testing set
y_pred = knn_classifier.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("\nAccuracy:", accuracy)

# Print classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

In [None]:
# Print confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.title('Confusion Matrix')
plt.show()

MarkDown:
In order for the algorithm to work following commands have to be executed.<br> 
*pip install numpy*<br> 
*pip install pandas*<br> 
*pip install scikit-learn*<br> 
*pip install matplotlib*<br> 
*pip install seaborn*<br> 