<a href="https://colab.research.google.com/github/SahandShabanloueii/ML/blob/main/simple_knn_sklearn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Hi, this is a simple implementation of a K-Nearest Neighbors (KNN) Classifier in python using SciKit library:**

# K-Nearest Neighbors (KNN) Classifier Overview

The K-Nearest Neighbors (KNN) algorithm is a versatile and widely-used classifier in the realm of machine learning. It is a non-parametric method, ideal for pattern recognition and data mining tasks. The KNN classifier operates on the principle of proximity, making predictions based on the closeness of data points.

## How KNN Operates

- **Supervised Learning**: KNN uses labeled training data to predict the classification of new data points.
- **Non-Parametric Nature**: It makes no assumptions about the underlying data distribution.
- **Proximity-Based Predictions**: The algorithm identifies the 'k' closest training examples to the new data point, and the majority class among these neighbors determines the predicted class.
- **Distance Metric**: The Euclidean distance is the default metric, but other metrics like Manhattan or Minkowski can be employed depending on the context.
- **Lazy Learner**: KNN is a lazy learning algorithm, meaning it does not build a model until a prediction is needed, utilizing memory to store the training dataset.
- **Versatility**: It handles both numerical and categorical data and is less sensitive to outliers.
- **Scalability Concerns**: While simple and effective for small datasets, KNN's performance can degrade with large datasets due to its reliance on distance calculations.

# Here we import the tools we need for the implementaion this project:

In [None]:
from sklearn.datasets import make_classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import pandas as pd

# Here we get a dataset with 20K samples and 8 features, with random state set to for example 34526:

In [None]:
X, Y = make_classification(n_samples=20000, n_features=8, random_state=34526)

As we can see X is a 20K*8 matrix and the Y is a vector with a size of 20K:

In [None]:
print(f"Shape of X: {X.shape}\nShape of Y: {Y.shape}")

Shape of X: (20000, 8)
Shape of Y: (20000,)


# Convert to pandas DataFrame for easier manipulation:

In [None]:
df = pd.DataFrame(X)
df['label'] = Y

In [None]:
# Shuffle the dataset
df = df.sample(frac=1, random_state=42).reset_index(drop=True)

In [None]:
df

Unnamed: 0,0,1,2,3,4,5,6,7,label
0,-2.887405,0.077387,0.039529,1.757866,-0.769964,-0.763994,-1.097254,-0.640758,1
1,-0.036560,-0.003198,0.002435,0.870491,-0.389032,-0.782002,-0.476001,0.336703,1
2,-0.123003,0.269931,0.115519,0.948324,-1.065840,-0.945974,-0.962114,0.971849,0
3,-0.959928,0.154601,0.066439,0.607170,-1.582343,-0.556328,-0.586434,-0.349004,1
4,1.518117,-1.456027,-0.594384,1.545800,-1.338545,1.299707,1.507575,-0.010668,0
...,...,...,...,...,...,...,...,...,...
19995,0.223135,-0.880745,-0.355405,1.893598,1.110408,0.930847,0.382059,-1.059231,1
19996,1.079751,-0.890832,-0.363894,0.890974,0.083314,0.780984,0.952654,0.148024,1
19997,0.198326,0.516239,0.217922,1.116608,0.193100,1.766564,-1.454714,-0.466108,0
19998,-1.216547,1.172866,0.481306,-0.662218,-0.043908,-1.763031,-1.536639,0.141922,0


# Now let's split the dataset into two parts, One as training dataset and the other for testing the model:

In [None]:
# Split the dataset into training and test sets (80% train, 20% test)
train_size = int(0.8 * len(df))
train_set = df[:train_size]
test_set = df[train_size:]

In [None]:
print(f"Shape of the training_set: {train_set.shape}\nShape of the test_set: {test_set.shape}")

Shape of the training_set: (16000, 9)
Shape of the test_set: (4000, 9)


# So before begining to train our model we need to seperate features and labels of our datasets:

In [None]:
# Separate features and labels
X_train = train_set.drop('label', axis=1).values
Y_train = train_set['label'].values
X_test = test_set.drop('label', axis=1).values
Y_test = test_set['label'].values

In [None]:
print(f"--> For the training set:\n\tshape(X_train): {X_train.shape}\n\tShape(Y_train): {Y_train.shape}\n")
print(f"--> For the testing set:\n\tshape(X_test): {X_test.shape}\n\tShape(Y_test): {Y_test.shape}")

--> For the training set:
	shape(X_train): (16000, 8)
	Shape(Y_train): (16000,)

--> For the testing set:
	shape(X_test): (4000, 8)
	Shape(Y_test): (4000,)


# Nice, Now we use the training dataset to train our KNN model:

## First let's make our KNN model by means of sklearn:

In [None]:
# Making a KNN model
knn = KNeighborsClassifier(n_neighbors=1, weights='distance', algorithm='brute', metric='euclidean')

## Now we need to fit ( train ) our model using the training data we got earlier:

In [None]:
# Training the model using the X_train,Y_train data which are splitted from the X,Y for this perpose (80% of data)
knn.fit(X_train, Y_train)

## Good, let's test our fitted model using the test_set data:

In [None]:
# Testing the model using the X_test set which we splitted from X erlier (the 20% of data):
Y_pred = knn.predict(X_test)

### So this is how labels look like for X_test:

In [None]:
Y_test

array([0, 0, 1, ..., 0, 0, 1])

### And this is how our KNN model predicts thos labes by giving only the X_test to the model, while model doesn't know the actual labels for X_test data and it has to predict them according to the knowlege it has learnt from Training data we gave it earlier:

In [None]:
Y_pred

array([0, 0, 1, ..., 0, 0, 1])

## Cool! It seems our model has predicted the labels with good accuracy, Now let's calculate the accuracy and see how well this model is trained:

In [None]:
# Calculating the accuracy of the model:
accuracy = accuracy_score(Y_test, Y_pred)
print(f'Accuracy of the model: {accuracy*100:.2f}%')

Accuracy of the model: 84.75%
