# Project Overview
This project explores a dataset with coded features to predict a binary outcome (Y or N). The primary objective was to build and evaluate a K-Nearest Neighbors (KNN) classification model. The work involved a complete machine learning pipeline, from initial data preparation to model evaluation.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Loading Data

In [2]:
df = pd.read_csv("Coded_Data.csv")
df

Unnamed: 0.1,Unnamed: 0,Cd_1,Cd_2,Cd_3,Cd_4,Cd_5,Cd_6,Cd_7,Cd_8,Cd_9,Cd_10,Result
0,1,53.1,63.4,33.0,46.2,47.3,21.2,44.3,36.1,46.6,65.0,Y
1,2,36.9,54.7,31.1,50.5,56.0,38.9,39.4,56.8,33.0,78.8,N
2,3,41.9,65.5,53.5,52.3,92.5,43.2,94.9,64.7,50.8,67.9,N
3,4,71.7,75.6,37.9,50.5,69.2,52.5,82.3,77.3,80.8,60.9,Y
4,5,74.3,51.8,36.4,40.9,74.7,42.2,65.1,36.2,77.6,74.9,Y
...,...,...,...,...,...,...,...,...,...,...,...,...
995,996,58.7,56.4,49.5,38.1,62.8,35.2,43.6,17.9,59.3,71.2,Y
996,997,33.4,52.1,54.7,48.5,85.7,76.2,61.6,40.0,50.8,87.8,N
997,998,66.0,53.6,45.4,56.1,54.6,53.1,22.6,21.8,48.8,73.2,Y
998,999,63.0,47.0,23.6,40.7,97.5,56.6,50.0,59.4,67.7,62.7,Y


**In our dataframe, the first column looks like the index, let's set `index_col = 0`**

In [3]:
df = df.drop('Unnamed: 0',axis=1)

In [4]:
df

Unnamed: 0,Cd_1,Cd_2,Cd_3,Cd_4,Cd_5,Cd_6,Cd_7,Cd_8,Cd_9,Cd_10,Result
0,53.1,63.4,33.0,46.2,47.3,21.2,44.3,36.1,46.6,65.0,Y
1,36.9,54.7,31.1,50.5,56.0,38.9,39.4,56.8,33.0,78.8,N
2,41.9,65.5,53.5,52.3,92.5,43.2,94.9,64.7,50.8,67.9,N
3,71.7,75.6,37.9,50.5,69.2,52.5,82.3,77.3,80.8,60.9,Y
4,74.3,51.8,36.4,40.9,74.7,42.2,65.1,36.2,77.6,74.9,Y
...,...,...,...,...,...,...,...,...,...,...,...
995,58.7,56.4,49.5,38.1,62.8,35.2,43.6,17.9,59.3,71.2,Y
996,33.4,52.1,54.7,48.5,85.7,76.2,61.6,40.0,50.8,87.8,N
997,66.0,53.6,45.4,56.1,54.6,53.1,22.6,21.8,48.8,73.2,Y
998,63.0,47.0,23.6,40.7,97.5,56.6,50.0,59.4,67.7,62.7,Y


**Great, this looks fine now!<br>**

**Let's overview the data, we can use `info()` here!**

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Cd_1    1000 non-null   float64
 1   Cd_2    1000 non-null   float64
 2   Cd_3    1000 non-null   float64
 3   Cd_4    1000 non-null   float64
 4   Cd_5    1000 non-null   float64
 5   Cd_6    1000 non-null   float64
 6   Cd_7    1000 non-null   float64
 7   Cd_8    1000 non-null   float64
 8   Cd_9    1000 non-null   float64
 9   Cd_10   1000 non-null   float64
 10  Result  1000 non-null   object 
dtypes: float64(10), object(1)
memory usage: 86.1+ KB


**Let's split the data so that we can fit the `scaler` to the features only**

In [6]:
features = df.select_dtypes(include = ['float64'])
features

Unnamed: 0,Cd_1,Cd_2,Cd_3,Cd_4,Cd_5,Cd_6,Cd_7,Cd_8,Cd_9,Cd_10
0,53.1,63.4,33.0,46.2,47.3,21.2,44.3,36.1,46.6,65.0
1,36.9,54.7,31.1,50.5,56.0,38.9,39.4,56.8,33.0,78.8
2,41.9,65.5,53.5,52.3,92.5,43.2,94.9,64.7,50.8,67.9
3,71.7,75.6,37.9,50.5,69.2,52.5,82.3,77.3,80.8,60.9
4,74.3,51.8,36.4,40.9,74.7,42.2,65.1,36.2,77.6,74.9
...,...,...,...,...,...,...,...,...,...,...
995,58.7,56.4,49.5,38.1,62.8,35.2,43.6,17.9,59.3,71.2
996,33.4,52.1,54.7,48.5,85.7,76.2,61.6,40.0,50.8,87.8
997,66.0,53.6,45.4,56.1,54.6,53.1,22.6,21.8,48.8,73.2
998,63.0,47.0,23.6,40.7,97.5,56.6,50.0,59.4,67.7,62.7


**Let's fit `scaler` to the features now!**

In [8]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)
df_features_scaled = pd.DataFrame(features_scaled,columns = features.columns)
df_features_scaled

Unnamed: 0,Cd_1,Cd_2,Cd_3,Cd_4,Cd_5,Cd_6,Cd_7,Cd_8,Cd_9,Cd_10
0,-0.122525,0.187569,-0.911832,0.318653,-1.035516,-2.305940,-0.801865,-1.480070,-0.952562,-0.645366
1,-1.086028,-0.433403,-1.024151,0.624941,-0.445471,-1.153296,-1.131088,-0.200556,-1.826218,0.635103
2,-0.788651,0.337459,0.300034,0.753154,2.030007,-0.873275,2.597862,0.287761,-0.682756,-0.376282
3,0.983718,1.058359,-0.622166,0.624941,0.449771,-0.267648,1.751290,1.066595,1.244427,-1.025795
4,1.138354,-0.640394,-0.710839,-0.058864,0.822788,-0.938396,0.595651,-1.473889,1.038861,0.273232
...,...,...,...,...,...,...,...,...,...,...
995,0.210537,-0.312064,0.063573,-0.258307,0.015714,-1.394244,-0.848897,-2.605053,-0.136721,-0.070082
996,-1.294192,-0.618981,0.370973,0.482481,1.568822,1.275723,0.360492,-1.239002,-0.682756,1.470191
997,0.644708,-0.511917,-0.178801,1.023827,-0.540420,-0.228575,-2.259851,-2.363985,-0.811235,0.115493
998,0.466282,-0.983000,-1.467517,-0.073110,2.369113,-0.000651,-0.418892,-0.039844,0.402891,-0.858777


**So, we have the fitted features to the `scaler` object. We will use this `scaler` object to transform all the features using `.transform()` method in `Scikit-learn` to do the `standardization` job by centering and scaling.<br>
Let's pass the features to `scaler.transform()` to get standardized features in `scaled_features`!**

In [9]:
scaled_features = scaler.transform(features)

In [10]:
scaled_features

array([[-0.12252539,  0.1875694 , -0.91183199, ..., -1.48006982,
        -0.95256187, -0.64536551],
       [-1.08602779, -0.43340316, -1.02415132, ..., -0.20055606,
        -1.82621843,  0.6351032 ],
       [-0.78865051,  0.33745933,  0.30003449, ...,  0.28776079,
        -0.68275617, -0.3762815 ],
       ...,
       [ 0.64470801, -0.51191693, -0.17880055, ..., -2.36398512,
        -0.81123508,  0.11549271],
       [ 0.46628164, -0.98299956, -1.46751711, ..., -0.03984418,
         0.40289057, -0.85877696],
       [-0.39016495, -0.59756832, -1.43204784, ..., -0.56524838,
         0.33865112,  0.01342636]])

In [11]:
scaled_features.shape

(1000, 10)

**`scaled_features` is a NumPy array, Let's convert this into the pandas `DataFrame`!<br>
We can use our `df.columns` to get the columns nam**

In [12]:
#Before Scaling
df

Unnamed: 0,Cd_1,Cd_2,Cd_3,Cd_4,Cd_5,Cd_6,Cd_7,Cd_8,Cd_9,Cd_10,Result
0,53.1,63.4,33.0,46.2,47.3,21.2,44.3,36.1,46.6,65.0,Y
1,36.9,54.7,31.1,50.5,56.0,38.9,39.4,56.8,33.0,78.8,N
2,41.9,65.5,53.5,52.3,92.5,43.2,94.9,64.7,50.8,67.9,N
3,71.7,75.6,37.9,50.5,69.2,52.5,82.3,77.3,80.8,60.9,Y
4,74.3,51.8,36.4,40.9,74.7,42.2,65.1,36.2,77.6,74.9,Y
...,...,...,...,...,...,...,...,...,...,...,...
995,58.7,56.4,49.5,38.1,62.8,35.2,43.6,17.9,59.3,71.2,Y
996,33.4,52.1,54.7,48.5,85.7,76.2,61.6,40.0,50.8,87.8,N
997,66.0,53.6,45.4,56.1,54.6,53.1,22.6,21.8,48.8,73.2,Y
998,63.0,47.0,23.6,40.7,97.5,56.6,50.0,59.4,67.7,62.7,Y


In [13]:
cols = df.columns[:-1]
cols

Index(['Cd_1', 'Cd_2', 'Cd_3', 'Cd_4', 'Cd_5', 'Cd_6', 'Cd_7', 'Cd_8', 'Cd_9',
       'Cd_10'],
      dtype='object')

In [14]:
#After Scaling
df_scaled_features = pd.DataFrame(data=scaled_features, columns=cols)#df.columns[:-1])
df_scaled_features

Unnamed: 0,Cd_1,Cd_2,Cd_3,Cd_4,Cd_5,Cd_6,Cd_7,Cd_8,Cd_9,Cd_10
0,-0.122525,0.187569,-0.911832,0.318653,-1.035516,-2.305940,-0.801865,-1.480070,-0.952562,-0.645366
1,-1.086028,-0.433403,-1.024151,0.624941,-0.445471,-1.153296,-1.131088,-0.200556,-1.826218,0.635103
2,-0.788651,0.337459,0.300034,0.753154,2.030007,-0.873275,2.597862,0.287761,-0.682756,-0.376282
3,0.983718,1.058359,-0.622166,0.624941,0.449771,-0.267648,1.751290,1.066595,1.244427,-1.025795
4,1.138354,-0.640394,-0.710839,-0.058864,0.822788,-0.938396,0.595651,-1.473889,1.038861,0.273232
...,...,...,...,...,...,...,...,...,...,...
995,0.210537,-0.312064,0.063573,-0.258307,0.015714,-1.394244,-0.848897,-2.605053,-0.136721,-0.070082
996,-1.294192,-0.618981,0.370973,0.482481,1.568822,1.275723,0.360492,-1.239002,-0.682756,1.470191
997,0.644708,-0.511917,-0.178801,1.023827,-0.540420,-0.228575,-2.259851,-2.363985,-0.811235,0.115493
998,0.466282,-0.983000,-1.467517,-0.073110,2.369113,-0.000651,-0.418892,-0.039844,0.402891,-0.858777


<br>Our data is ready for the Machine Learning part now!
## Let's do the train_test split 
I am sure, you are very comfortable with this now!

In [15]:
from sklearn.model_selection import train_test_split

x = df_features_scaled
y = df['Result']

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=30)

X_train

Unnamed: 0,Cd_1,Cd_2,Cd_3,Cd_4,Cd_5,Cd_6,Cd_7,Cd_8,Cd_9,Cd_10
802,-0.306899,-0.069385,0.406442,-0.244061,0.890609,-0.515108,-1.023587,0.448473,0.396467,1.776391
434,0.704183,0.608689,0.696108,-2.010557,-0.689627,0.175176,0.508306,-0.188194,-1.344423,0.468086
900,1.721214,1.165423,-0.107862,-0.101602,-1.666254,1.288748,-0.647332,-0.188194,0.756208,-0.431954
137,-0.532906,-0.240688,-1.148294,-0.001880,0.687145,0.351003,0.555338,-0.855766,1.687680,-0.283494
413,-0.134420,-1.354156,0.235008,0.169071,-0.662499,0.149127,0.199240,-2.673046,1.263699,-0.552578
...,...,...,...,...,...,...,...,...,...,...
500,-1.306087,-1.225678,-0.462555,1.201901,0.171703,-0.384866,-0.929523,0.905883,-0.245928,1.544422
813,0.478177,-1.561146,-1.508898,0.653432,0.849916,0.259833,-0.586863,-0.466349,1.469265,-0.469069
941,1.150249,-0.383440,-1.520721,0.425497,1.033034,-1.257490,-0.418892,-0.416899,0.839719,-0.320609
421,0.085639,-1.960853,0.737489,-0.187078,1.046598,2.050665,1.280972,0.627728,-0.605669,0.746448


In [16]:
X_train

Unnamed: 0,Cd_1,Cd_2,Cd_3,Cd_4,Cd_5,Cd_6,Cd_7,Cd_8,Cd_9,Cd_10
802,-0.306899,-0.069385,0.406442,-0.244061,0.890609,-0.515108,-1.023587,0.448473,0.396467,1.776391
434,0.704183,0.608689,0.696108,-2.010557,-0.689627,0.175176,0.508306,-0.188194,-1.344423,0.468086
900,1.721214,1.165423,-0.107862,-0.101602,-1.666254,1.288748,-0.647332,-0.188194,0.756208,-0.431954
137,-0.532906,-0.240688,-1.148294,-0.001880,0.687145,0.351003,0.555338,-0.855766,1.687680,-0.283494
413,-0.134420,-1.354156,0.235008,0.169071,-0.662499,0.149127,0.199240,-2.673046,1.263699,-0.552578
...,...,...,...,...,...,...,...,...,...,...
500,-1.306087,-1.225678,-0.462555,1.201901,0.171703,-0.384866,-0.929523,0.905883,-0.245928,1.544422
813,0.478177,-1.561146,-1.508898,0.653432,0.849916,0.259833,-0.586863,-0.466349,1.469265,-0.469069
941,1.150249,-0.383440,-1.520721,0.425497,1.033034,-1.257490,-0.418892,-0.416899,0.839719,-0.320609
421,0.085639,-1.960853,0.737489,-0.187078,1.046598,2.050665,1.280972,0.627728,-0.605669,0.746448


In [17]:
y_train

802    N
434    N
900    Y
137    Y
413    Y
      ..
500    Y
813    Y
941    Y
421    N
805    N
Name: Result, Length: 700, dtype: object

In [18]:
X_test

Unnamed: 0,Cd_1,Cd_2,Cd_3,Cd_4,Cd_5,Cd_6,Cd_7,Cd_8,Cd_9,Cd_10
923,0.091586,-0.754596,-0.592609,0.112087,-0.045325,0.884997,1.758009,-1.195733,1.302243,0.560873
921,0.840977,1.065496,1.222235,-1.682900,0.795659,-0.046236,-0.324829,-0.855766,-0.650636,0.291789
516,-1.460723,-0.190724,0.903012,1.344360,0.578631,-1.068638,-1.211714,1.635268,-0.779115,0.421692
87,0.662551,-1.696761,0.666550,0.475358,0.110664,-0.280672,-0.311391,-0.620879,1.494961,1.433076
879,0.157009,-0.490504,-0.787690,1.159163,0.436206,-0.209039,-0.129982,-0.194375,-0.136721,2.574364
...,...,...,...,...,...,...,...,...,...,...
857,2.101857,-0.383440,-0.675370,-1.269768,-0.825270,0.351003,-0.418892,0.887340,-1.106737,-1.814489
782,0.912348,-0.968724,-1.130559,0.567957,-1.564522,-1.146784,-0.257640,-1.782950,0.981046,-2.167082
598,1.245410,-0.397715,0.329592,-0.878005,-0.954130,-1.250977,-0.244203,-0.880491,1.103101,0.254674
93,2.054276,-1.232816,-0.888186,-0.906497,-0.574331,-0.567205,1.012219,-0.979391,-0.027514,-0.821662


In [19]:
y_test

923    Y
921    N
516    N
87     Y
879    Y
      ..
857    Y
782    Y
598    Y
93     Y
554    N
Name: Result, Length: 300, dtype: object

## KNN
Our focus is to come up with a model that can predict the class in `Result` for the new data point. We don't know what k number will work best, let's start with k = 1 at the moment. <br>
We need to import the `KNeighborsClassifier` from `sklearn.neighbors`.

In [25]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)

In [26]:
#Code here
knn2 = KNeighborsClassifier(n_neighbors=2)
knn2.fit(X_train, y_train)

In [27]:
#Code here
knn3 = KNeighborsClassifier(n_neighbors=3)
knn3.fit(X_train, y_train)

**Let's show the score now!**

In [29]:
score1 = knn.score(X_test, y_test)
print("Accuracy when 'k = 1':", round(score1*100,2),"%")

score2 = knn2.score(X_test, y_test)
print("Accuracy when 'k = 2':", round(score2*100,2),"%")

score3 = knn3.score(X_test, y_test)
print("Accuracy when 'k = 3':", round(score3*100,2),"%")


Accuracy when 'k = 1': 91.33 %
Accuracy when 'k = 2': 90.33 %
Accuracy when 'k = 3': 92.67 %


In [None]:
# Overall Statement
 A KNN classifier was trained on the processed training data. The model's performance was then evaluated on the testing set, achieving a consistently high accuracy across multiple runs, with scores of 91%, 90%, and 92%.

This project demonstrates a fundamental understanding of supervised learning techniques and provides a solid foundation for further analysis and model improvement.