# KNN sample code to understand Git and GitHub funcationality.




## Loading the appropriate packages

We will import logistic regression class along with some helpers from scikit-learn. 

In [1]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import plotly.express as px
import plotly.graph_objs as go

Let's turn off the scientific notation for floating point numbers.

In [2]:
np.set_printoptions(suppress=True)

## Loading and examining the data

We will load our data from a CSV file and put it in a pandas an object of the `DataFrame` class.


In [3]:
df_30 = pd.read_csv('diabetes_ds.csv')

Let's take a look at the data:

In [4]:
df_30

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


In [6]:
#df_30.drop(['id'],axis=1,inplace=True)
df_30.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

In [7]:
df_30.loc[df_30.duplicated()]

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome


We can also create $\{-1, +1\}$ labels for our data from `y_text` and assign it to (vector) variable `y`. We use `LabelEncoder` from scikit-learn again to transform labels into -1s or +1s:

In [8]:
X = df_30.drop(columns=['Outcome'])
y = df_30.Outcome
print(X)
print(y)

     Pregnancies  Glucose  BloodPressure  ...   BMI  DiabetesPedigreeFunction  Age
0              6      148             72  ...  33.6                     0.627   50
1              1       85             66  ...  26.6                     0.351   31
2              8      183             64  ...  23.3                     0.672   32
3              1       89             66  ...  28.1                     0.167   21
4              0      137             40  ...  43.1                     2.288   33
..           ...      ...            ...  ...   ...                       ...  ...
763           10      101             76  ...  32.9                     0.171   63
764            2      122             70  ...  36.8                     0.340   27
765            5      121             72  ...  26.2                     0.245   30
766            1      126             60  ...  30.1                     0.349   47
767            1       93             70  ...  30.4                     0.315   23

[76

## Splitting data

Now, let's split our data into training, validation and test sets. We don't need validation data in this example and we won't be doing model selection here. So, let's use 70% and 20% for training test data, repectively.

In [9]:
(X_train, X_test, y_train, y_test) = train_test_split(X, y, test_size=0.2, random_state=0)

Let's build our KNN model.

In [11]:
from sklearn.neighbors import KNeighborsClassifier

classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, y_train)

KNeighborsClassifier()

In [12]:
KNeighborsClassifier()

KNeighborsClassifier()

## Assessing the performance

Let's check our accuracies next. First, the training accuracy. For that let's get the predictions of training data. Predict `yhat_train` by `logreg` on `X_train`:

In [13]:
### begin your code here (1 line).
yhat_train = classifier.predict(X_train)
### end your code here.

Let's measure the accuracy:

In [14]:
accuracy_score(yhat_train, y_train)

0.7850162866449512

Let's check accuracy on the test data. Predict `yhat_test`:

In [16]:
### begin your code here (1 line).
yhat_test = classifier.predict(X_test)
### end your code here.
accuracy_score(yhat_test, y_test)

0.7532467532467533

We have better performance on training data than on test data! 