# K-Nearest Neighbors
You should build a machine learning pipeline using a k-nearest neighbor model. In particular, you should do the following:
- Load the `mnist` dataset using [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). You can find this dataset in the datasets folder.
- Split the dataset into training and test sets using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).
- Conduct data exploration, data preprocessing, and feature engineering if necessary.
- Train and test a k-nearest neighbor model using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html).
- Check the documentation to identify the most important hyperparameters, attributes, and methods of the model. Use them in practice.

In [1]:
import pandas as pd
import numpy as np
import sklearn.model_selection
import sklearn.compose
import sklearn.preprocessing
import sklearn.metrics
import sklearn.neighbors
import sklearn.svm

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/m-mahdavi/teaching/main/datasets/mnist.csv')
df.head()

Unnamed: 0,id,class,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783,pixel784
0,31953,5,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,34452,8,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,60897,5,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,36953,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1981,3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [3]:
df_train, df_test = sklearn.model_selection.train_test_split(df)

print ('df size: ', df.shape)
print ('df_train size: ', df_train.shape)
print ('df_test size: ', df_test.shape)

df size:  (4000, 786)
df_train size:  (3000, 786)
df_test size:  (1000, 786)


In [4]:
x_train = df_train.drop(['id','class'], axis = 1)
y_train = df_train['class']

x_test = df_test.drop(['id','class'], axis = 1)
y_test = df_test['class']

###########################

print('x_train size: ', x_train.shape)
print('y_train size: ', y_train.shape)

print('x_test size: ', x_test.shape)
print('y_test size: ', y_test.shape)

x_train size:  (3000, 784)
y_train size:  (3000,)
x_test size:  (1000, 784)
y_test size:  (1000,)


In [5]:
numerical_attributes = x_train.select_dtypes(include = ['int64']).columns.tolist()

ct = sklearn.compose.ColumnTransformer([
    ('standard_scaling', sklearn.preprocessing.StandardScaler(), numerical_attributes)
])

ct.fit(x_train)
x_train = ct.transform(x_train)
x_test = ct.transform(x_test)

print ('x_train size: ', x_train.shape)
print ('y_train size: ', y_train.shape)
print ('x_test size: ', x_test.shape)
print ('y_test size: ', y_test.shape)

x_train size:  (3000, 784)
y_train size:  (3000,)
x_test size:  (1000, 784)
y_test size:  (1000,)


In [6]:
model = sklearn.neighbors.KNeighborsClassifier()
result = model.fit(x_train, y_train)
#np.argwhere(np.isnan(x_test))

In [7]:
y_predicted = model.predict(x_test)
accuracy = sklearn.metrics.accuracy_score(y_test, y_predicted)

print(accuracy)

0.896
