## KNN 

<p>KNN regression is a non-parametric method that, in an intuitive manner, approximates the association between independent variables and the continuous outcome by averaging the observations in the same neighbourhood.<p>

1. **`import sklearn`**: Imports the scikit-learn library, a powerful toolkit for machine learning in Python.

2. **`from sklearn.utils import shuffle`**: Provides utilities to shuffle data, which helps in randomizing the dataset.

3. **`from sklearn.neighbors import KNeighborsClassifier`**: Imports the K-Nearest Neighbors algorithm for classification tasks.

4. **`import pandas as pd`**: Imports the pandas library, which is used for data manipulation and analysis.

5. **`import numpy as np`**: Imports the NumPy library, which provides support for large, multi-dimensional arrays and matrices.

6. **`from sklearn import linear_model, preprocessing`**: Imports modules for linear modeling and preprocessing of data.

7. **`from sklearn.metrics import confusion_matrix`**: Imports the function to compute the confusion matrix, which evaluates the performance of a classification algorithm.

8. **`from sklearn.metrics import f1_score`**: Imports the function to compute the F1 score, a metric that balances precision and recall.

9. **`from sklearn.metrics import accuracy_score`**: Imports the function to compute the accuracy score, which measures the proportion of correctly classified instances.


In [91]:
import sklearn 
from sklearn.utils import shuffle 
from sklearn.neighbors import KNeighborsClassifier 
import pandas as pd 
import numpy as np 
from sklearn import linear_model, preprocessing 
from sklearn.metrics import confusion_matrix 
from sklearn.metrics import f1_score 
from sklearn.metrics import accuracy_score 

In [92]:

data = pd.read_csv("car.data")
print(data.head())


  buying  maint door persons lug_boot safety  class
0  vhigh  vhigh    2       2    small    low  unacc
1  vhigh  vhigh    2       2    small    med  unacc
2  vhigh  vhigh    2       2    small   high  unacc
3  vhigh  vhigh    2       2      med    low  unacc
4  vhigh  vhigh    2       2      med    med  unacc


In [93]:
# Initialise the LabelEncoder to convert categorical values to numeric values
le = preprocessing.LabelEncoder()

buying = le.fit_transform(list(data["buying"]))

maint = le.fit_transform(list(data["maint"]))

door = le.fit_transform(list(data["door"]))

persons = le.fit_transform(list(data["persons"]))

lug_boot = le.fit_transform(list(data["lug_boot"]))

safety = le.fit_transform(list(data["safety"]))

cls = le.fit_transform(list(data["class"]))


In [1]:
predict = "class" 
x = list(zip(buying, maint, door, persons, lug_boot, safety)) 
y = list (cls) 

 

x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.1) 

print (x_train, y_test) 

NameError: name 'buying' is not defined

In [96]:
# Initialize the K-Nearest Neighbors classifier with Euclidean distance, p=3 (Minkowski distance), and 9 neighbors
model = KNeighborsClassifier(metric='euclidean', p=3,  n_neighbors=9) 

model.fit(x_train, y_train) 

# Evaluate the model's accuracy on the test data
acc = model.score(x_test, y_test) 

print(acc) 

0.9075144508670521


In [97]:
y_pred = model.predict(x_test)
y_pred

array([2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2,
       2, 3, 2, 2, 2, 2, 0, 2, 2, 2, 0, 0, 2, 2, 0, 0, 2, 2, 2, 2, 0, 0,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 2, 2, 2, 1, 0, 2, 0,
       0, 0, 0, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 0, 2, 0, 0, 2, 0,
       2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 0, 2, 2, 2, 2, 2, 0,
       1, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 0, 2,
       2, 2, 2, 0, 1, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 0, 0,
       0, 2, 2, 2, 2, 0, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [98]:
# Predict the class labels for the test data using the trained model
predicted = model.predict(x_test)

# Define the class names for interpreting the predictions
names = ["unnac", "acc", "good", "vgood"]

# Iterate over the predictions and print the predicted class, the test data, and the actual class
for x in range(len(predicted)):
    print("predicted: ", names[predicted[x]], "Data:", x_test[x], "actual: ", names[y_test[x]])
    
    # Find and print the 9 nearest neighbors for each test data point
    n = model.kneighbors([x_test[x]], 9, True)

# Print the neighbors information
print("N: ", n)


predicted:  good Data: (3, 2, 0, 0, 2, 1) actual:  good
predicted:  good Data: (1, 3, 3, 1, 1, 1) actual:  good
predicted:  good Data: (1, 1, 0, 0, 0, 2) actual:  good
predicted:  good Data: (1, 1, 1, 2, 0, 1) actual:  good
predicted:  good Data: (1, 2, 3, 1, 2, 2) actual:  unnac
predicted:  unnac Data: (0, 1, 2, 1, 2, 0) actual:  unnac
predicted:  unnac Data: (0, 2, 1, 1, 2, 0) actual:  unnac
predicted:  good Data: (1, 0, 1, 2, 2, 2) actual:  unnac
predicted:  good Data: (3, 2, 0, 1, 0, 2) actual:  unnac
predicted:  good Data: (0, 2, 3, 1, 2, 1) actual:  good
predicted:  good Data: (3, 2, 3, 2, 0, 1) actual:  good
predicted:  good Data: (2, 3, 1, 2, 0, 1) actual:  good
predicted:  good Data: (3, 3, 3, 2, 0, 2) actual:  good
predicted:  good Data: (1, 3, 2, 2, 2, 1) actual:  good
predicted:  good Data: (2, 0, 1, 1, 1, 1) actual:  good
predicted:  good Data: (3, 3, 3, 2, 1, 1) actual:  good
predicted:  good Data: (0, 2, 0, 0, 0, 1) actual:  good
predicted:  unnac Data: (3, 1, 1, 1, 1, 0

In [99]:
# Compute the confusion matrix to evaluate the performance of the classification model
cm = confusion_matrix(y_test, y_pred)

# Print the confusion matrix
print(cm)

# Compute and print the accuracy score of the model based on the test data
print(accuracy_score(y_test, y_pred))

[[ 35   0  12   0]
 [  3   3   0   0]
 [  1   0 117   0]
 [  0   0   0   2]]
0.9075144508670521


## Sources

<https://bookdown.org/tpinto_home/Regression-and-Classification/k-nearest-neighbours-regression.html#:~:text=KNN%20regression%20is%20a%20non,observations%20in%20the%20same%20neighbourhood.>