# Nearest Neighbor Classifiers

**Nearest Neighbors**, can be used to determine the class label of the test instance. The justification for using nearest neighbors is best exemplified by the following saying: *“If it walks like a duck, quacks like a duck, and looks like a duck, then it’s probably a duck.”*

A nearest neighbor classifier represents each example as a data point in a **d-dimensional** space, where $d$ is the number of attribute. 

Given a test instance, we compute its proximity to the training instances according to one of the proximity measures. 
The k-nearest neighbors of a given test instance $z$ refer to the k training examples that are closest to $z$.
<space>
<img src="knn-1.png">
<img src="knn-2.png">

# Algorithm

 The algorithm computes the distance (or similarity) between each test instance $z = (x′,y′)$ and all the training examples $(x,y) ∈ D$ to determine its nearest neighbor list, $Dz$.
 
Such computation can be costly if the number of training examples is large. However, efficient indexing techniques are available to reduce the computation needed to find the nearest neighbors of a test instance.
<space>
<img src="knn-Algo.png">
<space>
Once the nearest neighbor list is obtained, the test instance is classified based on the majority class of its nearest neighbors:

$$
Majority \hspace{0.5cm} Voting : y' = \underset{v}{argmax} \sum_{(x_i, y_i)\in D_z} I(v=y_i)
$$

where $v$ is a class label, $yi$ is the class label for one of the nearest neighbors, and $I(·)$ is an indicator function that returns the value 1 if its argument is true and 0 otherwise.

In the majority voting approach, every neighbor has the same impact on the classification. This makes the algorithm sensitive to the choice of $k$, as shown in figure. One way to reduce the impact of $k$ is to weight the influence of each nearest neighbor $xi$ according to its distance: 

$$wi = \frac{1}{d(x′,xi)^2} $$ 

As a result, training examples that are located far away from *z* have a weaker impact on the classification compared to those that are located close to *z*. Using the distance-weighted voting scheme, the class label can be determined as follows:

$$
Distance-Weighted \hspace{0.5cm} Voting : y' = \underset{v}{argmax} \sum_{(x_i, y_i)\in D_z} w_i * I(v=y_i)
$$

# Characteristics of Nearest Neighbor Classifiers 

# KNN in Practice

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

data = pd.read_csv("housing_price_dataset.csv")
data

Unnamed: 0,SquareFeet,Bedrooms,Bathrooms,Neighborhood,YearBuilt,Price
0,2126,4,1,Rural,1969,215355.283618
1,2459,3,2,Rural,1980,195014.221626
2,1860,2,1,Suburb,1970,306891.012076
3,2294,2,1,Urban,1996,206786.787153
4,2130,5,2,Suburb,2001,272436.239065
...,...,...,...,...,...,...
49995,1282,5,3,Rural,1975,100080.865895
49996,2854,2,2,Suburb,1988,374507.656727
49997,2979,5,3,Suburb,1962,384110.555590
49998,2596,5,2,Rural,1984,380512.685957


In [2]:
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SquareFeet,50000.0,2006.37468,575.513241,1000.0,1513.0,2007.0,2506.0,2999.0
Bedrooms,50000.0,3.4987,1.116326,2.0,3.0,3.0,4.0,5.0
Bathrooms,50000.0,1.99542,0.815851,1.0,1.0,2.0,3.0,3.0
YearBuilt,50000.0,1985.40442,20.719377,1950.0,1967.0,1985.0,2003.0,2021.0
Price,50000.0,224827.325151,76141.842966,-36588.165397,169955.860225,225052.141166,279373.630052,492195.259972


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   SquareFeet    50000 non-null  int64  
 1   Bedrooms      50000 non-null  int64  
 2   Bathrooms     50000 non-null  int64  
 3   Neighborhood  50000 non-null  object 
 4   YearBuilt     50000 non-null  int64  
 5   Price         50000 non-null  float64
dtypes: float64(1), int64(4), object(1)
memory usage: 2.3+ MB


In [4]:
num_inst, num_features = data.shape
# elem = [ np.unique(data_proc.iloc[:,f]) for f in range(num_features)]
for f in range(num_features):
    print (f, np.unique(data.iloc[:,f])) 

0 [1000 1001 1002 ... 2997 2998 2999]
1 [2 3 4 5]
2 [1 2 3]
3 ['Rural' 'Suburb' 'Urban']
4 [1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963
 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977
 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991
 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
 2020 2021]
5 [-36588.16539749 -28774.99802221 -24715.24248213 ... 476671.73326267
 482577.16340543 492195.25997202]


In [5]:
data["Neighborhood"].value_counts()

Neighborhood
Suburb    16721
Rural     16676
Urban     16603
Name: count, dtype: int64

In [6]:
# drop label columns
X = data.drop(columns=["Neighborhood"])

# isolate y
y = data["Neighborhood"]

# split in Train-set(80%) and Testing-set(20%)
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42) 

In [7]:
X_train.info(), X_test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 33500 entries, 23990 to 15795
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   SquareFeet  33500 non-null  int64  
 1   Bedrooms    33500 non-null  int64  
 2   Bathrooms   33500 non-null  int64  
 3   YearBuilt   33500 non-null  int64  
 4   Price       33500 non-null  float64
dtypes: float64(1), int64(4)
memory usage: 1.5 MB
<class 'pandas.core.frame.DataFrame'>
Index: 16500 entries, 33553 to 28203
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   SquareFeet  16500 non-null  int64  
 1   Bedrooms    16500 non-null  int64  
 2   Bathrooms   16500 non-null  int64  
 3   YearBuilt   16500 non-null  int64  
 4   Price       16500 non-null  float64
dtypes: float64(1), int64(4)
memory usage: 773.4 KB


(None, None)

In [8]:
from sklearn.preprocessing import LabelEncoder

def process_data_x(train, test):
    numerical_idx = ["SquareFeet", "Bedrooms", "Bathrooms", "YearBuilt", "Price"]
    
    # convert numeric integer to float and concat them with already float feature 
     # There are no NaN element in these feature
    for col in range(0):
        X_train[numerical_idx[col]] = pd.to_numeric(train[numerical_idx[col]],downcast='float')
    
    # --------------
    # process test
    
    # convert numeric integer to float and concat them with already float feature 
     # There are no NaN element in these feature
    for col in range(0,4):
        X_test[numerical_idx[col]] = pd.to_numeric(train[numerical_idx[col]],downcast='float')
    
    return X_train, X_test

In [9]:
X_train_enc, X_test_enc = process_data_x(X_train, X_test)

In [10]:
X_train_enc

Unnamed: 0,SquareFeet,Bedrooms,Bathrooms,YearBuilt,Price
23990,2561,3,1,2019,365384.363339
8729,1064,5,1,2016,98914.614596
3451,2756,4,2,1967,265441.025324
2628,1731,4,2,2015,248259.953718
38352,2794,5,1,1992,286485.264621
...,...,...,...,...,...
11284,2166,5,3,1996,324396.846219
44732,2463,4,1,1953,319266.944411
38158,2812,4,2,2010,248092.662727
860,2188,3,1,1979,132414.177622
