# $K$-Nearest Neighbors

## What is Maching Learning?

**Learning:** devising a rule for making a decision based on inputs.

- Inputs: $\mathrm{x}$
- Decision: $y$
- Goal of Machine Learning: Estimate the rule $f$ from data $(\mathrm{x}_{i}, y_i)$

The decision $y$ is typically called the **target** or **label**.

## Two Types of Machine Learning Problems

Machine learning problems are grouped into two types, based on the type of $y$:

- **Regression**: The label $y$ is quantitative.
- **Classification**: The label $y$ is categorical. 

Note that the input feature $\mathrm{x}$ may be categorical, quantitative, or a mix of the two.

We will initially focus on regression problems.

## Predicting Wine Quality
Orley Ashenfelter, an economics professor, used summer temperature and winter rainfall to predict the price of Bordeaux wines.

In [34]:
! cd data && wget https://dlsun.github.io/pods/data/bordeaux.csv

import pandas as pd

df = pd.read_csv("data/bordeaux.csv", index_col="year")
df

--2025-08-09 11:45:45--  https://dlsun.github.io/pods/data/bordeaux.csv
Resolving dlsun.github.io (dlsun.github.io)... 185.199.111.153, 185.199.108.153, 185.199.109.153, ...
Connecting to dlsun.github.io (dlsun.github.io)|185.199.111.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1093 (1.1K) [text/csv]
Saving to: ‘bordeaux.csv.2’


2025-08-09 11:45:45 (94.2 MB/s) - ‘bordeaux.csv.2’ saved [1093/1093]



Unnamed: 0_level_0,price,summer,har,sep,win,age
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1952,37.0,17.1,160,14.3,600,40
1953,63.0,16.7,80,17.3,690,39
1955,45.0,17.1,130,16.8,502,37
1957,22.0,16.1,110,16.2,420,35
1958,18.0,16.4,187,19.1,582,34
1959,66.0,17.5,187,18.7,485,33
1960,14.0,16.4,290,15.8,763,32
1961,100.0,17.3,38,20.4,830,31
1962,33.0,16.3,52,17.2,697,30
1963,17.0,15.7,155,16.2,608,29


## Visualizing the Data

In [76]:
import plotly.express as px
import plotly.graph_objects as go

fig1 = px.scatter(df[~df["price"].isnull()], 
                  x="win", y="summer", color="price")

fig2 = px.scatter(df[df["price"].isnull()], 
                  x="win", y="summer", symbol_sequence=["circle-open"])

# set dataframe for hightlight (transpose the dataframe)
df_highlight = pd.DataFrame(df.loc[1986]).T

# add highlight to fig2
fig2.add_scatter(x=df_highlight['win'], 
                 y=df_highlight['summer'], 
                 mode='markers+text', 
                 marker=dict(size=10, color='red', symbol='star'),
                 text=["1986"],
                 textposition="top center",
                 name='1986')

go.Figure(data=fig1.data + fig2.data, layout=fig1.layout)


What would you predict is the quality of the 1986 wine?

<u>Insight:</u> The "closest" wines are low quality, so the 1986 vintage is probably low quality as well.

This is the intuition behind $k$**-nearest neighbor** regression.

Today: implementing $k$-nearest neighbors

## $K$-Nearest Neighbors

The data for which we know the label $y$ is called the **training data**.

The data for which we don't know $y$ (and want to predict it) is called the **test data**.

In [None]:
df_train = df.loc[:1980].copy()
df_test = df.loc[1981:].copy()

$K$-Nearest Neighbors

1. For each observation in the test data, find the $k$ "nearest" observations in the training data based on input features $\mathrm{x}$ (e.g., summer temperature and winter rainfall).
2. To predict the label $y$ (e.g., price) for the test observation, average the labels of these $k$ "nearest" training observations.

## $K$-Nearest Neighbors from Scratch

Before computing distances, we should scale the variables.

In [None]:
X_train = df_train[["win", "summer"]]
y_train = df_train["price"]

# standardize the features
X_train_mean = X_train.mean()
X_train_std = X_train.std()
X_train_scaled = (X_train - X_train_mean) / X_train_std

We should scale the test data in exactly same way.

In [None]:
X_test = df_test[["win", "summer"]]
X_test_scaled = (X_test - X_train_mean) / X_train_std
X_test_scaled

Unnamed: 0_level_0,win,summer
year,Unnamed: 1_level_1,Unnamed: 2_level_1
1981,-0.568896,0.812183
1982,0.802826,1.425581
1983,1.833554,1.425581
1984,-0.134905,0.045437
1985,1.050821,0.505485
1986,-0.3519,-0.261262
1987,-1.212132,0.812183
1988,1.54681,0.965533
1989,-1.281881,3.265772
1990,-1.088135,3.419122


Next, we calculate the (Euclidean) distances between the vintage in the test set and the vintages in the training data.

In [None]:
import numpy as np

dists = np.sqrt(
    ((X_test_scaled.loc[1986] - X_train_scaled) ** 2).sum(axis=1)
)
dists

year
1952    1.259860
1953    1.159726
1955    1.314727
1957    1.149883
1958    0.212597
1959    1.936933
1960    1.557535
1961    2.575503
1962    1.038478
1963    0.983970
1964    1.976971
1965    1.412851
1966    2.007525
1967    1.180230
1968    0.395207
1969    0.320488
1970    0.765065
1971    0.772366
1972    2.004492
1973    1.898753
1974    0.085248
1975    0.922736
1976    2.288442
1977    2.269387
1978    1.729248
1979    1.203287
1980    0.474508
dtype: float64

Now we just need to sort these distances and take the first $k = 5$.

In [52]:
index_nearest = dists.sort_values().index[:5]
index_nearest

Index([1974, 1958, 1969, 1968, 1980], dtype='int64', name='year')

Finally, to make a prediction, we average the labels $y$ of these $k = 5$ nearest vintages in the training data.

In [53]:
y_train[index_nearest].mean()

np.float64(13.2)

That’s $13.20 for a bottle of wine. So 1986 is not a good vintage.

How do we do this for every vintage in the test set?

We can do this for every vintage in the test data by writing a `for` loop.

In [55]:
def calculate_knn_prediction(test_obs_scaled):
    # TODO: determine the k-nearest neighbors to test_obs_scaled
    # TODO: calculate the mean label of the k-nearest neighbors
    pass

for year in range(1981, 1992):
  print(year, calculate_knn_prediction(X_test_scaled.loc[year]))

1981 None
1982 None
1983 None
1984 None
1985 None
1986 None
1987 None
1988 None
1989 None
1990 None
1991 None


If we want the predictions in a Pandas object, we can use the .`apply()` function.

In [56]:
X_test_scaled.apply(calculate_knn_prediction, axis="columns")

year
1981    None
1982    None
1983    None
1984    None
1985    None
1986    None
1987    None
1988    None
1989    None
1990    None
1991    None
dtype: object

## $K$-Nearest Neighbors in Scikit-Learn

Scikit-learn provieds a built-in model `KNeighborsRegressor` that fits $k$-nearest neighbors regression models.

But first, we need to scale the training and test data.

In [57]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)

# Scale the test data using a scaler that was fit to the training data!
X_test_scaled = scaler.transform(X_test)

Now, we fit $k$nearest neighbors using `KNeighborsRegressor`.

In [58]:
from sklearn.neighbors import KNeighborsRegressor

model = KNeighborsRegressor(n_neighbors=5)
model.fit(X=X_train_scaled, y=y_train)
model.predict(X=X_test_scaled)

array([35.8, 54. , 52.2, 18.4, 35.6, 13.2, 37. , 51.4, 36.6, 36.6, 40.6])

## Pipelines in Scikit-Learn

In the code above, we had to be careful to standardize the training and test data in exactly the same way.

Machine learning models typically involve many more preprocessing steps.

Scikit-Learn's `Pipeline` allows us to chain steps together.


In [60]:
from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(
    StandardScaler(),
    KNeighborsRegressor(n_neighbors=5)
)

We can use this `Pipeline` like any other machine learning model.

In [61]:
pipeline.fit(X=X_train, y=y_train)
pipeline.predict(X=X_test)

array([35.8, 54. , 52.2, 18.4, 35.6, 13.2, 37. , 51.4, 36.6, 36.6, 40.6])

## Was Ashenfelter Right?

Ashenfelter was wrong that 1986 would be a disappointing vintage!