# Classifiation and $K$-nearest Neighbors

## Recap

So far, we've been looking and **regression** problems, where the label $y$ we are trying to predict is quantitative (e.g., price of wine).

Today, we switch to **classification** problems, where the label $y$ is categorical.

We will focus on the differences between regression and classification.

## Breast Tissue Classification

Electrical signals can be used to detect whether tissue is cancerous.

The goal is to determine whether a sample of breast tissue is:

1. connective tissue <!-- 결합조직 -->
2. adipose tissue <!-- 지방조직 -->
3. glandular tissue <!-- 선조직 -->
4. carcinoma <!-- 암 -->
5. fibro-adenoma <!-- 섬유선종 -->
6. mastopathy <!-- 유방질  환 -->

## Reading in the Data

In [1]:
! cd data && wget https://datasci112.stanford.edu/data/BreastTissue.csv
import pandas as pd
df = pd.read_csv("data/BreastTissue.csv")
df


--2025-08-17 15:26:23--  https://datasci112.stanford.edu/data/BreastTissue.csv
Resolving datasci112.stanford.edu (datasci112.stanford.edu)... 54.81.116.232
Connecting to datasci112.stanford.edu (datasci112.stanford.edu)|54.81.116.232|:443... connected.
HTTP request sent, awaiting response... 307 Temporary Redirect
Location: https://web.stanford.edu/class/datasci112//data/BreastTissue.csv [following]
--2025-08-17 15:26:24--  https://web.stanford.edu/class/datasci112//data/BreastTissue.csv
Resolving web.stanford.edu (web.stanford.edu)... 171.67.215.200, 2607:f6d0:0:925a::ab43:d7c8
Connecting to web.stanford.edu (web.stanford.edu)|171.67.215.200|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 17882 (17K) [text/csv]
Saving to: ‘BreastTissue.csv’


2025-08-17 15:26:25 (108 KB/s) - ‘BreastTissue.csv’ saved [17882/17882]



Unnamed: 0,Case #,Class,I0,PA500,HFS,DA,Area,A/DA,Max IP,DR,P
0,1,car,524.794072,0.187448,0.032114,228.800228,6843.598481,29.910803,60.204880,220.737212,556.828334
1,2,car,330.000000,0.226893,0.265290,121.154201,3163.239472,26.109202,69.717361,99.084964,400.225776
2,3,car,551.879287,0.232478,0.063530,264.804935,11888.391827,44.894903,77.793297,253.785300,656.769449
3,4,car,380.000000,0.240855,0.286234,137.640111,5402.171180,39.248524,88.758446,105.198568,493.701814
4,5,car,362.831266,0.200713,0.244346,124.912559,3290.462446,26.342127,69.389389,103.866552,424.796503
...,...,...,...,...,...,...,...,...,...,...,...
101,102,adi,2000.000000,0.106989,0.105418,520.222649,40087.920984,77.059161,204.090347,478.517223,2088.648870
102,103,adi,2600.000000,0.200538,0.208043,1063.441427,174480.476218,164.071543,418.687286,977.552367,2664.583623
103,104,adi,1600.000000,0.071908,-0.066323,436.943603,12655.342135,28.963331,103.732704,432.129749,1475.371534
104,105,adi,2300.000000,0.045029,0.136834,185.446044,5086.292497,27.427344,178.691742,49.593290,2480.592151


We will focus on two features:
- $I_0$: impedivity at 0 kHz,
- $PA_{500}$: phase angle at 500 kHz

## Visualizing the data

In [None]:
# visualize the I0 and PA500
import plotly.express as px

# mapping class names to more descriptive labels
df_plot = df.copy()
class_name_map = {
    'car': 'carcinoma',
    'fad': 'fibro-adenoma',
    'mas': 'mastopathy',
    'gla': 'glandular',
    'con': 'connective',
    'adi': 'adipose'
}
df_plot['Class'] = df_plot['Class'].replace(class_name_map)

# create scatter plot
fig = px.scatter(df_plot, x='I0', y='PA500', color='Class', color_continuous_scale='viridis')
fig.update_layout(title='Scatter plot of I0 vs PA500', xaxis_title='I0', yaxis_title='PA500')
fig.show()

Consider a new sample, with an $I_0$ of $400$ and a $PA_{500}$ of $0.18$. <br />
What kind of tissue is it?

In [53]:
import numpy as np
import plotly.graph_objects as go

# 1. define the new sample point
data = df[['I0', 'PA500']]
sample = pd.DataFrame([
    {"I0": 400,
     "PA500": 0.18}
])

# 2. scale the data and sample
data_mean = data.mean()
data_std = data.std()
data_scaled = (data - data_mean) / data_std
sample_scaled = (sample - data_mean) / data_std

# 3. find the k-nearest neighbors (k = 5) from sample
# do not believe the implicit broadcasting
# you have to use .loc[0] to get the first row (when the test data contains only one row)
dists = np.sqrt(
    ((sample_scaled.loc[0] - data_scaled) ** 2).sum(axis=1)
)
index_nearest = dists.sort_values().index[:5]

In [54]:
# 4. create scatter plot
fig = px.scatter(df_plot, x='I0', y='PA500', color='Class', color_continuous_scale='viridis')
fig.update_layout(title='Scatter plot of I0 vs PA500', xaxis_title='I0', yaxis_title='PA500')

# 5. highlight the sample point
fig.add_trace(go.Scatter(x=sample['I0'], y=sample['PA500'], mode='markers', marker=dict(symbol='star', size=10, color='red'), name='sample'))

# 6. create line segments to the nearest neighbors
for i, neighbor in df.loc[index_nearest].iterrows():
    fig.add_shape(
        type='line',
        x0=sample['I0'].values[0], y0=sample['PA500'].values[0],
        x1=neighbor['I0'], y1=neighbor['PA500'],
        line=dict(color='grey', width=1, dash='solid')
    )
fig.show()

In [56]:
# check the 5-nearest neighbor's data
print(df_plot.loc[index_nearest][['Class', 'I0', 'PA500', 'distance']])


            Class          I0     PA500  distance
50     mastopathy  310.000000  0.174707  0.142135
17      carcinoma  300.000000  0.190066  0.197805
0       carcinoma  524.794072  0.187448  0.197958
20      carcinoma  500.000000  0.192684  0.227563
23  fibro-adenoma  245.000000  0.189019  0.244033


Of its 5 nearest neighbors in the training data: <br />
3 are carcinomas, 1 is fibro-adenoma, 1 is mastopathy, <br />
so our best guess is that it is a carcinoma.

## $K$-Nearest Neighbors

In [57]:
X_train = df[["I0", "PA500"]]
y_train = df["Class"]
x_test = pd.Series({"I0": 400, "PA500":.18})
X_test = x_test.to_frame().T
X_test

Unnamed: 0,I0,PA500
0,400.0,0.18


Here is code we used for $k$-nearest neighbor <u>regression</u>.

In [58]:
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(
    StandardScaler(),
    KNeighborsRegressor(n_neighbors=5, metric="euclidean")
)

pipeline.fit(X_train, y_train)
pipeline.predict(X_test)

TypeError: unsupported operand type(s) for /: 'str' and 'int'

What would need to change for classification?

In [59]:
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(
    StandardScaler(),
    KNeighborsClassifier(n_neighbors=5, metric="euclidean")
)

pipeline.fit(X_train, y_train)
pipeline.predict(X_test)

array(['car'], dtype=object)

Instead of returning a single predicted class, we can ask it to return the predicted probabilities for each class.

In [60]:
pipeline.predict_proba(X_test)

array([[0. , 0.6, 0. , 0.2, 0. , 0.2]])

In [61]:
pipeline.classes_

array(['adi', 'car', 'con', 'fad', 'gla', 'mas'], dtype=object)

How did Scikit-Learn calculate these predicted probabilities?

## Cross-Validation for Classification

Here is code we used to cross-validate a <u>regression</u> model.

In [63]:
from sklearn.model_selection import cross_val_score
cross_val_score(
    pipeline, X_train, y_train,
    scoring="neg_mean_squared_error",
    cv=10
)


Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "/home/karoness/workspace/DATASCI112/.venv/lib/python3.12/site-packages/sklearn/metrics/_scorer.py", line 152, in __call__
    score = scorer._score(
            ^^^^^^^^^^^^^^
  File "/home/karoness/workspace/DATASCI112/.venv/lib/python3.12/site-packages/sklearn/metrics/_scorer.py", line 408, in _score
    return self._sign * self._score_func(y_true, y_pred, **scoring_kwargs)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/karoness/workspace/DATASCI112/.venv/lib/python3.12/site-packages/sklearn/utils/_param_validation.py", line 218, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/karoness/workspace/DATASCI112/.venv/lib/python3.12/site-packages/sklearn/metrics/_regression.py", line 580, in mean_squared_error
    _check_reg_targets_with_floating_dtype(
  Fil

array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])

What would need to change for classification?

We need a different scoring method for classification. A simple one is **accuracy:**

$$
\mathsf{accuracy = \frac{\textsf{\# correct predictions}}{\textsf{\# predictions}}}
$$


In [66]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(
    pipeline, X_train, y_train,
    scoring="accuracy",
    cv=10
)
scores

array([0.63636364, 0.81818182, 0.45454545, 0.54545455, 0.63636364,
       0.54545455, 0.5       , 0.6       , 0.4       , 0.7       ])

As before, we can get an overall estimate of test accuracy by averaging the cross-validation accuracies:

In [67]:
scores.mean()

np.float64(0.5836363636363637)

Accuracy is not always the best measure of a classification model. <br />
We'll talk about some better measures next time.