**Supervised 5**

# K-Nearest Neighbors Classification

## Part 2. Pre-Processing Data

<br>



This notebook continues the K-Nearest Neighbors (KNN) classification example, introducing **Standard Scaler** and **train-test split** to help overcome issues with overfitting and allow proper model evaluation.

**(Recap) The KNN Idea:** To classify something new, look at its nearest neighbors and vote.
- If most of your neighbors are Category A, you're probably Category A too
- It's like saying "birds of a feather flock together"

<br>

---

<br>

**Setup:** Import required libraries

In [93]:
import pandas as pd         # For data manipulation
import numpy as np          # For numerical operations
import altair as alt        # For plotting our results
import matplotlib.pyplot as plt     # For plotting with Matplotlib 

from sklearn.neighbors import KNeighborsClassifier, NearestNeighbors
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

---

## Our Data: US States Policy Outcomes

We'll use a simple question: Can we tell if a state is from the **South** or **Northeast** based on just two factors?
- Median income
- Firearm death rate


Import data, filter to Southern & North-Eastern States, then separate in X and y arrays.

In [94]:
# 1. Import csv data from GitHub Raw URL
data = pd.read_csv('https://raw.githubusercontent.com/RDeconomist/RDeconomist.github.io/main/charts/usa/data_USsocioEconomic.csv')
print(f"Columns: {data.columns.tolist()}\n")
print(f"Geographical Divisions: {data['GeographicDivision'].unique()}")

# 2. Filter to South and Northeast states
states = data[data['GeographicDivision'].isin(['South', 'Northeast'])].copy()

# 3. Select our features (X) and labels (y)
features = ['medIncome', 'DeathRate']
X = states[features] 
y = states['GeographicDivision']

Columns: ['State', 'StateInitials', 'Gini', 'DeathRate', 'Firearms_vs_avg', 'medIncome', 'Income_vs_med', 'ImprisonmentRate', 'PrisonRate', 'ImprisonmentRate.1', 'FirearmDeaths', 'GeographicDivision']

Geographical Divisions: ['South' 'West' 'Northeast' 'Midwest']


<br>
<br>
<br>

## Step 2: Standard scaling

In the first KNN example notebook, we ran into an issue where the difference in feature scales (i.e. income values vs firearm death rate values) meant that we weren't really captured the *nearest* neighbours.

In [95]:
# Instantiate StandardScaler and scale input features X
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Convert back to DataFrame for easier handling
X_scaled_df = pd.DataFrame(X_scaled, columns=features, index=X.index)

print("\nFirst few rows (before scaling):\n", X[:3])
print("\nFirst 3 rows of scaled features:\n", X_scaled_df[:3])


First few rows (before scaling):
    medIncome  DeathRate
0      47221       21.5
3      45907       17.8
6      75923        4.6

First 3 rows of scaled features:
    medIncome  DeathRate
0  -0.946845   1.572602
3  -1.073455   0.900705
6   1.818712  -1.496333


<br>
<br>
<br>

## Step 3: Split data into **training** and **test sets**

**Why split the data?**
- We train the model on some data (training set)
- We test it on data it hasn't seen (test set)
- This tells us if the model actually *learned* patterns or just memorised

In [96]:
# Split the data (making sure to use scaled features): 70% for training, 30% for testing
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled_df, y, test_size=0.3, random_state=42        # Random state sets the seed for reproducibility (so we get the same split each time)
)

print(f"Training set: {len(X_train)} states")
print(f"Test set: {len(X_test)} states")
print("\nThe model will learn from the training states,")
print("then we'll see if it can correctly classify the test states")

Training set: 17 states
Test set: 8 states

The model will learn from the training states,
then we'll see if it can correctly classify the test states


<br>
<br>

## Step 4: Train KNN classifier

Fit the K-Nearest Neighbours classifier with K=3, on **training** data.

In [97]:
# Create and train the KNN classifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)       #### NOTE: We **train** the model on the 'train' split of the data

print(f"Model fitted on {len(X_train)} states")

Model fitted on 17 states


<br>
<br>

## Step 5: Make predictions

Now let's see if our model can correctly classify the **test** states:

In [98]:
X_test

Unnamed: 0,medIncome,DeathRate
18,-0.596598,-0.824436
35,-0.588215,1.227574
0,-0.946845,1.572602
45,0.906045,-0.134379
23,-1.536726,1.282052
19,1.610298,-0.170698
29,1.100392,-1.332899
3,-1.073455,0.900705


In [99]:
# Make predictions on the test set
predictions = knn.predict(X_test)       #### NOTE: We **predict** the labels for the 'test' split of the data (so our model hasn't seen the y-labels for these states yet)

# Compare predictions to reality
results = pd.DataFrame({
    'State': states.loc[X_test.index, 'State'].values,  # This gets the State names for the test set from our original dataframe
    'Actual Region': y_test.values,
    'Predicted Region': predictions,
    'Correct?': predictions == y_test.values    # True/False if prediction matches actual
})
print("\nPrediction results on test set:")
results


Prediction results on test set:


Unnamed: 0,State,Actual Region,Predicted Region,Correct?
0,Maine,Northeast,South,False
1,Oklahoma,South,South,True
2,Alabama,South,South,True
3,Virginia,South,Northeast,False
4,Mississippi,South,South,True
5,Maryland,South,Northeast,False
6,New Jersey,Northeast,Northeast,True
7,Arkansas,South,South,True


<br>

Calculate an accuracy score

In [36]:
# Calculate accuracy
accuracy = sum(results['Correct?']) / len(results)
print(f"\nAccuracy: {accuracy:.1%}")
print(f"We correctly classified {sum(results['Correct?'])} out of {len(results)} states")


Accuracy: 62.5%
We correctly classified 5 out of 8 states


<br>
<br>
<br>

### Step 6: Visualise how KNN makes decisions


To visualise the effect of 'fixing' our method with the standard scaling, let's plot both non-scaled and scaled results. We will add a **decision boundary** to highlight the areas where points get assigned to either class outcome.

(this code gets a bit complicated, so don't worry about not understanding it. What we're doing is creating loads of sample data points across our data range, then using our model to see if that point would be assigned to either South/Northeast. With these predictions for loads of points, we can plot a heatmap, overlayed with our actual data points and predictions.)


In [None]:
# First, let's train an UNSCALED model for comparison
knn_unscaled = KNeighborsClassifier(n_neighbors=3)
knn_unscaled.fit(X, y)  # Using original unscaled features

# Create a mesh grid to show decision boundaries
# We'll create this in the ORIGINAL scale so it's interpretable
grid_resolution = 100  # Number of points along each axis
income_range = np.linspace(X['medIncome'].min() - 2000, X['medIncome'].max() + 2000, grid_resolution)
death_range = np.linspace(X['DeathRate'].min() - 2, X['DeathRate'].max() + 2, grid_resolution)
income_grid, death_grid = np.meshgrid(income_range, death_range)

# Flatten the grid to make predictions
grid_points = np.c_[income_grid.ravel(), death_grid.ravel()]

# Get predictions for UNSCALED model (straightforward)
grid_predictions_unscaled = knn_unscaled.predict(grid_points)

# Get predictions for SCALED model (need to scale the grid first!)
grid_points_scaled = scaler.transform(grid_points)  # Scale using same scaler
grid_predictions_scaled = knn.predict(grid_points_scaled)

# Prepare data for plotting
# For unscaled visualisation
boundary_data_unscaled = pd.DataFrame({
    'medIncome': grid_points[:, 0],
    'DeathRate': grid_points[:, 1],
    'Prediction': grid_predictions_unscaled
})

# For scaled visualisation (but showing in original coordinates!)
boundary_data_scaled = pd.DataFrame({
    'medIncome': grid_points[:, 0],  # Original scale for display
    'DeathRate': grid_points[:, 1],  # Original scale for display
    'Prediction': grid_predictions_scaled
})

# Add a column to identify train vs test points
states_plot = states.copy()
states_plot['Dataset'] = 'Training'
states_plot.loc[X_test.index, 'Dataset'] = 'Test'



Now build a chart for both an unscaled and scaled data.

In [122]:
### Chart 1: UNSCALED Features Decision Boundary
base_unscaled = alt.Chart(boundary_data_unscaled).mark_rect().encode(
    x=alt.X('medIncome:Q').title('Median Income ($)').scale(nice=False).bin(maxbins=grid_resolution),
    y=alt.Y('DeathRate:Q', title='Firearm Death Rate (per 100k)').scale(nice=False).bin(maxbins=grid_resolution),
    color=alt.Color('Prediction:N').scale(domain=['South', 'Northeast'], range=['#ffcccb', '#add8e6']).legend(title='Decision Region')
)

points_unscaled = alt.Chart(states_plot).mark_point(size=100, filled=True, stroke='black', strokeWidth=1, opacity=1).encode(
    x='medIncome:Q',
    y='DeathRate:Q',
    color=alt.Color('GeographicDivision:N').scale(domain=['South', 'Northeast'], range=['#ff6b6b', '#4dabf7']).legend(title='Actual Region'),
    shape=alt.Shape('Dataset:N').scale(domain=['Training', 'Test'], range=['circle', 'square']).legend(title='Data Split'),
    tooltip=['State:N', 'GeographicDivision:N', 'medIncome:Q', 'DeathRate:Q', 'Dataset:N']
)

chart_unscaled = (base_unscaled + points_unscaled).properties(
    width=400,
    height=400,
    title='WITHOUT Feature Scaling, K=3'
).resolve_scale(color='independent')


### Chart 2: SCALED Features Decision Boundary
base_scaled = alt.Chart(boundary_data_scaled).mark_rect(opacity=1).encode(
    x=alt.X('medIncome:Q').title('Median Income ($)').scale(nice=False).bin(maxbins=100),
    y=alt.Y('DeathRate:Q', title='Firearm Death Rate (per 100k)').scale(nice=False).bin(maxbins=grid_resolution),
    color=alt.Color('Prediction:N').scale(domain=['South', 'Northeast'], range=['#ffcccb', '#add8e6']).legend(None)
)

points_scaled = alt.Chart(states_plot).mark_point(size=100, filled=True, stroke='black', strokeWidth=1, opacity=1).encode(
    x='medIncome:Q',
    y='DeathRate:Q',
    color=alt.Color('GeographicDivision:N').scale(domain=['South', 'Northeast'], range=['#ff6b6b', '#4dabf7']).legend(None),
    shape=alt.Shape('Dataset:N').scale(domain=['Training', 'Test'], range=['circle', 'square']).legend(None),
    tooltip=['State:N', 'GeographicDivision:N', 'medIncome:Q', 'DeathRate:Q', 'Dataset:N']
)

chart_scaled = (base_scaled + points_scaled).properties(
    width=400,
    height=400,
    title='WITH Feature Scaling: K=3'
).resolve_scale(color='independent')

# Display side by side
(chart_unscaled | chart_scaled).resolve_scale(shape='independent')

<br>

**Reading this decision boundary chart:**
- Background shading: Light red = KNN predicts "South", Light blue = KNN predicts "Northeast"
- **Coloured circles**: Training data (what the model learned from)
- **Coloured squares**: Test data (what we're trying to classify)
- **Misclassifications** happen when a point's colour doesn't match its background.

**Key insight:** Notice how the boundary is "jagged" and follows the shape of the training data. This is characteristic of KNN - it makes local decisions based on nearby points rather than drawing a smooth line. This can be good (flexible) or bad (sensitive to noise) depending on our data.


<br>
<br>
<br>



**1. Without Scaling**: 
   - Decision boundary is nearly **vertical**
   - The classifier essentially only considers income
   - Death rate has almost no influence on classification
   - This happens regardless of k value!

**2. With Scaling**:
   - Decision boundary becomes **diagonal**
   - Both features contribute to the classification
   - Creates more nuanced regional distinctions
   - Different k values create different complexity levels

<br>
<br>
<br>
<br>

---

<br>
<br>

## BONUS: Try different values of K

The number of neighbors (K) is important:
- Too small (K=1): Might be too sensitive to outliers
- Too large (K=all): Might be too general

Let's experiment:

In [133]:
# Try different values of K
k_values = range(1, 9)  # K from 1 to 10
results = []

for k in k_values:
    # Split the data: 70% for training, 30% for testing
    X_train, X_test, y_train, y_test = train_test_split(
        X_scaled_df, y, test_size=0.3, random_state=42
    )

    # Train model with this K
    model = KNeighborsClassifier(n_neighbors=k)
    model.fit(X_train, y_train)
    
    # Calculate accuracy
    accuracy = model.score(X_test, y_test)
    results.append({'K': k, 'Accuracy': accuracy})

results_df = pd.DataFrame(results)
results_df

Unnamed: 0,K,Accuracy
0,1,0.625
1,2,0.625
2,3,0.625
3,4,0.625
4,5,0.625
5,6,0.625
6,7,0.75
7,8,0.75


In [145]:
# Visualise the results
accuracy_chart = alt.Chart(results_df).mark_line(point=alt.OverlayMarkDef(size=80), size=2.5).encode(
    x=alt.X('K:O').title('Number of Neighbors (K)').axis(labelAngle=0, labelFontSize=14, ticks=True, labelPadding=5),
    y=alt.Y('Accuracy:Q').scale(padding=50, zero=False).axis(format='%', labelFontSize=12).title('Accuracy'),
    tooltip=[alt.Tooltip('K:O'), 
             alt.Tooltip('Accuracy:Q', format='.1%')]
).properties(
    width=400,
    height=300,
    title='How does K affect accuracy?'
)

accuracy_chart

In our example analysis (with very few data points), it's normal to see the same accuracy level for different number of nearest neighbours. As our dataset becomes larger, either through more features or more observations, we'd expect performance to vary at every k neighbours.

<br>

**Ideas to take this further:**
- Here we've only used two input features (firearms deaths and median income), try including more features in the analysis.
- Expand the number of observations by finding data for a lower statistical geography - maybe there is data at counties level, or maybe we find annual values.