# $k$-NN Regression
### Foundations of Machine Learning
### `https://www.github.com/ds4e/knn`

## Introduction
- Last time we did $k$-nearest neighbor classification:
    - Supervised learning: Use $X$ to predict $y$
    - Classification: $y$ was categorical, a label
    - Performance: Confusion Matrix, Accuracy
    - Hyperparameter Selection: Train-test Split
- We make a few key changes:
    - Supervised learning: Use $X$ to predict $y$
    - **Regression: $y$ will be numeric, a number**
    - **Performance: Residuals, Mean Squared Error**
    - Hyperparameter Selection: Train-test Split

## Outline
1. Example
2. Regression
3. Residuals and MSE

# 1. Example

## Predicting Sales
- Last time, we predicted vehicle class from footprint and price
- Let's predict sales from price and mpg: How do `baseline mpg` and `baseline price` seem to drive `baseline sales`? 
- We care about this for marketing purposes (if we change our price, how many more cars do we sell?) as well as policy purposes (if we tax high carbon vehicles, how many fewer are sold?)

## Cleaning the Data
``` python
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd 
import seaborn as sns

df = pd.read_csv('./data/cars_env.csv')
df.head()

# Drop extremely expensive cars:
q90 = np.quantile( df['baseline price'],.9) # Compute the .9 quantile
print(q90)
keep = df['baseline price'] < q90  # Logical condition asserting price < .9 quantile
df = df.loc[keep,:] # Use locator function to filter on a Boolean conditional
```

## Sales versus Price
``` python
sns.scatterplot(x=df['baseline price'], 
                y = df['baseline sales'],
                palette='crest',
                alpha=.3)
```

## Sales versus MPG
``` python
sns.scatterplot(x=df['baseline mpg'], 
                y = df['baseline sales'],
                palette='crest',
                alpha=.3)
```

## Prediction
- This is the picture we want to predict: the color/sales, explained by price and mpg:
```python

sns.scatterplot(x=df['baseline price'], 
                y = df['baseline mpg'],
                hue = np.log(df['baseline sales']),
                palette='crest',
                alpha=.6)
```

## Exercise
- Pick a data set in your groups
- Find a well-posed regression problem (two explanatory numeric variables, one numeric outcome variable)

# 2. Regression

## $k$-NN Regression
- Imagine we have covariates/features $X = [x_1, x_2, ..., x_N]$, consisting of $N$ observations of $L$ variables, and observed outcomes/target numeric variable $Y$
- Consider a new case $\hat{x} = (\hat{x}_1,...,\hat{x}_L)$. We want to make a guess of what **numeric value** it will likely take, $\hat{y}$
- The *$k$ Nearest Neighbor Regression Algorithm* is:
  1. Compute the distance from $\hat{x}$ to each observation $x_i$ in the dataset
  2. Find the $k$ "nearest neighbors" $x_1^*$, $x_2^*$, ..., $x_k^*$ to $\hat{x}$ in the data in terms of distance, with values $y_{1}^*$, $y_2^*$, ..., $y_k^*$
  3. Return the average of the neighbor values, 
  
  $$\hat{y}(\hat{x}) = \dfrac{y_1^* + y_2^* + ... + y_k^*}{k} = \frac{1}{k} \sum_{i=1}^k y_i^*$$

## Scikit-Learn
- To get the `from sklearn.neighbors import KNeighborsRegressor` for regression
- The workflow with `sk` is that you use it to
  1. Create an untrained model object with a fixed $k$: `model = KNeighborsRegressor(n_neighbors=k)` or `model = KNeighborsClassifier(n_neighbors=k)`
  2. Fit that object to the data, $(X,y)$: `fitted_model = model.fit(X,y)`
  3. Use the fitted object to make predictions for new cases $\hat{x}$: `y_hat = fitted_model.predict(x_hat)` for hard classification and `y_hat = fitted_model.predict_proba(x_hat)` for soft classification

## The Train/Test Split
- Remember, we don't want to pick hyperparameters or evaluate performance on the data on which the model is trained!
- We want to imagine how the model with perform in plausible situations which the model has not yet seen
- In the code that follows, we'll implement the train/test split straightaway, even though we're not picking the $k$ hyperparameter yet

## (1) Preamble, Read Data

``` python
# Preamble:
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split

def minmax(x):
    u = (x-min(x))/(max(x)-min(x))
    return u

# Wrangle data:
df = pd.read_csv(file.csv) # load your data
df.head()
```

## (2) Normalize, Train-Test Split
``` python
y = df['target'] # Set out outcome/target
ctrl_list = [ var_1, var_2, ] # List of control variables
x = df.loc[:, ctrl_list] # Set our covariates/features
u = x.apply(MinMaxScaler) # Scale our variables

u_train, u_test, y_train, y_test = train_test_split(u,y,test_size=.2,random_state=100) 
```

## (3) Fit a kNN Regressor and Predict
``` python
# Set the number of neighbors, typically odd to "break ties":
k = 5 
# Create a fitted model instance:
model = KNeighborsRegressor(n_neighbors = k) # Create a model instance
model = model.fit(u_train,y_train) # Fit the model
# Make predictions:
y_hat = model.predict(u_test) # Prediction
```

## (4) Visualize the Predictor
- Plotting the predicted against actual values provides a general, visual diagnostic for fit:
```python
sns.scatterplot(x=y_test, y=y_hat)
```

## Exercise
- For the scenario you picked out, run the $k$-NN regression.
- Plot your test and predicted values. Do they line up on the diagonal, or are there significant errors or patterns?
- Commit/push your results back to GitHub.

# 3. Residuals and MSE

## Residuals
- The distance between the true value and the predicted value is called the **residual**,
$$\underbrace{r_i}_{\text{Residual, error}} = \underbrace{y_i}_{\text{True}} - \underbrace{\hat{y}(x_i)}_{\text{predicted}}$$
- This is how far from the true outcome its predicted value was, or the error
- These are like a "confusion matrix" for regression

``` python
import numpy as np
import matplotlib.pyplot as plt

y = np.asarray(y)
y_hat = np.asarray(y_hat)

# Perfect prediction line
lo = np.min([y.min(), y_hat.min()])
hi = np.max([y.max(), y_hat.max()])
plt.plot([lo, hi], [lo, hi], linestyle='--', label='y = ŷ')

# Data points
plt.scatter(y_hat, y, label='(ŷ, y)')

# Residuals: vertical lines y_i - y_hat_i
for a, b in zip(y_hat, y):
    plt.vlines(a, a, b, color='red', linewidth=.5)

plt.xlabel("Predicted ŷ")
plt.ylabel("Actual y")
plt.title("Residuals (r = y − ŷ)")
plt.legend()
plt.gca().set_aspect('equal', adjustable='box')
plt.show()
```

## Loss Function: Mean Squared Error
- Like accuracy for classification, we seek some kind of nice summary number for regression. How well did we do?
- The most fundamental common metric for this is **mean squared error**:
$$
\text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2,
$$
or **root mean squared error**, 
$$
\text{RMSE} = \sqrt{ \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2}.
$$
- This is essentially the distance from the true values to the predicted ones, weighted by sample size (as $n$ gets large, these values typically approach some fixed value)

## Mean Squared Error
- The `model.score(u_test, y_test)` value is **not** MSE or RMSE, but instead something called $R^2$ that we'll cover later
- To compute MSE or RMSE, you can write your own function:
``` python
def mse(y_test,y_hat):
    mse = np.sum( (y_test - y_hat) ** 2 )/len(y_test)
    return mse
```
or `from sklearn.metrics import mean_squared_error`

## Picking $k$
- Again, we can pick $k$ by plotting MSE for a range of values of $k$, and finding the one with the lowest mean squared error:
``` python
k_grid = [ (2*k+1) for k in range(100) ] # Odd numbers from 1 to 201
mses = [] # List to save MSEs
for k in k_grid:
    model = KNeighborsRegressor(n_neighbors = k) # Create a model instance
    model = model.fit(u_train,y_train) # Fit the model
    y_hat = model.predict(u_test) # Predict values
    mses.append( mse(y_test, y_hat) ) # Compute and store MSE
sns.lineplot(x=k_grid, y=mses).set(y_label='MSE', x_label='Neighbors') # Plot MSE
```

## Picking $k$
- Since we're looking at the **minimum** of our error function, we want to find the lowest value that MSE takes over all $k$:
``` python
index_star = np.argmin( mses ) # Find minimizing index of mses
k_star = k_grid[index_star] # Find value of k at that index
```
- This gives us the optimal number of neighbors, based on a single train-test split

## Training vs Test MSE
- Just to see the difference:
```python
k_grid = [ (2*k+1) for k in range(100)]
mses = []
mses_train = []
for k in k_grid:
    model = KNeighborsRegressor(n_neighbors = k) 
    model = model.fit(u_train,y_train) 
    y_hat = model.predict(u_test) 
    mses_train.append( mse(y_train, model.predict(u_train))) # Save training MSE
    mses.append( mse(y_test, y_hat) ) # Save test MSE
sns.lineplot(x=k_grid, y=mses, label = 'test MSE')
sns.lineplot(x=k_grid, y=mses_train, label = 'training MSE')
```

## Exercise
- For the scenario you picked out, determine the optimal number of neighbors using a train/test split.
- How does the MSE behave as you increase $k$? Is there a clear minimum?
- Commit/push your results back to GitHub.

## Conclusion
- We now have a complete data science loop:
    1. Wrangle data
    2. EDA and Visualization to explore relationships
    3. $k$-NN regression to predict numeric response variables, $k$-NN classification to predict categorical response variables
    4. Train/Test split for hyperparameter selection
- This is a good time to do a lab, and connect the loop from beginning to end