# 3.1.2 [Tuning KNN - normalizing distance, picking k](https://courses.thinkful.com/data-201v1/project/3.1.2)

### Drill:
Imagine you work at a credit card company and are trying to predict if people will pay their bills on time.
Data:
* everyone's purchases
* split into groceries, dining out, utilities & entertainment

What are some ways to use KNN to create this model?
What aspect of KNN would be useful?


My hypothesis would be that the greatest predictor of whether a person will pay their credit card on time would be whether they paid their credit card the previous month.



A few ways that this problem is particularly suited for KNN (and vice versa):
* Common metric of dollars. While one type of spending may be more indicative of irresponsible spending the units are the same across the available data
* All of the features are continuous but also bounded, there is a reasonable limit to what someone could spend in a month. 
* Whatever model we create will be classifying based on how close the behavior profile matches. This is a situation where there is likely a strong pattern for when 
* I don't know how often people miss credit card payments. Does KNN do a better job handling rare occurances/sampling imbalances than other models?


Would try a few models with different normalizaiton techniques and transformations to the expense times to see which captured the spending habits of either a month where a bill is not paid or a person that does not pay their bills.
1. If really all we know are absolute values of how much a person spent in these categories I would calculate percent of spend on each category to adjust the data to account for the wide variation in potential spending volume (a person could have relatively high spending and still be managing their finances responsibly. eg a person spending \$1,000 on entertainment a month looks very different if their total monthly spending is \$2,000 vs. \$10,000 
2. May want to pair dining-out & groceries since they are both food related. Or create a "fun" metric combining dining-out and entertainment. 


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy
%matplotlib inline

music = pd.DataFrame()
music['duration'] = [184, 134, 243, 186, 122, 197, 294, 382, 102, 264, 
                     205, 110, 307, 110, 397, 153, 190, 192, 210, 403,
                     164, 198, 204, 253, 234, 190, 182, 401, 376, 102]
music['loudness'] = [18, 34, 43, 36, 22, 9, 29, 22, 10, 24, 
                     20, 10, 17, 51, 7, 13, 19, 12, 21, 22,
                     16, 18, 4, 23, 34, 19, 14, 11, 37, 42]
music['jazz'] = [ 1, 0, 0, 0, 1, 1, 0, 1, 1, 0,
                  0, 1, 1, 0, 1, 1, 0, 1, 1, 1,
                  1, 1, 1, 1, 0, 0, 1, 1, 0, 0]

# Tuning KNN
KNearest Neighbor can be tuned in 2 major categories:
1. How we handle distance
2. How many neighbors we include

## Distance & Normalizing
* The measurement of euclidian distance makes the assuption that all units are equal (e.g. 1 second away = to 1 loudness away)
* In reality units are rarely equivalent
* Makes binary or categorical variables nearly impossible to include in KNN model

**Normalization** is used to make differing units of measurement more comparable. 2 main techniques are effectie with KNN:
1. Set bounds of data to 0 and 1 then **rescale** every variable to be within thoe bounds. This is best if data shows a linear rlationship, such that scaling to a 0 to 1 range makes sense
2. **Z-scores** Calculate how far each observation is from the mean expressed in number of standard deviations. puts everything in terms of how "abnormal" each datapoint is

## Weighting (by distance?)
* Sometimes the $k$ nearest observations are not all similarly close to the test. In this case by help to weight by distance
* Funcitonally this will weight by the inverse of distance so low distance datapoints will have a higher weight and that weight will be proportional to how much closer they are to the datapoint

## Choosing K
* choosing $k$ is a tradoff
    * larger $k$ the more smoothed out your decision space will be, with more observations voting in the prediction
    * smaller $k$ will pick up more subtl deviations but these could be noise leading to overfitting
* Weighting will add additional dimension to the $k$ tradeoff

Best technique is to try multiple models and use k-fold validation to see how KNN model is performing