### 1. Model (KNN classification)
The model is a classification function:
$$ f: R^{n} \rightarrow {c_{1}, c_{2}, ..., c_{K}} $$

### 2. Strategy
The probability of misclassification:
$$ P(Y != f(X)) = 1 - P(Y=f(X)) $$
Assuming training sample $x \in X$, k nearest samples make up a set $ N_{k}(x) $, if $ N_{k}(x) $ is in the area $ c_{j} $, then the probability of misclassification can be expressed as:
$$ \frac{1}{k} \sum_{x_{i} \in N_{k}(x)} I(c_{j} != y_{i}) $$
$$ = 1 - \frac{1}{k} \sum_{x_{i} \in N_{k}(x)} I(c_{j} = y_{i}) $$
The strategy is to minimize the probability of misclassification, that is, maximize $$\sum_{x_{i} \in N_{k}(x)} I(c_{j} = y_{i})$$
In this case, $ N_{k}(x) $ should have the largest overlap with the area $ c_{j} $, in other words, using the majority voting rule.

### 3. Algorithm
![alt text](images/2.2.1.png)
![alt text](images/2.2.2.png)

### 4. KNN regression
- Algorithm:
  - Same as KNN classification (1): select k closest instances $x_{i1},...,x_{ik}$ and their labels $y_{i1},...,y_{ik}$
  - $$ y = f(x) = \frac{1}{k} \sum_{j=1}^k y_{ij} $$

### 5. $L_{p}$ or Minkowski distance 
$$ L_{p}(x_{i}, x_{j}) = (\sum_{l=1}^n |x_{i}^{(l)}-x_{j}^{(l)}|^{p})^{\frac{1}{p}} $$
when $p=2$, we have Euclidean distance,
$$ L_{2}(x_{i}, x_{j}) = (\sum_{l=1}^n |x_{i}^{(l)}-x_{j}^{(l)}|^{2})^{\frac{1}{2}} $$
when $p=1$, we have Manhattan distance,
$$ L_{1}(x_{i}, x_{j}) = \sum_{l=1}^n |x_{i}^{(l)}-x_{j}^{(l)}| $$
when $p=\infty$, we have consider the distance the max distance along each coordinate,
$$ L_{\infty}(x_{i}, x_{j}) = \max_{l} |x_{i}^{(l)}-x_{j}^{(l)}| $$
<center class="half">
    <img src="images/2.2.3.png" width="200"/>  <img src="images/2.2.4.png" width="200"/>
</center>

### 6. the choice of $k$
- when $k$ is small, the approximation error will reduce (也就是说只有小范围的训练实例会对预测结果有影响) whereas estimation error will increase (对附近的训练实例很敏感), the model is more complex and more likely to be overfitting;
- when $k$ is large, the approximation error will increase (也就是说大范围的训练实例会对预测结果有影响) whereas estimation error will reduce, the model is more simple and more likely to be underfitting;
- use cross-validation to pick up the optimal $k$.

### 8. K-D tree
![5](images/2.2.5.png)

<center class="half">
    <img src="images/2.2.6.png" width="200"/>  <img src="images/2.2.7.png" width="200"/>
</center>

![8](images/2.2.8.png)
![8](images/2.2.9.png)

### 9. How to make KNN fast
- dimensionality reduction;
- do not compare to all training examples:
  - k-d trees (for low-dimension data only), can miss neighbors;
  - inverted lists (high-dimension and discrete(sparse)data));
  - locality-sensitive hashing, can miss neighbors

### 10. Pros and cons
- easy and no pre-assumption;
- need to handle missing data;
- sensivtive to outliers;
- computationally expensive (slow; need lots of storage space)