In [1]:
!wget -nc -O img.png "https://camo.githubusercontent.com/af5b58e7980908cb5b46b6f7bd6ca48cd8eab026e31d942a199cfbaaba833637/68747470733a2f2f63662d636f75727365732d646174612e73332e75732e636c6f75642d6f626a6563742d73746f726167652e617070646f6d61696e2e636c6f75642f49424d446576656c6f706572536b696c6c734e6574776f726b2d4d4c30313031454e2d536b696c6c734e6574776f726b2f6c6162732f4d6f64756c65253230332f696d616765732f4b4e4e5f4469616772616d2e706e67"

File ‘img.png’ already there; not retrieving.


# K-Nearest Neighbor(KNN)

- in this method we look at K nearest neighbors around the target and select its label based on them
- How to choose K (low -> noise & over-fit; high -> too general). Use different Ks with test and check which one is better
- KNN can also be used to compute a continuous target(regression)

<img src='./img.png'>

## evaluation methods:
 - Jaccard index
 -  F1-score
 - Log Loss
 <hr/>

# Jaccard Index: 
$ y $ : Actual Labels  
$ \hat{y} $ : Predicted Labels  
$ J (y , \hat{y}) = \frac{| y \cap  \hat{y}| }{ | y \cup \hat{y} | } = \frac { | y \cap  \hat{y} | } {|y| + |\hat{y}| - | y \cap  \hat{y}|}  $

$ y $ : [0, 0, 0, 1, 1, 0, 1, 1, 0, 1]  
$ \hat{y} $ : [1, 1, 0, 1, 1, 0, 1, 1, 0, 1]  
  
there is a total of 10 labels where we predict 8 of them correctly:  
$ j(y, \hat{y}) = \frac {8} {10 + 10 - 8}  = 0.66 $  

sklearn.metrics.jaccard_score, will calculate the value for each label and it can return the average:
the average method can be change as this:
- None, the scores for each class are returned.

- 'binary':
Only report results for the class specified by pos_label. This is applicable only if targets (y_{true,pred}) are binary.

- 'micro':
Calculate metrics globally by counting the total true positives, false negatives and false positives.

- 'macro':
Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

- 'weighted':
Calculate metrics for each label, and find their average, weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance.

 - 'samples':
Calculate metrics for each instance, and find their average (only meaningful for multilabel classification).

In [2]:
from sklearn.metrics import jaccard_score

y = [0, 0, 0, 1, 1, 0, 1, 1, 0, 1]
y_hat = [1, 1, 0, 1, 1, 0, 1, 1, 0, 1]

jaccard_score(y_true=y, y_pred=y_hat, average='micro')

0.6666666666666666

## F-1 score
- TP (True Positive)
    - is when we predict it as its True and it is actually True  
- TN (True Negative):
    - is when we predict it as its False and it is actually False  
- FP (False Positive):
    - is when we predict it as its True __but__ its is actually False
- FN (False Negative):
    - is when we predict it as its False __but__ its is actually True  

<small>
    sometimes one of them are more important from another for example maybe we don't care if we do some preventions for something that will not happened(FP),
    but we don't want to something unexpected happened (FN)
</small>

- Precision = TP / (TP + FP)  
- Recall = TP / (TP + FN)  
- F1-score = $ 2 \times \frac {Precision + Recall}{Precision + Recall} $

In [3]:
from sklearn.metrics import f1_score

y = [0, 0, 0, 1, 1, 0, 1, 1, 0, 1]
y_hat = [1, 1, 0, 1, 1, 0, 1, 1, 0, 1]

f1_score(y , y_hat)

0.8333333333333333

## Log Loss

since we are getting a categorical value for the labels,
they may not be accurate for example:
imagine we put person1, person2 in group 1, but we were so certain about person one but not so certain about person2 
we can calculate this error with log-loss

LogLoss = $ - \frac{1}{n} \sum(y \times \log(\hat{y}) + (1 - y) \times log(1 - \hat{y})) $  
  
$ 0 \le $ LogLoss $ \le 1  $

less Log-loss means more accuracy

In [12]:
from sklearn.metrics import log_loss


y_true = [0, 0, 0, 1, 1, 0, 1, 1, 0, 1]
y_pred = [0.3, 0.2, 0.1, 0.8, 0.9, 0.4, 0.7, 0.9, 0.2, 0.85] 

log_loss(y_true, y_pred)

0.2372206642057339