# <font color= 'darkorchid'> Mémos Classification </font> 

## <font color= 'mediumorchid'> Preprocessing </font> 

* **classes** : 
    * encode the target classe 0 or 1. <font color='red'> **1 = positif** </font>                            
    * look the balance between classes : same number of samples ? *ex: same number of patients that are sick and healthy*
    if imbalance : use other model as SVM 

* **categorical features** : convert into binary value 0-1 or ordinal value : ordinal value if there is a distance between these numbers !

* **continuous features** : Scaler() : center and standardize the values 

## <font color= 'mediumorchid'> Classification with the K-nearest neighbors (KNN) </font> 

1) At training, it simply memorizes all the training samples features $X$ and classes $y$. 

2) At test time given the features of one sample $x'$, it identifies the $k$ training samples $x_i, i \in 1,\dots,k$ that are the closest to $x'$ (in euclidian distance), and assign the class $y'$ that is the most frequent among the k-neareast neighbor classes $y_i, i \in 1,\dots,k$.

So <font color='red'> **each test sample is assigned a probablity and knn simply predicts 1 if this probability is higher than > 0.5** </font>

*ex: the probability of having a heart disease is the proportion of the k-nearest train samples that have a heart disease.*

### <font color= 'plum'> **Confusion matrix** </font>
* *true positives* (TP) : the number of patients that have been correctly classified as having a disease : 
* *true negatives* (TN) : the number of patients that have been correctly classified as not having a disease 
*  *false positives* (FP) : the number of patients that have been incorrectly classified as having a disease 
* *false negatives* (FN) : the number of patients that have been incorrectly classified as nothaving a disease 

/!\ the <font color= 'blue' > true/false </font> refers to <font color= 'blue' > the *true* class of the test samples </font>, whereas  <font color= 'green'> the positive/negative </font> refers to <font color= 'green'> the *predicted* class </font>  by the classifier.

The confusion matrix gives these four numbers in the following format:

|  |  |
|--|--|
|TN|FP|
|FN|TP|

The best : maximize the TP and TN ! The objectif is to decrease the FP and FN.              

`confusion_matrix(y_test, y_test_pred)`             
*toujours arg 1 : lignes et arg2 : colonnes dans la confusion matrix* 

### <font color= 'plum'> **predict proba** </font>

`y_test_proba = knn_clf.predict_proba(X_test_scaled)`                                   
Results :                                                       
```python
array([[0.93333333, 0.06666667],                                
        [0.        , 1.        ],                                    
        [0.86666667, 0.13333333],                    
        [0.33333333, 0.66666667],                                    
         [0.2       , 0.8       ],                            
        [0.26666667, 0.73333333],                                
        [0.66666667, 0.33333333]])  
```
Explanation : 
* left : probability to be in classe 0 --> negatif *not sick* 
* right : probability to be in classe 1 --> positif *sick*  


### <font color= 'plum'> **threshold** </font>

As mentionned each test sample is assigned a probablity and knn simply predicts 1 if this probability is higher than > 0.5. 
* if we increase the threshold : we increase the precision
* if we drecrease the threshold : we increase the recall

## <font color= 'mediumorchid'> Scoring of classification </font> 

### <font color='plum'> **Accuracy_score** </font>
         
`accuracy_score(y_test, y_test_pred)`

* Is calculated on the <font color='coral'> **predicted classes** </font>
* It measures how many observations, both positive and negative, were correctly classified.               
* Need to apply a certain threshold before computing it.   

So, when does it make sense to use it? 
* When your problem is balanced, using accuracy is usually a good start. 
* An additional benefit is that it is really easy to explain it to non-technical stakeholders in your project.
* When every class is equally important to you.

*ex : accuracy = 0.81 = 81% of the diagonal TP and TN are  classified. The rest is not but we don't know where* 

### <font color='plum'> **Recall** </font>
Important when we want to **minimize the FN** i.e. *people that are sick but predicted not*
To the detriment of FP (*people not sick but predicted as sick*) : decrease **precision**.          
This means we'd prefer **to have a higher Recall, to the cost of having a lower Precision**.        
And that implies **choosing a threshold that is lower than 0.5 for assigning the classes.**             
*i.e. : if threshold lower that 0.5, more test samples are going to be classifed as 1 = sick people*

`recall_score (y_test, y_test_pred)`

* Is calculated on the <font color='coral'> **predicted classes** </font>
* **i don't want to miss +** even if i am not ! Increase FP. I want to catch all the + even if i detect - as +. 

### <font color='plum'> **Precision** </font>

`precision_score (y_test, y_test_pred)`

* Is calculated on the <font color='coral'> **predicted classes** </font>
* what is trully +  for all the samples classified. What is the truth part of + and how much can I trust my model to predict + ? 
* **I want to really be sure to be +** and not say i am + if it is not (and detect + if it is trully + --> increase FN) 
* *ex: pregnancy test : be sure at 99.9% to be + !!* 
* **trade_off recall/precision** : both evolve on the contrary 

### <font color='plum'> **F1_score** </font>

`f1_score(y_test, y_test_pred)`

* Is calculated on the <font color='coral'> **predicted classes** </font>
* compromise between precision and recall 
* calculating the harmonic mean between those two. 
* beta : 
    * with the F1 score, we care equally about recall and precision; 
    * with the F2 score, recall is twice as important to us.
    * with 0<beta<1, we care more about precision, and so the higher the threshold, the higher the F beta score. 
    * when beta > 1, our optimal threshold moves toward lower thresholds
    * when beta = 1, it is somewhere in the middle.


### <font color='plum'> **average precision score / PR AUC** </font>
`average_precision_score(y_test, y_score)`

* Is calculated on the <font color='mediumturquoise'> **prediction score/proba** </font>
* *y_score* : Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions
* take in account the trade-off between precision and recall
* precision-recall curve : It is a curve that combines precision (PPV) and recall (TPR) in a single visualization. For every threshold, you calculate PPV and TPR and plot them. The higher the y-axis on your curve, the better your model’s performance.
* you can calculate the area under the precision-recall curve (PR AUC) to get one number that describes model performance : AP 


### <font color='plum'> **ROC AUC score** </font>
`roc_auc_score(y_test, y_score)`

* Is calculated on the <font color='mediumturquoise'> **prediction score/proba** </font>
* roc : It is a chart that visualizes the trade-off between the true positive rate (TPR) and the false positive rate (FPR). Basically, for every threshold, we calculate TPR and FPR and plot them on one chart.
* auc : calcul the area under the curve: **maximize the auc = minimize the FP** 
* average scoring that takes in account the FP ! 
