# Final modelling notebooks

This is the final notebook with all the models explained. We will make 2 different approachs:
- Using only the features extracted on **final_notebook_feature_extraction**
- Using the features extracted + features extracted with TFIDF transformation

We will use 5 different types of machine learning models, on some of them we will apply GridSearch CV:
- Logistic regression as benchmark for first aproach
- Passive agressive classifier as benchmark for second aproach
- K-nearest Neighbour
- Decision Tree
- Random Forest
- XGBoost

And we are using the following metrics to evaluate the model:
- Classification accuracy
- Logarithmic loss
- Confusion matrix
- Recall or sensitivity
- Specifity
- Area under curve (AUC)
- F1 Score

## Index

- [1. Evaluation metrics explanined](#1.-Evaluation-metrics-explained)
- [2. First approach: Only features extracted](#2.-First-approach:-Only-features-extracted)
    - Logistic regression as benchmark for first aproach
    - K-nearest Neighbour
    - Decision Tree
    - Random Forest
    - XGBoost
- 3. Second approach
    - Passive agressive classifier as benchmark for second aproach
    - K-nearest Neighbour
    - Decision Tree
    - Random Forest
    - XGBoost
- 4. Choose the best model
- 5. Pickle the best model

## 1. Evaluation metrics explained

### Classification accuracy

Classification Accuracy is what we usually mean, when we use the term accuracy. It is the ratio of number of correct predictions (True Positives + True Negatives) to the total number of input samples:

$$Accuracy = \frac{True Positives + True Negatives}{TotalExamples}$$

### Logarithmic loss
Logarithmic Loss or Log Loss, works by penalising the false classifications. It works well for multi-class classification. When working with Log Loss, the classifier must assign probability to each class for all the samples. Suppose, there are N samples belonging to M classes, then the Log Loss is calculated as below :

$$Logarithmic Loss = \frac{-1}{N}\sum\limits_{i=1}^{N}\sum\limits_{j=1}^{M} y_{i_j} \times log(p_{i_j})$$

### Confusion Matrix
A confusion matrix is a matrix representation of the prediction results of any binary testing that is often used to describe the performance of the classification model (or “classifier”) on a set of test data for which the true values are known.

![ConfusionMatrix](../imgs/ConfusionMatrix.png)

Important terms:

- True Positives : The cases in which we predicted YES and the actual output was also YES.
- True Negatives : The cases in which we predicted NO and the actual output was NO.
- False Positives : The cases in which we predicted YES and the actual output was NO.
- False Negatives : The cases in which we predicted NO and the actual output was YES.

### Recall or Sensitivity
Recall or Sensitivity gives us the True Positive Rate (TPR), which is defined as TP / (FN+TP). True Positive Rate corresponds to the proportion of positive data points that are correctly considered as positive, with respect to all positive data points.

![ConfusionMatrixSensitivity](../imgs/ConfusionMatrixSensitivity.png)

$$True Positive Rate = \frac{True Positive}{False Negative + True Positive}$$
                                                                       
### Specifity
Specifity gives us the True Negative Rate (TNR), which is defined as TN / (FP+TN). False Positive Rate corresponds to the proportion of negative data points that are correctly considered as negative, with respect to all negative data points.

![ConfusionMatrixSpecifity](../imgs/ConfusionMatrixSpecifity.png)

$$True Negative Rate = \frac{True Negative}{True Negative Negative + False Positive}$$


### Area under curve (AUC)
Area Under Curve(AUC) is one of the most widely used metrics for evaluation. It is used for binary classification problem. AUC of a classifier is equal to the probability that the classifier will rank a randomly chosen positive example higher than a randomly chosen negative example.

### Precision
Precision is the number of correct positive results divided by the number of positive results predicted by the classifier.

$$Precision = \frac{True Positives}{True Positives + False Positives}$$


### F1 Score
F1 Score is used to measure a test's accuracy. F1 Score is the Harmonic Mean between precision and recall. The range for F1 Score is [0, 1]. It tells you how precise your classifier is (how many instances it classifies correctly), as well as how robust it is (it does not miss a significant number of instances).

High precision but lower recall, gives you an extremely accurate, but it then misses a large number of instances that are difficult to classify. The greater the F1 Score, the better is the performance of our model. Mathematically, it can be expressed as :

$$F1 = 2 \times \frac{1}{\frac{1}{precision} + \frac{1}{recall}}$$

F1 Score tries to find the balance between precision and recall.

## 2. First approach: Only features extracted