# Model Evaluation


depending on if we have a classification or regression model, we have different types and methods for evaluations.  



## regressions:
- goal is to build a model to accurately predict an unknown case
- we usually split our data into train and test so we can prevent **over-fitting**
- a cost function measures how well the hypothesis h(x,w) fits the training data.

methods to calculate error:
- <strong> MAE</strong>:
    - <strong> Mean Absolute Error</strong>
    - $ \frac{1}{n} \sum_{i=n}^{n} |y_{i} - \hat y_{i}| $
    - its just calculate the absolute value so negative and positive errors don't cancel each other.
    - Gives equal weight to all errors.
    - Robust to outliers because it doesn't heavily penalize large errors.

    <br/>

- <strong> MSE</strong>:
    - <strong> Mean Squared Error</strong> :
    - $ \frac{1}{n} \sum_{i=n}^{n} (y_i - \hat{y}_i)^2 $  
    - Measures the average squared differences between actual and predicted values.
    - Squaring the errors gives more weight to larger errors, making it sensitive to outliers.
    - Often used in optimization problems because of its differentiability.
    <br/>

- <strong> RMSE </strong>
    - <strong> Root Mean Squared Error</strong>
    - $ \sqrt{MSE} $
    - Derived from MSE, it represents the square root of the average squared differences between actual and predicted values.
    - Like MSE, it gives more weight to larger errors but is in the original unit of the data.
    - Commonly used when a metric in the original data unit is preferred.

    <br/>
- <strong> RAE </strong>
    - <strong> Relative Absolute Error</strong>
    - $ \frac {\frac{1}{n} \sum_{i=n}^{n} |y_{i} - \hat y_{i}|} {\frac{1}{n} \sum_{i=n}^{n} |y_{i} - \bar y_{i}|} $
    - Measures the proportion of absolute prediction error relative to the absolute error of the mean model.
    - Provides a ratio of how well the model performs compared to a naive mean model.
    - A lower RAE indicates a better model.

    <br/>
- <strong> RSE </strong>
    - <strong> Relative Squared Error</strong>
    - $ \frac {\frac{1}{n} \sum_{i=n}^{n} (y_{i} - \hat y_{i})^2} {\frac{1}{n} \sum_{i=n}^{n} (y_{i} - \bar y_{i})^2} $
    - Similar to RAE but uses squared differences.
    - Measures the proportion of squared prediction error relative to the squared error of the mean model.
    - A lower RSE indicates a better model.
- <strong> R2 (Coefficient of Determination) </strong> 
    - $ R^2 = 1 - RSE   $
    - Represents the proportion of the variance in the dependent variable that is predictable from the independent variables.
    - Ranges from 0 to 1, where 1 indicates a perfect fit.



--- 

**tips**:
- obviously you we only square the errors when they are larger than 1, otherwise we make them smaller
- numpy has methods for mean and absolute mean but for r2_score you can use the sklearn.metrics.r2_score



## classification

- accuracy isz not always a good metric
    - face detection
    - accuracy of a classifier that always says 'no' is 99.9999%
here you have 3 main methods for evaluating:
1. F-1 score
    - precision and recall:
        - True positive: selected elements that are relevant
        - False positive: selected elements that are irrelevant
        - True negative: missed elements that are irrelevant
        - False negative: missed elements that are relevant

    <small>
        sometimes one of them are more important from another for example maybe we don't care if we do some preventions for something that will not happened(FP),
        but we don't want to something unexpected happened (FN)
    </small>

    - Precision = TP / (TP + FP)  
    - Recall = TP / (TP + FN)  
    - F1-score = $ 2  \frac {Precision + Recall}{Precision + Recall} $
    - this kind of mean is called **harmonic mean**
2. jaccard index:  
    $ y $ : Actual Labels  
    $ \hat{y} $ : Predicted Labels  
    $ J (y , \hat{y}) = \frac{| y \cap  \hat{y}| }{ | y \cup \hat{y} | } = \frac { | y \cap  \hat{y} | } {|y| + |\hat{y}| - | y \cap  \hat{y}|}  $

    $ y $ : [0, 0, 0, 1, 1, 0, 1, 1, 0, 1]  
    $ \hat{y} $ : [1, 1, 0, 1, 1, 0, 1, 1, 0, 1]  
    
    there is a total of 10 labels where we predict 8 of them correctly:  
    $ j(y, \hat{y}) = \frac {8} {10 + 10 - 8}  = 0.66 $  

    sklearn.metrics.jaccard_score, will calculate the value for each label and it can return the average:
    the average method can be change as this:
    - None, the scores for each class are returned.

    - 'binary':
    Only report results for the class specified by pos_label. This is applicable only if targets (y_{true,pred}) are binary.

    - 'micro':
    Calculate metrics globally by counting the total true positives, false negatives and false positives.

    - 'macro':
    Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

    - 'weighted':
    Calculate metrics for each label, and find their average, weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance.

    - 'samples':
    Calculate metrics for each instance, and find their average (only meaningful for multilabel classification).

3. log loss:

    since we are getting a categorical value for the labels,
    they may not be accurate for example:
    imagine we put person1, person2 in group 1, but we were so certain about person one but not so certain about person2 
    we can calculate this error with log-loss

    LogLoss = $ - \frac{1}{n} \sum(y \times \log(\hat{y}) + (1 - y) \times log(1 - \hat{y})) $  
    
    $ 0 \le $ LogLoss $ \le 1  $

    less Log-loss means more accuracy