## 3 Evaluate and benchmark

In the following notebooks we will evaluate the predictions of all models by evaluating how well estimate the **Main Pollen Season** (MPS) first. To obtain the MPS, we use two alternative formulas: a percentage-based approach and a threshold-based approach. These functions will be used to compare the predicted MPS periods against the true observed (real) values, and the results will be summarized in a comparison table.

The evaluation generates a CSV file with metrics, which will serve as the basis for a comparison (benchmark).

Then we will compare the performance of all models and their variants using this evaluation CSV. 

### 3.1. Percentage formula

The **main pollen season** (MPS) was determined by including the central 95% of the annual pollen count. Specifically, it began when 2.5% of the total yearly pollen had accumulated and ended when it reached 97.5%. This method preserves most of the relevant data while excluding the early and late outliers, which helps improve the performance of machine learning models.
### 3.2. Threshold formula

The **main pollen season** (MPS) can also be determined using a threshold-based method. In this case, the season was defined based on two parameters: a minimum daily pollen count (threshold) and a required number of consecutive days meeting that threshold. Specifically, the MPS was considered to begin when at least three pollen grains were recorded on three consecutive days. The season ended when this condition was no longer met in the subsequent data.

### 3.3. Evaluation CSV Example
The resulting table after obtaining a evaluation from the predictions has the following format in which it is displayed:

| year | model_name | uses_covariates | train_size | pred_n | start | end | est_start | est_end | start_dev | end_dev | duration | est_duration | total_dev | duration_dev |
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
| The year in which the MPS is being forecast  | Name of the model | Whether covariates are used for predictions (True) or not (False)| The number of days with which the model is trained | The n days before the target date predictions | Real start day of MPS | Real end day of MPS | Estimated start day of MPS | Estimated end day of MPS | Deviation in days between the actual start date and the estimated start date | Deviation in days between the actual end date and the estimated end date | Number of days of MPS | Estimated number of days of MPS | Sum of the deviations start_dev and end_dev | Difference between estimated and real MPS duration | 

For example, if ```HORIZON_SIZE``` of 1 week, 0 years of ```TRAIN_SIZE```(fitting),```START_YEAR = 1999```, ```END_YEAR = 2010```, ```OFFSET_DAYS = -182```, does not predict using covariates and the moirai.deterministic.small model was used, the evaluation will be like this: 
```
year,model_name,uses_covariates,train_size,pred_n,start,end,est_start,est_end,start_dev,end_dev,duration,est_duration,total_dev,duration_dev
1999,moirai.deterministic.small,False,0,1,1999-01-13,1999-02-27,1998-08-11,1999-12-17,155,293,45,493,448,-448
1999,moirai.deterministic.small,False,0,2,1999-01-13,1999-02-27,1998-08-09,1999-12-18,157,294,45,496,451,-451
1999,moirai.deterministic.small,False,0,3,1999-01-13,1999-02-27,1998-09-21,1999-12-06,114,282,45,441,396,-396
1999,moirai.deterministic.small,False,0,4,1999-01-13,1999-02-27,1998-08-12,1999-12-18,154,294,45,493,448,-448
1999,moirai.deterministic.small,False,0,5,1999-01-13,1999-02-27,1998-09-28,1999-11-28,107,274,45,426,381,-381
1999,moirai.deterministic.small,False,0,6,1999-01-13,1999-02-27,1998-08-15,1999-12-13,151,289,45,485,440,-440
1999,moirai.deterministic.small,False,0,7,1999-01-13,1999-02-27,1998-08-17,1999-12-14,149,290,45,484,439,-439
...
2010,moirai.deterministic.small,False,0,1,2010-01-18,2010-03-05,2009-07-24,2010-12-09,178,279,46,503,457,-457
2010,moirai.deterministic.small,False,0,2,2010-01-18,2010-03-05,2009-08-01,2010-12-13,170,283,46,499,453,-453
2010,moirai.deterministic.small,False,0,3,2010-01-18,2010-03-05,2009-09-12,2010-11-16,128,256,46,430,384,-384
2010,moirai.deterministic.small,False,0,4,2010-01-18,2010-03-05,2009-07-20,2010-12-11,182,281,46,509,463,-463
2010,moirai.deterministic.small,False,0,5,2010-01-18,2010-03-05,2009-09-04,2010-11-15,136,255,46,437,391,-391
2010,moirai.deterministic.small,False,0,6,2010-01-18,2010-03-05,2009-08-01,2010-12-06,170,276,46,492,446,-446
2010,moirai.deterministic.small,False,0,7,2010-01-18,2010-03-05,2009-08-01,2010-12-04,170,274,46,490,444,-444
```

The file name will be as follows: 
```
suffix = "percentage" if apply_percentage else "threshold"        
    if apply_percentage:
        file_path = os.path.join( evaluation_dir,
            f"mps_evaluation_{suffix}_{start_year}-{end_year}_{offset_days}_{input_size}_{horizon_size}_{start_pct}_{end_pct}.csv")
    else:
        file_path = os.path.join( evaluation_dir,
            f"mps_evaluation_{suffix}_{start_year}-{end_year}_{offset_days}_{input_size}_{horizon_size}_{threshold}_{consecutive_days}.csv")
```

For example, if ```evaluation_dir``` is: *./outputs/evaluations/*, ```apply_percentage``` is: *True*, ```start_pct``` is: *0.025*, ```end_pct``` is: *0.975* then ```file_path``` will be *./outputs/evaluations/mps_evaluation_percentage_1999-2010_-182_365_7_0.025_0.975.csv*.

### 3.4. Benchmark the models

In the last notebook, we will compare the performance of the models. The metrics we will use for the benchmark will be the ```start_dev``` and ```end_dev```.

The first thing to do is extracting the metrics of the CSVs: ```start_dev```, ```end_dev```. We will then make a box and whisker plot and then a comparative table of model results after running Mann-Whitney U with Holm-Bonferroni correction, to compare the performance of all models in obtaining the MPS.

➡️ **[Next notebook: 03-01_Evaluate_models](../notebooks/03-01_Evaluate_models.ipynb)**