# EVALUATION METRICS 

## GPT CLASSIFIER EVALUATION METRICS

### - ACCURACY  

Calculate the accuracy of the intent classifier.    
Accuracy measures the proportion of correctly classified cases from the total number of objects in the dataset.  
Per prediction we compare 2 lists, the test labels, in e.g. ['flight'] or ['flight', 'airline'] with the model's  
response, in e.g. ['flight', 'city', 'airline']. For each actual label present in the response we count +1 correct prediction else +1 incorrect predictions. 
The total number of prediction is calculated as matched_intent_count + mismatched_intent_count.   
Thus, the amount of total predictions will be >= count of test records, since prediction count depends on the amount of actual labels per prompt.  


### - PRECISION  

Calculate precision for each class.  
Precision is calculated as the fraction of instances   
correctly classified as belonging to a specific class out of all instances   
the model predicted to belong to that class (TP/(TP+FP)).  

```  
Example:  

actual_intents = ['flight', 'airfare']    
prediction = ['flight', 'airfare', 'flight_no']    

Here 'flight_no' gets +1 FP and     
classes `flight` & `airfare` earn +1 TP    
```   

### - RECALL
 
Calculate recall for each class.    
Recall is calculated as the fraction of instances in a class that the model correctly classified    
out of all instances in that class (TP/(TP+FN)).  

```
Example:

actual_intents = ['flight', 'airfare']  
prediction = ['flight', 'city', 'flight_no']  

Here 'airfare' gets +1 FN and   
class `flight` earns +1 TP  
```

### - CONFUSION MATRIX  

Calculate the confusion matrix.

Given a prediction, all three prediction values, for example ['flight', 'flight_no', 'airport'], are counted for each test intent(s), such as [`flight_time`,`flight_no`].

```
Example:
For test case `show all flights and fares from denver to san francisco` where:

- actual_intents: ['flight', 'airfare']
- predicted intents: ['flight', 'flight_no', 'airfare']

We count all intents vs all predictions as `confusions` like this: 

- 'flight' -> 'flight'       +1
- 'flight' -> 'flight_no'    +1
- 'flight' -> 'airfare'     +1

- 'airfare' -> 'flight'         +1
- 'airfare' -> 'flight_no'      +1
- 'airfare' -> 'airfare'       +1

```

Here, even if `airfare` & `flight` are correct predictions, we will still relate them with the (`flight`, `flight_no`) & (`airfare`, `flight_no`) respectively. 


By adopting this counting method, we are not only capturing the instances where classes align with the predictions accurately but also acknowledging   
the correlations suggested by the model. This approach enables us to discern not only how frequently certain classes appear together    
but also to identify any underlying associations the model implicitly recognizes.  


### Zero shot classifier 

Results:   

`gpt-3.5-turbo`  
`zero-shot`  
`model_unknown_targets:  {'day_name'}`  

**valid_res**: 845  -> times the model's response was in expected format  
**invalid_res**:  3 -> times the model's response was malformed (excluded)  
**general_accuracy**: 94%

- [zero-shot-predictions](../model_evaluation/zero-shot_test_results.csv)  
- [zero-shot-accuracy](../model_evaluation/zero-shot_accuracy.csv)  
- [zero-shot-precision](../model_evaluation/zero-shot_precision.csv) (per class)  
- [zero-shot-recall](../model_evaluation/zero-shot_recall.csv) (per class)  

**evaluation cost for the whole test dataset**

 - based on $0.0005/1k input tokens and  $0.0015/1k output pricing for the `gpt-3-5-turbo`, 
 - our input was 848 (requests) * 212 tokens each (on avg) == 179,7k input tokens 
 - our expected output is fixed at 5-7 tokens depending on tokenizer  

```
For example for output: `[0, 3, 7]`  
 '[' and ']' would be tokens on their own.  
'0', '2', and '7' would each be tokens on their own.  
',' would also be a token on its own.  
So, in total, the tokenizer would break the string '[0, 2, 7]' into 7 tokens.  
```
- our input was 848 (requests) * 7 each == 5,9k output tokens 
- 258,6 * 0.0015 = 0.09$ for input costs  
- 5,9 * 0.0015 = 0.00885 for output costs   

So, to evaluate the model with the `zero-shot` classifier on the complete test ds costs approximately `0,10$`  


### Few shot classifier 

Results:   

`gpt-3.5-turbo`  
`few-shot`  
`model_unknown_targets:  {'day_name'}`  

**valid_res**: 846  -> times the model's response was in expected format  
**invalid_res**:  2 -> times the model's response was malformed (excluded)  
**general_accuracy**: 95%

- [few-shot-predictions](../model_evaluation/few-shot_test_results.csv)  
- [few-shot-accuracy](../model_evaluation/few-shot_accuracy.csv)  
- [few-shot-precision](../model_evaluation/few-shot_precision.csv) (per class)  
- [few-shot-recall](../model_evaluation/few-shot_recall.csv) (per class)  


**evaluation cost for the whole test dataset**

 - based on $0.0005/1k input tokens and  $0.0015/1k output pricing for the `gpt-3-5-turbo`, 
 - our input was 848 (requests) * 305 tokens each (on avg) == 258,6k input tokens 
 - our expected output is fixed at 5-7 tokens depending on tokenizer  

```
For example for output: `[0, 3, 7]`  
 '[' and ']' would be tokens on their own.  
'0', '2', and '7' would each be tokens on their own.  
',' would also be a token on its own.  
So, in total, the tokenizer would break the string '[0, 2, 7]' into 7 tokens.  
```
- our input was 848 (requests) * 7 each == 5,9k output tokens 
- 258,6 * 0.0015 = 0.13$ for input costs  
- 5,9 * 0.0015 = 0.00885 for output costs   

So, to evaluate the model with the `few-shot` classifier on the complete test ds costs approximately `0,14$`  