## Classes (targets)

Class definition based on training examples. The vast majority of training examples are linked with a single target label.
However a small subset (~0.5%), is linked with 2 or more intents. 

Examples:  

- (single-intent) find me a flight from atlanta to baltimore	**flight**
- (multiple-intents) what are the flights and fares from atlanta to philadelphia	**flight+airfare**

Handling Multiple Intents:

**Standalone Class vs. Breakdown**: Despite the presence of the combined labels like flight+airfare, individual appearances of flight and airfare are significantly more common. Therefore, breaking down flight+airfare into separate classes may offer more nuanced classification. This approach ensures that the model can capture the distinctions between individual intents effectively, even if they occasionally appear together.  

**Consideration of Data Distribution**: Given that individual appearances of flight and airfare are more prevalent, prioritizing them as separate classes ensures that the model can learn the specific features and nuances associated with each intent more effectively.


## Training Data Sampling (Few-Shot Classifier)

To create a few-shot classifier, we will provide the model with a small number of examples per class, allowing it to gain a better understanding of the context associated with each intent. For example, given 10 unique classes and 1 example per class, we will provide the model with 10 examples, each representing one class.

### Edge Case 1   

Training examples that are linked with more than one class.  
While these data points may offer insights into class correlations, in the context of a few-shot classifier, it's better to avoid using them as training examples due to their infrequency. Based on the training dataset, prompts are correlated with a single label (intent) 99.5% of the time. Examples of multiple target classes (e.g., flight+airfare) are very infrequent; thus, we will refrain from using them as training examples as they could lead the model to create uneven class correlations.

In cases where a class is only present in a multi-labeled example (e.g., `flight+weather`) or is not related to any examples at all, the model will only be provided with the description (name, unique_id) of that class.

Here, if we pass the (**flight+weather**) example to the model, then the very frequent class `flight` might be strongly correlated with the `weather` class. This could result in the model predicting `weather` falsely positive each time a flight-related request appears.


## Validation Data Sampling (all Classifiers)

### Edge Case 2    

Assert representative validation (test sampling) regardless of test_size.  
Handle edge cases for robust evaluation. 
Class Definition for validation dataset differs from the one in edge-case 1, since this time we want to always include and evaluate multiple intent test cases
for a more robust evaluation.  
Given the aforementioned example the validation dataset is designed to include `flight+weather` as well as individual  `flight` & `weather` class examples.  

1. Identifying Unseen Unique Classes:
    - Compares the unique intents with the intents known to the model ('model_known_targets').
    - Identifies any intents present in the evaluation dataset but not known to the model as unseen classes (only for individual class, eg. `flight+weather` are not included as unseen).

2. Filtering Unseen Classes:
    - Removes data corresponding to unseen classes from the evaluation dataset.
    - Ensures compatibility between the evaluation dataset and the model's known intents.

3. Adjusting Test Size:
    - Creates a representative evaluation dataset by sampling from the original test dataset.
    - Ensures that each class in the dataset is equally represented in the sampled data (if possible)
    - See `sample_evaluation_data`  in [gpt_intent_classifier](../gpt_intent_classifier.py)


## Training Data and User Query Processing