# Chapter 7: Stage 5: Evaluation and Validation

##  Steps Involved in Evaluating and Validating Fine-Tuned Model

1. Set Up Evaluation Metrics
2. Interpret Training Loss Curve
3. Run Validation Loops
4. Monitor and Interpret Results
5. Hyperparameter Tuning and Adjustments

##  Setting Up Evaluation Metrics

Cross-entropy is a key metric for evaluating LLMs during training or fine-tuning. Originating from
 information theory, it quantifies the difference between two probability distributions.

### Importance of Cross-Entropy for LLM Training and Evaluation

Cross-entropy is crucial for training and fine-tuning LLMs. It serves as a loss function, guiding the model to produce high-quality predictions by minimising discrepancies between the predicted and actual data.

### Beyond Cross-Entropy: Advanced LLM Evaluation Metrics

+ Perplexity: Perplexity measures how well a probability distribution or model predicts a sample. In the context of LLMs, it evaluates the model’s uncertainty about the next word in a sequence. Lower perplexity indicates better performance, as the model is more confident in its predictions.

+ Factuality: Factuality assesses the accuracy of the information produced by the LLM. It is particularly important for applications where misinformation could have serious consequences. Higher factuality scores correlate with higher output quality.

+ LLM Uncertainty: LLM uncertainty is measured using log probability, helping to identify low-quality generations. Lower uncertainty indicates higher output quality. 

+ Prompt Perplexity: This metric evaluates how well the model understands the input prompt. 

+ Context Relevance: In retrieval-augmented generation (RAG) systems, context relevance measures how pertinent the retrieved context is to the user query. 

+ Completeness

+ Chunk Attribution and Utilisation: These metrics evaluate how effectively the retrieved chunks of information contribute to the final response.

+ Data Error Potential 

+ Safety Metrics

##  Understanding the Training Loss Curve

 The training loss curve plots the loss value against training epochs and is essential for monitoring model
 performance.

### Interpreting Loss Curves

 An ideal training loss curve shows a rapid decrease in loss during initial stages, followed by a gradual
 decline and eventual plateau. Specific patterns to look for include:
 1. Underfitting: High loss value that does not decrease significantly over time, suggesting the model
 cannot learn the data.
 2. Overfitting: Decreasing training loss with increasing validation loss, indicating the model mem
orises the training data.
 3. Fluctuations: Significant variations may indicate a high learning rate or noisy gradients.

###  Avoiding Overfitting

 Techniques to prevent overfitting include:
 1. Regularisation: Adds a penalty term to the loss function to encourage smaller weights.
 2. Early Stopping: Stops training when validation performance no longer improves.
 3. Dropout: Randomly deactivates neurons during training to reduce sensitivity to noise.
 4. Cross-Validation: Splits data into multiple subsets for training and validation to assess model
 generalisation.
 5. Batch Normalisation: Normalises inputs to each layer during training to stabilise the learning
 process.
 6. Larger Datasets and Batch Sizes: Reduces overfitting by increasing the amount of diverse
 data and batch sizes

###  Sources of Noisy Gradients

 1. Learning Rate Scheduling: Gradually decreasing the learning rate during training can reduce
 the impact of noisy gradients.
 2. Gradient Clipping: Setting a threshold for gradient values prevents large updates that can
 destabilise training.

## Running Validation Loops

 1. Split Data: Divide the dataset into training and validation sets.
 2. Initialise Validation: Evaluate the model on the validation set at the end of each epoch.
 3. Calculate Metrics: Compute relevant performance metrics, such as cross-entropy loss.
 4. Record Results: Log validation metrics for each epoch.
 5. Early Stopping: Optionally stop training if validation loss does not improve for a predefined
 number of epochs.

##  Monitoring and Interpreting Results

 1. Consistent Improvement: Indicates good model generalisation if both training and validation
 metrics improve and plateau.
 2. Divergence: Suggests overfitting if training metrics improve while validation metrics deteriorate.
 3. Stability: Ensure validation metrics do not fluctuate significantly, indicating stable training

##  Hyperparameter Tuning and Other Adjustments

 1. Learning Rate: Determines the step size for updating model weights. A good starting point is
 2e-4, but this can vary.
 2. Batch Size: Larger batch sizes lead to more stable updates but require more memory.
 3. Number of Training Epochs: Balancing the number of epochs ensures the model learns suffi
ciently without overfitting or underfitting.
 4. Optimiser: Optimisers like Paged ADAM optimise memory usage, advantageous for large models