Model evaluation

We need a better evaluation pipeline to better quantify model performance and compare models with each other.
Some ideas include
- Evaluating on dataset for which we already have the chatGPT reference, e.g. HC3.
- Using a fine-tuned reward model.