We need a better evaluation pipeline to better quantify model performance and compare models with each other. Some ideas include - Evaluating on dataset for which we already have the chatGPT reference, e.g. HC3. - Using a fine-tuned reward model.
We need a better evaluation pipeline to better quantify model performance and compare models with each other.
Some ideas include