Hi! Is there a script to evaluate the questions and model outputs by complexity level, as presented in the paper?