I suspect it is easy to beat the scores you are getting (and maybe even get closer to 100 %) by multi-turn and agentic systems like LLMLingua-2 or GraphReader.
Aggregating tricks, and understanding how to get to an acceptable performance by an LLM, seems important to someone building a system in production.
Would you consider accepting submissions of such agentic systems in your leaderboard? In particular, if you do, it would be interesting to include information on total tokens consumed/number of consecutive steps taken as well.