As corporations rush to integrate large language models (LLMs) to their search offerings, it is critical that they provide factually accurate information that is robust to any presuppositions that a user may express. We introduce UPHILL, a benchmark for Understanding Pressupositions for Health-related Inquiries to LLMs. It comprises of 9725 health-related queries with 5 different levels of presuppositions. For each query, it includes the veracity of the (presupposed) claim, model responses and entailment predictions for agreement between model responses and the claim within the query.
Our design and code is inspired by https://writing-assistant.github.io/.