Closed
Description
For the text-to-p sample, where the input is a paragraph of text, it was observed that the PromptPex test generation using the gpt-4o model generated tests of the following form:
ruleid, testid, testinput, expectedoutput, reasoning
1, 1, "The sky is blue. Birds are chirping.", "<p>The sky is <strong>blue</strong>. <em>Birds</em> are <em>chirping</em>.</p>", "Ensures HTML formatting is applied with correct usage of <strong> and <em> tags in sentences."
1, 2, "Today, it is raining. Tomorrow, it will be sunny.", "<p>Today, it is <strong>raining</strong>. Tomorrow, it will be <em>sunny</em>.</p>", "Verifies that HTML tags are properly used to format a paragraph and emphasize important words."
1, 3, "The cat sat on the mat. The dog lay beside.","<p>The cat <strong>sat</strong> on the <em>mat</em>.</p> <p>The dog <strong>lay</strong> beside.</p>","Validates that HTML formatting is properly applied with <strong> and <em> tags in different sentences."
While these tests are technically paragraphs, the compliance fails because they are too short to meet the specified requirement that the output should have 3 paragraph tags. The solution should probably involve generating tests for each rule 1 at a time or even generating each test 1 at a time instead of generating all tests in 1 call to the generation LLM.