Skip to content

Adolfds/prompt-engineering-notes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 

Repository files navigation

Prompt Engineering Notes

Practical lessons on prompt engineering for code-generation datasets used to train LLMs.

This repo is a working notebook from my job as an LLM Code Trainer & Dataset Quality Reviewer at Revelo, where I design and audit coding tasks across Python, TypeScript, JavaScript, C, and C++. The tasks have to be hard enough to create a measurable performance gap between a stronger and a weaker model, while still being solvable — and crucially, the unit tests have to actually catch what the prompt asks for.

These notes are the patterns and failure modes I keep running into. Nothing here is theoretical; every entry comes from a real task that either broke in review or taught me something about how LLMs interpret code prompts.


Table of contents

  1. The prompt-test parity gap
  2. Why happy-path tests aren't enough
  3. Ambiguity is the silent killer
  4. Calibrating difficulty — making the gap real
  5. asyncio.create_task vs await — a failure pattern
  6. Branch coverage ≠ line coverage
  7. Encoding corruption in C tasks
  8. Over-specification kills the signal
  9. When automated QA lies
  10. The "implied requirement" trap

1. The prompt-test parity gap

The most common failure I audit. The prompt says one thing, the tests verify something slightly different, and the model has no way to know which one it's actually being graded on.

Example: Prompt asks for a function that "returns the sum of unique elements in a list." Tests check sum(set(lst)). But the prompt didn't specify behavior for unhashable elements (like nested lists). A correct solution that handles unhashables gracefully fails the test; a naive solution that crashes on them passes.

Fix: Either the prompt explicitly restricts input types, or the tests must tolerate the edge cases. Both prompt and tests must agree on the contract, down to the edge case.

Rule of thumb: If I can write two correct solutions that disagree on some input, the prompt is under-specified.


2. Why happy-path tests aren't enough

A test suite that only covers the "normal" input path creates a weak signal. Both a strong model and a weak model pass it, so the task doesn't differentiate them. The gap between models only shows up at the edges.

What actually separates models:

  • Empty inputs ([], "", None, 0)
  • Single-element inputs
  • Max-size inputs (boundary of stated constraints)
  • Duplicate entries
  • Negative numbers or unicode strings when the prompt didn't forbid them
  • Unsorted vs sorted inputs
  • Inputs that trigger early returns

A good test suite for calibration looks like: 2-3 happy path tests, 5-7 edge cases, 2-3 adversarial cases.


3. Ambiguity is the silent killer

If a prompt has two defensible interpretations, models split between them. The ones that picked the interpretation you didn't test fail for no good reason. This doesn't measure model capability — it measures lucky guessing.

Red flags to catch in review:

  • Vague verbs: "process," "handle," "manage" (does "process" mean filter? transform? validate?)
  • Unstated null behavior: what does the function return for invalid input? Raise? Return None? Return default?
  • Unclear mutation: does the function mutate the input or return a new value?
  • "Sort the list" — ascending or descending? Stable?
  • "Parse the string" — what format? What if it's malformed?

Fix: State the contract explicitly. If Python, show function signature with types. If C, show the exact struct layout. Leave nothing to interpretation.


4. Calibrating difficulty — making the gap real

A task that both strong and weak models solve teaches the model nothing. A task that neither solves is noise. The sweet spot is where a stronger model consistently solves it and a weaker one fails in a specific, diagnosable way.

Techniques that reliably create gaps:

  • Non-obvious edge cases: off-by-one at boundaries, Unicode normalization, float precision.
  • Compositional complexity: 2-3 independent subproblems that must be combined correctly. Weaker models handle each in isolation but botch the composition.
  • State management: concurrency primitives, stateful iterators, reset behavior.
  • Constraint satisfaction: multiple constraints that interact (e.g., "O(n log n) time AND O(1) extra space").
  • Idiomatic vs naive: tasks where the naive solution works but is penalized by a performance test.

5. asyncio.create_task vs await — a failure pattern

A specific pattern I've seen models fail on.

# This creates a task that runs concurrently.
# If not awaited or tracked, it can be garbage-collected mid-execution.
task = asyncio.create_task(do_something())

# This blocks until the coroutine completes.
result = await do_something()

Weaker models often treat create_task and await as interchangeable. A well-designed prompt exploits this by requiring concurrent execution (so await sequentially is wrong) while also requiring results to be collected (so fire-and-forget create_task loses results).

Correct pattern: create_task + store references + await asyncio.gather(*tasks) or await task later.

Testing this requires measuring timing, not just correctness of output. asyncio.gather of 3 tasks that each sleep 1s should complete in ~1s, not 3s.


6. Branch coverage ≠ line coverage

Line coverage hits 100% when every line runs. Branch coverage hits 100% only when every conditional branch (both True and False paths) runs. They're very different.

Example:

def f(x):
    if x > 0:
        return "positive"
    return "non-positive"

A test with f(5) gives 100% line coverage (every line runs) but only 50% branch coverage (the else path never tested).

Models often generate tests with strong line coverage but weak branch coverage, creating false confidence. In review, I check branch coverage explicitly using tools like coverage.py --branch for Python, gcov -b for C, or c8 --branches for JS/TS.


7. Encoding corruption in C tasks

Real bug I caught in review. A reference solution for a Merkle tree task passed all tests but produced incorrect hashes for specific inputs. Root cause: the solution read bytes assuming ASCII but didn't handle UTF-8 multi-byte sequences correctly. All tests used ASCII-only inputs, so the bug was invisible.

Lesson: If a task involves string manipulation, hashing, or byte-level work in C, the test suite must include at least one UTF-8 input. This is a common oversight because authors default to ASCII inputs when sketching tests.

Similar patterns: null terminators in binary data, endianness in network code, locale-dependent sorting.


8. Over-specification kills the signal

The opposite of ambiguity, but equally bad. A prompt that spells out every implementation detail reduces the task to transcription. The model isn't solving a problem; it's typing what you dictated.

Bad: "Write a function that takes a list, creates an empty dict, iterates through the list, and for each element increments its count in the dict..."

Good: "Write a function that returns a dict mapping each unique element to its count."

The good version tests whether the model can design a solution. The bad version tests whether it can follow instructions.

Heuristic: If the prompt reads like pseudocode, rewrite it as a contract (input → output behavior).


9. When automated QA lies

Automated coverage and test-running tools are imperfect. Some real issues:

  • Python coverage.py can miss coverage of code inside eval() or dynamically generated functions.
  • gcov on C can report coverage for inlined functions incorrectly.
  • Test runners sometimes report "passed" for tests that actually silently return before any assertion.

When automated QA says something green that feels wrong, I verify manually: run the code with debug prints, check the actual test output, read the raw coverage XML. I've caught several bugs where the tooling lied.

Rule: Treat automated QA as a first-pass filter, not ground truth. High-stakes decisions need manual verification.


10. The "implied requirement" trap

Humans carry context into a prompt that a model doesn't have. If I write "sort this list," I implicitly mean "stable sort, ascending, don't mutate the original." A model sees just "sort this list" and might pick any of those choices.

Common implied requirements that bite in review:

  • Idempotency (calling the function twice gives the same result)
  • Thread safety (if the function might run concurrently)
  • Input validation (raise vs return None vs return default)
  • Side effects (print? log? pure function?)
  • Memory cleanup (in C/C++: who frees the result?)

Fix: Audit the prompt as an alien who has never seen similar code. Whatever assumption you catch yourself making, write it into the prompt.


Why this matters

Every failure mode above is something a well-designed task catches and a poorly-designed task hides. The quality of an LLM trained on code generation is bottlenecked by the quality of the prompts and tests it learns from. Bad tasks don't just waste reviewer time — they actively teach the model wrong patterns.

These notes will keep growing. If something here is useful to you, or if you see a mistake, open an issue.


Author: Adolfo Daniel Santos — LinkedInGitHub

About

Practical lessons on prompt engineering for code-generation datasets used to train LLMs. Patterns and failure modes from real task audits.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors