"The hottest new programming language is English" - Andrej Karpathy, 24 Jan 2023
Prompt engineering is about skillfully creating input queries (prompts) to communicate with AI models like ChatGPT effectively. Think of it as writing instructions for a highly capable yet sometimes unpredictably dumb personal assistant.
This guide serves as a hands-on resource for developers and early adopters using large language models (LLMs). It goes beyond the usual one-off task prompts, focusing instead on processing large quantities of inputs via an API. When manual review of every output isn't feasible, it's critical to evaluate and manage the trade-offs between cost, speed, and output quality. Therefore we emphasize the 'engineering' part of prompt engineering here.
Our aim with this guide is to organize links to key external resources, and give concise commentary to help you find what's relevant for your task.
If you want to contribute to this guide, please open an issue, send a PR, or email me at prompts@matthiasberth.com.
In the exploration phase of prompt engineering, the focus is on generating a range of candidate prompts that perform effectively on example inputs. This phase involves using a playground environment to experiment with various combinations of instructions, examples, and inputs, allowing for the identification and resolution of issues. Rapid iteration and drawing inspiration from existing prompts in the wild are key strategies during this phase.
-
Prompt engineering guide from OpenAI
The Prompt engineering guide from OpenAI covers "Six strategies for getting better results":
- Write clear instructions
- Provide reference text
- Split complex tasks into simpler subtasks
- Give the model time to "think"
- Use external tools
- Test changes systematically
Each comes with a set of tactics (like "Ask the model if it missed anything on previous passes"). The guide provides direct links to the OpenAI playground where you can try out examples.
-
Microsoft Introduction to prompt engineering,
The intro covers common techniques and best practices. The techniques article discusses Chain of Thought prompting, and the influence of the temperature parameter, among others.
-
Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4
This research paper presents 26 guiding principles and evaluates their effectiveness across several models.
-
Suggests to structure the prompt as Context, Objective, Style, Tone, Audience, Response. Makes a lot of sense and helped the author win a competition. I'm still trying to track down original sources to the CO-STAR framework and that competition.
-
Many of these are geared to everyday use, but there are relevant prompts in the categories:
-
Prompt collections / Libraries
- LangChain Hub collects prompts in a variety of areas, e.g. Tagging,
Summarization, Extraction.
- LangChain has prompts baked into its code. For example, here is a set of prompts for checking the correctness of summarizations: langchain/chains/llm_summarization_checker/prompts. So you can look up a use case in the LangChain docs (Summarization) and locate the relevant code.
e.g. langchain, Llamaindex
-
Finding examples by tasks / use case
Know the general category for your task, so you can search effectively for prompt examples, papers, and benchmark datasets.
-
Data Extraction
Example: Extract product number, due date from unstructured orders received via email. (google this)
-
Sentiment Analysis
Example: Analyzing customer feedback to determine sentiment towards a product or service. (google this)
-
Chatbot Conversations
Example: Developing chatbots for handling customer service inquiries. (google this)
-
Text Classification
Example: Categorizing support tickets into departments like technical, billing, general inquiries. (google this)
-
Named Entity Recognition (NER)
Example: Identifying company names in financial reports. (google this)
-
Keyword Extraction
Example: Extracting relevant keywords for SEO or document summarization. (google this)
-
Language Translation
Example: Translating business documents or communications between languages. (google this)
-
Summarization
Example: Generating concise summaries of long documents like business reports. (google this)
-
Topic Modeling
Example: Identifying main topics in customer feedback or a collection of articles. (google this)
-
Spam Detection
Example: Filtering out spam comments in a forum. (google this)
-
Intent Recognition
Example: Understanding the intent behind customer messages in chatbot interactions. (google this)
-
Text Generation
Example: Automatically generating text like product descriptions based on data inputs. (google this)
-
Question Answering Systems
Example: Building systems for answering customer questions in natural language. (google this)
-
Emotion Detection
Example: Identifying emotional states in text to understand customer sentiment. (google this)
-
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
When all your prompt engineering efforts don't give good enough results, you can try some alternatives
- Use another model. If you haven't done so already, try a different model with roughly the same or better capabilities. Keep in mind that performance is determined by the combination of model and prompt so you may want to iterate on your best prompt.
- Fine-tune an existing model. You can select examples from your current dataset, or create them by hand.
- Invest in better examples for a few-shot prompt. Think about providing more examples, more diverse examples, and positive vs. negative examples. If you're using RAG, try investing in the retrieval part of the pipeline.
- Use ensembles / mixture of experts. Solve the same task by multiple different prompts / models, then consolidate results with a majority vote or some other mechanism.
- Use automated methods to find a better prompt and / or better examples. For example, the DSPy paper reports performance improvements of 16-40% for their auto-optimized pipelines.
- Roll your own NLP solution. For some tasks, you don't necessarily need the large language model, it's just much more convenient to use. There is a wide array of more classical NLP methods that you may want to use. You can still let LLMs help you with generating enough labeled data.
- Pause. Seriously, sometimes it may be a viable approach to move on to the next promising application of LLMs. While you do that, something new may come up, like a price drop, a new more advanced model, or some research breakthrough that makes it worth revisiting the task.