Skip to content

Commit

Permalink
update README
Browse files Browse the repository at this point in the history
  • Loading branch information
EEElisa committed Oct 14, 2024
1 parent 0faaa74 commit 9f0ab13
Showing 1 changed file with 7 additions and 11 deletions.
18 changes: 7 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
### Is “A Helpful Assistant” the Best Role for Large Language Models? A Systematic Evaluation of Social Roles in System Prompts
### When "A Helpful Assistant" Is Not Really Helpful: Personas in System Prompts Do Not Improve Performances of Large Language Models

This is the repository for the paper: Is “A Helpful Assistant” the Best Role for Large Language Models? A Systematic Evaluation of Social Roles in System Prompts
This is the repository for the paper: When "A Helpful Assistant" Is Not Really Helpful: Personas in System Prompts Do Not Improve Performances of Large Language Models

Authors: Mingqian Zheng, Jiaxin Pei and David Jurgens
Authors: Mingqian Zheng, Jiaxin Pei, Lajanugen Logeswaran, Moontae Lee, and David Jurgens

Abstract: Prompting serves as the major way humans interact with Large Language Models (LLM). Commercial AI systems commonly define the role of the LLM in system prompts. For example, ChatGPT uses "You are a helpful assistant" as part of the default system prompt. But is ``a helpful assistant'' the best role for LLMs? In this study, we present a systematic evaluation of how social roles in system prompts affect model performance. We curate a list of 162 roles covering 6 types of interpersonal relationships and 8 types of occupations. Through extensive analysis of 3 popular LLMs and 2457 questions, we show that adding interpersonal roles in prompts consistently improves the models' performance over a range of questions. Moreover, while we find that using gender-neutral roles and specifying the role as the audience leads to better performances, predicting which role leads to the best performance remains a challenging task, and that frequency, similarity, and perplexity do not fully explain the effect of social roles on model performances. Our results can help inform the design of system prompts for AI systems.
Abstract: Prompting serves as the major way humans interact with Large Language Models (LLM). Commercial AI systems commonly define the role of the LLM in system prompts. For example, ChatGPT uses ``You are a helpful assistant'' as part of its default system prompt. Despite current practices of adding personas to system prompts, it remains unclear how different personas affect a model's performance on objective tasks. In this study, we present a systematic evaluation of personas in system prompts. We curate a list of 162 roles covering 6 types of interpersonal relationships and 8 domains of expertise. Through extensive analysis of 4 popular families of LLMs and 2,410 factual questions, we demonstrate that adding personas in system prompts does not improve model performance across a range of questions compared to the control setting where no persona is added. Nevertheless, further analysis suggests that the gender, type, and domain of the persona can all influence the resulting prediction accuracies. We further experimented with a list of persona search strategies and found that, while aggregating results from the best persona for each question significantly improves prediction accuracy, automatically identifying the best persona is challenging, with predictions often performing no better than random selection. Overall, our findings suggest that while adding a persona may lead to performance gains in certain settings, the effect of each persona can be largely random.

The paper is available on [Arxiv](https://arxiv.org/abs/2311.10054).

Expand All @@ -18,20 +18,16 @@ The paper is available on [Arxiv](https://arxiv.org/abs/2311.10054).
├── scripts <- Scripts to run experiments and get information
| ├── classifier <- Train dataset classifier and role classifier
| ├── llm_inference_pipeline <- Run experiments on different prompt templates and roles across various models
| ├── llm_pick_role <- Prompt LLMs to pick the best role for a given question
| ├── lmppl_compute <- Get perplexity of pairs of prompt and question
| ├── vllm_inference_pipeline <- Run experiments on different prompt templates and roles across various models
| ├── lmppl-compute <- Get perplexity of pairs of prompt and question
| ├── ngram_frequency <- Get Google ngram frequency of role word
| ├── ppl_encoder_decoder_lm <- Get perplexity of encoder-decoder LLMs
| ├── similarity <- Compute similarity between question and prompt
| ├── threads_inference <- Run experiments on GPUs via threading
| ├── utilities <- Utility functions
├── analysis_notebooks <- Jupyter notebooks for data analysis and plotting
│ ├── classifier_training_data_process <- Pre-process data for classifier training
| ├── dataset_role_preparation <- Prepare data for experiments
| ├── gender_impact <- Gender impact analysis
| ├── role_performance_analysis <- Analysis of role differences
| ├── plot <- Plotting codes for paper figures
| ├── plot_utilities <- Utility functions for plotting
|
└── README.md
Expand Down

0 comments on commit 9f0ab13

Please sign in to comment.