Skip to content

Latest commit

 

History

History
64 lines (33 loc) · 3.67 KB

evaluation.md

File metadata and controls

64 lines (33 loc) · 3.67 KB

LLM-Evaluation

Papers

2022

  • (2022-09) News Summarization and Evaluation in the Era of GPT-3 paper

2023

  • (2023-01) How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection paper | project

  • (2023-01) Is ChatGPT A Good Translator? A Preliminary Study paper | code

    ❗ They only randomly select 50 sentences for evaluation, since there is no available API.

  • (2023-01) Benchmarking Large Language Models for News Summarization paper

  • (2023-02) Is ChatGPT a General-Purpose Natural Language Processing Task Solver? paper

    ❗ No large dataset evaluation, no few-shot in-context learning evaluation, due to lack of API.

  • (2023-02) ChatGPT: Jack of all trades, master of none paper

  • (2023-02) Can ChatGPT Understand Too? A Comparative Study on ChatGPT and Fine-tuned BERT paper

  • (2023-02) On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective paper

  • (2023-02) Exploring the Limits of ChatGPT for Query or Aspect-based Text Summarization paper

  • (2023-02) ChatGPT: potential, prospects, and limitations paper

  • (2023-03) How Robust is GPT-3.5 to Predecessors? A Comprehensive Study on Language Understanding Tasks. paper

  • (2023-03) ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks paper

  • (2023-03) Consistency Analysis of ChatGPT paper

  • (2023-03) Could a Large Language Model be Conscious? paper

  • (2023-03) Susceptibility to Influence of Large Language Models paper

  • (2023-03) A Comprehensive Capability Analysis of GPT-3 and GPT-3.5 Series Models paper

  • (2023-03) Sparks of Artificial General Intelligence: Early experiments with GPT-4 paper

  • (2023-03) ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks paper

  • (2023-04) Is ChatGPT a Highly Fluent Grammatical Error Correction System? A Comprehensive Evaluation paper

  • (2023-03) Is ChatGPT a Good NLG Evaluator? A Preliminary Study paper

  • (2023-04) Is ChatGPT a Good Sentiment Analyzer? A Preliminary Study paper

  • (2023-04) Emergent and Predictable Memorization in Large Language Models paper

  • (2023-04) Why Does ChatGPT Fall Short in Answering Questions Faithfully? paper

  • (2023-04) Evaluating ChatGPT's Information Extraction Capabilities: An Assessment of Performance, Explainability, Calibration, and Faithfulness paper

  • (2023-04) Are Emergent Abilities of Large Language Models a Mirage? paper

  • (2023-10) Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators paper | code

Useful Resources