How is ChatGPT's behavior changing over time?, Lingjiao Chen+, N/A, arXiv'23 #887

AkihikoWatanabe · 2023-07-22T08:47:50Z

URL

GPT-3.5 and GPT-4 are the two most widely used large language model (LLM)services. However, when and how these models are updated over time is opaque.Here, we evaluate the March 2023 and June 2023 versions of GPT-3.5 and GPT-4 onfour diverse tasks: 1) solving math problems, 2) answering sensitive/dangerousquestions, 3) generating code and 4) visual reasoning. We find that theperformance and behavior of both GPT-3.5 and GPT-4 can vary greatly over time.For example, GPT-4 (March 2023) was very good at identifying prime numbers(accuracy 97.6%) but GPT-4 (June 2023) was very poor on these same questions(accuracy 2.4%). Interestingly GPT-3.5 (June 2023) was much better than GPT-3.5(March 2023) in this task. GPT-4 was less willing to answer sensitive questionsin June than in March, and both GPT-4 and GPT-3.5 had more formatting mistakesin code generation in June than in March. Overall, our findings shows that thebehavior of the same LLM service can change substantially in a relatively shortamount of time, highlighting the need for continuous monitoring of LLM quality.

GPT-3.5とGPT-4は、最も広く使用されている大規模言語モデル（LLM）サービスです。しかし、これらのモデルがいつ、どのように更新されるのかは不透明です。そこで、我々はGPT-3.5とGPT-4の2023年3月版と2023年6月版を、以下の4つの異なるタスクについて評価しました：1）数学の問題解決、2）敏感/危険な質問への回答、3）コード生成、4）視覚的な推論。その結果、GPT-3.5とGPT-4の性能と振る舞いは時間とともに大きく変動することがわかりました。例えば、GPT-4（2023年3月版）は素数の特定に非常に優れていました（正答率97.6％）、しかし同じ問題に対してGPT-4（2023年6月版）は非常に低い正答率（2.4％）でした。興味深いことに、GPT-3.5（2023年6月版）はこのタスクにおいてGPT-3.5（2023年3月版）よりもはるかに優れていました。GPT-4は6月には3月よりも敏感な質問に回答することをためらうようになり、GPT-4とGPT-3.5の両方が6月には3月よりもコード生成でのフォーマットのミスが多くなりました。全体的に、私たちの調査結果は、同じLLMサービスの振る舞いが比較的短期間で大きく変わることを示しており、LLMの品質を継続的に監視する必要性を強調しています。

GPT-3.5とGPT-4は、大規模言語モデル（LLM）のサービスであり、その性能と振る舞いは時間とともに変動することがわかった。例えば、GPT-4は素数の特定に優れていたが、後のバージョンでは低い正答率となった。また、GPT-3.5はGPT-4よりも優れた性能を示した。さらに、GPT-4とGPT-3.5の両方が時間とともに敏感な質問への回答やコード生成でのミスが増えた。この結果から、LLMの品質を継続的に監視する必要性が示唆される。

AkihikoWatanabe · 2023-07-22T08:48:44Z

GPT3.5, GPT4共にfreezeされてないのなら、研究で利用すると結果が再現されないので、研究で使うべきではない。

AkihikoWatanabe · 2023-07-22T08:49:46Z

また、知らんうちにいくつかのタスクで勝手に性能低下されたらたまったものではない。

AkihikoWatanabe added action_wanted Pocket labels Jul 22, 2023

AkihikoWatanabe changed the title あ How is ChatGPT's behavior changing over time?, Lingjiao Chen+, N/A, arXiv'23 Jul 22, 2023

AkihikoWatanabe added NLP ChatGPT Evaluation and removed action_wanted labels Oct 21, 2023