Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and Human Tutors, Tung Phung+, N/A, arXiv'23 #803

AkihikoWatanabe · 2023-07-11T14:02:46Z

URL

https://arxiv.org/abs/2306.17156

Affiliations

Tung Phung, N/A
Victor-Alexandru Pădurean, N/A
José Cambronero, N/A
Sumit Gulwani, N/A
Tobias Kohn, N/A
Rupak Majumdar, N/A
Adish Singla, N/A
Gustavo Soares, N/A

Abstract

Generative AI and large language models hold great promise in enhancingcomputing education by powering next-generation educational technologies forintroductory programming. Recent works have studied these models for differentscenarios relevant to programming education; however, these works are limitedfor several reasons, as they typically consider already outdated models or onlyspecific scenario(s). Consequently, there is a lack of a systematic study thatbenchmarks state-of-the-art models for a comprehensive set of programmingeducation scenarios. In our work, we systematically evaluate two models,ChatGPT (based on GPT-3.5) and GPT-4, and compare their performance with humantutors for a variety of scenarios. We evaluate using five introductory Pythonprogramming problems and real-world buggy programs from an online platform, andassess performance using expert-based annotations. Our results show that GPT-4drastically outperforms ChatGPT (based on GPT-3.5) and comes close to humantutors' performance for several scenarios. These results also highlightsettings where GPT-4 still struggles, providing exciting future directions ondeveloping techniques to improve the performance of these models.

Translation (by gpt-3.5-turbo)

生成AIと大規模言語モデルは、導入プログラミングのための次世代教育技術を強化することで、コンピューティング教育を向上させるという大きな可能性を秘めています。
最近の研究では、これらのモデルをプログラミング教育に関連するさまざまなシナリオについて研究してきましたが、これらの研究はいくつかの理由で限定的です。
なぜなら、既に時代遅れのモデルを考慮したり、特定のシナリオのみを考慮したりする傾向があるためです。
その結果、包括的なプログラミング教育シナリオのための最先端モデルのベンチマークを行う体系的な研究が不足しています。
本研究では、ChatGPT（GPT-3.5をベースにしたモデル）とGPT-4の2つのモデルを体系的に評価し、さまざまなシナリオでの人間のチューターとのパフォーマンスを比較します。
具体的には、5つの導入的なPythonプログラミング問題とオンラインプラットフォームからの実世界のバグのあるプログラムを使用して評価し、専門家による注釈を使用してパフォーマンスを評価します。
結果は、GPT-4がChatGPT（GPT-3.5をベースにしたモデル）を大幅に上回り、いくつかのシナリオでは人間のチューターのパフォーマンスに近づいていることを示しています。
また、GPT-4がまだ苦労している設定も明らかになり、これらのモデルのパフォーマンスを改善するための技術の開発における興味深い将来の方向性を提供しています。

Summary (by gpt-3.5-turbo)

生成AIと大規模言語モデルは、プログラミング教育の向上に大きな可能性を持っています。しかし、これまでの研究は限定的であり、包括的なプログラミング教育シナリオのための最先端モデルのベンチマークが不足しています。本研究では、ChatGPTとGPT-4の2つのモデルを評価し、人間のチューターとのパフォーマンスを比較しました。結果は、GPT-4がChatGPTを大幅に上回り、一部のシナリオでは人間のチューターに近づいていることを示しています。また、GPT-4の改善のための興味深い方向性も提案されています。

AkihikoWatanabe · 2023-07-11T14:04:27Z

GPT4とGPT3.5をプログラミング教育の文脈で評価したところ、GPT4AGPT3.5をoutperformし、人間のチューターに肉薄した。

AkihikoWatanabe added the Pocket label Jul 11, 2023

AkihikoWatanabe changed the title あ Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and Human Tutors, Tung Phung+, N/A, arXiv'23 Jul 11, 2023

AkihikoWatanabe added LanguageModel Education and removed Pocket labels Jul 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and Human Tutors, Tung Phung+, N/A, arXiv'23 #803

Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and Human Tutors, Tung Phung+, N/A, arXiv'23 #803

AkihikoWatanabe commented Jul 11, 2023 •

edited

AkihikoWatanabe commented Jul 11, 2023

Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and Human Tutors, Tung Phung+, N/A, arXiv'23 #803

Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and Human Tutors, Tung Phung+, N/A, arXiv'23 #803

Comments

AkihikoWatanabe commented Jul 11, 2023 • edited

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)

AkihikoWatanabe commented Jul 11, 2023

AkihikoWatanabe commented Jul 11, 2023 •

edited