-
[TMLR2023]
The Stack
The Stack: 3 TB of permissively licensed source code., 2022.11
Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro von Werra, Harm de Vries
-
[Link]
codeparrot-clean
the deduplicated version of the codeparrot, 2022.10Loubna Ben Allall, Leandro von Werra, LIJia, Thomas Wolf, Zdar
-
[ICLR2023]
BIGPYTHON
CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis., 2022.03
Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong
-
[Link]
CodeParrot
full CodeParrot dataset, 2022.02Lewis Tunstall, Leandro von Werra, Thomas Wolf
-
[MSR2013]
GHTorrent
The GHTorent dataset and tool suite. 2013.05Georgios Gousios
-
[Preprint] StarCoder 2 and The Stack v2: The Next Generation.
, 2024.02
Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas Krauß, Naman Jain, Yixuan Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muennighoff, Xiangru Tang, Muhtasham Oblokulov, Christopher Akiki, Marc Marone, Chenghao Mou, Mayank Mishra, Alex Gu, Binyuan Hui, Tri Dao, Armel Zebaze, Olivier Dehaene, Nicolas Patry, Canwen Xu, Julian McAuley, Han Hu, Torsten Scholak, Sebastien Paquet, Jennifer Robinson, Carolyn Jane Anderson, Nicolas Chapados, Mostofa Patwary, Nima Tajbakhsh, Yacine Jernite, Carlos Muñoz Ferrandis, Lingming Zhang, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, Harm de Vries
-
[Link]
CodeAlpaca-20k
CodeAlpaca-20k dataset, 2023.10Sahil Chaudhary
-
[Preprint]
CodeTextbook
Textbooks Are All You Need II: phi-1.5 technical report., 2023.09
Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, Yin Tat Lee
-
[Preprint]
CommitPack, CommitPackFT
OctoPack: Instruction Tuning Code Large Language Models., 2023.08
Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro von Werra, Shayne Longpre
-
[Link]
GitHub Code
GitHub Code dataset, 2022.10Loubna Ben Allall, Leandro von Werra, LIJia, Thomas Wolf, Zdar
-
[NeurIPS2021]
CodeNet
CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks., 2021.05
Ruchir Puri, David S. Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladimir Zolotov, Julian Dolby, Jie Chen, Mihir Choudhury, Lindsey Decker, Veronika Thost, Luca Buratti, Saurabh Pujar, Shyam Ramji, Ulrich Finkler, Susan Malaika, Frederick Reiss
-
[NeurIPS2021]
APPS
Measuring Coding Challenge Competence With APPS., 2021.05
Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, Jacob Steinhardt
-
[EMNLP2020]
PYMT5
PYMT5: multi-mode translation of natural language and PYTHON code with transformers., 2020.10
Colin B. Clement, Dawn Drain, Jonathan Timcheck, Alexey Svyatkovskiy, Neel Sundaresan
-
[EMNLP-IJCNLP2019]
JuICe
JuICe: A Large Scale Distantly Supervised Dataset for Open Domain Context-based Code Generation., 2019.10
Rajas Agashe, Srinivasan Iyer, Luke Zettlemoyer
-
[Preprint]
CodeSearchNet
CodeSearchNet Challenge: Evaluating the State of Semantic Code Search., 2019.09
Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, Marc Brockschmidt
-
[Link]
BigQuery
Google BigQuery Public Datasets, 2016.06Google