Skip to content

Latest commit

 

History

History
71 lines (35 loc) · 5.59 KB

Code-corpus.md

File metadata and controls

71 lines (35 loc) · 5.59 KB

Paper Collection for Corpus

Unimodal Corpus

  1. [TMLR2023] The Stack The Stack: 3 TB of permissively licensed source code. arXiv, 2022.11

    Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro von Werra, Harm de Vries

  2. [Link] codeparrot-clean the deduplicated version of the codeparrot, 2022.10

    Loubna Ben Allall, Leandro von Werra, LIJia, Thomas Wolf, Zdar

  3. [ICLR2023] BIGPYTHON CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. arXiv, 2022.03

    Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong

  4. [Link] CodeParrot full CodeParrot dataset, 2022.02

    Lewis Tunstall, Leandro von Werra, Thomas Wolf

  5. [MSR2013] GHTorrent The GHTorent dataset and tool suite. 2013.05

    Georgios Gousios

General Purpose/Bimodal Corpus

  1. [Preprint] StarCoder 2 and The Stack v2: The Next Generation. arXiv, 2024.02

    Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas Krauß, Naman Jain, Yixuan Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muennighoff, Xiangru Tang, Muhtasham Oblokulov, Christopher Akiki, Marc Marone, Chenghao Mou, Mayank Mishra, Alex Gu, Binyuan Hui, Tri Dao, Armel Zebaze, Olivier Dehaene, Nicolas Patry, Canwen Xu, Julian McAuley, Han Hu, Torsten Scholak, Sebastien Paquet, Jennifer Robinson, Carolyn Jane Anderson, Nicolas Chapados, Mostofa Patwary, Nima Tajbakhsh, Yacine Jernite, Carlos Muñoz Ferrandis, Lingming Zhang, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, Harm de Vries

  2. [Link] CodeAlpaca-20k CodeAlpaca-20k dataset, 2023.10

    Sahil Chaudhary

  3. [Preprint] CodeTextbook Textbooks Are All You Need II: phi-1.5 technical report. arXiv, 2023.09

    Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, Yin Tat Lee

  4. [Preprint] CommitPack, CommitPackFT OctoPack: Instruction Tuning Code Large Language Models. arXiv, 2023.08

    Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro von Werra, Shayne Longpre

  5. [Link] GitHub Code GitHub Code dataset, 2022.10

    Loubna Ben Allall, Leandro von Werra, LIJia, Thomas Wolf, Zdar

  6. [NeurIPS2021] CodeNet CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks. arXiv, 2021.05

    Ruchir Puri, David S. Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladimir Zolotov, Julian Dolby, Jie Chen, Mihir Choudhury, Lindsey Decker, Veronika Thost, Luca Buratti, Saurabh Pujar, Shyam Ramji, Ulrich Finkler, Susan Malaika, Frederick Reiss

  7. [NeurIPS2021] APPS Measuring Coding Challenge Competence With APPS. arXiv, 2021.05

    Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, Jacob Steinhardt

  8. [EMNLP2020] PYMT5 PYMT5: multi-mode translation of natural language and PYTHON code with transformers. arXiv, 2020.10

    Colin B. Clement, Dawn Drain, Jonathan Timcheck, Alexey Svyatkovskiy, Neel Sundaresan

  9. [EMNLP-IJCNLP2019] JuICe JuICe: A Large Scale Distantly Supervised Dataset for Open Domain Context-based Code Generation. arXiv, 2019.10

    Rajas Agashe, Srinivasan Iyer, Luke Zettlemoyer

  10. [Preprint] CodeSearchNet CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. arXiv, 2019.09

    Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, Marc Brockschmidt

  11. [Link] BigQuery Google BigQuery Public Datasets, 2016.06

    Google