Skip to content

This repository contains all the code for collecting large scale amounts of code from GitHub.

License

Notifications You must be signed in to change notification settings

CarperAI/Code-Pile

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Code-Pile

pytest

This repository contains the processing scripts to scrape/process the code-pile dataset.

Table of Contents

  • Project Description
  • How to use the Code-Pile (todo)
  • How to Contribute
  • Additional Resources

Project Description

Check out The code pile proposal

The Code-Pile will be released similar to "the pile" as a folder of .jsonl.zst files, see lm-dataformat

How to use the Code-Pile

It's not finished, ask on discord

How to Contribute

Think about the most usefull Code-data for the next generation of textual Code Models.

The most valuable dataset properties (use your own judgment) are:

  1. Open License
  2. Data quality
  3. Dataset size
  4. Data variance/variety/nicheness
  5. Ease of obtaining/processing

To add a new dataset, open a Issue under given dataset-request template. Gather all the related informations appropriate to it. Use the issue to track.

Check if there is existing Code or someone already working on it: See Additional Resources

  1. Eleuthers Pile V1 Repos
  2. Ask on Carper #code-pile
  3. Ask on Eleuther
  4. Consult the linked Spreadsheets below

Then implement it through the following steps:

  1. Fork this repo
  2. Use the working branch
  3. Read the shared classes in datasets.py and codepile.py
  4. Create mvp/example for your dataset
  5. Create a pull request
  6. Keep building the data-domain specific classes and repeat

Citation Placeholder:

@misc{Code-Pile,
  author = {},
  doi = {},
  month = {},
  title = {},
  url = {https://github.com/CarperAI/Code-Pile},
  version = {},
  year = {2022}
}

Additional Resources

Closely related projects:

Previous work:

About

This repository contains all the code for collecting large scale amounts of code from GitHub.

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages