Python library for getting metadata from source code hosting tools
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
requirements Added twine to required dev tools Feb 6, 2019
scraper Use created_at rather than pushed_at date from GitHub API Feb 12, 2019
LICENSE Updated license files Aug 14, 2018
Makefile When uploading dist files, ignore if file already uploaded Aug 30, 2018
NOTICE Fixed minor formatting / wording in readme Aug 26, 2018
llnl_config.json Added organization and contact email handling to config file logic Jan 25, 2018
setup.cfg Remove Python2 compatibility metadata Sep 12, 2018


Scraper is a tool for scraping and visualizing open source data from various code hosting platforms, such as:, GitHub Enterprise,, hosted GitLab, and Bitbucket Server.

Getting Started: is a newly launched website of the US Federal Government to allow the People to access metadata from the governments custom developed software. This site requires metadata to function, and this Python library can help with that!

To get started, you will need a GitHub Personal Auth Token to make requests to the GitHub API. This should be set in your environment or shell rc file with the name GITHUB_API_TOKEN:


$ echo "export GITHUB_API_TOKEN=XYZ" >> ~/.bashrc

Additionally, to perform the labor hours estimation, you will need to install cloc into your environment. This is typically done with a Package Manager such as npm or homebrew.

Then to generate a code.json file for your agency, you will need a config.json file to coordinate the platforms you will connect to and scrape data from. An example config file can be found in demo.json. Once you have your config file, you are ready to install and run the scraper!

# Install Scraper
$ pip install -e .

# Run Scraper with your config file ``config.json``
$ scraper --config config.json

A full example of the resulting code.json file can be found here.

Config File Options

The configuration file is a json file that specifies what repository platforms to pull projects from as well as some settings that can be used to override incomplete or inaccurate data returned via the scraping.

The basic structure is:

    "contact_email": "...", # Used when the contact email cannot be found otherwise

    "agency": "...",        # Your agency abbreviation here
    "organization": "...",  # The organization within the agency
    "permissions": { ... }, # Object containing default values for usageType and exemptionText

    # Platform configurations, described in more detail below
    "GitHub": [ ... ],
    "GitLab": [ ... ],
    "Bitbucket": [ ... ],
"GitHub": [
        "url": "",    # or GitHub Enterprise URL to inventory
        "token": null,                  # Private token for accessing this GitHub instance
        "public_only": true,            # Only inventory public repositories

        "orgs": [ ... ],    # List of organizations to inventory
        "repos": [ ... ],   # List of single repositories to inventory
        "exclude": [ ... ]  # List of organizations / repositories to exclude from inventory
"GitLab": [
        "url": "",    # or hosted GitLab instance URL to inventory
        "token": null,                  # Private token for accessing this GitHub instance

        "orgs": [ ... ],    # List of organizations to inventory
        "repos": [ ... ],   # List of single repositories to inventory
        "exclude": [ ... ]  # List of groups / repositories to exclude from inventory
"Bitbucket": [
        "url": "https://bitbucket.internal",    # Base URL for a Bitbucket Server instance
        "username": "",                         # Username to authenticate with
        "password": "",                         # Password to authenticate with

        "exclude": [ ... ]  # List of projects / repositories to exclude from inventory


Scraper is released under an MIT license. For more details see the LICENSE file.