Skip to content

GitGuardian/src-fingerprint

Repository files navigation

src-fingerprint

Introduction

The purpose of src-fingerprint is to provide an easy way to extract git related information (namely all file shas of a repository) from your hosted source version control system.

This util's main command is the collect command used to collect source code fingerprints from a version control system or a local repository. It supports 3 main VCS:

  • GitHub and GitHub Enterprise
  • Gitlab CE and EE
  • Bitbucket

Installation

Using pre-compiled executables

macOS, using Homebrew

If you're using Homebrew you can add GitGuardian's tap and then install src-fingerprint. Just run the following commands:

brew tap gitguardian/tap
brew install src-fingerprint

Linux packages

Deb and RPM packages are available on Cloudsmith.

Setup instructions:

Windows

Open a PowerShell prompt and run this command:

iwr -useb https://raw.githubusercontent.com/GitGuardian/src-fingerprint/main/scripts/windows-installer.ps1 | iex

The script asks for the installation directory. To install silently, use these commands instead:

iwr -useb https://raw.githubusercontent.com/GitGuardian/src-fingerprint/main/scripts/windows-installer.ps1 -Outfile install.ps1
.\install.ps1 C:\Destination\Dir
rm install.ps1

Note that src-fingerprint requires Unix commands such as bash to be available, so it runs better from a "Git Bash" prompt.

Manual download

You can also download the archives directly from the releases page.

Installing from sources

You need go installed and GOBIN in your PATH. Once that is done, run the command:

$ go get -u github.com/gitguardian/src-fingerprint/cmd/src-fingerprint

Generate My Token

GitHub

  1. Click on your profile picture at the top right of the screen. A dropdown menu will appear and you will be able to access your personal settings by clicking on Settings.
  2. On your profile, go to Developer Settings.
  3. Select Personal Access Tokens.
  4. Click on Generate a new token.
  5. Click the repo box. This is the only scope we need.
  6. Click on Generate token. The token will only be available at this time so make sure you keep it in a safe place.

GitLab

  1. Click on your profile picture at the top right of the screen. A dropdown menu will appear and you will be able to access your personal settings by clicking on Preferences.
  2. In the left sidebar, click on Access Tokens.
  3. Click the read_api box. This is the only scope we need. You can set an end-date for the token validity if you want more security.
  4. Click on Create personal token. The token will only be available at this time so make sure you keep it in a safe place.

Collect my code fingerprints

General information

The output format can be chosen between jsonl, json, gzip-jsonl and gzip-json with the option --export-format.
The default format is gzip-jsonl to minimize the size of the output file.
The default output filepath is ./fingerprints.jsonl.gz. Use --output to override this behavior.
Also, note that if you were to download fingerprints for repositories of a big organization, src-fingerprint has a limit to process no more than 100 repositories. You can override this limit with the option --limit, a limit of 0 will process all repos of the organization. Note that if multiple organizations are passed, the limit is applied to each one independently.
There is no default timeout, it can be set with the option --timeout. Similarly to the limit, it is applied to each source independently.

Sample output

Here is an example of some lines of a .jsonl format output:

{"repository_name":"src-fingerprint","private":false,"sha":"a0c16efce5e767f04ba0c6988d121147099a17df","type":"blob","filepath":".env.example","size":"31"}
{"repository_name":"src-fingerprint","private":false,"sha":"d425eb0f8af66203dbeef50c921ea5bff0f2acba","type":"blob","filepath":".github/workflows/tag.yml","size":"882"}
{"repository_name":"src-fingerprint","private":false,"sha":"c7f341033d78474b125dd56d8adaa3f0fc47faf2","type":"blob","filepath":".github/workflows/test.yml","size":"899"}
{"repository_name":"src-fingerprint","private":false,"sha":"f4409d88950abd4585d8938571864726533a7fa5","type":"blob","filepath":".gitignore","size":"356"}
{"repository_name":"src-fingerprint","private":false,"sha":"f733f951ace2e032c270d2f3cf79c2efb8187b5b","type":"blob","filepath":".gitlab-ci.yml","size":"85"}
{"repository_name":"src-fingerprint","private":false,"sha":"d17ae66a017477bc65a2f433bf23d551ffc6bd75","type":"blob","filepath":".golangci.yml","size":"1196"}
{"repository_name":"src-fingerprint","private":false,"sha":"ee08a617cfb1c63c1c55fa4cb15e8bac0095346f","type":"blob","filepath":".goreleaser.yml","size":"2127"}

Default behavior

Note that by default, src-fingerprint will exclude forked repositories from the fingerprints computation. For GitHub provider archived repositories and public repositories will also be excluded by default. Use flags --include-forked-repos, --include-archived-repos or include-public-repos to change this behavior.

For all the following examples, we assume that the user is able to clone repositories using an HTTP URL with basic authentication. If for any reason this is not possible with the user's organization, src-fingerprint supports ssh cloning by using the dedicated option --ssh-cloning. Note though that this option is not the standard configuration of the tool but rather a workaround for this type of edge case. Especially, this option may bring some issues in the event of discrepancies in permissions between the token provided for API-based repos listing, and the SSH keys used to clone these repos.

GitHub

  1. Export all fingerprints from private repositories from GitHub Orgs to the default path ./fingerprints.jsonl.gz with logs:
env VCS_TOKEN="<token>" src-fingerprint -v collect --provider github --object ORG_1_NAME --object ORG_2_NAME
  1. Export all fingerprints of every repository the user can access to the default path ./fingerprints.jsonl.gz:
env VCS_TOKEN="<token>" src-fingerprint -v collect --provider github --include-public-repos --include-forked-repos --include-archived-repos

GitLab

  1. Export all fingerprints from private repositories of a GitLab group to the default path ./fingerprints.jsonl.gz with logs:
    Note : If you are targeting a self-hosted GitLab instance, use the --provider-url to specify its url, don't forget to include the scheme.
env VCS_TOKEN="<token>" src-fingerprint -v collect --provider gitlab --object "GitGuardian-dev-group"
  1. Export all fingerprints of every project the user can access to the default path ./fingerprints.jsonl.gz with logs:
env VCS_TOKEN="<token>" src-fingerprint -v collect --provider gitlab --include-forked-repos

Bitbucket server (formerly Atlassian Stash)

  1. Export all fingerprints from a Bitbucket project with private repository to the default path ./fingerprints.jsonl.gz with logs:
    Note : If you are targeting a self-hosted BitBucket instance, use the --provider-url to specify its url, don't forget to include the scheme.
env VCS_TOKEN="<token>" src-fingerprint -v collect --provider bitbucket --object "GitGuardian Project"
  1. Export all fingerprints of every repository the user can access to the default path ./fingerprints.jsonl.gz with logs:
env VCS_TOKEN="<token>" src-fingerprint -v collect --provider bitbucket

Repository

Allows the processing of a single repository given a git clone URL

  1. ssh cloning
src-fingerprint collect -p repository -u 'git@github.com:GitGuardian/gg-shield.git'
  1. http cloning with basic authentication
src-fingerprint collect -p repository -u 'https://user:password@github.com/GitGuardian/gg-shield.git'
  1. http cloning without basic authentication
src-fingerprint collect -p repository -u 'https://github.com/GitGuardian/gg-shield.git'
  1. repository in multiple local directories
src-fingerprint collect -p repository -u /projects/gitlab/src-fingerprint -u /projects/gitlab/internal-api
  1. repository in current directory
src-fingerprint collect -p repository -u .

Performance and memory usage

src-fingerprint will by default process each object (--object/-u) one by one. When an object (ie: a GitHub Organization) contains multiple repositories, they are processed in parallel by multiple cloners, the number of cloners is configurable with --cloners. Adding more cloners will increase the memory usage of src-fingerprint. When extracting fingerprints from multiple sources (e.g. with multiple --object values), you can use the option --pool to configure the number of workers that will process the objects in parallel. Each worker will have --cloners cloners. Be cautious when increasing both --cloners and --pool, the memory usage may increase drastically.

License

GitGuardian src-fingerprint is MIT licensed.