Operating

Since the crawler has a very simple execution cycle, crawler operators have only a few things they can do. Basically you can start and stop the crawler's processing, queue new requests, and manage deadletters. The crawler service has a REST API that you can call from either the crawler command line or the crawler dashboard. For simplicity, here we will use the command line. The browser-based dashboard has more functionality and a richer user-experience.

Having cloned the ghcrawler-cli repo, run the command line using

  node bin/cc -i [-s <server url>]

Tokens

Before you can do anything meaningful you need to add some GitHub API tokens to the mix. Without tokens, GitHub will limit you to 60 API calls per hour, not really enough to do much interesting. See the GitHub Doc on how to get tokens.

Once you have tokens with enough permissions to access all the resources of interest, add them to the crawler either using the command line or using the CRAWLER_GITHUB_TOKENS Infrastructure settings.

From the command line set the tokens to use by running something like this

  > tokens 44323cb32f3ef#admin 098ad687ef654bc#public

where your replace the dummy token values with yours and set the correct traits to identify the permissions of the token. The general form of the tokens parameter in the command line is a space-separated list of <token>#<trait>[,<trait>]* where the possible traits are admin, public, and private. One token can have many traits. The crawler will match the request being processed with the tokens available based on the traits needed and presented. Note that in environment variables, the tokens in the list are delimited by semi-colons, ;.

Each token gives you 5000 requests/hour. That is enough for a modest set of repos though the initial scan may take a while. Note that you should have at least one admin token if you hope to get information on GitHub collaborators, teams and traffic information. An admin token is one associated with an org owner and with all the relevant token scopes granted.

Starting and stopping

Once the crawler service is running, you can start it processing by setting its crawler/count property to a value >0. In the command line, simply do start [n] where n is the number of concurrent loops to run. Start out with one loop until you are comfortable that things are working. For the most part the crawler spends a lot of time waiting on I/O to GitHub and the queue and storage infrastructure. As such, you can bump up the value until, under normal circumstances, the crawler's Node process is maxing out one CPU core.

You can stop the crawler by setting the crawler/count property to 0 or using the command line's stop command. All currently running requests will complete and the crawler will report the termination of each loop previously started.

Recall that changing the crawler count for one crawler changes it for all. That means also that starting one, starts them all. Similarly, stopping one, stops them all.

You can tell that the crawler is running either by watching on the dashboard or observing the log messages written to the shell console.

Queuing requests

Queuing requests can be done from either the command line or the dashboard. All requests have three parts:

type -- The kind of processing to be done on the document. The values here are typical entity types found in the GitHub documentation, for example, commit or repo.
url -- The URL of the resource to process. Note that this must be the API URL, not the URL you would normally use to visit the resource in a browser. So https://api.github.com/repos/microsoft/ghcrawler is a valid resource to queue.
policy -- The policy to use when deciding whether or not to process the resource and if/how to process any resources referenced from here.

From the command line the simplest thing you can queue is an org

> queue contoso

or a repo

> queue microsoft/ghcrawler

These command infer the type (whether or not there is a / in the given value) and use the default policy. From the command line you are currently not able to specify a policy or other types of entity to process. Using the dashboard you can both identify a policy and additional types of entity by constructing a request object with JSON. For example, the following is equivalent to the second command line above.

{
  "type": "repo",
  "url": "https://api.github.com/repos/microsoft/ghcrawler",
  "policy": "default"
}

Using this you can queue users, teams, commits, ... and vary the policy. Most of the time however you should stick to queuing root level GitHub entities (e.g., org, repo, user, team). Other entities need more context and advanced understanding of how processing happens.

Similarly, for most cases the default policy is enough (that's why its the default!). If you want to do fancy things like modify how the GitHub object graph is traversed and exclude various types of links, you can specify an Advanced Policy.

Deadletter management

The crawler is quite resilient to network and processing errors. Sometimes however, it encounters requests that just cannot be processed. Either the entity has been deleted, GitHub is down, there is a bug, ... When an error occurs, the crawler requeues the request for retrying at a future time. Sometimes it varies its approach. For example, if a fetch error happens, it requeues the request indicating that a token with elevated permissions should be used. Requests will only be requeued so many times. If, in the end, the request cannot be processed, the crawler annotates it with failure information and puts it into the deadletter box. This allows you to examine the failures and optionally requeue or delete the failed requests.

The crawler service has a REST API for counting, listing, requeuing and deleting deadletters. This is exposed in both the crawler command line app and the browser-based dashboard. Your best bet for dealing with deadletters is to use the dashboard. There you can browse, filter, sort and view the deadletters, and then select and requeue or delete as desired. A good strategy is to go to the dashboard, select all and requeue. That will retry all deadletters. This typically clears out the ones that were transient failures. Wait for those to finish processing (watch the processing graphs). Then sort through the remaining deadletters.

The key properties to look at are:

extra -- This is extra context information that typically includes the reason for the failure
meta -- This object often has the request status code from GitHub and an shortened version of the token used in the GitHub request. This is useful for tracking down permissions issues.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operating

Tokens

Starting and stopping

Queuing requests

Deadletter management

Clone this wiki locally