Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

Advanced traversal policies

Gene Hazan edited this page Mar 23, 2017 · 4 revisions

A crawler request has a policy that dictates how the request is processed. The policy is composed of a fetch behaivor, a freshness behavior, and a visitor map. At a high level, fetch and freshness talk about what/how to fetch resources and whether or not to process them. While the visitor map talks about which resources to visit after the current resource. So, for example, a policy of

  originStorage+match and { owner: self }

applied to a GitHub org entity would try to fetch the org from the origin (GitHub) but only if the crawler's cached copy is different from the current GitHub copy. That's the originStorage part. It would then process the resource only if it is not fresh, that is, if the current content does NOT match the crawler has previously seen. That's the match part. Finally, if the org is processed, the crawler uses the visitor map portion to queue up the org owner for further processing but only not traverse any of the owner's referenced resources (the {owner: self} part).

Let's break that down into more detail.

Fetch

The fetch behavior defines where to get the resource for processing as follows:

  • storageOnly -- Only use stored content. Skip this resource if we don't already have it
  • originStorage -- Origin rules. Consider storage first and use it if it matches origin. Otherwise, get content from origin
  • storageOriginIfMissing -- Storage rules. Only if content is missing from storage, get content from origin
  • mutables -- Use originStorage if the resource is deemed mutable, storageOriginIfMissing if immutable
  • originOnly -- Always get content from original source

Freshness

The freshness behavior defines how the age of the resource, relative what we have seen/done before, factors into whether or not to process the resource.

  • always -- process the resource no matter what
  • match -- process the resource if origin and stored docs do NOT match
  • version -- process the resource if the current stored doc's processing version is behind current
  • matchOrVersion -- process the resource if stored and origin do not match or the stored processed version is out of date

Pre-defined combinations

For convenience, the crawler has a set of pre-defined fetch and freshness combinations that cover popular situations. The most popular combinations are listed here.

default -- Get anything that is out of date and assume that the immutables are indeed immutable.

  • fetch = mutables
  • freshness = match

reprocessAndUpdate -- Like default but also process a resource if it was last processed by an older version of the crawler. This is great for the equivalent of schema-update scenarios.

  • fetch = mutables
  • freshness = matchOrVersion

always -- Complete refresh. Go to origin for all content and process every encountered resource no matter what.

  • fetch = originOnly
  • freshness = always

Visitor map

The visitor map is literally an object graph that has properties named after the properties of the GitHub properties and values that represent the referenced resource. So, for example, the following is a map for an org that includes its repos and their commits but nothing else.

{
  org: {
    repos: {
      commits: self
    }
  }
}

As the crawler works through requests it simply follows the map. If a property is in the map, that resource is traversed. If not, the property is ignored. In many ways this is like GraphQL and in the future we may merge this notion with the GitHub GraphQL approach.

A map also has a current location or a path. By default the path is / or the root of the map. As the crawler traverses, it adds to the path so it always knows where it is.

Policy specs

Policy specs bring all this together in one concise form. While a policy can be written out as a JSON object, the spec form is much convenient. For example, the following is a policy spec that uses the reprocessAndUpdate behavior on the repo map from the (mythical) minimal scenario starting at commits.

  reprocessAndUpdate:minimal/repo@/commits

The general form for a spec is as follows:

<policyName>[:mapSpec]
  mapSpec :: [scenario/]mapName[@p/a/t/h]

where

  • policyName identifies one of the well-known, pre-defined combinations of fetch and freshness behaviors
  • mapSpec optionally identifies the visitor map to use. If omitted, the map named after the request's type is used
  • If supplied, the mapSpec identifies the map (within an optional scenario) and a path-based starting point in the map.
  • A scenario is a set of maps that go together to cover a particular situation. For example, say you wanted to only traverse people. You would define a scenario with maps to all the people in all the GitHub entities. Then you could crawl an org with that map and process all the people involved in the org.

This arrangement allows one to apply an overall policy (e.g., freshness and fetching) to a traversal of the object graph. The traversal is driven by the scenario such as Initialization, or Update where the graph is cut differently to suit the need.

Clone this wiki locally