Skip to content

Heritrix3 Useful Scripts

Alex Osborne edited this page May 22, 2021 · 4 revisions

Heritrix 3 H3 useful scripts to run in the scripting console.

get a list of seeds from the seeds file on disk

//groovy
//this is not the original list of seeds. it's just the contents of the seeds file, filtered.
appCtx.getBean("seeds").textSource.file.readLines().findAll{l -> l =~ /^http/}.unique().each{seedStr ->
 rawOut.println(seedStr)
}

print out available variables from the scripting context

//groovy
this.binding.getVariables().each{ rawOut.println("${it.key}=\n ${it.value}\n") }

printProps(obj) and using appCtx.getData()

The bash ls command is to the working directory what this method is to a java object. It uses getProperties provided by groovy.

Putting printProps in appCtx.getData() means you don't have to include the whole printProps definition later, which helps keep scripts short and managable. appCtx.getData() is a java.util.Map. There is more information from Groovy (detailed) and from IBM (concise) about that.

//Groovy
appCtxData = appCtx.getData()
appCtxData.printProps = { rawOut, obj ->
  rawOut.println "#properties"
  // getProperties is a groovy introspective shortcut. it returns a map
  obj.properties.each{ prop ->
    // prop is a Map.Entry
    rawOut.println "\n"+ prop
    try{ // some things don't like you to get their class. ignore those.
      rawOut.println "TYPE: "+ prop.value.class.name
    }catch(Exception e){}
  }
  rawOut.println "\n\n#methods"
  try {
  obj.class.methods.each{ method ->
    rawOut.println "\n${method.name} ${method.parameterTypes}: ${method.returnType}"
  } }catch(Exception e){}
}

// above this line need not be included in later script console sessions
def printProps(x) { appCtx.getData().printProps(rawOut, x) }

// example: see what can be accessed on the frontier
printProps(job.crawlController.frontier)

changing a regex decide rule

It's a wise idea to pause the crawl while modifying collections it depends on.

//Groovy
pat = ~/your-regex-here/
dec = org.archive.modules.deciderules.DecideResult.REJECT
regexRuleObj = appCtx.getBean("scope").rules.find{ it.class == org.archive.modules.deciderules.MatchesListRegexDecideRule }
regexRuleObj.decision = dec
rawOut.println("decision: "+ regexRuleObj.decision)
regexRuleObj.regexList.add(pat)
rawOut.println("regexList: "+ regexRuleObj.regexList)

adding an exclusion surt

//Groovy
rule = appCtx.getBean("scope").rules.find{ rule ->
  rule.class == org.archive.modules.deciderules.surt.SurtPrefixedDecideRule &&
  rule.decision == org.archive.modules.deciderules.DecideResult.REJECT
}

theSurt = "http://(org,northcountrygazette," // ncg is cranky. avoid.
rawOut.print( "result of adding theSurt: ")
rawOut.println( rule.surtPrefixes.considerAsAddDirective(theSurt) )
rawOut.println()

//dump the list of surts excluded to check results
rule.surtPrefixes.each{ rawOut.println(it) }

Adding an exclusion surt for a list of URIs

rule = appCtx.getBean("scope").rules.find{ rule ->
  rule.class == org.archive.modules.deciderules.surt.SurtPrefixedDecideRule &&
  rule.decision == org.archive.modules.deciderules.DecideResult.REJECT
}

def stringList = [ "www.example.com", "example.net", "foo.org" ]


stringList.each() { rawOut.println( rule.surtPrefixes.considerAsAddDirective("${it}")) }
rule.surtPrefixes.each{ rawOut.println(it) }

take a gander at the decide rules

//Groovy
def printProps(obj){
  // getProperties is a groovy introspective shortcut. it returns a map
  obj.properties.each{ prop ->
    // prop is a Map.Entry
    rawOut.println "\n"+ prop
    try{ // some things don't like you to get their class. ignore those.
      rawOut.println "TYPE: "+ prop.value.class.name
    }catch(Exception e){}
  }
}
 
// loop through the rules
counter = 0
appCtx.getBean("scope").rules.each { rule ->
  rawOut.println("\n###############${counter++}\n")
  printProps( rule )
}

check your metadata

appCtx.getBean("metadata").keyedProperties.each{ k, v ->
  rawOut.println( k)
  rawOut.println(" $v\n")
}

add a sheet forcing many queues into 'retired' state

// force-retire all .org queues
mgr = appCtx.getBean("sheetOverlaysManager")
mgr.putSheetOverlay("forceRetire","disposition.forceRetire",true)
mgr.addSurtAssociation("http://(org,","forceRetire")

create a sheet for forcing queue assignment, and associate two surts with it

mgr = appCtx.getBean("sheetOverlaysManager");
newSheetName = "urbanOrgAndTaxpolicycenterOrgSingleQueue"
mgr.putSheetOverlay(newSheetName, "queueAssignmentPolicy.forceQueueAssignment", "urbanorg_and_taxpolicycenterorg");
mgr.addSurtAssociation("http://(org,urban,", newSheetName);
mgr.addSurtAssociation("http://(org,taxpolicycenter,", newSheetName);

//check your results
mgr.sheetNamesBySurt.each{ rawOut.println(it) }
rawOut.println(mgr.sheetNamesBySurt.size())

add a decide rule sheet association

force queue assignment based on the hop path

mgr = appCtx.getBean("sheetOverlaysManager");
newSheetName = "speculativeSingleQueue"
dr = new org.archive.crawler.spring.DecideRuledSheetAssociation()
hpreg = new org.archive.modules.deciderules.HopsPathMatchesRegexDecideRule()
hpreg.setRegex(~/.*X$/)
dr.setRules(hpreg)
dr.setTargetSheetNames([newSheetName])
dr.setBeanName("forceSpeculativeQueueAssociation")
mgr.putSheetOverlay(newSheetName, "queueAssignmentPolicy.forceQueueAssignment", "speculative-queue");
mgr.addRuleAssociation(dr)

similar xml:

 <bean class='org.archive.crawler.spring.DecideRuledSheetAssociation'>
  <property name='rules'>
    <bean class="org.archive.modules.deciderules.HopsPathMatchesRegexDecideRule">
     <property name="regex" value=".*X$" />
    </bean>
  </property>
  <property name='targetSheetNames'>
   <list>
    <value>speculativeSingleQueue</value>
   </list>
  </property>
 </bean>
 <bean id='speculativeSingleQueue' class='org.archive.spring.Sheet'>
  <property name='map'>
   <map>
    <entry key='queueAssignmentPolicy.forceQueueAssignment' value='speculative-queue' />
   </map>
  </property>
 </bean>

apply sheet (ignoreRobots) to a list of URIs as strings, taken from the seeds.txt file

//Groovy
sheetName = "ignoreRobots"
mgr = appCtx.getBean("sheetOverlaysManager")
 
//test that you got the name right
if ( ! mgr.sheetsByName.containsKey( sheetName ) ) {
 rawOut.println( "sheet $sheetName does not exist. your choices are:" )
 mgr.sheetsByName.keySet().each{ rawOut.println(it) }
 return;
}
 
//look for lines in the seeds.txt file starting with http
appCtx.getBean("seeds").textSource.file.readLines().findAll{ l ->
 l =~ /^http/
}.collect{ uriStr ->
 //turn the domain into a surt and remove www
 org.archive.util.SurtPrefixSet.prefixFromPlainForceHttp("http://"+ new org.apache.commons.httpclient.URI(uriStr).host).replaceAll( /www,$/, "" )
}.unique().each{ seedSurt ->
 rawOut.println("associating $seedSurt")
 try{
  //ignore robots on the domain
  mgr.addSurtAssociation( seedSurt, sheetName)
 } catch (Exception e) {
  println("caught $e on $seedSurt")
 }
}
//review the change
mgr.sheetNamesBySurt.each{ k, v -> rawOut.println("$k\n $v\n") }

reconsiderRetiredQueues

if frontier related settings have changed (for instance, budget), this can bring queues out of retirement

appCtx.getBean("frontier").reconsiderRetiredQueues()

run arbitrary command on the machine (BE CAREFUL WITH THIS, OBVIOUSLY)

command = "ls";
proc = Runtime.getRuntime().exec(command);

stdout = new BufferedReader(new InputStreamReader(proc.getInputStream()));
while ((line = stdout.readLine()) != null) {
    rawOut.println("stdout: " + line);
}

stderr = new BufferedReader(new InputStreamReader(proc.getErrorStream()));
while ((line = stderr.readLine()) != null) {
    rawOut.println("stderr: " + line);
}

There are groovier ways to do it, starting with the basic

rawOut.println( "ls".execute().text )

list pending urls

// groovy
// see org.archive.crawler.frontier.BdbMultipleWorkQueues.forAllPendingDo()

import com.sleepycat.je.DatabaseEntry;
import com.sleepycat.je.OperationStatus;

MAX_URLS_TO_LIST = 1000

pendingUris = job.crawlController.frontier.pendingUris

rawOut.println "(this seems to be more of a ceiling) pendingUris.pendingUrisDB.count()=" + pendingUris.pendingUrisDB.count()
rawOut.println()

cursor = pendingUris.pendingUrisDB.openCursor(null, null);
key = new DatabaseEntry();
value = new DatabaseEntry();
count = 0;

while (cursor.getNext(key, value, null) == OperationStatus.SUCCESS && count < MAX_URLS_TO_LIST) {
    if (value.getData().length == 0) {
        continue;
    }
    curi = pendingUris.crawlUriBinding.entryToObject(value);
    rawOut.println curi
    count++
}
cursor.close(); 

rawOut.println()
rawOut.println count + " pending urls listed"

GC Garbage Collector Collection Info

// beanshell
for (gcMxBean: java.lang.management.ManagementFactory.getGarbageCollectorMXBeans()) {
    rawOut.println(gcMxBean.getName() + " pools=" + java.util.Arrays.toString(gcMxBean.getMemoryPoolNames()) + " count=" + gcMxBean.getCollectionCount() + " time=" + gcMxBean.getCollectionTime());
}

Retrieving history for URI

//Groovy
uri="http://example.com/"
loadProcessor = appCtx.getBean("persistLoadProcessor") //this name depends on config
key = loadProcessor.persistKeyFor(uri)
history = loadProcessor.store.get(key)
history.get(org.archive.modules.recrawl.RecrawlAttributeConstants.A_FETCH_HISTORY).each{historyStr ->
    rawOut.println(historyStr)
}

dump surts

// beanshell

// permit access to protected variable surtPrefixes
setAccessibility(true);

// assumes SurtPrefixedDecideRule is second rule in scope; adjust number for nth rule
rawOut.print(appCtx.getBean("scope").rules.get(1).surtPrefixes);

Add cookie to running crawl

// Groovy
cookieStore = appCtx.getBean("cookieStore");

// Create a new Cookie with its Name and Value
epochSeconds = Long.parseLong("2094586238"); // Expiration in 2036
expirationDate = (epochSeconds >= 0 ? new Date(epochSeconds * 1000) : null);
cookie = new org.apache.http.impl.cookie.BasicClientCookie("COOKIE_NAME", "COOKIE_VALUE");
cookie.setDomain("COOKIE_DOMAIN");
cookie.setExpiryDate(expirationDate);
cookie.setSecure(true);
cookie.setPath("/");

rawOut.println(cookie);
cookieStore.addCookie(cookie);

// Print all cookies
cookies = appCtx.getBean("cookieStore").getCookies().toArray();
cookies.each{ rawOut.println("${it}\n") }

Delete urls matching regex from frontier

// groovy
count = job.crawlController.frontier.deleteURIs(".*", "^http://de.wikipedia.org/.*")
rawOut.println count + " uris deleted from frontier"

Force a site to crawl through a proxy

//groovy
mgr = appCtx.getBean("sheetOverlaysManager");
newSheetName = "proxyFetch"
mgr.putSheetOverlay(newSheetName, "fetchHttp.httpProxyHost", "my.proxy.host"); //hostname or ip
mgr.putSheetOverlay(newSheetName, "fetchHttp.httpProxyPort", "8443"); //port

mgr.addSurtAssociation("http://(com,timgriffinforcongress,", newSheetName);

//check your results
mgr.sheetNamesBySurt.each{ rawOut.println(it) }
rawOut.println(mgr.sheetNamesBySurt.size())

Force wake all snoozed queues

//Groovy
 
countBefore = job.crawlController.frontier.getSnoozedCount()


job.crawlController.frontier.forceWakeQueues()
countAfter = job.crawlController.frontier.getSnoozedCount()

rawOut.println("Snoozed queues.")
rawOut.println(" - Before: " + countBefore)
rawOut.println(" - After: " + countAfter)

Add a DecideRule to scope rejecting the second speculative hop

Pause the crawl before doing this.

//Groovy

scope = appCtx.getBean("scope")
hpm = new org.archive.modules.deciderules.HopsPathMatchesRegexDecideRule()
hpm.regex = ~/.*X.*X.*/
hpm.decision = org.archive.modules.deciderules.DecideResult.REJECT
rawOut.println scope.rules.add(hpm)
scope.rules.each{rawOut.println it}

Retire Queues Matching Regex

Set a sheet overlay by Regex and set a small budget for matching domains

//Groovy

dr = new org.archive.crawler.spring.DecideRuledSheetAssociation()
matchreg = new org.archive.modules.deciderules.MatchesRegexDecideRule()
matchreg.setRegex(~/^https?:\/\/([^\/])*((discount)|(cheap))[^\/]*.*$/)
dr.setRules(matchreg)
dr.setBeanName("cheap-discount-domain-small-budget")
dr.setTargetSheetNames(["smallBudget"])

//smallBudget bean exists by default

Determine which multi-machine crawler is responsible for given URI

Interpret the output as an index into the divert map.

//Groovy
import org.archive.modules.CrawlURI
import org.archive.net.UURIFactory
uris=['http://foo.com','http://bar.org']
mapper = appCtx.getBean("hashCrawlMapper")
uris.each{ uri -> rawOut.println(uri + ":" + mapper.map(new CrawlURI(UURIFactory.getInstance(uri)))) }

Print all sheet associations and all sheet properties

This enumerates all the surt-sheet associations, and enumerates all the sheets and their settings.

//Groovy
mgr = appCtx.getBean("sheetOverlaysManager")

//review the associations
rawOut.println("------------------");
rawOut.println("SHEET ASSOCIATIONS");
rawOut.println("------------------\n");
mgr.sheetNamesBySurt.each{ k, v -> 
    rawOut.println("$k\n $v\n")
}

// List the sheets:
rawOut.println("------")
rawOut.println("SHEETS")
rawOut.println("------\n")
mgr.getSheetsByName().each{ name, sheet ->
    rawOut.println("$name")
    sheet.getMap().each{ k, v -> rawOut.println("$k = $v") }
    rawOut.println("")
}

Heritrix

Structured Guides:

Wiki index

FAQs

User Guide

Knowledge Base

Known Issues

Background Reading

Users of Heritrix

How To Crawl

Development

Clone this wiki locally