You must configure your server to redirect traffic from search bots towards your Arachnid instance.
Arachnid works by inspecting a custom HTTP Header
x-original-uri, and then hitting the configured hostname at the URL you provided.
Optionally, Arachnid can save scraped pages to a folder of your choice, so that subsequent requests to the same resource are faster.
For more info, check out our blog post on Arachnid at the Clubjudge blog.
Arachnid expects a
config.js file to be present in the project root. It ships with a
config.js.example file with all available options. These are:
folder | String
The folder path where Arachnid should save scraped pages. Only has any effect if writeToFile is set to
host | String
The hostname to query URLs against. This value must be a valid URL for Arachnid to run correctly.
port | Number
The port where the service should run.
timeout | Number
The maximum time in ms that Arachnid should wait for your page to finish rendering before it returns the current HTML snapshot.
writeToFile | Boolean
Whether Arachnid should write scraped files to the disk. Works in tandem with the folder option.
Install NodeJS (v.0.8.11 works fine).
npm install -g phantomjs npm install -g forever npm install
Starting will spin up an instance of bin/arachnid using Forever. Any CLI arguments will be passed along to Forever.
Over time the folder where Arachnid saves its scraped pages can become too large. There's an included utility script that will clear this folder for you.
npm run-script pruneFiles
An example of how to set up this task to run regularly through cron would be:
min hour dayOfMonth month dayOfWeek cd /PATH/TO/ARACHNID/; /PATH/TO/NPM run-script pruneFiles