Overview

This application crawls all gov.au domains (excluding states and territories) and stores the resources found in a index. It is the intent that this information be used for a wide variety of applications which require an understanding across government. Analytics, search and alternative content presentations have already been considered in this context.

This was first started in 2015 but was not actively pursued or developed for a few years until late 2017. At this point the crawler was redeveloped for more effective horizontal scaling. This also meant moving from node to python in order to leverage functionality developed elsewhere.

The previous version is maintained in the archive-2015 branch.

Access

We are very open to creative use of the data created through this crawling exercise. If you would like access to the information/database feel free to contact us through an issue and we will do what we can to assist.

Architectural Components

These are the basic components of the crawler architecture.

Crawled pages storage: S3
Parsed results storage: Elastic (ElasticSearch)
Crawlers: EC2 boxes with access to S3 and SQS
Redis database to hold:
- seen domains
- finished domains
- currently locked domains

Solution Overview

There is a single instance of a "steward" (crawlers-steward) which tasks crawler nodes (crawler-node) by putting the next domain url to an SQS queue.

Steward Function

selects website to crawl (based on Redis database)
puts that website domain name to the requests queue

Crawler Node Function

Each crawler node:

Gets messages from the SQS
Starts crawler process for that website, while being quite polite
Ensures all required locks are set (so only single crawler downloads any domain name)
Starts another crawler process for up to N websites (per single worker)
For each new crawled page - put it's content to S3 and put the parsed metadata to Elastic

The saved metadata consists of:

Source URL
List of external HTTP links from this page
List of external domains
List of internal links, normalised ('a/b/../..' -> '/' and so on)
Short text for search (summarized page content, not the whole page), keywords
Http response details (time taken, status code, content metadata, etc)
Other business data

Workers on each EC2 node may be scaled up until they keep the box busy (take 70-90% of memory). The limiting factor is a memory here.

The Result

So, after some time, we have a lot of:

data in our S3 bucket
SQS messages spent
Elastic records created

Devops Information

The current proposed installation workflow relies on AWS, but it may be easily updated to any other devops environment.

Requirements:

Elasticsearch service started and URL is available
SQS queue created
- we have IAM access key/secret to access it
- we know the queue ARN
Some EC2 boxes created:
- one for steward
- multiple (at least single) for crawlers
- we have the SSH access to it

Starting the crawlers

Deploy crawler-node to the target EC2 box
cp prod-crawler-node.env.sample webindex-crawler-node.env (if you rename that file make sure docker-compose.yml file env_file variable contains the valid value)
docker-compose up to start the crawler in sync mode (logs are shown, Control-C kills it)
- docker-compose up -d to start it in the background (no logs shown directly and exit from the shell won't stop it)
- docker-compose up -d --scale crawler=10 to run 10 instancess
- docker-compose down to stop/drop all instances
- docker-compose up -d --scale crawler=5 to tune instances count while they are running already
- docker-compose up -d --build if you just have changed the source code (otherwise old one will be used)
- docker-compose logs --tail=40 -f to see the logs from the containers

Usually you may start crawlers number to use 70% of the memory after a hour or two of crawlers working (experimental value). If you start too many they would kill the EC2 box, if too few - instance won't be used enough.

Experimental values: 20-25 crawlers per 8GB of the memory.

Starting the steward

Works exactly the same as crawler instance, just .env file named differenently. Please ensure that both env files (from the crawler and from the steward) point to the same objects (SQS queue, redis instance and databases IDs).

Creating a fleet

Run the steward
Set up single EC2 box with crawler and make sure it works fine for several hours - crawls data, doesn't use excessive memory (>90%) and so on
Go to AWS EC2 console and create a snapshot of that crawler Ec2 box
Create a spot request with given snapshot AMI, instance type: the same as source EC2 box, number: start from 3-5, everything else: default
Submit that spot request, wait for some time for instances to be provisioned and ensure Kibana has more data being crawled.

Backups part

The following should be backed up:

Elastic database
S3 bucket
Redis database content (mostly 'seen domains')

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
crawler-node		crawler-node
crawler-steward		crawler-steward
post_domain_aliases		post_domain_aliases
postprocessor		postprocessor
redis-node		redis-node
.flake8		.flake8
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crawler-node

crawler-node

crawler-steward

crawler-steward

post_domain_aliases

post_domain_aliases

postprocessor

postprocessor

redis-node

redis-node

.flake8

.flake8

.gitignore

.gitignore

README.md

README.md

Repository files navigation

Overview

Access

Architectural Components

Solution Overview

Steward Function

Crawler Node Function

The Result

Devops Information

Starting the crawlers

Starting the steward

Creating a fleet

Backups part

About

Releases

Packages

Contributors 3

Languages

AusDTO/disco_crawl

Folders and files

Latest commit

History

Repository files navigation

Overview

Access

Architectural Components

Solution Overview

Steward Function

Crawler Node Function

The Result

Devops Information

Starting the crawlers

Starting the steward

Creating a fleet

Backups part

About

Resources

Stars

Watchers

Forks

Languages