Crawler.js

A PhantomJS script that crawls AJAX-sites using #!/-syntax and generating static HTML-sites in a _escaped_fragment_-directory.

The Problem

Search bots does not implement modern HTML, CSS and JavaScript. Thus we need to give these bots a helping hand and generate static snapshots of our sites.

Google describes the technique at Making AJAX Applications Crawlable.

API

Run

phantomjs --load-images=no --web-security=no crawler.js http://example.com

where http://example.com is your AJAX-site, or

phantomjs --load-images=no --web-security=no crawler.js --ignore somePatters http://example.com

where somePattern is a RegExp pattern over the URLs that will be ignored.

Features

Crawler.js will visit the page, wait for it to render, and then

store the visible HTML
find all #!/-links on your site
visit every link recursively

Crawler.js will only visit the pages that are inside the page you supplied as the start argument to crawler.js.

Configure Apache

Create a file .htaccess in the same folder as your AJAX-site (where you should store your escaped fragments) and add these lines

RewriteEngine On

RewriteCond %{QUERY_STRING} ^_escaped_fragment_=$
RewriteRule ^(.*)$ _escaped_fragment_$1/index.html? [L]

RewriteCond %{QUERY_STRING} ^_escaped_fragment_=(.*)$
RewriteRule ^(.*)$ _escaped_fragment_$1/%1/index.html? [L]

to the file. It will redirect all URLs with the parameter _escaped_fragment_ to the corresponding web site.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
LICENSE-MIT		LICENSE-MIT
LICENSE_q.txt		LICENSE_q.txt
README.md		README.md
crawler.js		crawler.js
docco.css		docco.css
index.html		index.html
jquery-1.8.2.min.js		jquery-1.8.2.min.js
q.min.js		q.min.js
test.sh		test.sh
underscore-min.js		underscore-min.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

.gitignore

.gitignore

LICENSE-MIT

LICENSE-MIT

LICENSE_q.txt

LICENSE_q.txt

README.md

README.md

crawler.js

crawler.js

docco.css

docco.css

index.html

index.html

jquery-1.8.2.min.js

jquery-1.8.2.min.js

q.min.js

q.min.js

test.sh

test.sh

underscore-min.js

underscore-min.js

Repository files navigation

Crawler.js

The Problem

API

Features

Configure Apache

About

Licenses found

Releases

Packages

Languages

License

Licenses found

finnsson/crawlerjs

Folders and files

Latest commit

History

Repository files navigation

Crawler.js

The Problem

API

Features

Configure Apache

About

Resources

License

Licenses found

Stars

Watchers

Forks

Languages