Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
JavaScript
Branch: master

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
.gitignore
LICENSE.txt
README.md
crawl.js
jquery-1.8.3.min.js
node-crawler.js
package.js
package.json
phantom-sitemap.org
sitemap.html
sitemap.xml

README.md

phantom-sitemap

Crawls a site, extracts the links and returns the promise of either a sitemap or just a list of links.

If a url has a hashbang (#!) or the page contains the fragment meta tag, the html to parse will be created by calling on phantomjs.

var defaultOptions = { maxDepth: 1,
                       maxFollow: 0,
                       verbose: false,
                       silent: false,
                       //timeout for a request:
                       timeout: 60000,
                       //interval before trying again:
                       retryTimeout: 10000,
                       retries:3,
                       ignore: ['xls', 'png', 'jpg', 'png','js', 'css' ], 
                       include: ['pdf', 'doc'], //include other crawlable assets to list
                       cacheDir: './cache',
                       sitemap: true,
                       out: 'sitemap.xml',
                       replaceHost: 'www.example.com'
                     };

Set options.sitemap to false to return just a list of links.

// Test
var crawl = module.exports(options);
crawl('http://localhost:9000').when(
    function(data) {
        console.log('RESULT:\n', data);
    }
    ,function(err) {
        console.log('ERROR', err);
    }
)

Using node-crawler to crawl static pages.

TODO: create html map

Something went wrong with that request. Please try again.