XML Sitemap parser

An easy-to-use PHP library to parse XML Sitemaps compliant with the Sitemaps.org protocol.

The Sitemaps.org protocol is the leading standard and is supported by Google, Bing, Yahoo, Ask and many others.

Features

Basic parsing
Recursive parsing
String parsing
Custom User-Agent string
Proxy support

Formats supported

XML .xml
Compressed XML .xml.gz
Robots.txt rule sheet robots.txt
Line separated text (disabled by default)

Requirements:

PHP 5.6 or 7.0+, alternatively HHVM
PHP extensions:
- mbstring
- libxml (enabled by default)
- SimpleXML (enabled by default)

Installation

The library is available for install via Composer. Just add this to your composer.json file:

{
    "require": {
        "vipnytt/sitemapparser": "^1.0"
    }
}

Then run composer update.

Getting Started

Basic example

Returns an list of URLs only.

use vipnytt\SitemapParser;
use vipnytt\SitemapParser\Exceptions\SitemapParserException;

try {
    $parser = new SitemapParser();
    $parser->parse('https://www.google.com/sitemap.xml');
    foreach ($parser->getURLs() as $url => $tags) {
        echo $url . '<br>';
    }
} catch (SitemapParserException $e) {
    echo $e->getMessage();
}

Advanced

Returns all available tags, for both Sitemaps and URLs.

use vipnytt\SitemapParser;
use vipnytt\SitemapParser\Exceptions\SitemapParserException;

try {
    $parser = new SitemapParser('MyCustomUserAgent');
    $parser->parse('http://php.net/sitemap.xml');
    foreach ($parser->getSitemaps() as $url => $tags) {
        echo 'Sitemap<br>';
        echo 'URL: ' . $url . '<br>';
        echo 'LastMod: ' . $tags['lastmod'] . '<br>';
        echo '<hr>';
    }
    foreach ($parser->getURLs() as $url => $tags) {
        echo 'URL: ' . $url . '<br>';
        echo 'LastMod: ' . $tags['lastmod'] . '<br>';
        echo 'ChangeFreq: ' . $tags['changefreq'] . '<br>';
        echo 'Priority: ' . $tags['priority'] . '<br>';
        echo '<hr>';
    }
} catch (SitemapParserException $e) {
    echo $e->getMessage();
}

Recursive

Parses any sitemap detected while parsing, to get an complete list of URLs

use vipnytt\SitemapParser;
use vipnytt\SitemapParser\Exceptions\SitemapParserException;

try {
    $parser = new SitemapParser('MyCustomUserAgent');
    $parser->parseRecursive('http://www.google.com/robots.txt');
    echo '<h2>Sitemaps</h2>';
    foreach ($parser->getSitemaps() as $url => $tags) {
        echo 'URL: ' . $url . '<br>';
        echo 'LastMod: ' . $tags['lastmod'] . '<br>';
        echo '<hr>';
    }
    echo '<h2>URLs</h2>';
    foreach ($parser->getURLs() as $url => $tags) {
        echo 'URL: ' . $url . '<br>';
        echo 'LastMod: ' . $tags['lastmod'] . '<br>';
        echo 'ChangeFreq: ' . $tags['changefreq'] . '<br>';
        echo 'Priority: ' . $tags['priority'] . '<br>';
        echo '<hr>';
    }
} catch (SitemapParserException $e) {
    echo $e->getMessage();
}

Parsing of line separated text strings

Note: This is disabled by default to avoid false positives when expecting XML, but fetches plain text instead.

To disable strict standards, simply pass this configuration to constructor parameter #2: ['strict' => false].

use vipnytt\SitemapParser;
use vipnytt\SitemapParser\Exceptions\SitemapParserException;

try {
    $parser = new SitemapParser('MyCustomUserAgent', ['strict' => false]);
    $parser->parse('https://www.xml-sitemaps.com/urllist.txt');
    foreach ($parser->getSitemaps() as $url => $tags) {
            echo $url . '<br>';
    }
    foreach ($parser->getURLs() as $url => $tags) {
            echo $url . '<br>';
    }
} catch (SitemapParserException $e) {
    echo $e->getMessage();
}

Additional examples

Even more examples available in the examples directory.

Configuration

Available configuration options, with their default values:

$config = [
    'strict' => true, // (bool) Disallow parsing of line-separated plain text
    'guzzle' => [
        // GuzzleHttp request options
        // http://docs.guzzlephp.org/en/latest/request-options.html
    ],
];
$parser = new SitemapParser('MyCustomUserAgent', $config);

If an User-agent also is set using the GuzzleHttp request options, it receives the highest priority and replaces the other User-agent.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
examples		examples
src		src
tests		tests
.codeclimate.yml		.codeclimate.yml
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
composer.json		composer.json
phpunit.xml		phpunit.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

XML Sitemap parser

Features

Formats supported

Requirements:

Installation

Getting Started

Basic example

Advanced

Recursive

Parsing of line separated text strings

Additional examples

Configuration

About

Releases

Packages

Languages

License

SpekArchive/SitemapParser

Folders and files

Latest commit

History

Repository files navigation

XML Sitemap parser

Features

Formats supported

Requirements:

Installation

Getting Started

Basic example

Advanced

Recursive

Parsing of line separated text strings

Additional examples

Configuration

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages