An easy-to-use PHP library to parse XML Sitemaps compliant with the Sitemaps.org protocol.
The Sitemaps.org protocol is the leading standard and is supported by Google, Bing, Yahoo, Ask and many others.
- Basic parsing
- Recursive parsing
- String parsing
- Custom User-Agent string
- Proxy support
- XML
.xml
- Compressed XML
.xml.gz
- Robots.txt rule sheet
robots.txt
- Line separated text (disabled by default)
- PHP 5.6 or 7.0+, alternatively HHVM
- PHP extensions:
The library is available for install via Composer. Just add this to your composer.json
file:
{
"require": {
"vipnytt/sitemapparser": "^1.0"
}
}
Then run composer update
.
Returns an list of URLs only.
use vipnytt\SitemapParser;
use vipnytt\SitemapParser\Exceptions\SitemapParserException;
try {
$parser = new SitemapParser();
$parser->parse('https://www.google.com/sitemap.xml');
foreach ($parser->getURLs() as $url => $tags) {
echo $url . '<br>';
}
} catch (SitemapParserException $e) {
echo $e->getMessage();
}
Returns all available tags, for both Sitemaps and URLs.
use vipnytt\SitemapParser;
use vipnytt\SitemapParser\Exceptions\SitemapParserException;
try {
$parser = new SitemapParser('MyCustomUserAgent');
$parser->parse('http://php.net/sitemap.xml');
foreach ($parser->getSitemaps() as $url => $tags) {
echo 'Sitemap<br>';
echo 'URL: ' . $url . '<br>';
echo 'LastMod: ' . $tags['lastmod'] . '<br>';
echo '<hr>';
}
foreach ($parser->getURLs() as $url => $tags) {
echo 'URL: ' . $url . '<br>';
echo 'LastMod: ' . $tags['lastmod'] . '<br>';
echo 'ChangeFreq: ' . $tags['changefreq'] . '<br>';
echo 'Priority: ' . $tags['priority'] . '<br>';
echo '<hr>';
}
} catch (SitemapParserException $e) {
echo $e->getMessage();
}
Parses any sitemap detected while parsing, to get an complete list of URLs
use vipnytt\SitemapParser;
use vipnytt\SitemapParser\Exceptions\SitemapParserException;
try {
$parser = new SitemapParser('MyCustomUserAgent');
$parser->parseRecursive('http://www.google.com/robots.txt');
echo '<h2>Sitemaps</h2>';
foreach ($parser->getSitemaps() as $url => $tags) {
echo 'URL: ' . $url . '<br>';
echo 'LastMod: ' . $tags['lastmod'] . '<br>';
echo '<hr>';
}
echo '<h2>URLs</h2>';
foreach ($parser->getURLs() as $url => $tags) {
echo 'URL: ' . $url . '<br>';
echo 'LastMod: ' . $tags['lastmod'] . '<br>';
echo 'ChangeFreq: ' . $tags['changefreq'] . '<br>';
echo 'Priority: ' . $tags['priority'] . '<br>';
echo '<hr>';
}
} catch (SitemapParserException $e) {
echo $e->getMessage();
}
Note: This is disabled by default to avoid false positives when expecting XML, but fetches plain text instead.
To disable strict
standards, simply pass this configuration to constructor parameter #2: ['strict' => false]
.
use vipnytt\SitemapParser;
use vipnytt\SitemapParser\Exceptions\SitemapParserException;
try {
$parser = new SitemapParser('MyCustomUserAgent', ['strict' => false]);
$parser->parse('https://www.xml-sitemaps.com/urllist.txt');
foreach ($parser->getSitemaps() as $url => $tags) {
echo $url . '<br>';
}
foreach ($parser->getURLs() as $url => $tags) {
echo $url . '<br>';
}
} catch (SitemapParserException $e) {
echo $e->getMessage();
}
Even more examples available in the examples directory.
Available configuration options, with their default values:
$config = [
'strict' => true, // (bool) Disallow parsing of line-separated plain text
'guzzle' => [
// GuzzleHttp request options
// http://docs.guzzlephp.org/en/latest/request-options.html
],
];
$parser = new SitemapParser('MyCustomUserAgent', $config);
If an User-agent also is set using the GuzzleHttp request options, it receives the highest priority and replaces the other User-agent.