Skip to content

mediacloud/ultimate-sitemap-parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

84 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build Status

Documentation Status

Coverage Status

PyPI package

Download stats

Website sitemap parser for Python 3.5+.

Features

Installation

pip install ultimate-sitemap-parser

Usage

from usp.tree import sitemap_tree_for_homepage

tree = sitemap_tree_for_homepage('https://www.nytimes.com/')
print(tree)

sitemap_tree_for_homepage() will return a tree of AbstractSitemap subclass objects that represent the sitemap hierarchy found on the website; see a reference of AbstractSitemap subclasses.

If you'd like to just list all the pages found in all of the sitemaps within the website, consider using all_pages() method:

# all_pages() returns an Iterator
for page in tree.all_pages():
    print(page)

all_pages() method will return an iterator yielding SitemapPage objects; see a reference of SitemapPage.