<img width="10%" alt="Naas" src="https://landen.imgix.net/jtci2pxwjczr/assets/5ice39g4.png?w=160"/>

# Advertools - Analyze website content using XML sitemap

**Tags:** #advertools #xml #sitemap #website #analyze #seo

**Author:** [Elias Dabbas](https://www.linkedin.com/in/eliasdabbas/)

**Description:** This notebook helps you get an overview of a website's content by analyzing and visualizing its XML sitemap. It's also an important SEO audit process that can uncover some potential issues that might affect the website.

**References:**
- [advertools Sitemaps](https://advertools.readthedocs.io/en/master/advertools.sitemaps.html)
- [XML Sitemap](https://en.wikipedia.org/wiki/Sitemaps)
- [Sitemaps Protocol](https://www.sitemaps.org/)

## Input

### Import libraries

In [1]:
try:
    import advertools as adv
except:
    !pip install advertools --user
    import advertools as adv
try:
    import adviz
except:
    !pip install adviz --user
    import adviz
from urllib.parse import urlsplit

### Setup Variables
- `sitemap_url`: URL of the sitemap to analyze, which can be
    * The URL of an XML sitemap
    * The URL of an XML sitemapindex
    * The URL of a robots.txt file
    * Normal and zipped formats are supported
- `recursive`: If this is a sitemapindex, should all the sub-sitemaps also be  downloaded, parsed and combined into one DataFrame?
- `max_workers`: Number of concurrent workers to fetch the sitemaps.

In [2]:
sitemap_url = "https://www.xml-sitemaps.com/download/www.naas.ai-a2e8849ba/sitemap.xml?view=1"
recursive = True
max_workers = 8

## Model

### Analyze website content using XML sitemap
Getting the sitemap(s)

In [3]:
sitemap = adv.sitemap_to_df(
    sitemap_url=sitemap_url,
    max_workers=max_workers,
    recursive=recursive
)
sitemap

2023-05-17 10:16:07,607 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://www.xml-sitemaps.com/download/www.naas.ai-a2e8849ba/sitemap.xml?view=1


Unnamed: 0,loc,lastmod,priority,sitemap,sitemap_size_mb,download_date
0,https://www.naas.ai/,2023-05-16 06:16:57+00:00,1.0,https://www.xml-sitemaps.com/download/www.naas.ai-a2e8849ba/sitemap.xml?view=1,0.000995,2023-05-17 08:16:07.611404+00:00
1,https://www.naas.ai/free-forever,2023-05-16 06:16:57+00:00,0.8,https://www.xml-sitemaps.com/download/www.naas.ai-a2e8849ba/sitemap.xml?view=1,0.000995,2023-05-17 08:16:07.611404+00:00
2,https://www.naas.ai/pricing,2023-05-16 06:16:57+00:00,0.8,https://www.xml-sitemaps.com/download/www.naas.ai-a2e8849ba/sitemap.xml?view=1,0.000995,2023-05-17 08:16:07.611404+00:00
3,https://www.naas.ai/company,2023-05-16 06:16:57+00:00,0.8,https://www.xml-sitemaps.com/download/www.naas.ai-a2e8849ba/sitemap.xml?view=1,0.000995,2023-05-17 08:16:07.611404+00:00
4,https://www.naas.ai/terms,2023-05-16 06:16:57+00:00,0.8,https://www.xml-sitemaps.com/download/www.naas.ai-a2e8849ba/sitemap.xml?view=1,0.000995,2023-05-17 08:16:07.611404+00:00


Split URLs into their components for further analysis/understanding

In [4]:
urldf = adv.url_to_df(sitemap['loc'])
urldf

Unnamed: 0,url,scheme,netloc,path,query,fragment,dir_1,last_dir
0,https://www.naas.ai/,https,www.naas.ai,/,,,,
1,https://www.naas.ai/free-forever,https,www.naas.ai,/free-forever,,,free-forever,free-forever
2,https://www.naas.ai/pricing,https,www.naas.ai,/pricing,,,pricing,pricing
3,https://www.naas.ai/company,https,www.naas.ai,/company,,,company,company
4,https://www.naas.ai/terms,https,www.naas.ai,/terms,,,terms,terms


## Output

### Display results

#### Errors

In [5]:
if 'errors' in sitemap:
    from IPython.display import display
    display(sitemap[sitemap['errors'].notnull()])
else:
    print('No errors found')

No errors found


#### Duplicated URLs

In [6]:
duplicated = sitemap[sitemap['loc'].duplicated()]
if not duplicated.empty:
    display(duplicated)
else:
    print('No duplicated URLs found')

No duplicated URLs found


#### URL counts per sitemap and sitemap sizes

Each sitemap should have a maximumof 50,000 URLs, and its size should not exceek 50MB

URL counts:

In [7]:
adviz.value_counts_plus(sitemap['sitemap'], name='Sitemap URLs')

Unnamed: 0,Sitemap URLs,count,cum. count,%,cum. %
1,https://www.xml-sitemaps.com/download/www.naas.ai-a2e8849ba/sitemap.xml?view=1,5,5,100.0%,100.0%


URL Sizes:

In [8]:
sitemap['sitemap_size_mb'].describe().to_frame().T.style.format('{:,.2f}')

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
sitemap_size_mb,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Count unique values of URL components

In [9]:
for col in ['scheme', 'netloc', 'dir_1', 'dir_2', 'dir_3']:
    try:
        display(adviz.value_counts_plus(urldf[col], name=col))
    except Exception as e:
        continue

Unnamed: 0,scheme,count,cum. count,%,cum. %
1,https,5,5,100.0%,100.0%


Unnamed: 0,netloc,count,cum. count,%,cum. %
1,www.naas.ai,5,5,100.0%,100.0%


Unnamed: 0,dir_1,count,cum. count,%,cum. %
1,,1,1,20.0%,20.0%
2,free-forever,1,2,20.0%,40.0%
3,pricing,1,3,20.0%,60.0%
4,company,1,4,20.0%,80.0%
5,terms,1,5,20.0%,100.0%


#### Visualize the structure of the URLs

In [10]:
domain = urlsplit(sitemap_url).netloc
adviz.url_structure(
    urldf['url'].fillna(''),
    items_per_level=30,
    domain=domain,
    height=750,
    title=f'URL Structure: {domain} XML sitemap'
)

KeyError: 'dir_2'