<img width="10%" alt="Naas" src="https://landen.imgix.net/jtci2pxwjczr/assets/5ice39g4.png?w=160"/>

# Advertools - Audit robots txt and xml sitemap issues
<a href="https://app.naas.ai/user-redirect/naas/downloader?url=https://raw.githubusercontent.com/jupyter-naas/awesome-notebooks/master/Advertools/Advertools_Audit_robots_txt_and_xml_sitemap_issues.ipynb" target="_parent"><img src="https://naasai-public.s3.eu-west-3.amazonaws.com/Open_in_Naas_Lab.svg"/></a><br><br><a href="https://github.com/jupyter-naas/awesome-notebooks/issues/new?assignees=&labels=&template=template-request.md&title=Tool+-+Action+of+the+notebook+">Template request</a> | <a href="https://github.com/jupyter-naas/awesome-notebooks/issues/new?assignees=&labels=bug&template=bug_report.md&title=Advertools+-+Audit+robots+txt+and+xml+sitemap+issues:+Error+short+description">Bug report</a> | <a href="https://app.naas.ai/user-redirect/naas/downloader?url=https://raw.githubusercontent.com/jupyter-naas/awesome-notebooks/master/Naas/Naas_Start_data_product.ipynb" target="_parent">Generate Data Product</a>

**Tags:** #advertools #xml #sitemap #website #audit #seo #robots.txt #google

**Author:** [Elias Dabbas](https://www.linkedin.com/in/eliasdabbas/)

**Description:** This notebook helps you check if there are any conflicts between robots.txt rules and your XML sitemap.

* Are you disallowing URLs that you shouldn't?
* Test and make sure you don't publish new pages with such conflicts.
* Do this in bulk: for all URL/rule/user-agent combinations run all tests with one command.

**References:**
- [advertools robots.txt functions](https://advertools.readthedocs.io/en/master/advertools.robotstxt.html)
- [Google's robots reference](https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt)
- [advertools XML sitemaps](https://advertools.readthedocs.io/en/master/advertools.sitemaps.html)

## Input

### Install libraries
If running it on naas, run the code below to install the libraries

In [None]:
#!pip install advertools adviz pandas==1.5.3 --user

### Import libraries

In [9]:
import advertools as adv

### Setup Variables
- `robotstxt_url`: URL of the robots.txt file to convert to a `DataFrame`

In [13]:
robotstxt_url = "https://www.youtube.com/robots.txt"

## Model

### Analyze potential robots.txt and XML conflicts

Getting the robots.txt file and converting it to a `DataFrame`.

In [14]:
robots_df = adv.robotstxt_to_df(robotstxt_url=robotstxt_url)
robots_df

2023-07-13 14:02:10,739 | INFO | robotstxt.py:381 | robotstxt_to_df | Getting: https://www.youtube.com/robots.txt


Unnamed: 0,directive,content,robotstxt_last_modified,robotstxt_url,download_date
0,comment,robots.txt file for YouTube,2023-06-13 21:38:00+00:00,https://www.youtube.com/robots.txt,2023-07-13 12:02:10.779470+00:00
1,comment,Created in the distant future (the year 2000) ...,2023-06-13 21:38:00+00:00,https://www.youtube.com/robots.txt,2023-07-13 12:02:10.779470+00:00
2,comment,the robotic uprising of the mid 90's which wip...,2023-06-13 21:38:00+00:00,https://www.youtube.com/robots.txt,2023-07-13 12:02:10.779470+00:00
3,User-agent,Mediapartners-Google*,2023-06-13 21:38:00+00:00,https://www.youtube.com/robots.txt,2023-07-13 12:02:10.779470+00:00
4,Disallow,,2023-06-13 21:38:00+00:00,https://www.youtube.com/robots.txt,2023-07-13 12:02:10.779470+00:00
5,User-agent,*,2023-06-13 21:38:00+00:00,https://www.youtube.com/robots.txt,2023-07-13 12:02:10.779470+00:00
6,Disallow,/comment,2023-06-13 21:38:00+00:00,https://www.youtube.com/robots.txt,2023-07-13 12:02:10.779470+00:00
7,Disallow,/feeds/videos.xml,2023-06-13 21:38:00+00:00,https://www.youtube.com/robots.txt,2023-07-13 12:02:10.779470+00:00
8,Disallow,/get_video,2023-06-13 21:38:00+00:00,https://www.youtube.com/robots.txt,2023-07-13 12:02:10.779470+00:00
9,Disallow,/get_video_info,2023-06-13 21:38:00+00:00,https://www.youtube.com/robots.txt,2023-07-13 12:02:10.779470+00:00


Get XML sitemap(s) and convert to a `DataFrame`.

In [15]:
sitemap = adv.sitemap_to_df(
    # the function will extract and combine all available sitemaps
    # in the robots.txt file
    robotstxt_url,
    max_workers=8,
    recursive=True)
sitemap

2023-07-13 14:02:15,771 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://www.youtube.com/sitemaps/misc.xml
2023-07-13 14:02:15,837 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://about.youtube/sitemap.xml
2023-07-13 14:02:15,848 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://www.youtube.com/trends/sitemap.xml
2023-07-13 14:02:15,872 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://www.youtube.com/ads/sitemap-old.xml
2023-07-13 14:02:15,904 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://www.youtube.com/jobs/sitemap.xml
2023-07-13 14:02:15,960 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://www.youtube.com/creators/sitemap.xml
2023-07-13 14:02:16,343 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://www.youtube.com/originals/guidelines/sitemap.xml
2023-07-13 14:02:16,345 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://www.youtube.com/ads/sitemap.xml
2023-07-13 14:02:16,353 | INFO | sitemaps.py:536 | sit

Unnamed: 0,loc,sitemap,sitemap_last_modified,sitemap_size_mb,download_date,errors,lastmod
0,https://www.youtube.com/videomasthead/,https://www.youtube.com/sitemaps/misc.xml,2020-05-28 21:15:00+00:00,0.000213,2023-07-13 12:02:15.873105+00:00,,NaT
1,https://www.youtube.com/trends/,https://www.youtube.com/trends/sitemap.xml,NaT,0.108909,2023-07-13 12:02:15.887329+00:00,,NaT
2,https://www.youtube.com/trends/2021/,https://www.youtube.com/trends/sitemap.xml,NaT,0.108909,2023-07-13 12:02:15.887329+00:00,,NaT
3,https://www.youtube.com/trends/about/,https://www.youtube.com/trends/sitemap.xml,NaT,0.108909,2023-07-13 12:02:15.887329+00:00,,NaT
4,https://www.youtube.com/trends/ads-leaderboard/,https://www.youtube.com/trends/sitemap.xml,NaT,0.108909,2023-07-13 12:02:15.887329+00:00,,NaT
...,...,...,...,...,...,...,...
240650,https://www.youtube.com/product/E_g_11g4j377vp,https://www.youtube.com/product/sitemap-files/...,2022-04-20 14:08:00+00:00,5.621672,2023-07-13 12:02:21.292386+00:00,,2022-04-20 00:00:00+00:00
240651,https://www.youtube.com/product/E_g_11dzx59g99,https://www.youtube.com/product/sitemap-files/...,2022-04-20 14:08:00+00:00,5.621672,2023-07-13 12:02:21.292386+00:00,,2022-04-20 00:00:00+00:00
240652,https://www.youtube.com/product/E_g_11lg3zxccn,https://www.youtube.com/product/sitemap-files/...,2022-04-20 14:08:00+00:00,5.621672,2023-07-13 12:02:21.292386+00:00,,2022-04-20 00:00:00+00:00
240653,https://www.youtube.com/product/E_g_11fmdg91lj,https://www.youtube.com/product/sitemap-files/...,2022-04-20 14:08:00+00:00,5.621672,2023-07-13 12:02:21.292386+00:00,,2022-04-20 00:00:00+00:00


#### Testing robots.txt
For all URL/user-agent combinations check if the URL is blocked.

In [17]:
user_agents = robots_df[robots_df['directive'].str.contains('user-agent', case=False)]['content']
user_agents

3    Mediapartners-Google*
5                        *
Name: content, dtype: object

Generate the robots.txt test report:

In [19]:
# Get users agent
user_agents = robots_df[robots_df['directive'].str.contains('user-agent', case=False)]['content']
print(user_agents)

# Testing robots.txt
robots_report = adv.robotstxt_test(
    robotstxt_url=robotstxt_url,
    user_agents=user_agents,
    urls=sitemap['loc'].dropna()
)

print("Row fetched:", len(robots_report))
robots_report.head(5)

3    Mediapartners-Google*
5                        *
Name: content, dtype: object
Row fetched: 481308


Unnamed: 0,robotstxt_url,user_agent,url_path,can_fetch
0,https://www.youtube.com/robots.txt,*,https://about.youtube/,True
1,https://www.youtube.com/robots.txt,*,https://artists.youtube/,True
2,https://www.youtube.com/robots.txt,*,https://artists.youtube/features/,True
3,https://www.youtube.com/robots.txt,*,https://artists.youtube/foundry/,True
4,https://www.youtube.com/robots.txt,*,https://artists.youtube/intl/de_ALL/,True


Does the website have URLs listed in the XML sitemap that are also disallowed by its robots.txt?

(this is not necessarily a problem, because they might disallow it for some user-agents only), and it's good to check.

## Output

Get the URLs that cannot be fetched

### Filter result

In [20]:
df_report = robots_report[~robots_report['can_fetch']].reset_index(drop=True)
print("Row fetched:", len(df_report))
df_report.head(5)

Row fetched: 0


Unnamed: 0,robotstxt_url,user_agent,url_path,can_fetch
