Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible to generate incompatible format with Python RobotFileParser #78

Open
willplaehn opened this issue May 16, 2017 · 1 comment
Open

Comments

@willplaehn
Copy link

I've been parsing my own robots.txt file with Python and found an interesting compatibility scenario:

If you create multiple Robot records with the same user-agent, they are spaced apart by a blank line, causing Python's RobotFileParser to miss subsequent lines if you read it in. I'm looking at Robots v3 and Python 3.5. Is this something you'd want to change or document?

https://github.com/python/cpython/blob/3.5/Lib/urllib/robotparser.py

Example robots.txt generated:

User-agent: *
Disallow: /one

Disallow: /two

Host: example.com

The work-around is simple -- you create a single Robot record with both rules so that robots.txt has no blank line:

User-agent: *
Disallow: /one
Disallow: /two

Host: example.com

To reproduce:

from urllib.robotparser import RobotFileParser
robots = RobotFileParser('http://example.com/robots.txt')
robots.read()
robots.can_fetch(useragent='', url='/two')
@some1ataplace
Copy link

It seems that when generating the robots.txt file using django-robots, it's possible to create multiple Disallow directives for the same user agent, which can result in a blank line between the directives in the generated file. This can cause compatibility issues with Python's RobotFileParser module, which may miss subsequent lines if the file is read in.

To work around this issue, you can create a single Disallow directive with multiple paths specified, as you mentioned. Alternatively, you can modify the get_robots_txt() function in django-robots to generate the robots.txt file in a format that is compatible with RobotFileParser. For example, you can modify the function to generate the Disallow directives on a single line separated by commas:

def get_robots_txt(self):
    lines = [
        'User-agent: *',
        'Disallow: {}'.format(','.join(['/one', '/two'])),
        'Host: example.com',
    ]
    return '\n'.join(lines)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants