Skip to content
This repository has been archived by the owner on Dec 8, 2017. It is now read-only.

pa11y-crawl does not not work with subdomains #3

Open
gemfarmer opened this issue Sep 29, 2016 · 5 comments
Open

pa11y-crawl does not not work with subdomains #3

gemfarmer opened this issue Sep 29, 2016 · 5 comments
Assignees
Labels

Comments

@gemfarmer
Copy link

I've noticed that pa11y-crawl gives the following error when attempting to crawl a URL with a subdomain.

. is not an html document, skipping

For example, https://login.gov crawls successfully, but https://useiti.doi.gov or https://18f.gsa.gov cannot find valid html to scan.

If the same projects are crawled on localhost, it crawls properly.

This is a problem on federalist URLs, because we end up seeing the following:

 ||   . is not an html document, skipping
 ||   federalist.18f.gov is not an html document, skipping
 ||   federalist.18f.gov/preview is not an html document, skipping
 ||   federalist.18f.gov/preview/18F is not an html document, skipping
 ||   federalist.18f.gov/preview/18F/18f.gsa.gov is not an html document, skipping
 ||   federalist.18f.gov/preview/18F/18f.gsa.gov/master is not an html document, skipping
 ||   federalist.18f.gov/preview/18F/18f.gsa.gov/master/index.html is not an html document, skipping
@gemfarmer
Copy link
Author

gemfarmer commented Sep 29, 2016

After reviewing this with @waldoj, it looks like this is not related to subdomains (that was a coincidence), but likely related to how pa11y-crawl opts to use a site map if it is available. This isn't a problem when the project is being run over localhost

This is the likely offending line. It is possible that the $TEMP_DIR is saving the sitemap urls in a strange manner

cc @stvnrlly

@stvnrlly stvnrlly self-assigned this Oct 4, 2016
@syndy1989
Copy link

Hi,
I'm new to pa11y accessability testing. i'm trying to use pa11y-crawl [URL] to find all HTML pages and runs pa11y on each one.but i'm getting the below error am i missing out anything. Any advise would be helpful. Thanks in advance.

C:\Windows\system32>pa11y-crawl nature.com
'bash' is not recognized as an internal or external command,
operable program or batch file.

@stvnrlly
Copy link
Member

@syndy1989 Hi there!

As an initial matter, you should know that pa11y-crawl is both experimental and unsupported, which makes it pretty fragile. You may have better success with one of the more official pa11y options, such as the "webservice".

Regarding the error that you're seeing: it looks like you're running on Windows, while this currently works on macOS. I'm not that familiar with the Windows command line, but I don't believe it supports bash natively. If you're on Windows 10, there's now a way to create a Ubuntu Linux environment and use bash. That may allow you to use this tool (though, because it's unsupported, you may still have issues).

@syndy1989
Copy link

@stvnrlly
Hi there, I'm actually using Windows server 2012. I tried downloading cygwin on Windows to run bash commands.
I've noticed that pa11y-crawl gives the following error when attempting to crawl a URL with a subdomain.

. is not an html document, skipping

Any advice on this would be helpful. Thanks in advance

@stvnrlly
Copy link
Member

I'm afraid that I won't be able to help troubleshoot that issue. If we're able to spend time working on this project in the future, we may be able to fix the problem that caused this issue to be opened in the first place, which may help with what you're seeing.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants