Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ignore www prefix #614

Open
jetnet opened this issue Jun 18, 2019 · 4 comments
Open

ignore www prefix #614

jetnet opened this issue Jun 18, 2019 · 4 comments

Comments

@jetnet
Copy link

jetnet commented Jun 18, 2019

we need to crawl many Internet sites and encountered an issue with www prefix:
some sites redirect to their domains without www, some other way round.
Unfortunately, such case cannot be handle by NC in general way (globally): we can normalize URLs bei removing www prefix, and, if a site would redicrect to www.some.site again, the collector would follow, as it is configured to follow sub-domains. But, there will be cases, when a site is available with www prefix only (e.g. https://www.pony.at/ does not work without www), so we will miss such sites again.
So, I'm looking for a general solution for that problem.
Any ideas - very welcome! Thank you!

Common requirements for a crawler:

<startURLs stayOnDomain="true" stayOnPort="false" stayOnProtocol="false" includeSubdomains="true">
@jetnet
Copy link
Author

jetnet commented Jun 18, 2019

I'm not alone :) #596

@essiembre
Copy link
Contributor

One solution could be to define two crawlers, one with the URL normalization to always use www and the other without the www. Then you would have to test each start URL to figure out under which one they belong.

If you do not know up front all the domains that will be crawled, it could get tricky for sure. We could make this a feature request, but I am not sure what solution could be generic enough. Especially knowing that www could technically be a subdomain that serves totally different content (even if I have never encountered such a site).

Maybe we could have a smart URL Normalizer where you can indicate your preference (www or not) and upon seeing a domain for the first time, if it does not suit your preference, it will first test if its alternate version exists before actually doing the normalization (making an extra call). I guess this could work as long as we can assume doing that test once per domain is valid for all URLs on that domain. An example:

  1. Let's say you prefer www
  2. The crawler encounters https://www.aaa.com/111.html, so it leaves it unchanged and remembers that domain to be OK.
  3. The crawler encounters https://aaa.com/222.html, it knows you prefer www and it knows it already exists, so it normalizes it to https://www.aaa.com/222.html.
  4. The crawler encounters https://bbb.com/333.html. It does not know if www exists for it so it makes an extra call to find out:
    • If it exists, it normalizes it to https://www.bbb.com/333.html and remembers it.
    • If it does not exist, it leaves it as is and remembers not to check again for that domain (never convert URLs on that domain to www).

Could an (optional) feature like that do it you think?

@jetnet
Copy link
Author

jetnet commented Jun 22, 2019

maybe we could simplify the logic like following:
I'd add two new options:

  • <startURLs includeWWW=[true|false] ...: works as includeSubdomains="true" where subDomain is www only
    • alternatively, w/o a new parameter, but w/ a new value: <startURLs includeSubdomains=[true|false|www]: I guess, it's self explanatory :)
  • <startURLs includeParentWWW=[true|false] ...: allow indexing the parent domain, when the current domain starts with www
    • alternatively, more general way: <startURLs includeParentDomains=[levelN|www] ...: where
      • levelN: domain level to allow indexing, if the crawler gets redirected to parent domains. Up to top-level (level = 1)
      • www: allow redirects to the single parent domain, when the current domain starts with www

I assume, it would be not that easy to implement the options includeParentDomains, when allowing up to top-level domain. I'd be happy with includeParentDomains=[true|false|www], when it'd allow the single parent domain only (any or when the current one starts w/ www).

What do you think? Thanks a lot!

@essiembre
Copy link
Contributor

Plenty of good ideas. I just marked this as a feature request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants