Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

duplicate pages with www and non www &( http and https) #596

Closed
HappyCustomers opened this issue Apr 22, 2019 · 4 comments
Closed

duplicate pages with www and non www &( http and https) #596

HappyCustomers opened this issue Apr 22, 2019 · 4 comments

Comments

@HappyCustomers
Copy link

HappyCustomers commented Apr 22, 2019

Dear Mr. Pascal Essiembre,

I have few websites where redirection is not enabled from NON-WWW to WWW and http to https. This is resulting in duplicate content.

Scenario 1
for Example both the below URLs are getting crawled and stored in the database resulting in duplicate pages.

http://www.example.com/home.html
http://example.com/home.html

Scenario 2
In this scenario both the URL page contents are same. one url is secure(https) and another non secure(http).

https://xyz.com/aboutus.html
http://xyz.com/aboutus.html

How do I enable the collectors to crawl only WWW in scenario 1 and https in scenario 2

I will be loading 100's of URLs from a file and from all the URLs I will be removing www and sending as http://<domain>/. if the redirection is enabled to https or www then it fetches the unique pages else the collector is fetching both www and non www URLs resulting in duplicate page content

Canonical Link detector Ignore is Set to False
<canonicalLinkDetector ignore="false" />

Thank You

@HappyCustomers
Copy link
Author

HappyCustomers commented Apr 25, 2019

Gentle Reminder. Please can I get a solution for this question?
Thank You

@essiembre
Copy link
Contributor

You can use stayOnDomain and stayOnProtocol or add/or reference filters to make sure it stays on URLs with https and www. If your website sometimes switches from one to the other and you are afraid it may exclude otherwise valid links, you can use the GenericURLNormalizer if you know all pages are available with the www and https.

For example, adding addWWW, and secureScheme to the default set of normalization rules would do it:

  <urlNormalizer class="com.norconex.collector.http.url.impl.GenericURLNormalizer">
    <normalizations>
        removeFragment, lowerCaseSchemeHost, upperCaseEscapeSequence,
        decodeUnreservedCharacters, removeDefaultPort,
        encodeNonURICharacters, addWWW, secureScheme
    </normalizations>
  </urlNormalizer>

@HappyCustomers
Copy link
Author

Thanks for the advise. I have quick question.

The Scenario is as follows:

I load 100 URLs from a file by removing www and all the urls will be http only due to sub-domain issue. 563.

The input URLs for all the website will be in the below format
http://xyz.com

There may be four possible scenarios for each URLs

  1. http://xyz.com
  2. http://www.xyz.com
  3. https://xyz.com
  4. https://www.xyz.com

if the server has proper re-directions, then the collector will extract the content from only one set of content from these 4 URLs combination. if the server is not configured properly then the collector is extracting more than one set, resulting in duplicate website content.

Adding stayOnDomain and stayOnProtocol will result in some URLs being excluded.

what is the impact on website which does not have www nor https
when l add addWWW, and secureScheme ? will the collector crawl or exclude these websites??

Thank you

@essiembre
Copy link
Contributor

If the same site has links that go from one format to the other but each page do not always exist in both formats, then yes, you could miss some. Often sites will redirect all pages to a single format, or always offer both (the same site just served up differently).

If you need more control, you may have to use reference filters instead. I invite you to try it first and see if you get the problem you describe. There might be no such problems or only a handful that reference filters can handle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants