duplicate pages with www and non www &( http and https) #596

HappyCustomers · 2019-04-22T12:47:58Z

Dear Mr. Pascal Essiembre,

I have few websites where redirection is not enabled from NON-WWW to WWW and http to https. This is resulting in duplicate content.

Scenario 1
for Example both the below URLs are getting crawled and stored in the database resulting in duplicate pages.

http://www.example.com/home.html
http://example.com/home.html

Scenario 2
In this scenario both the URL page contents are same. one url is secure(https) and another non secure(http).

https://xyz.com/aboutus.html
http://xyz.com/aboutus.html

How do I enable the collectors to crawl only WWW in scenario 1 and https in scenario 2

I will be loading 100's of URLs from a file and from all the URLs I will be removing www and sending as http://<domain>/. if the redirection is enabled to https or www then it fetches the unique pages else the collector is fetching both www and non www URLs resulting in duplicate page content

Canonical Link detector Ignore is Set to False
<canonicalLinkDetector ignore="false" />

Thank You

The text was updated successfully, but these errors were encountered:

HappyCustomers · 2019-04-25T12:33:49Z

Gentle Reminder. Please can I get a solution for this question?
Thank You

essiembre · 2019-04-28T20:33:26Z

You can use stayOnDomain and stayOnProtocol or add/or reference filters to make sure it stays on URLs with https and www. If your website sometimes switches from one to the other and you are afraid it may exclude otherwise valid links, you can use the GenericURLNormalizer if you know all pages are available with the www and https.

For example, adding addWWW, and secureScheme to the default set of normalization rules would do it:

  <urlNormalizer class="com.norconex.collector.http.url.impl.GenericURLNormalizer">
    <normalizations>
        removeFragment, lowerCaseSchemeHost, upperCaseEscapeSequence,
        decodeUnreservedCharacters, removeDefaultPort,
        encodeNonURICharacters, addWWW, secureScheme
    </normalizations>
  </urlNormalizer>

HappyCustomers · 2019-04-30T04:43:26Z

Thanks for the advise. I have quick question.

The Scenario is as follows:

I load 100 URLs from a file by removing www and all the urls will be http only due to sub-domain issue. 563.

The input URLs for all the website will be in the below format
http://xyz.com

There may be four possible scenarios for each URLs

if the server has proper re-directions, then the collector will extract the content from only one set of content from these 4 URLs combination. if the server is not configured properly then the collector is extracting more than one set, resulting in duplicate website content.

Adding stayOnDomain and stayOnProtocol will result in some URLs being excluded.

what is the impact on website which does not have www nor https
when l add addWWW, and secureScheme ? will the collector crawl or exclude these websites??

Thank you

essiembre · 2019-05-03T03:09:24Z

If the same site has links that go from one format to the other but each page do not always exist in both formats, then yes, you could miss some. Often sites will redirect all pages to a single format, or always offer both (the same site just served up differently).

If you need more control, you may have to use reference filters instead. I invite you to try it first and see if you get the problem you describe. There might be no such problems or only a handful that reference filters can handle.

jetnet mentioned this issue Jun 18, 2019

ignore www prefix #614

Open

essiembre closed this as completed Dec 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

duplicate pages with www and non www &( http and https) #596

duplicate pages with www and non www &( http and https) #596

HappyCustomers commented Apr 22, 2019 •

edited

HappyCustomers commented Apr 25, 2019 •

edited

essiembre commented Apr 28, 2019

HappyCustomers commented Apr 30, 2019

essiembre commented May 3, 2019

duplicate pages with www and non www &( http and https) #596

duplicate pages with www and non www &( http and https) #596

Comments

HappyCustomers commented Apr 22, 2019 • edited

HappyCustomers commented Apr 25, 2019 • edited

essiembre commented Apr 28, 2019

HappyCustomers commented Apr 30, 2019

essiembre commented May 3, 2019

HappyCustomers commented Apr 22, 2019 •

edited

HappyCustomers commented Apr 25, 2019 •

edited