-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
duplicate pages with www and non www &( http and https) #596
Comments
Gentle Reminder. Please can I get a solution for this question? |
You can use For example, adding <urlNormalizer class="com.norconex.collector.http.url.impl.GenericURLNormalizer">
<normalizations>
removeFragment, lowerCaseSchemeHost, upperCaseEscapeSequence,
decodeUnreservedCharacters, removeDefaultPort,
encodeNonURICharacters, addWWW, secureScheme
</normalizations>
</urlNormalizer> |
Thanks for the advise. I have quick question. The Scenario is as follows: I load 100 URLs from a file by removing The input URLs for all the website will be in the below format There may be four possible scenarios for each URLs if the server has proper re-directions, then the collector will extract the content from only one set of content from these 4 URLs combination. if the server is not configured properly then the collector is extracting more than one set, resulting in duplicate website content. Adding what is the impact on website which does not have Thank you |
If the same site has links that go from one format to the other but each page do not always exist in both formats, then yes, you could miss some. Often sites will redirect all pages to a single format, or always offer both (the same site just served up differently). If you need more control, you may have to use reference filters instead. I invite you to try it first and see if you get the problem you describe. There might be no such problems or only a handful that reference filters can handle. |
Dear Mr. Pascal Essiembre,
I have few websites where redirection is not enabled from NON-WWW to WWW and http to https. This is resulting in duplicate content.
Scenario 1
for Example both the below URLs are getting crawled and stored in the database resulting in duplicate pages.
Scenario 2
In this scenario both the URL page contents are same. one url is secure(https) and another non secure(http).
How do I enable the collectors to crawl only WWW in scenario 1 and https in scenario 2
I will be loading 100's of URLs from a file and from all the URLs I will be removing www and sending as
http://<domain>/
. if the redirection is enabled to https or www then it fetches the unique pages else the collector is fetching both www and non www URLs resulting in duplicate page contentCanonical Link detector Ignore is Set to False
<canonicalLinkDetector ignore="false" />
Thank You
The text was updated successfully, but these errors were encountered: