Cannot crawl all urls from a sitemap #758

peter-chan-hkmci · 2021-07-05T19:08:50Z

My client is using version 2.8.2-SNAPSHOT and found that some urls didn't updated in the search engine.

For example: https://store.acer.com/de-de/nitro-5-gaming-notebook-an517-51-schwarz-13

Checked that the crawler didn't fetch this url but this url is included in the sitemap.

My client don't want to change the crawler program a lot.

Is there any workaround or hotfix for this version?

The config is in the below:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>

<httpcollector id="application-webcrawler-ec">
    <progressDir>./output/progress</progressDir>
    <logsDir>./output/logs</logsDir>

    <crawlers>

        <crawler id="webcrawler-ec_M2">

            <startURLs stayOnDomain="true" includeSubdomains="true" stayOnPort="true" stayOnProtocol="true">
                <sitemap>https://store.acer.com/sitemaps/DE/sitemap.xml</sitemap>
            </startURLs>

            <workDir>./output</workDir>
            <maxDepth>1</maxDepth>
            <userAgent>gsa-crawler</userAgent>
            <sitemapResolverFactory ignore="false" lenient="true" class="com.norconex.collector.http.sitemap.impl.StandardSitemapResolverFactory">
                <path>https://store.acer.com/</path>
            </sitemapResolverFactory>

            <numThreads>16</numThreads>
            <delay default="0" scope="thread" />

            <documentFilters>
                <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
          ^https://store\.acer\.com/[^/-]+-[^/-]+/.*
                </filter>

                <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">
          ^https://store\.acer\.com/[^/-]+-[^/-]+/$
                </filter>
            </documentFilters>

            <referenceFilters>
                <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">^https://store\.acer\.com/[^/-]+-[^/-]+/.*</filter>
            </referenceFilters>

            <importer>
                <preParseHandlers>
                    <transformer class="com.norconex.importer.handler.transformer.impl.StripBetweenTransformer" inclusive="true">
                        <restrictTo field="document.contentType">text/html</restrictTo>
                        <stripBetween>
                            <start><![CDATA[<!--googleoff: index-->]]></start>
                            <end><![CDATA[<!--googleon: index-->]]></end>
                        </stripBetween>
                    </transformer>
                </preParseHandlers>
                <postParseHandlers>
                    <filter class="com.norconex.importer.handler.filter.impl.EmptyMetadataFilter" onMatch="exclude" fields="productPN" />

                    <tagger class="com.norconex.importer.handler.tagger.impl.ConstantTagger" onConflict="noop">
                        <constant name="collection">ec</constant>
                        <constant name="language"></constant>
                        <constant name="country"></constant>
                    </tagger>
                    <tagger class="com.norconex.importer.handler.tagger.impl.ReplaceTagger">
                        <replace fromField="document.reference" toField="language" regex="true">
                            <fromValue><![CDATA[^https:\/\/store\.acer\.com\/([^/-]+)-([^/-]+)\/.*]]></fromValue>
                            <toValue>$1</toValue>
                        </replace>
                        <replace fromField="document.reference" toField="country" regex="true">
                            <fromValue><![CDATA[^https:\/\/store\.acer\.com\/([^/-]+)-([^/-]+)\/.*]]></fromValue>
                            <toValue>$2</toValue>
                        </replace>
                    </tagger>
                    <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
                        <fields>document.reference,document.contentType,collection,language,country,title,description,keywords,robots,viewport,sectionName,productPN,price,sq,productGroup,quickSpecs,productImage</fields>
                    </tagger>
                </postParseHandlers>
            </importer>

            <committer class="com.norconex.committer.core.impl.JSONFileCommitter">
                <directory>./crawled</directory>
                <pretty>false</pretty>
                <docsPerFile>1000</docsPerFile>
                <compress>false</compress>
                <splitAddDelete>true</splitAddDelete>
            </committer>
        </crawler>
    </crawlers>

</httpcollector>

The text was updated successfully, but these errors were encountered:

essiembre · 2021-07-07T02:47:57Z

There are a few reasons this can happen. For instance, maybe a URL did not get updated because the sitemap indicated it did not change since the previous crawl. What do the logs say about those URLs?

peter-chan-hkmci · 2021-07-07T04:58:14Z

To avoid the last modify date cause, I have duplicated the application and cleaned all of the caches, then run the test.
However, no luck :(

essiembre · 2021-07-07T06:58:54Z

What about the logs? Maybe increase the verbosity if you have to, and look for what happened to the missing URLs. With the proper log level, every URL encountered should have an entry in the logs.

peter-chan-hkmci · 2021-07-07T17:31:20Z

Tried to increase the verbosity by changing the below loggers to DEBUG

# log4j.properties
log4j.logger.com.norconex.collector.http=DEBUG
log4j.logger.com.norconex.collector.core=DEBUG

But still cannot find the missing URLs, like https://store.acer.com/de-de/nitro-5-gaming-notebook-an517-51-schwarz-13 (I search it by using keyword an517-51)

I confirm that the URL above is in the sitemap.

Attached the full log
webcrawler-ec_95_M2.log

images. #758

essiembre · 2021-07-12T06:18:57Z

I was able to reproduce with what you shared. It turns out having <image> tags in your sitemap was making the parser to fail on <url> entries having them. I fixed the sitemap parser and made a new snapshot release (v2.x). Please give it a try and confirm.

custom elements. #758

essiembre added a commit that referenced this issue Jul 12, 2021

Fixed sitemap.xml URL entries not being extracted when they also contain

07b3e42

images. #758

essiembre added a commit that referenced this issue Jul 12, 2021

Fixed sitemap.xml URL entries not being extracted when they also contain

48fd6cc

images. #758

essiembre added bug resolved labels Jul 12, 2021

peter-chan-hkmci closed this as completed Jul 20, 2021

essiembre added a commit that referenced this issue Jul 28, 2021

Fixed sitemap.xml URL entries not being extracted when they contain

aaeaa12

custom elements. #758

essiembre mentioned this issue Sep 10, 2021

Unable to crawl sitemap entries with images #761

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot crawl all urls from a sitemap #758

Cannot crawl all urls from a sitemap #758

peter-chan-hkmci commented Jul 5, 2021

essiembre commented Jul 7, 2021

peter-chan-hkmci commented Jul 7, 2021

essiembre commented Jul 7, 2021

peter-chan-hkmci commented Jul 7, 2021 •

edited

Loading

essiembre commented Jul 12, 2021

Cannot crawl all urls from a sitemap #758

Cannot crawl all urls from a sitemap #758

Comments

peter-chan-hkmci commented Jul 5, 2021

essiembre commented Jul 7, 2021

peter-chan-hkmci commented Jul 7, 2021

essiembre commented Jul 7, 2021

peter-chan-hkmci commented Jul 7, 2021 • edited Loading

essiembre commented Jul 12, 2021

peter-chan-hkmci commented Jul 7, 2021 •

edited

Loading