Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot crawl all urls from a sitemap #758

Closed
peter-chan-hkmci opened this issue Jul 5, 2021 · 5 comments
Closed

Cannot crawl all urls from a sitemap #758

peter-chan-hkmci opened this issue Jul 5, 2021 · 5 comments

Comments

@peter-chan-hkmci
Copy link

My client is using version 2.8.2-SNAPSHOT and found that some urls didn't updated in the search engine.

For example: https://store.acer.com/de-de/nitro-5-gaming-notebook-an517-51-schwarz-13

Checked that the crawler didn't fetch this url but this url is included in the sitemap.

My client don't want to change the crawler program a lot.

Is there any workaround or hotfix for this version?

The config is in the below:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>

<httpcollector id="application-webcrawler-ec">
    <progressDir>./output/progress</progressDir>
    <logsDir>./output/logs</logsDir>

    <crawlers>

        <crawler id="webcrawler-ec_M2">

            <startURLs stayOnDomain="true" includeSubdomains="true" stayOnPort="true" stayOnProtocol="true">
                <sitemap>https://store.acer.com/sitemaps/DE/sitemap.xml</sitemap>
            </startURLs>

            <workDir>./output</workDir>
            <maxDepth>1</maxDepth>
            <userAgent>gsa-crawler</userAgent>
            <sitemapResolverFactory ignore="false" lenient="true" class="com.norconex.collector.http.sitemap.impl.StandardSitemapResolverFactory">
                <path>https://store.acer.com/</path>
            </sitemapResolverFactory>

            <numThreads>16</numThreads>
            <delay default="0" scope="thread" />

            <documentFilters>
                <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
          ^https://store\.acer\.com/[^/-]+-[^/-]+/.*
                </filter>

                <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">
          ^https://store\.acer\.com/[^/-]+-[^/-]+/$
                </filter>
            </documentFilters>

            <referenceFilters>
                <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">^https://store\.acer\.com/[^/-]+-[^/-]+/.*</filter>
            </referenceFilters>

            <importer>
                <preParseHandlers>
                    <transformer class="com.norconex.importer.handler.transformer.impl.StripBetweenTransformer" inclusive="true">
                        <restrictTo field="document.contentType">text/html</restrictTo>
                        <stripBetween>
                            <start><![CDATA[<!--googleoff: index-->]]></start>
                            <end><![CDATA[<!--googleon: index-->]]></end>
                        </stripBetween>
                    </transformer>
                </preParseHandlers>
                <postParseHandlers>
                    <filter class="com.norconex.importer.handler.filter.impl.EmptyMetadataFilter" onMatch="exclude" fields="productPN" />

                    <tagger class="com.norconex.importer.handler.tagger.impl.ConstantTagger" onConflict="noop">
                        <constant name="collection">ec</constant>
                        <constant name="language"></constant>
                        <constant name="country"></constant>
                    </tagger>
                    <tagger class="com.norconex.importer.handler.tagger.impl.ReplaceTagger">
                        <replace fromField="document.reference" toField="language" regex="true">
                            <fromValue><![CDATA[^https:\/\/store\.acer\.com\/([^/-]+)-([^/-]+)\/.*]]></fromValue>
                            <toValue>$1</toValue>
                        </replace>
                        <replace fromField="document.reference" toField="country" regex="true">
                            <fromValue><![CDATA[^https:\/\/store\.acer\.com\/([^/-]+)-([^/-]+)\/.*]]></fromValue>
                            <toValue>$2</toValue>
                        </replace>
                    </tagger>
                    <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
                        <fields>document.reference,document.contentType,collection,language,country,title,description,keywords,robots,viewport,sectionName,productPN,price,sq,productGroup,quickSpecs,productImage</fields>
                    </tagger>
                </postParseHandlers>
            </importer>

            <committer class="com.norconex.committer.core.impl.JSONFileCommitter">
                <directory>./crawled</directory>
                <pretty>false</pretty>
                <docsPerFile>1000</docsPerFile>
                <compress>false</compress>
                <splitAddDelete>true</splitAddDelete>
            </committer>
        </crawler>
    </crawlers>

</httpcollector>
@essiembre
Copy link
Contributor

There are a few reasons this can happen. For instance, maybe a URL did not get updated because the sitemap indicated it did not change since the previous crawl. What do the logs say about those URLs?

@peter-chan-hkmci
Copy link
Author

To avoid the last modify date cause, I have duplicated the application and cleaned all of the caches, then run the test.
However, no luck :(

@essiembre
Copy link
Contributor

What about the logs? Maybe increase the verbosity if you have to, and look for what happened to the missing URLs. With the proper log level, every URL encountered should have an entry in the logs.

@peter-chan-hkmci
Copy link
Author

peter-chan-hkmci commented Jul 7, 2021

Tried to increase the verbosity by changing the below loggers to DEBUG

# log4j.properties
log4j.logger.com.norconex.collector.http=DEBUG
log4j.logger.com.norconex.collector.core=DEBUG

But still cannot find the missing URLs, like https://store.acer.com/de-de/nitro-5-gaming-notebook-an517-51-schwarz-13 (I search it by using keyword an517-51)
logan517-51

I confirm that the URL above is in the sitemap.
an517-51

Attached the full log
webcrawler-ec_95_M2.log

@essiembre
Copy link
Contributor

I was able to reproduce with what you shared. It turns out having <image> tags in your sitemap was making the parser to fail on <url> entries having them. I fixed the sitemap parser and made a new snapshot release (v2.x). Please give it a try and confirm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants