Site Crawler incorrectly skips links when robots contains noindex #2989

cabal95 · 2018-05-14T22:23:13Z

Prerequisites

Put an X between the brackets on this line if you have done all of the following:
- Can you reproduce the problem on a fresh install or the demo site?
- Did you include your Rock version number and client culture setting?
- Did you perform a cursory search to see if your bug or enhancement is already reported?

Description

The Rock Site Crawler that came with Universal Search will not follow links if the robots noindex option is specified. This is incorrect, links should be followed unless the robots meta specifies the nofollow option, or the link itself has a rel="nofollow" option.

I can submit a PR for this.

Suggested Action

Add support for the nofollow flag in the robots meta tag. At the same time update the ParseLinks method to check for rel="nofollow" in the link and if found skip that individual link.

Expected behavior:

I want to build a page that has links to other pages that should be indexed but would not normally be found during a site crawl (example, event pages whose links only show up after clicking a PostBack button, which cannot be indexed).

Additionally, there are a few pages on the site that we don't want indexed because they are little more than menu/link-only pages.

Actual behavior:

These link-only pages are indexed because I cannot have the crawler follow links but not index the page itself.

Versions

Rock Version: 7.3
Client Culture Setting: en-US

The text was updated successfully, but these errors were encountered:

jonedmiston · 2018-05-14T22:32:20Z

Can you provide a bit more info on your specific use case. I'm not sure I understand this line:

The Rock Site Crawler that came with Universal Search will not follow links if the robots noindex option is specified.

Where is this specified?

cabal95 · 2018-05-14T22:43:19Z

Can you provide a bit more info on your specific use case.

We have about 300 calendar events. But only a handful will be indexed because the Events page, the one with the mini calendar and short descriptions of the events, will only show the events for the current month. Unless the user clicks the little arrow to go to the next month. This is done via PostBack which means the crawler cannot get to this information, since it is a Javascript link. In fact only about 25 of these events showed up on the index after crawling the site.

So a concrete example would be, come October we add calendar events for our Christmas program in December. Unless we have these calendar items links from somewhere else (like the homepage), the site crawler will not find them until December rolls around and the Events page now shows those items on the initial page load.

Where is this specified?

https://github.com/SparkDevNetwork/Rock/blob/develop/Rock/UniversalSearch/Crawler/Crawler.cs#L156

So for example, with a normal site crawler (e.g. Google), if I don't want that specific page to be indexed I would add <meta name="robots" content="noindex" />. That would mean "do not index the content on this page, but do follow any links". If I wanted the search engine/crawler to not index and not follow links, I would specify: <meta name="robots" content="noindex, nofollow" /> to indicate "do not index, and do not follow links".

Currently, the Rock Site Crawler will not follow links if noindex is specified, which is incorrect behavior.

Edited some text for clarity of meaning

jonedmiston · 2018-05-14T22:56:04Z

Ok, that makes more sense. I was thinking you may be talking about a robot.txt file. OK to PR.

…w"> tag to indicate that links should not be followed. (Issue SparkDevNetwork#2989)

jonedmiston assigned cabal95 May 14, 2018

cabal95 added a commit to cabal95/Rock that referenced this issue May 22, 2018

+ Updated Site Crawler to honor the <meta name="rebot" value="nofollo…

6675a40

…w"> tag to indicate that links should not be followed. (Issue SparkDevNetwork#2989)

cabal95 mentioned this issue May 22, 2018

+ Updated Site Crawler to honor the <meta name="rebot" value="nofollo… #3015

Merged

nairdo added the x-Fixed in v8.0 label May 26, 2018

cabal95 closed this as completed Jun 12, 2018

crayzd92 added this to the v8 milestone Mar 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Site Crawler incorrectly skips links when robots contains noindex #2989

Site Crawler incorrectly skips links when robots contains noindex #2989

cabal95 commented May 14, 2018

jonedmiston commented May 14, 2018

cabal95 commented May 14, 2018 •

edited

Loading

jonedmiston commented May 14, 2018

Site Crawler incorrectly skips links when robots contains noindex #2989

Site Crawler incorrectly skips links when robots contains noindex #2989

Comments

cabal95 commented May 14, 2018

Prerequisites

Description

Suggested Action

Versions

jonedmiston commented May 14, 2018

cabal95 commented May 14, 2018 • edited Loading

jonedmiston commented May 14, 2018

cabal95 commented May 14, 2018 •

edited

Loading