Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Site Crawler incorrectly skips links when robots contains noindex #2989

Closed
1 task done
cabal95 opened this issue May 14, 2018 · 3 comments
Closed
1 task done

Site Crawler incorrectly skips links when robots contains noindex #2989

cabal95 opened this issue May 14, 2018 · 3 comments
Assignees
Labels
Priority: Low Affects a small number of Rock installations and will not be noticed by most users. Status: Confirmed It's clear what the subject of the issue is about, and what the resolution should be. Topic: Rock Internals Related to internal core stuff. Type: Bug Confirmed bugs or reports that are very likely to be bugs. x-Fixed in v8.0
Milestone

Comments

@cabal95
Copy link
Member

cabal95 commented May 14, 2018

Prerequisites

  • Put an X between the brackets on this line if you have done all of the following:

Description

The Rock Site Crawler that came with Universal Search will not follow links if the robots noindex option is specified. This is incorrect, links should be followed unless the robots meta specifies the nofollow option, or the link itself has a rel="nofollow" option.

I can submit a PR for this.

Suggested Action

Add support for the nofollow flag in the robots meta tag. At the same time update the ParseLinks method to check for rel="nofollow" in the link and if found skip that individual link.

Expected behavior:

I want to build a page that has links to other pages that should be indexed but would not normally be found during a site crawl (example, event pages whose links only show up after clicking a PostBack button, which cannot be indexed).

Additionally, there are a few pages on the site that we don't want indexed because they are little more than menu/link-only pages.

Actual behavior:

These link-only pages are indexed because I cannot have the crawler follow links but not index the page itself.

Versions

  • Rock Version: 7.3
  • Client Culture Setting: en-US
@cabal95 cabal95 added Type: Bug Confirmed bugs or reports that are very likely to be bugs. Status: Confirmed It's clear what the subject of the issue is about, and what the resolution should be. Priority: Low Affects a small number of Rock installations and will not be noticed by most users. Topic: Rock Internals Related to internal core stuff. labels May 14, 2018
@jonedmiston
Copy link
Member

Can you provide a bit more info on your specific use case. I'm not sure I understand this line:

The Rock Site Crawler that came with Universal Search will not follow links if the robots noindex option is specified.

Where is this specified?

@cabal95
Copy link
Member Author

cabal95 commented May 14, 2018

Can you provide a bit more info on your specific use case.

We have about 300 calendar events. But only a handful will be indexed because the Events page, the one with the mini calendar and short descriptions of the events, will only show the events for the current month. Unless the user clicks the little arrow to go to the next month. This is done via PostBack which means the crawler cannot get to this information, since it is a Javascript link. In fact only about 25 of these events showed up on the index after crawling the site.

So a concrete example would be, come October we add calendar events for our Christmas program in December. Unless we have these calendar items links from somewhere else (like the homepage), the site crawler will not find them until December rolls around and the Events page now shows those items on the initial page load.

Where is this specified?

https://github.com/SparkDevNetwork/Rock/blob/develop/Rock/UniversalSearch/Crawler/Crawler.cs#L156

So for example, with a normal site crawler (e.g. Google), if I don't want that specific page to be indexed I would add <meta name="robots" content="noindex" />. That would mean "do not index the content on this page, but do follow any links". If I wanted the search engine/crawler to not index and not follow links, I would specify: <meta name="robots" content="noindex, nofollow" /> to indicate "do not index, and do not follow links".

Currently, the Rock Site Crawler will not follow links if noindex is specified, which is incorrect behavior.

Edited some text for clarity of meaning

@jonedmiston
Copy link
Member

Ok, that makes more sense. I was thinking you may be talking about a robot.txt file. OK to PR.

cabal95 added a commit to cabal95/Rock that referenced this issue May 22, 2018
…w"> tag to indicate that links should not be followed. (Issue SparkDevNetwork#2989)
@cabal95 cabal95 closed this as completed Jun 12, 2018
@crayzd92 crayzd92 added this to the v8 milestone Mar 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: Low Affects a small number of Rock installations and will not be noticed by most users. Status: Confirmed It's clear what the subject of the issue is about, and what the resolution should be. Topic: Rock Internals Related to internal core stuff. Type: Bug Confirmed bugs or reports that are very likely to be bugs. x-Fixed in v8.0
Projects
None yet
Development

No branches or pull requests

4 participants