Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support sitemaps with more than 50,000 items #10321

Open
wants to merge 18 commits into
base: develop
Choose a base branch
from

Conversation

jeromeroucou
Copy link
Contributor

@jeromeroucou jeromeroucou commented Feb 14, 2024

What this PR does / why we need it:

Actual generation sitemap file doesn't take care the numbers of entries. But sitemap specification indicate "each Sitemap file that you provide must have no more than 50,000 URLs and must be no larger than 50MB (52,428,800 bytes)" (see sitemap.org)

Which issue(s) this PR closes:

Closes #8936

Special notes for your reviewer:

Sitemap generation is done by sitemapgen4j library, which has the advantage of simplifying the code. I also use the new java 8 date formatter DateTimeFormatter which is threadsafe.
I make this PR as a draft because issue have been moved in "SPRINT READY" in IQSS Dataverse Project.

Suggestions on how to test this:

A new Unit Test have been added. And if the code is lunch into a running Dataverse, you can also call the API (for ex with curl -X POST http://localhost:8080/api/admin/sitemap).

Is there a release notes update needed for this change?:

No, I don't thinks.

Additional documentation:

Installation configuration need to be updated with the new file name of the sitemap.

Preview at https://dataverse-guide--10321.org.readthedocs.build/en/10321/installation/config.html#creating-a-sitemap-and-submitting-it-to-search-engines

@coveralls
Copy link

coveralls commented Feb 14, 2024

Coverage Status

coverage: 20.682% (+0.1%) from 20.57%
when pulling f6b0438 on Recherche-Data-Gouv:8936_huge_sitemap
into 4f46d15 on IQSS:develop.

Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick review. Thanks for the pull request, @jeromeroucou ! ❤️

@@ -1745,6 +1745,8 @@ https://demo.dataverse.org/sitemap.xml is the sitemap URL for the Dataverse Proj

Once the sitemap has been generated and placed in the domain docroot directory, it will become available to the outside callers at <YOUR_SITE_URL>/sitemap/sitemap.xml; it will also be accessible at <YOUR_SITE_URL>/sitemap.xml (via a *pretty-faces* rewrite rule). Some search engines will be able to find it at this default location. Some, **including Google**, need to be **specifically instructed** to retrieve it.

On Dataverse installation with more than 50000 Dataverse collections or datasets, sitemap file name is ``sitemap_index.html``, not ``sitemap.xml``. Be aware in previous steps to use the correct file name corresponding to your installation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. Why not use the library in all cases? And avoid having two code paths and two XML files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My note wasn't precise enough.

In sitemap specification, in the case where is more than 50000 URLs and/or the sitemap file is larger than 50MB, we have to use sitemap index file. The name of this file isn't clearly indicate, but in example there is indicate http://www.yoursite.com/sitemap_index.xml or https://example.com/public/sitemap_index.xml in Google developers documentation.

The library sitemapgen4j generate sitemap_index.xml name file by default in the case of sitemap index. You can see it here.

So in all cases the library is used, only the name of the sitemap file is change.

I'll have to clarify this in the documentation 👍

pom.xml Show resolved Hide resolved
@DS-INRA DS-INRA added this to 🚧 Dev by Recherche Data Gouv in Recherche Data Gouv (formerly Data INRAE) Feb 15, 2024
@pdurbin pdurbin added the Size: 10 A percentage of a sprint. 7 hours. label Feb 28, 2024
@scolapasta
Copy link
Contributor

Hi @jeromeroucou @pdurbin is this set to be QAed? (or does it still need more review / follwoup for the review).

Also, can we change this from not being a draft?

@pdurbin pdurbin changed the title 8936 huge sitemap Support sitemaps with more than 50,000 items Feb 28, 2024
@pdurbin pdurbin added the Type: Feature a feature request label Feb 28, 2024
@poikilotherm
Copy link
Contributor

poikilotherm commented Feb 29, 2024

Just a general note here: the dependency added has not been updated since 2019 (no activity, stale PRs and issues) and uses old Java APIs. At least it doesn't introduce any other transitive dependencies. There are some forks with small patches.

My suggestion: let's create a fork at github.com/gdcc and have our own package at Maven Central. In case we need patches because of bugs or changes at search engines, we would be much quicker to get this done. It also allows us to modernize the Java code over time.

@jeromeroucou
Copy link
Contributor Author

Good catch @poikilotherm ! If there is possible to have a specific fork with patches, it could be nice, and I will be happy to change my PR with this dependency.

@pdurbin
Copy link
Member

pdurbin commented Feb 29, 2024

@poikilotherm not a bad suggestion. So we'd fork https://github.com/dfabulich/sitemapgen4j under https://github.com/gdcc

@scolapasta and others reading this, what do you think?

@scolapasta scolapasta self-assigned this Feb 29, 2024
@pdurbin
Copy link
Member

pdurbin commented Feb 29, 2024

I see a 👍 from @scolapasta and no objections in Slack so I forked it to https://github.com/gdcc/sitemapgen4j

Now what? Fix it up and release to Maven Central? Do we want a separate issue to track this? @jeromeroucou are you interested in helping?

@jeromeroucou
Copy link
Contributor Author

@pdurbin Yes I am interested. And I also think a separate issue on the forked project can be better to review patchs applications with PR

@pdurbin
Copy link
Member

pdurbin commented Mar 1, 2024

@jeromeroucou great thanks. @poikilotherm is already working away in this "modernize" branch, if you'd like to try building it: https://github.com/gdcc/sitemapgen4j/tree/modernize

@poikilotherm
Copy link
Contributor

He doesn't even have to. Just enable Maven Central Snaps and point to the new coordinates: io.gdcc:sitemapgen4j:2.0.0-SNAPSHOT

@poikilotherm
Copy link
Contributor

poikilotherm commented Mar 4, 2024

https://github.com/gdcc/sitemapgen4j has been released as v2.0.0 to Maven Central. (Might take another hour until available from all mirrors.)

Have fun! (If we need fixes, updated -SNAPSHOT versions will be auto-released to Central now, too. Hopefully will make things a lot easier!)

@poikilotherm
Copy link
Contributor

I have just released a version 2.1.0 that has again more optimizations. Most important: addressed loads of code smells and removed obsolete, no longer supported Google Sitemap Extensions.

@scolapasta
Copy link
Contributor

@jeromeroucou Is this all ready for QA now? (if so can you change it from draft - I jsut want to be sure there's not something else we're waiting for here)

@jeromeroucou
Copy link
Contributor Author

Hi @scolapasta , the PR is not yet ready as it's waiting for the sitemap4j library namespaces fix, referenced by gdcc/sitemapgen4j#18

@scolapasta
Copy link
Contributor

Hi @scolapasta , the PR is not yet ready as it's waiting for the sitemap4j library namespaces fix, referenced by gdcc/sitemapgen4j#18

Ah, OK, that makes sense. We'll keep an eye on for when you switch it out of draft state.

@poikilotherm
Copy link
Contributor

@jeromeroucou I merged your PR and a new SNAPSHOT version is available. Would you mind testing this in Dataverse again before I cut a release?

@jeromeroucou
Copy link
Contributor Author

@poikilotherm Thanks for the update ! On my laptop with the latest SNAPSHOT version, unit tests and sitemap generation with some dataverses and datasets work fine. When the new release will be published, I'll mention it on pom.xml file.

@jeromeroucou jeromeroucou marked this pull request as ready for review March 21, 2024 10:11
@jeromeroucou
Copy link
Contributor Author

@poikilotherm thanks for you work on the sitemapgen4j ! The 2.1.2 version is now used on pom.xml

Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some questions about how this works. Please see my comments in the review. Thanks!

doc/sphinx-guides/source/installation/config.rst Outdated Show resolved Hide resolved
pom.xml Show resolved Hide resolved
@landreev landreev self-assigned this Mar 24, 2024
@landreev landreev self-requested a review March 24, 2024 20:00
jeromeroucou and others added 2 commits March 25, 2024 11:50
Co-authored-by: Philip Durbin <philipdurbin@gmail.com>
Co-authored-by: Philip Durbin <philipdurbin@gmail.com>
@@ -0,0 +1,3 @@
The sitemap file generation can handle more than 50,000 entries if needed with the [sitemapgen4j](https://github.com/gdcc/sitemapgen4j) library, maintained by the Global Dataverse Community Consortium.

In this case, the Dataverse Admin API `api/admin/sitemap` create a sitemap index file, called `sitemap_index.xml`, in place of the `sitemap.xml` file. See the config section of the Installation Guide for details.
Copy link
Contributor

@landreev landreev Mar 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add just a couple of words here, along the lines of "... in place of the single sitemap.xml, which will be split into individual files sitemap1.xml, sitemap2.xml etc. as needed; these files will be listed in the main index file and placed in the same directory..." I obviously encourage you to try and explain this better and in a more clear way (hence "along the lines of ...")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @landreev.

Thanks you for the review. A update of release notes snippet and configuration page have been done, hoping to have taken your feedback.

Copy link
Contributor

@landreev landreev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a couple of comments. Looks great otherwise. We have more than 50K public objects in our production, so I've been using a simple script to split the sitemap into segments. But it will help other instances to have it done automatically.

@landreev landreev removed their assignment Mar 28, 2024
@pdurbin
Copy link
Member

pdurbin commented Apr 18, 2024

@jeromeroucou I took a close look at the documentation and I'm suggesting a significant rewrite here:

If you like the changes, please feel free to merge. Otherwise, I'm happy to talk about it! 😄

@pdurbin
Copy link
Member

pdurbin commented Apr 18, 2024

I looked at the code and poked around with the new testHugeSiteMap test. I'm basically ready to approve this and move it to "ready for QA" but I'd like to see the doc PR I made above get merged first.

sitemap URL must have "/sitemap/" to be accessible in case of sitemap index file
@jeromeroucou
Copy link
Contributor Author

After discussion with @pdurbin, a new commit have been done to fix the sitemap URL in case of sitemap_index.xml file. A example below :

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://domain.tld/sitemap/sitemap1.xml</loc>
    <lastmod>2024-04-22</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://domain.tld/sitemap/sitemap2.xml</loc>
    <lastmod>2024-04-22</lastmod>
  </sitemap>
</sitemapindex>

In <loc> tag, if /sitemap/ isn't present, the location is invalid. There is no "pretty-faces" rewrite for these files (sitemap1.xml, ...), unlike sitemap.xml and sitemap_index.xml files.

I've also modified the unit tests to validate this modification, and I've also added some comments.

Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeromeroucou thanks for merging my docs pull request and for adding the pretty-config change. I haven't tested it specifically but I think this pull request is ready for QA so I'll move it along. Thanks!

@landreev landreev self-assigned this May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Size: 10 A percentage of a sprint. 7 hours. Type: Feature a feature request
Projects
Status: QA ✅
Recherche Data Gouv (formerly Data IN...
  
🚧 Dev by Recherche Data Gouv
Development

Successfully merging this pull request may close these issues.

Handle more than 50,000 entries in the sitemap
6 participants