Skip to content

Commit

Permalink
Merge pull request #10321 from Recherche-Data-Gouv/8936_huge_sitemap
Browse files Browse the repository at this point in the history
Support sitemaps with more than 50,000 items
  • Loading branch information
landreev authored May 8, 2024
2 parents d923f1c + f6b0438 commit 15fbbea
Show file tree
Hide file tree
Showing 6 changed files with 257 additions and 157 deletions.
11 changes: 11 additions & 0 deletions doc/release-notes/8936-more-than-50000-entries-in-sitemap.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
Dataverse can now handle more than 50,000 items when generating sitemap files, splitting the content across multiple files to comply with the Sitemap protocol.

For details see https://dataverse-guide--10321.org.readthedocs.build/en/10321/installation/config.html#creating-a-sitemap-and-submitting-it-to-search-engines #8936 and #10321.

## Upgrade instructions

If your installation has more than 50,000 entries, you should re-submit your sitemap URL to Google or other search engines. The file in the URL will change from ``sitemap.xml`` to ``sitemap_index.xml``.

As explained at https://dataverse-guide--10321.org.readthedocs.build/en/10321/installation/config.html#creating-a-sitemap-and-submitting-it-to-search-engines this is the command for regenerating your sitemap:

`curl -X POST http://localhost:8080/api/admin/sitemap`
39 changes: 32 additions & 7 deletions doc/sphinx-guides/source/installation/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2176,26 +2176,51 @@ If you are not fronting Payara with Apache you'll need to prevent Payara from se
Creating a Sitemap and Submitting it to Search Engines
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Search engines have an easier time indexing content when you provide them a sitemap. The Dataverse Software sitemap includes URLs to all published Dataverse collections and all published datasets that are not harvested or deaccessioned.
Creating a Sitemap
##################

Search engines have an easier time indexing content when you provide them a sitemap. Dataverse can generate a sitemap that includes URLs to all published collections and all published datasets that are not harvested or deaccessioned.

Create or update your sitemap by adding the following curl command to cron to run nightly or as you see fit:

``curl -X POST http://localhost:8080/api/admin/sitemap``

This will create or update a file in the following location unless you have customized your installation directory for Payara:
On a Dataverse installation with many datasets, the creation or updating of the sitemap can take a while. You can check Payara's server.log file for "BEGIN updateSiteMap" and "END updateSiteMap" lines to know when the process started and stopped and any errors in between.

For compliance with the `Sitemap protocol <https://sitemaps.org/protocol.html>`_, the generated sitemap will be a single file with 50,000 items or fewer or it will be split into multiple files.

Single Sitemap File
###################

If you have 50,000 items or fewer, a single sitemap will be generated in the following location (unless you have customized your installation directory for Payara):

``/usr/local/payara6/glassfish/domains/domain1/docroot/sitemap/sitemap.xml``

On Dataverse installation with many datasets, the creation or updating of the sitemap can take a while. You can check Payara's server.log file for "BEGIN updateSiteMap" and "END updateSiteMap" lines to know when the process started and stopped and any errors in between.
Once the sitemap has been generated in the location above, it will be served at ``/sitemap.xml`` like this: https://demo.dataverse.org/sitemap.xml

https://demo.dataverse.org/sitemap.xml is the sitemap URL for the Dataverse Project Demo site and yours should be similar.
Multiple Sitemap Files (Sitemap Index File)
###########################################

Once the sitemap has been generated and placed in the domain docroot directory, it will become available to the outside callers at <YOUR_SITE_URL>/sitemap/sitemap.xml; it will also be accessible at <YOUR_SITE_URL>/sitemap.xml (via a *pretty-faces* rewrite rule). Some search engines will be able to find it at this default location. Some, **including Google**, need to be **specifically instructed** to retrieve it.
According to the `Sitemaps.org protocol <https://www.sitemaps.org/protocol.html#index>`_, a sitemap file must have no more than 50,000 URLs and must be no larger than 50MiB. In this case, the protocol instructs you to create a sitemap index file called ``sitemap_index.xml`` (instead of ``sitemap.xml``), which references multiple sitemap files named ``sitemap1.xml``, ``sitemap2.xml``, etc. These referenced files are also generated in the same place as other sitemap files (``domain1/docroot/sitemap``) and there will be as many files as necessary to contain the URLs of collections and datasets present in your installation, while respecting the limit of 50,000 URLs per file.

One way to submit your sitemap URL to Google is by using their "Search Console" (https://search.google.com/search-console). In order to use the console, you will need to authenticate yourself as the owner of your Dataverse site. Various authentication methods are provided; but if you are already using Google Analytics, the easiest way is to use that account. Make sure you are logged in on Google with the account that has the edit permission on your Google Analytics property; go to the search console and enter the root URL of your Dataverse installation, then choose Google Analytics as the authentication method. Once logged in, click on "Sitemaps" in the menu on the left. (todo: add a screenshot?) Consult `Google's "submit a sitemap" instructions`_ for more information; and/or similar instructions for other search engines.
If you have over 50,000 items, a sitemap index file will be generated in the following location (unless you have customized your installation directory for Payara):

.. _Google's "submit a sitemap" instructions: https://support.google.com/webmasters/answer/183668
``/usr/local/payara6/glassfish/domains/domain1/docroot/sitemap/sitemap_index.xml``

Once the sitemap has been generated in the location above, it will be served at ``/sitemap_index.xml`` like this: https://demo.dataverse.org/sitemap_index.xml

Note that the sitemap is also available at (for example) https://demo.dataverse.org/sitemap/sitemap_index.xml and in that ``sitemap`` directory you will find the files it references such as ``sitemap1.xml``, ``sitemap2.xml``, etc.

Submitting Your Sitemap to Search Engines
#########################################

Some search engines will be able to find your sitemap file at ``/sitemap.xml`` or ``/sitemap_index.xml``, but others, **including Google**, need to be **specifically instructed** to retrieve it.

As described above, Dataverse will automatically detect whether you need to create a single sitemap file or several files and generate them for you. However, when submitting your sitemap file to Google or other search engines, you must be careful to supply the correct file name (``sitemap.xml`` or ``sitemap_index.xml``) depending on your situation.

One way to submit your sitemap URL to Google is by using their "Search Console" (https://search.google.com/search-console). In order to use the console, you will need to authenticate yourself as the owner of your Dataverse site. Various authentication methods are provided; but if you are already using Google Analytics, the easiest way is to use that account. Make sure you are logged in on Google with the account that has the edit permission on your Google Analytics property; go to the Search Console and enter the root URL of your Dataverse installation, then choose Google Analytics as the authentication method. Once logged in, click on "Sitemaps" in the menu on the left. Consult `Google's "submit a sitemap" instructions`_ for more information.

.. _Google's "submit a sitemap" instructions: https://support.google.com/webmasters/answer/183668

Putting Your Dataverse Installation on the Map at dataverse.org
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Expand Down
6 changes: 6 additions & 0 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -560,6 +560,12 @@
<artifactId>java-json-canonicalization</artifactId>
<version>1.1</version>
</dependency>
<!-- Sitemap generation -->
<dependency>
<groupId>io.gdcc</groupId>
<artifactId>sitemapgen4j</artifactId>
<version>2.1.2</version>
</dependency>
<dependency>
<groupId>edu.ucar</groupId>
<artifactId>cdm-core</artifactId>
Expand Down
Loading

0 comments on commit 15fbbea

Please sign in to comment.