Skip to content

Reindex and Reingest

Jan Tomášek edited this page Nov 29, 2022 · 2 revisions

Changes of ARCLib XML structure or its Solr schema may require that upgrade is followed with either reindex or even reingest of data stored in Archival Storage.

Reindex

Changes of the Solr schema of the ARCLib XML are likely to require reindex.

  • Reindex means retrieving ARCLib XML from Archival Storage and writing it to Solr (rewriting current Solr record).
  • There is currently no support for per-producer or per-XML reindex, the reindex must be done for whole repository. See Usage@Index on how to reindex.

Reingest

Reingest may be needed if the structure of ARCLib XML changes.

  • Reingest means running the whole ingest process of SIP which was already ingested in the past. ARCLib detects reingest and creates new AIP XML version to already present AIP (instead of treating the SIP as new package and duplicating it data). For this to work, workflow definition used for reindex must follow this:
    • Should not contain Duplicate SIP check task
    • The used SIP profile must lead ARCLib to find the same Authorial ID as the one which was found by SIP profile used for the previous ingest
      • See Path to XML file with authorial ID and XPath to node with authorial ID at Usage@Sip profiles
  • There is no support for bulk reingest - admin must simply place the packages into transfer area and activate a standard ingest routine.
    • If the data must be exported from Archival Storage, here are the export options:
      • Export through ARCLib using Export routines
      • Export using Archival Storage HTTP API (backup endpoint)
      • Export by accessing the logical storage directly
    • In all cases the format of exported data is not the same as format required for import
      • There is no need for AIP XML in transfer area
      • There must be .sums file in transfer area

There are two cases when ARCLib XML structure may change:

A) Update of source code related to ARCLib XML creation
See release notes for consequences.

B) Change of the SIP profile or Workflow definition of particular Producer
Reingest your data if you require all your ARCLib XMLs to have the latest format or you rely on searching in new/modified elements/attributes. Also take into consideration that in ARCLib there are:

  • Nodes which are mapped to fields explicitly defined in Solr schema (fine search)
  • Nodes which are indexed only as a part of the fulltext index covering the whole XML
Clone this wiki locally