Skip to content

Migration Guide: Digital Commons

Brandon Weigel edited this page May 8, 2017 · 17 revisions

Sample files

Refer to the files located in the bceln sample directory.

Overview

The Digital Commons migration performed for KORA (http://kora.kpu.ca) involved a combination of MIK toolchains:

The combination approach is favoured because:

  • Difficulty of batch exporting object files from a vendor-controlled system
  • Digital Commons's OAI-PMH feed provides a direct link to the file in the second <dc:identifier> element, making it easy to download files this way (this is the reason for the OAI toolchain's development)
  • Metadata provided by OAI-PMH is incomplete and difficult to transform accurately to MODS
  • Digital Commons provides its most complete metadata via Excel spreadsheet export, which can be readily converted to CSV

Step 1: Download all repository files

Using the OAI-PMH toolchain to download files. The configuration examples given on the OAI toolchain page will work.

Important: Be sure to use metadata_prefix = oai_dc, as this is the only one that includes the second <dc:identifier> element with the path to the downloadable file.

This sample config file will work as a basis for downloading all repository files: kora_oai.ini

This configuration will download both XML and files. You will need to split them apart and discard the XML files, as you will get better metadata from CSV. No config option exists yet to get just the files with no XML generated.

Downloaded file names

Example file name: oai_kora.kpu.ca_facultypub-1059.pdf

The downloaded filenames will be derived from several pieces of metadata:

  • oai_: indicating the file was obtained via the OAI feed.
  • kora.kpu.ca_: the Digital Commons site that it came from.
  • facultypub-: the Digital Commons "set" or "collection" - column heading "issue" in the spreadsheets provided by the software.
  • 1059: The object's manuscript number

The filenames will be important when tweaking your CSV files to fit the toolchain's design.

Step 2: Export spreadsheets from Digital Commons

From your Digital Commons site, as administrator, export the Excel files for each of your sets.

For each publication type:

  1. In the Manage Submissions screen for the series, choose Batch Revise Excel in the sidebar. You may select publications for export based on their status or State; for example, published items only.

  2. Click Generate. A spreadsheet of the metadata will appear at the top of the “Spreadsheet History” list. (Note: In publications containing 10,000 submissions or more, exports are broken up into multiple spreadsheets of 5,000 items each.)

  3. Click the Download link to complete the export and save the file.

  4. Download an Administrator Report for the same series. Tick all the relevant boxes, but primarily the Manuscript# field. More info at Digital Commons.

Step 3: Edit and optimize spreadsheets

Digital Commons exports metadata in a few ways that are not entirely compatible with MIK and MODS. A few cleanup processes will be required.

Preliminary edits

** Adding, merging and renaming columns, changing formats, etc. using Excel, OpenOffice, or Google Sheets **

Create the filename field

From the (spreadsheet with manuscript number), copy the Manuscript# column and insert it into your main metadata spreadsheet. Insert a new column next to that column and call it "filename".

In that field, enter the following formula: =IF(a2="","",(CONCATENATE("oai_[your_url]_",b2,"-",a2,".[file_extension]"))

Where b2 is the relevant cell in your "Issue" column (i.e. the name of the set) and a2 is in your "Manuscript#" column. The IF statement will create an empty cell in the case of no manuscript number - important if your set has a mix of objects with and without files.

Your results should look like the names of the files you downloaded via the OAI feed: e.g. oai_kora.kpu.ca_facultypub-1059.pdf.

Copy that line all the way down, then highlight the column, copy, and Paste Special -> values and number formats.

IMPORTANT DETAIL: Note that having a filename field is required by MIK, even if your objects have no files. If the cell is blank, an XML file will be generated anyway (depending on your configuration).

Authors/Names

Digital Commons puts each name and related information about those names in separate columns. Columns about a particular author share the same prefix (e.g. author1_fname, author1_mname, author1_email, etc.). There are two problematic fields that need adjusting:

  • author[x]_mname - Person's middle name
  • author[x]_is_corporate - TRUE if this is a corporate author

Middle name In MODS, middle names are considered part of the first name (<name type="given">). You will need to create a new column that merges the _fname and _mname columns.

Using Excel or Open Office, simply create a new column and use the CONCATENATE function:

=IF(B2="",A2,CONCATENATE(A2," ",B2))

where a2 and b2 are the relevant cell numbers. The IF function stops you from adding whitespace unnecessarily. Then:

  1. Copy down to all cells
  2. Change the column header to author[x]_fname (x depends on the name you're working with)
  3. Copy the whole column
  4. Paste special: Values and number formats
  5. Delete the original two columns (author[x]_fname, author[x]_mname).

Do this for each author.

Corporate names Filter the headings and find the _is_corporate headings. Look for any that are TRUE. Apply filters. If you find any, create a new column, called corporate_author. (Use this title exactly if using the example mappings file.) Cut your corporate authors and paste into the new column.

For multiple corporate authors in a single row, create a second column called corporate_author2. If you need more than two corporate_author columns, you will need to add them to the mappings file.

Geolocation fields

Digital Commons gives you three sets of geolocation fields: geolocate (text), latitude (numeric), and longitude (numeric). The example mapping puts geolocate into <subject><geographic>%value%</geographic></subject>. Alter this if desired.

Geographic coordinates are mapped to <subject><cartographics><coordinates>%value%</coordinates></cartographics></subject>. Because MODS puts both latitude and longitude into a single element, and because writing a Twig template to do this on the fly is too difficult, these columns need to be merged.

  1. Insert a new column called "coordinates".
  2. Concatenate the values of the latitude and longitude columns: =IF(B2="",A2,CONCATENATE(A2,", ",B2))
  3. Copy, paste special, delete, etc. as you did merging first and middle names.

Multi-value fields: Subjects, disciplines, etc.

In Digital Commons, if you want multiple values in one field, you must enter them manually with no specifically defined delimiter required. Submitters are most likely to have used either a comma or a semicolon. This could result in a mix of delimiters used. MIK requires consistency.

The sample files assume that the subjects column maps to uncontrolled author keywords (destined for <note type="author keyword">), and the disciplines column maps to controlled subjects (<subject><topic>%value%</topic></subject>). In either case, it also assumes that multiple values are delimited with a semicolon.

To make the delimiters consistent, simply highlight the columns, and execute a find and replace to turn all commas into semicolons. (Note that this is a blunt-object approach; some submitters may have used semicolons as delimiters and have commas within the values themselves.)

Publication date

You may find a set of numbers in the publication_date field that don't look like dates. These are in the Microsoft Date Format. You can convert these to strings easily enough in Excel by highlighting them and changing the number format to Date.

cc_license and distribution_license

Merge these into distribution_license and delete cc_license.

Further Refinements: Stripping HTML, changing dates to strings, and extracting more data with Open Refine

The following can be accomplished in Excel, but it's easier if you use Open Refine. Create a new Open Refine project based on your spreadsheet and then do the following steps.

Stripping HTML

Some fields, such as the Abstract, contain HTML markup which you will want to remove before transforming to MODS. To do so, edit the column and apply the following transform: replace(value,/<\/?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)\/?>/,'')

Convert date to string

While submitters most likely entered a minimal date (y-m-d, y-m, or even year-only), Digital Commons turns these into a complete timestamp. This should be converted to a string, erasing the 00:00:00, before transforming to MODS.

Edit the column and apply the following: toString(toDate(value),"yyyy-MM-dd")

Extract DOI from preferred citation column

If some of the preferred citations contain DOIs, it's a good idea to extract them. To do so, find the preferred_citation column, and choose "Add column based on this column". Give it the heading "doi", and apply the following transform to extract the DOI from the end:

replace((partition(value,"doi: ",omitFragment)[2]),/<\/?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)\/?>/,'')

Note that the string "doi: " may be different depending on how your citations are structured.

Convert commas to semicolons

If you forgot this step when editing the spreadsheet directly, you can do so here with the following transform: value.replace(",",";")

Export as CSV

When all your transformations are complete, export as CSV.

However - you aren't done.

Final settings for CSV

MIK tends to fail if cells, or at least the header cells, are not enclosed in double quotes, and if the wrong character set is selected. Excel tends to do terrible things, which have to be fixed in more controllable software like Open Office.

The easiest way to do this is by opening your CSV file in Open Office (NOT EXCEL). Use the following settings:

  • Character set: Unicode (UTF-8)
  • Separator options: Separated by, Comma
  • Text delimiter: "
  • Other options: Quoted field as text

Save as, overwrite, and put a copy into your MIK directory.

Step 4: Set up your INI file

This sample config file will work as a basis for your metadata manipulation: kora.ini

Metadata mapping and manipulation

As with your input CSV, your mapping file also needs to enclose cells in double quotes. A sample mapping file is provided in the BC ELN samples directory. If you use this file and follow the steps in this guide, it should work without further tweaks.

The "null" entries are related to the InsertXmlFromTemplate[] manipulators referenced in the config file. These deal with the more complex elements, where different values from the CSV are nexted withing a set of MODS elements. This is accomplished using Twig templates.

Twig templates and how they work

A Twig template is an XML file that can be called by MIK and insert metadata values into the appropriate elements. This is required if different metadata values from your CSV need to be placed together under a single parent element, such as <name>.

Only one parent element can exist in a given template - hence the seven name[x].xml templates in the sample directory. To avoid empty MODS elements in the case of the fields being empty for a given record, IF statements are used liberally.

A simple example: origininfo.xml

Each record will have an <originInfo> element and a <dateIssued>, so this is called every time. But publisher and city are only in a few records. They are therefore enclosed in IF statements - if they exist, add them; if they don't, they're left out.

A more complex example: name7.xml

Although all templates are called, if the author7 fields aren't populated (i.e. there is no seventh author), this template does not do anything at all. And because each related field, like author7_email and department, could or could not be populated in a given record, each child element is enclosed in an IF statement to keep the generated MODS record clean.

Splitting repeated values

Fields that were multi-valued in the CSV file need to be split into separate MODS elements. For this we use the SplitRepeatedValues[] manipulator. Note that any element using SplitRepeatedValues needs to also be added to the list of repeatable_wrapper_elements[] under METADATA_PARSER, or they will all be put into the same parent element.

Clone this wiki locally