Skip to content

Catalogue

Roddi Potter edited this page Sep 11, 2013 · 6 revisions

The Catalogue component contains a meaningful mapping between files and their locations.

API Documentation Project

The idea behind cataloguing is to scan the directories of an ftp server (or sftp) server and map the file locations (URLs) to Data Products. The Data Products are defined within a Project object and contain additional information that can be searched on. The additional attributes build the ontology that describes the data.

The following ERD defines the current object model that defines projects and data products. This is expected to grow over time, allowing for a greater range of definition capabilities.

ERD of CICSTART Catalogue schema

A Project is just an arbitrary container used to group related data products. Since the project is just a logical container, it could be anything, such as your username, or a real project such as THEMIS. Because the project is arbitrary, one might consider mapping a VFS system that contains the output of Macro jobs to be automatically added to the catalogue. This mechanism could be used to auto-populate the catalogue with higher level data products derived from more basic sources.

Data Products also contain the necessary details required to map the files on a server host.
The mapping occurs during a scan request. After the scan completes and the files are mapped, they can be accessed via the catalogue and file components. A typical use case would be to search the catalogue to get a URL and then pass that url to the cache to retrieve the data. Searching the catalogue maybe done by using any attribute that associates the file with the data product: observatory(ies), instrument type(s), or a discriminator.

Future considerations...

A data product may also (availalbe if requested) have a set of key=value pairs of descriptive attributes. These could be used to narrow search results, or to provide additional information about the data product such as cadence, instrument settings during data collection, etc. A data product could also be linked to a set of SPASE or other common ontology resources that help to describe it. Ontological specific queries could be developed to aid in the searching of data products.

Add a Project

Given a Project object

{
  "host": "space.augsburg.edu",
  "endDateBeanShell": "",
  "endDateRegex": "",
  "scanDirectories": [
    "processed/MACCS/IAGA2000"
  ],
  "excludesRegex": "string",
  "externalKey": {
    "value": "MACCS"
  },
  "rulesUrl": "http://space.augsburg.edu/maccs/datausepolicy.html",
  "observatories": {
    "description": "Pangnirtung, Nunavut, Canad",
    "longitude": "294.2",
    "latitude": "66.1",
    "externalKey": {
      "value": "PG"
    }
  },
  "name": "The Magnetometer Array for Cusp and Cleft Studies",
  "startDateBeanShell": "",
  "startDateRegex": "",
  "instrumentTypes": {
    "description": "Magnetometer",
    "externalKey": {
      "value": "Magnetometer"
    }
  },
  "dataProducts": {
    "metadataParserConfig": {
      "endDateBeanShell": "
      		import org.joda.time.LocalDate;
			import org.joda.time.LocalTime;
			import org.joda.time.LocalDateTime;
			import org.joda.time.format.DateTimeFormat;
			 
			LocalDateTime parse(String url, String regexResult) { 
				LocalDate date = DateTimeFormat.forPattern("yyyyMMdd").parseLocalDate(regexResult);
				LocalDateTime dateTime = date.toLocalDateTime(new LocalTime(23,59));
				return dateTime;
			}
      ",
      "endDateRegex": "/d{8}",
      "startDateRegex": "/d{8}",
      "startDateBeanShell": "
            import org.joda.time.LocalDate;
			import org.joda.time.LocalTime;
			import org.joda.time.LocalDateTime;
			import org.joda.time.format.DateTimeFormat;
			 
			LocalDateTime parse(String url, String regexResult) { 
				LocalDate date = DateTimeFormat.forPattern("yyyyMMdd").parseLocalDate(regexResult);
				LocalDateTime dateTime = date.toLocalDateTime(LocalTime.MIDNIGHT);
				return dateTime;
			}
      ",
      "includesRegex": "^PGG.*sec$",
    },
    "description": "Pangnirtung Magnetometer 10 sec",
    "externalKey": {
      "value": "PG10SEC"
    }
  }
}

POST the request to the Catalogue component Project Resource. To add a Project, you must be logged in and have a valid session token. Assuming this Project object is stored in a file called project in the same path as where we are running curl from:

    curl -v -X POST @project -H Content-Type:"application/json" \
    -H CICSTART.session:"b3f8031e-f84e-4a8a-9ebe-d89f219ffa82" http://208.75.74.81/cicstart/api/catalogue/project

Expect a response of 201 if the host object was successfully created.

###Special notes about includes & excludes regex. The regular expression is matched against the entire URL of the file on the remote host, not just the filename.

###Special notes about Start & End Date Beanshell code. The Beanshell script is the implementation of a parse method with signature:

     public LocalDateTime parse(String url, String dateTimeResultingFromRegex);

Where the dateTimeResultingFromRegex would be the result of the regex match from the specific regular expression for the start date or dnd date, but is not necessarily. An example Beanshell script is (this was used for parsing SuperDARN file names to get the data start date from each filename:

import org.joda.time.LocalDate;
import org.joda.time.LocalTime;
import org.joda.time.LocalDateTime;
import org.joda.time.format.DateTimeFormat;
 
LocalDateTime parse(String url, String regexResult) { 
	LocalDate date = DateTimeFormat.forPattern("yyyyMMdd").parseLocalDate(regexResult);
	LocalDateTime dateTime = date.toLocalDateTime(new LocalTime(23,59));
	return dateTime;
}

Note that all types are actual Java objects and LocalDateTime is a JodaTime type. Most of the JVM classpath is available for use.

Scan a Project

The project must be setup and the host configuration must already be added to the host resource of the File component.

POST the request to the Catalogue component Project Resource. This is an asynchronous request because the scan may take some time to complete. To determine if it is complete, you can poll the .... find method (for now).

    curl -v -X POST @themis -H Content-Type:"application/json" \
    -H CICSTART.session:"b3f8031e-f84e-4a8a-9ebe-d89f219ffa82" http://208.75.74.81/cicstart/api/catalogue/project

Expect a response of 202 if the scan request was accepted by the server.

You can’t perform that action at this time.