Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index metadata extracted from FITS files during ingest #111

Closed
eaquigley opened this issue Jul 9, 2014 · 3 comments
Closed

Index metadata extracted from FITS files during ingest #111

eaquigley opened this issue Jul 9, 2014 · 3 comments

Comments

@eaquigley
Copy link
Contributor


Author Name: Leonid Andreev (@landreev)
Original Redmine Issue: 3520, https://redmine.hmdc.harvard.edu/issues/3520
Original Date: 2014-02-03
Original Assignee: Kevin Condon


The metadata extracted from files on ingest - filenames/types/descriptions for all files; variable names/labels from subsettable files; headers from fits files - must be indexed.

@eaquigley
Copy link
Contributor Author


Original Redmine Comment
Author Name: Leonid Andreev (@landreev)
Original Date: 2014-05-02T16:47:02Z


Status update for this never-ending ticket:

I changed the subject of the ticket to specifically mention FITS. The ticket was originally opened for indexing of and searching on other types of metadata that's produced by ingest as well. But searches on things like variable names/labels and file descriptions have been working for a while. So FITS is the only thing remaining.

FITS tasks:

  1. The ingest code needs to be modified to reflect the latest changes just made to the Astro. metadata block - some field names have been modified. This is trivial, just needs to be done.
  2. As of yesterday, Gus was still in the process of finalizing the extraction rules for FITS metadata that's used to populate dataset metadata fields. I'm going to assume that it's not crucial that it's all absolutely finalized; I will just finalize my implementation by syncing it with the latest version of his astro. metadata document. But this is something I will be finishing early next week, i.e., at the absolute last minute.
  3. More importantly, I have obtained an agreement to only store and index metadata values for which a dedicated place in the metadata block has been defined. This means that the 3.6 system of arbitrary fields and values attached to datafile objects (FileMetadataField and FileMetadataFieldValue) that's still in place can finally be dismantled; I just haven't gotten to it yet.
    The other good news here - other than that we don't have to maintain this parallel, hackish system of metadata that's not part of the metadata blocks: indexing and searching should be all set without having to do anything custom; since
    we already have an automated system in place, that both indexes all the fields defined by metadata blocks, and also makes all these fields searchable in the solr schema.
  4. One thing in particular that we used to index, for which we now have no place in the metadata block, was the names of the columns from FITS tables. Gus has agreed to, rather than adding a place for these values in the block, to just not do anything with them for beta. As we were talking about longer term plans, he suggested that, rather than looking for places where to store these as extra metadata, let's just treat FITS tables as tabular data, just like with Stata/SPSS, etc. Then these column names will become variables, and we already have a mechanism for that. This would be an interesting, and hopefully useful, thing to try... But we'll have to figure out a few things. For example, there may be multiple tables in a FITS file, and he doesn't think splitting those and treating them as separate tabular datafiles would be acceptable.

@eaquigley
Copy link
Contributor Author


Original Redmine Comment
Author Name: Leonid Andreev (@landreev)
Original Date: 2014-05-07T16:01:59Z


Quick update:

There are some problems with FITS because of the validation that's now enforced on the metadata.
For example, the file SPITZER_S3_22893056_0002_0000_7_bcd.fits ingests fine (populated fields: type, instrument, facility, object);
but the file acisf01873N003_full_img2.fits fails to ingest with validation errors - because of multiple values for spatial.Coverage which is a "non-multivalued field".
Some updates will be needed for this - either to the metadata block, or to the code, of the validation or the ingest plugin. I just need to talk about this with everybody involved.

@eaquigley
Copy link
Contributor Author


Original Redmine Comment
Author Name: Gustavo Durand (@scolapasta)
Original Date: 2014-05-28T19:27:14Z


This was delivered and tested. Gus is doing the continued testing for this. Separate tickets have been opened for specific issues.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants