Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset - Too Many "Unknown" Files, Friendly File MIME Type Display Names #2202

Closed
mheppler opened this issue May 26, 2015 · 24 comments

Comments

@mheppler
Copy link
Contributor

commented May 26, 2015

As referenced in #2192, there are files in production that need friendly MIME Type labels.

From @landreev

The file in question is ./src/main/java/MimeTypeDisplay.properties

We should identify as many of these as possible, and give them friendlier display names that the one that @pdurbin found.

documentation_and_metadata_-training_materials_dataverse-_2015-05-24_10 00 24

@landreev

This comment has been minimized.

Copy link
Contributor

commented Jun 1, 2015

Also, I believe we should extend this "friendly name" functionality, to support wild cards.
As in:

image/jpeg=JPEG Image
image/gif=GIF Image
image/bmp=Windows Bitmap Image
image/*=Graphic Image

i.e., we provide friendly names for the types we know about; and a generic name for an image of type image/blah-blah that's not specifically listed.
We can do the same with MS documents and other types of files. Because we'll always be encountering file types we don't know about.

@scolapasta scolapasta added this to the Candidates for 4.0.3 milestone Jun 1, 2015

@scolapasta scolapasta modified the milestones: 4.2, Candidates for 4.2 Jul 15, 2015

@mheppler

This comment has been minimized.

Copy link
Contributor Author

commented Aug 18, 2015

Currently, the File Type values that are delivered from MimeTypeFacets.properties are lower case (see attached). I suggest that we capitalize them.

screen shot 2015-08-18 at 12 01 31 pm

@pdurbin

This comment has been minimized.

Copy link
Member

commented Sep 11, 2015

@scolapasta I'm passing this to you for a decision of what to do for 4.2.

@scolapasta scolapasta modified the milestones: Candidates for 4.3, 4.2 Sep 17, 2015

@mercecrosas mercecrosas modified the milestones: Candidates for 4.3, In Review Nov 30, 2015

@scolapasta scolapasta removed their assignment Jan 27, 2016

@scolapasta scolapasta removed this from the Not Assigned to a Release milestone Jan 28, 2016

@mheppler

This comment has been minimized.

Copy link
Contributor Author

commented Sep 7, 2016

Related to #3288 #3333 #3334 #3335

@mheppler mheppler changed the title Dataset - Friendly File MIME Type Display Names Dataset - Too Many "Unknown" Files, Friendly File MIME Type Display Names Apr 1, 2019

landreev added a commit that referenced this issue Jun 3, 2019

@landreev

This comment has been minimized.

Copy link
Contributor

commented Jun 4, 2019

The final word on the new version of Jhove - it works; (aside from the new xml plugin, that has the problem above - which does not seem acceptable, so it's going to be excluded from the configuration). It gives some modest gains in detecting the types of some previously unidentified files (mostly png images, text files, including the specific encoding used, gzip and web archive files; I can post the exact percentages in relation to the number of prod. files currently listed as unknown).
But it should be an ongoing process of further improving this area; I will open a new issue for this specific area. We will still need to decide if we want to stick with Jhove (the only way to get more out of it would be by creating more type-specific plugins for it ourselves), or adopt something else instead of, or on top of it.

@landreev

This comment has been minimized.

Copy link
Contributor

commented Jun 4, 2019

(That is to say, I'm choosing the manageable chunks/incremental improvements approach here, just so that we can close this issue and move forward)

landreev added a commit that referenced this issue Jun 4, 2019

Final reorganization of the code used to group files by type, for the…
… search facets and default thumbnail icons.

(ref #2202)

@landreev landreev moved this from IQSS Team Dev 💻 to Code Review 🦁 in IQSS/dataverse Jun 4, 2019

@scolapasta scolapasta assigned pdurbin and landreev and unassigned landreev Jun 5, 2019

landreev added a commit that referenced this issue Jun 5, 2019

@landreev landreev removed their assignment Jun 5, 2019

@pdurbin pdurbin moved this from Code Review 🦁 to QA in IQSS/dataverse Jun 5, 2019

@pdurbin

This comment has been minimized.

Copy link
Member

commented Jun 5, 2019

At standup I said I wanted to to check if I had documented the new file type redetect API endpoint I added (phew, done already) and I see that @landreev just pushed a release note in ef40804 which looks good. I just moved this to QA. Also looked at the recent code-related commits that @landreev made since I last touched the branch and they all look good to me too.

@pdurbin pdurbin removed their assignment Jun 5, 2019

landreev added a commit that referenced this issue Jun 5, 2019

landreev added a commit that referenced this issue Jun 7, 2019

@kcondon kcondon moved this from QA to IQSS Team Dev 💻 in IQSS/dataverse Jun 7, 2019

@landreev

This comment has been minimized.

Copy link
Contributor

commented Jun 7, 2019

Something I should've done earlier - notes on how to test/what to look for:
There is more than one area of where things were improved:

  1. The new API for re-identifying the types of files currently stored as unknown (mime type: "application/octet-stream") in the database. The API is /api/files/<FILEID>/redetect. Until this api is actually run in prod., the number that appears as "Unknown" in the type facets will not change. This API cannot be tested on the vm5 copy of the database - since it needs to read the actual files; and we don't want to point vm5 to the prod. s3 bucket. But it can be tested on some select files.
  2. Better rules for classifying known mime types for the type facets indexing. This part can be tested on vm5 - a full reindex should affect the facet numbers, most notably:
    the misleading "Application" (30K files in prod. currently) facet should disappear completely;
    "Zip" facet (8K in prod.) should go away, replaced by "Archive", showing a higher number (all compressed and archived formats will be indexed under this type);
    A new facet "Code" should appear, with a sizeable number of files (20K+)
    A new facet "Other", with a relatively small number of files. (this is for the files previously indexed under the "Application" facet, that haven't been reclassified under more informative groupings).
  3. More file types should have "friendly" type descriptions (as appear on the dataset and dataverse pages). See the diff on the MimeTypeDisplay.properties file.
  4. Jhove should do a better job identifying some file types. The recommended way of testing this is by uploading files via the API, to take the browser and the OS out of the picture. File types to try: png, gzipped. Changing/stripping the .png and .gz filename extensions would ensure that the type is identified by the contents, and not by the extension.
  5. The list of recognized filename extensions used to guess the content type has been extended. See the diff on MimeTypeDetectionByFileExtension.properties.
  6. It may be worth confirming that Mike's type-specific default thumbnails are still working properly - the code that selects those have been reorganized as part of this PR too.

@landreev landreev moved this from IQSS Team Dev 💻 to QA in IQSS/dataverse Jun 7, 2019

@djbrooke djbrooke assigned kcondon and unassigned landreev Jun 10, 2019

landreev added a commit that referenced this issue Jun 11, 2019

@kcondon kcondon closed this in f95a627 Jun 11, 2019

kcondon added a commit that referenced this issue Jun 11, 2019

Merge pull request #5853 from IQSS/2202-file-type-facet-fix
 Dataset - Too Many "Unknown" Files, Friendly File MIME Type Display Names #2202

IQSS/dataverse automation moved this from QA to Done 🚀 Jun 11, 2019

@mheppler

This comment has been minimized.

Copy link
Contributor Author

commented Jun 11, 2019

Peaked at the icons mentioned in "6" and suggested tweaks for data and archive icons. Put my random selection of 84 unknown files into dvn-build and Data went last (2 of 84), to first (18 of 84). The unknowns were still pretty high, but hopefully we see greater gains in the full 127,109 pool of unknowns in production since all 84 of those files were unknowns there originally.

Screen Shot 2019-06-11 at 12 25 51 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.