Skip to content

Commit

Permalink
Merge pull request #5853 from IQSS/2202-file-type-facet-fix
Browse files Browse the repository at this point in the history
 Dataset - Too Many "Unknown" Files, Friendly File MIME Type Display Names #2202
  • Loading branch information
kcondon committed Jun 11, 2019
2 parents 24d4209 + 7b7dbec commit 86bb329
Show file tree
Hide file tree
Showing 47 changed files with 1,236 additions and 157 deletions.
2 changes: 1 addition & 1 deletion conf/docker-aio/run-test-suite.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,4 @@ fi

# Please note the "dataverse.test.baseurl" is set to run for "all-in-one" Docker environment.
# TODO: Rather than hard-coding the list of "IT" classes here, add a profile to pom.xml.
mvn test -Dtest=DataversesIT,DatasetsIT,SwordIT,AdminIT,BuiltinUsersIT,UsersIT,UtilIT,ConfirmEmailIT,FileMetadataIT,FilesIT,SearchIT,InReviewWorkflowIT,HarvestingServerIT,MoveIT,MakeDataCountApiIT -Ddataverse.test.baseurl=$dvurl
mvn test -Dtest=DataversesIT,DatasetsIT,SwordIT,AdminIT,BuiltinUsersIT,UsersIT,UtilIT,ConfirmEmailIT,FileMetadataIT,FilesIT,SearchIT,InReviewWorkflowIT,HarvestingServerIT,MoveIT,MakeDataCountApiIT,FileTypeDetectionIT -Ddataverse.test.baseurl=$dvurl
11 changes: 11 additions & 0 deletions conf/jhove/jhove.conf
Original file line number Diff line number Diff line change
Expand Up @@ -40,4 +40,15 @@
<module>
<class>edu.harvard.hul.ois.jhove.module.Utf8Module</class>
</module>
<!-- New modules for application/gzip and application/warc: -->
<module>
<class>edu.harvard.hul.ois.jhove.module.GzipModule</class>
</module>
<module>
<class>edu.harvard.hul.ois.jhove.module.WarcModule</class>
</module>
<!-- A new 3rd-party module for image/png from mcgauth.com: -->
<module>
<class>com.mcgath.jhove.module.PngModule</class>
</module>
</jhoveConfig>
5 changes: 5 additions & 0 deletions doc/release-notes/2202-improved-file-detection.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Upgrade instructions:

A new version of file type detection software, Jhove, is added in this release. It requires an update of its configuration file: ``jhove.conf``. Download the new configuration file from the Dataverse release page on GitHub, or from the source tree at https://github.com/IQSS/dataverse/blob/master/conf/jhove/jhove.conf, and place it in ``<GLASSFISH_DOMAIN_DIRECTORY>/config/``. For example: ``/usr/local/glassfish4/glassfish/domains/domain1/config/jhove.conf``.

**Important:** If your Glassfish installation directory is different from ``/usr/local/glassfish4``, make sure to edit the header of the config file, to reflect the correct location.
5 changes: 5 additions & 0 deletions doc/sphinx-guides/source/admin/troubleshooting.rst
Original file line number Diff line number Diff line change
Expand Up @@ -71,3 +71,8 @@ In real life production use, it may be possible to end up in a situation where s
(contrary to what the message suggests, there are no specific "details" anywhere in the stack trace that would explain what values violate which constraints)

To identifiy the specific invalid values in the affected datasets, or to check all the datasets in the Dataverse for constraint violations, see :ref:`Dataset Validation <dataset-validation-api>` in the :doc:`/api/native-api` section of the User Guide.

Many Files with a File Type of "Unknown", "Application", or "Binary"
--------------------------------------------------------------------

From the home page of a Dataverse installation you can get a count of files by file type by clicking "Files" and then scrolling down to "File Type". If you see a lot of files that are "Unknown", "Application", or "Binary" you can have Dataverse attempt to redetect the file type by using the :ref:`Redetect File Type <redetect-file-type>` API endpoint.
21 changes: 21 additions & 0 deletions doc/sphinx-guides/source/api/native-api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -446,6 +446,8 @@ A more detailed "add" example using curl::

curl -H "X-Dataverse-key:$API_TOKEN" -X POST -F 'file=@data.tsv' -F 'jsonData={"description":"My description.","directoryLabel":"data/subdir1","categories":["Data"], "restrict":"true"}' "https://example.dataverse.edu/api/datasets/:persistentId/add?persistentId=$PERSISTENT_ID"

Please note that it's possible to "trick" Dataverse into giving a file a content type (MIME type) of your choosing. For example, you can make a text file be treated like a video file with ``-F 'file=@README.txt;type=video/mpeg4'``, for example. If Dataverse does not properly detect a file type, specifying the content type via API like this a potential workaround.

Example python code to add a file. This may be run by changing these parameters in the sample code:

* ``dataverse_server`` - e.g. https://demo.dataverse.org
Expand Down Expand Up @@ -740,6 +742,25 @@ Note that this requires "superuser" credentials::

Note: at present, the API cannot be used on a file that's already successfully ingested as tabular.

.. _redetect-file-type:

Redetect File Type
~~~~~~~~~~~~~~~~~~

Dataverse uses a variety of methods for determining file types (MIME types or content types) and these methods (listed below) are updated periodically. If you have files that have an unknown file type, you can have Dataverse attempt to redetect the file type.

When using the curl command below, you can pass ``dryRun=true`` if you don't want any changes to be saved to the database. Change this to ``dryRun=false`` (or omit it) to save the change. In the example below, the file is identified by database id "42".

``export FILE_ID=42``

``curl -H "X-Dataverse-key:$API_TOKEN" -X POST $SERVER_URL/api/files/$FILE_ID/redetect?dryRun=true``

Currently the following methods are used to detect file types:

- The file type detected by the browser (or sent via API).
- JHOVE: http://jhove.openpreservation.org
- As a last resort the file extension (e.g. ".ipybn") is used, defined in a file called ``MimeTypeDetectionByFileExtension.properties``.

Replacing Files
~~~~~~~~~~~~~~~

Expand Down
Binary file not shown.

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

Binary file not shown.

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

Binary file not shown.

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

19 changes: 10 additions & 9 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@
<junit.platform.version>1.3.1</junit.platform.version>
<mockito.version>2.22.0</mockito.version>
<flyway.version>5.2.4</flyway.version>
<jhove.version>1.20.1</jhove.version>
<!--
Jacoco 0.8.2 seems to break Netbeans code coverage integration so we'll use 0.8.1 instead.
See https://github.com/jacoco/jacoco/issues/772 for discussion of how the XML changed.
Expand Down Expand Up @@ -376,19 +377,19 @@
<version>4.0.0</version>
</dependency>
<dependency>
<groupId>edu.harvard.hul.ois.jhove</groupId>
<artifactId>jhove</artifactId>
<version>1.11.0</version>
<groupId>org.openpreservation.jhove</groupId>
<artifactId>jhove-core</artifactId>
<version>${jhove.version}</version>
</dependency>
<dependency>
<groupId>edu.harvard.hul.ois.jhove</groupId>
<artifactId>jhove-module</artifactId>
<version>1.11.0</version>
<groupId>org.openpreservation.jhove</groupId>
<artifactId>jhove-modules</artifactId>
<version>${jhove.version}</version>
</dependency>
<dependency>
<groupId>edu.harvard.hul.ois.jhove</groupId>
<artifactId>jhove-handler</artifactId>
<version>1.11.0</version>
<groupId>org.openpreservation.jhove</groupId>
<artifactId>jhove-ext-modules</artifactId>
<version>${jhove.version}</version>
</dependency>
<!-- JAI (Java Advanced Imaging) jars: -->
<dependency>
Expand Down
64 changes: 12 additions & 52 deletions src/main/java/edu/harvard/iq/dataverse/DataFileServiceBean.java
Original file line number Diff line number Diff line change
Expand Up @@ -66,20 +66,6 @@ public class DataFileServiceBean implements java.io.Serializable {
@PersistenceContext(unitName = "VDCNet-ejbPU")
private EntityManager em;

// File type "classes" tags:

private static final String FILE_CLASS_AUDIO = "audio";
private static final String FILE_CLASS_CODE = "code";
private static final String FILE_CLASS_DOCUMENT = "document";
private static final String FILE_CLASS_ASTRO = "astro";
private static final String FILE_CLASS_IMAGE = "image";
private static final String FILE_CLASS_NETWORK = "network";
private static final String FILE_CLASS_GEO = "geodata";
private static final String FILE_CLASS_TABULAR = "tabular";
private static final String FILE_CLASS_VIDEO = "video";
private static final String FILE_CLASS_PACKAGE = "package";
private static final String FILE_CLASS_OTHER = "other";

// Assorted useful mime types:

// 3rd-party and/or proprietary tabular data formasts that we know
Expand Down Expand Up @@ -1151,51 +1137,25 @@ public String getFileClassById (Long fileId) {
return null;
}

return getFileClass(file);
return getFileThumbnailClass(file);
}

public String getFileClass (DataFile file) {
if (isFileClassImage(file)) {
return FILE_CLASS_IMAGE;
}

if (isFileClassVideo(file)) {
return FILE_CLASS_VIDEO;
}

if (isFileClassAudio(file)) {
return FILE_CLASS_AUDIO;
}

if (isFileClassCode(file)) {
return FILE_CLASS_CODE;
}

if (isFileClassDocument(file)) {
return FILE_CLASS_DOCUMENT;
}

if (isFileClassAstro(file)) {
return FILE_CLASS_ASTRO;
}

if (isFileClassNetwork(file)) {
return FILE_CLASS_NETWORK;
}

if (isFileClassGeo(file)) {
return FILE_CLASS_GEO;
public String getFileThumbnailClass (DataFile file) {
// there's no solr search facet for "package files", but
// there is a special thumbnail icon:
if (isFileClassPackage(file)) {
return FileUtil.FILE_THUMBNAIL_CLASS_PACKAGE;
}

if (isFileClassTabularData(file)) {
return FILE_CLASS_TABULAR;
}
if (file != null) {
String fileTypeFacet = FileUtil.getFacetFileType(file);

if (isFileClassPackage(file)) {
return FILE_CLASS_PACKAGE;
if (fileTypeFacet != null && FileUtil.FILE_THUMBNAIL_CLASSES.containsKey(fileTypeFacet)) {
return FileUtil.FILE_THUMBNAIL_CLASSES.get(fileTypeFacet);
}
}

return FILE_CLASS_OTHER;
return FileUtil.FILE_THUMBNAIL_CLASS_OTHER;
}


Expand Down
2 changes: 2 additions & 0 deletions src/main/java/edu/harvard/iq/dataverse/api/Access.java
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,8 @@
import static javax.ws.rs.core.Response.Status.BAD_REQUEST;
import javax.ws.rs.core.StreamingOutput;
import static edu.harvard.iq.dataverse.util.json.JsonPrinter.json;
import static edu.harvard.iq.dataverse.util.json.JsonPrinter.json;
import static edu.harvard.iq.dataverse.util.json.JsonPrinter.json;

/*
Custom API exceptions [NOT YET IMPLEMENTED]
Expand Down
22 changes: 21 additions & 1 deletion src/main/java/edu/harvard/iq/dataverse/api/Files.java
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
import edu.harvard.iq.dataverse.engine.command.impl.DeleteMapLayerMetadataCommand;
import edu.harvard.iq.dataverse.engine.command.impl.GetDataFileCommand;
import edu.harvard.iq.dataverse.engine.command.impl.GetDraftFileMetadataIfAvailableCommand;
import edu.harvard.iq.dataverse.engine.command.impl.RedetectFileTypeCommand;
import edu.harvard.iq.dataverse.engine.command.impl.RestrictFileCommand;
import edu.harvard.iq.dataverse.engine.command.impl.UpdateDatasetVersionCommand;
import edu.harvard.iq.dataverse.engine.command.impl.UningestFileCommand;
Expand All @@ -38,6 +39,7 @@
import edu.harvard.iq.dataverse.util.FileUtil;
import edu.harvard.iq.dataverse.util.StringUtil;
import edu.harvard.iq.dataverse.util.SystemConfig;
import edu.harvard.iq.dataverse.util.json.NullSafeJsonBuilder;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.Arrays;
Expand All @@ -63,6 +65,7 @@
import org.glassfish.jersey.media.multipart.FormDataContentDisposition;
import org.glassfish.jersey.media.multipart.FormDataParam;
import java.util.List;
import javax.ws.rs.QueryParam;

@Path("files")
public class Files extends AbstractApiBean {
Expand Down Expand Up @@ -575,7 +578,24 @@ public Response reingest(@PathParam("id") String id) {
return ok("Datafile " + id + " queued for ingest");

}


@Path("{id}/redetect")
@POST
public Response redetectDatafile(@PathParam("id") String id, @QueryParam("dryRun") boolean dryRun) {
try {
DataFile dataFileIn = findDataFileOrDie(id);
String originalContentType = dataFileIn.getContentType();
DataFile dataFileOut = execCommand(new RedetectFileTypeCommand(createDataverseRequest(findUserOrDie()), dataFileIn, dryRun));
NullSafeJsonBuilder result = NullSafeJsonBuilder.jsonObjectBuilder()
.add("dryRun", dryRun)
.add("oldContentType", originalContentType)
.add("newContentType", dataFileOut.getContentType());
return ok(result);
} catch (WrappedResponse wr) {
return wr.getResponse();
}
}

/**
* Attempting to run metadata export, for all the formats for which we have
* metadata Exporters.
Expand Down
Loading

0 comments on commit 86bb329

Please sign in to comment.