Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset - Too Many "Unknown" Files, Friendly File MIME Type Display Names #2202 #5853

Merged
merged 32 commits into from Jun 11, 2019
Merged
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
5a19537
Added more file types to properties files to better categorize type f…
mheppler May 2, 2019
b8094d7
More file types to properties files to better categorize type facets …
mheppler May 2, 2019
82cea9a
application/zip was changed from "ZIP" to "Archive" #2202
pdurbin May 2, 2019
d9e7d71
upgrade jhove from 1.11.0 to 1.20.1, add tests #2202
pdurbin May 2, 2019
2fb20d7
fix intermittent build failure #2202
pdurbin May 3, 2019
6fc97db
More file types to properties files to better categorize type facets …
mheppler May 3, 2019
5aac551
detect Jupyter Notebooks based on .ipynb file extension #2202
pdurbin May 8, 2019
290968b
working jupyter notebook file
pdurbin May 10, 2019
7f3f0fe
move .ipynb to external properties file #2202
pdurbin May 10, 2019
357c24b
add override mime type test #2202
pdurbin May 13, 2019
9e95ede
document file detection workaround #2202
pdurbin May 15, 2019
698e8d0
stub out RedetectFileTypeCommand #2202
pdurbin May 16, 2019
be9b8d9
save new file type to database #2202
pdurbin May 16, 2019
9cbf5ee
better error handling #2202
pdurbin May 16, 2019
e4c6ed4
require edit dataset perm #2202
pdurbin May 16, 2019
4f91e9a
add docs for redetect file type API endpoint #2202
pdurbin May 16, 2019
f95a627
Merge branch 'develop' into 2202-file-type-facet-fix #2202
pdurbin May 16, 2019
2a39f8c
typo
pdurbin May 16, 2019
1f19851
more docs on file type detection # 2202
pdurbin May 20, 2019
26e80a8
Added more file types to mime properties to decrease unknowns [ref #2…
mheppler May 20, 2019
4c5d6a2
Added more unknown file mime types and extentions to properties [ref …
mheppler May 21, 2019
835daa4
Fixed typo in file mime type properties [ref #2202]
mheppler May 21, 2019
4a32285
added gzip and warc modules to the jhove configuration (#2202)
landreev Jun 3, 2019
c3ee340
some final cosmetic improvements for the friendly display types and t…
landreev Jun 3, 2019
edcfad3
extra code in the redetect type command, to read non-local files (#2202)
landreev Jun 3, 2019
1752b2a
Final reorganization of the code used to group files by type, for the…
landreev Jun 4, 2019
ef40804
release notes with the upgrade instructions for #2202.
landreev Jun 5, 2019
97155ab
an extra null check, if the page needs to run the method on a null fi…
landreev Jun 5, 2019
6bd2e18
Merge branch 'develop' into 2202-file-type-facet-fix
landreev Jun 5, 2019
2de1761
fixed a type check to be case-insensitive (#2202)
landreev Jun 7, 2019
5793094
fixed the name of a renamed method in the files fragment
landreev Jun 10, 2019
7b7dbec
better choice of default icons for "data" and "archive", per review b…
landreev Jun 11, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 1 addition & 1 deletion conf/docker-aio/run-test-suite.sh
Expand Up @@ -8,4 +8,4 @@ fi

# Please note the "dataverse.test.baseurl" is set to run for "all-in-one" Docker environment.
# TODO: Rather than hard-coding the list of "IT" classes here, add a profile to pom.xml.
mvn test -Dtest=DataversesIT,DatasetsIT,SwordIT,AdminIT,BuiltinUsersIT,UsersIT,UtilIT,ConfirmEmailIT,FileMetadataIT,FilesIT,SearchIT,InReviewWorkflowIT,HarvestingServerIT,MoveIT,MakeDataCountApiIT -Ddataverse.test.baseurl=$dvurl
mvn test -Dtest=DataversesIT,DatasetsIT,SwordIT,AdminIT,BuiltinUsersIT,UsersIT,UtilIT,ConfirmEmailIT,FileMetadataIT,FilesIT,SearchIT,InReviewWorkflowIT,HarvestingServerIT,MoveIT,MakeDataCountApiIT,FileTypeDetectionIT -Ddataverse.test.baseurl=$dvurl
4 changes: 4 additions & 0 deletions conf/jhove/jhove.conf
Expand Up @@ -40,4 +40,8 @@
<module>
<class>edu.harvard.hul.ois.jhove.module.Utf8Module</class>
</module>
<module>
<!-- https://github.com/openpreserve/jhove/blob/v1.20.1/jhove-ext-modules/src/main/java/com/mcgath/jhove/module/PngModule.java#L1 -->
<class>com.mcgath.jhove.module.PngModule</class>
</module>
</jhoveConfig>
5 changes: 5 additions & 0 deletions doc/sphinx-guides/source/admin/troubleshooting.rst
Expand Up @@ -71,3 +71,8 @@ In real life production use, it may be possible to end up in a situation where s
(contrary to what the message suggests, there are no specific "details" anywhere in the stack trace that would explain what values violate which constraints)

To identifiy the specific invalid values in the affected datasets, or to check all the datasets in the Dataverse for constraint violations, see :ref:`Dataset Validation <dataset-validation-api>` in the :doc:`/api/native-api` section of the User Guide.

Many Files with a File Type of "Unknown", "Application", or "Binary"
--------------------------------------------------------------------

From the home page of a Dataverse installation you can get a count of files by file type by clicking "Files" and then scrolling down to "File Type". If you see a lot of files that are "Unknown", "Application", or "Binary" you can have Dataverse attempt to redetect the file type by using the :ref:`Redetect File Type <redetect-file-type>` API endpoint.
21 changes: 21 additions & 0 deletions doc/sphinx-guides/source/api/native-api.rst
Expand Up @@ -444,6 +444,8 @@ A more detailed "add" example using curl::

curl -H "X-Dataverse-key:$API_TOKEN" -X POST -F 'file=@data.tsv' -F 'jsonData={"description":"My description.","directoryLabel":"data/subdir1","categories":["Data"], "restrict":"true"}' "https://example.dataverse.edu/api/datasets/:persistentId/add?persistentId=$PERSISTENT_ID"

Please note that it's possible to "trick" Dataverse into giving a file a content type (MIME type) of your choosing. For example, you can make a text file be treated like a video file with ``-F 'file=@README.txt;type=video/mpeg4'``, for example. If Dataverse does not properly detect a file type, specifying the content type via API like this a potential workaround.

Example python code to add a file. This may be run by changing these parameters in the sample code:

* ``dataverse_server`` - e.g. https://demo.dataverse.org
Expand Down Expand Up @@ -738,6 +740,25 @@ Note that this requires "superuser" credentials::

Note: at present, the API cannot be used on a file that's already successfully ingested as tabular.

.. _redetect-file-type:

Redetect File Type
~~~~~~~~~~~~~~~~~~

Dataverse uses a variety of methods for determining file types (MIME types or content types) and these methods (listed below) are updated periodically. If you have files that have an unknown file type, you can have Dataverse attempt to redetect the file type.

When using the curl command below, you can pass ``dryRun=true`` if you don't want any changes to be saved to the database. Change this to ``dryRun=false`` (or omit it) to save the change. In the example below, the file is identified by database id "42".

``export FILE_ID=42``

``curl -H "X-Dataverse-key:$API_TOKEN" -X POST $SERVER_URL/api/files/$FILE_ID/redetect?dryRun=true``

Currently the following methods are used to detect file types:

- The file type detected by the browser (or sent via API).
- JHOVE: http://jhove.openpreservation.org
- As a last resort the file extension (e.g. ".ipybn") is used, defined in a file called ``MimeTypeDetectionByFileExtension.properties``.

Replacing Files
~~~~~~~~~~~~~~~

Expand Down
Binary file not shown.

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

Binary file not shown.

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

Binary file not shown.

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

19 changes: 10 additions & 9 deletions pom.xml
Expand Up @@ -34,6 +34,7 @@
<junit.platform.version>1.3.1</junit.platform.version>
<mockito.version>2.22.0</mockito.version>
<flyway.version>5.2.4</flyway.version>
<jhove.version>1.20.1</jhove.version>
<!--
Jacoco 0.8.2 seems to break Netbeans code coverage integration so we'll use 0.8.1 instead.
See https://github.com/jacoco/jacoco/issues/772 for discussion of how the XML changed.
Expand Down Expand Up @@ -376,19 +377,19 @@
<version>4.0.0</version>
</dependency>
<dependency>
<groupId>edu.harvard.hul.ois.jhove</groupId>
<artifactId>jhove</artifactId>
<version>1.11.0</version>
<groupId>org.openpreservation.jhove</groupId>
<artifactId>jhove-core</artifactId>
<version>${jhove.version}</version>
</dependency>
<dependency>
<groupId>edu.harvard.hul.ois.jhove</groupId>
<artifactId>jhove-module</artifactId>
<version>1.11.0</version>
<groupId>org.openpreservation.jhove</groupId>
<artifactId>jhove-modules</artifactId>
<version>${jhove.version}</version>
</dependency>
<dependency>
<groupId>edu.harvard.hul.ois.jhove</groupId>
<artifactId>jhove-handler</artifactId>
<version>1.11.0</version>
<groupId>org.openpreservation.jhove</groupId>
<artifactId>jhove-ext-modules</artifactId>
<version>${jhove.version}</version>
</dependency>
<!-- JAI (Java Advanced Imaging) jars: -->
<dependency>
Expand Down
22 changes: 21 additions & 1 deletion src/main/java/edu/harvard/iq/dataverse/api/Files.java
Expand Up @@ -25,6 +25,7 @@
import edu.harvard.iq.dataverse.engine.command.impl.DeleteMapLayerMetadataCommand;
import edu.harvard.iq.dataverse.engine.command.impl.GetDataFileCommand;
import edu.harvard.iq.dataverse.engine.command.impl.GetDraftFileMetadataIfAvailableCommand;
import edu.harvard.iq.dataverse.engine.command.impl.RedetectFileTypeCommand;
import edu.harvard.iq.dataverse.engine.command.impl.RestrictFileCommand;
import edu.harvard.iq.dataverse.engine.command.impl.UpdateDatasetVersionCommand;
import edu.harvard.iq.dataverse.engine.command.impl.UningestFileCommand;
Expand All @@ -38,6 +39,7 @@
import edu.harvard.iq.dataverse.util.FileUtil;
import edu.harvard.iq.dataverse.util.StringUtil;
import edu.harvard.iq.dataverse.util.SystemConfig;
import edu.harvard.iq.dataverse.util.json.NullSafeJsonBuilder;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.Arrays;
Expand All @@ -63,6 +65,7 @@
import org.glassfish.jersey.media.multipart.FormDataContentDisposition;
import org.glassfish.jersey.media.multipart.FormDataParam;
import java.util.List;
import javax.ws.rs.QueryParam;

@Path("files")
public class Files extends AbstractApiBean {
Expand Down Expand Up @@ -575,7 +578,24 @@ public Response reingest(@PathParam("id") String id) {
return ok("Datafile " + id + " queued for ingest");

}


@Path("{id}/redetect")
@POST
public Response redetectDatafile(@PathParam("id") String id, @QueryParam("dryRun") boolean dryRun) {
try {
DataFile dataFileIn = findDataFileOrDie(id);
String originalContentType = dataFileIn.getContentType();
DataFile dataFileOut = execCommand(new RedetectFileTypeCommand(createDataverseRequest(findUserOrDie()), dataFileIn, dryRun));
NullSafeJsonBuilder result = NullSafeJsonBuilder.jsonObjectBuilder()
.add("dryRun", dryRun)
.add("oldContentType", originalContentType)
.add("newContentType", dataFileOut.getContentType());
return ok(result);
} catch (WrappedResponse wr) {
return wr.getResponse();
}
}

/**
* Attempting to run metadata export, for all the formats for which we have
* metadata Exporters.
Expand Down
@@ -0,0 +1,75 @@
package edu.harvard.iq.dataverse.engine.command.impl;

import edu.harvard.iq.dataverse.DataFile;
import edu.harvard.iq.dataverse.Dataset;
import edu.harvard.iq.dataverse.authorization.Permission;
import edu.harvard.iq.dataverse.dataaccess.DataAccess;
import edu.harvard.iq.dataverse.engine.command.AbstractCommand;
import edu.harvard.iq.dataverse.engine.command.CommandContext;
import edu.harvard.iq.dataverse.engine.command.DataverseRequest;
import edu.harvard.iq.dataverse.engine.command.RequiredPermissions;
import edu.harvard.iq.dataverse.engine.command.exception.CommandException;
import edu.harvard.iq.dataverse.export.ExportException;
import edu.harvard.iq.dataverse.export.ExportService;
import edu.harvard.iq.dataverse.util.EjbUtil;
import edu.harvard.iq.dataverse.util.FileTypeDetection;
import java.io.File;
import java.io.IOException;
import java.nio.file.Path;
import java.util.logging.Logger;
import javax.ejb.EJBException;

@RequiredPermissions(Permission.EditDataset)
public class RedetectFileTypeCommand extends AbstractCommand<DataFile> {

private static final Logger logger = Logger.getLogger(RedetectFileTypeCommand.class.getCanonicalName());

final DataFile fileToRedetect;
final boolean dryRun;

public RedetectFileTypeCommand(DataverseRequest dataveseRequest, DataFile dataFile, boolean dryRun) {
super(dataveseRequest, dataFile);
this.fileToRedetect = dataFile;
this.dryRun = dryRun;
}

@Override
public DataFile execute(CommandContext ctxt) throws CommandException {
DataFile filetoReturn = null;
Path path;
try {
// FIXME: Get this working with S3 and Swift.
path = DataAccess.getStorageIO(fileToRedetect).getFileSystemPath();
logger.fine("path: " + path);
File file = path.toFile();
String newlyDetectedContentType = FileTypeDetection.determineFileType(file);
fileToRedetect.setContentType(newlyDetectedContentType);
} catch (IOException ex) {
throw new CommandException("Exception while attempting to get the bytes of the file during file type redetection: " + ex.getLocalizedMessage(), this);
}
filetoReturn = fileToRedetect;
if (!dryRun) {
try {
filetoReturn = ctxt.files().save(fileToRedetect);
} catch (EJBException ex) {
throw new CommandException("Exception while attempting to save the new file type: " + EjbUtil.ejbExceptionToString(ex), this);
}
Dataset dataset = fileToRedetect.getOwner();
try {
boolean doNormalSolrDocCleanUp = true;
ctxt.index().indexDataset(dataset, doNormalSolrDocCleanUp);
} catch (Exception ex) {
logger.info("Exception while reindexing files during file type redetection: " + ex.getLocalizedMessage());
}
try {
ExportService instance = ExportService.getInstance(ctxt.settings());
instance.exportAllFormats(dataset);
} catch (ExportException ex) {
// Just like with indexing, a failure to export is not a fatal condition.
logger.info("Exception while exporting metadata files during file type redetection: " + ex.getLocalizedMessage());
}
}
return filetoReturn;
}

}
12 changes: 12 additions & 0 deletions src/main/java/edu/harvard/iq/dataverse/util/FileTypeDetection.java
@@ -0,0 +1,12 @@
package edu.harvard.iq.dataverse.util;

import java.io.File;
import java.io.IOException;

public class FileTypeDetection {

public static String determineFileType(File file) throws IOException {
return FileUtil.determineFileType(file, file.getName());
}

}
32 changes: 27 additions & 5 deletions src/main/java/edu/harvard/iq/dataverse/util/FileUtil.java
Expand Up @@ -78,6 +78,7 @@
import java.util.zip.ZipEntry;
import java.util.zip.ZipInputStream;
import static edu.harvard.iq.dataverse.datasetutility.FileSizeChecker.bytesToHumanReadable;
import org.apache.commons.io.FilenameUtils;


/**
Expand Down Expand Up @@ -414,13 +415,34 @@ public static String determineFileType(File f, String fileName) throws IOExcepti
logger.fine("returning fileType "+fileType);
return fileType;
}

public static String determineFileTypeByExtension(String fileName) {
logger.fine("Type by extension, for "+fileName+": "+MIME_TYPE_MAP.getContentType(fileName));
return MIME_TYPE_MAP.getContentType(fileName);
String mimetypesFileTypeMapResult = MIME_TYPE_MAP.getContentType(fileName);
logger.fine("MimetypesFileTypeMap type by extension, for " + fileName + ": " + mimetypesFileTypeMapResult);
if (mimetypesFileTypeMapResult != null) {
if ("application/octet-stream".equals(mimetypesFileTypeMapResult)) {
return lookupFileTypeFromPropertiesFile(fileName);
} else {
return mimetypesFileTypeMapResult;
}
} else {
return null;
}
}



public static String lookupFileTypeFromPropertiesFile(String fileName) {
String fileExtension = FilenameUtils.getExtension(fileName);
String propertyFileName = "MimeTypeDetectionByFileExtension";
String propertyFileNameOnDisk = propertyFileName + ".properties";
try {
logger.fine("checking " + propertyFileNameOnDisk + " for file extension " + fileExtension);
return BundleUtil.getStringFromPropertyFile(fileExtension, propertyFileName);
} catch (MissingResourceException ex) {
logger.info(fileExtension + " is a file extension Dataverse doesn't know about. Consider adding it to the " + propertyFileNameOnDisk + " file.");
return null;
}
}

/*
* Custom method for identifying FITS files:
* TODO:
Expand Down
15 changes: 10 additions & 5 deletions src/main/java/edu/harvard/iq/dataverse/util/JhoveFileType.java
Expand Up @@ -19,10 +19,14 @@
*/
package edu.harvard.iq.dataverse.util;

import edu.harvard.hul.ois.jhove.*;
import java.io.*;
import java.util.*;
import static java.lang.System.*;
import edu.harvard.hul.ois.jhove.App;
import edu.harvard.hul.ois.jhove.JhoveBase;
import edu.harvard.hul.ois.jhove.Module;
import edu.harvard.hul.ois.jhove.RepInfo;
import java.io.File;
import java.io.IOException;
import java.util.Iterator;
import java.util.Properties;
import java.util.logging.Logger;

/**
Expand Down Expand Up @@ -69,7 +73,8 @@ public RepInfo checkFileType(File file) {
try {
// initialize the application spec object
// name, release number, build date, usage, Copyright infor
App jhoveApp = new App("Jhove", "1.11",
// TODO: Should the release number come from pom.xml as we upgrade from 1.11.0 to 1.20.1?
App jhoveApp = new App("Jhove", "1.20.1",
ORIGINAL_RELEASE_DATE, "Java JhoveFileType",
ORIGINAL_COPR_RIGHTS);

Expand Down
@@ -0,0 +1,31 @@
7z=application/x-7z-compressed
ado=application/x-stata-ado
dbf=application/dbf
dcm=application/dicom
docx=application/vnd.openxmlformats-officedocument.wordprocessingml.document
emf=application/x-emf
h5=application/x-h5
hdf=application/x-hdf
hdf5=application/x-hdf5
ipynb=application/x-ipynb+json
json=application/json
m=text/x-matlab
mat=application/matlab-mat
mp3=audio/mp3
nii=image/nii
nc=application/netcdf
ods=application/vnd.oasis.opendocument.spreadsheet
png=image/png
pptx=application/vnd.openxmlformats-officedocument.presentationml.presentation
prj=application/prj
py=text/x-python
rar=application/rar
sas=application/x-sas
sbn=application/sbn
sbx=application/sbx
shp=application/shp
shx=application/shx
smcl=application/x-stata-smcl
swc=application/x-swc
xz=application/x-xz
xlsx=application/vnd.openxmlformats-officedocument.spreadsheetml.sheet