diff --git a/conf/docker-aio/run-test-suite.sh b/conf/docker-aio/run-test-suite.sh index 5a584e39395..7ecc5009b0a 100755 --- a/conf/docker-aio/run-test-suite.sh +++ b/conf/docker-aio/run-test-suite.sh @@ -8,4 +8,4 @@ fi # Please note the "dataverse.test.baseurl" is set to run for "all-in-one" Docker environment. # TODO: Rather than hard-coding the list of "IT" classes here, add a profile to pom.xml. -mvn test -Dtest=DataversesIT,DatasetsIT,SwordIT,AdminIT,BuiltinUsersIT,UsersIT,UtilIT,ConfirmEmailIT,FileMetadataIT,FilesIT,SearchIT,InReviewWorkflowIT,HarvestingServerIT,MoveIT,MakeDataCountApiIT -Ddataverse.test.baseurl=$dvurl +mvn test -Dtest=DataversesIT,DatasetsIT,SwordIT,AdminIT,BuiltinUsersIT,UsersIT,UtilIT,ConfirmEmailIT,FileMetadataIT,FilesIT,SearchIT,InReviewWorkflowIT,HarvestingServerIT,MoveIT,MakeDataCountApiIT,FileTypeDetectionIT -Ddataverse.test.baseurl=$dvurl diff --git a/conf/jhove/jhove.conf b/conf/jhove/jhove.conf index 261a2e16988..17a7c5e0530 100644 --- a/conf/jhove/jhove.conf +++ b/conf/jhove/jhove.conf @@ -40,4 +40,15 @@ edu.harvard.hul.ois.jhove.module.Utf8Module + + + edu.harvard.hul.ois.jhove.module.GzipModule + + + edu.harvard.hul.ois.jhove.module.WarcModule + + + + com.mcgath.jhove.module.PngModule + diff --git a/doc/release-notes/2202-improved-file-detection.md b/doc/release-notes/2202-improved-file-detection.md new file mode 100644 index 00000000000..ba8b5e33f33 --- /dev/null +++ b/doc/release-notes/2202-improved-file-detection.md @@ -0,0 +1,5 @@ +Upgrade instructions: + +A new version of file type detection software, Jhove, is added in this release. It requires an update of its configuration file: ``jhove.conf``. Download the new configuration file from the Dataverse release page on GitHub, or from the source tree at https://github.com/IQSS/dataverse/blob/master/conf/jhove/jhove.conf, and place it in ``/config/``. For example: ``/usr/local/glassfish4/glassfish/domains/domain1/config/jhove.conf``. + +**Important:** If your Glassfish installation directory is different from ``/usr/local/glassfish4``, make sure to edit the header of the config file, to reflect the correct location. diff --git a/doc/sphinx-guides/source/admin/troubleshooting.rst b/doc/sphinx-guides/source/admin/troubleshooting.rst index 8cec4431947..3e8cfbfa62f 100644 --- a/doc/sphinx-guides/source/admin/troubleshooting.rst +++ b/doc/sphinx-guides/source/admin/troubleshooting.rst @@ -71,3 +71,8 @@ In real life production use, it may be possible to end up in a situation where s (contrary to what the message suggests, there are no specific "details" anywhere in the stack trace that would explain what values violate which constraints) To identifiy the specific invalid values in the affected datasets, or to check all the datasets in the Dataverse for constraint violations, see :ref:`Dataset Validation ` in the :doc:`/api/native-api` section of the User Guide. + +Many Files with a File Type of "Unknown", "Application", or "Binary" +-------------------------------------------------------------------- + +From the home page of a Dataverse installation you can get a count of files by file type by clicking "Files" and then scrolling down to "File Type". If you see a lot of files that are "Unknown", "Application", or "Binary" you can have Dataverse attempt to redetect the file type by using the :ref:`Redetect File Type ` API endpoint. diff --git a/doc/sphinx-guides/source/api/native-api.rst b/doc/sphinx-guides/source/api/native-api.rst index 0287e0d1dff..8e3b2d81c22 100644 --- a/doc/sphinx-guides/source/api/native-api.rst +++ b/doc/sphinx-guides/source/api/native-api.rst @@ -446,6 +446,8 @@ A more detailed "add" example using curl:: curl -H "X-Dataverse-key:$API_TOKEN" -X POST -F 'file=@data.tsv' -F 'jsonData={"description":"My description.","directoryLabel":"data/subdir1","categories":["Data"], "restrict":"true"}' "https://example.dataverse.edu/api/datasets/:persistentId/add?persistentId=$PERSISTENT_ID" +Please note that it's possible to "trick" Dataverse into giving a file a content type (MIME type) of your choosing. For example, you can make a text file be treated like a video file with ``-F 'file=@README.txt;type=video/mpeg4'``, for example. If Dataverse does not properly detect a file type, specifying the content type via API like this a potential workaround. + Example python code to add a file. This may be run by changing these parameters in the sample code: * ``dataverse_server`` - e.g. https://demo.dataverse.org @@ -740,6 +742,25 @@ Note that this requires "superuser" credentials:: Note: at present, the API cannot be used on a file that's already successfully ingested as tabular. +.. _redetect-file-type: + +Redetect File Type +~~~~~~~~~~~~~~~~~~ + +Dataverse uses a variety of methods for determining file types (MIME types or content types) and these methods (listed below) are updated periodically. If you have files that have an unknown file type, you can have Dataverse attempt to redetect the file type. + +When using the curl command below, you can pass ``dryRun=true`` if you don't want any changes to be saved to the database. Change this to ``dryRun=false`` (or omit it) to save the change. In the example below, the file is identified by database id "42". + +``export FILE_ID=42`` + +``curl -H "X-Dataverse-key:$API_TOKEN" -X POST $SERVER_URL/api/files/$FILE_ID/redetect?dryRun=true`` + +Currently the following methods are used to detect file types: + +- The file type detected by the browser (or sent via API). +- JHOVE: http://jhove.openpreservation.org +- As a last resort the file extension (e.g. ".ipybn") is used, defined in a file called ``MimeTypeDetectionByFileExtension.properties``. + Replacing Files ~~~~~~~~~~~~~~~ diff --git a/local_lib/edu/harvard/hul/ois/jhove/jhove-handler/1.11.0/jhove-handler-1.11.0.jar b/local_lib/edu/harvard/hul/ois/jhove/jhove-handler/1.11.0/jhove-handler-1.11.0.jar deleted file mode 100644 index 8d5509f4f20..00000000000 Binary files a/local_lib/edu/harvard/hul/ois/jhove/jhove-handler/1.11.0/jhove-handler-1.11.0.jar and /dev/null differ diff --git a/local_lib/edu/harvard/hul/ois/jhove/jhove-handler/1.11.0/jhove-handler-1.11.0.jar.md5 b/local_lib/edu/harvard/hul/ois/jhove/jhove-handler/1.11.0/jhove-handler-1.11.0.jar.md5 deleted file mode 100644 index 9840dffe677..00000000000 --- a/local_lib/edu/harvard/hul/ois/jhove/jhove-handler/1.11.0/jhove-handler-1.11.0.jar.md5 +++ /dev/null @@ -1 +0,0 @@ -f9bb7a20a9d538819606ec1630d661fe diff --git a/local_lib/edu/harvard/hul/ois/jhove/jhove-handler/1.11.0/jhove-handler-1.11.0.jar.sha1 b/local_lib/edu/harvard/hul/ois/jhove/jhove-handler/1.11.0/jhove-handler-1.11.0.jar.sha1 deleted file mode 100644 index 8d2333a8c2b..00000000000 --- a/local_lib/edu/harvard/hul/ois/jhove/jhove-handler/1.11.0/jhove-handler-1.11.0.jar.sha1 +++ /dev/null @@ -1 +0,0 @@ -37a9d8e464a57b90c04252f265572e5274beb605 diff --git a/local_lib/edu/harvard/hul/ois/jhove/jhove-handler/1.11.0/jhove-handler-1.11.0.pom b/local_lib/edu/harvard/hul/ois/jhove/jhove-handler/1.11.0/jhove-handler-1.11.0.pom deleted file mode 100644 index d22906ef787..00000000000 --- a/local_lib/edu/harvard/hul/ois/jhove/jhove-handler/1.11.0/jhove-handler-1.11.0.pom +++ /dev/null @@ -1,8 +0,0 @@ - - - 4.0.0 - edu.harvard.hul.ois.jhove - jhove-handler - 1.11.0 - diff --git a/local_lib/edu/harvard/hul/ois/jhove/jhove-handler/1.11.0/jhove-handler-1.11.0.pom.md5 b/local_lib/edu/harvard/hul/ois/jhove/jhove-handler/1.11.0/jhove-handler-1.11.0.pom.md5 deleted file mode 100644 index e248ce1a5df..00000000000 --- a/local_lib/edu/harvard/hul/ois/jhove/jhove-handler/1.11.0/jhove-handler-1.11.0.pom.md5 +++ /dev/null @@ -1 +0,0 @@ -c2d1a458dc809cb3833f3b362a23ed79 diff --git a/local_lib/edu/harvard/hul/ois/jhove/jhove-handler/1.11.0/jhove-handler-1.11.0.pom.sha1 b/local_lib/edu/harvard/hul/ois/jhove/jhove-handler/1.11.0/jhove-handler-1.11.0.pom.sha1 deleted file mode 100644 index e3dba1303d9..00000000000 --- a/local_lib/edu/harvard/hul/ois/jhove/jhove-handler/1.11.0/jhove-handler-1.11.0.pom.sha1 +++ /dev/null @@ -1 +0,0 @@ -0f195ee47691c7ee8611db63b6d5ee262c139129 diff --git a/local_lib/edu/harvard/hul/ois/jhove/jhove-module/1.11.0/jhove-module-1.11.0.jar b/local_lib/edu/harvard/hul/ois/jhove/jhove-module/1.11.0/jhove-module-1.11.0.jar deleted file mode 100644 index 1ba8229674c..00000000000 Binary files a/local_lib/edu/harvard/hul/ois/jhove/jhove-module/1.11.0/jhove-module-1.11.0.jar and /dev/null differ diff --git a/local_lib/edu/harvard/hul/ois/jhove/jhove-module/1.11.0/jhove-module-1.11.0.jar.md5 b/local_lib/edu/harvard/hul/ois/jhove/jhove-module/1.11.0/jhove-module-1.11.0.jar.md5 deleted file mode 100644 index 5643f23cdf3..00000000000 --- a/local_lib/edu/harvard/hul/ois/jhove/jhove-module/1.11.0/jhove-module-1.11.0.jar.md5 +++ /dev/null @@ -1 +0,0 @@ -c3605bd6434ebeef82ef655d21075652 diff --git a/local_lib/edu/harvard/hul/ois/jhove/jhove-module/1.11.0/jhove-module-1.11.0.jar.sha1 b/local_lib/edu/harvard/hul/ois/jhove/jhove-module/1.11.0/jhove-module-1.11.0.jar.sha1 deleted file mode 100644 index 38510b3afc3..00000000000 --- a/local_lib/edu/harvard/hul/ois/jhove/jhove-module/1.11.0/jhove-module-1.11.0.jar.sha1 +++ /dev/null @@ -1 +0,0 @@ -d8dc496b4d408dd6a9ed7429e6fa4d1ce5f57403 diff --git a/local_lib/edu/harvard/hul/ois/jhove/jhove-module/1.11.0/jhove-module-1.11.0.pom b/local_lib/edu/harvard/hul/ois/jhove/jhove-module/1.11.0/jhove-module-1.11.0.pom deleted file mode 100644 index 464753f15e5..00000000000 --- a/local_lib/edu/harvard/hul/ois/jhove/jhove-module/1.11.0/jhove-module-1.11.0.pom +++ /dev/null @@ -1,8 +0,0 @@ - - - 4.0.0 - edu.harvard.hul.ois.jhove - jhove-module - 1.11.0 - diff --git a/local_lib/edu/harvard/hul/ois/jhove/jhove-module/1.11.0/jhove-module-1.11.0.pom.md5 b/local_lib/edu/harvard/hul/ois/jhove/jhove-module/1.11.0/jhove-module-1.11.0.pom.md5 deleted file mode 100644 index 4d11568ae43..00000000000 --- a/local_lib/edu/harvard/hul/ois/jhove/jhove-module/1.11.0/jhove-module-1.11.0.pom.md5 +++ /dev/null @@ -1 +0,0 @@ -bcac19fbdf825c5e93e785413815b998 diff --git a/local_lib/edu/harvard/hul/ois/jhove/jhove-module/1.11.0/jhove-module-1.11.0.pom.sha1 b/local_lib/edu/harvard/hul/ois/jhove/jhove-module/1.11.0/jhove-module-1.11.0.pom.sha1 deleted file mode 100644 index 01ca799d4c1..00000000000 --- a/local_lib/edu/harvard/hul/ois/jhove/jhove-module/1.11.0/jhove-module-1.11.0.pom.sha1 +++ /dev/null @@ -1 +0,0 @@ -1f983c8cf895056f4d4efe7a717b8d73d5c6b091 diff --git a/local_lib/edu/harvard/hul/ois/jhove/jhove/1.11.0/jhove-1.11.0.jar b/local_lib/edu/harvard/hul/ois/jhove/jhove/1.11.0/jhove-1.11.0.jar deleted file mode 100644 index 8fc6078e64b..00000000000 Binary files a/local_lib/edu/harvard/hul/ois/jhove/jhove/1.11.0/jhove-1.11.0.jar and /dev/null differ diff --git a/local_lib/edu/harvard/hul/ois/jhove/jhove/1.11.0/jhove-1.11.0.jar.md5 b/local_lib/edu/harvard/hul/ois/jhove/jhove/1.11.0/jhove-1.11.0.jar.md5 deleted file mode 100644 index f34f0d62da1..00000000000 --- a/local_lib/edu/harvard/hul/ois/jhove/jhove/1.11.0/jhove-1.11.0.jar.md5 +++ /dev/null @@ -1 +0,0 @@ -3f6f413fb54c5142f2e34837bb9369b4 diff --git a/local_lib/edu/harvard/hul/ois/jhove/jhove/1.11.0/jhove-1.11.0.jar.sha1 b/local_lib/edu/harvard/hul/ois/jhove/jhove/1.11.0/jhove-1.11.0.jar.sha1 deleted file mode 100644 index 772766ad997..00000000000 --- a/local_lib/edu/harvard/hul/ois/jhove/jhove/1.11.0/jhove-1.11.0.jar.sha1 +++ /dev/null @@ -1 +0,0 @@ -475409b6444aba6bdc96ce42431b6d601c7abe5f diff --git a/local_lib/edu/harvard/hul/ois/jhove/jhove/1.11.0/jhove-1.11.0.pom b/local_lib/edu/harvard/hul/ois/jhove/jhove/1.11.0/jhove-1.11.0.pom deleted file mode 100644 index c0096eaf03d..00000000000 --- a/local_lib/edu/harvard/hul/ois/jhove/jhove/1.11.0/jhove-1.11.0.pom +++ /dev/null @@ -1,8 +0,0 @@ - - - 4.0.0 - edu.harvard.hul.ois.jhove - jhove - 1.11.0 - diff --git a/local_lib/edu/harvard/hul/ois/jhove/jhove/1.11.0/jhove-1.11.0.pom.md5 b/local_lib/edu/harvard/hul/ois/jhove/jhove/1.11.0/jhove-1.11.0.pom.md5 deleted file mode 100644 index 433c1031bcd..00000000000 --- a/local_lib/edu/harvard/hul/ois/jhove/jhove/1.11.0/jhove-1.11.0.pom.md5 +++ /dev/null @@ -1 +0,0 @@ -7f9939585e369ad60ac1f8a99b2fa75f diff --git a/local_lib/edu/harvard/hul/ois/jhove/jhove/1.11.0/jhove-1.11.0.pom.sha1 b/local_lib/edu/harvard/hul/ois/jhove/jhove/1.11.0/jhove-1.11.0.pom.sha1 deleted file mode 100644 index acfde074c96..00000000000 --- a/local_lib/edu/harvard/hul/ois/jhove/jhove/1.11.0/jhove-1.11.0.pom.sha1 +++ /dev/null @@ -1 +0,0 @@ -804fffb163526c6bea975038702ea90f24f89419 diff --git a/pom.xml b/pom.xml index 95ba816701d..e15958d221c 100644 --- a/pom.xml +++ b/pom.xml @@ -34,6 +34,7 @@ 1.3.1 2.22.0 5.2.4 + 1.20.1 diff --git a/src/main/java/edu/harvard/iq/dataverse/DataFileServiceBean.java b/src/main/java/edu/harvard/iq/dataverse/DataFileServiceBean.java index d7eda3cb948..a35bfb0df15 100644 --- a/src/main/java/edu/harvard/iq/dataverse/DataFileServiceBean.java +++ b/src/main/java/edu/harvard/iq/dataverse/DataFileServiceBean.java @@ -66,20 +66,6 @@ public class DataFileServiceBean implements java.io.Serializable { @PersistenceContext(unitName = "VDCNet-ejbPU") private EntityManager em; - // File type "classes" tags: - - private static final String FILE_CLASS_AUDIO = "audio"; - private static final String FILE_CLASS_CODE = "code"; - private static final String FILE_CLASS_DOCUMENT = "document"; - private static final String FILE_CLASS_ASTRO = "astro"; - private static final String FILE_CLASS_IMAGE = "image"; - private static final String FILE_CLASS_NETWORK = "network"; - private static final String FILE_CLASS_GEO = "geodata"; - private static final String FILE_CLASS_TABULAR = "tabular"; - private static final String FILE_CLASS_VIDEO = "video"; - private static final String FILE_CLASS_PACKAGE = "package"; - private static final String FILE_CLASS_OTHER = "other"; - // Assorted useful mime types: // 3rd-party and/or proprietary tabular data formasts that we know @@ -1151,51 +1137,25 @@ public String getFileClassById (Long fileId) { return null; } - return getFileClass(file); + return getFileThumbnailClass(file); } - public String getFileClass (DataFile file) { - if (isFileClassImage(file)) { - return FILE_CLASS_IMAGE; - } - - if (isFileClassVideo(file)) { - return FILE_CLASS_VIDEO; - } - - if (isFileClassAudio(file)) { - return FILE_CLASS_AUDIO; - } - - if (isFileClassCode(file)) { - return FILE_CLASS_CODE; - } - - if (isFileClassDocument(file)) { - return FILE_CLASS_DOCUMENT; - } - - if (isFileClassAstro(file)) { - return FILE_CLASS_ASTRO; - } - - if (isFileClassNetwork(file)) { - return FILE_CLASS_NETWORK; - } - - if (isFileClassGeo(file)) { - return FILE_CLASS_GEO; + public String getFileThumbnailClass (DataFile file) { + // there's no solr search facet for "package files", but + // there is a special thumbnail icon: + if (isFileClassPackage(file)) { + return FileUtil.FILE_THUMBNAIL_CLASS_PACKAGE; } - if (isFileClassTabularData(file)) { - return FILE_CLASS_TABULAR; - } + if (file != null) { + String fileTypeFacet = FileUtil.getFacetFileType(file); - if (isFileClassPackage(file)) { - return FILE_CLASS_PACKAGE; + if (fileTypeFacet != null && FileUtil.FILE_THUMBNAIL_CLASSES.containsKey(fileTypeFacet)) { + return FileUtil.FILE_THUMBNAIL_CLASSES.get(fileTypeFacet); + } } - return FILE_CLASS_OTHER; + return FileUtil.FILE_THUMBNAIL_CLASS_OTHER; } diff --git a/src/main/java/edu/harvard/iq/dataverse/api/Access.java b/src/main/java/edu/harvard/iq/dataverse/api/Access.java index c7aa5261bf2..6f957afe2c5 100644 --- a/src/main/java/edu/harvard/iq/dataverse/api/Access.java +++ b/src/main/java/edu/harvard/iq/dataverse/api/Access.java @@ -114,6 +114,8 @@ import static javax.ws.rs.core.Response.Status.BAD_REQUEST; import javax.ws.rs.core.StreamingOutput; import static edu.harvard.iq.dataverse.util.json.JsonPrinter.json; +import static edu.harvard.iq.dataverse.util.json.JsonPrinter.json; +import static edu.harvard.iq.dataverse.util.json.JsonPrinter.json; /* Custom API exceptions [NOT YET IMPLEMENTED] diff --git a/src/main/java/edu/harvard/iq/dataverse/api/Files.java b/src/main/java/edu/harvard/iq/dataverse/api/Files.java index 113332b345f..f304444a7f3 100644 --- a/src/main/java/edu/harvard/iq/dataverse/api/Files.java +++ b/src/main/java/edu/harvard/iq/dataverse/api/Files.java @@ -25,6 +25,7 @@ import edu.harvard.iq.dataverse.engine.command.impl.DeleteMapLayerMetadataCommand; import edu.harvard.iq.dataverse.engine.command.impl.GetDataFileCommand; import edu.harvard.iq.dataverse.engine.command.impl.GetDraftFileMetadataIfAvailableCommand; +import edu.harvard.iq.dataverse.engine.command.impl.RedetectFileTypeCommand; import edu.harvard.iq.dataverse.engine.command.impl.RestrictFileCommand; import edu.harvard.iq.dataverse.engine.command.impl.UpdateDatasetVersionCommand; import edu.harvard.iq.dataverse.engine.command.impl.UningestFileCommand; @@ -38,6 +39,7 @@ import edu.harvard.iq.dataverse.util.FileUtil; import edu.harvard.iq.dataverse.util.StringUtil; import edu.harvard.iq.dataverse.util.SystemConfig; +import edu.harvard.iq.dataverse.util.json.NullSafeJsonBuilder; import java.io.InputStream; import java.util.ArrayList; import java.util.Arrays; @@ -63,6 +65,7 @@ import org.glassfish.jersey.media.multipart.FormDataContentDisposition; import org.glassfish.jersey.media.multipart.FormDataParam; import java.util.List; +import javax.ws.rs.QueryParam; @Path("files") public class Files extends AbstractApiBean { @@ -575,7 +578,24 @@ public Response reingest(@PathParam("id") String id) { return ok("Datafile " + id + " queued for ingest"); } - + + @Path("{id}/redetect") + @POST + public Response redetectDatafile(@PathParam("id") String id, @QueryParam("dryRun") boolean dryRun) { + try { + DataFile dataFileIn = findDataFileOrDie(id); + String originalContentType = dataFileIn.getContentType(); + DataFile dataFileOut = execCommand(new RedetectFileTypeCommand(createDataverseRequest(findUserOrDie()), dataFileIn, dryRun)); + NullSafeJsonBuilder result = NullSafeJsonBuilder.jsonObjectBuilder() + .add("dryRun", dryRun) + .add("oldContentType", originalContentType) + .add("newContentType", dataFileOut.getContentType()); + return ok(result); + } catch (WrappedResponse wr) { + return wr.getResponse(); + } + } + /** * Attempting to run metadata export, for all the formats for which we have * metadata Exporters. diff --git a/src/main/java/edu/harvard/iq/dataverse/engine/command/impl/RedetectFileTypeCommand.java b/src/main/java/edu/harvard/iq/dataverse/engine/command/impl/RedetectFileTypeCommand.java new file mode 100644 index 00000000000..0477a483783 --- /dev/null +++ b/src/main/java/edu/harvard/iq/dataverse/engine/command/impl/RedetectFileTypeCommand.java @@ -0,0 +1,102 @@ +package edu.harvard.iq.dataverse.engine.command.impl; + +import edu.harvard.iq.dataverse.DataFile; +import edu.harvard.iq.dataverse.Dataset; +import edu.harvard.iq.dataverse.authorization.Permission; +import edu.harvard.iq.dataverse.dataaccess.StorageIO; +import edu.harvard.iq.dataverse.engine.command.AbstractCommand; +import edu.harvard.iq.dataverse.engine.command.CommandContext; +import edu.harvard.iq.dataverse.engine.command.DataverseRequest; +import edu.harvard.iq.dataverse.engine.command.RequiredPermissions; +import edu.harvard.iq.dataverse.engine.command.exception.CommandException; +import edu.harvard.iq.dataverse.export.ExportException; +import edu.harvard.iq.dataverse.export.ExportService; +import edu.harvard.iq.dataverse.util.EjbUtil; +import edu.harvard.iq.dataverse.util.FileTypeDetection; +import java.io.File; +import java.io.FileOutputStream; +import java.io.IOException; +import java.nio.channels.FileChannel; +import java.nio.channels.ReadableByteChannel; +import java.util.logging.Logger; +import javax.ejb.EJBException; + +@RequiredPermissions(Permission.EditDataset) +public class RedetectFileTypeCommand extends AbstractCommand { + + private static final Logger logger = Logger.getLogger(RedetectFileTypeCommand.class.getCanonicalName()); + + final DataFile fileToRedetect; + final boolean dryRun; + + public RedetectFileTypeCommand(DataverseRequest dataveseRequest, DataFile dataFile, boolean dryRun) { + super(dataveseRequest, dataFile); + this.fileToRedetect = dataFile; + this.dryRun = dryRun; + } + + @Override + public DataFile execute(CommandContext ctxt) throws CommandException { + DataFile filetoReturn = null; + File tempFile = null; + File localFile; + + + try { + StorageIO storageIO; + + storageIO = fileToRedetect.getStorageIO(); + storageIO.open(); + + if (storageIO.isLocalFile()) { + localFile = storageIO.getFileSystemPath().toFile(); + } else { + // Need to create a temporary local file: + + ReadableByteChannel targetFileChannel = (ReadableByteChannel) storageIO.getReadChannel(); + tempFile = File.createTempFile("tempFileTypeCheck", ".tmp"); + FileChannel tempFileChannel = new FileOutputStream(tempFile).getChannel(); + tempFileChannel.transferFrom(targetFileChannel, 0, storageIO.getSize()); + + localFile = tempFile; + } + + logger.fine("target file: " + localFile); + String newlyDetectedContentType = FileTypeDetection.determineFileType(localFile); + fileToRedetect.setContentType(newlyDetectedContentType); + } catch (IOException ex) { + throw new CommandException("Exception while attempting to get the bytes of the file during file type redetection: " + ex.getLocalizedMessage(), this); + } finally { + // If we had to create a temp file, delete it now: + if (tempFile != null) { + tempFile.delete(); + } + } + + + filetoReturn = fileToRedetect; + if (!dryRun) { + try { + filetoReturn = ctxt.files().save(fileToRedetect); + } catch (EJBException ex) { + throw new CommandException("Exception while attempting to save the new file type: " + EjbUtil.ejbExceptionToString(ex), this); + } + Dataset dataset = fileToRedetect.getOwner(); + try { + boolean doNormalSolrDocCleanUp = true; + ctxt.index().indexDataset(dataset, doNormalSolrDocCleanUp); + } catch (Exception ex) { + logger.info("Exception while reindexing files during file type redetection: " + ex.getLocalizedMessage()); + } + try { + ExportService instance = ExportService.getInstance(ctxt.settings()); + instance.exportAllFormats(dataset); + } catch (ExportException ex) { + // Just like with indexing, a failure to export is not a fatal condition. + logger.info("Exception while exporting metadata files during file type redetection: " + ex.getLocalizedMessage()); + } + } + return filetoReturn; + } + +} diff --git a/src/main/java/edu/harvard/iq/dataverse/util/FileTypeDetection.java b/src/main/java/edu/harvard/iq/dataverse/util/FileTypeDetection.java new file mode 100644 index 00000000000..52515c00524 --- /dev/null +++ b/src/main/java/edu/harvard/iq/dataverse/util/FileTypeDetection.java @@ -0,0 +1,12 @@ +package edu.harvard.iq.dataverse.util; + +import java.io.File; +import java.io.IOException; + +public class FileTypeDetection { + + public static String determineFileType(File file) throws IOException { + return FileUtil.determineFileType(file, file.getName()); + } + +} diff --git a/src/main/java/edu/harvard/iq/dataverse/util/FileUtil.java b/src/main/java/edu/harvard/iq/dataverse/util/FileUtil.java index f4342f6ab7b..58f4dc223b7 100644 --- a/src/main/java/edu/harvard/iq/dataverse/util/FileUtil.java +++ b/src/main/java/edu/harvard/iq/dataverse/util/FileUtil.java @@ -78,6 +78,7 @@ import java.util.zip.ZipEntry; import java.util.zip.ZipInputStream; import static edu.harvard.iq.dataverse.datasetutility.FileSizeChecker.bytesToHumanReadable; +import org.apache.commons.io.FilenameUtils; /** @@ -147,6 +148,59 @@ public class FileUtil implements java.io.Serializable { public static final String MIME_TYPE_INGESTED_FILE = "text/tab-separated-values"; + // File type "thumbnail classes" tags: + + public static final String FILE_THUMBNAIL_CLASS_AUDIO = "audio"; + public static final String FILE_THUMBNAIL_CLASS_CODE = "code"; + public static final String FILE_THUMBNAIL_CLASS_DOCUMENT = "document"; + public static final String FILE_THUMBNAIL_CLASS_ASTRO = "astro"; + public static final String FILE_THUMBNAIL_CLASS_IMAGE = "image"; + public static final String FILE_THUMBNAIL_CLASS_NETWORK = "network"; + public static final String FILE_THUMBNAIL_CLASS_GEOSHAPE = "geodata"; + public static final String FILE_THUMBNAIL_CLASS_TABULAR = "tabular"; + public static final String FILE_THUMBNAIL_CLASS_VIDEO = "video"; + public static final String FILE_THUMBNAIL_CLASS_PACKAGE = "package"; + public static final String FILE_THUMBNAIL_CLASS_OTHER = "other"; + + // File type facets, as returned by the getFacetFileType() method in this utility: + + private static final String FILE_FACET_CLASS_ARCHIVE = "Archive"; + private static final String FILE_FACET_CLASS_AUDIO = "Audio"; + private static final String FILE_FACET_CLASS_CODE = "Code"; + private static final String FILE_FACET_CLASS_DATA = "Data"; + private static final String FILE_FACET_CLASS_DOCUMENT = "Document"; + private static final String FILE_FACET_CLASS_ASTRO = "FITS"; + private static final String FILE_FACET_CLASS_IMAGE = "Image"; + private static final String FILE_FACET_CLASS_NETWORK = "Network Data"; + private static final String FILE_FACET_CLASS_GEOSHAPE = "Shape"; + private static final String FILE_FACET_CLASS_TABULAR = "Tabular Data"; + private static final String FILE_FACET_CLASS_VIDEO = "Video"; + private static final String FILE_FACET_CLASS_TEXT = "Text"; + private static final String FILE_FACET_CLASS_OTHER = "Other"; + private static final String FILE_FACET_CLASS_UNKNOWN = "Unknown"; + + // The file type facets and type-specific thumbnail classes (above) are + // very similar, but not exactly 1:1; so the following map is for + // maintaining the relationship between the two: + + public static Map FILE_THUMBNAIL_CLASSES = new HashMap(); + + static { + FILE_THUMBNAIL_CLASSES.put(FILE_FACET_CLASS_VIDEO, FILE_THUMBNAIL_CLASS_VIDEO); + FILE_THUMBNAIL_CLASSES.put(FILE_FACET_CLASS_AUDIO, FILE_THUMBNAIL_CLASS_AUDIO); + FILE_THUMBNAIL_CLASSES.put(FILE_FACET_CLASS_CODE, FILE_THUMBNAIL_CLASS_CODE); + FILE_THUMBNAIL_CLASSES.put(FILE_FACET_CLASS_DATA, FILE_THUMBNAIL_CLASS_TABULAR); + FILE_THUMBNAIL_CLASSES.put(FILE_FACET_CLASS_NETWORK, FILE_THUMBNAIL_CLASS_NETWORK); + FILE_THUMBNAIL_CLASSES.put(FILE_FACET_CLASS_ASTRO, FILE_THUMBNAIL_CLASS_ASTRO); + FILE_THUMBNAIL_CLASSES.put(FILE_FACET_CLASS_IMAGE, FILE_THUMBNAIL_CLASS_IMAGE); + FILE_THUMBNAIL_CLASSES.put(FILE_FACET_CLASS_DOCUMENT, FILE_THUMBNAIL_CLASS_DOCUMENT); + FILE_THUMBNAIL_CLASSES.put(FILE_FACET_CLASS_GEOSHAPE, FILE_THUMBNAIL_CLASS_GEOSHAPE); + FILE_THUMBNAIL_CLASSES.put(FILE_FACET_CLASS_TABULAR, FILE_THUMBNAIL_CLASS_TABULAR); + FILE_THUMBNAIL_CLASSES.put(FILE_FACET_CLASS_TEXT, FILE_THUMBNAIL_CLASS_DOCUMENT); + FILE_THUMBNAIL_CLASSES.put(FILE_FACET_CLASS_OTHER, FILE_THUMBNAIL_CLASS_OTHER); + FILE_THUMBNAIL_CLASSES.put(FILE_FACET_CLASS_UNKNOWN, FILE_THUMBNAIL_CLASS_OTHER); + FILE_THUMBNAIL_CLASSES.put(FILE_FACET_CLASS_ARCHIVE, FILE_THUMBNAIL_CLASS_PACKAGE); + } /** * This string can be prepended to a Base64-encoded representation of a PNG @@ -233,11 +287,10 @@ public static String getFacetFileType(DataFile dataFile) { } catch (MissingResourceException e) { // if there's no defined "facet-friendly" form of this mime type // we'll truncate the available type by "/", e.g., all the - // unknown image/* types will become "image"; many other, quite - // different types will all become "application" this way - - // but it is probably still better than to tag them all as - // "uknown". - // -- L.A. 4.0 alpha 1 + // unknown image/* types will become "image". + // Since many other, quite different types would then all become + // "application" - we will use the facet "Other" for all the + // application/* types not specifically defined in the properties file. // // UPDATE, MH 4.9.2 // Since production is displaying both "tabulardata" and "Tabular Data" @@ -245,6 +298,9 @@ public static String getFacetFileType(DataFile dataFile) { // in order to capitalize all the unknown types that are not called // out in MimeTypeFacets.properties String typeClass = fileType.split("/")[0]; + if ("application".equalsIgnoreCase(typeClass)) { + return FILE_FACET_CLASS_OTHER; + } return Character.toUpperCase(typeClass.charAt(0)) + typeClass.substring(1); } } else { @@ -414,13 +470,34 @@ public static String determineFileType(File f, String fileName) throws IOExcepti logger.fine("returning fileType "+fileType); return fileType; } - + public static String determineFileTypeByExtension(String fileName) { - logger.fine("Type by extension, for "+fileName+": "+MIME_TYPE_MAP.getContentType(fileName)); - return MIME_TYPE_MAP.getContentType(fileName); + String mimetypesFileTypeMapResult = MIME_TYPE_MAP.getContentType(fileName); + logger.fine("MimetypesFileTypeMap type by extension, for " + fileName + ": " + mimetypesFileTypeMapResult); + if (mimetypesFileTypeMapResult != null) { + if ("application/octet-stream".equals(mimetypesFileTypeMapResult)) { + return lookupFileTypeFromPropertiesFile(fileName); + } else { + return mimetypesFileTypeMapResult; + } + } else { + return null; + } } - - + + public static String lookupFileTypeFromPropertiesFile(String fileName) { + String fileExtension = FilenameUtils.getExtension(fileName); + String propertyFileName = "MimeTypeDetectionByFileExtension"; + String propertyFileNameOnDisk = propertyFileName + ".properties"; + try { + logger.fine("checking " + propertyFileNameOnDisk + " for file extension " + fileExtension); + return BundleUtil.getStringFromPropertyFile(fileExtension, propertyFileName); + } catch (MissingResourceException ex) { + logger.info(fileExtension + " is a file extension Dataverse doesn't know about. Consider adding it to the " + propertyFileNameOnDisk + " file."); + return null; + } + } + /* * Custom method for identifying FITS files: * TODO: diff --git a/src/main/java/edu/harvard/iq/dataverse/util/JhoveFileType.java b/src/main/java/edu/harvard/iq/dataverse/util/JhoveFileType.java index 56400d87c41..8a4ed81bc5b 100644 --- a/src/main/java/edu/harvard/iq/dataverse/util/JhoveFileType.java +++ b/src/main/java/edu/harvard/iq/dataverse/util/JhoveFileType.java @@ -19,10 +19,14 @@ */ package edu.harvard.iq.dataverse.util; -import edu.harvard.hul.ois.jhove.*; -import java.io.*; -import java.util.*; -import static java.lang.System.*; +import edu.harvard.hul.ois.jhove.App; +import edu.harvard.hul.ois.jhove.JhoveBase; +import edu.harvard.hul.ois.jhove.Module; +import edu.harvard.hul.ois.jhove.RepInfo; +import java.io.File; +import java.io.IOException; +import java.util.Iterator; +import java.util.Properties; import java.util.logging.Logger; /** @@ -69,7 +73,8 @@ public RepInfo checkFileType(File file) { try { // initialize the application spec object // name, release number, build date, usage, Copyright infor - App jhoveApp = new App("Jhove", "1.11", + // TODO: Should the release number come from pom.xml as we upgrade from 1.11.0 to 1.20.1? + App jhoveApp = new App("Jhove", "1.20.1", ORIGINAL_RELEASE_DATE, "Java JhoveFileType", ORIGINAL_COPR_RIGHTS); diff --git a/src/main/java/propertyFiles/MimeTypeDetectionByFileExtension.properties b/src/main/java/propertyFiles/MimeTypeDetectionByFileExtension.properties new file mode 100644 index 00000000000..7648138f20e --- /dev/null +++ b/src/main/java/propertyFiles/MimeTypeDetectionByFileExtension.properties @@ -0,0 +1,31 @@ +7z=application/x-7z-compressed +ado=application/x-stata-ado +dbf=application/dbf +dcm=application/dicom +docx=application/vnd.openxmlformats-officedocument.wordprocessingml.document +emf=application/x-emf +h5=application/x-h5 +hdf=application/x-hdf +hdf5=application/x-hdf5 +ipynb=application/x-ipynb+json +json=application/json +m=text/x-matlab +mat=application/matlab-mat +mp3=audio/mp3 +nii=image/nii +nc=application/netcdf +ods=application/vnd.oasis.opendocument.spreadsheet +png=image/png +pptx=application/vnd.openxmlformats-officedocument.presentationml.presentation +prj=application/prj +py=text/x-python +rar=application/rar +sas=application/x-sas +sbn=application/sbn +sbx=application/sbx +shp=application/shp +shx=application/shx +smcl=application/x-stata-smcl +swc=application/x-swc +xz=application/x-xz +xlsx=application/vnd.openxmlformats-officedocument.spreadsheetml.sheet \ No newline at end of file diff --git a/src/main/java/propertyFiles/MimeTypeDisplay.properties b/src/main/java/propertyFiles/MimeTypeDisplay.properties index da2693c42b6..29407ccda40 100644 --- a/src/main/java/propertyFiles/MimeTypeDisplay.properties +++ b/src/main/java/propertyFiles/MimeTypeDisplay.properties @@ -1,51 +1,178 @@ # MimeTypeDisplay properties file -# User friendly names for displaying mime types. -# Documentation, Data, Archive files: +# User friendly names for displaying mime types +# Documentation application/pdf=Adobe PDF +image/pdf=Adobe PDF +text/pdf=Adobe PDF +application/x-pdf=Adobe PDF application/msword=MS Word -application/vnd.ms-excel=MS Excel -application/vnd.openxmlformats-officedocument.spreadsheetml.sheet=MS Excel (XLSX) -application/vnd.openxmlformats-officedocument.wordprocessingml.document=MS Word (docx) -application/zip=ZIP Archive +application/vnd.ms-excel=MS Excel Spreadsheet +application/vnd.openxmlformats-officedocument.spreadsheetml.sheet=MS Excel Spreadsheet +application/vnd.ms-powerpoint=MS Powerpoint +application/vnd.openxmlformats-officedocument.presentationml.presentation=MS Powerpoint +application/vnd.openxmlformats-officedocument.wordprocessingml.document=MS Word +application/vnd.oasis.opendocument.spreadsheet=OpenOffice Spreadsheet +# Text text/plain=Plain Text +text/html=HTML +application/x-tex=LaTeX +text/x-tex=LaTeX +text/markdown=Markdown Text +text/x-markdown=Markdown Text +text/x-r-markdown=R Markdown Text +application/rtf=Rich Text Format +text/rtf=Rich Text Format +text/richtext=Rich Text Format +text/turtle=Turtle RDF +application/xml=XML text/xml=XML +# Code +text/x-c=C++ Source +text/css=Cascading Style Sheet +text/javascript=Javascript Code +application/javascript=Javascript Code +application/x-javascript=Javascript Code +text/x-matlab=MATLAB Source Code +text/x-mathematica=Mathematica Input +text/php=PHP Source Code +text/x-python=Python Source Code +text/x-python-script=Python Source Code +text/x-r-source=R Source Code +application/x-sh=Shell Script +application/x-shellscript=Shell Script +application/x-sql=SQL Code +text/x-sql=SQL Code +application/x-swc=Shockwave Flash Component +application/x-msdownload=Windows Executable +application/x-ipynb+json=Jupyter Notebook +application/x-stata-ado=Stata Ado Script +application/x-stata-do=Stata Do Script +application/x-stata-dta=Stata Data Script +application/x-stata-smcl=Stata Markup and Control Language +text/x-stata-syntax=Stata Syntax +application/x-stata-syntax=Stata Syntax +text/x-spss-syntax=SPSS Syntax +application/x-spss-syntax=SPSS Syntax +application/x-spss-sps=SPSS Script Syntax +text/x-sas-syntax=SAS Syntax +application/x-sas-syntax=SAS Syntax +type/x-r-syntax=R Syntax +# Ingested Tabular Data text/tab-separated-values=Tab-Delimited -text/tsv=Tab-Delimited +# RawData +text/tsv=Tab-Separated Values +text/comma-separated-values=Comma Separated Values text/csv=Comma Separated Values text/x-fixed-field=Fixed Field Text Data application/x-rlang-transport=R Data -type/x-r-syntax=R Syntax application/x-R-2=R Binary application/x-stata=Stata Binary application/x-stata-6=Stata Binary application/x-stata-13=Stata 13 Binary application/x-stata-14=Stata 14 Binary application/x-stata-15=Stata 15 Binary -text/x-stata-syntax=Stata Syntax application/x-spss-por=SPSS Portable -application/x-spss-sav=SPSS SAV -text/x-spss-syntax=SPSS Syntax +application/x-spss-portable=SPSS Portable +application/x-spss-sav=SPSS Binary +application/x-sas=SAS application/x-sas-transport=SAS Transport application/x-sas-system=SAS System -text/x-sas-syntax=SAS Syntax +application/x-sas-data=SAS Data +application/x-sas-catalog=SAS Catalog +application/x-sas-log=SAS Log +application/x-sas-output=SAS Output +application/softgrid-do=Softgrid DTA Script application/x-dvn-csvspss-zip=CSV (w/SPSS card) application/x-dvn-tabddi-zip=TAB (w/DDI) +application/x-emf=Extended Metafile +application/x-h5=Hierarchical Data Format +application/x-hdf=Hierarchical Data Format +application/x-hdf5=Hierarchical Data Format +application/json=JSON +application/mathematica=Mathematica +application/matlab-mat=MATLAB Data +application/x-matlab-data=MATLAB Data +application/x-matlab-figure=MATLAB Figure +application/x-matlab-workspace=MATLAB Workspace +application/x-xfig=MATLAB Figure +application/x-msaccess=MS Access +application/netcdf=Network Common Data Form +application/x-netcdf=Network Common Data Form +application/vnd.lotus-notes=Notes Storage Facility +application/x-nsdstat=NSDstat +application/vnd.realvnc.bed=PLINK Binary +# FITS +image/fits=FITS application/fits=FITS -#Images files +# Shape +application/dbf=dBASE Table for ESRI Shapefile +application/dbase=dBASE Table for ESRI Shapefile +application/prj=ESRI Shapefile +application/sbn=ESRI Spatial Index +application/sbx=ESRI Spatial Index +application/shp=Shape +application/shx=Shape +application/zipped-shapefile=Shape +# Archive +application/zip=ZIP Archive +application/x-zip-compressed=ZIP Archive +application/vnd.antix.game-component=ATX Archive +application/x-bzip=Bzip Archive +application/x-bzip2=Bzip Archive +application/vnd.google-earth.kmz=Google Earth Archive +application/gzip=Gzip Archive +application/x-gzip=Gzip Archive +application/rar=RAR Archive +application/x-rar=RAR Archive +application/x-rar-compressed=RAR Archive +application/tar=TAR Archive +application/x-tar=TAR Archive +application/x-compressed-tar=TAR Archive +application/x-7z-compressed=7Z Archive +application/x-xz=XZ Archive +application/warc=Web Archive +# Image image/gif=GIF Image image/jpeg=JPEG Image +image/jp2=JPEG-2000 Image image/x-portable-bitmap=Bitmap Image image/x-portable-graymap=Graymap Image image/png=PNG Image image/x-portable-anymap=Anymap Image image/x-portable-pixmap=Pixmap Image +application/x-msmetafile=Enhanced Metafile +application/dicom=DICOM Image +image/dicom-rle=DICOM Image +image/nii=NIfTI Image image/cmu-raster=Raster Image image/x-rgb=RGB Image +image/svg+xml=SVG Image image/tiff=TIFF Image -image/x-xbitmap=XBitmap -image/x-xpixmap=XPixmap +image/bmp=Bitmap Image +image/x-xbitmap=Bitmap Image +image/RAW=Bitmap Image +image/x-xpixmap=Pixmap Image image/x-xwindowdump=X Windows Dump -# Network Data files +# Audio +audio/x-aiff=AIFF Audio +audio/mp3=MP3 Audio +audio/mpeg=MP3 Audio +audio/mp4=MPEG-4 Audio +audio/x-m4a=MPEG-4 Audio +audio/ogg=OGG Audio +audio/wav=Waveform Audio +audio/x-wav=Waveform Audio +audio/x-wave=Waveform Audio +# Video +video/avi=AVI Video +video/x-msvideo=AVI Video +video/mpeg=MPEG Video +video/mp4=MPEG-4 Video +video/x-m4v=MPEG-4 Video +video/ogg=OGG Video +video/quicktime=Quicktime Video +video/webm=WebM Video +# Network Data text/xml-graphml=GraphML Network Data # Other application/octet-stream=Unknown diff --git a/src/main/java/propertyFiles/MimeTypeFacets.properties b/src/main/java/propertyFiles/MimeTypeFacets.properties index 2acd2aa6168..54c5e01d317 100644 --- a/src/main/java/propertyFiles/MimeTypeFacets.properties +++ b/src/main/java/propertyFiles/MimeTypeFacets.properties @@ -1,60 +1,181 @@ # MimeTypeFacets properties file -# Defines "facetable" groups of files by mime type; -# For example, all image formats will be grouped under "image", etc. -# -# Documentation: +# Defines "facetable" groups of files by mime type +# Documentation application/pdf=Document +image/pdf=Document +text/pdf=Document +application/x-pdf=Document application/msword=Document application/vnd.ms-excel=Document application/vnd.openxmlformats-officedocument.spreadsheetml.sheet=Document +application/vnd.ms-powerpoint=Document +application/vnd.openxmlformats-officedocument.presentationml.presentation=Document application/vnd.openxmlformats-officedocument.wordprocessingml.document=Document -# Text: +application/vnd.oasis.opendocument.spreadsheet=Document +# Text text/plain=Text +text/html=Text +application/x-tex=Text +text/x-tex=Text +text/markdown=Text +text/x-markdown=Text +text/x-r-markdown=Text +application/rtf=Text +text/rtf=Text +text/richtext=Text +text/turtle=Text +application/xml=Text text/xml=Text +# Code +text/x-c=Code +text/css=Code +text/javascript=Code +application/javascript=Code +application/x-javascript=Code +text/x-matlab=Code +text/x-mathematica=Code +text/php=Code +text/x-python=Code +text/x-python-script=Code +text/x-r-source=Code +application/x-sh=Code +application/x-shellscript=Code +application/x-sql=Code +text/x-sql=Code +application/x-swc=Code +application/x-msdownload=Code +application/x-ipynb+json=Code +application/x-stata-do=Code +text/x-stata-syntax=Code +application/x-stata-syntax=Code +text/x-spss-syntax=Code +application/x-spss-syntax=Code +text/x-sas-syntax=Code +application/x-sas-syntax=Code +type/x-r-syntax=Code # Ingested text/tab-separated-values=Tabular Data - -# Data files: +# Data text/tsv=Data +text/comma-separated-values=Data text/csv=Data text/x-fixed-field=Data application/x-rlang-transport=Data -type/x-r-syntax=Data application/x-R-2=Data application/x-stata=Data application/x-stata-6=Data application/x-stata-13=Data application/x-stata-14=Data application/x-stata-15=Data -text/x-stata-syntax=Data +application/x-stata-ado=Data +application/x-stata-dta=Data +application/x-stata-smcl=Data application/x-spss-por=Data +application/x-spss-portable=Data application/x-spss-sav=Data -text/x-spss-syntax=Data +application/x-spss-sps=Data +application/x-sas=Data application/x-sas-transport=Data application/x-sas-system=Data -text/x-sas-syntax=Data +application/x-sas-data=Data +application/x-sas-catalog=Data +application/x-sas-log=Data +application/x-sas-output=Data +application/softgrid-do=Data application/x-dvn-csvspss-zip=Data application/x-dvn-tabddi-zip=Data +application/x-emf=Data +application/x-h5=Data +application/x-hdf=Data +application/x-hdf5=Data +application/json=Data +application/mathematica=Data +application/matlab-mat=Data +application/x-matlab-data=Data +application/x-matlab-figure=Data +application/x-matlab-workspace=Data +application/x-xfig=Data +application/x-msaccess=Data +application/netcdf=Data +application/x-netcdf=Data +application/vnd.lotus-notes=Data +application/x-nsdstat=Data +application/vnd.realvnc.bed=Data +# FITS +image/fits=FITS application/fits=FITS +# Shape +application/dbf=Shape +application/dbase=Shape +application/prj=Shape +application/sbn=Shape +application/sbx=Shape +application/shp=Shape +application/shx=Shape application/zipped-shapefile=Shape -# Archive files: -application/zip=ZIP -# Images files -# (should be safe to just split the mime type on "/" in "image/*" though...) +# Archive +application/zip=Archive +application/x-zip-compressed=Archive +application/vnd.antix.game-component=Archive +application/x-bzip=Archive +application/x-bzip2=Archive +application/vnd.google-earth.kmz=Archive +application/gzip=Archive +application/x-gzip=Archive +application/rar=Archive +application/x-rar=Archive +application/x-rar-compressed=Archive +application/tar=Archive +application/x-tar=Archive +application/x-compressed-tar=Archive +application/x-7z-compressed=Archive +application/x-xz=Archive +application/warc=Archive +# Image image/gif=Image image/jpeg=Image +image/jp2=Image image/x-portable-bitmap=Image image/x-portable-graymap=Image image/png=Image image/x-portable-anymap=Image image/x-portable-pixmap=Image +application/x-msmetafile=Image +application/dicom=Image +image/dicom-rle=Image +image/nii=Image image/cmu-raster=Image image/x-rgb=Image +image/svg+xml=Image image/tiff=Image +image/bmp=Image image/x-xbitmap=Image +image/RAW=Image image/x-xpixmap=Image image/x-xwindowdump=Image -# Network Data files +# (anything else that looks like image/* will also be indexed as facet type "Image") +# Audio +audio/x-aiff=Audio +audio/mp3=Audio +audio/mpeg=Audio +audio/mp4=Audio +audio/x-m4a=Audio +audio/ogg=Audio +audio/wav=Audio +audio/x-wav=Audio +audio/x-wave=Audio +# (anything else that looks like audio/* will also be indexed as facet type "Audio") +# Video +video/avi=Video +video/x-msvideo=Video +video/mpeg=Video +video/mp4=Video +video/x-m4v=Video +video/ogg=Video +video/quicktime=Video +video/webm=Video +# (anything else that looks like image/* will also be indexed as facet type "Video") +# Network Data text/xml-graphml=Network Data # Other application/octet-stream=Unknown diff --git a/src/main/webapp/editFilesFragment.xhtml b/src/main/webapp/editFilesFragment.xhtml index a2db4bff7e9..eef6d5fac95 100644 --- a/src/main/webapp/editFilesFragment.xhtml +++ b/src/main/webapp/editFilesFragment.xhtml @@ -326,7 +326,7 @@ - + diff --git a/src/main/webapp/file-info-fragment.xhtml b/src/main/webapp/file-info-fragment.xhtml index 2add73eab3a..f6679543dce 100644 --- a/src/main/webapp/file-info-fragment.xhtml +++ b/src/main/webapp/file-info-fragment.xhtml @@ -14,7 +14,7 @@
- + diff --git a/src/main/webapp/file.xhtml b/src/main/webapp/file.xhtml index 26e0f766117..e7d1d8fd743 100644 --- a/src/main/webapp/file.xhtml +++ b/src/main/webapp/file.xhtml @@ -198,7 +198,7 @@
- +
diff --git a/src/main/webapp/filesFragment.xhtml b/src/main/webapp/filesFragment.xhtml index a74ef7ddbbf..c91fb368f97 100644 --- a/src/main/webapp/filesFragment.xhtml +++ b/src/main/webapp/filesFragment.xhtml @@ -370,7 +370,7 @@ - + @@ -631,7 +631,7 @@ - + diff --git a/src/main/webapp/search-include-fragment.xhtml b/src/main/webapp/search-include-fragment.xhtml index c07fd77ff7b..cdf0f1de1a7 100644 --- a/src/main/webapp/search-include-fragment.xhtml +++ b/src/main/webapp/search-include-fragment.xhtml @@ -581,7 +581,7 @@ diff --git a/src/test/java/edu/harvard/iq/dataverse/DataFileServiceBeanTest.java b/src/test/java/edu/harvard/iq/dataverse/DataFileServiceBeanTest.java index 92a1f6a6b17..136916cf449 100644 --- a/src/test/java/edu/harvard/iq/dataverse/DataFileServiceBeanTest.java +++ b/src/test/java/edu/harvard/iq/dataverse/DataFileServiceBeanTest.java @@ -186,8 +186,8 @@ public void testIsThumbnailSupportedForSize() throws Exception { */ @Test public void testGetFileClass() throws Exception { - assertEquals("other", dataFileServiceBean.getFileClass(fileWoContentType)); - assertEquals("other", dataFileServiceBean.getFileClass(fileWithBogusContentType)); + assertEquals("other", dataFileServiceBean.getFileThumbnailClass(fileWoContentType)); + assertEquals("other", dataFileServiceBean.getFileThumbnailClass(fileWithBogusContentType)); } /** diff --git a/src/test/java/edu/harvard/iq/dataverse/api/FileTypeDetectionIT.java b/src/test/java/edu/harvard/iq/dataverse/api/FileTypeDetectionIT.java new file mode 100644 index 00000000000..8e38a0da2f2 --- /dev/null +++ b/src/test/java/edu/harvard/iq/dataverse/api/FileTypeDetectionIT.java @@ -0,0 +1,207 @@ +package edu.harvard.iq.dataverse.api; + +import com.jayway.restassured.path.json.JsonPath; +import com.jayway.restassured.response.Response; +import javax.json.Json; +import javax.json.JsonObjectBuilder; +import static javax.ws.rs.core.Response.Status.CREATED; +import static javax.ws.rs.core.Response.Status.OK; +import static javax.ws.rs.core.Response.Status.UNAUTHORIZED; +import static org.hamcrest.CoreMatchers.equalTo; +import static org.hamcrest.CoreMatchers.nullValue; +import org.junit.Test; + +public class FileTypeDetectionIT { + + @Test + public void testOverrideMimeType() { + Response createUser = UtilIT.createRandomUser(); + createUser.prettyPrint(); + createUser.then().assertThat() + .statusCode(OK.getStatusCode()); + String username = UtilIT.getUsernameFromResponse(createUser); + String apiToken = UtilIT.getApiTokenFromResponse(createUser); + + Response createDataverseResponse = UtilIT.createRandomDataverse(apiToken); + createDataverseResponse.prettyPrint(); + createDataverseResponse.then().assertThat() + .statusCode(CREATED.getStatusCode()); + + String dataverseAlias = UtilIT.getAliasFromResponse(createDataverseResponse); + + Response createDataset = UtilIT.createRandomDatasetViaNativeApi(dataverseAlias, apiToken); + createDataset.prettyPrint(); + createDataset.then().assertThat() + .statusCode(CREATED.getStatusCode()); + + Integer datasetId = UtilIT.getDatasetIdFromResponse(createDataset); + + String readmeFile = "README.md"; + + JsonObjectBuilder readmeFileMetadata = Json.createObjectBuilder() + .add("description", "How to run the code on the data.") + .add("categories", Json.createArrayBuilder() + .add("Documentation") + ); + + // Markdown media type: https://tools.ietf.org/html/rfc7763 + String overrideMimeType = "text/markdown"; + Response addReadme = UtilIT.uploadFileViaNative(datasetId.toString(), readmeFile, readmeFileMetadata.build().toString(), overrideMimeType, apiToken); + addReadme.prettyPrint(); + addReadme.then().assertThat() + .body("data.files[0].categories[0]", equalTo("Documentation")) + .body("data.files[0].dataFile.contentType", equalTo("text/markdown")) + .body("data.files[0].dataFile.description", equalTo("How to run the code on the data.")) + .body("data.files[0].directoryLabel", nullValue()) + .body("data.files[0].dataFile.tags", nullValue()) + .body("data.files[0].dataFile.tabularTags", nullValue()) + .body("data.files[0].label", equalTo("README.md")) + // not sure why description appears in two places + .body("data.files[0].description", equalTo("How to run the code on the data.")) + .statusCode(OK.getStatusCode()); + + String jupyterNotebook = "src/test/java/edu/harvard/iq/dataverse/util/irc-metrics.ipynb"; + + JsonObjectBuilder jupyterNotebookMetadata = Json.createObjectBuilder() + .add("description", "Jupyter Notebook showing IRC metrics.") + .add("directoryLabel", "code") + .add("categories", Json.createArrayBuilder() + .add("Code") + ); + + Response addCode = UtilIT.uploadFileViaNative(datasetId.toString(), jupyterNotebook, jupyterNotebookMetadata.build(), apiToken); + addCode.prettyPrint(); + addCode.then().assertThat() + .body("data.files[0].categories[0]", equalTo("Code")) + .body("data.files[0].dataFile.contentType", equalTo("application/x-ipynb+json")) + .body("data.files[0].dataFile.description", equalTo("Jupyter Notebook showing IRC metrics.")) + .body("data.files[0].directoryLabel", equalTo("code")) + .body("data.files[0].dataFile.tags", nullValue()) + .body("data.files[0].dataFile.tabularTags", nullValue()) + .body("data.files[0].label", equalTo("irc-metrics.ipynb")) + // not sure why description appears in two places + .body("data.files[0].description", equalTo("Jupyter Notebook showing IRC metrics.")) + .statusCode(OK.getStatusCode()); + + String tsvFile = "src/test/java/edu/harvard/iq/dataverse/util/irclog.tsv"; + + JsonObjectBuilder tsvFileMetadata = Json.createObjectBuilder() + .add("description", "TSV file of Dataverse IRC logs.") + .add("directoryLabel", "data") + .add("categories", Json.createArrayBuilder() + .add("Data") + ); + + Response addData = UtilIT.uploadFileViaNative(datasetId.toString(), tsvFile, tsvFileMetadata.build(), apiToken); + addData.prettyPrint(); + addData.then().assertThat() + .body("data.files[0].categories[0]", equalTo("Data")) + .body("data.files[0].dataFile.contentType", equalTo("text/tsv")) + .body("data.files[0].dataFile.description", equalTo("TSV file of Dataverse IRC logs.")) + .body("data.files[0].directoryLabel", equalTo("data")) + .body("data.files[0].dataFile.tags", nullValue()) + .body("data.files[0].dataFile.tabularTags", nullValue()) + .body("data.files[0].label", equalTo("irclog.tsv")) + // not sure why description appears in two places + .body("data.files[0].description", equalTo("TSV file of Dataverse IRC logs.")) + .statusCode(OK.getStatusCode()); + + } + + @Test + public void testRedetectMimeType() { + Response createUser = UtilIT.createRandomUser(); + createUser.prettyPrint(); + createUser.then().assertThat() + .statusCode(OK.getStatusCode()); + String username = UtilIT.getUsernameFromResponse(createUser); + String apiToken = UtilIT.getApiTokenFromResponse(createUser); + + Response createDataverseResponse = UtilIT.createRandomDataverse(apiToken); + createDataverseResponse.prettyPrint(); + createDataverseResponse.then().assertThat() + .statusCode(CREATED.getStatusCode()); + + String dataverseAlias = UtilIT.getAliasFromResponse(createDataverseResponse); + + Response createDataset = UtilIT.createRandomDatasetViaNativeApi(dataverseAlias, apiToken); + createDataset.prettyPrint(); + createDataset.then().assertThat() + .statusCode(CREATED.getStatusCode()); + + Integer datasetId = UtilIT.getDatasetIdFromResponse(createDataset); + + String filePath = "scripts/issues/1380/dvs.pdf"; + + JsonObjectBuilder readmeFileMetadata = Json.createObjectBuilder() + .add("description", "This is a PDF.") + .add("categories", Json.createArrayBuilder() + .add("Documentation") + ); + + /** + * We are overriding the MIME type here because even though Dataverse + * knows how to figure out what a PDF is we want to pretend it doesn't + * so that we can later try the "redetect file type" API. + */ + String overrideMimeType = "foo/bar"; + Response addFileUnknownType = UtilIT.uploadFileViaNative(datasetId.toString(), filePath, readmeFileMetadata.build().toString(), overrideMimeType, apiToken); + addFileUnknownType.prettyPrint(); + addFileUnknownType.then().assertThat() + .statusCode(OK.getStatusCode()) + .body("data.files[0].categories[0]", equalTo("Documentation")) + .body("data.files[0].dataFile.contentType", equalTo("foo/bar")) + .body("data.files[0].dataFile.description", equalTo("This is a PDF.")) + .body("data.files[0].directoryLabel", nullValue()) + .body("data.files[0].dataFile.tags", nullValue()) + .body("data.files[0].dataFile.tabularTags", nullValue()) + .body("data.files[0].label", equalTo("dvs.pdf")) + // not sure why description appears in two places + .body("data.files[0].description", equalTo("This is a PDF.")); + + Long fileId = JsonPath.from(addFileUnknownType.asString()).getLong("data.files[0].dataFile.id"); + System.out.println("file id: " + fileId); + boolean dryRunTrue = true; + Response redetectDryRun = UtilIT.redetectFileType(fileId.toString(), dryRunTrue, apiToken); + redetectDryRun.prettyPrint(); + redetectDryRun.then().assertThat() + .statusCode(OK.getStatusCode()) + .body("data.dryRun", equalTo(true)) + .body("data.oldContentType", equalTo("foo/bar")) + .body("data.newContentType", equalTo("application/pdf")); + + Response createNoPrivsUser = UtilIT.createRandomUser(); + createNoPrivsUser.prettyPrint(); + createNoPrivsUser.then().assertThat() + .statusCode(OK.getStatusCode()); + String noPrivsUsername = UtilIT.getUsernameFromResponse(createNoPrivsUser); + String noPrivsApiToken = UtilIT.getApiTokenFromResponse(createNoPrivsUser); + + Response forbidden = UtilIT.redetectFileType(fileId.toString(), true, noPrivsApiToken); + forbidden.then().assertThat() + .statusCode(UNAUTHORIZED.getStatusCode()); + + Response noChange = UtilIT.nativeGet(datasetId, apiToken); + noChange.prettyPrint(); + noChange.then().assertThat() + .statusCode(OK.getStatusCode()) + .body("data.latestVersion.files[0].dataFile.contentType", equalTo("foo/bar")); + + boolean dryRunFalse = false; + Response redetectAndChange = UtilIT.redetectFileType(fileId.toString(), dryRunFalse, apiToken); + redetectAndChange.prettyPrint(); + redetectAndChange.then().assertThat() + .statusCode(OK.getStatusCode()) + .body("data.dryRun", equalTo(false)) + .body("data.oldContentType", equalTo("foo/bar")) + .body("data.newContentType", equalTo("application/pdf")); + + Response databaseChanged = UtilIT.nativeGet(datasetId, apiToken); + databaseChanged.prettyPrint(); + databaseChanged.then().assertThat() + .statusCode(OK.getStatusCode()) + .body("data.latestVersion.files[0].dataFile.contentType", equalTo("application/pdf")); + + } + +} diff --git a/src/test/java/edu/harvard/iq/dataverse/api/UtilIT.java b/src/test/java/edu/harvard/iq/dataverse/api/UtilIT.java index 4487a0553ae..f23d480632e 100644 --- a/src/test/java/edu/harvard/iq/dataverse/api/UtilIT.java +++ b/src/test/java/edu/harvard/iq/dataverse/api/UtilIT.java @@ -558,10 +558,15 @@ static Response uploadFileViaNative(String datasetId, String pathToFile, JsonObj } static Response uploadFileViaNative(String datasetId, String pathToFile, String jsonAsString, String apiToken) { + String nullMimeType = null; + return uploadFileViaNative(datasetId, pathToFile, jsonAsString, nullMimeType, apiToken); + } + + static Response uploadFileViaNative(String datasetId, String pathToFile, String jsonAsString, String mimeType, String apiToken) { RequestSpecification requestSpecification = given() .header(API_TOKEN_HTTP_HEADER, apiToken) .multiPart("datasetId", datasetId) - .multiPart("file", new File(pathToFile)); + .multiPart("file", new File(pathToFile), mimeType); if (jsonAsString != null) { requestSpecification.multiPart("jsonData", jsonAsString); } @@ -701,6 +706,12 @@ static Response testIngest(String fileName, String fileType) { .get("/api/ingest/test/file?fileName=" + fileName + "&fileType=" + fileType); } + static Response redetectFileType(String fileId, boolean dryRun, String apiToken) { + return given() + .header(API_TOKEN_HTTP_HEADER, apiToken) + .post("/api/files/" + fileId + "/redetect?dryRun=" + dryRun); + } + static Response getSwordAtomEntry(String persistentId, String apiToken) { Response response = given() .auth().basic(apiToken, EMPTY_STRING) @@ -822,7 +833,14 @@ public static Response deleteUser(String username) { .delete("/api/admin/authenticatedUsers/" + username + "/"); return deleteUserResponse; } - + + public static Response reingestFile(Long fileId, String apiToken) { + Response response = given() + .header(API_TOKEN_HTTP_HEADER, apiToken) + .post("/api/files/" + fileId + "/reingest"); + return response; + } + public static Response uningestFile(Long fileId, String apiToken) { Response uningestFileResponse = given() diff --git a/src/test/java/edu/harvard/iq/dataverse/util/BundleUtilTest.java b/src/test/java/edu/harvard/iq/dataverse/util/BundleUtilTest.java index c34ab81c7f5..8889d492829 100644 --- a/src/test/java/edu/harvard/iq/dataverse/util/BundleUtilTest.java +++ b/src/test/java/edu/harvard/iq/dataverse/util/BundleUtilTest.java @@ -74,7 +74,7 @@ public void testGetStringFromBundleWithArgumentsAndSpecificBundle() { @Test public void testStringFromPropertyFile() { - assertEquals("ZIP", BundleUtil.getStringFromPropertyFile("application/zip","MimeTypeFacets")); + assertEquals("Archive", BundleUtil.getStringFromPropertyFile("application/zip","MimeTypeFacets")); } //To assure that the MissingResourceException bubble up from this call diff --git a/src/test/java/edu/harvard/iq/dataverse/util/FileTypeDetectionTest.java b/src/test/java/edu/harvard/iq/dataverse/util/FileTypeDetectionTest.java new file mode 100644 index 00000000000..5d2b9b4d56a --- /dev/null +++ b/src/test/java/edu/harvard/iq/dataverse/util/FileTypeDetectionTest.java @@ -0,0 +1,42 @@ +package edu.harvard.iq.dataverse.util; + +import java.io.File; +import java.io.IOException; +import java.util.logging.Level; +import java.util.logging.Logger; +import org.apache.commons.io.FileUtils; +import org.junit.AfterClass; +import static org.junit.Assert.assertEquals; +import org.junit.BeforeClass; +import org.junit.Test; + +public class FileTypeDetectionTest { + + static String baseDirForConfigFiles = "/tmp"; + + @BeforeClass + public static void setUpClass() { + System.setProperty("com.sun.aas.instanceRoot", baseDirForConfigFiles); + String testFile1Src = "conf/jhove/jhove.conf"; + String testFile1Tmp = baseDirForConfigFiles + "/config/jhove.conf"; + try { + FileUtils.copyFile(new File(testFile1Src), new File(testFile1Tmp)); + } catch (IOException ex) { + Logger.getLogger(JhoveFileTypeTest.class.getName()).log(Level.SEVERE, null, ex); + } + } + + @AfterClass + public static void tearDownClass() { + // SiteMapUtilTest relies on com.sun.aas.instanceRoot being null. + System.clearProperty("com.sun.aas.instanceRoot"); + } + + @Test + public void testDetermineFileTypeJupyterNoteboook() throws Exception { + File file = new File("src/test/java/edu/harvard/iq/dataverse/util/irc-metrics.ipynb"); + // https://jupyter.readthedocs.io/en/latest/reference/mimetype.html + assertEquals("application/x-ipynb+json", FileTypeDetection.determineFileType(file)); + } + +} diff --git a/src/test/java/edu/harvard/iq/dataverse/util/JhoveFileTypeTest.java b/src/test/java/edu/harvard/iq/dataverse/util/JhoveFileTypeTest.java new file mode 100644 index 00000000000..88a8d24c772 --- /dev/null +++ b/src/test/java/edu/harvard/iq/dataverse/util/JhoveFileTypeTest.java @@ -0,0 +1,90 @@ +package edu.harvard.iq.dataverse.util; + +import java.io.File; +import java.io.IOException; +import java.util.logging.Level; +import java.util.logging.Logger; +import org.apache.commons.io.FileUtils; +import org.junit.AfterClass; +import static org.junit.Assert.assertEquals; +import org.junit.BeforeClass; +import org.junit.Test; + +public class JhoveFileTypeTest { + + static JhoveFileType jhoveFileType; + static String baseDirForConfigFiles = "/tmp"; + static File png; + static File gif; + static File jpg; + static File pdf; + static File zip; + static File xml; + static File html; + static File ico; + static File ipynb; + + @BeforeClass + public static void setUpClass() { + System.setProperty("com.sun.aas.instanceRoot", baseDirForConfigFiles); + jhoveFileType = new JhoveFileType(); + copyConfigIntoPlace(); + + png = new File("src/test/resources/images/coffeeshop.png"); + gif = new File("src/main/webapp/resources/images/ajax-loading.gif"); + jpg = new File("src/main/webapp/resources/images/dataverseproject_logo.jpg"); + pdf = new File("scripts/issues/1380/dvs.pdf"); + zip = new File("src/test/resources/doi-10-5072-fk2hyixmyv1.0.zip"); + xml = new File("pom.xml"); + html = new File("src/main/webapp/mydata_templates/mydata.html"); + ico = new File("src/main/webapp/resources/images/fav/favicon.ico"); + ipynb = new File("src/test/java/edu/harvard/iq/dataverse/util/irc-metrics.ipynb"); + } + + @AfterClass + public static void tearDownClass() { + // SiteMapUtilTest relies on com.sun.aas.instanceRoot being null. + System.clearProperty("com.sun.aas.instanceRoot"); + } + + @Test + public void testGetFileMimeType() { + System.out.println("getFileMimeType"); + // GOOD: figured it out. :) + assertEquals("image/png", jhoveFileType.getFileMimeType(png)); + assertEquals("image/gif", jhoveFileType.getFileMimeType(gif)); + assertEquals("image/jpeg", jhoveFileType.getFileMimeType(jpg)); + assertEquals("application/pdf", jhoveFileType.getFileMimeType(pdf)); + // BAD: couldn't figure it out. :( + assertEquals("application/octet-stream", jhoveFileType.getFileMimeType(zip)); + assertEquals("application/octet-stream", jhoveFileType.getFileMimeType(ico)); + // BAD: not very specific. :( + assertEquals("text/plain; charset=US-ASCII", jhoveFileType.getFileMimeType(xml)); + assertEquals("text/plain; charset=US-ASCII", jhoveFileType.getFileMimeType(html)); + assertEquals("text/plain; charset=US-ASCII", jhoveFileType.getFileMimeType(ipynb)); + } + + @Test + public void testCheckFileType() { + System.out.println("checkFileType"); + jhoveFileType = new JhoveFileType(); + assertEquals(543938, jhoveFileType.checkFileType(png).getSize()); + } + + @Test + public void testGetJhoveConfigFile() { + System.out.println("getJhoveConfigFile"); + assertEquals(baseDirForConfigFiles + "/config/jhove.conf", JhoveFileType.getJhoveConfigFile()); + } + + private static void copyConfigIntoPlace() { + String testFile1Src = "conf/jhove/jhove.conf"; + String testFile1Tmp = baseDirForConfigFiles + "/config/jhove.conf"; + try { + FileUtils.copyFile(new File(testFile1Src), new File(testFile1Tmp)); + } catch (IOException ex) { + Logger.getLogger(JhoveFileTypeTest.class.getName()).log(Level.SEVERE, null, ex); + } + } + +} diff --git a/src/test/java/edu/harvard/iq/dataverse/util/irc-metrics.ipynb b/src/test/java/edu/harvard/iq/dataverse/util/irc-metrics.ipynb new file mode 100644 index 00000000000..13088234fcb --- /dev/null +++ b/src/test/java/edu/harvard/iq/dataverse/util/irc-metrics.ipynb @@ -0,0 +1,251 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Pandas version 0.22.0\n", + "Numpy version 1.13.3\n" + ] + } + ], + "source": [ + "%matplotlib inline\n", + "import numpy as np\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "pd.set_option('display.max_columns', 100)\n", + "\n", + "print('Pandas version ' + pd.__version__)\n", + "print('Numpy version ' + np.__version__)" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "data = pd.read_table(\"irclog.tsv\", encoding = \"ISO-8859-1\")" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idchanneldaynicktimestamplinespamin_summary
01#dvn2012-12-08NaN1355005146iqlogbot joined #dvn00
12#dvn2012-12-08NaN1355005248Topic for #dvn is now http://thedata.org - The...00
23#dvn2012-12-08pdurbin1355005351hello! welcome to #dvn, an IRC channel on Free...00
34#dvn2012-12-08pdurbin1355005459our website is http://thedata.org and we're st...00
45#dvn2012-12-08pdurbin1355005517we call our project DVN for short :)00
\n", + "
" + ], + "text/plain": [ + " id channel day nick timestamp \\\n", + "0 1 #dvn 2012-12-08 NaN 1355005146 \n", + "1 2 #dvn 2012-12-08 NaN 1355005248 \n", + "2 3 #dvn 2012-12-08 pdurbin 1355005351 \n", + "3 4 #dvn 2012-12-08 pdurbin 1355005459 \n", + "4 5 #dvn 2012-12-08 pdurbin 1355005517 \n", + "\n", + " line spam in_summary \n", + "0 iqlogbot joined #dvn 0 0 \n", + "1 Topic for #dvn is now http://thedata.org - The... 0 0 \n", + "2 hello! welcome to #dvn, an IRC channel on Free... 0 0 \n", + "3 our website is http://thedata.org and we're st... 0 0 \n", + "4 we call our project DVN for short :) 0 0 " + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "data.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "RangeIndex: 92847 entries, 0 to 92846\n", + "Data columns (total 8 columns):\n", + "id 92847 non-null int64\n", + "channel 92847 non-null object\n", + "day 92847 non-null object\n", + "nick 60116 non-null object\n", + "timestamp 92847 non-null int64\n", + "line 92845 non-null object\n", + "spam 92847 non-null int64\n", + "in_summary 92847 non-null int64\n", + "dtypes: int64(4), object(4)\n", + "memory usage: 5.7+ MB\n" + ] + } + ], + "source": [ + "data.info()" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['id', 'channel', 'day', 'nick', 'timestamp', 'line', 'spam', 'in_summary']" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "list(data.columns)" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "#dataverse 82587\n", + "#dvn 10260\n", + "Name: channel, dtype: int64" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "data['channel'].value_counts()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.7" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +} diff --git a/src/test/java/edu/harvard/iq/dataverse/util/irclog.tsv b/src/test/java/edu/harvard/iq/dataverse/util/irclog.tsv new file mode 100644 index 00000000000..d0e22852965 --- /dev/null +++ b/src/test/java/edu/harvard/iq/dataverse/util/irclog.tsv @@ -0,0 +1,7 @@ +id channel day nick timestamp line spam in_summary +10261 #dataverse 2014-06-24 1403620825 iqlogbot joined #dataverse 0 0 +10262 #dataverse 2014-06-24 1403620825 Topic for #dataverse is now Dataverse: http://dataverse.org | logs at http://irclog.iq.harvard.edu/dataverse/today 0 0 +10263 #dataverse 2014-06-24 pdurbin 1403620846 hello world! 0 0 +10264 #dataverse 2014-06-24 pdurbin 1403620958 for over a year I've been gathering people in #dvn to talk about Dataverse Network but as a bit of a rebranding effort, we're shortening the name to just "Dataverse" 0 0 +10265 #dataverse 2014-06-24 pdurbin 1403621058 we even have a fancy new domain: http://dataverse.org :) 0 0 +10266 #dataverse 2014-06-24 pdurbin 1403621094 once I get everyone who's in the old #dvn channel to join this new #dataverse channel we'll shut the old one down 0 0