Skip to content

Latest commit

 

History

History
21 lines (12 loc) · 2.23 KB

README.md

File metadata and controls

21 lines (12 loc) · 2.23 KB

MediaConch Matroska Survey

Introduction

This repository contains a research corpus used in the development of the MediaConch. These collections contain MediaArea XML documents which contain both a MediaInfo report and a MediaTrace report. Between these reports most of the structure of Matroska files is documented along with a list of significant characteristics of the file.

Whenever an XML file exceeds a size of 2 megabytes, the file is compressed using gzip compression before being added to the repository, which reduces size substantially. To read a gzip compressed file it is recommended to use gzcat, such as:

gzcat MatroskaFile.mkv_maxml.xml.gz | grep "<CompleteName>"

Because the majority of the structure of a Matroska is used within the Cluster elements, there may also appear a file with a "_nocluster" suffix. This xml report is the same as its neighboring xml but has all MediaTrace elements that document the Cluster elements removed. This allows nearly every sample to be documented by an xml file that can be under 2 MB in size.

For instance the file at (1919) Das Cabinet des Dr. Caligari.mkv_maxml.xml.gz represents a gzipped archive of a MediaArea XML containing both a MediaInfo and MediaTrace report on a Matroska file of Das Cabinet Des Dr. Caligari. The file called (1919) Das Cabinet des Dr. Caligari.mkv_maxml.xml_cluster.xml presents the same report but without the reporting on the Matroska Cluster elements within MediaTrace.

Collections

archive_org

The archive.org collection consists of Matroska files identified in the public collections of archive.org. Within this collection each subdirectory represents an Internet Archive asset identifier with each file within that being named after the source file of that asset.