Skip to content

CybOX 3.0: File Object Refactoring

Ivan Kirillov edited this page Feb 24, 2016 · 100 revisions

Issue Description

There are several existing issues around the current implementation of the File Object and its subclasses:

  1. It conflates generic properties of a file (e.g., hashes) with those that are specific to its representation on a file system or disk (e.g., file_name).
  2. The current file path related fields (e.g., file_path, full_path) overlap and are difficult to use consistently.
  3. There are certain fields (e.g., device_path) that may be specific to Windows and no other platforms.
  4. There currently exist many subclasses of the File Object, making it difficult to understand which Object should be used in which situation, and also leading to some semantic inconsistencies (e.g., should the Windows Executable File Object really be a subclass of the Windows File Object?).

File Object Requirements

Before delving into potential solutions for the issues highlighted above, it is useful to discuss the requirements around the File Object in CybOX, as we see them today. Based on our defined use cases, in an ideal scenario, the File Object should be able to capture the types of properties described here.

Existing Fields

The existing fields on the File Object and the corresponding requirement they address are given in the table below.

Field Applicable Requirement
is_packed 6. File metadata
is_masqueraded 6. File metadata
File_Name 2. Generic file system properties
File_Path 2. Generic file system properties
Device_Path 4. Operating-system specific properties
Full_Path 2. Generic file system properties
File_Extension 2. Generic file system properties
Size_In_Bytes 1. Generic file properties
Magic_Number 1. Generic file properties
File_Format 1. Generic file properties
Hashes 1. Generic file properties
Digital_Signatures 6. File metadata
Modified_Time 2. Generic file system properties
Accessed_Time 2. Generic file system properties
Created_Time 2. Generic file system properties
File_Attributes_List 4. Operating-system specific properties
Permissions 3. File system-specific properties
User_Owner 4. Operating-system specific properties
Packer_List 6. File metadata
Peak_Entropy 6. File metadata
Sym_Links 3. File system-specific properties
Byte_Runs 5. Generic file on disk properties
Extracted_Features 6. File metadata
Decryption_Key 6. File metadata
Compression_Method 6. File metadata
Compression_Version 6. File metadata
Compression_Comment 6. File metadata

Refactoring

Before delving into our ideas for the refactoring of the File Object, it's important to discuss our planned scope for this refactoring, stemming from our CybOX 3.0 principles of focusing on simplicity and reduction of ambiguity, along with having this release serve as a stable base for future CybOX releases going forward. Accordingly, our focus is on:

  1. Cleaning up and fixing the aforementioned issues
  2. Having a stable base of generic file and filesystem properties useful for a wide variety of use cases
  3. Defining the necessary extension points for other types of properties

Generic vs. File system Properties

The biggest issue around the conflation of generic vs. file system properties in the current File Object is that it isn't always clear whether an instance of the Object represents an abstract file, or one stored on a particular file system. Accordingly, we feel that this can be solved with the following modifications:

  1. Having a clear delineation between generic and file-system specific properties
  2. Having separate fields for related, but not the same, properties
    1. File size vs. size on disk

File name/path related fields

As shown above, there are currently 5 fields associated with the name and/or path of a file:

Field Description
File_Name The File_Name field specifies the base name of the file (including an extension, if present).
File_Path The File_Path field specifies the relative or fully-qualified path to the file, not including the path to the device where the file system containing the file resides. Whether the path is relative or fully-qualified can be specified via the 'fully_qualified' attribute of this field. The File_Path field may include the name of the file; if so, it must not conflict with the File_Name field. If not, the File_Path field should contain the path of the directory containing the file, and should end with a terminating path separator("" or "/").
Device_Path The Device_Path field specifies the path to the physical device where the file system containing the file resides.
Full_Path The Full_Path field specifies the complete path to the file, including the device path. It should contain the contents that would otherwise be in the Device_Path and File_Path fields, and can be used in case the producer is unable or does not wish to separate the Device_Path and File_Path fields. If the Full_Path field is specified along with the File_Path and/or Device_Path fields, it must not conflict with either. The Full_Path field may include the name of the file; if so, it must not conflict with the File_Name field. If not, the File_Path field should contain the path of the directory containing the file, and should end with a terminating path separator("" or "/").
File_Extension The File_Extension field specifies the extension of the name of the file. The File_Extension field must not conflict with the ending of the File_Name field. The File_Extension field should not begin with a "." character, but may contain a "." character in the case of a compound file extension, such as "tar.gz".

As can be seen in their descriptions, these fields can all overlap in various ways (e.g., Full_Path can encompass File_Path, which subsequently can encompass File_Name), leading to highly complex logic in order for them to be used consistently by content producers. For example, one could have the following valid File Object instance:

<File_Name>abcd.dll</File_Name>
<File_Path>C:\Users\abcd.dll</File_Path>
<Device_Path>\Device\HardDiskVolume2\<Device_Path>
<Full_Path>\Device\HardDiskVolume2\Users\abcd.dll<Full_Path>
<File_Extension>dll</File_Extension>

Accordingly, this also leads to CybOX consumers having to understand how to deal with each of these fields upon parsing. The primary argument in favor of this granular approach in the past has been that it gives the ability to key off and specify patterns against particular file name/path components, such as testing for the presence of a file with a particular extension. However, we feel that the benefits of this approach are significantly outweighed by the complexity that it brings, and that we can meet the majority of use cases by collapsing these fields into a single one:

Field Datatype Description
file_name FileName The name of the file, including its path and extension (if known).

To make it easier to deal with file names on different operating systems, we believe that it may make sense to have a special type that breaks up the file name/path into a list of delimited components:

FilePath

Field Datatype Description
delimiter string The delimiter used in the file name/path string.
components list A list of strings that represent the components of the file name/path string, when split using the delimiter specified in the 'delimiter' field. A value of 'null' at the end of the list specifies a directory.

If one wishes to test for a particular component of the file, such as its name or extension, it is still possible to do so via regular expression. However, generating and parsing File Object instance data is now vastly simplified and less ambiguous. As far as device-path related fields, this is discussed in the next section.

Operating system-specific fields

This is primarily in relation to the Device_Path field - the concept of devices and corresponding device paths is one that is exclusive to Windows. Accordingly, it doesn't make sense to have as part of a generic File Object. The most sensible solution is to have an extension point on the File Object for these types of properties, and to relegate such properties to this extension point.

Excessive subclassing

The following represents the full hierarchy of File-related Objects as they exist in CybOX:

  • File
    • Archive File
    • Image File
    • PDF File
    • Unix File
    • Windows File
      • Windows Executable File

From a modeling perspective, it is logical to build up a tree of classes and sub-classes that represent the taxonomy of files in the cyber domain. However, there are a number of issues that stem from this approach:

  • Cascading changes. Due to the use of sub-classing, any change made to the top-level File Object necessitates a new version of its sub-classes; this is also true for any intermediate classes and their children, such as the Windows File and Windows Executable File.
  • Object management. Given the current use of sub-classing and the wide variety of potential Files that we may want to characterize, any new File Objects would also need to be added as sub-classes. This makes maintaining and managing them difficult, due to the sheer number of Objects but also as a result of any cascading changes as described above.
  • Object usage. As a consumer, it's difficult to determine whether the File Object or one of its children should be used in different types of situations, unless there are fields that are explicit to a particular Object that need to be used. For example, if one is characterizing a PDF File without knowledge of any PDF-specific properties of the file, one could use either the File Object or the PDF File Object - the resulting characterization would be identical.

As described below, the solution to this and some of the other aforementioned issues would be to have a single File Object that supports a number of context-specific extension points.

Extension points

As alluded to above, one of the most sensible ways to support the different properties (and associated domains and use cases) would be to have extension points for the more specific classes of file properties:

  • Operating system-specific properties
    • Windows-specific properties
    • Unix-specific properties
  • File system-specific properties
  • File on disk properties
  • File metadata
    • Generic file metadata
    • Image file metadata
    • Document file metadata
    • Executable binary file metadata

It's an open question as to whether we should specify default extensions for each of these classes of properties or not; one possibility is to take the fields from the existing sub-classes of the File Object (e.g., the Image File) and use them as a default extension.

Notional Implementation

Given the changes proposed above, the File Object in CybOX 3.0 would have the following notional implementation, consisting of a set of default properties that cover basic file properties and file system properties, with extension points for everything else. Note that this includes some other potential changes, such as the refactoring of the Hashes structure from CybOX common.

Refactored File Object

FileObject

The new top-level File Object.

Field Type Multiplicity Description
hashes HashListType 0-1 Cryptographic hashes of the file, such as MD5 and SHA1.
size int 0-1 The size of the file, in bytes.
format string 0-1 The format of the file, as specified (for instance) by the UNIX file command.
file-system-properties FileSystemProperties 0-1 The basic properties associated with the storage of the file on a file system.
extension FileExtension 0-N An extension point for specifying domain or context-specific file properties, such as those relating to operating systems, file systems, etc.
FileSystemProperties

A basic set of common file system properties.

Field Type Multiplicity Description
is-directory boolean 1 A required flag that indicates whether the file object instance represents a directory (if TRUE) or a file (if FALSE).
file-name string 0-1 The name of the file, including its extension (if known) but excluding its path. This field may only be included ONLY IF the is-directory property is set to FALSE.
file-path FilePath 0-1 The path to the file on the file system, excluding its name and extension. If this field is included without the file-name field, the file object instance specifies a directory.
modified-time string 0-1 The date/time the file was last modified.
accessed-time string 0-1 The date/time the file was last accessed.
created-time string 0-1 The date/time the file was created.

FilePath

Field Datatype Description
delimiter string The delimiter used in the file path string.
components list A list of strings that represent the components of the file path string, when split using the delimiter specified in the 'delimiter' field.
ExtendedFileProperties

An abstract class that is sub-classed and overridden by the various default extensions.

Default Extensions

The list of default extensions below is still TBD - the following are intended primarily to provide some examples of our current thinking and to help drive future discussion and development.

FileMetadataExtension

A default extension point for capturing general classes of file metadata. A sub-class of the ExtendedFileProperties class. MUST be specified using a key named "FileMetadataExtension".

Field Type Multiplicity Description
mime-type string 0-1 The MIME type name specified for the file, e.g., "msword". This value MUST be one of the values found in the IANA media type registry (http://www.iana.org/assignments/media-types/media-types.xhtml).
magic-number string 0-1 The hexadecimal constant ("magic number") associated with a specific file format that corresponds to the file, if applicable.
has-mismatch boolean 0-1 Indicates that there is a mismatch between one or more stated and reported properties of the file. For example, a mismatch between the MIME type of the file its file extension.
mismatch-type FileMismatchEnum 0-N Specifies the specific type of file mismatch that was found. This field is required if the has-mismatch property is set to true.
FileMismatchEnum
Value Description
extension/type A mismatch between the MIME type reported for the file and its file extension. For example, if the reported MIME type (as captured in the mime-type property) for the file is 'vnd.microsoft.portable-executable' and the file extension (as captured in the file-name property) is 'txt'.
magic/extension A mismatch between the magic number reported for the file and its file extension. For example, if the reported magic number (as captured in the magic-number property) for the file is '25504446', indicating a PDF file, and the file extension (as captured in the file-name property) is 'txt'.
magic/type A mismatch between the reported MIME type and magic number for the file. For example, if the reported MIME type (as captured in the mime-type property) for the file is 'JPEG' and the reported magic number is '424D' (as captured in the magic-number property, indicating a bitmap file).
UFSFileExtension

A default extension point for capturing properties specific to the storage of the file on the UFS file system and its derivatives. A sub-class of the ExtendedFileProperties class. MUST be specified using a key named "UFSFileExtension".

Field Type Multiplicity Description
inode string 0-1 The index node (inode) value assigned to the file (usually an integer).
NTFSFileExtension

A default extension point for capturing properties specific to the storage of the file on the NTFS file system. A sub-class of the ExtendedFileProperties class. MUST be specified using a key named "NTFSFileExtension".

Field Type Multiplicity Description
sid string 0-1 The security ID (SID) value assigned to the file.
alternate_data_stream AlternateDataStream 0-N An NTFS alternate data stream that exists for the file.
AlternateDataStream

Specifies the properties associated with NTFS alternate data streams. Contains the same properties as the existing StreamObjectType.

ImageFileExtension

A default extension point for capturing image file specific metadata that may be associated with the file. An extension of the ExtendedFileProperties class, which otherwise contains the same properties as the existing ImageFileObject. MUST be specified using a key named "ImageFileExtension".

PDFFileExtension

A default extension point for capturing PDF file specific metadata that may be associated with the file. An extension of the ExtendedFileProperties class, which otherwise contains the same properties as the existing PDFFileObject. MUST be specified using a key named "PDFFileExtension".

ArchiveFileExtension

A default extension point for capturing archive file specific metadata that may be associated with the file. An extension of the ExtendedFileProperties class, which otherwise contains the same properties as the existing ArchiveFileObject. MUST be specified using a key named "ArchiveFileExtension".

PEBinaryFileExtension

A default extension point for capturing PE-binary specific properties that may be associated with the file. An extension of the ExtendedFileProperties class, which otherwise contains the same properties as the existing WindowsExecutableFileObject. MUST be specified using a key named "PEBinaryFileExtension".

Examples

Here are some examples to help illustrate how the new File Object and its extended properties would be used.

Basic characterization

The majority (~80%) of file characterization involves capturing the most basic, identifying properties of a file.

{ 
  "type" : "file-object",
  "hashes" : [{"type":"hash",
               "hash_type":"md5",
               "hash_value":"3773a88f65a5e780c8dff9cdc3a056f3"}],
  "size" : 25537
}
Basic file system properties

Besides the basic identifying properties, those relating to the storage of a file on a file system are also often captured, especially for use cases around malware characterization and digital forensics.

{ 
  "type":"file-object",
  "hashes" : [{"type" : "hash",
               "hash_type":"md5",
               "hash_value":"3773a88f65a5e780c8dff9cdc3a056f3"}],
  "size" : 25537,
  "file_system_properties":{"is_directory":false,
                            "file_name": "test.dll",
                            "file_path": {"delimiter":"\\", 
                                          "components":["C:","windows"]}}
}
Extended properties

One of the key advantages of this new structure is that it provides a great deal of flexibility for specifying the various extended properties, so that any combination of such properties can be captured. In this case, let's say we want to capture some properties of a PE binary file that is stored on an EXT3 file system.

{ 
  "type":"file-object",
  "hashes" : [{"type" : "hash",
               "hash_type" : "md5",
               "hash_value" : "3773a88f65a5e780c8dff9cdc3a056f3"}],
  "size" : 25537,
  "file_system_properties":{"is_directory":false,
                            "file_name": "foo.exe",
                            "file_path": {"delimiter":"/", 
                                          "components":["usr","tmp"]}},
  "extended_properties": {"FileMetadataExtension":
                           {"mime-type":"vnd.microsoft.portable-executable"},
                          "UFSFileExtension":
                           {"inode":"34483923"},
                          "PEBinaryFileExtension":
                           {"exports":[{"name":"foo_app"}]}}
}
Clone this wiki locally