Skip to content

Uploader Users Guide

kenisteward edited this page Aug 14, 2017 · 12 revisions

Description

The herd uploader application is a command line program that provides the ability to copy files/directories to an S3 bucket from a local file system and register it with the herd Registry.

The JAR is built as part of the herd application suite in the dm-tools project.

Command Line Summary

java -jar dm-uploader.jar
  [-a <S3AccessKey>]
  [-p <S3SecretKey>]
  [-e <S3Endpoint>]
  -l < LocalDirPath>
  -m <ManifestFilePath>
  [-V]
  -H <RegServerHost>
  -P <RegServerPort>
  [-n <HttpProxyHost>]
  [-o <HttpProxyPort>]
  [-s true]
  [-u <username>]
  [-w <password>]
  [-t <MaxThreads>]
  [-r]
  [-R <MaxRetryAttempts>]
  [-D <RetryDelaySecs>]
  [-c <socketTimeout>]
  [-f]

Options

-a <arg>, --s3AccessKey <arg>

  • Required: No
  • Type: String

The S3 access key used to authenticate the user connecting to the S3 service. When specified, make sure the s3SecretKey is also specified.

-p <arg>, --s3SecretKey <arg>

  • Required: No
  • Type: String

The S3 secret key used to authenticate the user connecting to the S3 service. When specified, make sure the s3AccessKey is also specified.

-e <arg>, --s3Endpoint <arg>

  • Required: No
  • Type: String

The optional Amazon S3 endpoint to use when making S3 service calls.

-l <arg>, --localPath <arg>

  • Required: Yes
  • Type: String

The local path of the directory or file to upload from or download into.

-m <arg>, --manifestPath <arg>

  • Required: Yes
  • Type: String

Local path to the manifest file.

-V, --createNewVersion

  • Required: No

If set, a new business object data version will be created and registered. Otherwise, only an initial business object data version (version 0) is allowed to be registered.

-H <arg>, --regServerHost <arg>

  • Required: Yes
  • Type: String

Registration Server hostname.

-P <arg>, --regServerPort <arg>

  • Required: Yes
  • Type: Integer

Registration Server port.

-Y <arg>, --dmRegServerHost <arg>

  • Required: No
  • Type: String

DEPRECATED. Use regServerHost parameter.

-Z <arg>, --dmRegServerPort <arg>

  • Required: No
  • Type: String

DEPRECATED. Use regServerPort parameter.

-h, --help

  • Required: No

Display usage information and exit.

-v, --version

  • Required: No

Display version information and exit.

-n <arg>, --httpProxyHost <arg>

  • Required: No
  • Type: String

The hostname of an HTTP proxy that will be used when connecting to the S3 service. This is needed when a direct HTTP connection isn't allowed. Make sure the httpProxyPort is also specified when usiing this option.

-o <arg>, --httpProxyPort <arg>

  • Required: No
  • Type: Integer

The port number of an HTTP proxy that will be used when connectinng to the S3 service. This is needed when a direct HTTP connection isn't allowed. Make sure the httpProxyHost is also specified when using this option.

-s <arg>, --ssl <arg>

  • Required: No
  • Type: Boolean
  • Default: False

If set to true, enables SSL (HTTPS) to communicate with the herd Registration Service. Otherwise, uses HTTP.

-u <arg>, --username <arg>

  • Required: Yes if --ssl is True
  • Type: String

The username used for HTTPS client authentication with the herd Registration Service.

Note: To avoid complications with parsing the username if it has spaces, please encapsulate your username in "" (double quotes)

-w <arg>, --password <arg>

  • Required: Yes if --ssl is True
  • Type: String

The password used for HTTPS client authenticationwith the herd Registration Service.

Note: To avoid complications with parsing the password, please encapsulate your password in "" (double quotes)

-t <arg>, --maxThreads <arg>

  • Required: No
  • Type: Integer
  • Default: 10

The maximum number of threads to use during file transfers. If this argument isn't specified, a suitable default will be used. Amazon does a good job of determining how many threads to use so it is not recommended to use this option unless there is a specific need. Please note that we are only expecting to get ~55Mbps of throughput per thread, so please run the tool on the appropriate box given required performance.

-r, --rrs

  • Required: No

When specified, uploads into S3 will use the reduced redundancy storage. This is a cheaper storage option and can be used files aren't critical or don't need to be accessed long term.

-R <args>, --maxRetryAttempts <args>

  • Required: No
  • Type: Integer (min: 0, max: 10)
  • Default: 5

The maximum number of the business object data registration retry attempts that uploader would perform before rolling back the upload.

-D <args>, --retryDelaySecs <args>

  • Required: No
  • Type: Integer (min: 0, max: 900)
  • Default: 120

The delay in seconds between the business object data registration retry attempts.

-c <args>, --socketTimeout <args>

  • Required: No
  • Type: Integer
  • Default: 50000
  • Release: 0.18.0

The socket timeout in milliseconds. 0 indicates no timeout.

-f, --force

  • Required: No

If set, allows upload to proceed when the latest version of the business object data is in UPLOADING state by invalidating it.

Returned Codes

The command line program returns zero when execution succeeds and non-zero when execution fails.

Logging

The uploader displays output including errors on the console. Informational messages will be logged such as key program parameters and the total number of files/bytes copied.

Note that if an error occurs, a "ERROR" level logging message will be displayed with (more than likely) a stack trace. If a stack trace is logged, it is a good idea to look at the lowest level "Caused by" message of the stack trace to get the underlying issue of what caused the problem as opposed to the first error message which typically says what the application was unable to do. For example, the following is a possible error/stack trace that could be present in the logs:

Oct-24-2014 17:06:52.815 [main] ERROR finra.dm.tools.uploader.UploaderApp.main - Error running Data Management Uploader.
java.lang.RuntimeException: Failed to list keys/objects with prefix "application_a/application_a/prc-meta/txt/exch-data/schm-v0/data-v0/application-a-prcsg-dt=2014-07-10" from bucket "myBucket".
        at org.finra.dm.dao.S3DaoImpl.listKeysMatchingKeyPrefix(S3DaoImpl.java:543)
        ...
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: The AWS Access Key Id you provided does not exist in our records. (Service: Amazon S3; Status Code: 403; Error Code: InvalidAccessKeyId; Request ID: <request_id_goes_here>), S3 Extended Request ID: <extended_request_id_goes_here>
        at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:937)

In this case, the top level exception shows that the Uploader wasn't able to list keys/object whereas the underlying "Caused by" exception says the reason is because the AWS Access Key Id does not exist (i.e. it is invalid).

NOTE: You might see the below socket and http exceptions in the Uploader output. Those exception are safe to ignore, since they are typically handled seamlessly by AWS Java SDK. Still, if you observe those exceptions, please try to reduce the number of threads being used by the relative Uploader instance and/or limit the number of the Uploader instances (parallel upload jobs) that you run on the relative box.

  • ... INFO  com.amazonaws.http.AmazonHttpClient.executeHelper - Unable to execute HTTP request: Timeout waiting for connection
    org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection
        
  • ... INFO  com.amazonaws.http.AmazonHttpClient.executeHelper - Unable to execute HTTP request: Broken pipe
    java.net.SocketException: Socket is closed
       
  • ... INFO  com.amazonaws.http.AmazonHttpClient.executeHelper - Unable to execute HTTP request: Socket is closed
    java.net.SocketException: Broken pipe
        

Manifest "Side-car" File

The manifest file, or "side-car" file, is used by a DataManagement component to register new data to the Data Registry. This subsection describes the specification for the manifest file, which includes required and optional fields.

The characteristics of the file should be:

  • Name: <manifest_file_name>.json
  • Type: Text
  • Encoding: UTF8
  • Format: JSON
  • Case-sensitive: No

manifest definition

namespace

  • Required: Yes
  • Type: String
  • Case sensitive: No

Namespace in which the business object definition belongs to.

businessObjectDefinitionName

  • Required: Yes
  • Type: String
  • Case sensitive: No

Business Object Definition Name

businessObjectFormatUsage

  • Required: Yes
  • Type: String
  • Case sensitive: No

Business Object Format Usage

businessObjectFormatFileType

  • Required: Yes
  • Type: String
  • Case sensitive: No

Business Object Format File Type

businessObjectFormatVersion

  • Required: Yes
  • Type: String
  • Case sensitive: No

Business Object Format Version

partitionKey

  • Required: Yes
  • Type: String
  • Case sensitive: No

Business Object Format Partition Key (e.g. tradeDate)

partitionValue

  • Required: Yes
  • Type: String
  • Case sensitive: Yes

Business Object Data Partition Value (dates typically in YYYY-MM-DD format)

subPartitionValues

  • Required: No
  • Type: Array
  • Property type: String
  • Case sensitive: Yes

Business Object Data Sub-Partition Values

storageName

  • Required: No
  • Type: String
  • Case sensitive: No
  • Default: S3_MANAGED

Name of storage to upload to.

tradeDate

  • Required: No
  • Type: String
  • Case sensitive: No

DEPRECATED. The trade date value in YYYYMMDD format.

files

  • Required: No
  • Type: Array
  • Property type: String
  • Case sensitive: Yes

DEPRECATED. The list of files names.

manifestFiles

  • Required: Yes
  • Type: Array
  • Property type: manifestFile

The list of file information.

attributes

  • Required: No
  • Type: Object
  • Property type: String
  • Case sensitive: Yes

The list of name/value pairs to be associated with the data.

businessObjectDataParents

  • Required: No
  • Type: Array
  • Property type: businessObjectDataParent

An optional list of business object data parents that were used/needed in the creation of this data.

manifestFile definition

fileName

  • Required: Yes
  • Type: String
  • Case sensitive: Yes

The file name of a manifest file.

rowCount

  • Required: No
  • Type: Integer

The row count of a manifest file.

businessObjectDataParent definition

namespace

  • Required: Yes
  • Type: String
  • Case sensitive: No

The namespace in which the parent's business object definition belongs to.

businessObjectDefinitionName

  • Required: Yes
  • Type: String
  • Case sensitive: No

The name of the business object definition for a specific business object data parent.

businessObjectFormatUsage

  • Required: Yes
  • Type: String
  • Case sensitive: No

The business object format usage for a specific business object data parent.

businessObjectFormatFileType

  • Required: Yes
  • Type: String
  • Case sensitive: No

The business object format file type for a specific business object data parent.

businessObjectFormatVersion

  • Required: Yes
  • Type: Integer

The business object format version for a specific business object data parent.

partitionValue

  • Required: Yes
  • Type: String
  • Case sensitive: Yes

The partition value for a specific business object data parent.

subPartitionValues

  • Required: No
  • Type: Array
  • Property type: String
  • Case sensitive: Yes

The sub-partition values for a specific business object data parent.

businessObjectDataVersion

  • Required: Yes
  • Type: Integer

The business object data version for a specific business object data parent.

Manifest File Format

{
    "namespace": STRING,
    "businessObjectDefinitionName ": STRING,
    "businessObjectFormatUsage": STRING,
    "businessObjectFormatFileType": STRING,
    "businessObjectFormatVersion": STRING,
    "partitionKey": STRING,
    "partitionValue": STRING,
    "subPartitionValues": [STRING, STRING, STRING, STRING],
    "storageName": "STRING",
    "files": [ STRING, STRING, STRING, ... ],
    "manifestFiles" : [ {
        "fileName" : STRING,
        "rowCount" : NUMBER
    },
    ...
    ],
    "attributes": { STRING: STRING, STRING: STRING, ... },
    "businessObjectDataParents" : [ {
		"namespace": STRING
        "businessObjectDefinitionName" : STRING,
        "businessObjectFormatUsage" : STRING,
        "businessObjectFormatFileType" : STRING,
        "businessObjectFormatVersion" : NUMBER,
        "partitionValue" : STRING,
        "subPartitionValues": [STRING, STRING, STRING, STRING],
        "businessObjectDataVersion" : NUMBER
    },
    ...
    ]
}

Manifest File Example

The below is an example of a manifest file for NEW_ORDERS object source data for 2014-04-01.

/nfs/site/mrkt/exchange_ingest/ECXH_PD/20140401/NEW_ORDERS_DU/EXCH_V2_FMT/manifest.json:

(using "manifestFiles")

{
    "namespace": "APPLICATION_A",
    "businessObjectDefinitionName": "NEW_ORDERS",
    "businessObjectFormatUsage": "PRC",
    "businessObjectFormatFileType": "TXT",
    "businessObjectFormatVersion": "2",
    "partitionKey": "PROCESS_DATE",
    "partitionValue": "2014-04-01",
    "manifestFiles" : [ {
        "fileName" : "testFile1.gz",
        "rowCount" : 1000
    }, {
        "fileName" : "testFile2.gz",
        "rowCount" : 2000
    } ],
    "attributes": {"name1": "value1", "name2": "value2"},
    "businessObjectDataParents" : [ {
        "businessObjectDefinitionName" : "NEW_ORDERS",
        "businessObjectFormatUsage" : "SRC",
        "businessObjectFormatFileType" : "TXT",
        "businessObjectFormatVersion" : 1,
        "partitionValue" : "2014-04-01",
        "businessObjectDataVersion" : 0
    } ]
}

(using decreacated "files")

{
    "namespace": "APPLICATION_A",
    "businessObjectDefinitionName": "NEW_ORDERS",
    "businessObjectFormatUsage": "PRC",
    "businessObjectFormatFileType": "TXT",
    "businessObjectFormatVersion": "2",
    "partitionKey": "PROCESS_DATE",
    "partitionValue": "2014-04-01",
    "files": ["testFile1.gz", "testFile2.gz"],
    "attributes": {"name1": "value1", "name2": "value2"},
    "businessObjectDataParents" : [ {
        "businessObjectDefinitionName" : "NEW_ORDERS",
        "businessObjectFormatUsage" : "SRC",
        "businessObjectFormatFileType" : "TXT",
        "businessObjectFormatVersion" : 1,
        "partitionValue" : "2014-04-01",
        "businessObjectDataVersion" : 0
    } ]
}

Usage Example

The below command uploads NEW_ORDERS 20140401 data to the "S3_MANAGED" DEV bucket and registers it with the herd Registry service.

java -jar dm-uploader-app.jar \ 
-a <accessKey> \ 
-p <secretKey> \ 
-e s3-external-1.amazonaws.com \ 
-l /nfs/site/mrkt/exchange_ingest/ECXH_PD/20140401/NEW_ORDERS_DU/EXCH_V2_FMT/ \ 
-m /nfs/site/mrkt/exchange_ingest/ECXH_PD/20140401/NEW_ORDERS_DU/EXCH_V2_FMT/manifest.json \ 
-V \ 
-H myHostname.us-east-1.elb.amazonaws.com \ 
-P 80 \
-s true \
-u <username> \
-w <password> \
-n myProxyHostname \
-o 80 \
-R 3 \
-D 60

Environment/Security Access Details

  • Please make sure that the server where you run the Uploader can talk to the herd application server. That might require a new firewall rule to be set up.
  • Depending on your environment, in order for the uploader tool to communicate with the AWS S3, you might need to provide values for the HTTP proxy parameters (i.e. -n and -o parameters).
Clone this wiki locally