-
Notifications
You must be signed in to change notification settings - Fork 41
Uploader Users Guide
The herd uploader application is a command line program that provides the ability to copy files/directories to an S3 bucket from a local file system and register it with the herd Registry.
The JAR is built as part of the herd application suite in the dm-tools project.
java -jar dm-uploader.jar
[-a <S3AccessKey>]
[-p <S3SecretKey>]
[-e <S3Endpoint>]
-l < LocalDirPath>
-m <ManifestFilePath>
[-V]
-H <RegServerHost>
-P <RegServerPort>
[-n <HttpProxyHost>]
[-o <HttpProxyPort>]
[-s true]
[-u <username>]
[-w <password>]
[-t <MaxThreads>]
[-r]
[-R <MaxRetryAttempts>]
[-D <RetryDelaySecs>]
[-c <socketTimeout>]
[-f]
- Required: No
- Type: String
The S3 access key used to authenticate the user connecting to the S3 service. When specified, make sure the s3SecretKey is also specified.
- Required: No
- Type: String
The S3 secret key used to authenticate the user connecting to the S3 service. When specified, make sure the s3AccessKey is also specified.
- Required: No
- Type: String
The optional Amazon S3 endpoint to use when making S3 service calls.
- Required: Yes
- Type: String
The local path of the directory or file to upload from or download into.
- Required: Yes
- Type: String
Local path to the manifest file.
- Required: No
If set, a new business object data version will be created and registered. Otherwise, only an initial business object data version (version 0) is allowed to be registered.
- Required: Yes
- Type: String
Registration Server hostname.
- Required: Yes
- Type: Integer
Registration Server port.
- Required: No
- Type: String
DEPRECATED. Use regServerHost parameter.
- Required: No
- Type: String
DEPRECATED. Use regServerPort parameter.
- Required: No
Display usage information and exit.
- Required: No
Display version information and exit.
- Required: No
- Type: String
The hostname of an HTTP proxy that will be used when connecting to the S3 service. This is needed when a direct HTTP connection isn't allowed. Make sure the httpProxyPort is also specified when usiing this option.
- Required: No
- Type: Integer
The port number of an HTTP proxy that will be used when connectinng to the S3 service. This is needed when a direct HTTP connection isn't allowed. Make sure the httpProxyHost is also specified when using this option.
- Required: No
- Type: Boolean
- Default: False
If set to true, enables SSL (HTTPS) to communicate with the herd Registration Service. Otherwise, uses HTTP.
- Required: Yes if --ssl is True
- Type: String
The username used for HTTPS client authentication with the herd Registration Service.
Note: To avoid complications with parsing the username if it has spaces, please encapsulate your username in "" (double quotes)
- Required: Yes if --ssl is True
- Type: String
The password used for HTTPS client authenticationwith the herd Registration Service.
Note: To avoid complications with parsing the password, please encapsulate your password in "" (double quotes)
- Required: No
- Type: Integer
- Default: 10
The maximum number of threads to use during file transfers. If this argument isn't specified, a suitable default will be used. Amazon does a good job of determining how many threads to use so it is not recommended to use this option unless there is a specific need. Please note that we are only expecting to get ~55Mbps of throughput per thread, so please run the tool on the appropriate box given required performance.
- Required: No
When specified, uploads into S3 will use the reduced redundancy storage. This is a cheaper storage option and can be used files aren't critical or don't need to be accessed long term.
- Required: No
- Type: Integer (min: 0, max: 10)
- Default: 5
The maximum number of the business object data registration retry attempts that uploader would perform before rolling back the upload.
- Required: No
- Type: Integer (min: 0, max: 900)
- Default: 120
The delay in seconds between the business object data registration retry attempts.
- Required: No
- Type: Integer
- Default: 50000
- Release: 0.18.0
The socket timeout in milliseconds. 0 indicates no timeout.
- Required: No
If set, allows upload to proceed when the latest version of the business object data is in UPLOADING state by invalidating it.
The command line program returns zero when execution succeeds and non-zero when execution fails.
The uploader displays output including errors on the console. Informational messages will be logged such as key program parameters and the total number of files/bytes copied.
Note that if an error occurs, a "ERROR" level logging message will be displayed with (more than likely) a stack trace. If a stack trace is logged, it is a good idea to look at the lowest level "Caused by" message of the stack trace to get the underlying issue of what caused the problem as opposed to the first error message which typically says what the application was unable to do. For example, the following is a possible error/stack trace that could be present in the logs:
Oct-24-2014 17:06:52.815 [main] ERROR finra.dm.tools.uploader.UploaderApp.main - Error running Data Management Uploader.
java.lang.RuntimeException: Failed to list keys/objects with prefix "application_a/application_a/prc-meta/txt/exch-data/schm-v0/data-v0/application-a-prcsg-dt=2014-07-10" from bucket "myBucket".
at org.finra.dm.dao.S3DaoImpl.listKeysMatchingKeyPrefix(S3DaoImpl.java:543)
...
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: The AWS Access Key Id you provided does not exist in our records. (Service: Amazon S3; Status Code: 403; Error Code: InvalidAccessKeyId; Request ID: <request_id_goes_here>), S3 Extended Request ID: <extended_request_id_goes_here>
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:937)
In this case, the top level exception shows that the Uploader wasn't able to list keys/object whereas the underlying "Caused by" exception says the reason is because the AWS Access Key Id does not exist (i.e. it is invalid).
NOTE: You might see the below socket and http exceptions in the Uploader output. Those exception are safe to ignore, since they are typically handled seamlessly by AWS Java SDK. Still, if you observe those exceptions, please try to reduce the number of threads being used by the relative Uploader instance and/or limit the number of the Uploader instances (parallel upload jobs) that you run on the relative box.
-
... INFO com.amazonaws.http.AmazonHttpClient.executeHelper - Unable to execute HTTP request: Timeout waiting for connection org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection
-
... INFO com.amazonaws.http.AmazonHttpClient.executeHelper - Unable to execute HTTP request: Broken pipe java.net.SocketException: Socket is closed
-
... INFO com.amazonaws.http.AmazonHttpClient.executeHelper - Unable to execute HTTP request: Socket is closed java.net.SocketException: Broken pipe
The manifest file, or "side-car" file, is used by a DataManagement component to register new data to the Data Registry. This subsection describes the specification for the manifest file, which includes required and optional fields.
The characteristics of the file should be:
- Name: <manifest_file_name>.json
- Type: Text
- Encoding: UTF8
- Format: JSON
- Case-sensitive: No
- Required: Yes
- Type: String
- Case sensitive: No
Namespace in which the business object definition belongs to.
- Required: Yes
- Type: String
- Case sensitive: No
Business Object Definition Name
- Required: Yes
- Type: String
- Case sensitive: No
Business Object Format Usage
- Required: Yes
- Type: String
- Case sensitive: No
Business Object Format File Type
- Required: Yes
- Type: String
- Case sensitive: No
Business Object Format Version
- Required: Yes
- Type: String
- Case sensitive: No
Business Object Format Partition Key (e.g. tradeDate)
- Required: Yes
- Type: String
- Case sensitive: Yes
Business Object Data Partition Value (dates typically in YYYY-MM-DD format)
- Required: No
- Type: Array
- Property type: String
- Case sensitive: Yes
Business Object Data Sub-Partition Values
- Required: No
- Type: String
- Case sensitive: No
- Default: S3_MANAGED
Name of storage to upload to.
- Required: No
- Type: String
- Case sensitive: No
DEPRECATED. The trade date value in YYYYMMDD format.
- Required: No
- Type: Array
- Property type: String
- Case sensitive: Yes
DEPRECATED. The list of files names.
- Required: Yes
- Type: Array
- Property type: manifestFile
The list of file information.
- Required: No
- Type: Object
- Property type: String
- Case sensitive: Yes
The list of name/value pairs to be associated with the data.
- Required: No
- Type: Array
- Property type: businessObjectDataParent
An optional list of business object data parents that were used/needed in the creation of this data.
- Required: Yes
- Type: String
- Case sensitive: Yes
The file name of a manifest file.
- Required: No
- Type: Integer
The row count of a manifest file.
- Required: Yes
- Type: String
- Case sensitive: No
The namespace in which the parent's business object definition belongs to.
- Required: Yes
- Type: String
- Case sensitive: No
The name of the business object definition for a specific business object data parent.
- Required: Yes
- Type: String
- Case sensitive: No
The business object format usage for a specific business object data parent.
- Required: Yes
- Type: String
- Case sensitive: No
The business object format file type for a specific business object data parent.
- Required: Yes
- Type: Integer
The business object format version for a specific business object data parent.
- Required: Yes
- Type: String
- Case sensitive: Yes
The partition value for a specific business object data parent.
- Required: No
- Type: Array
- Property type: String
- Case sensitive: Yes
The sub-partition values for a specific business object data parent.
- Required: Yes
- Type: Integer
The business object data version for a specific business object data parent.
{
"namespace": STRING,
"businessObjectDefinitionName ": STRING,
"businessObjectFormatUsage": STRING,
"businessObjectFormatFileType": STRING,
"businessObjectFormatVersion": STRING,
"partitionKey": STRING,
"partitionValue": STRING,
"subPartitionValues": [STRING, STRING, STRING, STRING],
"storageName": "STRING",
"files": [ STRING, STRING, STRING, ... ],
"manifestFiles" : [ {
"fileName" : STRING,
"rowCount" : NUMBER
},
...
],
"attributes": { STRING: STRING, STRING: STRING, ... },
"businessObjectDataParents" : [ {
"namespace": STRING
"businessObjectDefinitionName" : STRING,
"businessObjectFormatUsage" : STRING,
"businessObjectFormatFileType" : STRING,
"businessObjectFormatVersion" : NUMBER,
"partitionValue" : STRING,
"subPartitionValues": [STRING, STRING, STRING, STRING],
"businessObjectDataVersion" : NUMBER
},
...
]
}
The below is an example of a manifest file for NEW_ORDERS object source data for 2014-04-01.
/nfs/site/mrkt/exchange_ingest/ECXH_PD/20140401/NEW_ORDERS_DU/EXCH_V2_FMT/manifest.json:
(using "manifestFiles")
{
"namespace": "APPLICATION_A",
"businessObjectDefinitionName": "NEW_ORDERS",
"businessObjectFormatUsage": "PRC",
"businessObjectFormatFileType": "TXT",
"businessObjectFormatVersion": "2",
"partitionKey": "PROCESS_DATE",
"partitionValue": "2014-04-01",
"manifestFiles" : [ {
"fileName" : "testFile1.gz",
"rowCount" : 1000
}, {
"fileName" : "testFile2.gz",
"rowCount" : 2000
} ],
"attributes": {"name1": "value1", "name2": "value2"},
"businessObjectDataParents" : [ {
"businessObjectDefinitionName" : "NEW_ORDERS",
"businessObjectFormatUsage" : "SRC",
"businessObjectFormatFileType" : "TXT",
"businessObjectFormatVersion" : 1,
"partitionValue" : "2014-04-01",
"businessObjectDataVersion" : 0
} ]
}
(using decreacated "files")
{
"namespace": "APPLICATION_A",
"businessObjectDefinitionName": "NEW_ORDERS",
"businessObjectFormatUsage": "PRC",
"businessObjectFormatFileType": "TXT",
"businessObjectFormatVersion": "2",
"partitionKey": "PROCESS_DATE",
"partitionValue": "2014-04-01",
"files": ["testFile1.gz", "testFile2.gz"],
"attributes": {"name1": "value1", "name2": "value2"},
"businessObjectDataParents" : [ {
"businessObjectDefinitionName" : "NEW_ORDERS",
"businessObjectFormatUsage" : "SRC",
"businessObjectFormatFileType" : "TXT",
"businessObjectFormatVersion" : 1,
"partitionValue" : "2014-04-01",
"businessObjectDataVersion" : 0
} ]
}
The below command uploads NEW_ORDERS 20140401 data to the "S3_MANAGED" DEV bucket and registers it with the herd Registry service.
java -jar dm-uploader-app.jar \
-a <accessKey> \
-p <secretKey> \
-e s3-external-1.amazonaws.com \
-l /nfs/site/mrkt/exchange_ingest/ECXH_PD/20140401/NEW_ORDERS_DU/EXCH_V2_FMT/ \
-m /nfs/site/mrkt/exchange_ingest/ECXH_PD/20140401/NEW_ORDERS_DU/EXCH_V2_FMT/manifest.json \
-V \
-H myHostname.us-east-1.elb.amazonaws.com \
-P 80 \
-s true \
-u <username> \
-w <password> \
-n myProxyHostname \
-o 80 \
-R 3 \
-D 60
- Please make sure that the server where you run the Uploader can talk to the herd application server. That might require a new firewall rule to be set up.
- Depending on your environment, in order for the uploader tool to communicate with the AWS S3, you might need to provide values for the HTTP proxy parameters (i.e. -n and -o parameters).
- Getting Started with herd
- herd Usage Pages
- herd API documentation
- herd Workflow Tasks
- herd Tools