bigquery/src/main/java/com/google/cloud/hadoop/io/bigquery/CHANGES.txt

0.10.0 - 2016-11-07

  1. Update output configuration keys to conform to the format in
     BigQueryConfiguration and have BigQueryOutputConfiguration handle
     the output path resolution and configuration.


0.9.0 - 2016-11-04

  1. Added temporary-path resolution to BigQueryConfiguration to
     automatically default to the GCS-specified system bucket if the BigQuery
     specific GCS bucket isn't specified and an explicit temporary GCS
     path for BigQuery isn't specified.
  2. Refactored the new IndirectBigQueryOutputFormat class hierarchy
     to require less boilerplate in format-specific implementations.
  3. Added support for federated BigQuery table outputs using the
     IndirectBigQueryOutputFormat.


0.8.1 - 2016-10-10

  1. Misc minor refactoring.


0.8.0 - 2016-10-05

  1. Added new still-experimental versions of BigQuery output which buffer to
     GCS before issuing the BigQuery load jobs, rather than using temporary
     BigQuery tables.
  2. Updated Hadoop 2 dependency to 2.7.2.


0.7.9 - 2016-09-21

  1. Upgraded BigQuery library dependency to rev319.
  2. Added support for reading pre-existing federated Avro and JSON tables
  stored ing Google Cloud Storage.
  3. Misc updates in credential acquisition on GCE.

0.7.8 - 2016-08-23

  1. Updated AbstractGoogleAsyncWriteChannel to always set the
     X-Goog-Upload-Desired-Chunk-Granularity header independently from the
     deprecated X-Goog-Upload-Max-Raw-Size; in general this improves
     performance of large uploads.


0.7.7 - 2016-07-29

  1.  Misc updates in gcs-related library dependencies.


0.7.6 - 2016-07-15

  1. Changed RetryHttpInitializer to treat HTTP code 429 as a retriable
     error with exponential backoff instead of erroring out.
  2. Misc updates in gcs-related library dependencies.
  3. Added location support for temporary dataset, configured with the key
     'mapred.bq.output.location'. Default is 'US'.


0.7.5 - 2016-03-24

  1.  Misc updates in gcs-related library dependencies.


0.7.4 - 2016-02-02

  1. Fixes handling of sharded avro splits where avro files contain more than a
     single file split.


0.7.3 - 2015-11-12

  1. Fixes handling of 'unsharded' BigQuery exports.
  2. Shades Apache HTTP components that conflict with Hadoop dependencies.
  3. Misc updates in error detection logic.


0.7.2 - 2015-09-15

  1. Misc updates in gcs-related library dependencies.


0.7.1 - 2015-07-09

  1. Misc updates in library dependencies; google.api.version
     (com.google.http-client, com.google.api-client) updated from 1.19.0 to
     1.20.0, google-api-services-storage from v1-rev16-1.19.0 to
     v1-rev35-1.20.0, and google-api-services-bigquery from v2-rev171-1.19.0
     to v2-rev217-1.20.0, and Guava from 17.0 to 18.0.


0.7.0 - 2015-05-27

  1. All logging converted to use slf4j instead of the previous
     org.apache.commons.logging.Log; removed the LogUtil wrapper which
     previously wrapped org.apache.commons.logging.Log.
  2. Added exponential-backoff automatic retries in waitForJobCompletion.
  3. Added support for running multiple concurrent jobs writing to the same
     output dataset by including the JobID as part of the temporary datasetId.


0.6.0 - 2015-02-26

  1. Fixed a bug in BigQueryOutputFormat which can occasionally cause extraneous
     records to be written due to low-level retries not be de-duplicated on the
     server side; temporary tables are now written with WRITE_TRUNCATE intead of
     WRITE_APPEND to get around the lack of de-duplication of low-level retries.
  2. Removed the BigQueryBatchedWriteChannel, which was previously the default
     behavior and controlled by mapred.bq.output.async.write.enabled defaulting
     to 'false'; default is now 'true' and the key is ignored; the batched
     channel is fundamentally vulnerable to low-level retries causing extraneous
     records to be written. BigQueryAsyncWriteChannel is now always used, which
     improves performance and reduces the number of "load" jobs over the course
     of a mapreduce; it is now equal to the number of reduce tasks, rather than
     being equal to the total output_size divided by the batch_size.
  3. All BigQuery job insertions now supply client-generated job_ids and
     handle 409 conflict as a duplicate job caused by low-level retries.


0.5.1 - 2015-01-22

  1. Added enforcement of maximum number of export shards (currently 500)
     when calculating splits for BigQueryIntputFormat.
  2. Fixed a bug where BigQueryOutputCommitter.needsTaskCommit() incorrectly
     depended on a Bigquery.Tables.list() call; listing tables suffers
     "eventual consistency", so occasionally a task would erroneously
     fail to commit data.
  3. Removed extraneous table-deletion in BigQueryOutputCommitter.abortTask();
     cleanup occurs during job cleanup anyways, and this would incorrectly
     (but harmlessly) try to delete a nonexistent table for map tasks.


0.5.0 - 2014-12-16

  1. BigQueryInputFormat has been renamed GsonBigQueryInputFormat to better
     reflect its nature as a gson-based format. A forwarding declaration
     was left in place to maintain compatibility.
  2. JsonTextBigQueryInputFormat was added to provide lines of JSON text as
     they appear in the BigQuery export.
  3. When using sharded BigQuery exports (the default), the keys will no
     longer be in increasing order per mapper. Instead, the keys will be
     as they are reported by the delegate RecordReader which is generally
     going to be the byte position within the current file. However, the
     sharded export creates many files per mapper so this position will
     appear to reset to 0 when we switch between files. The record reader's
     getProgress() will still report progress across the entire dataset that
     the record reader is responsible for.
  4. The BigQuery connector can now ingest Avro based BigQuery exports. Using
     and Avro-based export should result in less data transferred between your
     MapReduce job and Google Cloud Storage and should require less CPU time
     to parse the data files. To use Avro, set the input format to
     AvroBigQueryInputFormat and update your map code to expect LongWritable
     keys and Avro GenericData.Record values.
  5. Hadoop 2 support was added for java MapReduce. Streaming support for
     Hadoop 2 will be included in a future release.


0.4.5 - 2014-10-17

  1. Attempting to acquire an OAuth access token will be now be retried when
     using .p12 or installed application (JWT) credentials if there is a
     recoverable error such as an HTTP 5XX response code or an IOException.


0.4.4 - 2014-09-18

  1. Added new classes implementing the hadoop.mapred.* interfaces by wrapping
     the existing hadoop.mapreduce.* implementations and delegating
     appropriately. This enables backwards-compatability for some stacks which
     depend on the "old api" interfaces, including now being able to use
     the standard "hadoop-streaming.jar" to run binary mappers/reducers with
     the BigQuery connector. Note that in the absence of a blocking driver
     program to call BigQueryInputFormat.cleanupJob, you must instead explicitly
     clean up the temporary exported files after a hadoop-streaming job if
     using the input connector. Extra cleanup is not necessary if only using
     the output connector in hadoop-streaming. The new top-level classes:

       com.google.cloud.hadoop.io.bigquery.mapred.BigQueryMapredInputFormat
       com.google.cloud.hadoop.io.bigquery.mapred.BigQueryMapredOutputFormat

     See the javadocs for the associated RecordReader/Writer, InputSplit, and
     OutputCommitter classes.


0.4.3 - 2014-08-07

  1. Added better validation to BigQueryUtils.getSchemaFromString used by
     BigQueryOutputFormat to throw descriptive IllegalArgumentExceptions
     instead of NullPointerExceptions for most types of malformed schemas.
  2. Fixed a bug in BigQueryUtils.getSchemaFromString to support 'repeated'
     fields inside of nested records; used to throw IllegalStateException
     if a nested record contained more than 1 inner field.


0.4.2 - 2014-06-05

  1. Misc updates in library dependencies.


0.4.1 - 2014-05-08

  1. Removed mapred.bq.output.num.records.batch in favor of
     mapred.bq.output.buffer.size.
  2. Misc updates in library dependencies.


0.4.0 - 2014-04-09

  1. Preview release of BigQuery connector.
  2. Added support for different projectIds owning the input/output tables in
     BigQuery vs the projectId performing the BigQuery jobs. Necessary for
     reading public tables like publicdata:samples.shakespeare. Distinguished by
     mapred.bq.input.project.id and mapred.bq.output.project.id.
  3. Deprecated mapred.bq.input.query and modified the WordCount sample to
     eliminate usage of the query.
  4. Added a new BigQueryOutputFormat mode which uses resumable uploads; can be
     enabled with mapred.bq.output.async.write.enabled.
  5. Jar file renamed to bigquery-connector-<version>jar.


0.3.0 - 2014-03-21

  1. Added CHANGES.txt for release notes to be included in connector jarfile.
  2. Fixed a bug where different task attempts could collide (either through
     retries or speculative execution) by using full TaskAttemptID for temp
     tableIds in BigQueryOutputFormat.
  3. Expanded debug logging, especially in the OutputFormat and helpers.
  4. Modified BigQueryOutputCommitter to correctly use the CopyTable API and
     avoid incurring "query" costs in the output path.
  5. Eliminated the need to call BigQueryInputFormat.setInputs(Job) in main
     class; this is now automatically set if necessary inside getSplits.
  6. The field 'mapred.bq.temp.gcs.path' is now optional; auto-generated based
     on JobID in BigQueryInputFormat.getSplits from 'mapred.bq.gcs.bucket' if
     unspecified. No longer set in BigQueryConfiguration.configureBigQueryInput.
  7. Fixed BigQueryInputFormat to only delete the input BigQuery table if a
     query was actually run AND mapred.bq.query.results.table.delete is true;
     changed the latter's default value from 'true' to 'false'.
  8. Fixed a bug where JsonRecordReader.initialize was getting called twice,
     which resulted in extraneous GCS API calls.
  9. Implemented a new "sharded" export mode for the BigQueryInputFormat which
     allows the MapReduce to progress concurrently with the BigQuery export;
     RecordReaders will read files as they become available, or block if more
     files are expected but are not yet available. Toggled on/off using
     "mapred.bq.input.sharded.export.enable" = [true|false]. The number of
     export directories is based on "mapred.map.tasks", but may automatically
     choose a smaller number of shards if the BigQuery table is small.
     This mode is now the default export mode for BigQueryInputFormat.
  10.Changed credential configuration values to allow per connector overrides.
     To use the metadata service, no extra configuration is required. To a use
     PKCS12 private key file, specify "mapred.bq.auth.service.account.email"
     and "mapred.bq.auth.service.account.keyfile". To use the installed app
     workflow, set "mapred.bq.service.account.enable" to "false" and
     "mapred.bq.auth.client.id", "mapred.bq.auth.client.secret" and
     "mapred.bq.auth.client.file" to appropriate values. Both PKCS12 and
     installed app workflows require the credential file to exist on all
     nodes and at the same path.
  11.Removed cleanup of things related to the BigQueryInputFormat from the
     BigQueryOutputCommitter so that Input/Output formats operate independently.
     Now, the main class must call BigQueryInputFormat.cleanupJob(JobContext)
     at the end of a job explicitly.


0.2.0 - 2014-02-12

  1. Compiled connector examples and scripts for running the sample mapreduce
     now available in bq-toolkit-0.2.0.tar.gz.
  2. Added low-level retries for transient server errors.
  3. Improved debug logging.
  4. Fixed a bug where the connector failed to perform the BigQuery export if
     DEFAULT_FS was set to ‘hdfs’.
  5. GCS export paths no longer contain the jobIdentifier, and instead are
     written to gs://<bucket>/hadoop/tmp/bigquery/data-<datetime string>/data-*.json.