/
CHANGES.txt
274 lines (200 loc) · 12 KB
/
CHANGES.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
0.10.0 - 2016-11-07
1. Update output configuration keys to conform to the format in
BigQueryConfiguration and have BigQueryOutputConfiguration handle
the output path resolution and configuration.
0.9.0 - 2016-11-04
1. Added temporary-path resolution to BigQueryConfiguration to
automatically default to the GCS-specified system bucket if the BigQuery
specific GCS bucket isn't specified and an explicit temporary GCS
path for BigQuery isn't specified.
2. Refactored the new IndirectBigQueryOutputFormat class hierarchy
to require less boilerplate in format-specific implementations.
3. Added support for federated BigQuery table outputs using the
IndirectBigQueryOutputFormat.
0.8.1 - 2016-10-10
1. Misc minor refactoring.
0.8.0 - 2016-10-05
1. Added new still-experimental versions of BigQuery output which buffer to
GCS before issuing the BigQuery load jobs, rather than using temporary
BigQuery tables.
2. Updated Hadoop 2 dependency to 2.7.2.
0.7.9 - 2016-09-21
1. Upgraded BigQuery library dependency to rev319.
2. Added support for reading pre-existing federated Avro and JSON tables
stored ing Google Cloud Storage.
3. Misc updates in credential acquisition on GCE.
0.7.8 - 2016-08-23
1. Updated AbstractGoogleAsyncWriteChannel to always set the
X-Goog-Upload-Desired-Chunk-Granularity header independently from the
deprecated X-Goog-Upload-Max-Raw-Size; in general this improves
performance of large uploads.
0.7.7 - 2016-07-29
1. Misc updates in gcs-related library dependencies.
0.7.6 - 2016-07-15
1. Changed RetryHttpInitializer to treat HTTP code 429 as a retriable
error with exponential backoff instead of erroring out.
2. Misc updates in gcs-related library dependencies.
3. Added location support for temporary dataset, configured with the key
'mapred.bq.output.location'. Default is 'US'.
0.7.5 - 2016-03-24
1. Misc updates in gcs-related library dependencies.
0.7.4 - 2016-02-02
1. Fixes handling of sharded avro splits where avro files contain more than a
single file split.
0.7.3 - 2015-11-12
1. Fixes handling of 'unsharded' BigQuery exports.
2. Shades Apache HTTP components that conflict with Hadoop dependencies.
3. Misc updates in error detection logic.
0.7.2 - 2015-09-15
1. Misc updates in gcs-related library dependencies.
0.7.1 - 2015-07-09
1. Misc updates in library dependencies; google.api.version
(com.google.http-client, com.google.api-client) updated from 1.19.0 to
1.20.0, google-api-services-storage from v1-rev16-1.19.0 to
v1-rev35-1.20.0, and google-api-services-bigquery from v2-rev171-1.19.0
to v2-rev217-1.20.0, and Guava from 17.0 to 18.0.
0.7.0 - 2015-05-27
1. All logging converted to use slf4j instead of the previous
org.apache.commons.logging.Log; removed the LogUtil wrapper which
previously wrapped org.apache.commons.logging.Log.
2. Added exponential-backoff automatic retries in waitForJobCompletion.
3. Added support for running multiple concurrent jobs writing to the same
output dataset by including the JobID as part of the temporary datasetId.
0.6.0 - 2015-02-26
1. Fixed a bug in BigQueryOutputFormat which can occasionally cause extraneous
records to be written due to low-level retries not be de-duplicated on the
server side; temporary tables are now written with WRITE_TRUNCATE intead of
WRITE_APPEND to get around the lack of de-duplication of low-level retries.
2. Removed the BigQueryBatchedWriteChannel, which was previously the default
behavior and controlled by mapred.bq.output.async.write.enabled defaulting
to 'false'; default is now 'true' and the key is ignored; the batched
channel is fundamentally vulnerable to low-level retries causing extraneous
records to be written. BigQueryAsyncWriteChannel is now always used, which
improves performance and reduces the number of "load" jobs over the course
of a mapreduce; it is now equal to the number of reduce tasks, rather than
being equal to the total output_size divided by the batch_size.
3. All BigQuery job insertions now supply client-generated job_ids and
handle 409 conflict as a duplicate job caused by low-level retries.
0.5.1 - 2015-01-22
1. Added enforcement of maximum number of export shards (currently 500)
when calculating splits for BigQueryIntputFormat.
2. Fixed a bug where BigQueryOutputCommitter.needsTaskCommit() incorrectly
depended on a Bigquery.Tables.list() call; listing tables suffers
"eventual consistency", so occasionally a task would erroneously
fail to commit data.
3. Removed extraneous table-deletion in BigQueryOutputCommitter.abortTask();
cleanup occurs during job cleanup anyways, and this would incorrectly
(but harmlessly) try to delete a nonexistent table for map tasks.
0.5.0 - 2014-12-16
1. BigQueryInputFormat has been renamed GsonBigQueryInputFormat to better
reflect its nature as a gson-based format. A forwarding declaration
was left in place to maintain compatibility.
2. JsonTextBigQueryInputFormat was added to provide lines of JSON text as
they appear in the BigQuery export.
3. When using sharded BigQuery exports (the default), the keys will no
longer be in increasing order per mapper. Instead, the keys will be
as they are reported by the delegate RecordReader which is generally
going to be the byte position within the current file. However, the
sharded export creates many files per mapper so this position will
appear to reset to 0 when we switch between files. The record reader's
getProgress() will still report progress across the entire dataset that
the record reader is responsible for.
4. The BigQuery connector can now ingest Avro based BigQuery exports. Using
and Avro-based export should result in less data transferred between your
MapReduce job and Google Cloud Storage and should require less CPU time
to parse the data files. To use Avro, set the input format to
AvroBigQueryInputFormat and update your map code to expect LongWritable
keys and Avro GenericData.Record values.
5. Hadoop 2 support was added for java MapReduce. Streaming support for
Hadoop 2 will be included in a future release.
0.4.5 - 2014-10-17
1. Attempting to acquire an OAuth access token will be now be retried when
using .p12 or installed application (JWT) credentials if there is a
recoverable error such as an HTTP 5XX response code or an IOException.
0.4.4 - 2014-09-18
1. Added new classes implementing the hadoop.mapred.* interfaces by wrapping
the existing hadoop.mapreduce.* implementations and delegating
appropriately. This enables backwards-compatability for some stacks which
depend on the "old api" interfaces, including now being able to use
the standard "hadoop-streaming.jar" to run binary mappers/reducers with
the BigQuery connector. Note that in the absence of a blocking driver
program to call BigQueryInputFormat.cleanupJob, you must instead explicitly
clean up the temporary exported files after a hadoop-streaming job if
using the input connector. Extra cleanup is not necessary if only using
the output connector in hadoop-streaming. The new top-level classes:
com.google.cloud.hadoop.io.bigquery.mapred.BigQueryMapredInputFormat
com.google.cloud.hadoop.io.bigquery.mapred.BigQueryMapredOutputFormat
See the javadocs for the associated RecordReader/Writer, InputSplit, and
OutputCommitter classes.
0.4.3 - 2014-08-07
1. Added better validation to BigQueryUtils.getSchemaFromString used by
BigQueryOutputFormat to throw descriptive IllegalArgumentExceptions
instead of NullPointerExceptions for most types of malformed schemas.
2. Fixed a bug in BigQueryUtils.getSchemaFromString to support 'repeated'
fields inside of nested records; used to throw IllegalStateException
if a nested record contained more than 1 inner field.
0.4.2 - 2014-06-05
1. Misc updates in library dependencies.
0.4.1 - 2014-05-08
1. Removed mapred.bq.output.num.records.batch in favor of
mapred.bq.output.buffer.size.
2. Misc updates in library dependencies.
0.4.0 - 2014-04-09
1. Preview release of BigQuery connector.
2. Added support for different projectIds owning the input/output tables in
BigQuery vs the projectId performing the BigQuery jobs. Necessary for
reading public tables like publicdata:samples.shakespeare. Distinguished by
mapred.bq.input.project.id and mapred.bq.output.project.id.
3. Deprecated mapred.bq.input.query and modified the WordCount sample to
eliminate usage of the query.
4. Added a new BigQueryOutputFormat mode which uses resumable uploads; can be
enabled with mapred.bq.output.async.write.enabled.
5. Jar file renamed to bigquery-connector-<version>jar.
0.3.0 - 2014-03-21
1. Added CHANGES.txt for release notes to be included in connector jarfile.
2. Fixed a bug where different task attempts could collide (either through
retries or speculative execution) by using full TaskAttemptID for temp
tableIds in BigQueryOutputFormat.
3. Expanded debug logging, especially in the OutputFormat and helpers.
4. Modified BigQueryOutputCommitter to correctly use the CopyTable API and
avoid incurring "query" costs in the output path.
5. Eliminated the need to call BigQueryInputFormat.setInputs(Job) in main
class; this is now automatically set if necessary inside getSplits.
6. The field 'mapred.bq.temp.gcs.path' is now optional; auto-generated based
on JobID in BigQueryInputFormat.getSplits from 'mapred.bq.gcs.bucket' if
unspecified. No longer set in BigQueryConfiguration.configureBigQueryInput.
7. Fixed BigQueryInputFormat to only delete the input BigQuery table if a
query was actually run AND mapred.bq.query.results.table.delete is true;
changed the latter's default value from 'true' to 'false'.
8. Fixed a bug where JsonRecordReader.initialize was getting called twice,
which resulted in extraneous GCS API calls.
9. Implemented a new "sharded" export mode for the BigQueryInputFormat which
allows the MapReduce to progress concurrently with the BigQuery export;
RecordReaders will read files as they become available, or block if more
files are expected but are not yet available. Toggled on/off using
"mapred.bq.input.sharded.export.enable" = [true|false]. The number of
export directories is based on "mapred.map.tasks", but may automatically
choose a smaller number of shards if the BigQuery table is small.
This mode is now the default export mode for BigQueryInputFormat.
10.Changed credential configuration values to allow per connector overrides.
To use the metadata service, no extra configuration is required. To a use
PKCS12 private key file, specify "mapred.bq.auth.service.account.email"
and "mapred.bq.auth.service.account.keyfile". To use the installed app
workflow, set "mapred.bq.service.account.enable" to "false" and
"mapred.bq.auth.client.id", "mapred.bq.auth.client.secret" and
"mapred.bq.auth.client.file" to appropriate values. Both PKCS12 and
installed app workflows require the credential file to exist on all
nodes and at the same path.
11.Removed cleanup of things related to the BigQueryInputFormat from the
BigQueryOutputCommitter so that Input/Output formats operate independently.
Now, the main class must call BigQueryInputFormat.cleanupJob(JobContext)
at the end of a job explicitly.
0.2.0 - 2014-02-12
1. Compiled connector examples and scripts for running the sample mapreduce
now available in bq-toolkit-0.2.0.tar.gz.
2. Added low-level retries for transient server errors.
3. Improved debug logging.
4. Fixed a bug where the connector failed to perform the BigQuery export if
DEFAULT_FS was set to ‘hdfs’.
5. GCS export paths no longer contain the jobIdentifier, and instead are
written to gs://<bucket>/hadoop/tmp/bigquery/data-<datetime string>/data-*.json.