Skip to content

Commit

Permalink
prepare release 0.17.0
Browse files Browse the repository at this point in the history
  • Loading branch information
davidrabinowitz committed Jul 21, 2020
1 parent fb5e3d7 commit 3b61d8d
Show file tree
Hide file tree
Showing 3 changed files with 32 additions and 14 deletions.
12 changes: 12 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,17 @@
# Release Notes

## 0.17.0 - 2020-07-15
* PR #201: [Structured streaming write](http://spark.apache.org/docs/2.4.5/structured-streaming-programming-guide.html#starting-streaming-queries)
is now supported (thanks @varundhussa)
* PR #202: Users now has the option to keep the data on GCS after writing to BigQuery (thanks @leoneuwald)
* PR #211: Enabling to overwrite data of a single date partition
* PR #198: Supporting columnar batch reads from Spark in the DataSource V2 implementation. **It is not ready for production use.**
* PR #192: Supporting `MATERIALIZED_VIEW` as table type
* Issue #197: Conditions on StructType fields are now handled by Spark and not the connector
* BigQuery API has been upgraded to version 1.116.3
* BigQuery Storage API has been upgraded to version 1.0.0
* Netty has been upgraded to version 4.1.48.Final (Fixing issue #200)

## 0.16.1 - 2020-06-11
* PR #186: Fixed SparkBigQueryConnectorUserAgentProvider initialization bug

Expand Down
32 changes: 19 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,8 +76,8 @@ repository. It can be used using the `--packages` option or the

| Scala version | Connector Artifact |
| --- | --- |
| Scala 2.11 | `com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.16.1` |
| Scala 2.12 | `com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.16.1` |
| Scala 2.11 | `com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.17.0` |
| Scala 2.12 | `com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.17.0` |

## Hello World Example

Expand Down Expand Up @@ -136,7 +136,10 @@ df.write
.save("dataset.table")
```

When writing a streaming DataFrame to BigQuery, each batch is written in the same manner as a non-streaming DataFrame. Note that a HDFS compatible [checkpoint location](http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing) (eg: path/to/HDFS/dir or gs://checkpointBucket/checkpointDir) must be specified.
When streaming a DataFrame to BigQuery, each batch is written in the same manner as a non-streaming DataFrame.
Note that a HDFS compatible
[checkpoint location](http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing)
(eg: `path/to/HDFS/dir` or `gs://checkpoint-bucket/checkpointDir`) must be specified.

```
df.writeStream
Expand All @@ -146,7 +149,7 @@ df.writeStream
.option("table", "dataset.table")
```

*Important:* The connector does not configure the GCS connector, in order to avoid conflict with another GCS connector, if exists. In order to use the write capabilities of the connector, please configure the GCS connector on your cluster as explained [here](https://github.com/GoogleCloudPlatform/bigdata-interop/tree/master/gcs).
**Important:** The connector does not configure the GCS connector, in order to avoid conflict with another GCS connector, if exists. In order to use the write capabilities of the connector, please configure the GCS connector on your cluster as explained [here](https://github.com/GoogleCloudPlatform/bigdata-interop/tree/master/gcs).

### Properties

Expand Down Expand Up @@ -220,15 +223,15 @@ The API Supports a number of options to configure the read
<td>Read</td>
</tr>
<tr valign="top">
<td><code>viewMaterializationProject</code>
<td><code>materializationProject</code>
</td>
<td>The project id where the materialized view is going to be created
<br/>(Optional. Defaults to view's project id)
</td>
<td>Read</td>
</tr>
<tr valign="top">
<td><code>viewMaterializationDataset</code>
<td><code>materializationDataset</code>
</td>
<td>The dataset where the materialized view is going to be created
<br/>(Optional. Defaults to view's dataset)
Expand Down Expand Up @@ -311,7 +314,6 @@ The API Supports a number of options to configure the read
</td>
<td>Write</td>
</tr>
<!--
<tr valign="top">
<td><code>datePartition</code>
</td>
Expand All @@ -325,7 +327,6 @@ The API Supports a number of options to configure the read
</td>
<td>Write</td>
</tr>
-->
<tr valign="top">
<td><code>partitionField</code>
</td>
Expand Down Expand Up @@ -385,6 +386,11 @@ The API Supports a number of options to configure the read
</tr>
</table>

Options can also be set outside of the code, using the `--conf` parameter of `spark-submit` or `--properties` parameter
of the `gcloud dataproc submit spark`. In order to use this, prepend the prefix `spark.datasource.bigquery.` to any of
the options, for example `spark.conf.set("temporaryGcsBucket", "some-bucket")` can also be set as
`--conf spark.datasource.bigquery.temporaryGcsBucket=some-bucket`.

### Data types

With the exception of `DATETIME` and `TIME` all BigQuery data types directed map into the corresponding Spark SQL data type. Here are all of the mappings:
Expand Down Expand Up @@ -579,7 +585,7 @@ using the following code:
```python
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.config("spark.jars.packages", "com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.16.1")\
.config("spark.jars.packages", "com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.17.0")\
.getOrCreate()
df = spark.read.format("bigquery")\
.load("dataset.table")
Expand All @@ -588,15 +594,15 @@ df = spark.read.format("bigquery")\
**Scala:**
```python
val spark = SparkSession.builder
.config("spark.jars.packages", "com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.16.1")
.config("spark.jars.packages", "com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.17.0")
.getOrCreate()
val df = spark.read.format("bigquery")
.load("dataset.table")
```

In case Spark cluster is using Scala 2.12 (it's optional for Spark 2.4.x,
mandatory in 3.0.x), then the relevant package is
com.google.cloud.spark:spark-bigquery-with-dependencies_**2.12**:0.16.1. In
com.google.cloud.spark:spark-bigquery-with-dependencies_**2.12**:0.17.0. In
order to know which Scala version is used, please run the following code:

**Python:**
Expand All @@ -620,14 +626,14 @@ To include the connector in your project:
<dependency>
<groupId>com.google.cloud.spark</groupId>
<artifactId>spark-bigquery-with-dependencies_${scala.version}</artifactId>
<version>0.16.1</version>
<version>0.17.0</version>
</dependency>
```

### SBT

```sbt
libraryDependencies += "com.google.cloud.spark" %% "spark-bigquery-with-dependencies" % "0.16.1"
libraryDependencies += "com.google.cloud.spark" %% "spark-bigquery-with-dependencies" % "0.17.0"
```

## Building the Connector
Expand Down
2 changes: 1 addition & 1 deletion build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ lazy val nettyTcnativeVersion = "2.0.29.Final"

lazy val commonSettings = Seq(
organization := "com.google.cloud.spark",
version := "0.16.2-SNAPSHOT",
version := "0.17.0",
scalaVersion := scala211Version,
crossScalaVersions := Seq(scala211Version, scala212Version)
)
Expand Down

0 comments on commit 3b61d8d

Please sign in to comment.