External hive support in SnappySession #1220

sumwale · 2018-12-17T14:58:49Z

Changes proposed in this pull request

This adds support for the two components of Spark's hive session:

catalog that reads from external hive meta-store using an extra hive-enabled SparkSession
HiveSessionState from the hive-enabled SparkSession that adds additional resolution rules
and strategies for such hive managed tables
Parser changes to delegate to Spark Parser for Hive DDL extensions. A special format for
"CREATE TABLE ... USING hive" is allowed that explicitly specifies the table to use hive provider.

There are two user-level properties:

Standard "spark.sql.catalogImplementation" that will consult external hive metastore in addition
to the builtin catalog when the value is set to "hive". Note that first builtin catalog is used and then
the external one, so in case of name clashes, the builtin one is given preference. For writes,
all tables using "hive" as the provider will use the external hive metastore while rest use builtin.
"snappydata.sql.hiveCompatibility" can be set to default/spark/full. When set to "spark" or "full"
then the default behaviour of "create table ..." without any USING provider and any Hive DDL
extensions will change to create a hive table instead of a row table.

A lazily instantiated instance of Hive-enabled SparkSession is kept inside SnappySessionState
which gets referred if the "spark.sql.catalogImplementation" is "hive" for the session.

For 1), the list/get/create methods in SnappySessionCatalog have been overridden to read/write to
the hive catalog after the snappy catalog if hive support is enabled on the session.

For 2), wrapper Rule/Strategy classes have been added that wrap the extra rules/strategies from
hive session and run them only if the property has been enabled on SnappySession.

The code temporarily switches to the hive-enabled SparkSession when running hive
rules/strategies some of which expect the internal sharedState/sessionState to be those of hive.

Patch testing

precheckin and manual testing

ReleaseNotes.txt changes

Documentation for the new property and what it provides for users.

Other PRs

TIBCOSoftware/snappy-store#499
TIBCOSoftware/snappy-spark#164
https://github.com/SnappyDataInc/snappy-aqp/pull/178

The koloboke project has been dead and unmaintained for a couple of years now so replaced with eclipse collections though latter are a bit slower for some operations and also add significant bulk (~10M).

- also added implicit retry for catalog stale exception in queries - invalidate entire cache of connector for a create/drop/alter since the version stored for other relations in RelationInfo will also certainly be stale

allow for absence of baseTable in external catalog table drop since it can be a temporary table

…ng HiveStrategies

Allow for "gemfire" data source to make a catalog entry during create table execution in its createRelation itself. It needs the creation to add new parameters to the options bag. Fixed dependent handling to avoid duplicates.

…-hive-support

…eparate files on this branch

…sent

Instead of changing the sessionState/sharedState inside SnappySession, switch the existing active session to SparkSession. This also fixes failure in InsertIntoHiveTable which was due to state inside SnappySession having being switched back when makeCopy of that plan is invoked.

sumwale · 2019-07-22T11:00:55Z

Do you plan to add any unit tests or mostly by hydra functional test.

Porting the hive test suites from Spark to use SnappySession with external hive enabled. Hive compatible DDL support as present in SparkSession has also been added in this PR to SnappySession. An additional property has been added to use hive provider as default when no provider has been provided in a CREATE TABLE (which is ROW table otherwise). The convention in CREATE TABLE is to use external hive catalog for hive provider and in-built catalog otherwise, so all of hive suites from Spark should work as is.

instead rename hiveCompatible to snappydata.sql.hive.compatibility and use a tri-state (default, spark, hive) to denote what level of compatibility to use. Specifically the spark and hive levels will use 'hive' as the default provider.

if spark.sql.sources.default is explicitly set then use the same in SQL parser with default as 'row' like before

instead honour Spark's "spark.sql.catalogImplementation" itself to make the configuration identical to Spark with the difference that "hive" implementation in SnappySession actually refers to the union of builtin and external catalogs fixed few precheckin failures

…tion due to enableHiveSupport getting set

ExpressionSQLBuilderSuite -> SnappyExpressionSQLBuilderSuite

…wise not allowed by Spark other fixes and cleanups

if hive-specific extensions are present in CREATE TABLE then always assume the provider to be "hive" and pass to Spark parser

make the behaviour of "drop schema" and "drop database" as identical to drop from both builtin and external catalog since "create schema" is identical to "create database"

also cleaned up current schema/database setup

also improved CommandLineToolsSuite to not print failed output to screen

Sumedh Wale added 22 commits November 28, 2018 22:07

changes for integrated catalog and cleanup

b660f4b

Merge remote-tracking branch 'origin/master' into catalog-cleanup

7132932

some fixes

9de169d

use non-isolated hive client

3a46574

replace koloboke maps with eclipse collections

a5d078d

The koloboke project has been dead and unmaintained for a couple of years now so replaced with eclipse collections though latter are a bit slower for some operations and also add significant bulk (~10M).

Merge remote-tracking branch 'origin/master' into catalog-cleanup

2a6635a

more fixes to tests

dd64f47

- also added implicit retry for catalog stale exception in queries - invalidate entire cache of connector for a create/drop/alter since the version stored for other relations in RelationInfo will also certainly be stale

corrected tokenize to be session property rather than global

1dea3e4

more test fixes as per the catalog fixes

d62413d

Merge remote-tracking branch 'origin/master' into catalog-cleanup

17315d9

fix build error after master merge

fb06812

more fixes/additions and enabled a bunch of compatibility tests

3af6077

more fixes and changes

d2f8b1e

fixed issue with update sub-query due to alias removal

bffb66c

Merge remote-tracking branch 'origin/master' into catalog-cleanup

71d5597

fixes for AQP

285560b

clear global view catalog explicitly in close

369f583

don't resolve baseTable since it can be temporary table

8d853e0

allow for absence of baseTable in external catalog table drop since it can be a temporary table

fixes for AQP test failures

aa809b4

some cleanups

e36976e

Support for external hive catalog/session from within SnappySession

74becab

switch the sharedState and sessionState in SnappySession before runni…

b7819bd

…ng HiveStrategies

sumwale requested review from suranjan and kneeraj December 17, 2018 14:58

Sumedh Wale added 6 commits December 17, 2018 21:16

minor comment change

1a0a503

add gemfire to the providers for which dbtable is added implicitly

21f8fbd

update store link

6687984

add special path for "gemfire" data source

cdd2259

Allow for "gemfire" data source to make a catalog entry during create table execution in its createRelation itself. It needs the creation to add new parameters to the options bag. Fixed dependent handling to avoid duplicates.

minor logging changes

fb49f64

minor comment changes

1da2075

Sumedh Wale added 6 commits July 20, 2019 19:54

Merge commit '9fe19c6613986a1f4a7cc732d335daf25444bc19' into external…

6de8b61

…-hive-support

manually merging some changes from master that were refactored into s…

6f601fb

…eparate files on this branch

enable external-hive support if non-default hive configuration is pre…

81f9364

…sent

minor changes

3de52d4

minor change

84c9f05

Sumedh Wale added 18 commits July 22, 2019 16:32

Merge remote-tracking branch 'origin/master' into external-hive-support

a8bebb4

honour spark.sql.sources.default for default data source

2518402

if spark.sql.sources.default is explicitly set then use the same in SQL parser with default as 'row' like before

minor fix

e455ab6

Merge remote-tracking branch 'origin/master' into external-hive-support

1d1707d

update store link

29a3e76

initial code for porting hive suite

015ecdb

added check to avoid recursive calls to SnappySessionState initializa…

1a4e571

…tion due to enableHiveSupport getting set

first working ported hive test suite

86d910a

ExpressionSQLBuilderSuite -> SnappyExpressionSQLBuilderSuite

fix dynamic setting of spark.sql.catalogImplementation which is other…

a70046a

…wise not allowed by Spark other fixes and cleanups

interpret CREATE TABLE containing hive-specific extensions

195e247

if hive-specific extensions are present in CREATE TABLE then always assume the provider to be "hive" and pass to Spark parser

some minor changes to behaviour

a15d16f

fix for SNAP-3100

43243ec

make the behaviour of "drop schema" and "drop database" as identical to drop from both builtin and external catalog since "create schema" is identical to "create database"

fixes for schema/database handling and improved help messages

2958e1b

Merge remote-tracking branch 'origin/master' into external-hive-support

99fcf4a

uniform databaseExists check for hiveSessionCatalog

d30d9b9

also cleaned up current schema/database setup

fix MetadataTest to add "DEFAULT" to list of schemas

b60cfdc

also improved CommandLineToolsSuite to not print failed output to screen

sumwale mentioned this pull request Jul 27, 2019

Fix SNAP-3056, cleanup query routing TIBCOSoftware/snappy-store#499

Merged

sumwale added 2 commits July 27, 2019 10:14

update spark and store links

80ead5b

Merge remote-tracking branch 'origin/master' into external-hive-support

91ce084

sumwale merged commit 542404c into master Jul 27, 2019

sumwale deleted the external-hive-support branch July 27, 2019 04:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

External hive support in SnappySession #1220

External hive support in SnappySession #1220

sumwale commented Dec 17, 2018 •

edited

Loading

sumwale commented Jul 22, 2019

External hive support in SnappySession #1220

External hive support in SnappySession #1220

Conversation

sumwale commented Dec 17, 2018 • edited Loading

Changes proposed in this pull request

Patch testing

ReleaseNotes.txt changes

Other PRs

sumwale commented Jul 22, 2019

sumwale commented Dec 17, 2018 •

edited

Loading