Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

External hive support in SnappySession #1220

Merged
merged 73 commits into from
Jul 27, 2019
Merged

External hive support in SnappySession #1220

merged 73 commits into from
Jul 27, 2019

Conversation

sumwale
Copy link
Contributor

@sumwale sumwale commented Dec 17, 2018

Changes proposed in this pull request

This adds support for the two components of Spark's hive session:

  1. catalog that reads from external hive meta-store using an extra hive-enabled SparkSession
  2. HiveSessionState from the hive-enabled SparkSession that adds additional resolution rules
    and strategies for such hive managed tables
  3. Parser changes to delegate to Spark Parser for Hive DDL extensions. A special format for
    "CREATE TABLE ... USING hive" is allowed that explicitly specifies the table to use hive provider.

There are two user-level properties:

  • Standard "spark.sql.catalogImplementation" that will consult external hive metastore in addition
    to the builtin catalog when the value is set to "hive". Note that first builtin catalog is used and then
    the external one, so in case of name clashes, the builtin one is given preference. For writes,
    all tables using "hive" as the provider will use the external hive metastore while rest use builtin.
  • "snappydata.sql.hiveCompatibility" can be set to default/spark/full. When set to "spark" or "full"
    then the default behaviour of "create table ..." without any USING provider and any Hive DDL
    extensions will change to create a hive table instead of a row table.

A lazily instantiated instance of Hive-enabled SparkSession is kept inside SnappySessionState
which gets referred if the "spark.sql.catalogImplementation" is "hive" for the session.

For 1), the list/get/create methods in SnappySessionCatalog have been overridden to read/write to
the hive catalog after the snappy catalog if hive support is enabled on the session.

For 2), wrapper Rule/Strategy classes have been added that wrap the extra rules/strategies from
hive session and run them only if the property has been enabled on SnappySession.

The code temporarily switches to the hive-enabled SparkSession when running hive
rules/strategies some of which expect the internal sharedState/sessionState to be those of hive.

Patch testing

precheckin and manual testing

ReleaseNotes.txt changes

Documentation for the new property and what it provides for users.

Other PRs

TIBCOSoftware/snappy-store#499
TIBCOSoftware/snappy-spark#164
https://github.com/SnappyDataInc/snappy-aqp/pull/178

Sumedh Wale added 22 commits November 28, 2018 22:07
The koloboke project has been dead and unmaintained for a couple of years now
so replaced with eclipse collections though latter are a bit slower for some operations
and also add significant bulk (~10M).
- also added implicit retry for catalog stale exception in queries
- invalidate entire cache of connector for a create/drop/alter since the version
  stored for other relations in RelationInfo will also certainly be stale
allow for absence of baseTable in external catalog table drop since it can be a temporary table
Sumedh Wale added 6 commits December 17, 2018 21:16
Allow for "gemfire" data source to make a catalog entry during create table execution
in its createRelation itself. It needs the creation to add new parameters to the options bag.

Fixed dependent handling to avoid duplicates.
Sumedh Wale added 6 commits July 20, 2019 19:54
Instead of changing the sessionState/sharedState inside SnappySession, switch
the existing active session to SparkSession. This also fixes failure in InsertIntoHiveTable
which was due to state inside SnappySession having being switched back when makeCopy
of that plan is invoked.
@sumwale
Copy link
Contributor Author

sumwale commented Jul 22, 2019

Do you plan to add any unit tests or mostly by hydra functional test.

Porting the hive test suites from Spark to use SnappySession with external hive enabled. Hive compatible DDL support as present in SparkSession has also been added in this PR to SnappySession. An additional property has been added to use hive provider as default when no provider has been provided in a CREATE TABLE (which is ROW table otherwise). The convention in CREATE TABLE is to use external hive catalog for hive provider and in-built catalog otherwise, so all of hive suites from Spark should work as is.

Sumedh Wale added 18 commits July 22, 2019 16:32
instead rename hiveCompatible to snappydata.sql.hive.compatibility and use a tri-state
(default, spark, hive) to denote what level of compatibility to use. Specifically
the spark and hive levels will use 'hive' as the default provider.
if spark.sql.sources.default is explicitly set then use the same in SQL parser
with default as 'row' like before
instead honour Spark's "spark.sql.catalogImplementation" itself to make the
configuration identical to Spark with the difference that "hive" implementation
in SnappySession actually refers to the union of builtin and external catalogs

fixed few precheckin failures
ExpressionSQLBuilderSuite -> SnappyExpressionSQLBuilderSuite
…wise not allowed by Spark

other fixes and cleanups
if hive-specific extensions are present in CREATE TABLE then always assume the provider
to be "hive" and pass to Spark parser
make the behaviour of "drop schema" and "drop database" as identical to drop from
both builtin and external catalog since "create schema" is identical to "create database"
also cleaned up current schema/database setup
also improved CommandLineToolsSuite to not print failed output to screen
@sumwale sumwale merged commit 542404c into master Jul 27, 2019
@sumwale sumwale deleted the external-hive-support branch July 27, 2019 04:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants