Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do filter push down when reading from views #160

Merged
merged 2 commits into from
Apr 24, 2020

Conversation

diggerk
Copy link
Contributor

@diggerk diggerk commented Apr 23, 2020

Here's a fix for filter push down for views. I've moved temp table creation to the moment when "buildScan" is invoked. The defaultTableDefinition type is changed to "TableDefinition" and a separate method "getNumBytes" is created to calculate the size of the table. The "getNumBytes" returns zero bytes for views, and that should be fine as the important part is that the connector calculates the number of partitions after the temp table is generated and by that time the getNumBytes will be called on the temp table, not the view.

@davidrabinowitz
Copy link
Member

/gcbrun

def getNumBytes(tableDefinition: TableDefinition): Long = {
val tableType = tableDefinition.getType
if (options.viewsEnabled && TableDefinition.Type.VIEW == tableType) {
0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please replace the value with the default: sqlContext.conf.defaultSizeInBytes (taken from here)
Based on the documentation having a size of 0 may have side effects.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sqlContext.conf is not accessible outside of org.apache.spark.sql so I had to use sqlContext.sparkSession.sessionState.conf.defaultSizeInBytes instead

@davidrabinowitz
Copy link
Member

Hi @diggerk , thanks for the contribution! Can you please have a look at the comment above?

Using 0 as Spark relation size estimate may lead to spark broadcasting
the relation that potentially represets a large table
@ghost
Copy link

ghost commented Apr 24, 2020

Hi @davidrabinowitz , thanks for the catch, I fixed that.

@davidrabinowitz
Copy link
Member

/gcbrun

@davidrabinowitz davidrabinowitz merged commit 1095dfd into GoogleCloudDataproc:master Apr 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants