Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds implementation for supporting columnar batch reads from Spark. #198

Merged
merged 2 commits into from
Jul 7, 2020

Conversation

emkornfield
Copy link
Collaborator

This bypasses most of the existing translation code for the following reasons:

  1. I think there might be a memory leak because the existing code doesn't close the allocator.
  2. This avoids continuously recopying the schema.

I didn't delete the old code because it appears the BigQueryRDD still relies on it partially.

I also couldn't find instructions on formatting/testing (I couldn't find explicit unit tests
for existing arrow code, I'll update accordingly if pointers can be provided).

@emkornfield emkornfield marked this pull request as draft June 27, 2020 00:17
@emkornfield
Copy link
Collaborator Author

CC @Gaurangi94 @davidrabinowitz

@davidrabinowitz
Copy link
Member

/gcbrun

@emkornfield
Copy link
Collaborator Author

looks like I don't have permissions to see the build failure?

@davidrabinowitz
Copy link
Member

/gcbrun

@davidrabinowitz
Copy link
Member

Failed on a timeout, re-ran

@emkornfield
Copy link
Collaborator Author

/gcbrun

This bypasses most of the existing translation code for the following reasons:
1.  I think there might be a memory leak because the existing code doesn't close the allocator.
2.  This avoids continuously recopying the schema.

I didn't delete the old code because it appears the BigQueryRDD still relies on it partially.

I also couldn't find instructions on formatting/testing (I couldn't find explicit unit tests
for existing arrow code, I'll update accordingly if pointers can be provided).
@emkornfield
Copy link
Collaborator Author

/gcbrun

@emkornfield
Copy link
Collaborator Author

Looks like with a fix to helper class it passes gcbrun? I had to force push do to the change that reformatted the entire code base.

@davidrabinowitz davidrabinowitz marked this pull request as ready for review July 7, 2020 05:08
@davidrabinowitz
Copy link
Member

/gcbrun

@davidrabinowitz davidrabinowitz merged commit 22b41d3 into GoogleCloudDataproc:master Jul 7, 2020
YuvalMedina pushed a commit to YuvalMedina/spark-bigquery-connector that referenced this pull request Jul 9, 2020
…oogleCloudDataproc#198)

This bypasses most of the existing translation code for the following reasons:
1.  I think there might be a memory leak because the existing code doesn't close the allocator.
2.  This avoids continuously recopying the schema.

I didn't delete the old code because it appears the BigQueryRDD still relies on it partially.

I also couldn't find instructions on formatting/testing (I couldn't find explicit unit tests
for existing arrow code, I'll update accordingly if pointers can be provided).
YuvalMedina pushed a commit to YuvalMedina/spark-bigquery-connector that referenced this pull request Jul 9, 2020
author Yuval Medina <ymed@google.com> 1592603880 +0000
committer Yuval Medina <ymed@google.com> 1594336084 +0000

Created ProtobufUtils.java in order to convert data and schema from Spark format into protobuf format for ingestion into BigQuery. Created Spark to BigQuery schema conversion suites in SchemaConverters.java. Created ProtobufUtilsTest.java and SchemaConverterTest.java to comprehensively test both classes. Translated Scala testing code in SchemaConvertersSuite.scala into java, and merged with SchemaConverters.java.

Fixing SparkBigQueryConnectorUserAgentProvider initialization bug (GoogleCloudDataproc#186)

prepare release 0.16.1

prepare for next development iteration

Sectioned the schema converter file for easier readability. Added a Table creation method.

Wrote comprehensive tests to check YuvalSchemaConverters. Now needs to improve equality testing: assertEquals does not check for more than superficial equality, so if further testing is to be done without the help of logs, it would be useful to write an equality function for schemas.

Spark->BQ Schema working correctly. Blocked out Map functionality, as it is not supported. Made SchemaConverters, Schema-unit-tests more readable. Improved use of BigQuery library functions/iteration in SchemaConverters

Adding acceptance test on Dataproc (GoogleCloudDataproc#193)

In order to run the test: `sbt package acceptance:test`

Added support for materialized views (GoogleCloudDataproc#192)

Applying Google Java format on compile (GoogleCloudDataproc#203)

Created Spark-BigQuery schema converter and created BigQuery schema - ProtoSchema converter. Now awaiting comprehensive tests before merging with master.

Fixing SparkBigQueryConnectorUserAgentProvider initialization bug (GoogleCloudDataproc#186)

prepare release 0.16.1

prepare for next development iteration

Sectioned the schema converter file for easier readability. Added a Table creation method.

Wrote comprehensive tests to check YuvalSchemaConverters. Now needs to improve equality testing: assertEquals does not check for more than superficial equality, so if further testing is to be done without the help of logs, it would be useful to write an equality function for schemas.

Spark->BQ Schema working correctly. Blocked out Map functionality, as it is not supported. Made SchemaConverters, Schema-unit-tests more readable. Improved use of BigQuery library functions/iteration in SchemaConverters

Renamed SchemaConverters file, about to merge into David's SchemaConverters. Improved unit tests to check the toBigQueryColumn method, instead of the more abstract toBigQuerySchema (in order to check each data type is working correctly. Tackling toProtoRows converter.

BigQuery->ProtoSchema converter is passing all unit tests.

Merged my (YuvalMedina) schema converters with David's (davidrabinowitz) SchemaConverters under spark.bigquery. Renamed my schema converters to SchemaConvertersDevelopment, in which I will continue working on a ProtoRows converter.

SchemaConvertersDevelopment is passing all tests on Spark -> Protobuf Descriptor conversion, even on nested structs. Unit tests need to be written to tests actual row conversion (Spark values -> Protobuf values). Minor fixes to SchemaConverters.java: code needs to be smoothed out.

ProtoRows converter is passing 10 unit tests, sparkRowToProtoRow test must be revised to confirm that ProtoRows conversion is fully working. All functions doing Spark InternalRow -> ProtoRow and BigQuery Schema -> ProtoSchema conversions were migrated from SchemaConverters.java to ProtoBufUtils.java. SchemaConverters.java now contains both Spark -> BigQuery as well as the original BigQuery -> Spark conversions. ProtoBufUtilsTests.java was created to test for functions in ProtoBufUtils separately.

All conversion suites for Spark -> BigQuery, BigQuery -> ProtoSchema, and Spark rows -> ProtoRows are working correctly, and comprehensive tests were written. SchemaConvertersSuite.scala, which tests for BigQuery -> Spark conversions was translated into .java, and merged with SchemaConvertersTests.java.

Cleaned up the SchemaConverter tests that were translated from Scala. Added a nesting-depth limit to Records created by the Spark->BigQuery converter.

Deleted unnecessary comments

Deleted a leftover TODO comment in SchemaConvertersTests

Deleted some unnecessary tests.

Last commit before write-support implementation

Made minor edits according to davidrab@'s comments.
Added license heading to all files that were created. Need to test if binary types are converted correctly to protobuf format.

Integrated all of DavidRab's suggestions

Adds implementation for supporting columnar batch reads from Spark. (GoogleCloudDataproc#198)

This bypasses most of the existing translation code for the following reasons:
1.  I think there might be a memory leak because the existing code doesn't close the allocator.
2.  This avoids continuously recopying the schema.

I didn't delete the old code because it appears the BigQueryRDD still relies on it partially.

I also couldn't find instructions on formatting/testing (I couldn't find explicit unit tests
for existing arrow code, I'll update accordingly if pointers can be provided).

Changed tests as well

Changed tests as well

Added functionality to support more complex Spark types (such as StructTypes within ArrayTypes) in SchemaConverters and ProtobufUtils. There are known issues with Timestamp conversion into BigQuery format when integrating with BigQuery Storage Write API.

Added support for materialized views (GoogleCloudDataproc#192)

Applying Google Java format on compile (GoogleCloudDataproc#203)

Created Spark-BigQuery schema converter and created BigQuery schema - ProtoSchema converter. Now awaiting comprehensive tests before merging with master.

Fixing SparkBigQueryConnectorUserAgentProvider initialization bug (GoogleCloudDataproc#186)

prepare release 0.16.1

prepare for next development iteration

Sectioned the schema converter file for easier readability. Added a Table creation method.

Wrote comprehensive tests to check YuvalSchemaConverters. Now needs to improve equality testing: assertEquals does not check for more than superficial equality, so if further testing is to be done without the help of logs, it would be useful to write an equality function for schemas.

Spark->BQ Schema working correctly. Blocked out Map functionality, as it is not supported. Made SchemaConverters, Schema-unit-tests more readable. Improved use of BigQuery library functions/iteration in SchemaConverters

Renamed SchemaConverters file, about to merge into David's SchemaConverters. Improved unit tests to check the toBigQueryColumn method, instead of the more abstract toBigQuerySchema (in order to check each data type is working correctly. Tackling toProtoRows converter.

BigQuery->ProtoSchema converter is passing all unit tests.

Merged my (YuvalMedina) schema converters with David's (davidrabinowitz) SchemaConverters under spark.bigquery. Renamed my schema converters to SchemaConvertersDevelopment, in which I will continue working on a ProtoRows converter.

SchemaConvertersDevelopment is passing all tests on Spark -> Protobuf Descriptor conversion, even on nested structs. Unit tests need to be written to tests actual row conversion (Spark values -> Protobuf values). Minor fixes to SchemaConverters.java: code needs to be smoothed out.

ProtoRows converter is passing 10 unit tests, sparkRowToProtoRow test must be revised to confirm that ProtoRows conversion is fully working. All functions doing Spark InternalRow -> ProtoRow and BigQuery Schema -> ProtoSchema conversions were migrated from SchemaConverters.java to ProtoBufUtils.java. SchemaConverters.java now contains both Spark -> BigQuery as well as the original BigQuery -> Spark conversions. ProtoBufUtilsTests.java was created to test for functions in ProtoBufUtils separately.

All conversion suites for Spark -> BigQuery, BigQuery -> ProtoSchema, and Spark rows -> ProtoRows are working correctly, and comprehensive tests were written. SchemaConvertersSuite.scala, which tests for BigQuery -> Spark conversions was translated into .java, and merged with SchemaConvertersTests.java.

Cleaned up the SchemaConverter tests that were translated from Scala. Added a nesting-depth limit to Records created by the Spark->BigQuery converter.

Deleted unnecessary comments

Deleted a leftover TODO comment in SchemaConvertersTests

Deleted some unnecessary tests.

Last commit before write-support implementation

Made minor edits according to davidrab@'s comments.
Added license heading to all files that were created. Need to test if binary types are converted correctly to protobuf format.

Adds implementation for supporting columnar batch reads from Spark. (GoogleCloudDataproc#198)

This bypasses most of the existing translation code for the following reasons:
1.  I think there might be a memory leak because the existing code doesn't close the allocator.
2.  This avoids continuously recopying the schema.

I didn't delete the old code because it appears the BigQueryRDD still relies on it partially.

I also couldn't find instructions on formatting/testing (I couldn't find explicit unit tests
for existing arrow code, I'll update accordingly if pointers can be provided).

Added functionality to support more complex Spark types (such as StructTypes within ArrayTypes) in SchemaConverters and ProtobufUtils. There are known issues with Timestamp conversion into BigQuery format when integrating with BigQuery Storage Write API.

Revert "Merge branch 'writesupport' of https://github.com/YuvalMedina/spark-bigquery-connector into writesupport"

This reverts commit 65294d8, reversing
changes made to 814a1bf.

Integrated David Rab's second round of suggestions.

Ran sbt build
YuvalMedina pushed a commit to YuvalMedina/spark-bigquery-connector that referenced this pull request Jul 10, 2020
author Yuval Medina <ymed@google.com> 1592603880 +0000
committer Yuval Medina <ymed@google.com> 1594336084 +0000

Created ProtobufUtils.java in order to convert data and schema from Spark format into protobuf format for ingestion into BigQuery. Created Spark to BigQuery schema conversion suites in SchemaConverters.java. Created ProtobufUtilsTest.java and SchemaConverterTest.java to comprehensively test both classes. Translated Scala testing code in SchemaConvertersSuite.scala into java, and merged with SchemaConverters.java.

Fixing SparkBigQueryConnectorUserAgentProvider initialization bug (GoogleCloudDataproc#186)

prepare release 0.16.1

prepare for next development iteration

Sectioned the schema converter file for easier readability. Added a Table creation method.

Wrote comprehensive tests to check YuvalSchemaConverters. Now needs to improve equality testing: assertEquals does not check for more than superficial equality, so if further testing is to be done without the help of logs, it would be useful to write an equality function for schemas.

Spark->BQ Schema working correctly. Blocked out Map functionality, as it is not supported. Made SchemaConverters, Schema-unit-tests more readable. Improved use of BigQuery library functions/iteration in SchemaConverters

Adding acceptance test on Dataproc (GoogleCloudDataproc#193)

In order to run the test: `sbt package acceptance:test`

Added support for materialized views (GoogleCloudDataproc#192)

Applying Google Java format on compile (GoogleCloudDataproc#203)

Created Spark-BigQuery schema converter and created BigQuery schema - ProtoSchema converter. Now awaiting comprehensive tests before merging with master.

Fixing SparkBigQueryConnectorUserAgentProvider initialization bug (GoogleCloudDataproc#186)

prepare release 0.16.1

prepare for next development iteration

Sectioned the schema converter file for easier readability. Added a Table creation method.

Wrote comprehensive tests to check YuvalSchemaConverters. Now needs to improve equality testing: assertEquals does not check for more than superficial equality, so if further testing is to be done without the help of logs, it would be useful to write an equality function for schemas.

Spark->BQ Schema working correctly. Blocked out Map functionality, as it is not supported. Made SchemaConverters, Schema-unit-tests more readable. Improved use of BigQuery library functions/iteration in SchemaConverters

Renamed SchemaConverters file, about to merge into David's SchemaConverters. Improved unit tests to check the toBigQueryColumn method, instead of the more abstract toBigQuerySchema (in order to check each data type is working correctly. Tackling toProtoRows converter.

BigQuery->ProtoSchema converter is passing all unit tests.

Merged my (YuvalMedina) schema converters with David's (davidrabinowitz) SchemaConverters under spark.bigquery. Renamed my schema converters to SchemaConvertersDevelopment, in which I will continue working on a ProtoRows converter.

SchemaConvertersDevelopment is passing all tests on Spark -> Protobuf Descriptor conversion, even on nested structs. Unit tests need to be written to tests actual row conversion (Spark values -> Protobuf values). Minor fixes to SchemaConverters.java: code needs to be smoothed out.

ProtoRows converter is passing 10 unit tests, sparkRowToProtoRow test must be revised to confirm that ProtoRows conversion is fully working. All functions doing Spark InternalRow -> ProtoRow and BigQuery Schema -> ProtoSchema conversions were migrated from SchemaConverters.java to ProtoBufUtils.java. SchemaConverters.java now contains both Spark -> BigQuery as well as the original BigQuery -> Spark conversions. ProtoBufUtilsTests.java was created to test for functions in ProtoBufUtils separately.

All conversion suites for Spark -> BigQuery, BigQuery -> ProtoSchema, and Spark rows -> ProtoRows are working correctly, and comprehensive tests were written. SchemaConvertersSuite.scala, which tests for BigQuery -> Spark conversions was translated into .java, and merged with SchemaConvertersTests.java.

Cleaned up the SchemaConverter tests that were translated from Scala. Added a nesting-depth limit to Records created by the Spark->BigQuery converter.

Deleted unnecessary comments

Deleted a leftover TODO comment in SchemaConvertersTests

Deleted some unnecessary tests.

Last commit before write-support implementation

Made minor edits according to davidrab@'s comments.
Added license heading to all files that were created. Need to test if binary types are converted correctly to protobuf format.

Integrated all of DavidRab's suggestions

Adds implementation for supporting columnar batch reads from Spark. (GoogleCloudDataproc#198)

This bypasses most of the existing translation code for the following reasons:
1.  I think there might be a memory leak because the existing code doesn't close the allocator.
2.  This avoids continuously recopying the schema.

I didn't delete the old code because it appears the BigQueryRDD still relies on it partially.

I also couldn't find instructions on formatting/testing (I couldn't find explicit unit tests
for existing arrow code, I'll update accordingly if pointers can be provided).

Changed tests as well

Changed tests as well

Added functionality to support more complex Spark types (such as StructTypes within ArrayTypes) in SchemaConverters and ProtobufUtils. There are known issues with Timestamp conversion into BigQuery format when integrating with BigQuery Storage Write API.

Added support for materialized views (GoogleCloudDataproc#192)

Applying Google Java format on compile (GoogleCloudDataproc#203)

Created Spark-BigQuery schema converter and created BigQuery schema - ProtoSchema converter. Now awaiting comprehensive tests before merging with master.

Fixing SparkBigQueryConnectorUserAgentProvider initialization bug (GoogleCloudDataproc#186)

prepare release 0.16.1

prepare for next development iteration

Sectioned the schema converter file for easier readability. Added a Table creation method.

Wrote comprehensive tests to check YuvalSchemaConverters. Now needs to improve equality testing: assertEquals does not check for more than superficial equality, so if further testing is to be done without the help of logs, it would be useful to write an equality function for schemas.

Spark->BQ Schema working correctly. Blocked out Map functionality, as it is not supported. Made SchemaConverters, Schema-unit-tests more readable. Improved use of BigQuery library functions/iteration in SchemaConverters

Renamed SchemaConverters file, about to merge into David's SchemaConverters. Improved unit tests to check the toBigQueryColumn method, instead of the more abstract toBigQuerySchema (in order to check each data type is working correctly. Tackling toProtoRows converter.

BigQuery->ProtoSchema converter is passing all unit tests.

Merged my (YuvalMedina) schema converters with David's (davidrabinowitz) SchemaConverters under spark.bigquery. Renamed my schema converters to SchemaConvertersDevelopment, in which I will continue working on a ProtoRows converter.

SchemaConvertersDevelopment is passing all tests on Spark -> Protobuf Descriptor conversion, even on nested structs. Unit tests need to be written to tests actual row conversion (Spark values -> Protobuf values). Minor fixes to SchemaConverters.java: code needs to be smoothed out.

ProtoRows converter is passing 10 unit tests, sparkRowToProtoRow test must be revised to confirm that ProtoRows conversion is fully working. All functions doing Spark InternalRow -> ProtoRow and BigQuery Schema -> ProtoSchema conversions were migrated from SchemaConverters.java to ProtoBufUtils.java. SchemaConverters.java now contains both Spark -> BigQuery as well as the original BigQuery -> Spark conversions. ProtoBufUtilsTests.java was created to test for functions in ProtoBufUtils separately.

All conversion suites for Spark -> BigQuery, BigQuery -> ProtoSchema, and Spark rows -> ProtoRows are working correctly, and comprehensive tests were written. SchemaConvertersSuite.scala, which tests for BigQuery -> Spark conversions was translated into .java, and merged with SchemaConvertersTests.java.

Cleaned up the SchemaConverter tests that were translated from Scala. Added a nesting-depth limit to Records created by the Spark->BigQuery converter.

Deleted unnecessary comments

Deleted a leftover TODO comment in SchemaConvertersTests

Deleted some unnecessary tests.

Last commit before write-support implementation

Made minor edits according to davidrab@'s comments.
Added license heading to all files that were created. Need to test if binary types are converted correctly to protobuf format.

Adds implementation for supporting columnar batch reads from Spark. (GoogleCloudDataproc#198)

This bypasses most of the existing translation code for the following reasons:
1.  I think there might be a memory leak because the existing code doesn't close the allocator.
2.  This avoids continuously recopying the schema.

I didn't delete the old code because it appears the BigQueryRDD still relies on it partially.

I also couldn't find instructions on formatting/testing (I couldn't find explicit unit tests
for existing arrow code, I'll update accordingly if pointers can be provided).

Added functionality to support more complex Spark types (such as StructTypes within ArrayTypes) in SchemaConverters and ProtobufUtils. There are known issues with Timestamp conversion into BigQuery format when integrating with BigQuery Storage Write API.

Revert "Merge branch 'writesupport' of https://github.com/YuvalMedina/spark-bigquery-connector into writesupport"

This reverts commit 65294d8, reversing
changes made to 814a1bf.

Integrated David Rab's second round of suggestions.

Ran sbt build
YuvalMedina pushed a commit to YuvalMedina/spark-bigquery-connector that referenced this pull request Jul 23, 2020
author Yuval Medina <ymed@google.com> 1592603880 +0000
committer Yuval Medina <ymed@google.com> 1594336084 +0000

Created ProtobufUtils.java in order to convert data and schema from Spark format into protobuf format for ingestion into BigQuery. Created Spark to BigQuery schema conversion suites in SchemaConverters.java. Created ProtobufUtilsTest.java and SchemaConverterTest.java to comprehensively test both classes. Translated Scala testing code in SchemaConvertersSuite.scala into java, and merged with SchemaConverters.java.

Fixing SparkBigQueryConnectorUserAgentProvider initialization bug (GoogleCloudDataproc#186)

prepare release 0.16.1

prepare for next development iteration

Sectioned the schema converter file for easier readability. Added a Table creation method.

Wrote comprehensive tests to check YuvalSchemaConverters. Now needs to improve equality testing: assertEquals does not check for more than superficial equality, so if further testing is to be done without the help of logs, it would be useful to write an equality function for schemas.

Spark->BQ Schema working correctly. Blocked out Map functionality, as it is not supported. Made SchemaConverters, Schema-unit-tests more readable. Improved use of BigQuery library functions/iteration in SchemaConverters

Adding acceptance test on Dataproc (GoogleCloudDataproc#193)

In order to run the test: `sbt package acceptance:test`

Added support for materialized views (GoogleCloudDataproc#192)

Applying Google Java format on compile (GoogleCloudDataproc#203)

Created Spark-BigQuery schema converter and created BigQuery schema - ProtoSchema converter. Now awaiting comprehensive tests before merging with master.

Fixing SparkBigQueryConnectorUserAgentProvider initialization bug (GoogleCloudDataproc#186)

prepare release 0.16.1

prepare for next development iteration

Sectioned the schema converter file for easier readability. Added a Table creation method.

Wrote comprehensive tests to check YuvalSchemaConverters. Now needs to improve equality testing: assertEquals does not check for more than superficial equality, so if further testing is to be done without the help of logs, it would be useful to write an equality function for schemas.

Spark->BQ Schema working correctly. Blocked out Map functionality, as it is not supported. Made SchemaConverters, Schema-unit-tests more readable. Improved use of BigQuery library functions/iteration in SchemaConverters

Renamed SchemaConverters file, about to merge into David's SchemaConverters. Improved unit tests to check the toBigQueryColumn method, instead of the more abstract toBigQuerySchema (in order to check each data type is working correctly. Tackling toProtoRows converter.

BigQuery->ProtoSchema converter is passing all unit tests.

Merged my (YuvalMedina) schema converters with David's (davidrabinowitz) SchemaConverters under spark.bigquery. Renamed my schema converters to SchemaConvertersDevelopment, in which I will continue working on a ProtoRows converter.

SchemaConvertersDevelopment is passing all tests on Spark -> Protobuf Descriptor conversion, even on nested structs. Unit tests need to be written to tests actual row conversion (Spark values -> Protobuf values). Minor fixes to SchemaConverters.java: code needs to be smoothed out.

ProtoRows converter is passing 10 unit tests, sparkRowToProtoRow test must be revised to confirm that ProtoRows conversion is fully working. All functions doing Spark InternalRow -> ProtoRow and BigQuery Schema -> ProtoSchema conversions were migrated from SchemaConverters.java to ProtoBufUtils.java. SchemaConverters.java now contains both Spark -> BigQuery as well as the original BigQuery -> Spark conversions. ProtoBufUtilsTests.java was created to test for functions in ProtoBufUtils separately.

All conversion suites for Spark -> BigQuery, BigQuery -> ProtoSchema, and Spark rows -> ProtoRows are working correctly, and comprehensive tests were written. SchemaConvertersSuite.scala, which tests for BigQuery -> Spark conversions was translated into .java, and merged with SchemaConvertersTests.java.

Cleaned up the SchemaConverter tests that were translated from Scala. Added a nesting-depth limit to Records created by the Spark->BigQuery converter.

Deleted unnecessary comments

Deleted a leftover TODO comment in SchemaConvertersTests

Deleted some unnecessary tests.

Last commit before write-support implementation

Made minor edits according to davidrab@'s comments.
Added license heading to all files that were created. Need to test if binary types are converted correctly to protobuf format.

Integrated all of DavidRab's suggestions

Adds implementation for supporting columnar batch reads from Spark. (GoogleCloudDataproc#198)

This bypasses most of the existing translation code for the following reasons:
1.  I think there might be a memory leak because the existing code doesn't close the allocator.
2.  This avoids continuously recopying the schema.

I didn't delete the old code because it appears the BigQueryRDD still relies on it partially.

I also couldn't find instructions on formatting/testing (I couldn't find explicit unit tests
for existing arrow code, I'll update accordingly if pointers can be provided).

Changed tests as well

Changed tests as well

Added functionality to support more complex Spark types (such as StructTypes within ArrayTypes) in SchemaConverters and ProtobufUtils. There are known issues with Timestamp conversion into BigQuery format when integrating with BigQuery Storage Write API.

Added support for materialized views (GoogleCloudDataproc#192)

Applying Google Java format on compile (GoogleCloudDataproc#203)

Created Spark-BigQuery schema converter and created BigQuery schema - ProtoSchema converter. Now awaiting comprehensive tests before merging with master.

Fixing SparkBigQueryConnectorUserAgentProvider initialization bug (GoogleCloudDataproc#186)

prepare release 0.16.1

prepare for next development iteration

Sectioned the schema converter file for easier readability. Added a Table creation method.

Wrote comprehensive tests to check YuvalSchemaConverters. Now needs to improve equality testing: assertEquals does not check for more than superficial equality, so if further testing is to be done without the help of logs, it would be useful to write an equality function for schemas.

Spark->BQ Schema working correctly. Blocked out Map functionality, as it is not supported. Made SchemaConverters, Schema-unit-tests more readable. Improved use of BigQuery library functions/iteration in SchemaConverters

Renamed SchemaConverters file, about to merge into David's SchemaConverters. Improved unit tests to check the toBigQueryColumn method, instead of the more abstract toBigQuerySchema (in order to check each data type is working correctly. Tackling toProtoRows converter.

BigQuery->ProtoSchema converter is passing all unit tests.

Merged my (YuvalMedina) schema converters with David's (davidrabinowitz) SchemaConverters under spark.bigquery. Renamed my schema converters to SchemaConvertersDevelopment, in which I will continue working on a ProtoRows converter.

SchemaConvertersDevelopment is passing all tests on Spark -> Protobuf Descriptor conversion, even on nested structs. Unit tests need to be written to tests actual row conversion (Spark values -> Protobuf values). Minor fixes to SchemaConverters.java: code needs to be smoothed out.

ProtoRows converter is passing 10 unit tests, sparkRowToProtoRow test must be revised to confirm that ProtoRows conversion is fully working. All functions doing Spark InternalRow -> ProtoRow and BigQuery Schema -> ProtoSchema conversions were migrated from SchemaConverters.java to ProtoBufUtils.java. SchemaConverters.java now contains both Spark -> BigQuery as well as the original BigQuery -> Spark conversions. ProtoBufUtilsTests.java was created to test for functions in ProtoBufUtils separately.

All conversion suites for Spark -> BigQuery, BigQuery -> ProtoSchema, and Spark rows -> ProtoRows are working correctly, and comprehensive tests were written. SchemaConvertersSuite.scala, which tests for BigQuery -> Spark conversions was translated into .java, and merged with SchemaConvertersTests.java.

Cleaned up the SchemaConverter tests that were translated from Scala. Added a nesting-depth limit to Records created by the Spark->BigQuery converter.

Deleted unnecessary comments

Deleted a leftover TODO comment in SchemaConvertersTests

Deleted some unnecessary tests.

Last commit before write-support implementation

Made minor edits according to davidrab@'s comments.
Added license heading to all files that were created. Need to test if binary types are converted correctly to protobuf format.

Adds implementation for supporting columnar batch reads from Spark. (GoogleCloudDataproc#198)

This bypasses most of the existing translation code for the following reasons:
1.  I think there might be a memory leak because the existing code doesn't close the allocator.
2.  This avoids continuously recopying the schema.

I didn't delete the old code because it appears the BigQueryRDD still relies on it partially.

I also couldn't find instructions on formatting/testing (I couldn't find explicit unit tests
for existing arrow code, I'll update accordingly if pointers can be provided).

Added functionality to support more complex Spark types (such as StructTypes within ArrayTypes) in SchemaConverters and ProtobufUtils. There are known issues with Timestamp conversion into BigQuery format when integrating with BigQuery Storage Write API.

Revert "Merge branch 'writesupport' of https://github.com/YuvalMedina/spark-bigquery-connector into writesupport"

This reverts commit 65294d8, reversing
changes made to 814a1bf.

Integrated David Rab's second round of suggestions.

Ran sbt build
YuvalMedina pushed a commit to YuvalMedina/spark-bigquery-connector that referenced this pull request Jul 31, 2020
author Yuval Medina <ymed@google.com> 1592603880 +0000
committer Yuval Medina <ymed@google.com> 1594336084 +0000

Created ProtobufUtils.java in order to convert data and schema from Spark format into protobuf format for ingestion into BigQuery. Created Spark to BigQuery schema conversion suites in SchemaConverters.java. Created ProtobufUtilsTest.java and SchemaConverterTest.java to comprehensively test both classes. Translated Scala testing code in SchemaConvertersSuite.scala into java, and merged with SchemaConverters.java.

Fixing SparkBigQueryConnectorUserAgentProvider initialization bug (GoogleCloudDataproc#186)

prepare release 0.16.1

prepare for next development iteration

Sectioned the schema converter file for easier readability. Added a Table creation method.

Wrote comprehensive tests to check YuvalSchemaConverters. Now needs to improve equality testing: assertEquals does not check for more than superficial equality, so if further testing is to be done without the help of logs, it would be useful to write an equality function for schemas.

Spark->BQ Schema working correctly. Blocked out Map functionality, as it is not supported. Made SchemaConverters, Schema-unit-tests more readable. Improved use of BigQuery library functions/iteration in SchemaConverters

Adding acceptance test on Dataproc (GoogleCloudDataproc#193)

In order to run the test: `sbt package acceptance:test`

Added support for materialized views (GoogleCloudDataproc#192)

Applying Google Java format on compile (GoogleCloudDataproc#203)

Created Spark-BigQuery schema converter and created BigQuery schema - ProtoSchema converter. Now awaiting comprehensive tests before merging with master.

Fixing SparkBigQueryConnectorUserAgentProvider initialization bug (GoogleCloudDataproc#186)

prepare release 0.16.1

prepare for next development iteration

Sectioned the schema converter file for easier readability. Added a Table creation method.

Wrote comprehensive tests to check YuvalSchemaConverters. Now needs to improve equality testing: assertEquals does not check for more than superficial equality, so if further testing is to be done without the help of logs, it would be useful to write an equality function for schemas.

Spark->BQ Schema working correctly. Blocked out Map functionality, as it is not supported. Made SchemaConverters, Schema-unit-tests more readable. Improved use of BigQuery library functions/iteration in SchemaConverters

Renamed SchemaConverters file, about to merge into David's SchemaConverters. Improved unit tests to check the toBigQueryColumn method, instead of the more abstract toBigQuerySchema (in order to check each data type is working correctly. Tackling toProtoRows converter.

BigQuery->ProtoSchema converter is passing all unit tests.

Merged my (YuvalMedina) schema converters with David's (davidrabinowitz) SchemaConverters under spark.bigquery. Renamed my schema converters to SchemaConvertersDevelopment, in which I will continue working on a ProtoRows converter.

SchemaConvertersDevelopment is passing all tests on Spark -> Protobuf Descriptor conversion, even on nested structs. Unit tests need to be written to tests actual row conversion (Spark values -> Protobuf values). Minor fixes to SchemaConverters.java: code needs to be smoothed out.

ProtoRows converter is passing 10 unit tests, sparkRowToProtoRow test must be revised to confirm that ProtoRows conversion is fully working. All functions doing Spark InternalRow -> ProtoRow and BigQuery Schema -> ProtoSchema conversions were migrated from SchemaConverters.java to ProtoBufUtils.java. SchemaConverters.java now contains both Spark -> BigQuery as well as the original BigQuery -> Spark conversions. ProtoBufUtilsTests.java was created to test for functions in ProtoBufUtils separately.

All conversion suites for Spark -> BigQuery, BigQuery -> ProtoSchema, and Spark rows -> ProtoRows are working correctly, and comprehensive tests were written. SchemaConvertersSuite.scala, which tests for BigQuery -> Spark conversions was translated into .java, and merged with SchemaConvertersTests.java.

Cleaned up the SchemaConverter tests that were translated from Scala. Added a nesting-depth limit to Records created by the Spark->BigQuery converter.

Deleted unnecessary comments

Deleted a leftover TODO comment in SchemaConvertersTests

Deleted some unnecessary tests.

Last commit before write-support implementation

Made minor edits according to davidrab@'s comments.
Added license heading to all files that were created. Need to test if binary types are converted correctly to protobuf format.

Integrated all of DavidRab's suggestions

Adds implementation for supporting columnar batch reads from Spark. (GoogleCloudDataproc#198)

This bypasses most of the existing translation code for the following reasons:
1.  I think there might be a memory leak because the existing code doesn't close the allocator.
2.  This avoids continuously recopying the schema.

I didn't delete the old code because it appears the BigQueryRDD still relies on it partially.

I also couldn't find instructions on formatting/testing (I couldn't find explicit unit tests
for existing arrow code, I'll update accordingly if pointers can be provided).

Changed tests as well

Changed tests as well

Added functionality to support more complex Spark types (such as StructTypes within ArrayTypes) in SchemaConverters and ProtobufUtils. There are known issues with Timestamp conversion into BigQuery format when integrating with BigQuery Storage Write API.

Added support for materialized views (GoogleCloudDataproc#192)

Applying Google Java format on compile (GoogleCloudDataproc#203)

Created Spark-BigQuery schema converter and created BigQuery schema - ProtoSchema converter. Now awaiting comprehensive tests before merging with master.

Fixing SparkBigQueryConnectorUserAgentProvider initialization bug (GoogleCloudDataproc#186)

prepare release 0.16.1

prepare for next development iteration

Sectioned the schema converter file for easier readability. Added a Table creation method.

Wrote comprehensive tests to check YuvalSchemaConverters. Now needs to improve equality testing: assertEquals does not check for more than superficial equality, so if further testing is to be done without the help of logs, it would be useful to write an equality function for schemas.

Spark->BQ Schema working correctly. Blocked out Map functionality, as it is not supported. Made SchemaConverters, Schema-unit-tests more readable. Improved use of BigQuery library functions/iteration in SchemaConverters

Renamed SchemaConverters file, about to merge into David's SchemaConverters. Improved unit tests to check the toBigQueryColumn method, instead of the more abstract toBigQuerySchema (in order to check each data type is working correctly. Tackling toProtoRows converter.

BigQuery->ProtoSchema converter is passing all unit tests.

Merged my (YuvalMedina) schema converters with David's (davidrabinowitz) SchemaConverters under spark.bigquery. Renamed my schema converters to SchemaConvertersDevelopment, in which I will continue working on a ProtoRows converter.

SchemaConvertersDevelopment is passing all tests on Spark -> Protobuf Descriptor conversion, even on nested structs. Unit tests need to be written to tests actual row conversion (Spark values -> Protobuf values). Minor fixes to SchemaConverters.java: code needs to be smoothed out.

ProtoRows converter is passing 10 unit tests, sparkRowToProtoRow test must be revised to confirm that ProtoRows conversion is fully working. All functions doing Spark InternalRow -> ProtoRow and BigQuery Schema -> ProtoSchema conversions were migrated from SchemaConverters.java to ProtoBufUtils.java. SchemaConverters.java now contains both Spark -> BigQuery as well as the original BigQuery -> Spark conversions. ProtoBufUtilsTests.java was created to test for functions in ProtoBufUtils separately.

All conversion suites for Spark -> BigQuery, BigQuery -> ProtoSchema, and Spark rows -> ProtoRows are working correctly, and comprehensive tests were written. SchemaConvertersSuite.scala, which tests for BigQuery -> Spark conversions was translated into .java, and merged with SchemaConvertersTests.java.

Cleaned up the SchemaConverter tests that were translated from Scala. Added a nesting-depth limit to Records created by the Spark->BigQuery converter.

Deleted unnecessary comments

Deleted a leftover TODO comment in SchemaConvertersTests

Deleted some unnecessary tests.

Last commit before write-support implementation

Made minor edits according to davidrab@'s comments.
Added license heading to all files that were created. Need to test if binary types are converted correctly to protobuf format.

Adds implementation for supporting columnar batch reads from Spark. (GoogleCloudDataproc#198)

This bypasses most of the existing translation code for the following reasons:
1.  I think there might be a memory leak because the existing code doesn't close the allocator.
2.  This avoids continuously recopying the schema.

I didn't delete the old code because it appears the BigQueryRDD still relies on it partially.

I also couldn't find instructions on formatting/testing (I couldn't find explicit unit tests
for existing arrow code, I'll update accordingly if pointers can be provided).

Added functionality to support more complex Spark types (such as StructTypes within ArrayTypes) in SchemaConverters and ProtobufUtils. There are known issues with Timestamp conversion into BigQuery format when integrating with BigQuery Storage Write API.

Revert "Merge branch 'writesupport' of https://github.com/YuvalMedina/spark-bigquery-connector into writesupport"

This reverts commit 65294d8, reversing
changes made to 814a1bf.

Integrated David Rab's second round of suggestions.

Ran sbt build
YuvalMedina pushed a commit to YuvalMedina/spark-bigquery-connector that referenced this pull request Aug 10, 2020
author Yuval Medina <ymed@google.com> 1592603880 +0000
committer Yuval Medina <ymed@google.com> 1594336084 +0000

Created ProtobufUtils.java in order to convert data and schema from Spark format into protobuf format for ingestion into BigQuery. Created Spark to BigQuery schema conversion suites in SchemaConverters.java. Created ProtobufUtilsTest.java and SchemaConverterTest.java to comprehensively test both classes. Translated Scala testing code in SchemaConvertersSuite.scala into java, and merged with SchemaConverters.java.

Fixing SparkBigQueryConnectorUserAgentProvider initialization bug (GoogleCloudDataproc#186)

prepare release 0.16.1

prepare for next development iteration

Sectioned the schema converter file for easier readability. Added a Table creation method.

Wrote comprehensive tests to check YuvalSchemaConverters. Now needs to improve equality testing: assertEquals does not check for more than superficial equality, so if further testing is to be done without the help of logs, it would be useful to write an equality function for schemas.

Spark->BQ Schema working correctly. Blocked out Map functionality, as it is not supported. Made SchemaConverters, Schema-unit-tests more readable. Improved use of BigQuery library functions/iteration in SchemaConverters

Adding acceptance test on Dataproc (GoogleCloudDataproc#193)

In order to run the test: `sbt package acceptance:test`

Added support for materialized views (GoogleCloudDataproc#192)

Applying Google Java format on compile (GoogleCloudDataproc#203)

Created Spark-BigQuery schema converter and created BigQuery schema - ProtoSchema converter. Now awaiting comprehensive tests before merging with master.

Fixing SparkBigQueryConnectorUserAgentProvider initialization bug (GoogleCloudDataproc#186)

prepare release 0.16.1

prepare for next development iteration

Sectioned the schema converter file for easier readability. Added a Table creation method.

Wrote comprehensive tests to check YuvalSchemaConverters. Now needs to improve equality testing: assertEquals does not check for more than superficial equality, so if further testing is to be done without the help of logs, it would be useful to write an equality function for schemas.

Spark->BQ Schema working correctly. Blocked out Map functionality, as it is not supported. Made SchemaConverters, Schema-unit-tests more readable. Improved use of BigQuery library functions/iteration in SchemaConverters

Renamed SchemaConverters file, about to merge into David's SchemaConverters. Improved unit tests to check the toBigQueryColumn method, instead of the more abstract toBigQuerySchema (in order to check each data type is working correctly. Tackling toProtoRows converter.

BigQuery->ProtoSchema converter is passing all unit tests.

Merged my (YuvalMedina) schema converters with David's (davidrabinowitz) SchemaConverters under spark.bigquery. Renamed my schema converters to SchemaConvertersDevelopment, in which I will continue working on a ProtoRows converter.

SchemaConvertersDevelopment is passing all tests on Spark -> Protobuf Descriptor conversion, even on nested structs. Unit tests need to be written to tests actual row conversion (Spark values -> Protobuf values). Minor fixes to SchemaConverters.java: code needs to be smoothed out.

ProtoRows converter is passing 10 unit tests, sparkRowToProtoRow test must be revised to confirm that ProtoRows conversion is fully working. All functions doing Spark InternalRow -> ProtoRow and BigQuery Schema -> ProtoSchema conversions were migrated from SchemaConverters.java to ProtoBufUtils.java. SchemaConverters.java now contains both Spark -> BigQuery as well as the original BigQuery -> Spark conversions. ProtoBufUtilsTests.java was created to test for functions in ProtoBufUtils separately.

All conversion suites for Spark -> BigQuery, BigQuery -> ProtoSchema, and Spark rows -> ProtoRows are working correctly, and comprehensive tests were written. SchemaConvertersSuite.scala, which tests for BigQuery -> Spark conversions was translated into .java, and merged with SchemaConvertersTests.java.

Cleaned up the SchemaConverter tests that were translated from Scala. Added a nesting-depth limit to Records created by the Spark->BigQuery converter.

Deleted unnecessary comments

Deleted a leftover TODO comment in SchemaConvertersTests

Deleted some unnecessary tests.

Last commit before write-support implementation

Made minor edits according to davidrab@'s comments.
Added license heading to all files that were created. Need to test if binary types are converted correctly to protobuf format.

Integrated all of DavidRab's suggestions

Adds implementation for supporting columnar batch reads from Spark. (GoogleCloudDataproc#198)

This bypasses most of the existing translation code for the following reasons:
1.  I think there might be a memory leak because the existing code doesn't close the allocator.
2.  This avoids continuously recopying the schema.

I didn't delete the old code because it appears the BigQueryRDD still relies on it partially.

I also couldn't find instructions on formatting/testing (I couldn't find explicit unit tests
for existing arrow code, I'll update accordingly if pointers can be provided).

Changed tests as well

Changed tests as well

Added functionality to support more complex Spark types (such as StructTypes within ArrayTypes) in SchemaConverters and ProtobufUtils. There are known issues with Timestamp conversion into BigQuery format when integrating with BigQuery Storage Write API.

Added support for materialized views (GoogleCloudDataproc#192)

Applying Google Java format on compile (GoogleCloudDataproc#203)

Created Spark-BigQuery schema converter and created BigQuery schema - ProtoSchema converter. Now awaiting comprehensive tests before merging with master.

Fixing SparkBigQueryConnectorUserAgentProvider initialization bug (GoogleCloudDataproc#186)

prepare release 0.16.1

prepare for next development iteration

Sectioned the schema converter file for easier readability. Added a Table creation method.

Wrote comprehensive tests to check YuvalSchemaConverters. Now needs to improve equality testing: assertEquals does not check for more than superficial equality, so if further testing is to be done without the help of logs, it would be useful to write an equality function for schemas.

Spark->BQ Schema working correctly. Blocked out Map functionality, as it is not supported. Made SchemaConverters, Schema-unit-tests more readable. Improved use of BigQuery library functions/iteration in SchemaConverters

Renamed SchemaConverters file, about to merge into David's SchemaConverters. Improved unit tests to check the toBigQueryColumn method, instead of the more abstract toBigQuerySchema (in order to check each data type is working correctly. Tackling toProtoRows converter.

BigQuery->ProtoSchema converter is passing all unit tests.

Merged my (YuvalMedina) schema converters with David's (davidrabinowitz) SchemaConverters under spark.bigquery. Renamed my schema converters to SchemaConvertersDevelopment, in which I will continue working on a ProtoRows converter.

SchemaConvertersDevelopment is passing all tests on Spark -> Protobuf Descriptor conversion, even on nested structs. Unit tests need to be written to tests actual row conversion (Spark values -> Protobuf values). Minor fixes to SchemaConverters.java: code needs to be smoothed out.

ProtoRows converter is passing 10 unit tests, sparkRowToProtoRow test must be revised to confirm that ProtoRows conversion is fully working. All functions doing Spark InternalRow -> ProtoRow and BigQuery Schema -> ProtoSchema conversions were migrated from SchemaConverters.java to ProtoBufUtils.java. SchemaConverters.java now contains both Spark -> BigQuery as well as the original BigQuery -> Spark conversions. ProtoBufUtilsTests.java was created to test for functions in ProtoBufUtils separately.

All conversion suites for Spark -> BigQuery, BigQuery -> ProtoSchema, and Spark rows -> ProtoRows are working correctly, and comprehensive tests were written. SchemaConvertersSuite.scala, which tests for BigQuery -> Spark conversions was translated into .java, and merged with SchemaConvertersTests.java.

Cleaned up the SchemaConverter tests that were translated from Scala. Added a nesting-depth limit to Records created by the Spark->BigQuery converter.

Deleted unnecessary comments

Deleted a leftover TODO comment in SchemaConvertersTests

Deleted some unnecessary tests.

Last commit before write-support implementation

Made minor edits according to davidrab@'s comments.
Added license heading to all files that were created. Need to test if binary types are converted correctly to protobuf format.

Adds implementation for supporting columnar batch reads from Spark. (GoogleCloudDataproc#198)

This bypasses most of the existing translation code for the following reasons:
1.  I think there might be a memory leak because the existing code doesn't close the allocator.
2.  This avoids continuously recopying the schema.

I didn't delete the old code because it appears the BigQueryRDD still relies on it partially.

I also couldn't find instructions on formatting/testing (I couldn't find explicit unit tests
for existing arrow code, I'll update accordingly if pointers can be provided).

Added functionality to support more complex Spark types (such as StructTypes within ArrayTypes) in SchemaConverters and ProtobufUtils. There are known issues with Timestamp conversion into BigQuery format when integrating with BigQuery Storage Write API.

Revert "Merge branch 'writesupport' of https://github.com/YuvalMedina/spark-bigquery-connector into writesupport"

This reverts commit 65294d8, reversing
changes made to 814a1bf.

Integrated David Rab's second round of suggestions.

Ran sbt build
YuvalMedina pushed a commit to YuvalMedina/spark-bigquery-connector that referenced this pull request Aug 13, 2020
author Yuval Medina <ymed@google.com> 1592603880 +0000
committer Yuval Medina <ymed@google.com> 1594336084 +0000

Created ProtobufUtils.java in order to convert data and schema from Spark format into protobuf format for ingestion into BigQuery. Created Spark to BigQuery schema conversion suites in SchemaConverters.java. Created ProtobufUtilsTest.java and SchemaConverterTest.java to comprehensively test both classes. Translated Scala testing code in SchemaConvertersSuite.scala into java, and merged with SchemaConverters.java.

Fixing SparkBigQueryConnectorUserAgentProvider initialization bug (GoogleCloudDataproc#186)

prepare release 0.16.1

prepare for next development iteration

Sectioned the schema converter file for easier readability. Added a Table creation method.

Wrote comprehensive tests to check YuvalSchemaConverters. Now needs to improve equality testing: assertEquals does not check for more than superficial equality, so if further testing is to be done without the help of logs, it would be useful to write an equality function for schemas.

Spark->BQ Schema working correctly. Blocked out Map functionality, as it is not supported. Made SchemaConverters, Schema-unit-tests more readable. Improved use of BigQuery library functions/iteration in SchemaConverters

Adding acceptance test on Dataproc (GoogleCloudDataproc#193)

In order to run the test: `sbt package acceptance:test`

Added support for materialized views (GoogleCloudDataproc#192)

Applying Google Java format on compile (GoogleCloudDataproc#203)

Created Spark-BigQuery schema converter and created BigQuery schema - ProtoSchema converter. Now awaiting comprehensive tests before merging with master.

Fixing SparkBigQueryConnectorUserAgentProvider initialization bug (GoogleCloudDataproc#186)

prepare release 0.16.1

prepare for next development iteration

Sectioned the schema converter file for easier readability. Added a Table creation method.

Wrote comprehensive tests to check YuvalSchemaConverters. Now needs to improve equality testing: assertEquals does not check for more than superficial equality, so if further testing is to be done without the help of logs, it would be useful to write an equality function for schemas.

Spark->BQ Schema working correctly. Blocked out Map functionality, as it is not supported. Made SchemaConverters, Schema-unit-tests more readable. Improved use of BigQuery library functions/iteration in SchemaConverters

Renamed SchemaConverters file, about to merge into David's SchemaConverters. Improved unit tests to check the toBigQueryColumn method, instead of the more abstract toBigQuerySchema (in order to check each data type is working correctly. Tackling toProtoRows converter.

BigQuery->ProtoSchema converter is passing all unit tests.

Merged my (YuvalMedina) schema converters with David's (davidrabinowitz) SchemaConverters under spark.bigquery. Renamed my schema converters to SchemaConvertersDevelopment, in which I will continue working on a ProtoRows converter.

SchemaConvertersDevelopment is passing all tests on Spark -> Protobuf Descriptor conversion, even on nested structs. Unit tests need to be written to tests actual row conversion (Spark values -> Protobuf values). Minor fixes to SchemaConverters.java: code needs to be smoothed out.

ProtoRows converter is passing 10 unit tests, sparkRowToProtoRow test must be revised to confirm that ProtoRows conversion is fully working. All functions doing Spark InternalRow -> ProtoRow and BigQuery Schema -> ProtoSchema conversions were migrated from SchemaConverters.java to ProtoBufUtils.java. SchemaConverters.java now contains both Spark -> BigQuery as well as the original BigQuery -> Spark conversions. ProtoBufUtilsTests.java was created to test for functions in ProtoBufUtils separately.

All conversion suites for Spark -> BigQuery, BigQuery -> ProtoSchema, and Spark rows -> ProtoRows are working correctly, and comprehensive tests were written. SchemaConvertersSuite.scala, which tests for BigQuery -> Spark conversions was translated into .java, and merged with SchemaConvertersTests.java.

Cleaned up the SchemaConverter tests that were translated from Scala. Added a nesting-depth limit to Records created by the Spark->BigQuery converter.

Deleted unnecessary comments

Deleted a leftover TODO comment in SchemaConvertersTests

Deleted some unnecessary tests.

Last commit before write-support implementation

Made minor edits according to davidrab@'s comments.
Added license heading to all files that were created. Need to test if binary types are converted correctly to protobuf format.

Integrated all of DavidRab's suggestions

Adds implementation for supporting columnar batch reads from Spark. (GoogleCloudDataproc#198)

This bypasses most of the existing translation code for the following reasons:
1.  I think there might be a memory leak because the existing code doesn't close the allocator.
2.  This avoids continuously recopying the schema.

I didn't delete the old code because it appears the BigQueryRDD still relies on it partially.

I also couldn't find instructions on formatting/testing (I couldn't find explicit unit tests
for existing arrow code, I'll update accordingly if pointers can be provided).

Changed tests as well

Changed tests as well

Added functionality to support more complex Spark types (such as StructTypes within ArrayTypes) in SchemaConverters and ProtobufUtils. There are known issues with Timestamp conversion into BigQuery format when integrating with BigQuery Storage Write API.

Added support for materialized views (GoogleCloudDataproc#192)

Applying Google Java format on compile (GoogleCloudDataproc#203)

Created Spark-BigQuery schema converter and created BigQuery schema - ProtoSchema converter. Now awaiting comprehensive tests before merging with master.

Fixing SparkBigQueryConnectorUserAgentProvider initialization bug (GoogleCloudDataproc#186)

prepare release 0.16.1

prepare for next development iteration

Sectioned the schema converter file for easier readability. Added a Table creation method.

Wrote comprehensive tests to check YuvalSchemaConverters. Now needs to improve equality testing: assertEquals does not check for more than superficial equality, so if further testing is to be done without the help of logs, it would be useful to write an equality function for schemas.

Spark->BQ Schema working correctly. Blocked out Map functionality, as it is not supported. Made SchemaConverters, Schema-unit-tests more readable. Improved use of BigQuery library functions/iteration in SchemaConverters

Renamed SchemaConverters file, about to merge into David's SchemaConverters. Improved unit tests to check the toBigQueryColumn method, instead of the more abstract toBigQuerySchema (in order to check each data type is working correctly. Tackling toProtoRows converter.

BigQuery->ProtoSchema converter is passing all unit tests.

Merged my (YuvalMedina) schema converters with David's (davidrabinowitz) SchemaConverters under spark.bigquery. Renamed my schema converters to SchemaConvertersDevelopment, in which I will continue working on a ProtoRows converter.

SchemaConvertersDevelopment is passing all tests on Spark -> Protobuf Descriptor conversion, even on nested structs. Unit tests need to be written to tests actual row conversion (Spark values -> Protobuf values). Minor fixes to SchemaConverters.java: code needs to be smoothed out.

ProtoRows converter is passing 10 unit tests, sparkRowToProtoRow test must be revised to confirm that ProtoRows conversion is fully working. All functions doing Spark InternalRow -> ProtoRow and BigQuery Schema -> ProtoSchema conversions were migrated from SchemaConverters.java to ProtoBufUtils.java. SchemaConverters.java now contains both Spark -> BigQuery as well as the original BigQuery -> Spark conversions. ProtoBufUtilsTests.java was created to test for functions in ProtoBufUtils separately.

All conversion suites for Spark -> BigQuery, BigQuery -> ProtoSchema, and Spark rows -> ProtoRows are working correctly, and comprehensive tests were written. SchemaConvertersSuite.scala, which tests for BigQuery -> Spark conversions was translated into .java, and merged with SchemaConvertersTests.java.

Cleaned up the SchemaConverter tests that were translated from Scala. Added a nesting-depth limit to Records created by the Spark->BigQuery converter.

Deleted unnecessary comments

Deleted a leftover TODO comment in SchemaConvertersTests

Deleted some unnecessary tests.

Last commit before write-support implementation

Made minor edits according to davidrab@'s comments.
Added license heading to all files that were created. Need to test if binary types are converted correctly to protobuf format.

Adds implementation for supporting columnar batch reads from Spark. (GoogleCloudDataproc#198)

This bypasses most of the existing translation code for the following reasons:
1.  I think there might be a memory leak because the existing code doesn't close the allocator.
2.  This avoids continuously recopying the schema.

I didn't delete the old code because it appears the BigQueryRDD still relies on it partially.

I also couldn't find instructions on formatting/testing (I couldn't find explicit unit tests
for existing arrow code, I'll update accordingly if pointers can be provided).

Added functionality to support more complex Spark types (such as StructTypes within ArrayTypes) in SchemaConverters and ProtobufUtils. There are known issues with Timestamp conversion into BigQuery format when integrating with BigQuery Storage Write API.

Revert "Merge branch 'writesupport' of https://github.com/YuvalMedina/spark-bigquery-connector into writesupport"

This reverts commit 65294d8, reversing
changes made to 814a1bf.

Integrated David Rab's second round of suggestions.

Ran sbt build
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants