Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added JDBC-integration #451

Merged
merged 63 commits into from
Oct 6, 2023
Merged

Added JDBC-integration #451

merged 63 commits into from
Oct 6, 2023

Conversation

zaleslaw
Copy link
Collaborator

@zaleslaw zaleslaw commented Sep 8, 2023

Enhanced SQL Database Integration via JDBC API

This PR marks a significant advancement in our project by introducing seamless integration with SQL databases through the JDBC API, addressing issues #212 and #124.

Key Features:

This enhancement empowers our framework to read data from SQL databases with remarkable flexibility, offering the following reading regimes:

  • Table-based Reading: Retrieve data from specific tables by name.
  • Database Exploration: Read all non-system tables within the database.
  • ResultSet Access: Directly access and read data from ResultSet objects.
  • Complex Query Results: Fetch results from intricate SQL queries, including joins between two or more tables.

Each reading regime allows for either utilizing a provided database connection or establishing a connection internally if the database configuration details are available. Furthermore, every method within these regimes offers two variants: with and without result limits.

Schema Information:

In addition to data retrieval, this enhancement provides the capability to obtain schema information for the results. You can retrieve the schema of individual tables, ResultSet objects, SQL queries, or even acquire schemas for all non-system tables within the database.

Supported Databases:

This implementation currently supports the following databases:

  • MySQL
  • MariaDB
  • SQLite
  • PostgreSQL
  • H2

Please note that while PostgreSQL is supported, certain data types may not be fully accommodated yet.

To gain a comprehensive understanding of this new functionality, I strongly recommend consulting the documentation.

Note: The plugin support is a work in progress and not yet complete; kindly disregard this aspect for now. Your feedback and contributions are highly valued as we continue refining this feature.

@zaleslaw zaleslaw added the enhancement New feature or request label Sep 14, 2023
@zaleslaw zaleslaw self-assigned this Sep 14, 2023
@zaleslaw zaleslaw linked an issue Sep 14, 2023 that may be closed by this pull request
@zaleslaw zaleslaw marked this pull request as ready for review September 15, 2023 06:13
@zaleslaw
Copy link
Collaborator Author

@koperagen @Jolanrensen @ermolenkodev please have a look to be familiar with the code/changes.

@Jolanrensen
Copy link
Collaborator

Awesome job! I'll leave some small comments here and there for now and play around with it a bit :)

@@ -372,6 +375,127 @@ class SchemaGeneratorPluginIntegrationTest : AbstractDataFramePluginIntegrationT
result.task(":build")?.outcome shouldBe TaskOutcome.SUCCESS
}

@Test
// TODO: test is broken
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a dataframe version mismatch

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, what do you mean?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the gradle plugin uses the bootstrap version of dataframe, which does not yet include your added functions (like readSqlTable etc.). I remember from working on OpenAPI and adding support to the Gradle plugin that I continuously had to publishToMavenLocal because the Gradle plugin cannot use the source files directly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@koperagen am I correct?

* @param [tableName] the name of the table to read data from.
* @return the DataFrame containing the data from the SQL table.
*/
public fun DataFrame.Companion.readSqlTable(dbConfig: DatabaseConfiguration, tableName: String): AnyFrame {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be possible to define the DatabaseConfiguration in a little DSL as well, instead of a class instance? like:

DataFrame.readSqlTable("tableName") {
    url = ""
    user = ""
    password = ""
}

Would be a bit more Dataframe-like

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data class is easier to reuse imo. Like, i tried to experiment with SQLite db: read multiple tables, made multiple readSqlTable calls. With data class i created a variable. With DSL it would be more tiresome, especially in notebook where there're no refactorings (?) like extract

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Easier to reuse yes, but not easier to type when calling the function. Maybe we could both support a DatabaseConfiguration and just arguments in the function call?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, for now, a lot of functions, we tracked all the ideas here or I created a separate issues for that.

I want to give users ability to taste it with the status experimental to handle more cases

* @param [dbConfig] the database configuration to connect to the database, including URL, user, and password.
* @return a list of [AnyFrame] objects representing the non-system tables from the database.
*/
public fun DataFrame.Companion.readAllTables(dbConfig: DatabaseConfiguration): List<AnyFrame> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readAllSqlTables maybe? This is defined on the entire DataFrame.Companion scope

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this doesn't return the names of the tables right? may be better to return a map

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another suggestion for the future: better notebooks support. We can generate a type safe wrapper for this map. From user perspective it'll be:

val tables = DataFrame.readAllTables(...)

next cell (each table has schema):

tables.employee
tables.tracks
...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explored API after proposed changes and found that for meta information better to add separate methods or use methods returning a data schema

When I changed it to map, it overcomplicated a path to the dataframes

But I also thought, that it's great to have a field name for dataframes and one global space of all created dataframes to search between them by name

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One global space of all created dataframes would break the functional aspect of Kotlin DataFrame. They are immutable and any modification makes a new instance.
We could have a DataFrame Collection of some kind, but that would not be much different than a Map, or be able to name DataFrames.
Naming a DataFrame is not that far-fetched, since essentially, a DataFrame and ColumnGroup are the same and ColumnGroups can be named too. That said, currently variable names are used to refer to a DataFrame. There would be a mismatch when people would write: val dataFrameName = dataFrameOf(...) named "otherDataFrameName".

Maybe in the meantime, you could return a DataFrame containing a "name: ValueColumn" and "df: FrameColumn" column, which, especially in notebooks, would be a clear way for users to access multiple dataframes using a single one.

@PaulWoitaschek
Copy link

Thanks for the initiative!

We are using Trino:
https://trino.io/docs/current/client/jdbc.html

Would it be possible to make the DbType not sealed?
Then in the DatabaseConfiguration we could pass in a custom implementation of the DbType, which would superceed the automatic selection.

@zaleslaw
Copy link
Collaborator Author

@PaulWoitaschek thanks for the feedback, I will think about making nor sealed


override fun toColumnSchema(tableColumnMetadata: TableColumnMetadata): ColumnSchema {
return when (tableColumnMetadata.sqlType) {
"CHARACTER", "CHAR" -> ColumnSchema.Value(typeOf<String>())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about nullability? If i understand correctly, all rs.get* methods might return nulls since SQL tables can store them. Thus, if your table has nulls and you use a generated schema, code will throw an NPE

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean all the column values with NULL or justa few NULLs in the column?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean all the column values with NULL or justa few NULLs in the column?

@zaleslaw zaleslaw merged commit ce45ed4 into Kotlin:master Oct 6, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JDBC Support
4 participants