Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delta lake / lakehouse support #64

Open
TissonMathew opened this issue Dec 31, 2020 · 20 comments
Open

Delta lake / lakehouse support #64

TissonMathew opened this issue Dec 31, 2020 · 20 comments
Labels
enhancement New feature or request

Comments

@TissonMathew
Copy link

TissonMathew commented Dec 31, 2020

Any plans to support delta lake? Keep the CDM specific manifests / metadata etc. in ADLS Gen 2 and data in delta. Also, this removes a lot of operational burden including partitioning etc.

I like CDM standard schemas, approaches etc. but operationalizing CDM data for interactive queries is costly (e.g. copying data into cosmos db, azure search etc.). Delta compute & storage optimization for interactive queries is cost effective without impacting performance E.g. Power BI or a React app ...

CDM + Delta could be an excellent cost effective alternative to Snowflake.

For example:

Creates the CDM manifest and adds the entity to it with delta with both physical and logical entity definitions
(df.write.format("com.microsoft.cdm")
.option("storage", storageAccountName)
.option("manifestPath", container + "/implicitTest/default.manifest.cdm.json")
.option("entity", "TestEntity")
.option("format", "delta")
.save())

@TissonMathew
Copy link
Author

We are currently in production with delta and CDM. We had to work around a few things to make it work but the perf & scale incredible with delta lake (lakehouse architecture - both streaming/real-time and batch). The best of both worlds, CDM add meaning to data in delta.

@stevenwilliamsmis
Copy link

@TissonMathew would you mind sharing what you had to do to make this work with delta?

@SQLArchitect
Copy link

SQLArchitect commented Feb 8, 2021 via email

@bissont
Copy link
Contributor

bissont commented Feb 18, 2021

Hi @TissonMathew,

Yes, @euangms and I are also interested in your use case. Would you be interested in getting in on a call to discuss?

@TissonMathew
Copy link
Author

@bissont Sure. Happy to connect!

@don4of4
Copy link

don4of4 commented Mar 26, 2021

@TissonMathew do you mind describing your approach? Did you write your own connector?

@bitsofinfo
Copy link

@TissonMathew please share?

@srichetar srichetar added the enhancement New feature or request label Jul 1, 2021
@colinrippeyfinarne
Copy link

Hi @TissonMathew I am currently building out an Azure Synapse PoC and we are planning on using Delta Lake as our primary storage format. I would love to know how you've been able to get the ADLS Folders and Files containing the Delta Lake to "co-exist" with the CDM Folders & Files (if that is in fact what you've done).

Can you provide any details please?

@ralphke
Copy link

ralphke commented Nov 22, 2021

Would be a great addition to have native CDM support for Delta files.
Is anybody actively working on this?

@drinkingbird
Copy link

Hi @TissonMathew
I'm wondering what you achieve by writing delta files using the CDM structure. While there is an overlap in features (such as data typing in a data lake / partitioning / history), they are quite different in implementation. The benefit of CDM is it's accessibility and numerous readers. Pretty much everything can read csvs and it has an easier barrier for entry to interact with the data, even with custom code.

A mixed approach would create additional metadata files to manage and limit readers to only those which are compatible with delta, and the only CDM + delta reader/writer would be this one.

I feel it's quite simple to build a delta lake using the CDM as one of the sources, to which you can apply your own businesses governance and requirements. In this case, your readers would not require this connector.

I'm sorry but I'm missing the advantage of this. Can you please elaborate?

@ralphke
Copy link

ralphke commented Nov 23, 2021

@drinkingbird CDM serves a different purpose than what delta is made up for. One of the key aspects of CDM is the ability to express relationships between different delta files as well as rich descriptions of the context of each file and columns as they exist within the delta format. Also having well documented industry specific data models now in Synapse workbench would allow an easy editing and maintenance of CDM models. This is today possible with parquet files but not yet with delta files.

@drinkingbird
Copy link

@ralphke Excellent. Thank you for the response! That clears it up.

@don4of4
Copy link

don4of4 commented Nov 23, 2021

We hear that Delta is the number one requested feature -- so I know there must be irons in this fire.

That said, Delta support for CDM/SyMS is a must have for our company's big data plans on Azure Data Services.

Speaking on behalf of Hitachi Solutions, the current limitations around Delta impair the serious use of directly exported Dynamics CE / FO data. We are finding that out need to ingest it into Delta (or dedicated pool) to bring that data to the same plane as the rest of the Enterprise sources materially errodes the value proposition of the feature.

Similarly, we can't seriously consider building models with the 3NF Industry Models without the performance that we currently get from Delta. This is especially true given the industry models still must be transformed/converted to Kimball/dimensional marts.

@bissont
Copy link
Contributor

bissont commented Nov 23, 2021

I need to investigate, but it looks like there is native support for writing to the deta format now:
https://github.com/delta-io/connectors/issues/85

@ralphke
Copy link

ralphke commented Feb 7, 2022

I need to investigate, but it looks like there is native support for writing to the deta format now: delta-io/connectors#85

This looks like the team is in a very early design phase for this connector.

@NitinSingh12
Copy link

We are working on the POC to integrate delta lake as a primary storage resource - But looking out for options to read the dynamic 365 data which is stored in ADLS - wanted to use autoloader functions and create entity tables and store in delta lake.

Please help if you have any resources. thank you.

@NitinSingh12
Copy link

@TissonMathew Hi, would you mind sharing your approach, Pls ? Thank you.

@don4of4
Copy link

don4of4 commented Jul 11, 2022

@NitinSingh12 My firm, Hitachi Solutions has a commercial connector under final development and testing with a very large customer -- trickle loader for Finance and Operations, and dataverse. You can reach out to me at dscott@hitachisolutions.com if you are interested.

@TissonMathew
Copy link
Author

@NitinSingh12 please contact suresh.velga@skypointcloud.com

@NitinSingh12
Copy link

@TissonMathew @don4of4 Thank you for providing info, However my team is able to build solution and able to perform CDC & Data load from M 365 with autoloader functions into Delta format. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests