Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New content transfer module #14630

Draft
wants to merge 25 commits into
base: main
Choose a base branch
from
Draft

New content transfer module #14630

wants to merge 25 commits into from

Conversation

MikeAlhayek
Copy link
Member

@MikeAlhayek MikeAlhayek commented Nov 4, 2023

This feature will enable the user to import and export content items from and into OC

Important this PR is not yet ready to be reviews. Let's hold off any code review until it is ready to prevent time wasting.
To do

  • Create UI to show list of ContentTransferEntry instances along with the status. Add at least a status filter to allow viewing the entries by status.
  • Create a background processor to process ContentTransferEntry instances by committing 200 rows at a time. The idea if the tenant was restarted, we would lose no more than 200 already processed records.
  • Add option for the import process so the user can add behavior (Should we validate content items? if something fails validation, should be ignore or store it any way?
  • Replace EPPlus with a free library (DocumentFormat.OpenXML 3.0 maybe a good choice).
  • Add import handler for the remaining field
  • Add import handlers for other parts.
  • Document the module so users know how to use it
  • Clean up the code

Fixes #7865

@hishamco
Copy link
Member

hishamco commented Nov 6, 2023

What about the deployment module?

@MikeAlhayek
Copy link
Member Author

What about the deployment module?

The deployment module import/export recipe. This module import/exports content items using csv or excel files. I think having it's own module is better.

@MichaelPetrinolis
Copy link
Contributor

Hi @MikeAlhayek, I agree with having a flat file import/export module, but I would like to reuse the recipes for the importing/exporting of the data. The import/export flat files module should define multiple mappings between file and contenttype, and when importing it should convert records to multiple recipe.json files for Content. A background job should execute the created recipe files and mark the job as done, when all batches are handled.

The same logic could be for the export, an asynchronous background job to get the recipe.json data, convert the records to a flat file in batches, and when all the batches are complete, the job should be marked as completed with a file available for download.

Another thing is the licensing of EPPlus, it is not MIT, it was LGPL until v4 and from v5 and up it is under Polyform Noncommercial 1.0.0 license. see https://www.epplussoftware.com/en/LicenseOverview

PS
I have not read the code to see how you implement the module, I suppose that you reimplement the import/export from the database to/from the files

@MikeAlhayek
Copy link
Member Author

@MichaelPetrinolis Yes, the EPPlus license is a problem. Maybe, we'll use a different library (not worried about it for now). Once everything is ready, it should be an easy task to switch library assuming there are free libraries out there that would convert

Why do you suggest using recipes to import? The idea is to queue up files, a background process will process the records in batches after a validation is done. With this approach, we can monitor the process (like how many failed, successes and total processes). We could even download the failed records.

The problem with creating recipe files and try to import them, we can't tell which records failed and there no good way to provide feedback on the process. And if I am able to create the content item, at that point I don't know if there is benefit of writing that object to a recipe over writing it to the database directly. Sterilizing and deserializing objects will over an overhead. But if you think there are benefits, I would like to hear you out more

@MichaelPetrinolis
Copy link
Contributor

@MikeAlhayek my point is that we already have import logic on content manager and a mechanism of importing files with content recipe steps, which call all the necessary handlers, report back the validation errors etc. What is missing, is the glue between the import operation, the batch and the external source id with each content item record that is to be imported. Then we could use this info to track the operation for each content item in the context of an import operation/batch. That way, we could create for example audit trail events with the info or report back about the status. The logic of the new module would focus on the transformation between flat files and/or external json schemas to content items based on the content type definition, the creation of the batches and a background job to handle the batches asynchronously.

Regarding the Database it is a seek operation to retrieve the original data based on the contentitemids in the batch plus the insert/update/delete commands to update the documents and the indexes. The Serialization/Deserialization process is something that we can't avoid if we use recipes, but it takes place in the background, a producer job creates the batch files while another background consumer job could perform the deserialization and import of the batches. And if the Distributed features are enabled, the import could happen in multiple nodes. The bottleneck here would be the DB.

In any case, it would be nice to have import mechanism of external data in the system with an easy way, good job!

@MikeAlhayek
Copy link
Member Author

@MichaelPetrinolis yes I already use the content manager in my code. I still don't think we need the recipes here. Once I finish, I'll ping you so you can test drive it and provide your feedback. We can always change how we handle the last step (saving) the records later on to a better way if we agree that indeed there is a better way.

@MikeAlhayek
Copy link
Member Author

@cocoke I update the readme file in this PR with info on how to define field, part, or content importer. The description above shows the remaining items. One thing I have not yet attempted is attempt to import a file and let the background task process it and ensure everything is working as expected.

Let me know which item you want to tackle so we are not stepping on each other toes. But, I think you should be able to first this branch and submit a PR into this branch not main.

Copy link

This pull request has merge conflicts. Please resolve those before requesting a review.

@hyzx86
Copy link
Contributor

hyzx86 commented Jan 29, 2024

Hi @MikeAlhayek , Notice today that this PR has been merged into Dev. Maybe we can use ClosedXML instead of EPPlus
ClosedXML/ClosedXML#2248

@MikeAlhayek
Copy link
Member Author

@hyzx86 yes that is the plan. If you like, feel free to help completing this PR. I have not had much time to finish it off.

@hyzx86
Copy link
Contributor

hyzx86 commented Jan 30, 2024

Okay, that doesn't look like a problem. I'll try it when I have time

Copy link

This pull request has merge conflicts. Please resolve those before requesting a review.

@Piedone
Copy link
Member

Piedone commented Mar 24, 2024

Is this something you'd like to revisit any time soon @MikeAlhayek or should we close?

If you'd like to continue, could you please explain the following?

  1. Why not extend the Deployment module to support multiple formats instead?
  2. Why do we need multiple formats, to begin with?

Ad 2: Excel and CSV are quite specific to a given use-case. We at Lombiq built quite a few such export/import features over the years, but none of them were generic content item export/imports, and all of them were only single-direction to/from a non-OC app (and never between two OC apps).

They were all specific to that app. E.g., upload a CSV from an external system (that knows nothing about OC and content items), and generate content items from it or other save data to the DB (some values came directly from the CSV, some were calculated based on something else, some were transforms of the CSV data). Or export to Excel in a way that some other system will understand, which was unique to that system. Or even export specific items to PDF, using a PDF template.

I think that we can only provide a generic solution, but that will never be appropriate, people will need to customize it, and customize it in wildly different ways. Perhaps there is some lowest common denominator that's feasible to support, like some batch-processing feature that can handle arbitrary large uploaded import files and their processing with providers that make no assumptions (not even whether you export from or import into content items). And the first usage example would be the Deployment module. However, this is rather abstract.

@MikeAlhayek
Copy link
Member Author

Yes at some point I would like to finish it. Many people showed interest into it. The idea is to be able to import/import data from/to Excel or cvs format. In many cases people want to be able to export data in Excel format for other systems, client data export or any other reasons. Similarly, people may want to import external data into OC. This module will allow to do such a thing. This module is only for text data.

Deployment module allows you to build deployment plan made of steps and is not text only. One can add media as deployment step. So it is a completely different concept. Not sure if you have a an idea where we can use the deployment module for this too.

@Piedone
Copy link
Member

Piedone commented Mar 24, 2024

While with some formats you wouldn't be able to export Media, I think everything else that Deployment Plans provide would be necessary here too (like filtering what to export, unless you only want to allow a single content item or all of them).

@MikeAlhayek
Copy link
Member Author

In this module, you export one content type at a time. The content type defines the file headers.

@Piedone
Copy link
Member

Piedone commented Mar 24, 2024

I see.

@hyzx86
Copy link
Contributor

hyzx86 commented Mar 25, 2024

Recently I've been learning Odoo, which has a data initialization/import mechanism, maybe we can refer to it,
It uses CSV/XML to import data,
I was wondering if I could use this module as well, since Json format can sometimes seem unintuitive
Perhaps it's as simple as adding a deployment node of type ContentData and referencing it to a specific file, like the template file currently does

https://github.com/odoo/odoo/blob/e8697f609372cd61b045c4ee2c7f0fcfb496f58a/odoo/addons/base/__manifest__.py#L29
https://github.com/odoo/odoo/blob/e8697f609372cd61b045c4ee2c7f0fcfb496f58a/odoo/addons/base/data/res.country.state.csv

@Piedone Piedone removed the notready label Apr 9, 2024
@Piedone
Copy link
Member

Piedone commented May 16, 2024

Fixes #7865?

@MikeAlhayek
Copy link
Member Author

Yes it should. Got to find time to complete this draft.

@hyzx86
Copy link
Contributor

hyzx86 commented May 19, 2024

Hi @MikeAlhayek , Notice today that this PR has been merged into Dev. Maybe we can use ClosedXML instead of EPPlus (https://github.com/ClosedXML/ ClosedXML/pull/2248)

As of today. ClosedXML has not yet released a new official version.
Do we have any other options?

@MikeAlhayek
Copy link
Member Author

@hyzx86 maybe https://www.nuget.org/packages/DocumentFormat.OpenXml. Are you planning to help finishing up this PR?

@hyzx86
Copy link
Contributor

hyzx86 commented May 19, 2024

@hyzx86 maybe https://www.nuget.org/packages/DocumentFormat.OpenXml. Are you planning to help finishing up this PR?

Oh. That might be another question about this #14941 🤣

@MikeAlhayek
Copy link
Member Author

MikeAlhayek commented May 19, 2024

No strong opinion here. If there is a draw back from using 3, we can use 2. Maybe you should just drop support for CloseXml and everyone is happy :). This module is not using CloseXml so it may be better than the functionality you are using

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improvement idea: Import custom content
5 participants