Daffodil Processors now use NiFi Records, and most other CLI Options have been implemented #1

andrewjc2000 · 2020-07-31T00:32:08Z

More specifically, DaffodilParse can be configured with a RecordSetWriter component that takes a Daffodil Infoset in NiFi Record form and transforms it to a Flowfile, and DaffodilUnparse can be configured with a RecordReader component that takes a Daffodil Infoset in some Flowfile form and transforms it into a NiFi Record, and then proceeds to unparse this Record back into data.

In addition, all CLI Options except for the Root, Path, and Pre-Compiled Parser options are now available as Properties.

stevedlawrence

Had a handful of comments, mainly surrounding usability questions/concerns. Looks really nice though. Even if changes are needed and you're aren't able to get to them, this is a great amount of work that should not require too much effort to finalize.

Because it depends on a snapshot build on Daffodil, can you please modify the pull request to merge into the dev branch. So when it is ready to merge, we can make it clear that it shouldn't be used execet for devs until 3.0.0 comes out and potentially NiFI bugs are resolved.

stevedlawrence · 2020-08-05T13:23:08Z

.gitignore

+.idea/*
+target/*
+nifi-daffodil-nar/target/
+nifi-daffodil-processors/target/


In general, I try to follow the standard that only intermediate build files should be listed in a repos .gitignore file. For maven, I would expected that "target" should be sufficient. Files that are created by a users specific development environment (such as .idea and .iml) should be handled somehow else, such as with a global git core.excludesFile.

Understood, I'll remove the rest.

stevedlawrence · 2020-08-05T13:28:12Z

README.md

-  unparses the infoset to the original file format
+* DaffodilParse: Reads a FlowFile and parses the data into a NiFi Record, which is then converted into an Infoset by a NiFi RecordSetWriter component.
+* DaffodilUnparse: Reads a FlowFile containing an infoset in some form, reads it using the correct NiFi RecordReader component and converts it into Records, and then unparses these Records to the original file format.
+


Thoguhts on keeping the DaffodilParse and DaffodilUnparse processors exactly the same as before? My concern here is that there are some odd edge cases due to how Records work in NiFi. In some cases, I can imagine we will want to avoid those, and just have Daffodil read/write the infoset directly, rather than going through Records.

We can still have the DaffodilRecordReader/Writer Controllers that I think should be able to be used in any of the standard NiFI Processors taht accespt Record Reader/Writers.

This essentially gives users two options, one where they use Daffodil directly, but are limited to Daffodils XML/JSON output but has no infoset quirks, or one where the use Daffodil via the new Record Reader/Writers, but may have some quirks to eal with.

I'm not a fan of users having these Controllers because Controllers, unlike Processors, have to be configured globally. If we want to keep what we had before, I say we make 4 processors available: the 2 processors from before and the 2 new ones that use Records. I think this can be done with quite a bit of code reuse if we want to go this route.

Isn't that how controllers work in NiFi? What is wrong with them being configured globally? We should following NiFi conventions as close as possible so people already familiar with using Records will be able to plugin this new one in without having to learn something new.

I'm saying that by virtue of the fact that Controllers work this way, Processors seems like the more logical representation of the Daffodil tool. But if you don't think that the fact that every instance of the Controller would be on the same schema at a given time (i.e. every controller would have to be set to Parse JPEGS or PNGs, you can't have 2 going at the same time processing different things), then I suppose I can bring back the Controller solution.

Are you sure that NiFi can't create different instances of Controllers with different properties? That seems like a necessary capability.

To test, I created to SplitRecord processors, for each I configured the RecordReader property and selected "Create new service" and selected "CSV Reader" for both. This results in two CSV Reader's that can be configured indpendently. Seems like this should allow to have multiple DaffodilReaderWriter controllers that are configured for different schemas.

Oh, it would appear you are correct. You can even rename them in order to distinguish them. I apologize for my ignorance, I did not think they were as flexible as Processors.

When I tried to develop controllers before, it required some serious restructuring of the project for it to build properly, but I can try to see if I can get away with keeping the current structure.

No problem. I know you have limited time left, so it if there isn't enough time to make the change not a big deal. Especially if there are any other upates you have planned. If you do run out of time, please just make sure to document where you're at some if someone picks up the remaining work they have some idea where to start off. Thanks for all your great work!

I would definitely be open to going the controller route during my work day today and tomorrow. Would you like to see something in the end like the project containing the original 2 processors, only modified with the new CLI options, and then the 2 new controllers in the same project? I think that would be very realistic. If we do go this route, for the sake of code reuse, would you mind if I moved out every property into a shared sort of static class, so that way both the Processors and Controllers could use them? It would stink to have to define all of them the same way in 2 places.

Yeah, I think the two original processors and two new controllers would probably be the ideal set up. That way people can use the native records (which are probably preferred) with the controllers, but can switch to the processors if they run into the various bugs/quirks you've discovered.

Some sort of shared class to reduce duplciation is definitely a good 👍 I suspect many of the properties will be exatly the same.

stevedlawrence · 2020-08-05T13:29:49Z

README.md

+## Processor Properties
+
+Each Processor has a number of configurable properties intended to be analogous to the [CLI options](https://daffodil.apache.org/cli/) for the Daffodil tool. Here are is a note about the __Stream__ option:
+- __Stream Mode:__ This mode is disabled by default, but when enabled parsing will continue in the situation that there is leftover data rather than erroring out; it is simply repeated, generating a Record per parse.  If they are all successful, a Set of Records will be generated.  When using this mode for the XML Reader and Writer components, the Writer component must be configured with a name for the Root Tag, and the Reader component must be configured with "Expect Records as Array" set to true.

 ## Build Instructions



Please wrap these lines. Technically it doesn't matter in markdown files, but it makes it easier to read when viewing them as raw text, which is still common.

Are you saying to split it up into multiple lines?

Yep, just wrap everything at 80 or 100 characters or something reasonable.

stevedlawrence · 2020-08-05T13:32:58Z

README.md

+## Processor Properties
+
+Each Processor has a number of configurable properties intended to be analogous to the [CLI options](https://daffodil.apache.org/cli/) for the Daffodil tool. Here are is a note about the __Stream__ option:
+- __Stream Mode:__ This mode is disabled by default, but when enabled parsing will continue in the situation that there is leftover data rather than erroring out; it is simply repeated, generating a Record per parse.  If they are all successful, a Set of Records will be generated.  When using this mode for the XML Reader and Writer components, the Writer component must be configured with a name for the Root Tag, and the Reader component must be configured with "Expect Records as Array" set to true.


What happens if the settings aren't correct? Do things fail? Also, what happens if when streaming the first parse succeeeds, but the second fails? Does it still create the first record and the second goes to a failure, or is the whole thing considered a failure?

Yes, things fail if these settings aren't enabled. The XML components are pretty picky. For your second question, I currently have the whole thing failing, but I could easily change it.

As you have it is fine. This might be something that people would want to be configurable for a future enhancement, but don't worry about it for this PR. It would be good to document this to make it more clear what happens on failure.

stevedlawrence · 2020-08-05T13:43:13Z

nifi-daffodil-processors/pom.xml

+            <artifactId>incubator-daffodil</artifactId>
+            <version>92d2036e3d</version>
+        </dependency>
+        <!-- For most of the NIFI imports -->


I would rather this just depend on 3.0.0-SNAPSHOT and require users to build and locally publish their own version of Daffodil.

jitpack looks like something that the Apache Software Foundation would frown upon. Code and packaged code must go through a pretty strict release process to verify that there are no licensing issues before being made publicly availabe, which this artifact skipped. This will need some clarification from the ASF IPMC whether this is okay, but I'd prefer that it just rely on user built snapshot and not use jitpack.

I tried including this dependency and it didn't work:

<dependency> <groupId>org.apache.daffodil</groupId> <artifactId>daffodil-japi_2.12</artifactId> <version>3.0.0-SNAPSHOT</version> </dependency>

Is there another special attribute needed because it is a snapshot version?

This is correct, but then you'll need to run sbt publishM2 from the daffodil repo to publish to a local maven repo where maven can find the dependency.

Ok, I see - should we add that step to the readme and then take it out 3.0.0 is released?

Yeah, that's probably the right approach for the dev branch, which presumably will also depend on a snapshot of daffodil.

For some reason when I switched to this dependency I started getting the following build error:
[ERROR] C:\Users\achafos\Desktop\nifi-daffodil\nifi-daffodil-processors\src\main\java\com\tresys\nifi\processors\AbstractDaffodilProcessor.java:30:31: error: cannot access ValidationMode

Very bizarre, so for now I'll leave in what I had before, but I definitely would prefer to resolve this before the end of the review.

Your latest changes build for me if I set the dependency to above and run sbt publishM2 from the daffodil repo. You might try clearing your maven cache (~/.m2) and delete all the target directories, and republishing the daffodil snapshot. Maybe the dependency switch caused maven confusion or conflict or something.

I deleted .m2, I deleted .ivy, I deleted .sbt, and it still resulted in the same problem. Now Unit Tests aren't running right either, very bizarre.

I'll circle back to this as the last change to the project, since I need a working build environment in order to develop. Everything works if I keep the Jitpack dependency, as unprofessional/hacky as it is.

Not a problem. This is low hanging fruit that can be easily fixed at some point in the future.

stevedlawrence · 2020-08-05T14:33:47Z

nifi-daffodil-processors/src/main/java/com/tresys/nifi/AbstractDaffodilProcessor.java

+                try (final RecordReader reader = readerFactory.createRecordReader(originalAttributes, inputStream, original.getSize(), getLogger())) {
+                    // Get the first record and process it before we create the Record Writer. We do this so that if the Processor
+                    // updates the Record's schema, we can provide an updated schema to the Record Writer. If there are no records,
+                    // then we can simply create the Writer with the Reader's schema and begin & end the Record Set.


I suspect making changes to a schema won't be acceptable to NiFi devs. The point of a schema is to properly describe data. Changing the schema based on data doesn't really make any sense. It's perhaps okay for now, but ultimately NiFI needs to be modified to support the kinds of schemas that Daffodil describes.

Yes, I entirely agree - this would be a non-issue if optional fields existed in NiFi RecordSchemas and components knew how to deal with them because of that fact.

stevedlawrence · 2020-08-05T14:37:04Z

nifi-daffodil-processors/src/main/java/com/tresys/nifi/AbstractDaffodilProcessor.java

+                            recordCount.set(writeResult.getRecordCount());
+                        }
+                        return;
+                    }


It's not clear what is going on here, if reader.nextRecord returns null (i.e. we didn't read a record?) then we still write something out? What gets writen in this case?

nextRecord is essentially next if you think of a RecordReader component as an iterator. So, we keep going while there are valid records produced. For our case, the only time multiple records are produced in parsing is when stream mode is enabled, but that's not necessarily true for unparse.

By the way, I didn't write most of the code in this new onTrigger method - it's copied and pasted from https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/AbstractRecordProcessor.java . This is the standard way Records are created in NiFi

stevedlawrence · 2020-08-05T14:39:23Z

nifi-daffodil-processors/src/main/java/com/tresys/nifi/AbstractDaffodilProcessor.java

+                        attributes.put(CoreAttributes.MIME_TYPE.key(), writer.getMimeType());
+                        attributes.putAll(writeResult.getAttributes());
+                        recordCount.set(writeResult.getRecordCount());
+                    }


Very similar hunk to above, and it's not clear to me what is going on. It seems to me we only need this second hunk, and this just becomes:

writer.beginRecordSet(); Record record; while ((record = reader.nextRecord()) != null) { writer.write(record); } final WriteResult writeResult = writer.finishRecordSet(); ...

What am I missing?

stevedlawrence · 2020-08-05T14:46:41Z

nifi-daffodil-processors/src/main/java/com/tresys/nifi/AbstractDaffodilProcessor.java

+                        if (content.toLowerCase().startsWith("<?xml")) {
+                            content = content.replaceAll("<(\\S+)></(\\1)>", "<$1>\u200B</$1>");
+                            outputStream.write(content.getBytes(StandardCharsets.UTF_8));
+                            outputStream.flush();


This only helps with XML infosets that we create with this processor. If a XML infoset comes from elsewhere, it's going to fail. So this is a hack that has potential performace implications, and it wont' even fix all issues. I would rather just say this is a known issue with the XMLReader and we can't support those until this is fixed.

stevedlawrence · 2020-08-05T14:59:06Z

nifi-daffodil-processors/src/main/java/com/tresys/nifi/DaffodilRecordReader.java

+        this.currentSchema = originalSchema;
+        this.streamMode = streamMode;
+    }
+


This isn't how I imagined this would be used. Can this RecordReader/Writer not be used with other Record processors? Can it only be used with DaffodilParse/Unparse? That feels incorrect. This RecordReader/Writer should probably have it's own properties, similar to those of the DaffodilParse/Unparser processors. Then one can add a new DaffodilReader/Writer to some other Processor (e.g. QueryRecord, SplitRecord). Once added, one can then configure that DaffodilReader/Writer to set the schema/vars/tuanbles/etc. It seems that capability is missing and this can only be used with th Daffodil Processors. Is that a limitation of the Records, or just a missing capability?

Again, I was under the impression we were going with a Processor solution, not a Controller solution. A controller solution has the disadvantage of being configured globally. Every instance of the Controller would have to use the same schema. I'll have to refactor quite a bit to revert back to a Controller solution.

I just want to do what is more familiar to NiFi people. I'm under the impression that anytime Records are involved a Controller is globally created and provided to certain processors that know how to handle records. Right now, the DaffodilRecordReader/Writers can't be used with those other processors like other Reader/Writers can.

If it's too big a change for the limited remainder of time you have, not a big deal, but I think for this to be be merged into NiFI, which is the eventual goal, I suspect treating the Reader/Writers as a controller is necessary. I'm far from a NiFi expert, so I could be wrong.

There's certainly going to need to be some disucssions with the NiFi devs at some point in the future about the XMLReader issues, optional fields, anonymous choices, etc. so perhaps the right way to approach this controller vs processor can be punted for now and figured out in that discussion as well.

stevedlawrence · 2020-08-06T12:40:39Z

...-daffodil-processors/src/main/java/com/tresys/nifi/processors/AbstractDaffodilProcessor.java

+            String actualName = isTunable ? propertyDescriptorName.substring(1) : propertyDescriptorName;
+            return new PropertyDescriptor.Builder()
+                .name(actualName)
+                .displayName((isTunable ? "Tunable Variable " : "External Variable ") + actualName)
+                .description("Value to configure a specific " + (isTunable ? "tunable variable" : "external variable") + " when parsing or unparsing")


I just tested this in NiFi and something weird is going on with tunables using the NiFi GUI. When I add a tunable starting with +, it doesn't show up and I so I can't edit its value. If I then add a variable, it adds correctly, but there's a space where the tunable was added. So seems like the tunable is added, but something prevents it from showing, likely a bug in NiFi.

I'm wondering if there is an issue with changing the "actualName"? I made this modification and it at least allows me to add and edit the tunable.

Suggested change

String actualName = isTunable ? propertyDescriptorName.substring(1) : propertyDescriptorName;

return new PropertyDescriptor.Builder()

.name(actualName)

.displayName((isTunable ? "Tunable Variable " : "External Variable ") + actualName)

.description("Value to configure a specific " + (isTunable ? "tunable variable" : "external variable") + " when parsing or unparsing")

String displayName = isTunable ? propertyDescriptorName.substring(1) : propertyDescriptorName;

return new PropertyDescriptor.Builder()

.name(propertyDescriptorName)

.displayName((isTunable ? "Tunable: " : "External Variable: ") + displayName)

.description("Value to configure the " + displayName + " " + (isTunable ? "tunable" : "external variable") + " when parsing or unparsing")

NiFi doesn't seem to show the display name until after you close and reopen the configure processor properties, but it does seem to fix the GUI issues.

stevedlawrence

Looks great! 👍 Just a couple minor comments about potential clean-up.

Please squash the commits into a single commit (include the cleanup changes if you have time to make them, if not, then don't worry about it) and this is good to merge.

stevedlawrence · 2020-08-07T11:49:39Z

nifi-daffodil-processors/src/main/java/com/tresys/nifi/controllers/DaffodilRecordReader.java

+                log("Infoset Root: {}", rootNode);
+                log("Final Result: {}", RecordUtil.printRecord(result, ""));
+                log("Current Schema: {}", RecordUtil.printRecordSchemaHelper(currentSchema, ""));


Is this standard practice in record readers to log this information? For large infosets/records, I imagige converting the schema and record to a string could cause unnecssary overhead and impact performance. I would suggust removing this logs.

All of the logs that aren't explicitly getLogger().error() or logger.error() were solely for debugging purposes. I'll take them out.

stevedlawrence · 2020-08-07T11:54:05Z

nifi-daffodil-processors/src/main/java/com/tresys/nifi/controllers/DaffodilRecordReader.java

+        InfosetNodeOutputter outputter = new InfosetNodeOutputter();
+        ParseResult result = dataProcessor.parse(dataInputStream, outputter);
+        if (result.isError()) {
+            throw new IllegalStateException("Failed to parse: " + result.getDiagnostics());


I'm not sure this should be an IllegalStateException. It's perfectly resonable for a parse to fail. I think this instead wants to be a MalformedRecordException. IllegalStateExceptions should probably never actually happen, and probably indicate there is a bug in the code. Parse failing isn't considered a bug.

Yeah, I think I prefer the Exceptions you had before. I'll integrate them into my own code too, catching an Unchecked Exception never sat right with me.

stevedlawrence · 2020-08-07T12:00:49Z

...-daffodil-processors/src/main/java/com/tresys/nifi/processors/AbstractDaffodilProcessor.java

+    /**
+     * Note that the Reader/Writer component always winds up at index 1; we choose which one to add based
+     * on if this is the Parse or Unparse Processor.
+     */


Is this comment correct? I think this way talking about older stuff?

Good catch, I'll remove it.

stevedlawrence · 2020-08-07T12:02:40Z

...-daffodil-processors/src/main/java/com/tresys/nifi/processors/AbstractDaffodilProcessor.java

+     * Plus, at least for the moment, we have to do a bit of filtering for the output of any XML Infoset
+     * due to the bug that currently exists on that processor.
+     *
+     */


I think this comment is no longer correct too? ReaderReader/Writers are no longer used with the Processors.

You are correct. I'll remove it.

stevedlawrence · 2020-08-07T12:04:18Z

nifi-daffodil-processors/src/main/java/com/tresys/nifi/processors/DaffodilParse.java

-            logDiagnostics(pr);
-            throw new DaffodilProcessingException("Failed to parse");
+            DaffodilProperties.logDiagnostics(getLogger(), pr);
+            throw new IllegalStateException("Failed to parse");


This here too is not an illegal state. Should be some other exception, maybe just IOExecption since MalformedRecordExeption doesn't make sense in this context since this isn't dealing with records.

stevedlawrence · 2020-08-07T12:04:35Z

nifi-daffodil-processors/src/main/java/com/tresys/nifi/processors/DaffodilUnparse.java

-            logDiagnostics(ur);
-            throw new DaffodilProcessingException("Failed to unparse");
+            DaffodilProperties.logDiagnostics(getLogger(), ur);
+            throw new IllegalStateException("Failed to unparse");


Same here, not an IllegalState.

stevedlawrence · 2020-08-07T12:06:48Z

...daffodil-processors/src/main/resources/META-INF/services/org.apache.nifi.processor.Processor

@@ -13,4 +13,4 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 com.tresys.nifi.processors.DaffodilParse
-com.tresys.nifi.processors.DaffodilUnparse
+com.tresys.nifi.processors.DaffodilUnparse


Looks like many files are missing a newline at the end. It general, it's a good idea to include a trailing newline. This is usually a configuration setting in your editor.

Which files should have a newline at the end - just configuration files/schemas, or all the code files as well? I know there is a precedent for this but I am not sure what it is

Convention is anything that you use a text editor to modify should have a newline. So source files, configs, schemas, test files, etc. One exception is sometimes tests would fail if an input had a newline, so a test might leave it off in that case. But we generally try to avoid those unless explicitly testing how newlines parsed/unparsed.

… Records, Implemented nearly all CLI options for both existing Processors and new Controllers, and moved all shared properties and methods to new global file for Processors and Controllers to use

stevedlawrence · 2020-08-14T17:26:27Z

Thanks for the contribution!

stevedlawrence reviewed Aug 5, 2020

View reviewed changes

andrewjc2000 changed the base branch from master to dev August 6, 2020 01:06

stevedlawrence reviewed Aug 6, 2020

View reviewed changes

stevedlawrence approved these changes Aug 7, 2020

View reviewed changes

Added 2 new Controllers that perform Daffodil Parse and Unparse using…

15736df

… Records, Implemented nearly all CLI options for both existing Processors and new Controllers, and moved all shared properties and methods to new global file for Processors and Controllers to use

andrewjc2000 force-pushed the master branch from 89b3745 to 15736df Compare August 10, 2020 19:19

stevedlawrence merged commit 8795400 into TresysTechnology:dev Aug 27, 2020

stevedlawrence mentioned this pull request Feb 25, 2021

Add native NiFi Record support #3

Open

Daffodil Processors now use NiFi Records, and most other CLI Options have been implemented #1

Daffodil Processors now use NiFi Records, and most other CLI Options have been implemented #1

Conversation

andrewjc2000 commented Jul 31, 2020

stevedlawrence left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewjc2000 Aug 5, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevedlawrence left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevedlawrence commented Aug 14, 2020

andrewjc2000 Aug 5, 2020 •

edited

Loading