Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Daffodil Processors now use NiFi Records, and most other CLI Options have been implemented #1

Merged
merged 1 commit into from
Aug 27, 2020

Conversation

andrewjc2000
Copy link

More specifically, DaffodilParse can be configured with a RecordSetWriter component that takes a Daffodil Infoset in NiFi Record form and transforms it to a Flowfile, and DaffodilUnparse can be configured with a RecordReader component that takes a Daffodil Infoset in some Flowfile form and transforms it into a NiFi Record, and then proceeds to unparse this Record back into data.

In addition, all CLI Options except for the Root, Path, and Pre-Compiled Parser options are now available as Properties.

Copy link
Member

@stevedlawrence stevedlawrence left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a handful of comments, mainly surrounding usability questions/concerns. Looks really nice though. Even if changes are needed and you're aren't able to get to them, this is a great amount of work that should not require too much effort to finalize.

Because it depends on a snapshot build on Daffodil, can you please modify the pull request to merge into the dev branch. So when it is ready to merge, we can make it clear that it shouldn't be used execet for devs until 3.0.0 comes out and potentially NiFI bugs are resolved.

.gitignore Outdated
.idea/*
target/*
nifi-daffodil-nar/target/
nifi-daffodil-processors/target/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I try to follow the standard that only intermediate build files should be listed in a repos .gitignore file. For maven, I would expected that "target" should be sufficient. Files that are created by a users specific development environment (such as .idea and .iml) should be handled somehow else, such as with a global git core.excludesFile.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, I'll remove the rest.

unparses the infoset to the original file format
* DaffodilParse: Reads a FlowFile and parses the data into a NiFi Record, which is then converted into an Infoset by a NiFi RecordSetWriter component.
* DaffodilUnparse: Reads a FlowFile containing an infoset in some form, reads it using the correct NiFi RecordReader component and converts it into Records, and then unparses these Records to the original file format.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoguhts on keeping the DaffodilParse and DaffodilUnparse processors exactly the same as before? My concern here is that there are some odd edge cases due to how Records work in NiFi. In some cases, I can imagine we will want to avoid those, and just have Daffodil read/write the infoset directly, rather than going through Records.

We can still have the DaffodilRecordReader/Writer Controllers that I think should be able to be used in any of the standard NiFI Processors taht accespt Record Reader/Writers.

This essentially gives users two options, one where they use Daffodil directly, but are limited to Daffodils XML/JSON output but has no infoset quirks, or one where the use Daffodil via the new Record Reader/Writers, but may have some quirks to eal with.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of users having these Controllers because Controllers, unlike Processors, have to be configured globally. If we want to keep what we had before, I say we make 4 processors available: the 2 processors from before and the 2 new ones that use Records. I think this can be done with quite a bit of code reuse if we want to go this route.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that how controllers work in NiFi? What is wrong with them being configured globally? We should following NiFi conventions as close as possible so people already familiar with using Records will be able to plugin this new one in without having to learn something new.

Copy link
Author

@andrewjc2000 andrewjc2000 Aug 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm saying that by virtue of the fact that Controllers work this way, Processors seems like the more logical representation of the Daffodil tool. But if you don't think that the fact that every instance of the Controller would be on the same schema at a given time (i.e. every controller would have to be set to Parse JPEGS or PNGs, you can't have 2 going at the same time processing different things), then I suppose I can bring back the Controller solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that NiFi can't create different instances of Controllers with different properties? That seems like a necessary capability.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To test, I created to SplitRecord processors, for each I configured the RecordReader property and selected "Create new service" and selected "CSV Reader" for both. This results in two CSV Reader's that can be configured indpendently. Seems like this should allow to have multiple DaffodilReaderWriter controllers that are configured for different schemas.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it would appear you are correct. You can even rename them in order to distinguish them. I apologize for my ignorance, I did not think they were as flexible as Processors.

When I tried to develop controllers before, it required some serious restructuring of the project for it to build properly, but I can try to see if I can get away with keeping the current structure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. I know you have limited time left, so it if there isn't enough time to make the change not a big deal. Especially if there are any other upates you have planned. If you do run out of time, please just make sure to document where you're at some if someone picks up the remaining work they have some idea where to start off. Thanks for all your great work!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would definitely be open to going the controller route during my work day today and tomorrow. Would you like to see something in the end like the project containing the original 2 processors, only modified with the new CLI options, and then the 2 new controllers in the same project? I think that would be very realistic. If we do go this route, for the sake of code reuse, would you mind if I moved out every property into a shared sort of static class, so that way both the Processors and Controllers could use them? It would stink to have to define all of them the same way in 2 places.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think the two original processors and two new controllers would probably be the ideal set up. That way people can use the native records (which are probably preferred) with the controllers, but can switch to the processors if they run into the various bugs/quirks you've discovered.

Some sort of shared class to reduce duplciation is definitely a good 👍 I suspect many of the properties will be exatly the same.

## Processor Properties

Each Processor has a number of configurable properties intended to be analogous to the [CLI options](https://daffodil.apache.org/cli/) for the Daffodil tool. Here are is a note about the __Stream__ option:
- __Stream Mode:__ This mode is disabled by default, but when enabled parsing will continue in the situation that there is leftover data rather than erroring out; it is simply repeated, generating a Record per parse. If they are all successful, a Set of Records will be generated. When using this mode for the XML Reader and Writer components, the Writer component must be configured with a name for the Root Tag, and the Reader component must be configured with "Expect Records as Array" set to true.

## Build Instructions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please wrap these lines. Technically it doesn't matter in markdown files, but it makes it easier to read when viewing them as raw text, which is still common.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying to split it up into multiple lines?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, just wrap everything at 80 or 100 characters or something reasonable.

README.md Outdated
## Processor Properties

Each Processor has a number of configurable properties intended to be analogous to the [CLI options](https://daffodil.apache.org/cli/) for the Daffodil tool. Here are is a note about the __Stream__ option:
- __Stream Mode:__ This mode is disabled by default, but when enabled parsing will continue in the situation that there is leftover data rather than erroring out; it is simply repeated, generating a Record per parse. If they are all successful, a Set of Records will be generated. When using this mode for the XML Reader and Writer components, the Writer component must be configured with a name for the Root Tag, and the Reader component must be configured with "Expect Records as Array" set to true.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the settings aren't correct? Do things fail? Also, what happens if when streaming the first parse succeeeds, but the second fails? Does it still create the first record and the second goes to a failure, or is the whole thing considered a failure?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, things fail if these settings aren't enabled. The XML components are pretty picky. For your second question, I currently have the whole thing failing, but I could easily change it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you have it is fine. This might be something that people would want to be configurable for a future enhancement, but don't worry about it for this PR. It would be good to document this to make it more clear what happens on failure.

<artifactId>incubator-daffodil</artifactId>
<version>92d2036e3d</version>
</dependency>
<!-- For most of the NIFI imports -->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather this just depend on 3.0.0-SNAPSHOT and require users to build and locally publish their own version of Daffodil.

jitpack looks like something that the Apache Software Foundation would frown upon. Code and packaged code must go through a pretty strict release process to verify that there are no licensing issues before being made publicly availabe, which this artifact skipped. This will need some clarification from the ASF IPMC whether this is okay, but I'd prefer that it just rely on user built snapshot and not use jitpack.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried including this dependency and it didn't work:

<dependency>
    <groupId>org.apache.daffodil</groupId>
    <artifactId>daffodil-japi_2.12</artifactId>
    <version>3.0.0-SNAPSHOT</version>
</dependency>

Is there another special attribute needed because it is a snapshot version?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct, but then you'll need to run sbt publishM2 from the daffodil repo to publish to a local maven repo where maven can find the dependency.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see - should we add that step to the readme and then take it out 3.0.0 is released?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's probably the right approach for the dev branch, which presumably will also depend on a snapshot of daffodil.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason when I switched to this dependency I started getting the following build error:
[ERROR] C:\Users\achafos\Desktop\nifi-daffodil\nifi-daffodil-processors\src\main\java\com\tresys\nifi\processors\AbstractDaffodilProcessor.java:30:31: error: cannot access ValidationMode

Very bizarre, so for now I'll leave in what I had before, but I definitely would prefer to resolve this before the end of the review.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your latest changes build for me if I set the dependency to above and run sbt publishM2 from the daffodil repo. You might try clearing your maven cache (~/.m2) and delete all the target directories, and republishing the daffodil snapshot. Maybe the dependency switch caused maven confusion or conflict or something.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deleted .m2, I deleted .ivy, I deleted .sbt, and it still resulted in the same problem. Now Unit Tests aren't running right either, very bizarre.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll circle back to this as the last change to the project, since I need a working build environment in order to develop. Everything works if I keep the Jitpack dependency, as unprofessional/hacky as it is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a problem. This is low hanging fruit that can be easily fixed at some point in the future.

try (final RecordReader reader = readerFactory.createRecordReader(originalAttributes, inputStream, original.getSize(), getLogger())) {
// Get the first record and process it before we create the Record Writer. We do this so that if the Processor
// updates the Record's schema, we can provide an updated schema to the Record Writer. If there are no records,
// then we can simply create the Writer with the Reader's schema and begin & end the Record Set.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect making changes to a schema won't be acceptable to NiFi devs. The point of a schema is to properly describe data. Changing the schema based on data doesn't really make any sense. It's perhaps okay for now, but ultimately NiFI needs to be modified to support the kinds of schemas that Daffodil describes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I entirely agree - this would be a non-issue if optional fields existed in NiFi RecordSchemas and components knew how to deal with them because of that fact.

recordCount.set(writeResult.getRecordCount());
}
return;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear what is going on here, if reader.nextRecord returns null (i.e. we didn't read a record?) then we still write something out? What gets writen in this case?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nextRecord is essentially next if you think of a RecordReader component as an iterator. So, we keep going while there are valid records produced. For our case, the only time multiple records are produced in parsing is when stream mode is enabled, but that's not necessarily true for unparse.

By the way, I didn't write most of the code in this new onTrigger method - it's copied and pasted from https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/AbstractRecordProcessor.java . This is the standard way Records are created in NiFi

attributes.put(CoreAttributes.MIME_TYPE.key(), writer.getMimeType());
attributes.putAll(writeResult.getAttributes());
recordCount.set(writeResult.getRecordCount());
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very similar hunk to above, and it's not clear to me what is going on. It seems to me we only need this second hunk, and this just becomes:

  writer.beginRecordSet();
  Record record;
  while ((record = reader.nextRecord()) != null) {
    writer.write(record);
  }
  final WriteResult writeResult = writer.finishRecordSet();
 ...

What am I missing?

if (content.toLowerCase().startsWith("<?xml")) {
content = content.replaceAll("<(\\S+)></(\\1)>", "<$1>\u200B</$1>");
outputStream.write(content.getBytes(StandardCharsets.UTF_8));
outputStream.flush();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only helps with XML infosets that we create with this processor. If a XML infoset comes from elsewhere, it's going to fail. So this is a hack that has potential performace implications, and it wont' even fix all issues. I would rather just say this is a known issue with the XMLReader and we can't support those until this is fixed.

this.currentSchema = originalSchema;
this.streamMode = streamMode;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't how I imagined this would be used. Can this RecordReader/Writer not be used with other Record processors? Can it only be used with DaffodilParse/Unparse? That feels incorrect. This RecordReader/Writer should probably have it's own properties, similar to those of the DaffodilParse/Unparser processors. Then one can add a new DaffodilReader/Writer to some other Processor (e.g. QueryRecord, SplitRecord). Once added, one can then configure that DaffodilReader/Writer to set the schema/vars/tuanbles/etc. It seems that capability is missing and this can only be used with th Daffodil Processors. Is that a limitation of the Records, or just a missing capability?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I was under the impression we were going with a Processor solution, not a Controller solution. A controller solution has the disadvantage of being configured globally. Every instance of the Controller would have to use the same schema. I'll have to refactor quite a bit to revert back to a Controller solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want to do what is more familiar to NiFi people. I'm under the impression that anytime Records are involved a Controller is globally created and provided to certain processors that know how to handle records. Right now, the DaffodilRecordReader/Writers can't be used with those other processors like other Reader/Writers can.

If it's too big a change for the limited remainder of time you have, not a big deal, but I think for this to be be merged into NiFI, which is the eventual goal, I suspect treating the Reader/Writers as a controller is necessary. I'm far from a NiFi expert, so I could be wrong.

There's certainly going to need to be some disucssions with the NiFi devs at some point in the future about the XMLReader issues, optional fields, anonymous choices, etc. so perhaps the right way to approach this controller vs processor can be punted for now and figured out in that discussion as well.

@andrewjc2000 andrewjc2000 changed the base branch from master to dev August 6, 2020 01:06
Comment on lines 294 to 298
String actualName = isTunable ? propertyDescriptorName.substring(1) : propertyDescriptorName;
return new PropertyDescriptor.Builder()
.name(actualName)
.displayName((isTunable ? "Tunable Variable " : "External Variable ") + actualName)
.description("Value to configure a specific " + (isTunable ? "tunable variable" : "external variable") + " when parsing or unparsing")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tested this in NiFi and something weird is going on with tunables using the NiFi GUI. When I add a tunable starting with +, it doesn't show up and I so I can't edit its value. If I then add a variable, it adds correctly, but there's a space where the tunable was added. So seems like the tunable is added, but something prevents it from showing, likely a bug in NiFi.

I'm wondering if there is an issue with changing the "actualName"? I made this modification and it at least allows me to add and edit the tunable.

Suggested change
String actualName = isTunable ? propertyDescriptorName.substring(1) : propertyDescriptorName;
return new PropertyDescriptor.Builder()
.name(actualName)
.displayName((isTunable ? "Tunable Variable " : "External Variable ") + actualName)
.description("Value to configure a specific " + (isTunable ? "tunable variable" : "external variable") + " when parsing or unparsing")
String displayName = isTunable ? propertyDescriptorName.substring(1) : propertyDescriptorName;
return new PropertyDescriptor.Builder()
.name(propertyDescriptorName)
.displayName((isTunable ? "Tunable: " : "External Variable: ") + displayName)
.description("Value to configure the " + displayName + " " + (isTunable ? "tunable" : "external variable") + " when parsing or unparsing")

NiFi doesn't seem to show the display name until after you close and reopen the configure processor properties, but it does seem to fix the GUI issues.

Copy link
Member

@stevedlawrence stevedlawrence left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! 👍 Just a couple minor comments about potential clean-up.

Please squash the commits into a single commit (include the cleanup changes if you have time to make them, if not, then don't worry about it) and this is good to merge.

Comment on lines 90 to 92
log("Infoset Root: {}", rootNode);
log("Final Result: {}", RecordUtil.printRecord(result, ""));
log("Current Schema: {}", RecordUtil.printRecordSchemaHelper(currentSchema, ""));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this standard practice in record readers to log this information? For large infosets/records, I imagige converting the schema and record to a string could cause unnecssary overhead and impact performance. I would suggust removing this logs.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the logs that aren't explicitly getLogger().error() or logger.error() were solely for debugging purposes. I'll take them out.

InfosetNodeOutputter outputter = new InfosetNodeOutputter();
ParseResult result = dataProcessor.parse(dataInputStream, outputter);
if (result.isError()) {
throw new IllegalStateException("Failed to parse: " + result.getDiagnostics());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this should be an IllegalStateException. It's perfectly resonable for a parse to fail. I think this instead wants to be a MalformedRecordException. IllegalStateExceptions should probably never actually happen, and probably indicate there is a bug in the code. Parse failing isn't considered a bug.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think I prefer the Exceptions you had before. I'll integrate them into my own code too, catching an Unchecked Exception never sat right with me.

/**
* Note that the Reader/Writer component always winds up at index 1; we choose which one to add based
* on if this is the Parse or Unparse Processor.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment correct? I think this way talking about older stuff?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I'll remove it.

* Plus, at least for the moment, we have to do a bit of filtering for the output of any XML Infoset
* due to the bug that currently exists on that processor.
*
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is no longer correct too? ReaderReader/Writers are no longer used with the Processors.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct. I'll remove it.

logDiagnostics(pr);
throw new DaffodilProcessingException("Failed to parse");
DaffodilProperties.logDiagnostics(getLogger(), pr);
throw new IllegalStateException("Failed to parse");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This here too is not an illegal state. Should be some other exception, maybe just IOExecption since MalformedRecordExeption doesn't make sense in this context since this isn't dealing with records.

logDiagnostics(ur);
throw new DaffodilProcessingException("Failed to unparse");
DaffodilProperties.logDiagnostics(getLogger(), ur);
throw new IllegalStateException("Failed to unparse");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, not an IllegalState.

@@ -13,4 +13,4 @@
# See the License for the specific language governing permissions and
# limitations under the License.
com.tresys.nifi.processors.DaffodilParse
com.tresys.nifi.processors.DaffodilUnparse
com.tresys.nifi.processors.DaffodilUnparse
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like many files are missing a newline at the end. It general, it's a good idea to include a trailing newline. This is usually a configuration setting in your editor.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which files should have a newline at the end - just configuration files/schemas, or all the code files as well? I know there is a precedent for this but I am not sure what it is

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convention is anything that you use a text editor to modify should have a newline. So source files, configs, schemas, test files, etc. One exception is sometimes tests would fail if an input had a newline, so a test might leave it off in that case. But we generally try to avoid those unless explicitly testing how newlines parsed/unparsed.

… Records, Implemented nearly all CLI options for both existing Processors and new Controllers, and moved all shared properties and methods to new global file for Processors and Controllers to use
@stevedlawrence
Copy link
Member

Thanks for the contribution!

@stevedlawrence stevedlawrence merged commit 8795400 into TresysTechnology:dev Aug 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants