Conversation
- Introduced a new CLI entry point in UnfCli.java for processing input files and generating JSON reports. - Updated README.md with CLI usage instructions and examples. - Added unf6_schema.json to define the schema for UNF v6 calculation results. - Created UnfCliTest.java to validate the CLI functionality with unit tests.
… metadataJson method
…mproved readability and consistency
- Exposed UnfCli.generateReport and CliOptions as public for programmatic usage. - Added documentation for programmatic usage to README.md. - Renamed doc/unf6.schema.json to doc/unf.schema.json and updated all references.
There was a problem hiding this comment.
Pull request overview
Adds a Java command-line interface to generate UNF v6 reports as JSON (file- and dataset-level), plus schema/docs and test fixtures to validate outputs and support cross-implementation comparisons (e.g., with dartfx-unf Python).
Changes:
- Introduce
org.dataverse.unf.UnfClito compute UNFs for single files (line-based and CSV/TSV) and directories (dataset-level). - Add JUnit tests and dartfx CSV/JSON fixtures for validating computed UNFs.
- Add JSON schema + documentation (README CLI usage, technical overview, contributor guide) and update
.gitignore.
Reviewed changes
Copilot reviewed 15 out of 16 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
src/main/java/org/dataverse/unf/UnfCli.java |
New CLI + report generator (JSON output, tabular parsing, dataset aggregation). |
src/test/java/org/dataverse/unf/UnfCliTest.java |
Tests for CLI report generation on temp line/text and CSV inputs. |
src/test/java/org/dataverse/unf/UnfDartfxTest.java |
Tests validating known file-level UNFs against dartfx example CSVs. |
src/test/resources/test/dartfx/101A.csv |
Dartfx sample CSV fixture. |
src/test/resources/test/dartfx/101A.unf.json |
Expected JSON report fixture for 101A. |
src/test/resources/test/dartfx/101B.csv |
Dartfx sample CSV fixture. |
src/test/resources/test/dartfx/101B.unf.json |
Expected JSON report fixture for 101B. |
src/test/resources/test/dartfx/101C.csv |
Dartfx sample CSV fixture. |
src/test/resources/test/dartfx/101C.unf.json |
Expected JSON report fixture for 101C. |
src/test/resources/test/dartfx/101D.csv |
Dartfx sample CSV fixture. |
src/test/resources/test/dartfx/101D.unf.json |
Expected JSON report fixture for 101D. |
doc/unf.schema.json |
New JSON Schema documenting the UNF report shape. |
doc/TECHNICAL_OVERVIEW.md |
Technical architecture overview of the UNF implementation. |
doc/CONTRIBUTOR_GUIDE.md |
Contributor guidance emphasizing compatibility/stability and testing. |
README.md |
Adds CLI usage docs and programmatic usage examples. |
.gitignore |
Adds OS/editor/IDE ignores. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| List<Path> files = Files.list(inputPath) | ||
| .filter(Files::isRegularFile) | ||
| .sorted(Comparator.comparing(Path::getFileName)) | ||
| .toList(); |
| fileUnfs.add(entry.unf); | ||
| } | ||
| String datasetUnf = UNFUtil.calculateUNF(fileUnfs.toArray(new String[0])); | ||
| resultJson = datasetResultJson(inputPath.getFileName().toString(), datasetUnf, entries); |
| private static final String SOFTWARE_VERSION = "6.0.2-SNAPSHOT"; | ||
|
|
| + "\"N\":7," | ||
| + "\"X\":128," | ||
| + "\"H\":128," |
| "version": { | ||
| "type": "string" | ||
| } | ||
| } |
| case "--has-header": | ||
| options.hasHeader = Boolean.parseBoolean(requireValue(args, ++i, arg)); | ||
| break; | ||
| case "--column-types": |
| if (Files.isDirectory(inputPath)) { | ||
| List<FileResult> entries = new ArrayList<>(); | ||
| List<Path> files = Files.list(inputPath) | ||
| .filter(Files::isRegularFile) | ||
| .sorted(Comparator.comparing(Path::getFileName)) | ||
| .toList(); | ||
|
|
||
| if (files.isEmpty()) { | ||
| throw new IllegalArgumentException("Input directory has no regular files: " + inputPath); | ||
| } | ||
|
|
||
| List<String> fileUnfs = new ArrayList<>(); | ||
| for (Path file : files) { | ||
| FileResult entry = computeFileResult(file, options); | ||
| entries.add(entry); | ||
| fileUnfs.add(entry.unf); | ||
| } | ||
| String datasetUnf = UNFUtil.calculateUNF(fileUnfs.toArray(new String[0])); | ||
| resultJson = datasetResultJson(inputPath.getFileName().toString(), datasetUnf, entries); | ||
| } else { | ||
| FileResult fileResult = computeFileResult(inputPath, options); | ||
| resultJson = fileResult.toJson(); | ||
| } |
| case DATETIME: | ||
| if (datetimeFormat == null || datetimeFormat.isBlank()) { | ||
| throw new IllegalArgumentException("--datetime-format is required for type datetime."); | ||
| } | ||
| String[] rows = values.toArray(new String[0]); | ||
| String[] patterns = new String[rows.length]; | ||
| Arrays.fill(patterns, datetimeFormat); | ||
| return UNFUtil.calculateUNF(rows, patterns); | ||
| default: |
| Path tempFile = Files.createTempFile("unf-cli-string", ".txt"); | ||
| Files.writeString(tempFile, "Hello World\nTesting 123\n", StandardCharsets.UTF_8); | ||
|
|
||
| UnfCli.CliOptions options = new UnfCli.CliOptions().withInput(tempFile.toString()).withType("string"); | ||
| String json = UnfCli.generateReport(tempFile, options); | ||
|
|
||
| assertTrue(json.contains("\"unf_version\":\"6\"")); | ||
| assertTrue(json.contains("\"type\":\"file\"")); | ||
| assertTrue(json.contains("\"columns\"")); | ||
| assertTrue(json.contains("\"unf\":\"UNF:6:r+FDbVC6fKdUjRS6ZIzP4w==\"")); | ||
| } | ||
|
|
||
| @Test | ||
| void generateReport_csvFile_withTwoNumericColumns_returnsFileAndColumnUNFs() throws Exception { | ||
| Path tempFile = Files.createTempFile("unf-cli-table", ".csv"); | ||
| Files.writeString(tempFile, "a,b\n6.6666666666666667,32\n75.216,2024\n", StandardCharsets.UTF_8); | ||
|
|
| 4. Truncates to the most significant 128 bits. | ||
| 5. Encodes as Base64 and prefixes with `UNF:<version>[:extensions]:`. | ||
|
|
||
| Current version string is `6` (`UnfDigest.currentVersion`). |
|
Looks like Copilot has a few good suggestions. Let me know if you need me to resolve. |
|
@kulnor Thanks again for the PR. It properly identifies var1 and var3 as numeric, passes these vectors as Java arrays of the correct type to the UNFUtil proper, and gets the correct signatures, identical to the ones produced by Dataverse. From a very quick look, the issue is likely this, and other similar methods: https://github.com/kulnor/UNF-dataverse/blob/master/src/main/java/org/dataverse/unf/UnfCli.java#L659-L666 - which just need to be adjusted so that they do not jump to conclusions prematurely. Overall I am quite excited about having this interface added; which will give us a simple, but potentially very useful standalone tool, as as an extra way to calculate the UNFs outside of Dataverse. |
|
Glad this is useful and that it may lead to a patch for type detection. I can also keep the generated JSON aligned with the one produced by the Python package (it is now, but in case I adjust it). We're starting to find other use cases for UNF, so we hope this helps grow the adoption. |
| private static boolean isLong(String value) { | ||
| try { | ||
| Long.parseLong(value); | ||
| return true; | ||
| } catch (NumberFormatException ex) { | ||
| return false; | ||
| } | ||
| } |
There was a problem hiding this comment.
It looks like we just need to modify this, and similar methods to be less rigid: just because a value is an empty string, should not be sufficient to assume that this is NOT a Long, Int ... etc. And the final decision should only be made on the combined vector, based on the types of the non-empty values in it - as long as some are present.
|
I will look at the Copilot suggestions carefully later on. (my own experience with it has been a mixed bag, in terms of the quality of its advice) |
I have recently implemented a UNF Python package and, to compare/validate the outputs, added a command-line utility to this Dataverse / Java implementation:
It outputs a comprehensive JSON document that includes file- and variable-level UNFs, along with options for the algorithm and the tool (see the test directory for simple examples).
Along the way, I have also generated a technical overview and a contributor guide. Note that this was mostly vibe-coded and should not affect the current code (it's an add-on).
FYI, I discussed this by email with Micah and Leo.