Skip to content

Command Line Interface for UNF#10

Open
kulnor wants to merge 12 commits intoIQSS:masterfrom
kulnor:master
Open

Command Line Interface for UNF#10
kulnor wants to merge 12 commits intoIQSS:masterfrom
kulnor:master

Conversation

@kulnor
Copy link
Copy Markdown

@kulnor kulnor commented Mar 17, 2026

I have recently implemented a UNF Python package and, to compare/validate the outputs, added a command-line utility to this Dataverse / Java implementation:

java -cp target/unf-6.0.2-SNAPSHOT.jar org.dataverse.unf.UnfCli --input <path> [options]

It outputs a comprehensive JSON document that includes file- and variable-level UNFs, along with options for the algorithm and the tool (see the test directory for simple examples).

Along the way, I have also generated a technical overview and a contributor guide. Note that this was mostly vibe-coded and should not affect the current code (it's an add-on).

FYI, I discussed this by email with Micah and Leo.

kulnor added 12 commits March 9, 2026 13:41
- Introduced a new CLI entry point in UnfCli.java for processing input files and generating JSON reports.
- Updated README.md with CLI usage instructions and examples.
- Added unf6_schema.json to define the schema for UNF v6 calculation results.
- Created UnfCliTest.java to validate the CLI functionality with unit tests.
- Exposed UnfCli.generateReport and CliOptions as public for programmatic usage.
- Added documentation for programmatic usage to README.md.
- Renamed doc/unf6.schema.json to doc/unf.schema.json and updated all references.
Copilot AI review requested due to automatic review settings March 17, 2026 17:30
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Java command-line interface to generate UNF v6 reports as JSON (file- and dataset-level), plus schema/docs and test fixtures to validate outputs and support cross-implementation comparisons (e.g., with dartfx-unf Python).

Changes:

  • Introduce org.dataverse.unf.UnfCli to compute UNFs for single files (line-based and CSV/TSV) and directories (dataset-level).
  • Add JUnit tests and dartfx CSV/JSON fixtures for validating computed UNFs.
  • Add JSON schema + documentation (README CLI usage, technical overview, contributor guide) and update .gitignore.

Reviewed changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
src/main/java/org/dataverse/unf/UnfCli.java New CLI + report generator (JSON output, tabular parsing, dataset aggregation).
src/test/java/org/dataverse/unf/UnfCliTest.java Tests for CLI report generation on temp line/text and CSV inputs.
src/test/java/org/dataverse/unf/UnfDartfxTest.java Tests validating known file-level UNFs against dartfx example CSVs.
src/test/resources/test/dartfx/101A.csv Dartfx sample CSV fixture.
src/test/resources/test/dartfx/101A.unf.json Expected JSON report fixture for 101A.
src/test/resources/test/dartfx/101B.csv Dartfx sample CSV fixture.
src/test/resources/test/dartfx/101B.unf.json Expected JSON report fixture for 101B.
src/test/resources/test/dartfx/101C.csv Dartfx sample CSV fixture.
src/test/resources/test/dartfx/101C.unf.json Expected JSON report fixture for 101C.
src/test/resources/test/dartfx/101D.csv Dartfx sample CSV fixture.
src/test/resources/test/dartfx/101D.unf.json Expected JSON report fixture for 101D.
doc/unf.schema.json New JSON Schema documenting the UNF report shape.
doc/TECHNICAL_OVERVIEW.md Technical architecture overview of the UNF implementation.
doc/CONTRIBUTOR_GUIDE.md Contributor guidance emphasizing compatibility/stability and testing.
README.md Adds CLI usage docs and programmatic usage examples.
.gitignore Adds OS/editor/IDE ignores.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +74 to +77
List<Path> files = Files.list(inputPath)
.filter(Files::isRegularFile)
.sorted(Comparator.comparing(Path::getFileName))
.toList();
fileUnfs.add(entry.unf);
}
String datasetUnf = UNFUtil.calculateUNF(fileUnfs.toArray(new String[0]));
resultJson = datasetResultJson(inputPath.getFileName().toString(), datasetUnf, entries);
Comment on lines +37 to +38
private static final String SOFTWARE_VERSION = "6.0.2-SNAPSHOT";

Comment on lines +287 to +289
+ "\"N\":7,"
+ "\"X\":128,"
+ "\"H\":128,"
"version": {
"type": "string"
}
}
Comment on lines +532 to +535
case "--has-header":
options.hasHeader = Boolean.parseBoolean(requireValue(args, ++i, arg));
break;
case "--column-types":
Comment on lines +72 to +94
if (Files.isDirectory(inputPath)) {
List<FileResult> entries = new ArrayList<>();
List<Path> files = Files.list(inputPath)
.filter(Files::isRegularFile)
.sorted(Comparator.comparing(Path::getFileName))
.toList();

if (files.isEmpty()) {
throw new IllegalArgumentException("Input directory has no regular files: " + inputPath);
}

List<String> fileUnfs = new ArrayList<>();
for (Path file : files) {
FileResult entry = computeFileResult(file, options);
entries.add(entry);
fileUnfs.add(entry.unf);
}
String datasetUnf = UNFUtil.calculateUNF(fileUnfs.toArray(new String[0]));
resultJson = datasetResultJson(inputPath.getFileName().toString(), datasetUnf, entries);
} else {
FileResult fileResult = computeFileResult(inputPath, options);
resultJson = fileResult.toJson();
}
Comment on lines +233 to +241
case DATETIME:
if (datetimeFormat == null || datetimeFormat.isBlank()) {
throw new IllegalArgumentException("--datetime-format is required for type datetime.");
}
String[] rows = values.toArray(new String[0]);
String[] patterns = new String[rows.length];
Arrays.fill(patterns, datetimeFormat);
return UNFUtil.calculateUNF(rows, patterns);
default:
Comment on lines +31 to +47
Path tempFile = Files.createTempFile("unf-cli-string", ".txt");
Files.writeString(tempFile, "Hello World\nTesting 123\n", StandardCharsets.UTF_8);

UnfCli.CliOptions options = new UnfCli.CliOptions().withInput(tempFile.toString()).withType("string");
String json = UnfCli.generateReport(tempFile, options);

assertTrue(json.contains("\"unf_version\":\"6\""));
assertTrue(json.contains("\"type\":\"file\""));
assertTrue(json.contains("\"columns\""));
assertTrue(json.contains("\"unf\":\"UNF:6:r+FDbVC6fKdUjRS6ZIzP4w==\""));
}

@Test
void generateReport_csvFile_withTwoNumericColumns_returnsFileAndColumnUNFs() throws Exception {
Path tempFile = Files.createTempFile("unf-cli-table", ".csv");
Files.writeString(tempFile, "a,b\n6.6666666666666667,32\n75.216,2024\n", StandardCharsets.UTF_8);

4. Truncates to the most significant 128 bits.
5. Encodes as Base64 and prefixes with `UNF:<version>[:extensions]:`.

Current version string is `6` (`UnfDigest.currentVersion`).
@pdurbin pdurbin moved this to Ready for Triage in IQSS Dataverse Project Mar 17, 2026
@kulnor
Copy link
Copy Markdown
Author

kulnor commented Mar 17, 2026

Looks like Copilot has a few good suggestions. Let me know if you need me to resolve.

@landreev landreev self-assigned this Apr 2, 2026
@landreev
Copy link
Copy Markdown
Contributor

landreev commented Apr 2, 2026

@kulnor Thanks again for the PR.
I'm playing/experimenting with with the CLI now. I'm encouraged to see that it's generally doing a good job detecting or guessing data types of individual columns. For ex., in the test case we discussed in the email thread:

var1,var2,var3
1,1,2
2,,2
3,3,3
4,4,4

It properly identifies var1 and var3 as numeric, passes these vectors as Java arrays of the correct type to the UNFUtil proper, and gets the correct signatures, identical to the ones produced by Dataverse.
The problem with the second column, var2 should therefore be solvable, hopefully by some simple tweaks to the logic the CLI uses to make these educated guesses.

From a very quick look, the issue is likely this, and other similar methods: https://github.com/kulnor/UNF-dataverse/blob/master/src/main/java/org/dataverse/unf/UnfCli.java#L659-L666 - which just need to be adjusted so that they do not jump to conclusions prematurely.

Overall I am quite excited about having this interface added; which will give us a simple, but potentially very useful standalone tool, as as an extra way to calculate the UNFs outside of Dataverse.

@kulnor
Copy link
Copy Markdown
Author

kulnor commented Apr 2, 2026

Glad this is useful and that it may lead to a patch for type detection. I can also keep the generated JSON aligned with the one produced by the Python package (it is now, but in case I adjust it). We're starting to find other use cases for UNF, so we hope this helps grow the adoption.

Comment on lines +659 to +666
private static boolean isLong(String value) {
try {
Long.parseLong(value);
return true;
} catch (NumberFormatException ex) {
return false;
}
}
Copy link
Copy Markdown
Contributor

@landreev landreev Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we just need to modify this, and similar methods to be less rigid: just because a value is an empty string, should not be sufficient to assume that this is NOT a Long, Int ... etc. And the final decision should only be made on the combined vector, based on the types of the non-empty values in it - as long as some are present.

@landreev
Copy link
Copy Markdown
Contributor

landreev commented Apr 2, 2026

I will look at the Copilot suggestions carefully later on. (my own experience with it has been a mixed bag, in terms of the quality of its advice)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants