Skip to content

Conversation

@jeltsch
Copy link
Collaborator

@jeltsch jeltsch commented Apr 14, 2025

This pull requests adds a script that helps in finding missing IO specializations, along with some example input and output.

When running the command scripts/find-missing-io-specialisations.sh src/**/*.hs on my Ubuntu 24.04 machine while being at the current main tip, 78a3231, I get the following output:

Database.LSMTree.Internal.Arena.newArenaManager
Database.LSMTree.Internal.Arena.withArena
Database.LSMTree.Internal.Arena.scrambleArena
Database.LSMTree.Internal.Arena.scrambleBlock
Database.LSMTree.Internal.Arena.withUnmanagedArena
Database.LSMTree.Internal.Arena.allocateFromArena
Database.LSMTree.Internal.Arena.newBlockWithFree
Database.LSMTree.Internal.BlobFile.readBlob
Database.LSMTree.Internal.BlobRef.releaseBlobRef
Database.LSMTree.Internal.BlobRef.readRawBlobRef
Database.LSMTree.Internal.ChecksumHandle.dropCache
Database.LSMTree.Internal.ChecksumHandle.closeHandle
Database.LSMTree.Internal.CRC32C.expectValidFile
Database.LSMTree.Internal.doesSnapshotDirExist
Database.LSMTree.Internal.IncomingRun.newIncomingSingleRun
Database.LSMTree.Internal.IncomingRun.newIncomingMergingRun
Database.LSMTree.Internal.IncomingRun.depositNominalCredits
Database.LSMTree.Internal.Merge.writeReaderEntry
Database.LSMTree.Internal.Merge.writeSerialisedEntry
Database.LSMTree.Internal.MergeSchedule.mkUnionCache
Database.LSMTree.Internal.MergeSchedule.duplicateUnionCache
Database.LSMTree.Internal.MergeSchedule.releaseUnionCache
Database.LSMTree.Internal.MergingRun.unsafeNew
Database.LSMTree.Internal.MergingRun.mergeType
Database.LSMTree.Internal.MergingRun.atomicReadCredits
Database.LSMTree.Internal.MergingRun.atomicModifyInt
Database.LSMTree.Internal.MergingRun.supplyChecked
Database.LSMTree.Internal.MergingRun.supplyCreditsRelative
Database.LSMTree.Internal.MergingRun.supplyCreditsAbsolute
Database.LSMTree.Internal.MergingTree.expectCompleted
Database.LSMTree.Internal.MergingTree.Lookup.mapMStrict
Database.LSMTree.Internal.RunReader.seekToDiskPage
Database.LSMTree.Internal.UniqCounter.newUniqCounter
Database.LSMTree.Internal.UniqCounter.incrUniqCounter
Database.LSMTree.Internal.Vector.mapMStrict
Database.LSMTree.Internal.Vector.imapMStrict
Database.LSMTree.Internal.Vector.forMStrict
Database.LSMTree.Internal.Vector.unsafeInsertWithMStrict

It should be checked that the same output arises on other systems, in particular such that do not use GNU sed.

@jeltsch jeltsch self-assigned this Apr 14, 2025
@jeltsch jeltsch added the enhancement New feature or request label Apr 14, 2025
@wenkokke
Copy link
Collaborator

As a preliminary note, it would be good to change the script to require no parameters (ie inline the source glob) and perhaps rename it to start with lint, for consistency.

@jeltsch
Copy link
Collaborator Author

jeltsch commented Apr 14, 2025

I‘d like to keep the flexibility of being able to pass arbitrarily chosen Haskell source code to the script but realize that it would be good to have a script that just checks the present source code (probably of the main library only). What about keeping the script as it is and adding a wrapper script that passes the main library’s modules and have its name start with lint-?

@wenkokke
Copy link
Collaborator

What about keeping the script as it is and adding a wrapper script that passes the main library’s modules and have its name start with lint-?

Sounds great! Another option would be to default to all Haskell files (or just src/**/*.hs) but allow the first argument to the script to override this.

@wenkokke
Copy link
Collaborator

I think it might be a good idea to add dinner comments to the sed script that explain the logic a bit, not per se on a command by command basis, but to explain the algorithm.

@wenkokke
Copy link
Collaborator

OTOH, it's sufficiently clear the script isn't malicious, so once there's a CI job that runs it on macOS and Windows, I'm happy to merge it under the understanding that it's not a complete guarantee on the absence of missing specialisations... which is true anyway, given the assumptions it makes.

@jorisdral
Copy link
Collaborator

Are there cases where the script returns false positives, and if not now, is it conceivable that it could do so in the future? If so, it might be good to think about how such false positives could be addressed, specially if it's a CI job we run that can hold up PRs

@wenkokke
Copy link
Collaborator

Are there cases where the script returns false positives, and if not now, is it conceivable that it could do so in the future? If so, it might be good to think about how such false positives could be addressed, specially if it's a CI job we run that can hold up PRs

@jeltsch already floated the idea of an allowlist for false positives.

@jeltsch
Copy link
Collaborator Author

jeltsch commented Apr 16, 2025

f0f80d6 disables reporting of operations with inline directives. The expected result of running scripts/find-missing-io-specialisations.sh src/**/*.hs is now as follows:

Database.LSMTree.Internal.Arena.newArenaManager
Database.LSMTree.Internal.Arena.scrambleArena
Database.LSMTree.Internal.Arena.scrambleBlock
Database.LSMTree.Internal.Arena.withUnmanagedArena
Database.LSMTree.Internal.Arena.allocateFromArena
Database.LSMTree.Internal.Arena.newBlockWithFree
Database.LSMTree.Internal.ChecksumHandle.dropCache
Database.LSMTree.Internal.ChecksumHandle.closeHandle
Database.LSMTree.Internal.CRC32C.expectValidFile
Database.LSMTree.Internal.doesSnapshotDirExist
Database.LSMTree.Internal.IncomingRun.depositNominalCredits
Database.LSMTree.Internal.MergeSchedule.mkUnionCache
Database.LSMTree.Internal.MergeSchedule.duplicateUnionCache
Database.LSMTree.Internal.MergeSchedule.releaseUnionCache
Database.LSMTree.Internal.MergingTree.expectCompleted
Database.LSMTree.Internal.MergingTree.Lookup.mapMStrict
Database.LSMTree.Internal.RunReader.seekToDiskPage

@jeltsch
Copy link
Collaborator Author

jeltsch commented Apr 16, 2025

OTOH, it's sufficiently clear the script isn't malicious, so once there's a CI job that runs it on macOS and Windows, I'm happy to merge it under the understanding that it's not a complete guarantee on the absence of missing specialisations... which is true anyway, given the assumptions it makes.

Are you thinking here of a CI job that applies the script to the codebase to ensure that there are no missing specializations or one that tests the script using the data in scripts/find-missing-io-specialisations.test?

Beside that, do you think it is necessary to have CI run this script on Windows? After all, it’s not part of the software we want to ship in the end but a utility intended to help us (and perhaps future contributors). I mean, if supporting Windows, using some Unix compatibility layer, is easy, then I’m not against it; I just don’t want to spend much effort on making this script Windows-compatible (note that initially I intended this script to just assist me in a particular task and not even make it into the repository 🙂)

@jeltsch
Copy link
Collaborator Author

jeltsch commented Apr 16, 2025

I think it might be a good idea to add dinner comments to the sed script that explain the logic a bit, not per se on a command by command basis, but to explain the algorithm.

Done in 325c20e.

@jeltsch
Copy link
Collaborator Author

jeltsch commented Apr 20, 2025

What about keeping the script as it is and adding a wrapper script that passes the main library’s modules and have its name start with lint-?

Sounds great!

I’ve realized this in 539f06a. The wrapper script not only automatically picks the source files of the main library but also enables users to explicitly allow certain missing IO specializations.

@jeltsch jeltsch force-pushed the jeltsch/find-missing-io-specialisations branch 4 times, most recently from 130de00 to b018b5a Compare April 20, 2025 17:30
@wenkokke
Copy link
Collaborator

I've opened #697 which contains the changes in this PR, plus a few fixes.

  • The top-level script uses set -e which is not a POSIX shell feature and only works on Ubuntu because it lies and symlinks /bin/sh to /bin/bash.
  • The sed script does not work with macOS sed. Rather than fixing the sed script, I've altered the shell script to ask for gsed.

@jeltsch
Copy link
Collaborator Author

jeltsch commented May 14, 2025

I’ve added another three commits, which contain changes from #697 and variants thereof.

@jeltsch
Copy link
Collaborator Author

jeltsch commented May 14, 2025

The top-level script uses set -e which is not a POSIX shell feature and only works on Ubuntu because it lies and symlinks /bin/sh to /bin/bash.

Actually, set -e is a POSIX feature, at least according to the 2024 version of the standard.

Calling sh when sh is linked to bash doesn’t lead to a normal Bash session but one in POSIX compatibility mode. So “lying” is a bit of a strong accusation here (I guess nobody expects sh to be Stephen Bourne’s shell from 1979). 😉 Also note that some time ago Ubuntu changed the sh link to point to dash, Debian’s version of the Almquist Shell, a lean shell that actually lacks some POSIX features (at least the pipefail option).

@jeltsch
Copy link
Collaborator Author

jeltsch commented May 14, 2025

The sed script does not work with macOS sed. Rather than fixing the sed script, I've altered the shell script to ask for gsed.

This was apparently a good idea. As we discussed already in the meeting, there seems to be no easy way to compare strings with sed in a fully POSIX-compatible way, as POSIX doesn’t guarantee backreferences in regular expressions to work.

@jeltsch jeltsch force-pushed the jeltsch/find-missing-io-specialisations branch 2 times, most recently from 75fc37b to 92e2a8e Compare May 20, 2025 14:35
Copy link
Collaborator

@wenkokke wenkokke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Are you happy to merge this in the current state, @jeltsch?

@wenkokke wenkokke added this pull request to the merge queue May 28, 2025
@wenkokke wenkokke removed this pull request from the merge queue due to a manual request May 28, 2025
Copy link
Collaborator

@dcoutts dcoutts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks very impressive. I've never seen such a sophisticated sed script. All my sed scripts are one liners!

@jeltsch
Copy link
Collaborator Author

jeltsch commented May 29, 2025

LGTM. Are you happy to merge this in the current state, @jeltsch?

Yes. 🙂 I’ll rebase it in a minute and then initiate the merging.

@jeltsch jeltsch force-pushed the jeltsch/find-missing-io-specialisations branch from e05b1e4 to a5f12f9 Compare May 29, 2025 12:37
@jeltsch jeltsch enabled auto-merge May 29, 2025 12:37
@jeltsch
Copy link
Collaborator Author

jeltsch commented May 29, 2025

This all looks very impressive. I've never seen such a sophisticated sed script. All my sed scripts are one liners!

Glad that I managed to impress you. 😉

@jeltsch jeltsch added this pull request to the merge queue May 29, 2025
Merged via the queue into main with commit a3efabc May 29, 2025
30 checks passed
@jeltsch jeltsch deleted the jeltsch/find-missing-io-specialisations branch May 29, 2025 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants