Skip to content

Conversation

@ericphanson
Copy link
Member

@ericphanson ericphanson commented Oct 10, 2020

My attempt to fix #10 and close #273. There's three semi-arbitrarily chosen cutoffs that might need more tuning, and if any are hit, then the package is flagged.

  1. DL distance is <= 1 (which would catch Websocket vs Websockets)
  2. A normalized DL distance catch long package names with more than 1 edit but only a few. I ended up going with a weird 5 + sqrt(max(len1, len2)) normalization just because it seemed like just dividing by the length made long packages get flagged too much.
  3. Finally, there's the visual distance check, which can catch package names with more edits than allowed by the other checks if the edits are hard to distinguish visually, like Jill vs JiII (that's 2 edits of lowercase-ell to uppercase-eye, so the straight DL doesn't catch, short name so the normalized one doesn't catch it, but very similar looking letters, so the visual one catches it). I put a DL <= 2 guard on the calculation so we don't have to perform the expensive visual check too often.

I also added an ASCII check; I saw that's in the guidelines but doesn't appear to be implemented.

I added some unit tests but not an integration test (out of lazyness / time constraints).

P.S. https://ericphanson.github.io/VisualStringDistances.jl/dev/packagenames/ has some short discussion of VisualStringDistances for this problem, and https://github.com/ericphanson/VisualStringDistances.jl/tree/master/scripts/packagenames has some messy/exploratory code for playing around with distances and cutoffs.


DL = Damerau–Levenshtein distance

@ericphanson
Copy link
Member Author

ericphanson commented Oct 10, 2020

I went with a major version bump because this will flag packages that were previously allowed, but the public method signatures don't need to change, at least.

@DilumAluthge
Copy link
Member

bors try

bors bot added a commit that referenced this pull request Oct 10, 2020
@bors
Copy link
Contributor

bors bot commented Oct 10, 2020

try

Build failed:

  • continuous-integration/travis-ci/push

@DilumAluthge
Copy link
Member

DilumAluthge commented Oct 10, 2020

  1. DL distance is <= 1 (which would catch Websocket vs Websockets)

The actual example was Websocket vs WebSockets. So maybe the cutoff should be DL distance is <= 2?

Edit: I just saw Stefan already said this in his review above.

@DilumAluthge
Copy link
Member

bors try

bors bot added a commit that referenced this pull request Oct 10, 2020
@StefanKarpinski
Copy link

I think we should probably measure edit distance after lowercasing. In particular, since some file systems are (unfortunately) case insensitive, you can cause problems by registering something with the same name, capitalized differently, which could have a large edit distance.

@fredrikekre
Copy link
Member

The actual example was Websocket vs WebSockets. So maybe the cutoff should be DL distance is <= 2?

I think we can be pretty conservative here, this is just for automatic merging after all. If the limits become a problem we can always lower them later on. What things would be caught with e.g. 3?

@KristofferC
Copy link
Member

Would be good with a list of existing packages that are within this distance. PgfPlots vs PGFPlotsX for example.

@bors
Copy link
Contributor

bors bot commented Oct 10, 2020

try

Build failed:

  • continuous-integration/travis-ci/push

@ericphanson
Copy link
Member Author

ericphanson commented Oct 10, 2020

Thanks for all the quick feedback!

I think we should probably measure edit distance after lowercasing.

Ah, good point, done.

The actual example was Websocket vs WebSockets. So maybe the cutoff should be DL distance is <= 2?

Oops, my mistake. It would be caught with a limit of 1 plus lowercasing the names before comparison, so we don't actually need 2 for this, but could make that change anyway to be more conservative. I've left it at 1 for now.

Would be good with a list of existing packages that are within this distance. PgfPlots vs PGFPlotsX for example.

For your example specifically,

  ("PGFPlots", "PGFPlotsX") => "Too similar to existing package name PGFPlotsX. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.13 is at or below cutoff 0.25."

So it would be caught (and differences in case don't matter now for the edit distances). There are 242 other clashes with the current settings:

julia> @time clashes = all_name_clashes(sort!(AutoMerge.get_all_package_names(expanduser("~/.julia/registries/General"))))
 95.335998 seconds (165.32 M allocations: 12.287 GiB, 2.16% gc time)

where I've defined

using RegistryCI
using RegistryCI.AutoMerge
function all_name_clashes(packages; kwargs...)
    n = length(packages)
    clashes = Dict{Tuple{String, String}, String}()
    for i = 1:n, j = i+1:n
        name1 = packages[i]
        name2 = packages[j]
        pass, message = AutoMerge.meets_distance_check(name1, tuple(name2); kwargs...)
        if !pass
            clashes[(name1, name2)] = message
        end
    end
    return clashes
end
Full results
Dict{Tuple{String,String},String} with 243 entries:
  ("DBInterface", "ODEInterface") => "Too similar to existing package name ODEInterface. Sqrt-normalized Damerau-Levenshtein distance 0.24 is at or below cutoff 0.25."
  ("Exfiltrator", "Infiltrator") => "Too similar to existing package name Infiltrator. Sqrt-normalized Damerau-Levenshtein distance 0.24 is at or below cutoff 0.25."
  ("SMM", "SOM") => "Too similar to existing package name SOM. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.15 is at or below cutoff 0.25. Normalized visual distance 1.82 is at or below cutoff 2.50."
  ("NaiveGAflux", "NaiveNASflux") => "Too similar to existing package name NaiveNASflux. Sqrt-normalized Damerau-Levenshtein distance 0.24 is at or below cutoff 0.25."
  ("LSL", "VSL") => "Too similar to existing package name VSL. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.15 is at or below cutoff 0.25. Normalized visual distance 2.49 is at or below cutoff 2.50."
  ("MPIReco", "MRIReco") => "Too similar to existing package name MRIReco. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.13 is at or below cutoff 0.25. Normalized visual distance 0.59 is at or below cutoff 2.50."
  ("UAParser", "URIParser") => "Too similar to existing package name URIParser. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25."
  ("GAP", "GCP") => "Too similar to existing package name GCP. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.15 is at or below cutoff 0.25. Normalized visual distance 2.00 is at or below cutoff 2.50."
  ("AMD", "Amb") => "Too similar to existing package name Amb. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.15 is at or below cutoff 0.25."
  ("JuLIP", "julia") => "Too similar to existing package name julia. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25."
  ("MLJModels", "NLPModels") => "Too similar to existing package name NLPModels. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25."
  ("Hive", "Jive") => "Too similar to existing package name Jive. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25."
  ("EchoviewEcs", "EchoviewEvr") => "Too similar to existing package name EchoviewEvr. Sqrt-normalized Damerau-Levenshtein distance 0.24 is at or below cutoff 0.25."
  ("CSV", "uCSV") => "Too similar to existing package name uCSV. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25."
  ("Bijections", "Bijectors") => "Too similar to existing package name Bijectors. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25."
  ("IRTools", "IterTools") => "Too similar to existing package name IterTools. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25."
  ("CUDA", "Cuba") => "Too similar to existing package name Cuba. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25."
  ("Conda", "Onda") => "Too similar to existing package name Onda. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25."
  ("MLDatasets", "RDatasets") => "Too similar to existing package name RDatasets. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25."
  ("StanSample", "StanSamples") => "Too similar to existing package name StanSamples. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.12 is at or below cutoff 0.25. Normalized visual distance 2.23 is at or below cutoff 2.50."
  ("Match", "Matcha") => "Too similar to existing package name Matcha. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.13 is at or below cutoff 0.25."
  ("ExportAll", "ImportAll") => "Too similar to existing package name ImportAll. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25."
  ("ITensors", "Tensors") => "Too similar to existing package name Tensors. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.13 is at or below cutoff 0.25."
  ("NTFk", "NTNk") => "Too similar to existing package name NTNk. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25. Normalized visual distance 2.17 is at or below cutoff 2.50."
  ("StanModels", "StatsModels") => "Too similar to existing package name StatsModels. Sqrt-normalized Damerau-Levenshtein distance 0.24 is at or below cutoff 0.25."
  ("Glo", "Glob") => "Too similar to existing package name Glob. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25."
  ("Dhall", "Shell") => "Too similar to existing package name Shell. Normalized visual distance 1.80 is at or below cutoff 2.50."
  ("BIDSTools", "BioTools") => "Too similar to existing package name BioTools. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25."
  ("NFLTables", "Nullables") => "Too similar to existing package name Nullables. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25."
  ("GracePlot", "GraphPlot") => "Too similar to existing package name GraphPlot. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25. Normalized visual distance 2.26 is at or below cutoff 2.50."
  ("MLDatasets", "NLIDatasets") => "Too similar to existing package name NLIDatasets. Sqrt-normalized Damerau-Levenshtein distance 0.24 is at or below cutoff 0.25."
  ("Plots", "Pluto") => "Too similar to existing package name Pluto. Normalized visual distance 1.38 is at or below cutoff 2.50."
  ("Mocking", "Packing") => "Too similar to existing package name Packing. Normalized visual distance 2.20 is at or below cutoff 2.50."
  ("MIDI", "MIRT") => "Too similar to existing package name MIRT. Normalized visual distance 2.16 is at or below cutoff 2.50."
  ("Mads", "Mods") => "Too similar to existing package name Mods. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25. Normalized visual distance 0.70 is at or below cutoff 2.50."
  ("TreeView", "TreeViews") => "Too similar to existing package name TreeViews. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.13 is at or below cutoff 0.25. Normalized visual distance 2.25 is at or below cutoff 2.50."
  ("LSHFunctions", "LossFunctions") => "Too similar to existing package name LossFunctions. Sqrt-normalized Damerau-Levenshtein distance 0.23 is at or below cutoff 0.25."
  ("Media", "Modia") => "Too similar to existing package name Modia. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25. Normalized visual distance 0.55 is at or below cutoff 2.50."
  ("SCIP", "SciPy") => "Too similar to existing package name SciPy. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25."
  ("Calculus", "ZXCalculus") => "Too similar to existing package name ZXCalculus. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25."
  ("MCMCChain", "MCMCChains") => "Too similar to existing package name MCMCChains. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.12 is at or below cutoff 0.25. Normalized visual distance 2.27 is at or below cutoff 2.50."
  ("GLMakie", "WGLMakie") => "Too similar to existing package name WGLMakie. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.13 is at or below cutoff 0.25."
  ("CUDA", "CoDa") => "Too similar to existing package name CoDa. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25. Normalized visual distance 1.96 is at or below cutoff 2.50."
  ("HAML", "YAML") => "Too similar to existing package name YAML. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25. Normalized visual distance 2.30 is at or below cutoff 2.50."
  ("RSCG", "Rsvg") => "Too similar to existing package name Rsvg. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25."
  ("CuArrays", "GPUArrays") => "Too similar to existing package name GPUArrays. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25."
  ("YAJL", "YAML") => "Too similar to existing package name YAML. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25."
  ("COSMA", "COSMO") => "Too similar to existing package name COSMO. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25. Normalized visual distance 1.43 is at or below cutoff 2.50."
  ("ITensors", "NDTensors") => "Too similar to existing package name NDTensors. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25."
  ("Devito", "Pavito") => "Too similar to existing package name Pavito. Normalized visual distance 1.87 is at or below cutoff 2.50."
  ("PGFPlots", "PGFPlotsX") => "Too similar to existing package name PGFPlotsX. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.13 is at or below cutoff 0.25."
  ("KernelDensity", "KernelDensitySJ") => "Too similar to existing package name KernelDensitySJ. Sqrt-normalized Damerau-Levenshtein distance 0.23 is at or below cutoff 0.25."
  ("BDF", "JDF") => "Too similar to existing package name JDF. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.15 is at or below cutoff 0.25."
  ("Knet", "UNet") => "Too similar to existing package name UNet. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25."
  ("UnROOT", "UpROOT") => "Too similar to existing package name UpROOT. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.13 is at or below cutoff 0.25. Normalized visual distance 0.63 is at or below cutoff 2.50."
  ("ROMEO", "RoME") => "Too similar to existing package name RoME. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25."
  ("Altro", "Attrs") => "Too similar to existing package name Attrs. Normalized visual distance 2.47 is at or below cutoff 2.50."
  ("HSL", "LSL") => "Too similar to existing package name LSL. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.15 is at or below cutoff 0.25. Normalized visual distance 2.14 is at or below cutoff 2.50."
  ("FileTrees", "Filetimes") => "Too similar to existing package name Filetimes. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25."
  ("POMDPFiles", "POMDPXFiles") => "Too similar to existing package name POMDPXFiles. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.12 is at or below cutoff 0.25."
  ("SMM", "SPH") => "Too similar to existing package name SPH. Normalized visual distance 2.37 is at or below cutoff 2.50."
  ("Bcrypt", "Scrypt") => "Too similar to existing package name Scrypt. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.13 is at or below cutoff 0.25. Normalized visual distance 0.79 is at or below cutoff 2.50."
  ("NRRD", "Nord") => "Too similar to existing package name Nord. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25."
  ("BDF", "ODE") => "Too similar to existing package name ODE. Normalized visual distance 1.32 is at or below cutoff 2.50."
  ("SCS", "WCS") => "Too similar to existing package name WCS. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.15 is at or below cutoff 0.25. Normalized visual distance 2.11 is at or below cutoff 2.50."
  ("Clp", "Glo") => "Too similar to existing package name Glo. Normalized visual distance 1.32 is at or below cutoff 2.50."
  ("NCDatasets", "NLIDatasets") => "Too similar to existing package name NLIDatasets. Sqrt-normalized Damerau-Levenshtein distance 0.24 is at or below cutoff 0.25."
  ("BandedMatrices", "PaddedMatrices") => "Too similar to existing package name PaddedMatrices. Sqrt-normalized Damerau-Levenshtein distance 0.23 is at or below cutoff 0.25. Normalized visual distance 1.76 is at or below cutoff 2.50."
  ("Bio", "Glo") => "Too similar to existing package name Glo. Normalized visual distance 1.13 is at or below cutoff 2.50."
  ("CAOS", "COBS") => "Too similar to existing package name COBS. Normalized visual distance 2.09 is at or below cutoff 2.50."
  ("CoDa", "Conda") => "Too similar to existing package name Conda. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25."
  ("GSL", "HSL") => "Too similar to existing package name HSL. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.15 is at or below cutoff 0.25. Normalized visual distance 1.54 is at or below cutoff 2.50."
  ("UnitfulMR", "UnitfulUS") => "Too similar to existing package name UnitfulUS. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25. Normalized visual distance 2.33 is at or below cutoff 2.50."
  ("COBS", "ECOS") => "Too similar to existing package name ECOS. Normalized visual distance 2.11 is at or below cutoff 2.50."
  ("NMFk", "NTFk") => "Too similar to existing package name NTFk. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25."
  ("MLDatasets", "NCDatasets") => "Too similar to existing package name NCDatasets. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25. Normalized visual distance 1.62 is at or below cutoff 2.50."
  ("MortarContact2D", "MortarContact2DAD") => "Too similar to existing package name MortarContact2DAD. Sqrt-normalized Damerau-Levenshtein distance 0.22 is at or below cutoff 0.25."
  ("GLM", "SOM") => "Too similar to existing package name SOM. Normalized visual distance 2.41 is at or below cutoff 2.50."
  ("AIControl", "DFControl") => "Too similar to existing package name DFControl. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25."
  ("AMD", "AWS") => "Too similar to existing package name AWS. Normalized visual distance 2.13 is at or below cutoff 2.50."
  ("Stopping", "Strapping") => "Too similar to existing package name Strapping. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25."
  ("FCA", "FDM") => "Too similar to existing package name FDM. Normalized visual distance 2.27 is at or below cutoff 2.50."
  ("JDBC", "ODBC") => "Too similar to existing package name ODBC. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25. Normalized visual distance 2.01 is at or below cutoff 2.50."
  ("GBIF", "GRIB") => "Too similar to existing package name GRIB. Normalized visual distance 2.23 is at or below cutoff 2.50."
  ("CSDP", "OSQP") => "Too similar to existing package name OSQP. Normalized visual distance 1.52 is at or below cutoff 2.50."
  ("Tar", "Taro") => "Too similar to existing package name Taro. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25."
  ("GlobalApproximationValueIteration", "LocalApproximationValueIteration") => "Too similar to existing package name LocalApproximationValueIteration. Sqrt-normalized Damerau-Levenshtein distance 0.19 is at or below cutoff 0.25."
  ("IJulia", "julia") => "Too similar to existing package name julia. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.13 is at or below cutoff 0.25."
  ("NCDatasets", "RDatasets") => "Too similar to existing package name RDatasets. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25."
  ("LasIO", "Lasso") => "Too similar to existing package name Lasso. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25."
  ("Distributions", "DistributionsAD") => "Too similar to existing package name DistributionsAD. Sqrt-normalized Damerau-Levenshtein distance 0.23 is at or below cutoff 0.25."
  ("JuLIP", "Tulip") => "Too similar to existing package name Tulip. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25."
  ("StrFs", "Strs") => "Too similar to existing package name Strs. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25."
  ("Gtk", "ITK") => "Too similar to existing package name ITK. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.15 is at or below cutoff 0.25."
  ("XSim", "Xsum") => "Too similar to existing package name Xsum. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25."
  ("BigArrays", "DimArrays") => "Too similar to existing package name DimArrays. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25."
  ("StanBase", "StatsBase") => "Too similar to existing package name StatsBase. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25."
  ("TeXTable", "TexTables") => "Too similar to existing package name TexTables. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.13 is at or below cutoff 0.25."
  ("BED", "PEG") => "Too similar to existing package name PEG. Normalized visual distance 2.00 is at or below cutoff 2.50."
  ("NES", "WCS") => "Too similar to existing package name WCS. Normalized visual distance 1.91 is at or below cutoff 2.50."
  ("MDInclude", "NBInclude") => "Too similar to existing package name NBInclude. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25. Normalized visual distance 1.40 is at or below cutoff 2.50."
  ("StatPlots", "StatsPlots") => "Too similar to existing package name StatsPlots. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.12 is at or below cutoff 0.25."
  ("ECC", "SCS") => "Too similar to existing package name SCS. Normalized visual distance 2.09 is at or below cutoff 2.50."
  ("COBRA", "COESA") => "Too similar to existing package name COESA. Normalized visual distance 2.08 is at or below cutoff 2.50."
  ("IPMeasures", "Measures") => "Too similar to existing package name Measures. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25."
  ("EMIRT", "MIRT") => "Too similar to existing package name MIRT. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25."
  ("GSL", "LSL") => "Too similar to existing package name LSL. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.15 is at or below cutoff 0.25. Normalized visual distance 1.83 is at or below cutoff 2.50."
  ("CDCS", "COBS") => "Too similar to existing package name COBS. Normalized visual distance 1.51 is at or below cutoff 2.50."
  ("AdvancedHMC", "AdvancedMH") => "Too similar to existing package name AdvancedMH. Sqrt-normalized Damerau-Levenshtein distance 0.24 is at or below cutoff 0.25."
  ("LCPsolve", "LapSolve") => "Too similar to existing package name LapSolve. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.13 is at or below cutoff 0.25."
  ("JDF", "XDF") => "Too similar to existing package name XDF. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.15 is at or below cutoff 0.25. Normalized visual distance 2.19 is at or below cutoff 2.50."
  ("Chess", "Loess") => "Too similar to existing package name Loess. Normalized visual distance 2.05 is at or below cutoff 2.50."
  ("SHA", "SMM") => "Too similar to existing package name SMM. Normalized visual distance 1.87 is at or below cutoff 2.50."
  ("EDF", "JDF") => "Too similar to existing package name JDF. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.15 is at or below cutoff 0.25. Normalized visual distance 2.33 is at or below cutoff 2.50."
  ("MDDatasets", "NCDatasets") => "Too similar to existing package name NCDatasets. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25. Normalized visual distance 1.50 is at or below cutoff 2.50."
  ("OPFSampler", "PDSampler") => "Too similar to existing package name PDSampler. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25."
  ("COBRA", "COSMA") => "Too similar to existing package name COSMA. Normalized visual distance 2.05 is at or below cutoff 2.50."
  ("HSL", "VSL") => "Too similar to existing package name VSL. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.15 is at or below cutoff 0.25. Normalized visual distance 2.20 is at or below cutoff 2.50."
  ("Nord", "Serd") => "Too similar to existing package name Serd. Normalized visual distance 2.48 is at or below cutoff 2.50."
  ("Fire", "Pipe") => "Too similar to existing package name Pipe. Normalized visual distance 1.57 is at or below cutoff 2.50."
  ("MPI", "NPZ") => "Too similar to existing package name NPZ. Normalized visual distance 2.25 is at or below cutoff 2.50."
  ("SDPA", "SOFA") => "Too similar to existing package name SOFA. Normalized visual distance 0.87 is at or below cutoff 2.50."
  ("ASDF", "CSDP") => "Too similar to existing package name CSDP. Normalized visual distance 2.26 is at or below cutoff 2.50."
  ("NDTensors", "Tensors") => "Too similar to existing package name Tensors. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25."
  ("AES", "ASE") => "Too similar to existing package name ASE. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.15 is at or below cutoff 0.25."
  ("BitFlags", "BitFloats") => "Too similar to existing package name BitFloats. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25."
  ("CMPlot", "ImPlot") => "Too similar to existing package name ImPlot. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.13 is at or below cutoff 0.25."
  ("ITK", "Tk") => "Too similar to existing package name Tk. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.15 is at or below cutoff 0.25."
  ("Jets", "Jute") => "Too similar to existing package name Jute. Normalized visual distance 1.99 is at or below cutoff 2.50."
  ("HealthBase", "HealthMLBase") => "Too similar to existing package name HealthMLBase. Sqrt-normalized Damerau-Levenshtein distance 0.24 is at or below cutoff 0.25."
  ("JSON", "JSON2") => "Too similar to existing package name JSON2. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25."
  ("Ogg", "Org") => "Too similar to existing package name Org. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.15 is at or below cutoff 0.25."
  ("CUDA", "CUDD") => "Too similar to existing package name CUDD. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25. Normalized visual distance 1.45 is at or below cutoff 2.50."
  ("AEMS", "AES") => "Too similar to existing package name AES. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25."
  ("H3", "Z3") => "Too similar to existing package name Z3. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.16 is at or below cutoff 0.25."
  ("Hose", "Soss") => "Too similar to existing package name Soss. Normalized visual distance 2.47 is at or below cutoff 2.50."
  ("GDAL", "SEAL") => "Too similar to existing package name SEAL. Normalized visual distance 1.98 is at or below cutoff 2.50."
  ("GMT", "Git") => "Too similar to existing package name Git. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.15 is at or below cutoff 0.25."
  ("JWAS", "JWTs") => "Too similar to existing package name JWTs. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25."
  ("KDEstimation", "MEstimation") => "Too similar to existing package name MEstimation. Sqrt-normalized Damerau-Levenshtein distance 0.24 is at or below cutoff 0.25."
  ("Persa", "Porta") => "Too similar to existing package name Porta. Normalized visual distance 2.26 is at or below cutoff 2.50."
  ("JLD", "JLD2") => "Too similar to existing package name JLD2. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25."
  ("StanMCMCChain", "StanMCMCChains") => "Too similar to existing package name StanMCMCChains. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.11 is at or below cutoff 0.25. Normalized visual distance 2.12 is at or below cutoff 2.50."
  ("BeaData", "GeoData") => "Too similar to existing package name GeoData. Normalized visual distance 1.42 is at or below cutoff 2.50."
  ("BDF", "EDF") => "Too similar to existing package name EDF. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.15 is at or below cutoff 0.25. Normalized visual distance 0.84 is at or below cutoff 2.50."
  ("ACME", "AEMS") => "Too similar to existing package name AEMS. Normalized visual distance 2.31 is at or below cutoff 2.50."
  ("JWTs", "Jets") => "Too similar to existing package name Jets. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25."
  ("SMC", "SMM") => "Too similar to existing package name SMM. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.15 is at or below cutoff 0.25. Normalized visual distance 2.33 is at or below cutoff 2.50."
  ("CRC", "SMC") => "Too similar to existing package name SMC. Normalized visual distance 2.10 is at or below cutoff 2.50."
  ("GCP", "SCS") => "Too similar to existing package name SCS. Normalized visual distance 2.35 is at or below cutoff 2.50."
  ("BAT", "GMT") => "Too similar to existing package name GMT. Normalized visual distance 2.16 is at or below cutoff 2.50."
  ("EDF", "XDF") => "Too similar to existing package name XDF. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.15 is at or below cutoff 0.25. Normalized visual distance 2.28 is at or below cutoff 2.50."
  ("GPUArrays", "GeoArrays") => "Too similar to existing package name GeoArrays. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25."
  ("AES", "NES") => "Too similar to existing package name NES. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.15 is at or below cutoff 0.25. Normalized visual distance 1.45 is at or below cutoff 2.50."
  ("DataArrays", "MetaArrays") => "Too similar to existing package name MetaArrays. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25. Normalized visual distance 2.04 is at or below cutoff 2.50."
  ("AdvancedMH", "AdvancedVI") => "Too similar to existing package name AdvancedVI. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25."
  ("QDates", "RDates") => "Too similar to existing package name RDates. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.13 is at or below cutoff 0.25. Normalized visual distance 1.78 is at or below cutoff 2.50."
  ("GLM", "Glo") => "Too similar to existing package name Glo. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.15 is at or below cutoff 0.25."
  ("PosDefManifold", "PosDefManifoldML") => "Too similar to existing package name PosDefManifoldML. Sqrt-normalized Damerau-Levenshtein distance 0.22 is at or below cutoff 0.25."
  ("Spec", "Spot") => "Too similar to existing package name Spot. Normalized visual distance 2.40 is at or below cutoff 2.50."
  ("Debugger", "Rebugger") => "Too similar to existing package name Rebugger. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.13 is at or below cutoff 0.25. Normalized visual distance 1.36 is at or below cutoff 2.50."
  ("LiBr", "Libz") => "Too similar to existing package name Libz. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25."
  ("HTTPClient", "SMTPClient") => "Too similar to existing package name SMTPClient. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25."
  ("Unitful", "UnitfulUS") => "Too similar to existing package name UnitfulUS. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25."
  ("JSON2", "JSON3") => "Too similar to existing package name JSON3. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25. Normalized visual distance 0.75 is at or below cutoff 2.50."
  ("ForwardDiff", "ForwardDiff2") => "Too similar to existing package name ForwardDiff2. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.12 is at or below cutoff 0.25."
  ("LinearMaps", "LinearMapsAA") => "Too similar to existing package name LinearMapsAA. Sqrt-normalized Damerau-Levenshtein distance 0.24 is at or below cutoff 0.25."
  ("CRC", "Cbc") => "Too similar to existing package name Cbc. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.15 is at or below cutoff 0.25."
  ("CAOS", "GAMS") => "Too similar to existing package name GAMS. Normalized visual distance 2.42 is at or below cutoff 2.50."
  ("FDM", "SOM") => "Too similar to existing package name SOM. Normalized visual distance 2.34 is at or below cutoff 2.50."
  ("BEAST", "FFAST") => "Too similar to existing package name FFAST. Normalized visual distance 1.95 is at or below cutoff 2.50."
  ("Blobs", "Blosc") => "Too similar to existing package name Blosc. Normalized visual distance 2.19 is at or below cutoff 2.50."
  ("ACME", "ADCME") => "Too similar to existing package name ADCME. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25."
  ("GMAT", "MAT") => "Too similar to existing package name MAT. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25."
  ("Pandas", "Pandoc") => "Too similar to existing package name Pandoc. Normalized visual distance 1.36 is at or below cutoff 2.50."
  ("LasIO", "LazIO") => "Too similar to existing package name LazIO. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25. Normalized visual distance 1.16 is at or below cutoff 2.50."
  ("CSDP", "SDDP") => "Too similar to existing package name SDDP. Normalized visual distance 2.06 is at or below cutoff 2.50."
  ("Dolo", "YOLO") => "Too similar to existing package name YOLO. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25."
  ("BTCParser", "BibParser") => "Too similar to existing package name BibParser. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25. Normalized visual distance 2.47 is at or below cutoff 2.50."
  ("CGAL", "GDAL") => "Too similar to existing package name GDAL. Normalized visual distance 1.57 is at or below cutoff 2.50."
  ("Bits", "LITS") => "Too similar to existing package name LITS. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25."
  ("Catlab", "MATLAB") => "Too similar to existing package name MATLAB. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.13 is at or below cutoff 0.25."
  ("CodeCosts", "CodecZstd") => "Too similar to existing package name CodecZstd. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25."
  ("StructuredGrids", "UnstructuredGrids") => "Too similar to existing package name UnstructuredGrids. Sqrt-normalized Damerau-Levenshtein distance 0.22 is at or below cutoff 0.25."
  ("ClimateBase", "ClimateEasy") => "Too similar to existing package name ClimateEasy. Sqrt-normalized Damerau-Levenshtein distance 0.24 is at or below cutoff 0.25."
  ("AMQPClient", "SMTPClient") => "Too similar to existing package name SMTPClient. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25."
  ("BAT", "BBI") => "Too similar to existing package name BBI. Normalized visual distance 2.37 is at or below cutoff 2.50."
  ("SimplePlots", "SimpleRoots") => "Too similar to existing package name SimpleRoots. Sqrt-normalized Damerau-Levenshtein distance 0.24 is at or below cutoff 0.25."
  ("FTPClient", "HTTPClient") => "Too similar to existing package name HTTPClient. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25."
  ("FixedEffectModels", "GLFixedEffectModels") => "Too similar to existing package name GLFixedEffectModels. Sqrt-normalized Damerau-Levenshtein distance 0.21 is at or below cutoff 0.25."
  ("Fire", "Jfire") => "Too similar to existing package name Jfire. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25."
  ("CGAL", "SEAL") => "Too similar to existing package name SEAL. Normalized visual distance 2.09 is at or below cutoff 2.50."
  ("ResumableFunctions", "ReusableFunctions") => "Too similar to existing package name ReusableFunctions. Sqrt-normalized Damerau-Levenshtein distance 0.22 is at or below cutoff 0.25."
  ("MDDatasets", "MLDatasets") => "Too similar to existing package name MLDatasets. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.12 is at or below cutoff 0.25. Normalized visual distance 1.28 is at or below cutoff 2.50."
  ("Sass", "Soss") => "Too similar to existing package name Soss. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25. Normalized visual distance 0.70 is at or below cutoff 2.50."
  ("NMF", "NMFk") => "Too similar to existing package name NMFk. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25."
  ("NumericIO", "Numerics") => "Too similar to existing package name Numerics. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25."
  ("Atmosphere", "ISAtmosphere") => "Too similar to existing package name ISAtmosphere. Sqrt-normalized Damerau-Levenshtein distance 0.24 is at or below cutoff 0.25."
  ("Config", "Configs") => "Too similar to existing package name Configs. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.13 is at or below cutoff 0.25. Normalized visual distance 2.44 is at or below cutoff 2.50."
  ("LightGBM", "LightOSM") => "Too similar to existing package name LightOSM. Normalized visual distance 1.23 is at or below cutoff 2.50."
  ("MDDatasets", "RDatasets") => "Too similar to existing package name RDatasets. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25."
  ("Gtk", "Tk") => "Too similar to existing package name Tk. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.15 is at or below cutoff 0.25."
  ("EDF", "ODE") => "Too similar to existing package name ODE. Normalized visual distance 2.09 is at or below cutoff 2.50."
  ("MIDI", "Mimi") => "Too similar to existing package name Mimi. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25."
  ("ThreadPools", "ThreadTools") => "Too similar to existing package name ThreadTools. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.12 is at or below cutoff 0.25. Normalized visual distance 2.09 is at or below cutoff 2.50."
  ("JSON", "JSON3") => "Too similar to existing package name JSON3. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25."
  ("Clustering", "ClusteringGA") => "Too similar to existing package name ClusteringGA. Sqrt-normalized Damerau-Levenshtein distance 0.24 is at or below cutoff 0.25."
  ("CoDa", "Cuba") => "Too similar to existing package name Cuba. Normalized visual distance 1.70 is at or below cutoff 2.50."
  ("DimArrays", "SymArrays") => "Too similar to existing package name SymArrays. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25."
  ("DimArrays", "DiskArrays") => "Too similar to existing package name DiskArrays. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25."
  ("GLFW", "GLPK") => "Too similar to existing package name GLPK. Normalized visual distance 2.17 is at or below cutoff 2.50."
  ("Media", "Redis") => "Too similar to existing package name Redis. Normalized visual distance 2.14 is at or below cutoff 2.50."
  ("BLPData", "BlsData") => "Too similar to existing package name BlsData. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.13 is at or below cutoff 0.25."
  ("Clustering", "DPClustering") => "Too similar to existing package name DPClustering. Sqrt-normalized Damerau-Levenshtein distance 0.24 is at or below cutoff 0.25."
  ("CAOS", "CDCS") => "Too similar to existing package name CDCS. Normalized visual distance 1.90 is at or below cutoff 2.50."
  ("MLBase", "MLJBase") => "Too similar to existing package name MLJBase. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.13 is at or below cutoff 0.25."
  ("BAT", "MAT") => "Too similar to existing package name MAT. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.15 is at or below cutoff 0.25. Normalized visual distance 1.56 is at or below cutoff 2.50."
  ("DiffEqBase", "DiffEqBayes") => "Too similar to existing package name DiffEqBayes. Sqrt-normalized Damerau-Levenshtein distance 0.24 is at or below cutoff 0.25."
  ("Tar", "Tau") => "Too similar to existing package name Tau. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.15 is at or below cutoff 0.25. Normalized visual distance 1.67 is at or below cutoff 2.50."
  ("MeshArrays", "MetaArrays") => "Too similar to existing package name MetaArrays. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25."
  ("DBFTables", "FWFTables") => "Too similar to existing package name FWFTables. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25."
  ("DSP", "GCP") => "Too similar to existing package name GCP. Normalized visual distance 1.71 is at or below cutoff 2.50."
  ("Tau", "Yao") => "Too similar to existing package name Yao. Normalized visual distance 2.16 is at or below cutoff 2.50."
  ("CFITSIO", "FITSIO") => "Too similar to existing package name FITSIO. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.13 is at or below cutoff 0.25."
  ("EzXML", "MzXML") => "Too similar to existing package name MzXML. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25. Normalized visual distance 2.07 is at or below cutoff 2.50."
  ("BDF", "XDF") => "Too similar to existing package name XDF. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.15 is at or below cutoff 0.25. Normalized visual distance 2.43 is at or below cutoff 2.50."
  ("LITS", "Lints") => "Too similar to existing package name Lints. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25."
  ("Porta", "XPORTA") => "Too similar to existing package name XPORTA. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.13 is at or below cutoff 0.25."
  ("DCCA", "ORCA") => "Too similar to existing package name ORCA. Normalized visual distance 1.99 is at or below cutoff 2.50."
  ("Stipple", "StippleUI") => "Too similar to existing package name StippleUI. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25."
  ("Unitful", "UnitfulMR") => "Too similar to existing package name UnitfulMR. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25."
  ("GLM", "Gym") => "Too similar to existing package name Gym. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.15 is at or below cutoff 0.25."
  ("GMAT", "GMT") => "Too similar to existing package name GMT. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25."
  ("TOML", "TSML") => "Too similar to existing package name TSML. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25. Normalized visual distance 0.79 is at or below cutoff 2.50."
  ("BSON", "JSON") => "Too similar to existing package name JSON. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.14 is at or below cutoff 0.25. Normalized visual distance 2.45 is at or below cutoff 2.50."
  ("Cubature", "HCubature") => "Too similar to existing package name HCubature. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.13 is at or below cutoff 0.25."
  ("SimpleRoots", "SimpleTools") => "Too similar to existing package name SimpleTools. Sqrt-normalized Damerau-Levenshtein distance 0.24 is at or below cutoff 0.25."
  ("SOM", "SPH") => "Too similar to existing package name SPH. Normalized visual distance 2.16 is at or below cutoff 2.50."
  ("Caching", "Packing") => "Too similar to existing package name Packing. Normalized visual distance 2.47 is at or below cutoff 2.50."
  ("GSL", "VSL") => "Too similar to existing package name VSL. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.15 is at or below cutoff 0.25."
  ("FTPClient", "SMTPClient") => "Too similar to existing package name SMTPClient. Sqrt-normalized Damerau-Levenshtein distance 0.25 is at or below cutoff 0.25."
  ("PANDA", "Pandas") => "Too similar to existing package name Pandas. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.13 is at or below cutoff 0.25."
  ("AES", "AWS") => "Too similar to existing package name AWS. Damerau-Levenshtein distance 1 is at or below cutoff 1. Sqrt-normalized Damerau-Levenshtein distance 0.15 is at or below cutoff 0.25. Normalized visual distance 2.30 is at or below cutoff 2.50."

I think we can be pretty conservative here, this is just for automatic merging after all. If the limits become a problem we can always lower them later on. What things would be caught with e.g. 3?

Agreed that we can be conservative, although I think lowercasing already helps a lot. I think 3 may be too high, especially for short packages; any two four-letter packages would clash if they shared a single common letter (independently of case). Just checking for pairs with DL-distance <= 3, we get 15404 clashes (using clashes_3 = all_name_clashes(sort!(AutoMerge.get_all_package_names(expanduser("~/.julia/registries/General"))), DL_cutoff=3, sqrt_normalized_DL_cutoff=-1, sqrt_normalized_vd_cutoff=-1)). Doing the same with a cutoff of 2 instead of 3 gives 1592 clashes, which seems a bit better. I've left it at 1 for now, which results in 111 clashes (due to the non-length-normalized DL distance check alone).

@KristofferC
Copy link
Member

So anything that is three letters and share one letter will be marked?

I think we should check for casing when it comes to not allowing registering a new package with identical name in a different casing but not when it comes to string distance. When you install something the case is important everywhere so an A and an a are just as different as two other letters.

@DilumAluthge
Copy link
Member

DilumAluthge commented Oct 10, 2020

any two four-letter packages would clash if they shared a single common letter (independently of case)

So anything that is three letters and share one letter will be marked?

Keep in mind, any new package with a name that is less than five letters long will require manual merging anyway. That's just the regular "new package name length" check that we already have.

@DilumAluthge
Copy link
Member

The tests keep failing; will need to figure out why.

@DilumAluthge
Copy link
Member

Also, can you rebase on master and squash?

@StefanKarpinski
Copy link

How about this:

  • check if the visual string distance is too low
  • check if the edit distance of raw names is ≤ 2
  • check if the edit distance of lowercased names is ≤ 1

Keep in mind that this is just a trigger that requires manual review, not a block, so if there's some case where the name is reasonable but these "rules" aren't satisfied, that's still ok, we can just merge manually.

@ericphanson
Copy link
Member Author

It would be great to have some more examples of things that should be flagged, especially long names (because I wonder about length normalization). If anyone wants to contribute pairs of names that are qualitatively “too close”, I can add them to the tests and try to tune the cutoffs a bit.

We are saved from some of the bad examples that URLs face by not allowing Unicode and other software ecosystems where there’s been examples with missing a “-“, etc., so there’s that at least.

@StefanKarpinski
Copy link

Note that we don't have to get this right all at once: we can adjust the rules as we encounter more cases. Having some checks in place is already much, much better than having no checks in place. I do suspect that a single cutoff regardless of name length may not be right, but let's see how it goes.

@ericphanson
Copy link
Member Author

ericphanson commented Oct 11, 2020

Ok, then I'll just apply those rules you mentioned and Dilum's review suggestions. I am also concerned about communication, i.e. when someone's new package PR isn't automerged due to these rules and they don't understand why. I'll try to add a bit more to the README to help with that.

My dream would be for when the automerge check fails due to the visual distance, if the automerge comment could include a gif showing the two names merging into each other, to show how close they look :). That should definitely be followup work though, if we even want to take on the hefty plotting dependencies that would be needed.

@ericphanson
Copy link
Member Author

ericphanson commented Oct 11, 2020

I applied the review suggestions from @DilumAluthge and updated the name checks to as @StefanKarpinski suggested. I also tweaked the short-circuiting and the resulting message so that we only shortcircuit if the package name is in the registry. Otherwise, e.g. if you try "Flux" (with the ell), you get an error that it's too similar to "Mux", instead of the more appropriate error that it is already a name in the registry. Additionally, if the name is not in the registry but is too close, you get a numbered list of all the packages that it is too similar to. E.g. with "FIux" (uppercase eye), you get

julia> println(AutoMerge.meets_distance_check("FIux", all_pkg_names)[2])

Package name too similar to 3 existing packages.
1. Too similar to Mux. Damerau-Levenshtein distance 2 is at or below cutoff 2.
2. Too similar to FIB. Damerau-Levenshtein distance 2 is at or below cutoff 2.
3. Too similar to Flux. Damerau-Levenshtein distance 1 is at or below cutoff 2. Damerau-Levenshtein distance 1 between lowercased names is at or below cutoff 1. Normalized visual distance 0.46 is at or below cutoff 2.50.

I did not squash as @DilumAluthge requested because I thought it would make reviewing harder. But I can do that too when we're ready to merge. (I guess Bors can't squash merge yet? I see bors-ng/bors-ng#718 was merged though)

P.S. Not short-circuiting makes the (already quadratic scaling) all_name_clashes code take way longer, but for the linear sweep that CI does, I think it's fine and worth it, since it would be nice to be aware of all the nearby names to watch for, in case one wants to tweak the name and re-register.

@ericphanson
Copy link
Member Author

Re-

When you install something the case is important everywhere so an A and an a are just as different as two other letters.

I think they are not quite as different as two other letters; I think failing to press shift is a more common typing error than most two letter swaps (though not all), so if we are worried about typo-squatting that can be a factor. And I think people's memory tends to remember the letters easier than the case, i.e. you might remember "jump" but not remember if it's capitalized as "Jump" or "JuMP". So from the perspective of routing people to the right package, I think it is relevant. I don't have evidence handy for these claims though.

@KristofferC
Copy link
Member

I think they are not quite as different as two other letters ...

Yeah, I guess. But then s is more similar to a than l (keyboard distance) :P

@ericphanson
Copy link
Member Author

Yeah, I guess. But then s is more similar to a than l (keyboard distance) :P

Agreed :). IMO the right solution for typo-related concerns is a weighted DL distance, where the weights are empirically determined from typos produced by QWERTY1 typists (mentioned in #10 somewhere). But StringDistances.jl doesn't do weighted DL yet, and I haven't found a weight matrix yet. Actually these are reasons I hadn't submitted a PR sooner. My hope is that with the right combination of weighted DL for typos and visual distances for malicious websites giving example code with tricksy package names (ref #10 (comment)), one could have a very low false-positive rate and still a good false-negative rate, by somewhat precisely determining if a package name could cause trouble or not.

I see special casing lowercase letters as a half-step towards that world.

[1]: I am actually not a QWERTY typist (I use colemak), but I know we are in the vast minority, so I think special casing QWERTY would still be a good step.

@ericphanson
Copy link
Member Author

The tests keep failing; will need to figure out why.

I'm not actually sure where the logs are; can anyone point me to it? Or do I just need to try and reproduce locally?

@StefanKarpinski
Copy link

Can we change the wording from "too similar" to just "similar"? Otherwise people are going to take this automated feedback as telling them they strictly may not call something this—which I can guarantee will cause some people to get unhappy—whereas all we're doing is requiring a manual review if a name is similar enough to an existing package.

@ericphanson
Copy link
Member Author

Can we change the wording from "too similar" to just "similar"? Otherwise people are going to take this automated feedback as telling them they strictly may not call something this—which I can guarantee will cause some people to get unhappy—whereas all we're doing is requiring a manual review if a name is similar enough to an existing package.

Good call, just made that change. By the way, in the README I also added a reminder that this is deliberately conservative guidance, not a requirement; let me know if that can be phrased better.

If this PR is merged, we should likely also update the General registry README.

bors bot added a commit that referenced this pull request Oct 11, 2020
@ericphanson
Copy link
Member Author

bors try-
bors try

bors bot added a commit that referenced this pull request Oct 12, 2020
@ericphanson
Copy link
Member Author

bors try-
bors try

bors bot added a commit that referenced this pull request Oct 12, 2020
@bors
Copy link
Contributor

bors bot commented Oct 12, 2020

try

Build failed:

@ericphanson
Copy link
Member Author

Aha, an interesting failure: https://travis-ci.com/github/JuliaRegistries/RegistryCI.jl/jobs/398081414#L403-L405

I think what happens is that we run the code on the branch of the registry with the update committed, so the new package name is always already in the registry.

I wonder if the case of exactly duplicate names is already covered by #255?

@DilumAluthge
Copy link
Member

I wonder if the case of exactly duplicate names is already covered by #255?

No. You could have two packages with exactly the same name but different paths.

@DilumAluthge
Copy link
Member

@ericphanson In a new package PR, where are you getting the list of existing package names?

@DilumAluthge
Copy link
Member

AutoMerge specifically clones the master branch of the registry. You should refer to that clone to get the list of existing package names.

@DilumAluthge
Copy link
Member

In other words, AutoMerge has two copies of the registry:

  1. The PR branch
  2. The master branch

Make sure that you are using the correct copy of the registry for each tasks. For some tasks you need to be looking at the PR branch, and for other tasks you need to be looking at the master branch.

@ericphanson
Copy link
Member Author

ah, thanks @DilumAluthge! I did not understand that, and was using registry_head as the path to the registry to read off the packages. I've swapped to registry_master, so hopefully that should fix things.

bors try

bors bot added a commit that referenced this pull request Oct 12, 2020
@bors
Copy link
Contributor

bors bot commented Oct 12, 2020

try

Build succeeded:

@ericphanson
Copy link
Member Author

This is good to go from my end :)

@fredrikekre
Copy link
Member

bors r+

@fredrikekre
Copy link
Member

bors r-

bors bot added a commit that referenced this pull request Oct 12, 2020
274: Add ASCII check, distance check, visual distance check r=fredrikekre a=ericphanson

My attempt to fix #10 and close #273. There's three semi-arbitrarily chosen cutoffs that might need more tuning, and if any are hit, then the package is flagged.

1. DL distance is <= 1 (which would catch Websocket vs Websockets)
2. A normalized DL distance catch long package names with more than 1 edit but only a few. I ended up going with a weird `5 + sqrt(max(len1, len2))` normalization just because it seemed like just dividing by the length made long packages get flagged too much.
3. Finally, there's the visual distance check, which can catch package names with more edits than allowed by the other checks if the edits are hard to distinguish visually, like `Jill` vs `JiII` (that's 2 edits of lowercase-ell to uppercase-eye, so the straight DL doesn't catch, short name so the normalized one doesn't catch it, but very similar looking letters, so the visual one catches it). I put a `DL <= 2` guard on the calculation so we don't have to perform the expensive visual check too often.

I also added an ASCII check; I saw that's in the guidelines but doesn't appear to be implemented.

I added some unit tests but not an integration test (out of lazyness / time constraints).


P.S. https://ericphanson.github.io/VisualStringDistances.jl/dev/packagenames/ has some short discussion of VisualStringDistances for this problem, and https://github.com/ericphanson/VisualStringDistances.jl/tree/master/scripts/packagenames has some messy/exploratory code for playing around with distances and cutoffs.

---

DL = Damerau–Levenshtein distance

Co-authored-by: ericphanson <5846501+ericphanson@users.noreply.github.com>
Co-authored-by: Eric Hanson <5846501+ericphanson@users.noreply.github.com>
@bors
Copy link
Contributor

bors bot commented Oct 12, 2020

Canceled.

@fredrikekre
Copy link
Member

fredrikekre commented Oct 12, 2020

Does bors support squashing nowadays? Can you squash otherwise?

@ericphanson
Copy link
Member Author

Ah right, it does not (I looked into it a bit, and you can configure it to always squash or not, but not per PR, and we don’t have the “always squash” option set). I’ll squash it now.

Bump version

Lowercase edit distances, clean up code

Fixes from review; update checks

Add more details about name checks to README

Include all clashes in error message

Cleanup code

Update Project.toml

Co-authored-by: Dilum Aluthge <dilum@aluthge.com>

Tweak wording

Fix order in comment

Add some logging in the tests for Travis

Remove outer testset

Restore outer testset, add more detailed logging around distance checks

Allow inlining

add unused keyword argument to fix call signature

Always check ascii names

Fix typo

Fix another typo

another

swap `registry_head` -> `registry_master`

`of` -> `for`
@fredrikekre
Copy link
Member

bors r+

bors bot added a commit that referenced this pull request Oct 12, 2020
274: Add ASCII check, distance check, visual distance check r=fredrikekre a=ericphanson

My attempt to fix #10 and close #273. There's three semi-arbitrarily chosen cutoffs that might need more tuning, and if any are hit, then the package is flagged.

1. DL distance is <= 1 (which would catch Websocket vs Websockets)
2. A normalized DL distance catch long package names with more than 1 edit but only a few. I ended up going with a weird `5 + sqrt(max(len1, len2))` normalization just because it seemed like just dividing by the length made long packages get flagged too much.
3. Finally, there's the visual distance check, which can catch package names with more edits than allowed by the other checks if the edits are hard to distinguish visually, like `Jill` vs `JiII` (that's 2 edits of lowercase-ell to uppercase-eye, so the straight DL doesn't catch, short name so the normalized one doesn't catch it, but very similar looking letters, so the visual one catches it). I put a `DL <= 2` guard on the calculation so we don't have to perform the expensive visual check too often.

I also added an ASCII check; I saw that's in the guidelines but doesn't appear to be implemented.

I added some unit tests but not an integration test (out of lazyness / time constraints).


P.S. https://ericphanson.github.io/VisualStringDistances.jl/dev/packagenames/ has some short discussion of VisualStringDistances for this problem, and https://github.com/ericphanson/VisualStringDistances.jl/tree/master/scripts/packagenames has some messy/exploratory code for playing around with distances and cutoffs.

---

DL = Damerau–Levenshtein distance

Co-authored-by: ericphanson <5846501+ericphanson@users.noreply.github.com>
@bors
Copy link
Contributor

bors bot commented Oct 12, 2020

Build failed:

  • continuous-integration/travis-ci/push

@ericphanson
Copy link
Member Author

ericphanson commented Oct 12, 2020

Nightly is failing on commit 8c03bf7 (https://travis-ci.com/github/JuliaRegistries/RegistryCI.jl/jobs/398306656#L170) which has the latest Pkg version bump from JuliaLang/julia#37992, and it looks like something is going wrong with that. It seems were using Pkg internals in

r = Pkg.Operations.load_package_data(Base.UUID, depsfile, versions) isa rtype
which are out of date. Maybe we can allow failures on nightly and merge this, since the failure is unrelated?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Include VisualStringDistances.jl for CI require minimal edit distance for new package names

5 participants