Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

find(next|prev|last|first) for failed matches #36768

Open
vtjnash opened this issue Jul 22, 2020 · 8 comments
Open

find(next|prev|last|first) for failed matches #36768

vtjnash opened this issue Jul 22, 2020 · 8 comments
Labels
domain:search & find The find* family of functions kind:bug Indicates an unexpected problem or unintended behavior kind:minor change Marginal behavior change acceptable for a minor release

Comments

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jul 22, 2020

Sometimes find is returning nothing when the match fails, sometimes it returns 0:-1—this is a bit confusing. There seems to be some uses of the something function in base/strings/search.jl, where there probably shouldn't be any uses. Discovered while trying to test dcd4f3e.

julia> findnext("", "abc", 1)
1:0

julia> findnext("", "abc", 2)
2:1

julia> findnext("", "abc", 3)
3:2

julia> findnext("", "abc", 4)
4:3

julia> findnext("", "abc", 5)
0:-1

julia> findnext("", "abc", 6)
0:-1

julia> findnext("a", "abc", 5)
ERROR: BoundsError: attempt to access 3-codeunit String at index [5]

julia> findnext("a", "abc", 3)

julia> findprev("a", "abc", 2)
1:1

julia> findprev("a", "abc", 1)
1:1

julia> findprev("a", "abc", 0)

julia> findprev("a", "abc", -1)
ERROR: BoundsError: attempt to access 3-codeunit String at index [-1]
@JeffBezanson JeffBezanson added the domain:search & find The find* family of functions label Jul 23, 2020
@StefanKarpinski StefanKarpinski added kind:bug Indicates an unexpected problem or unintended behavior kind:minor change Marginal behavior change acceptable for a minor release labels Jul 24, 2020
@StefanKarpinski
Copy link
Sponsor Member

This should absolutely be consistent—indicating no match two different ways is definitely a bug. It may, however, be a bug that people depend on, so while this should certainly be fixed, it should only be changed in a minor release.

@sostock
Copy link
Contributor

sostock commented Mar 6, 2021

The behavior has changed on current master. However, it’s still not consistent:

julia> findnext("", "abc", 5)
ERROR: BoundsError: attempt to access 3-element Vector{UInt8} at index [5]
[...]

julia> findprev("", "abc", -100)
-100:-101

julia> findprev("", "abc", 100)
0:-1

I think the BoundsError is somewhat weird, since we don’t actually need to access any elements in the case of an empty needle. But what’s really inconsistent is the findprev behavior (regardless of whether we want to throw a BoundsError or not). IMO, findprev("", "abc", -100) should either return nothing or throw a BoundsError and findprev("", "abc", 100) should either return 4:3 or throw a BoundsError.

In addition, I think the behavior of findprev is wrong even for starts that are not out of bounds. I have opened an issue for that: #39940.

@bicycle1885
Copy link
Member

bicycle1885 commented Mar 7, 2021

EDIT: I misunderstood the behavior of findprev. The following code is NOT correct.

Here is a suggestion of reference implementation (of course, suboptimal in terms of performance):

function findnext(a, b, i)
    i = max(i, firstindex(b))
    j = i + ncodeunits(a) - 1
    while checkbounds(Bool, b, i:j)
        a == b[i:j] && return i:j
        i += 1; j += 1
    end
    return nothing
end

function findprev(a, b, j)
    j = min(j, lastindex(b))
    i = j - ncodeunits(a) + 1
    while checkbounds(Bool, b, i:j)
        a == b[i:j] && return i:j
        i -= 1; j -= 1
    end
    return nothing
end

findfirst(a, b) = findnext(a, b, firstindex(b))
findlast(a, b) = findprev(a, b, lastindex(b))

I think throwing an exception is never useful when searching a string pattern, and the minimal, self-evident behavior that can be explained with a simple reference implementation as above is most consistent and thus least confusing.

@sostock
Copy link
Contributor

sostock commented Mar 8, 2021

Can you explain why you think that throwing an exception is not useful?

@bicycle1885
Copy link
Member

Because I cannot think of a situation where throwing an exception is useful?

@bicycle1885
Copy link
Member

bicycle1885 commented Mar 9, 2021

I think some formal description of the behavior is needed in order to find a consistent solution. Let a and b be strings and i be an integer. Both r = findnext(a, b, i) and r = findprev(a, b, i) must satisfy a == b[r] if r is not nothing. Moreover, according to the description of their docstirngs, findnext must satisfy i ≤ first(r) and findprev must satisfy i ≥ last(r) if r is not nothing. This still doesn't uniquely identify the result because there may be multiple ranges satisfying these conditions. So, in addition to the previous conditions, findnext must return the leftmost range (if any) and findprev must return the rightmost range (if any).

Here are more formal definitions:

a, b: string
i: integer

R_n = { r ∈ UnitRange | a == b[r] and first(r) ≥ i }
R_p = { r ∈ UnitRange | a == b[r] and last(r)  ≤ i }

findnext(a, b, i) = min R_n if R_n is not empty; otherwise nothing.
findprev(a, b, i) = max R_p if R_p is not empty; otherwise nothing.

There is no need to throw an exception of BoundsError at all. I think it is a kind of implementation details leaking to users.

This is okay. But we need to consider corner cases where a is an empty string. If a is empty, R_n and R_p contains infinitely many elements because a == b[r] is always satisfied if r is an empty range, which implies findall(a, b) be an unbounded iterator. Therefore, we should impose another constraint on R_n and R_p. A sensible choise would be bounding ranges within the range of b. That is, the following two constraints should be imposed: first(r) ≤ lastindex(b) + 1 and firstindex(b) - 1 ≤ last(r). The peculiar +1 and -1 are due to empty strings: findnext("", "", 1) must be 1:0. It is easy prove that the intersection of R_n (R_p) and the bounded range set includes finitely many elements for finite-length strings.

So, with this constraint, we can modify the previous definitions as follows:

Bounding condition:
B = { r ∈ UnitRange | first(r) ≤ lastindex(b) + 1 and firstindex(b) - 1 ≤ last(r) }

Modified versions:
findnext(a, b, i) = min R_n ∩ B if R_n ∩ B is not empty; otherwise nothing.
findprev(a, b, i) = min R_p ∩ B if R_p ∩ B is not empty; otherwise nothing.

And the following is a reference implementation of the modified versions:

function findnext(a, b, i)
    if i > lastindex(b) + 1
        return nothing
    elseif i < firstindex(b)
        i = firstindex(b)
    elseif i  thisind(b, i)
        i = nextind(b, i)
    end
    while true
        startswith(b[i:end], a) && return i:i+lastindex(a)-1
        i > lastindex(b) && return nothing
        i = nextind(b, i)
    end
end

function findprev(a, b, i)
    if i < firstindex(b) - 1
        return nothing
    elseif i > lastindex(b)
        i = lastindex(b)
    elseif i  thisind(b, i)
        i = thisind(b, i)
    end
    while true
        endswith(b[begin:i], a) && return i-lastindex(a)+1:i
        i < firstindex(b) && return nothing
        i = prevind(b, i)
    end
end

findfirst(a, b) = findnext(a, b, firstindex(b))
findlast(a, b) = findprev(a, b, lastindex(b))

@bicycle1885
Copy link
Member

Another strange behavior:

julia> findnext("≠", "≠    ", 2)
ERROR: StringIndexError: invalid index [2], valid nearby indices [1]=>'≠', [4]=>' '
Stacktrace:
 [1] string_index_err(s::String, i::Int64)
   @ Base ./strings/string.jl:12
 [2] findnext(pred::Base.Fix2{typeof(isequal), Char}, s::String, i::Int64)
   @ Base ./strings/search.jl:11
 [3] _searchindex(s::String, t::String, i::Int64)
   @ Base ./strings/search.jl:192
 [4] _search
   @ ./strings/search.jl:266 [inlined]
 [5] findnext(t::String, s::String, start::Int64)
   @ Base ./strings/search.jl:302
 [6] top-level scope
   @ REPL[3]:1

julia> findnext(" ≠", "≠    ", 2)  # nothing

@jariji
Copy link
Contributor

jariji commented Jun 17, 2023

Related issue #29565

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:search & find The find* family of functions kind:bug Indicates an unexpected problem or unintended behavior kind:minor change Marginal behavior change acceptable for a minor release
Projects
None yet
Development

No branches or pull requests

6 participants