Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Case insensitive wildacard range #2183

Closed
jahav opened this issue May 22, 2019 — with docs.microsoft.com · 3 comments

Comments

Projects
None yet
5 participants
Copy link

commented May 22, 2019 — with docs.microsoft.com

Beware that wildcard ranges are not behaving same as expanded range even in simple cases with collate use. The page notes that 'In range searches, the characters included in the range may vary depending on the sorting rules of the collation.', but it doesn't work properly even for a simple case:

-- db collation Czech_CI_AS
create table People (Id int identity, GivenName nvarchar(100));
insert into People (GivenName) values ('Adam'), ('Bob'), ('cecilia');

select * from People person where person.GivenName collate SQL_Latin1_General_CP1_CS_AS like '[A-C]%'; -- result: Adam, Bob, cecilia
select * from People person where person.GivenName collate SQL_Latin1_General_CP1_CS_AS like '[ABC]%'; -- result: Adam, Bob


Document Details

Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

@srutzky

This comment has been minimized.

Copy link
Contributor

commented May 22, 2019

Hello @jahav . Actually, this is behaving correctly. The issue is simply a misunderstanding of how it should work, and an incomplete test that doesn't show the complete behavior. To be clear, the pattern [ABC] is just a multi-character wildcard, it is not a "range". It works similar to an IN list in that it will try to match against 'A' OR 'B' OR 'C'.

The [A-C] pattern is a range that will try to match against anything between 'A' AND 'C' (inclusive). So the question is: what characters are between A and C? Well, due to using a case-sensitive collation, the order of the characters is:

  1. a
  2. A
  3. b
  4. B
  5. c
  6. C

As you can see, lowercase "a" is outside of that range, and lowercase "c" is inside of that range. Your test included "c", but not "a". If we add sample data for "a" and "C", then we will see this in action (the following example is based on your code at the top of this issue):

-- DROP TABLE dbo.People;
CREATE TABLE dbo.People (Id INT IDENTITY, GivenName NVARCHAR(100));
INSERT INTO dbo.People (GivenName) VALUES (N'Adam'), (N'Bob'), (N'cecilia'), (N'aaron'),
      (N'Charles');


SELECT *
FROM   dbo.People
ORDER BY [GivenName] COLLATE Latin1_General_100_CS_AS_SC;
-- result: aaron, Adam, Bob, cecilia, Charles

SELECT *
FROM   dbo.People person
WHERE  person.GivenName COLLATE Latin1_General_100_CS_AS_SC LIKE N'[A-C]%'
ORDER BY [GivenName] COLLATE Latin1_General_100_CS_AS_SC;
-- result: Adam, Bob, cecilia, Charles

SELECT *
FROM   dbo.People person
WHERE  person.GivenName COLLATE Latin1_General_100_CS_AS_SC LIKE N'[A-c]%'
ORDER BY [GivenName] COLLATE Latin1_General_100_CS_AS_SC;
-- result: Adam, Bob, cecilia

SELECT *
FROM   dbo.People person
WHERE  person.GivenName COLLATE Latin1_General_100_CS_AS_SC LIKE N'[a-c]%'
ORDER BY [GivenName] COLLATE Latin1_General_100_CS_AS_SC;
-- result: aaron, Adam, Bob, cecilia

Please note that:

  1. When using NVARCHAR data, it is best to always prefix string literals with an uppercase "N", even if all characters are standard ASCII.
  2. It's best to always specify a schema name--it cuts down on errors and increases performance.
  3. It's best to use the newest collation possible for the culture you need. The SQL_* collations are very out of date and are missing sort weights for many thousands of characters and hundreds of uppercase/lowercase mappings. Additionally, there is the following concern if using VARCHAR data with a SQL_* collation: Impact on Indexes When Mixing VARCHAR and NVARCHAR Types

To learn more about collations, please visit: Collations Info

@jahav

This comment has been minimized.

Copy link
Author

commented May 23, 2019

Thank you very much for a thorough answer. I get it now.

@jahav jahav closed this May 23, 2019

@srutzky

This comment has been minimized.

Copy link
Contributor

commented May 23, 2019

@jahav You are quite welcome. Please keep in mind that when using case-sensitive collations, ordering is not guaranteed to be lowercase first. Some collations, at least most of the SQL_* collations (and only when using VARCHAR data), sort uppercase first:

SELECT tmp.col
FROM (VALUES ('a'), ('A'), ('b'), ('B')) tmp(col)
WHERE tmp.col > 'a' COLLATE SQL_Latin1_General_CP1_CS_AS
ORDER BY tmp.col COLLATE SQL_Latin1_General_CP1_CS_AS;
-- B, b

SELECT tmp.col
FROM (VALUES (N'a'), (N'A'), (N'b'), (N'B')) tmp(col)
WHERE tmp.col > N'a' COLLATE SQL_Latin1_General_CP1_CS_AS
ORDER BY tmp.col COLLATE SQL_Latin1_General_CP1_CS_AS;
-- A, b, B
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.