-
Notifications
You must be signed in to change notification settings - Fork 2
/
dbo.eNGrams8K.sql
144 lines (117 loc) · 5.95 KB
/
dbo.eNGrams8K.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
USE sqlDevToolboxAB;
GO
IF OBJECT_ID('dbo.eNGrams8k', 'IF') IS NOT NULL DROP FUNCTION dbo.eNGrams8k;
GO
CREATE FUNCTION dbo.eNGrams8k
(
@string varchar(8000), -- Input string
@n int -- requested token size
)
/****************************************************************************************
Purpose:
This is the memory optimized version of ngrams8K (see: https://goo.gl/T3DDiY).
Like ngrams8k, engrams8K is s character-level N-Grams function that outputs a contiguous
stream of @n-sized tokens based on an input string (@string). Accepts strings up to 8000
varchar characters long. For more information about N-Grams see: https://goo.gl/CYTvTS.
Compatibility:
SQL Server 2014+
Prerequisites:
eNGrams8K requires an "electric" (memory optimized) tally table.
Below is example code that will do this on SQL Server 2014+
----------------------------------------------------------------------------------------
-- add memory optimized filegroup
alter database <yourdb,,>
add filegroup <filegroup name,,> contains memory_optimized_data;
alter database <yourdb,,>
add file
(
name = '<filename,,>',
filename = '<folder,,>/<filename,,>'
) to filegroup sqlDevToolboxAB_mod;
alter database <yourdb,,>
set MEMORY_OPTIMIZED_ELEVATE_TO_SNAPSHOT = on;
-- create and populate eTally
if object_id('dbo.eTally') is not null drop table eTally;
create table dbo.eTally (n int not null unique nonclustered)
with(memory_optimized = on, durability = schema_only);
insert dbo.eTally(n)
select top(10000) row_number() over (order by (select 1))
from sys.all_columns a, sys.all_columns b;
----------------------------------------------------------------------------------------
Syntax:
--===== Autonomous
SELECT position, token FROM dbo.eNGrams8K(@string,@n);
--===== Against a table using APPLY
SELECT s.SomeID, ng.position, ng.token
FROM dbo.SomeTable s
CROSS APPLY dbo.eNGrams8K(s.SomeValue,@n) ng;
Parameters:
@string = The input string to split into tokens.
@n = The size of each token returned.
Returns:
Position = bigint; the position of the token in the input string
token = varchar(8000); a @n-sized character-level N-Gram token
Developer Notes:
1. eNGrams8K is not case sensitive
2. eNGrams8K uses a memory optimized tally table and therefore can only run with a
serial execution plan.
3. ORDER BY is required in the function for it to work correctly.
4. When @n is less than 1 or greater than the datalength of the input string then no
tokens (rows) are returned. If either @string or @n are NULL no rows are returned.
This is a debatable topic but the thinking behind this decision is that: because you
can't split 'xxx' into 4-grams, you can't split a NULL value into unigrams and you
can't turn anything into NULL-grams, no rows should be returned.
For people who would prefer that a NULL input forces the function to return a single
NULL output you could add this code to the end of the function:
UNION ALL
SELECT 1, NULL
WHERE NOT(@n > 0 AND @n <= DATALENGTH(@string)) OR (@n IS NULL OR @string IS NULL)
5. eNGrams8K can also be used as a Tally Table with the position column being your "N"
row. To do so use REPLICATE to create an imaginary string, then use eNGrams8K to split
it into unigrams then only return the position column. eNGrams8K will get you up to
8000 numbers. There will be no performance penalty for sorting by position in
ascending order but there is for sorting in descending order. To get the numbers in
descending order without forcing a sort in the query plan use the following formula:
N = <highest number>-position+1.
Pseudo Tally Table Examples:
--===== (1) Get the numbers 1 to 100 in ascending order:
SELECT N = position
FROM dbo.eNGrams8K(REPLICATE(0,100),1);
--===== (2) Get the numbers 1 to 100 in descending order:
DECLARE @maxN int = 100;
SELECT N = @maxN-position+1
FROM dbo.eNGrams8K(REPLICATE(0,@maxN),1)
ORDER BY position;
6. eNGrams8K is deterministic. For more about deterministic functions see:
https://msdn.microsoft.com/en-us/library/ms178091.aspx
Usage Examples:
--===== Turn the string, 'abcd' into unigrams, bigrams and trigrams
SELECT position, token FROM dbo.eNGrams8K('abcd',1); -- unigrams (@n=1)
SELECT position, token FROM dbo.eNGrams8K('abcd',2); -- bigrams (@n=2)
SELECT position, token FROM dbo.eNGrams8K('abcd',3); -- trigrams (@n=3)
--===== How many times the substring "AB" appears in each record
DECLARE @table TABLE(stringID int identity primary key, string varchar(100));
INSERT @table(string) VALUES ('AB123AB'),('123ABABAB'),('!AB!AB!'),('AB-AB-AB-AB-AB');
SELECT string, occurances = COUNT(*)
FROM @table t
CROSS APPLY dbo.eNGrams8K(t.string,2) ng
WHERE ng.token = 'AB'
GROUP BY string;
----------------------------------------------------------------------------------------
Revision History:
Rev 00 - 20170819 - Initial Development - Alan Burstein (original developed 20140310)
****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
--iTally(N) AS -- my cte Tally Table
--(
-- SELECT TOP(ABS(CONVERT(BIGINT,(DATALENGTH(ISNULL(@string,''))-(ISNULL(@n,1)-1)),0)))
-- ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) -- Order by a constant to avoid a sort
-- FROM L1 a CROSS JOIN L1 b -- cartesian for 8100 rows (90^2)
--)
SELECT TOP(ABS(CONVERT(BIGINT,(DATALENGTH(ISNULL(@string,''))-(ISNULL(@n,1)-1)),0)))
position = N, -- position of the token in the string(s)
token = SUBSTRING(@string,CAST(N AS int),@n) -- the @n-Sized token
FROM dbo.eTally
WHERE @n > 0 AND @n <= DATALENGTH(@string) -- Protection against bad parameter values
ORDER BY N;
GO