-
Notifications
You must be signed in to change notification settings - Fork 18
/
Copy pathre-tutorial.lhs
318 lines (265 loc) · 11.5 KB
/
re-tutorial.lhs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
The regex Tutorial
==================
This tutorial is a self-testing literate Haskell programme introducing
the vanilla API of the [regex package](http://hs.regex.uk). There
are other tutorials for explaining the more specialist aspects of regex
and you can load them into into you Haskell REPL of choice: see the
[regex Tutorials page](http://tutorial.regex.uk) for details.
Language Pragmas
----------------
The first thing you will have to do is enable `QuasiQuotes` as regex
uses them to check that REs are well-formed at compile time.
\begin{code}
{-# LANGUAGE QuasiQuotes #-}
{-# OPTIONS_GHC -fno-warn-missing-signatures #-}
\end{code}
If you are trying out examples interactively at the ghci prompt then you
will need
```
:seti -XQuasiQuotes
```
Importing the API
-----------------
\begin{code}
module Main(main) where
\end{code}
*********************************************************
*
* WARNING: this is generated from pp-tutorial-master.lhs
*
*********************************************************
Before importing the `regex` API into your Haskell script you will need
to answer two questions:
1. Which flavour of REs do I need? If you need Posix REs then the `TDFA`
is for you, otherwise it is the PCRE back end, which is housed in
a seperate `regex-with-pcre` package.
2. Which Haskell type is being used for the text I need to match? This
can influence as, at the time of writing, the `PCRE` `regex` back end
[does not support the`Text` types](https://github.com/iconnect/regex/issues/58).
The import statement will in general look like this
```
import Text.RE.<back-end>.<text-type>
```
As we have no interest in Posix/PCRE distinctions or performance here,
we have chosen to work with the `TDFA` back end with `String` types.
\begin{code}
import TestKit
import Text.RE.TDFA.String
\end{code}
You could also import `Text.RE.TDFA` or `Text.RE.PCRE` to get an API
in which the operators are overloaded over all text types accepted by
each of these back ends: see the [Tools Tutorial](re-tutorial-tools.html)
for details.
Single `Match` with `?=~`
-------------------------
The regex API provides two matching operators: one for looking for the first
match in its search string and the other for finding all of the matches. The
first-match operator, `?=~`, yields the result of attempting to find the first
match.
```
(?=~) :: String -> RE -> Match String
```
The boolean `matched` function,
```
matched :: Match a -> Bool
```
can be used to test whether a match was found:
\begin{code}
evalme_SGL_01 = checkThis "evalme_SGL_01" (True) $ matched $ "2016-01-09 2015-12-5 2015-10-05" ?=~ [re|[0-9]{4}-[0-9]{2}-[0-9]{2}|]
\end{code}
To get the matched text use `matchText`,
```
matchedText :: Match a -> Maybe a
```
which returns `Nothing` if no match was found in the search string:
\begin{code}
evalme_SGL_02 = checkThis "evalme_SGL_02" (Just "2016-01-09") $ matchedText $ "2016-01-09 2015-12-5 2015-10-05" ?=~ [re|[0-9]{4}-[0-9]{2}-[0-9]{2}|]
\end{code}
\begin{code}
evalme_SGL_03 = checkThis "evalme_SGL_03" (Nothing) $ matchedText $ "2015-12-5" ?=~ [re|[0-9]{4}-[0-9]{2}-[0-9]{2}|]
\end{code}
Multiple `Matches` with `*=~`
-----------------------------
Use `*=~` to locate all of the non-overlapping substrings that match a RE,
```
(*=~) :: String -> RE -> Matches String
anyMatches :: Matches a -> Bool
```
`anyMatches` can be used to determine if any matches were found
\begin{code}
evalme_MLT_01 = checkThis "evalme_MLT_01" (True) $ anyMatches $ "2016-01-09 2015-12-5 2015-10-05" *=~ [re|[0-9]{4}-[0-9]{2}-[0-9]{2}|]
\end{code}
and `countMatches` will tell us how many sub-strings matched:
\begin{code}
evalme_MLT_02 = checkThis "evalme_MLT_02" (2) $ countMatches $ "2016-01-09 2015-12-5 2015-10-05" *=~ [re|[0-9]{4}-[0-9]{2}-[0-9]{2}|]
\end{code}
`matches` will return all of the matches.
```
matches :: Natches a -> [a]
```
\begin{code}
evalme_MLT_03 = checkThis "evalme_MLT_03" (["2016-01-09","2015-10-05"]) $ matches $ "2016-01-09 2015-12-5 2015-10-05" *=~ [re|[0-9]{4}-[0-9]{2}-[0-9]{2}|]
\end{code}
The `regex` Macros and Parsers
------------------------------
regex supports macros in regular expressions. There are a bunch of
standard macros that you can just use, and you can define your own.
RE macros are enclosed in `@{` ... '}'. By convention the macros in
the standard environment start with a '%'. `@{%date}` will match an
ISO 8601 date, this
\begin{code}
evalme_MAC_00 = checkThis "evalme_MAC_00" (2) $ countMatches $ "2016-01-09 2015-12-5 2015-10-05" *=~ [re|@{%date}|]
\end{code}
will pick out the two dates.
There are also parsing functions for analysing the matched text. The
`@{%string}` macro will match quoted strings (in which double quotes can be
escaped with backslashes in the usual way) and its companion `parseString`
function will extract the string that was being quoted, interpreting any
escaped double quotes:
\begin{code}
evalme_MAC_01 = checkThisWith convertMaybeTextList "evalme_MAC_01" ([Just "foo",Just "bar", Just "\""]) $ map parseString $ matches $ "\"foo\", \"bar\" and a quote \"\\\"\"" *=~ [re|@{%string}|]
\end{code}
See the [macro tables page](http://macros.regex.uk) for details of the standard macros and their parsers.
See the [testbench tutorial](re-tutorial-testbench.html) for more on how
you can develop, document and test RE macros with the regex test bench.
Search and Replace
------------------
If you need to edit a string then `SearchReplace` `[ed|` ... `|]`
templates can be used with `?=~/` to replace a single instance or
`*=~/` to replace all matching instances.
\begin{code}
evalme_SRP_00 = checkThis "evalme_SRP_00" ("0x0000: 40AA fab0") $ "0000 40AA fab0" ?=~/ [ed|${adr}([0-9A-Fa-f]{4}):?///0x${adr}:|]
\end{code}
\begin{code}
evalme_SRP_01 = checkThis "evalme_SRP_01" ("0x0000: 0x40AA 0xfab0") $ "0000: 40AA fab0" *=~/ [ed|[0-9A-Fa-f]{4}///0x$0|]
\end{code}
Specifying Options
------------------
By default regular expressions are of the multi-line case-sensitive
variety so this
\begin{code}
evalme_SOP_01 = checkThis "evalme_SOP_01" (2) $ countMatches $ "0a\nbb\nFe\nA5" *=~ [re|[0-9a-f]{2}$|]
\end{code}
will find 2 matches, the '$' anchor matching each of the newlines, but only
the first two lowercase hex numbers matching the RE. The case sensitivity
and multiline-ness can be controled by selecting alternative parsers.
+--------------------------+-------------+-----------+----------------+
| long name | short forms | multiline | case sensitive |
+==========================+=============+===========+================+
| reMultilineSensitive | reMS, re | yes | yes |
+--------------------------+-------------+-----------+----------------+
| reMultilineInsensitive | reMI | yes | no |
+--------------------------+-------------+-----------+----------------+
| reBlockSensitive | reBS | no | yes |
+--------------------------+-------------+-----------+----------------+
| reBlockInsensitive | reBI | no | no |
+--------------------------+-------------+-----------+----------------+
So while the default setup
\begin{code}
evalme_SOP_02 = checkThis "evalme_SOP_02" (2) $ countMatches $ "0a\nbb\nFe\nA5" *=~ [reMultilineSensitive|[0-9a-f]{2}$|]
\end{code}
finds 2 matches, a case-insensitive RE
\begin{code}
evalme_SOP_03 = checkThis "evalme_SOP_03" (4) $ countMatches $ "0a\nbb\nFe\nA5" *=~ [reMultilineInsensitive|[0-9a-f]{2}$|]
\end{code}
finds 4 matches, while a non-multiline RE
\begin{code}
evalme_SOP_04 = checkThis "evalme_SOP_04" (0) $ countMatches $ "0a\nbb\nFe\nA5" *=~ [reBlockSensitive|[0-9a-f]{2}$|]
\end{code}
finds no matches but a non-multiline, case-insensitive match
\begin{code}
evalme_SOP_05 = checkThis "evalme_SOP_05" (1) $ countMatches $ "0a\nbb\nFe\nA5" *=~ [reBlockInsensitive|[0-9a-f]{2}$|]
\end{code}
finds the final match.
For the hard of typing the shortforms are available.
\begin{code}
evalme_SOP_06 = checkThis "evalme_SOP_06" (True) $ matched $ "SuperCaliFragilisticExpialidocious" ?=~ [reMI|supercalifragilisticexpialidocious|]
\end{code}
Compiling and Escaping
----------------------
It is possible to compile a dynamically aquired RE string at run-time using
`compileRegex`:
```
compileRegex :: (Functor m, Monad m) => String -> m RE
```
\begin{code}
evalme_CPL_01 = checkThis "evalme_CPL_01" (["2016-01-09","2015-10-05"]) $ matches $ "2016-01-09 2015-12-5 2015-10-05" *=~ (maybe (error "evalme_CPL_01") id $ compileRegex "[0-9]{4}-[0-9]{2}-[0-9]{2}")
\end{code}
These will compile the RE using the default multiline, case-sensitive options,
but you can specify the options dynamically using `compileRegexWith`:
```
compileRegexWith :: (Functor m, Monad m) => SimpleREOptions -> String -> m RE
```
where `SimpleREOptions` is a simple enumerated type.
%include "Text/RE/REOptions.lhs" "^data SimpleREOptions"
\begin{code}
evalme_CPL_02 = checkThis "evalme_CPL_02" (["2016-01-09","2015-10-05"]) $ matches $ "2016-01-09 2015-12-5 2015-10-05" *=~ (maybe (error "evalme_CPL_01") id $ compileRegexWith MultilineSensitive "[0-9]{4}-[0-9]{2}-[0-9]{2}")
\end{code}
If you need to compile `SearchReplace` templates for use with `?=~/` and
`*=~/` then the `compileSearchReplace` and `compileSearchReplaceWith`,
```
compileSearchReplace :: (Monad m, Functor m, IsRegex RE s) => String -> String -> m (SearchReplace RE s)
compileSearchReplaceWith :: (Monad m, Functor m, IsRegex RE s) => SimpleREOptions -> String -> String -> m (SearchReplace RE s)
```
work analagously to `compileRegex` and `compileRegexWith`, with the RE
and replacement template (either side of the '///' in the `[ed|...///...|]`
quasi quoters) being passed into these functions in two separate strings,
to compile to the `SearchReplace` type expected by the `?=~/` and `*=~/`
operators.
%include "Text/RE/ZeInternals/Types/SearchReplace.lhs" "^data SearchReplace"
The `escape` and `escapeWith` functions are special compilers that compile
a string into a RE that should match itself, which is assumed to be embedded
in a complex RE to be compiled.
```
escape :: (Functor m, Monad m) => (String->String) -> String -> m RE
```
The function pased in the first argument to `escape` takes the RE string
that will match the string passed in the second argument and yields the
RE to be compiled, which is returned from the parsing action.
\begin{code}
evalme_CPL_03 = checkThis "evalme_CPL_03" ("foobar") $ "fooe{0}bar" *=~/ SearchReplace (maybe (error "evalme_CPL_03") id $ escape id "e{0}") ""
\end{code}
The Classic regex-base Match Operators
--------------------------------------
The original `=~` and `=~~` match operators are still available for
those that have mastered them.
\begin{code}
evalme_CLC_01 = checkThis "evalme_CLC_01" (True ) $ ("bar" =~ [re|(foo|bar)|] :: Bool)
\end{code}
\begin{code}
evalme_CLC_02 = checkThis "evalme_CLC_02" (False) $ ("quux" =~ [re|(foo|bar)|] :: Bool)
\end{code}
\begin{code}
evalme_CLC_03 = checkThis "evalme_CLC_03" (2) $ ("foobar" =~ [re|(foo|bar)|] :: Int)
\end{code}
\begin{code}
evalme_CLC_04 = checkThis "evalme_CLC_04" (Nothing) $ ("foo" =~~ [re|bar|] :: Maybe String)
\end{code}
\begin{code}
main :: IO ()
main = runTheTests
[ evalme_CLC_04
, evalme_CLC_03
, evalme_CLC_02
, evalme_CLC_01
, evalme_CPL_03
, evalme_CPL_02
, evalme_CPL_01
, evalme_SOP_06
, evalme_SOP_05
, evalme_SOP_04
, evalme_SOP_03
, evalme_SOP_02
, evalme_SOP_01
, evalme_SRP_01
, evalme_SRP_00
, evalme_MAC_01
, evalme_MAC_00
, evalme_MLT_03
, evalme_MLT_02
, evalme_MLT_01
, evalme_SGL_03
, evalme_SGL_02
, evalme_SGL_01
]
\end{code}