-
Notifications
You must be signed in to change notification settings - Fork 128
/
lex.htm
executable file
·1206 lines (855 loc) · 55.2 KB
/
lex.htm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<html>
<head>
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<meta name=Generator content="Microsoft Word 11 (filtered)">
<title>fa_lex</title>
<style>
<!--
/* Font Definitions */
@font-face
{font-family:Courier;
panose-1:2 7 4 9 2 2 5 2 4 4;}
@font-face
{font-family:Wingdings;
panose-1:5 0 0 0 0 0 0 0 0 0;}
@font-face
{font-family:"MS Mincho";
panose-1:2 2 6 9 4 2 5 8 3 4;}
@font-face
{font-family:"\@MS Mincho";
panose-1:2 2 6 9 4 2 5 8 3 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:12.0pt;
font-family:"Times New Roman";}
a:link, span.MsoHyperlink
{font-family:"Times New Roman";
color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{color:purple;
text-decoration:underline;}
@page Section1
{size:595.3pt 841.9pt;
margin:56.7pt 42.5pt 56.7pt 85.05pt;}
div.Section1
{page:Section1;}
/* List Definitions */
ol
{margin-bottom:0in;}
ul
{margin-bottom:0in;}
-->
</style>
</head>
<body lang=RU link=blue vlink=purple>
<div class=Section1>
<p class=MsoNormal align=right style='text-align:right'><span lang=EN-US>25 July,
2007</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal align=center style='text-align:center'><b><span lang=EN-US
style='font-size:24.0pt'>Lexical analyzer (fa_lex)</span></b></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><b><span lang=EN-US style='font-size:18.0pt'>Introduction</span></b></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US>The lexical analyzer (the lexer) takes a sequence of characters and
returns a sequence of tokens; where every token is a meaningful unit identified
by its type and its boundaries.</span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US> </span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US>Everything that is not meaningful is normally discarded, like spaces
or new-line symbols for C++. The tokens cannot overlap and include each other,
in other words each character belongs to not more than one token. Depending on
the language, the definition of the tokens can be different, see examples
below:</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><b><span lang=EN-US>For C++:</span></b></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US>Input: </span><span lang=EN-US
style='font-size:10.0pt;font-family:Courier'>if(++i==0) { j = 0; }</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US>Output: </span><span lang=EN-US
style='font-size:10.0pt;font-family:Courier'>if/OP (/LRB ++/OP i/VAR ==/OP
0/NUM )/RBR {/LCBR j/VAR =/OP 0/NUM ;/OP }/RCBR</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US>Where {</span><span lang=EN-US
style='font-size:10.0pt;font-family:Courier'>OP, LBR, RBR, VAR, NUM, LCBR, RCBR}
</span><span lang=EN-US>is a possible set of token types for C++.</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><b><span lang=EN-US>For English:</span></b></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US>Input: </span><span lang=EN-US
style='font-size:10.0pt;font-family:Courier'>Pierre Vinken, 61 years old, will
join the board as a nonexecutive director Nov.29.</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US>Output: </span><span lang=EN-US
style='font-size:10.0pt;font-family:Courier'>Pierre/WORD Vinken/WORD ,/PUNKT
61/CD years/WORD old/WORD ,/PUNKT will/WORD join/WORD the/WORD board/WORD
as/WORD a/WORD nonexecutive/WORD director/WORD Nov./WORD 29/CD ./EOS</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US>Where: {</span><span lang=EN-US
style='font-size:10.0pt;font-family:Courier'>WORD, PUNKT, CD, EOS}</span><span
lang=EN-US> is a possible set of token types for English.</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><b><span lang=EN-US style='font-size:18.0pt'>Grammar of
fa_lex rules</span></b></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US>The lexer uses rules in order to identify the boundaries and types
of the tokens. Each rule describes one token in a context. The rules are based
on the character regular expressions.</span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US> </span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US>Each rule consists of optional left context description, the token
description, optional right context description and the token type. A token description
is enclosed into triangular brackets. A left, right context and token descriptions
are character regular expressions. However, context descriptions should not be cyclic
(e.g. accept a string of an infinite length) and the token description should
not be empty (e.g. accept a string of a zero length). All rules are combined
together by an "or" operator. The following grammar in Backus-Naur
form formally describes the syntax of the lexer rules.</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'> </span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
<b>GRAMMAR</b> ::= <b>RULES</b></span></p>
<p class=MsoNormal><b><span lang=EN-US style='font-size:10.0pt;font-family:
Courier'> GRAMMAR</span></b><span lang=EN-US style='font-size:10.0pt;
font-family:Courier'> ::= <b>FUNCTIONS</b></span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'> </span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
<b>RULES</b> ::= <b>RULE</b></span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
<b>RULES</b> ::= <b>RULE</b>\n <b>RULES</b></span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'> </span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
<b>RULE</b> ::= <b>CONDITION</b> --> <b>ACTION</b></span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'> </span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
<b>CONDITION</b> ::= Regexp* < Regexp > Regexp*</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
<b>CONDITION</b> ::= < Regexp > Regexp*</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
<b>CONDITION</b> ::= Regexp* < Regexp ></span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
<b>CONDITION</b> ::= < Regexp ></span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'> </span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
<b>ACTION</b> ::= Tag</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
<b>ACTION</b> ::= _call <b>FUNCTION_NAMES</b></span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
<b>ACTION</b> ::= Tag _call <b>FUNCTION_NAMES</b></span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'> </span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
<b>FUNCTION_NAMES</b> ::= _main</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
<b>FUNCTION_NAMES</b> ::= FnName</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
<b>FUNCTION_NAMES</b> ::= FnName <b>FUNCTION_NAMES</b></span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'> </span></p>
<p class=MsoNormal><b><span lang=EN-US style='font-size:10.0pt;font-family:
Courier'> FUNCTIONS </span></b><span lang=EN-US style='font-size:10.0pt;
font-family:Courier'>::=<b> FUNCTION</b></span></p>
<p class=MsoNormal><b><span lang=EN-US style='font-size:10.0pt;font-family:
Courier'> FUNCTIONS </span></b><span lang=EN-US style='font-size:10.0pt;
font-family:Courier'>::=<b> FUNCTION</b>\n<b> FUNCTIONS</b></span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'> </span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
<b>FUNCTION</b> ::= _function FnName\n <b>RULES</b>\n _end</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt'> </span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt'> </span></p>
<p class=MsoNormal><span lang=EN-US> </span><span lang=EN-US
style='font-size:10.0pt;font-family:Courier'>Regexp</span><span lang=EN-US> --
non-empty character-based regular expression</span></p>
<p class=MsoNormal><span lang=EN-US> </span><span lang=EN-US
style='font-size:10.0pt;font-family:Courier'>Regexp*</span><span lang=EN-US> --
acyclic character-based regular expression</span></p>
<p class=MsoNormal><span lang=EN-US> </span><span lang=DE style='font-size:
10.0pt;font-family:Courier'>Tag </span><span lang=DE> Tag name (token type
name)</span></p>
<p class=MsoNormal><span lang=DE> </span><span lang=EN-US style='font-size:
10.0pt;font-family:Courier'>FnName</span><span lang=EN-US> function name, can
be one of the tags or a new name</span></p>
<p class=MsoNormal><span lang=EN-US> _function a keyword indicating
beginning of the function</span></p>
<p class=MsoNormal><span lang=EN-US> _end a keyword indicating the end of
the function</span></p>
<p class=MsoNormal><span lang=EN-US> _call a keyword indicating function
call</span></p>
<p class=MsoNormal><span lang=EN-US> _main a special function name
referring to the main rule set</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><b><span lang=EN-US>Example 1:</span></b></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
< [0-9]+ > --> NUM</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
[0-9] < [-+*/] > [-]?[0-9] --> OP</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
< [[:alpha:]]+ > --> VAR</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><b><span lang=EN-US>Example 2:</span></b></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:8.0pt;font-family:Courier'>
< ([A-Za-z\x00C0-\x00D6\x00D8-\x00F6\x00F8-\x00FF\x0152\x0153])+[+-] >
[0-9] --> WORD</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:8.0pt;font-family:Courier'> </span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:8.0pt;font-family:Courier'>
</span><span lang=IT style='font-size:8.0pt;font-family:Courier'><
([0]?[0-9]|(1)[0-9]|(2)[0-4])[:]([0-5][0-9])([:]([0-5][0-9]))? > [^0-9]
--> HHMM</span></p>
<p class=MsoNormal><span lang=IT style='font-size:8.0pt;font-family:Courier'> </span></p>
<p class=MsoNormal><span lang=IT style='font-size:8.0pt;font-family:Courier'>
< ([\x0024\x00A2-\x00A5\x09F2\x09F3\x0E3F\x20A0\x20A2\x20A3\x20A4\x20A6-\x20AF])[\x0020\t]*</span></p>
<p class=MsoNormal><span lang=IT style='font-size:8.0pt;font-family:Courier'>
((0)|[1-9][0-9]*)[\x0020\t]*((</span></p>
<p class=MsoNormal><span lang=IT style='font-size:8.0pt;font-family:Courier'>
(AED)|(ARP)|(ATS)|(AUD)|(BBD)|(BEF)|(BGL)|(BHD)|(BMD)|(BRR)|(BRL)|(BSD)</span></p>
<p class=MsoNormal><span lang=IT style='font-size:8.0pt;font-family:Courier'>
| (CAD)|(CHF)|(CLP)|(CNY)|(CSK)|(CYP)|(DEM)|(DKK)|(DJF)|(DZD)|(EGP)|(ESP)</span></p>
<p class=MsoNormal><span lang=IT style='font-size:8.0pt;font-family:Courier'>
| (EUR)|(FIM)|(FJD)|(FRF)|(GBP)|(GRD)|(HKD)|(HUF)|(IDR)|(IEP)|(ILS)|(INR)</span></p>
<p class=MsoNormal><span lang=IT style='font-size:8.0pt;font-family:Courier'>
| (IQD)|(ISK)|(ITL)|(JMD)|(JOD)|(JPY)|(KRW)|(KWD)|(LBP)|(LUF)|(LYD)|(MAD)</span></p>
<p class=MsoNormal><span lang=IT style='font-size:8.0pt;font-family:Courier'>
| (MRO)|(MXP)|(MYR)|(NLG)|(NOK)|(NZD)|(OMR)|(PHP)|(PKR)|(PLN)|(PTE)|(QAR)</span></p>
<p class=MsoNormal><span lang=IT style='font-size:8.0pt;font-family:Courier'>
| (ROL)|(RUR)|(SAR)|(SDD)|(SEK)|(SGD)|(SKK)|(SOS)|(SYP)|(SUR)|(THB)|(TND)</span></p>
<p class=MsoNormal><span lang=IT style='font-size:8.0pt;font-family:Courier'>
| (TRL)|(TRY)|(TTD)|(TWD)|(USD)|(VEB)|(XEC)|(YER)|(ZAR)|(ZMK)|(DM)|(FF)</span></p>
<p class=MsoNormal><span lang=IT style='font-size:8.0pt;font-family:Courier'>
</span><span lang=EN-US style='font-size:8.0pt;font-family:Courier'>|
(\x20AC((uro)|(URO))[s]?)</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:8.0pt;font-family:Courier'>
)) > ([^.,0-9A-Za-z\x00C0-\x00D6\x00D8-\x00F6\x00F8-\x00FF\x0152\x0153])
--> CURR</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:8.0pt;font-family:Courier'> </span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><b><span lang=EN-US>The following rules are incorrect:</span></b></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:8.0pt;font-family:Courier'>
< [-+*/] > <span style='color:red'>[-]?[0-9]+ </span>--> OP</span><span
lang=EN-US> ; the context should be acyclic</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
< <span style='color:red'>[-+*/]*</span> > [-]?[0-9] --> OP</span><span
lang=EN-US> ; the token description must not allow empty tokens</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
<span style='color:red'>[-]?[0-9]+ --> CD</span></span><span lang=EN-US>
; the token definition should be enclosed in the triangular brackets</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><b><span lang=EN-US>The following are equivalent rule-sets:</span></b></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US> 1. The</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
< [-+*/] > [-]?[0-9] --> OP</span></p>
<p class=MsoNormal><span lang=EN-US> and</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
< [-+*/] > [-][0-9] --> OP</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
< [-+*/] > [0-9] --> OP</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US> 2. The</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
< [-+*/] > [[:alpha:]]|[[:digit:]] --> OP</span></p>
<p class=MsoNormal><span lang=EN-US> and</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
< [-+*/] > [[:alpha:]] --> OP</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
< [-+*/] > [[:digit:]] --> OP</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US> 3. The</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
< [-+*/] > [-]? --> OP</span></p>
<p class=MsoNormal><span lang=EN-US> and</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
< [-+*/] > --> OP</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal style='text-indent:6.0pt'><span lang=EN-US>4. The</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
< [-+*/] > [^a] --> OP</span></p>
<p class=MsoNormal style='text-indent:6.0pt'><span lang=EN-US>and</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
< [-+*/] > [-] --> OP</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
< [-+*/] > [^a] --> OP</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal style='text-indent:6.0pt'><span lang=EN-US>5. The</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
< [-+*/] > . --> OP</span></p>
<p class=MsoNormal><span lang=EN-US> and</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
< [-+*/] > [^a] --> OP</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
< [-+*/] > [^b] --> OP</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><b><span lang=EN-US>Description of Functions:</span></b></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US>Functions are isolated named sets of rules in fa_lex syntax. </span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US> </span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US>If the action of the rule </span><span lang=EN-US style='font-family:
Courier'>R </span><span lang=EN-US>contains </span><span lang=EN-US
style='font-size:10.0pt;font-family:Courier'>_call</span><span lang=EN-US>
keyword followed by a function name </span><span lang=EN-US style='font-size:
10.0pt;font-family:Courier'>FnName</span><span lang=EN-US> or by a </span><span
lang=EN-US style='font-size:10.0pt;font-family:Courier'>_main</span><span
lang=EN-US> keyword then each time </span><span lang=EN-US style='font-family:
Courier'>R</span><span lang=EN-US> extracts a token, the rule set </span><span
lang=EN-US style='font-size:10.0pt;font-family:Courier'>FnName</span><span
lang=EN-US> or </span><span lang=EN-US style='font-size:10.0pt;font-family:
Courier'>_main</span><span lang=EN-US> is applied to the token span. If </span><span
lang=EN-US style='font-size:10.0pt;font-family:Courier'>_call</span><span
lang=EN-US> is followed by one function name then the functions rule set
extracts all possible non-overlapping tokens out of the span. If </span><span
lang=EN-US style='font-size:10.0pt;font-family:Courier'>_call</span><span
lang=EN-US> is followed by more than one function name then each corresponding
rule set extracts just one token, one after another in a sequence with
exception to the main rule set </span><span lang=EN-US style='font-size:10.0pt;
font-family:Courier'>_main </span><span lang=EN-US>(it always extracts all
possible tokens.)</span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US> </span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US>As formal grammar defines, there may be three types of actions: a)
tag assignment b) function call c) tag assignment and function call. If the
action is a function call without tag assignment then no token corresponding to
the span is extracted. In this case "fa_lex" returns whatever is the
output of the calling function. It is possible that the calling function will
return nothing.</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US>Functions are optional, they may be used for hierarchical tokens
extractions (such as date as a whole and day, month and year as its parts,)
they also may be used for wide context description and conflict resolution.</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><b><span lang=EN-US>Examples of Functions:</span></b></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US>The following function </span><span lang=EN-US style='font-size:
10.0pt;font-family:Courier'>HY_WORD</span><span lang=EN-US> is called to split
the hyphenated word into segments. Input: out-of-date, output: </span><span
lang=EN-US style='font-size:10.0pt;font-family:Courier'>out/WORD -/WORD of/WORD
-/WORD date/WORD </span><span lang=EN-US>No nested tokens are created.</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>_function
HY_WORD</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'><
[A-Za-z]+ > --> WORD</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'><
[-] > --> WORD</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>_end</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'> </span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'><
[A-Za-z]+([-][A-Za-z]+)+ > --> _call HY_WORD</span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US> </span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US> </span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US>In the following example the tag </span><span lang=EN-US
style='font-size:10.0pt;font-family:Courier'>ACR</span><span lang=EN-US>
assigned and the function </span><span lang=EN-US style='font-size:10.0pt;
font-family:Courier'>ACR</span><span lang=EN-US> is called (it is fine to have
functions and tags of the same names.) Input: </span><span lang=EN-US
style='font-size:10.0pt;font-family:Courier'>A.B.C.</span><span lang=EN-US>,
output: </span><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>A.B.C./ACR
A./WORD B./WORD C./WORD</span><span lang=EN-US>. </span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>_function
ACR</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'><
[A-Z][.] > --> WORD</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>_end</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'> </span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'><
([A-Z][.])+ > --> ACR _call ACR</span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US> </span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US> </span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US>The following functions </span><span lang=EN-US style='font-size:
10.0pt;font-family:Courier'>DAY, MONTH, YEAR </span><span lang=EN-US>are called
in a sequence one after another, each of them extracts just one token in the
selected by the caller rule span. Input: </span><span lang=EN-US
style='font-size:10.0pt;font-family:Courier'>12/13/2006 13/12/2006</span><span
lang=EN-US>, output: </span><span lang=EN-US style='font-size:10.0pt;
font-family:Courier'>12/13/2006/DATE_US 12/MONTH 13/DAY 2006/YEAR 13/12/2006/DATE_EU
13/DAY 12/MONTH 2006/YEAR. </span><span lang=EN-US>For the cases when the
input token matches both </span><span lang=EN-US style='font-size:10.0pt;
font-family:Courier'>DATE_US</span><span lang=EN-US> and </span><span
lang=EN-US style='font-size:10.0pt;font-family:Courier'>DATE_EU</span><span
lang=EN-US> rules the fa_lex prefers tag name which has smaller value, so
depending on the tagset definition </span><span lang=EN-US style='font-size:
10.0pt;font-family:Courier'>DATE_US</span><span lang=EN-US> may be preferred to
the </span><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>DATE_EU</span><span
lang=EN-US> and vice versa (see Conflict resolution rules.)</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>_function
MONTH</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'><
[0-9][0-9] > --> MONTH</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>_end</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'> </span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>_function
DAY</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'><
[0-9][0-9] > --> DAY</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>_end</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'> </span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>_function
YEAR</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'><
[0-9][0-9][0-9]?[0-9]? > --> YEAR</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>_end</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'> </span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'><
[01][0-9][/][0123][0-9][/][0-9][0-9][0-9]?[0-9]? > --> DATE_US _call
MONTH DAY YEAR</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'> </span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'><
[0123][0-9][/][01][0-9][/][0-9][0-9][0-9]?[0-9]? > --> DATE_EU _call DAY
MONTH YEAR</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><b><span lang=EN-US>Extra syntax notes:</span></b></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US>1. The blank characters does not mean anything for fa_lex and they
are simply ignored. In order to match with any of those characters, the
following constructions can be used: </span><span lang=EN-US style='font-size:
10.0pt;font-family:Courier'>\t, \n, \r, \f, \v, [ ], [\t], [\n], [\r], [\f],
[\v], \x20, \x09, \x0D, \x0A, [\x20], [\x09], [\x0D], [\x0A], [[:blank:]],
[[:space:]] </span><span lang=EN-US>and so on</span><span lang=EN-US
style='font-size:10.0pt;font-family:Courier'>.</span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US> </span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US>2. Left (</span><span lang=EN-US style='font-size:10.0pt;font-family:
Courier'>^</span><span lang=EN-US>) and right (</span><span lang=EN-US
style='font-size:10.0pt;font-family:Courier'>$</span><span lang=EN-US>) anchors
are ordinary symbols for the lexer, they can be included into both contexts as
well as the token definition. If possible, including them into the token
definition is more preferable. The any symbol (e.g. </span><span lang=EN-US
style='font-family:Courier'>.</span><span lang=EN-US> ) matches both of the
anchors, the negation of a character (e.g. </span><span lang=EN-US
style='font-size:10.0pt;font-family:Courier'>[^a]</span><span lang=EN-US>) also
matches any of the anchors.</span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US> </span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US>3. Chracter classes:</span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US style='font-size:10.0pt;font-family:Courier'> [:alnum:]
[:alpha:] [:lower:] [:xdigit:]</span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US style='font-size:10.0pt;font-family:Courier'> [:digit:]
[:space:] [:upper:] [:print:]</span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US style='font-size:10.0pt;font-family:Courier'> [:punct:]
[:blank:] [:cntrl:] [:graph:]</span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US>are defined as in POSIX "C" locale and have to be extended
for Unicode range, if necessary.</span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US> </span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US>4. See POSIX 1003.2 standard for regular expressions for more
details on the regular expression syntax (<a
href="http://www.unusualresearch.com/regex/regexmanpage.htm">http://www.unusualresearch.com/regex/regexmanpage.htm</a>
).</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><b><span lang=EN-US>Compilation:</span></b></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US>The lexer compiler fa_build_lex takes two input files: one a rule-set
and the other a tagset. The tagset is a list of symbolic names of token types
each of which has a numerical value associated with it. The tagset can be
shared with some other modules like POS tagging, in case of NL analysis.</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US>Suppose the file </span><span lang=EN-US
style='font-size:10.0pt;font-family:Courier'>lex_rules.utf8</span><span
lang=EN-US> contains the following rule-set:</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
[0-9] < [-+*/] > [-]?[0-9] --> PUNKT</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
< [-]?[0-9]+ > --> CD</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
< [-+*/]+ > --> WORD</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US>The </span><span lang=EN-US
style='font-size:10.0pt;font-family:Courier'>tagset.txt</span><span lang=EN-US>
contains the following tagset:</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
CD 1</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
PUNKT 2</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
WORD 3</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US>The following command will compile the rule-set:</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
fa_build_lex --in=lex_rules.utf8 --out=lex_rules.dump \</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
--build-dump --tagset=tagset.txt</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US>The </span><span lang=EN-US style='font-size:10.0pt;font-family:
Courier'>--build-dump</span><span lang=EN-US> parameter makes a memory-dump
representation of the compiled rule-set, without this parameter the compiled rule-set
will be stored in the textual representation. See the description of all
switches by typing: </span><span lang=EN-US style='font-size:10.0pt;font-family:
Courier'>fa_build_lex --help</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US>If there were no compilation errors the
output file </span><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>lex_rules.dump</span><span
lang=EN-US> will be created.</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><b><span lang=EN-US style='font-size:18.0pt'>Lexical
analysis</span></b></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US>As it has been said, the lexical analysis is a process of conversion
of a sequence of characters into a sequence of tokens; where each token is a
meaningful unit identified by its type and its boundaries. Everything that is
not the token is ignored. The output tokens cannot overlap and include each
other, in other words each character of the input text belongs to not more than
one token. In order to guarantee this condition, it is necessary to be able to
prefer one match over the other if more than one rule matches the given character
of the input text. This is addressed by the conflict resolution rules (see
below).</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><b><span lang=EN-US>Conflict resolution for matching rules:</span></b></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US> </span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US>The following is the order in which fa_lex selects which rule to
execute if more than one matched the text:</span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US> </span></p>
<p class=MsoNormal style='margin-top:0in;margin-right:44.75pt;margin-bottom:
0in;margin-left:27.0pt;margin-bottom:.0001pt;text-align:justify;text-justify:
inter-ideograph'><span lang=EN-US>1. The leftmost rule,</span></p>
<p class=MsoNormal style='margin-top:0in;margin-right:44.75pt;margin-bottom:
0in;margin-left:27.0pt;margin-bottom:.0001pt;text-align:justify;text-justify:
inter-ideograph'><span lang=EN-US>2. The rule with the longest span,</span></p>
<p class=MsoNormal style='margin-top:0in;margin-right:44.75pt;margin-bottom:
0in;margin-left:27.0pt;margin-bottom:.0001pt;text-align:justify;text-justify:
inter-ideograph'><span lang=EN-US>3. The rule with the smallest left context,</span></p>
<p class=MsoNormal style='margin-top:0in;margin-right:44.75pt;margin-bottom:
0in;margin-left:27.0pt;margin-bottom:.0001pt;text-align:justify;text-justify:
inter-ideograph'><span lang=EN-US>4. The rule with the smallest right context,</span></p>
<p class=MsoNormal style='margin-top:0in;margin-right:44.75pt;margin-bottom:
0in;margin-left:27.0pt;margin-bottom:.0001pt;text-align:justify;text-justify:
inter-ideograph'><span lang=EN-US>5. The rule with no tag assignment (just
function call)</span></p>
<p class=MsoNormal style='margin-top:0in;margin-right:44.75pt;margin-bottom:
0in;margin-left:27.0pt;margin-bottom:.0001pt;text-align:justify;text-justify:
inter-ideograph'><span lang=EN-US>6. The rule with the smaller tag value</span></p>
<p class=MsoNormal style='margin-top:0in;margin-right:44.75pt;margin-bottom:
0in;margin-left:27.0pt;margin-bottom:.0001pt;text-align:justify;text-justify:
inter-ideograph'><span lang=EN-US>7. The rule with lexicographically smaller
list of function names (based their values)</span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US> </span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US> </span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US>Conflict resolution rules #1 and #2 are common for many lexical
analyzer implementations (including lex and flex). The rules #3 -- #7 are
specific for fa_lex. Unlike in lex/flex in fa_lex the rule order does not play
any role in conflict resolution, thus it absolutely does not matter in which
order the rules are specified.</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><b><span lang=EN-US>Runtime Execution:</span></b></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US>The lexical analysis of the text can be performed by a stand-alone
program </span><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>fa_lex</span><span
lang=EN-US>. It takes two obligatory parameters: a compiled rule-set and a
tagset and reads from stdin or from an input file the raw text and prints out
the extracted tokens to stdout or an output file in the tagged-text format. The
output can be redirected to programs like </span><span lang=EN-US
style='font-size:10.0pt;font-family:Courier'>fa_ts2ps, fa_gcd, fa_ts2stat</span><span
lang=EN-US> or any other understanding this format.</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US>The following command will make a lexical analysis of the input text
with respect to the grammar defined and compiled in the pervious section:</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
$ echo 23452345+34534 | fa_lex --tagset=tagset.txt --stage=lex_rules.dump</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
> 23452345/CD +/PUNKT 34534/CD</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
$ echo 23452345+34534+ | fa_lex --tagset=tagset.txt --stage=lex_rules.dump</span></p>
<p class=MsoNormal><span lang=EN-US style='font-size:10.0pt;font-family:Courier'>
> 23452345/CD +/PUNKT 34534/CD +/WORD</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><b><span lang=EN-US style='font-size:18.0pt'>FAQ</span></b></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><b><span lang=EN-US style='font-size:14.0pt'>1. Why use
fa_lex?</span></b></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US>The fa_lex lexical analyzer does not require rule-authors to write a
C/C++ code or even have a C/C++ compiler installed. In fa_lex, the rule-sets
are purely declarative. This allows authors (usually linguists) to focus on the
linguistic aspects of the problem and be isolated from the actual
implementation. The rules by-design cannot contain a hard to understand logic,
they are more independent from each other, and, thus, easier to maintain than
in other lexer programs.</span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal><b><span lang=EN-US style='font-size:14.0pt'>2. How is fa_lex
different from flex?</span></b></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US> </span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US>Efficiency aspects:</span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US> </span></p>
<ul style='margin-top:0in' type=disc>
<li class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US>In fa_lex, the compiled automata (e.g. the tokenization logic) are
separated from the client code. A C++/C or even C#, Ruby or Perl program may
use the exact same tokenization automata.</span></li>
</ul>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US> </span></p>
<ul style='margin-top:0in' type=disc>
<li class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US>Depending on how the automaton structure is represented in
memory, the fa_lexs approach allows balancing between speed and size for
the same tokenization.</span></li>
</ul>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US> </span></p>
<ul style='margin-top:0in' type=disc>
<li class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US>Unlike in flex, the result does not depend on rule order, there
is no such thing as rule priority. Conflicts are solved based on the span,
the token size, and the token type only, see <b>Conflict resolution for
matching rules</b> section.</span></li>
</ul>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US> </span></p>
<ul style='margin-top:0in' type=disc>
<li class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US>Due to the previous points, the fa_lex lexical analyzer is
smaller and faster than one based on flex. And yet, there is a possibility
to get even more speed by using more space or to take even less space by
having a lower speed, e.g. speed/size balancing.</span></li>
</ul>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US> </span></p>
<ul style='margin-top:0in' type=disc>
<li class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US>The fa_lex compiles faster than flex (the automata creation
stage only). This is mainly due to two reasons: the different semantics of
rule actions, and a better optimization for big grammars. The difference
can be significant 2 minutes vs. 2 hours on the same machine for the same
grammar.</span></li>
</ul>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US> </span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US>Authoring aspects:</span></p>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US> </span></p>
<ul style='margin-top:0in' type=disc>
<li class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US>fa_lex has optional functions which may serve for complex
context description or for the nested tokens extraction.</span></li>
</ul>
<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
lang=EN-US> </span></p>
<ul style='margin-top:0in' type=disc>
<li class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span