-
Notifications
You must be signed in to change notification settings - Fork 1
/
tutor9.txt
821 lines (576 loc) · 26.1 KB
/
tutor9.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
LET'S BUILD A COMPILER!
By
Jack W. Crenshaw, Ph.D.
16 April 1989
Part IX: A TOP VIEW
*****************************************************************
* *
* COPYRIGHT NOTICE *
* *
* Copyright (C) 1989 Jack W. Crenshaw. All rights reserved. *
* *
*****************************************************************
INTRODUCTION
In the previous installments, we have learned many of the
techniques required to build a full-blown compiler. We've done
both assignment statements (with Boolean and arithmetic
expressions), relational operators, and control constructs. We
still haven't addressed procedure or function calls, but even so
we could conceivably construct a mini-language without them.
I've always thought it would be fun to see just how small a
language one could build that would still be useful. We're
ALMOST in a position to do that now. The problem is: though we
know how to parse and translate the constructs, we still don't
know quite how to put them all together into a language.
In those earlier installments, the development of our programs
had a decidedly bottom-up flavor. In the case of expression
parsing, for example, we began with the very lowest level
constructs, the individual constants and variables, and worked
our way up to more complex expressions.
Most people regard the top-down design approach as being better
than the bottom-up one. I do too, but the way we did it
certainly seemed natural enough for the kinds of things we were
parsing.
You mustn't get the idea, though, that the incremental approach
that we've been using in all these tutorials is inherently
bottom-up. In this installment I'd like to show you that the
approach can work just as well when applied from the top down ...
maybe better. We'll consider languages such as C and Pascal, and
see how complete compilers can be built starting from the top.
In the next installment, we'll apply the same technique to build
a complete translator for a subset of the KISS language, which
I'll be calling TINY. But one of my goals for this series is
that you will not only be able to see how a compiler for TINY or
KISS works, but that you will also be able to design and build
compilers for your own languages. The C and Pascal examples will
help. One thing I'd like you to see is that the natural
structure of the compiler depends very much on the language being
translated, so the simplicity and ease of construction of the
compiler depends very much on letting the language set the
program structure.
It's a bit much to produce a full C or Pascal compiler here, and
we won't try. But we can flesh out the top levels far enough so
that you can see how it goes.
Let's get started.
THE TOP LEVEL
One of the biggest mistakes people make in a top-down design is
failing to start at the true top. They think they know what the
overall structure of the design should be, so they go ahead and
write it down.
Whenever I start a new design, I always like to do it at the
absolute beginning. In program design language (PDL), this top
level looks something like:
begin
solve the problem
end
OK, I grant you that this doesn't give much of a hint as to what
the next level is, but I like to write it down anyway, just to
give me that warm feeling that I am indeed starting at the top.
For our problem, the overall function of a compiler is to compile
a complete program. Any definition of the language, written in
BNF, begins here. What does the top level BNF look like? Well,
that depends quite a bit on the language to be translated. Let's
take a look at Pascal.
THE STRUCTURE OF PASCAL
Most texts for Pascal include a BNF or "railroad-track"
definition of the language. Here are the first few lines of one:
<program> ::= <program-header> <block> '.'
<program-header> ::= PROGRAM <ident>
<block> ::= <declarations> <statements>
We can write recognizers to deal with each of these elements,
just as we've done before. For each one, we'll use our familiar
single-character tokens to represent the input, then flesh things
out a little at a time. Let's begin with the first recognizer:
the program itself.
To translate this, we'll start with a fresh copy of the Cradle.
Since we're back to single-character names, we'll just use a 'p'
to stand for 'PROGRAM.'
To a fresh copy of the cradle, add the following code, and insert
a call to it from the main program:
{--------------------------------------------------------------}
{ Parse and Translate A Program }
procedure Prog;
var Name: char;
begin
Match('p'); { Handles program header part }
Name := GetName;
Prolog(Name);
Match('.');
Epilog(Name);
end;
{--------------------------------------------------------------}
The procedures Prolog and Epilog perform whatever is required to
let the program interface with the operating system, so that it
can execute as a program. Needless to say, this part will be
VERY OS-dependent. Remember, I've been emitting code for a 68000
running under the OS I use, which is SK*DOS. I realize most of
you are using PC's and would rather see something else, but I'm
in this thing too deep to change now!
Anyhow, SK*DOS is a particularly easy OS to interface to. Here
is the code for Prolog and Epilog:
{--------------------------------------------------------------}
{ Write the Prolog }
procedure Prolog;
begin
EmitLn('WARMST EQU $A01E');
end;
{--------------------------------------------------------------}
{ Write the Epilog }
procedure Epilog(Name: char);
begin
EmitLn('DC WARMST');
EmitLn('END ' + Name);
end;
{--------------------------------------------------------------}
As usual, add this code and try out the "compiler." At this
point, there is only one legal input:
px. (where x is any single letter, the program name)
Well, as usual our first effort is rather unimpressive, but by
now I'm sure you know that things will get more interesting.
There is one important thing to note: THE OUTPUT IS A WORKING,
COMPLETE, AND EXECUTABLE PROGRAM (at least after it's assembled).
This is very important. The nice feature of the top-down
approach is that at any stage you can compile a subset of the
complete language and get a program that will run on the target
machine. From here on, then, we need only add features by
fleshing out the language constructs. It's all very similar to
what we've been doing all along, except that we're approaching it
from the other end.
FLESHING IT OUT
To flesh out the compiler, we only have to deal with language
features one by one. I like to start with a stub procedure that
does nothing, then add detail in incremental fashion. Let's
begin by processing a block, in accordance with its PDL above.
We can do this in two stages. First, add the null procedure:
{--------------------------------------------------------------}
{ Parse and Translate a Pascal Block }
procedure DoBlock(Name: char);
begin
end;
{--------------------------------------------------------------}
and modify Prog to read:
{--------------------------------------------------------------}
{ Parse and Translate A Program }
procedure Prog;
var Name: char;
begin
Match('p');
Name := GetName;
Prolog;
DoBlock(Name);
Match('.');
Epilog(Name);
end;
{--------------------------------------------------------------}
That certainly shouldn't change the behavior of the program, and
it doesn't. But now the definition of Prog is complete, and we
can proceed to flesh out DoBlock. That's done right from its BNF
definition:
{--------------------------------------------------------------}
{ Parse and Translate a Pascal Block }
procedure DoBlock(Name: char);
begin
Declarations;
PostLabel(Name);
Statements;
end;
{--------------------------------------------------------------}
The procedure PostLabel was defined in the installment on
branches. Copy it into your cradle.
I probably need to explain the reason for inserting the label
where I have. It has to do with the operation of SK*DOS. Unlike
some OS's, SK*DOS allows the entry point to the main program to
be anywhere in the program. All you have to do is to give that
point a name. The call to PostLabel puts that name just before
the first executable statement in the main program. How does
SK*DOS know which of the many labels is the entry point, you ask?
It's the one that matches the END statement at the end of the
program.
OK, now we need stubs for the procedures Declarations and
Statements. Make them null procedures as we did before.
Does the program still run the same? Then we can move on to the
next stage.
DECLARATIONS
The BNF for Pascal declarations is:
<declarations> ::= ( <label list> |
<constant list> |
<type list> |
<variable list> |
<procedure> |
<function> )*
(Note that I'm using the more liberal definition used by Turbo
Pascal. In the standard Pascal definition, each of these parts
must be in a specific order relative to the rest.)
As usual, let's let a single character represent each of these
declaration types. The new form of Declarations is:
{--------------------------------------------------------------}
{ Parse and Translate the Declaration Part }
procedure Declarations;
begin
while Look in ['l', 'c', 't', 'v', 'p', 'f'] do
case Look of
'l': Labels;
'c': Constants;
't': Types;
'v': Variables;
'p': DoProcedure;
'f': DoFunction;
end;
end;
{--------------------------------------------------------------}
Of course, we need stub procedures for each of these declaration
types. This time, they can't quite be null procedures, since
otherwise we'll end up with an infinite While loop. At the very
least, each recognizer must eat the character that invokes it.
Insert the following procedures:
{--------------------------------------------------------------}
{ Process Label Statement }
procedure Labels;
begin
Match('l');
end;
{--------------------------------------------------------------}
{ Process Const Statement }
procedure Constants;
begin
Match('c');
end;
{--------------------------------------------------------------}
{ Process Type Statement }
procedure Types;
begin
Match('t');
end;
{--------------------------------------------------------------}
{ Process Var Statement }
procedure Variables;
begin
Match('v');
end;
{--------------------------------------------------------------}
{ Process Procedure Definition }
procedure DoProcedure;
begin
Match('p');
end;
{--------------------------------------------------------------}
{ Process Function Definition }
procedure DoFunction;
begin
Match('f');
end;
{--------------------------------------------------------------}
Now try out the compiler with a few representative inputs. You
can mix the declarations any way you like, as long as the last
character in the program is'.' to indicate the end of the
program. Of course, none of the declarations actually declare
anything, so you don't need (and can't use) any characters other
than those standing for the keywords.
We can flesh out the statement part in a similar way. The BNF
for it is:
<statements> ::= <compound statement>
<compound statement> ::= BEGIN <statement>
(';' <statement>) END
Note that statements can begin with any identifier except END.
So the first stub form of procedure Statements is:
{--------------------------------------------------------------}
{ Parse and Translate the Statement Part }
procedure Statements;
begin
Match('b');
while Look <> 'e' do
GetChar;
Match('e');
end;
{--------------------------------------------------------------}
At this point the compiler will accept any number of
declarations, followed by the BEGIN block of the main program.
This block itself can contain any characters at all (except an
END), but it must be present.
The simplest form of input is now
'pxbe.'
Try it. Also try some combinations of this. Make some
deliberate errors and see what happens.
At this point you should be beginning to see the drill. We begin
with a stub translator to process a program, then we flesh out
each procedure in turn, based upon its BNF definition. Just as
the lower-level BNF definitions add detail and elaborate upon the
higher-level ones, the lower-level recognizers will parse more
detail of the input program. When the last stub has been
expanded, the compiler will be complete. That's top-down
design/implementation in its purest form.
You might note that even though we've been adding procedures, the
output of the program hasn't changed. That's as it should be.
At these top levels there is no emitted code required. The
recognizers are functioning as just that: recognizers. They are
accepting input sentences, catching bad ones, and channeling good
input to the right places, so they are doing their job. If we
were to pursue this a bit longer, code would start to appear.
The next step in our expansion should probably be procedure
Statements. The Pascal definition is:
<statement> ::= <simple statement> | <structured statement>
<simple statement> ::= <assignment> | <procedure call> | null
<structured statement> ::= <compound statement> |
<if statement> |
<case statement> |
<while statement> |
<repeat statement> |
<for statement> |
<with statement>
These are starting to look familiar. As a matter of fact, you
have already gone through the process of parsing and generating
code for both assignment statements and control structures. This
is where the top level meets our bottom-up approach of previous
sessions. The constructs will be a little different from those
we've been using for KISS, but the differences are nothing you
can't handle.
I think you can get the picture now as to the procedure. We
begin with a complete BNF description of the language. Starting
at the top level, we code up the recognizer for that BNF
statement, using stubs for the next-level recognizers. Then we
flesh those lower-level statements out one by one.
As it happens, the definition of Pascal is very compatible with
the use of BNF, and BNF descriptions of the language abound.
Armed with such a description, you will find it fairly
straightforward to continue the process we've begun.
You might have a go at fleshing a few of these constructs out,
just to get a feel for it. I don't expect you to be able to
complete a Pascal compiler here ... there are too many things
such as procedures and types that we haven't addressed yet ...
but it might be helpful to try some of the more familiar ones.
It will do you good to see executable programs coming out the
other end.
If I'm going to address those issues that we haven't covered yet,
I'd rather do it in the context of KISS. We're not trying to
build a complete Pascal compiler just yet, so I'm going to stop
the expansion of Pascal here. Let's take a look at a very
different language.
THE STRUCTURE OF C
The C language is quite another matter, as you'll see. Texts on
C rarely include a BNF definition of the language. Probably
that's because the language is quite hard to write BNF for.
One reason I'm showing you these structures now is so that I can
impress upon you these two facts:
(1) The definition of the language drives the structure of the
compiler. What works for one language may be a disaster for
another. It's a very bad idea to try to force a given
structure upon the compiler. Rather, you should let the BNF
drive the structure, as we have done here.
(2) A language that is hard to write BNF for will probably be
hard to write a compiler for, as well. C is a popular
language, and it has a reputation for letting you do
virtually anything that is possible to do. Despite the
success of Small C, C is _NOT_ an easy language to parse.
A C program has less structure than its Pascal counterpart. At
the top level, everything in C is a static declaration, either of
data or of a function. We can capture this thought like this:
<program> ::= ( <global declaration> )*
<global declaration> ::= <data declaration> |
<function>
In Small C, functions can only have the default type int, which
is not declared. This makes the input easy to parse: the first
token is either "int," "char," or the name of a function. In
Small C, the preprocessor commands are also processed by the
compiler proper, so the syntax becomes:
<global declaration> ::= '#' <preprocessor command> |
'int' <data list> |
'char' <data list> |
<ident> <function body> |
Although we're really more interested in full C here, I'll show
you the code corresponding to this top-level structure for Small
C.
{--------------------------------------------------------------}
{ Parse and Translate A Program }
procedure Prog;
begin
while Look <> ^Z do begin
case Look of
'#': PreProc;
'i': IntDecl;
'c': CharDecl;
else DoFunction(Int);
end;
end;
end;
{--------------------------------------------------------------}
Note that I've had to use a ^Z to indicate the end of the source.
C has no keyword such as END or the '.' to otherwise indicate the
end.
With full C, things aren't even this easy. The problem comes
about because in full C, functions can also have types. So when
the compiler sees a keyword like "int," it still doesn't know
whether to expect a data declaration or a function definition.
Things get more complicated since the next token may not be a
name ... it may start with an '*' or '(', or combinations of the
two.
More specifically, the BNF for full C begins with:
<program> ::= ( <top-level decl> )*
<top-level decl> ::= <function def> | <data decl>
<data decl> ::= [<class>] <type> <decl-list>
<function def> ::= [<class>] [<type>] <function decl>
You can now see the problem: The first two parts of the
declarations for data and functions can be the same. Because of
the ambiguity in the grammar as written above, it's not a
suitable grammar for a recursive-descent parser. Can we
transform it into one that is suitable? Yes, with a little work.
Suppose we write it this way:
<top-level decl> ::= [<class>] <decl>
<decl> ::= <type> <typed decl> | <function decl>
<typed decl> ::= <data list> | <function decl>
We can build a parsing routine for the class and type
definitions, and have them store away their findings and go on,
without their ever having to "know" whether a function or a data
declaration is being processed.
To begin, key in the following version of the main program:
{--------------------------------------------------------------}
{ Main Program }
begin
Init;
while Look <> ^Z do begin
GetClass;
GetType;
TopDecl;
end;
end.
{--------------------------------------------------------------}
For the first round, just make the three procedures stubs that do
nothing _BUT_ call GetChar.
Does this program work? Well, it would be hard put NOT to, since
we're not really asking it to do anything. It's been said that a
C compiler will accept virtually any input without choking. It's
certainly true of THIS compiler, since in effect all it does is
to eat input characters until it finds a ^Z.
Next, let's make GetClass do something worthwhile. Declare the
global variable
var Class: char;
and change GetClass to do the following:
{--------------------------------------------------------------}
{ Get a Storage Class Specifier }
Procedure GetClass;
begin
if Look in ['a', 'x', 's'] then begin
Class := Look;
GetChar;
end
else Class := 'a';
end;
{--------------------------------------------------------------}
Here, I've used three single characters to represent the three
storage classes "auto," "extern," and "static." These are not
the only three possible classes ... there are also "register" and
"typedef," but this should give you the picture. Note that the
default class is "auto."
We can do a similar thing for types. Enter the following
procedure next:
{--------------------------------------------------------------}
{ Get a Type Specifier }
procedure GetType;
begin
Typ := ' ';
if Look = 'u' then begin
Sign := 'u';
Typ := 'i';
GetChar;
end
else Sign := 's';
if Look in ['i', 'l', 'c'] then begin
Typ := Look;
GetChar;
end;
end;
{--------------------------------------------------------------}
Note that you must add two more global variables, Sign and Typ.
With these two procedures in place, the compiler will process the
class and type definitions and store away their findings. We can
now process the rest of the declaration.
We are by no means out of the woods yet, because there are still
many complexities just in the definition of the type, before we
even get to the actual data or function names. Let's pretend for
the moment that we have passed all those gates, and that the next
thing in the input stream is a name. If the name is followed by
a left paren, we have a function declaration. If not, we have at
least one data item, and possibly a list, each element of which
can have an initializer.
Insert the following version of TopDecl:
{--------------------------------------------------------------}
{ Process a Top-Level Declaration }
procedure TopDecl;
var Name: char;
begin
Name := Getname;
if Look = '(' then
DoFunc(Name)
else
DoData(Name);
end;
{--------------------------------------------------------------}
(Note that, since we have already read the name, we must pass it
along to the appropriate routine.)
Finally, add the two procedures DoFunc and DoData:
{--------------------------------------------------------------}
{ Process a Function Definition }
procedure DoFunc(n: char);
begin
Match('(');
Match(')');
Match('{');
Match('}');
if Typ = ' ' then Typ := 'i';
Writeln(Class, Sign, Typ, ' function ', n);
end;
{--------------------------------------------------------------}
{ Process a Data Declaration }
procedure DoData(n: char);
begin
if Typ = ' ' then Expected('Type declaration');
Writeln(Class, Sign, Typ, ' data ', n);
while Look = ',' do begin
Match(',');
n := GetName;
WriteLn(Class, Sign, Typ, ' data ', n);
end;
Match(';');
end;
{--------------------------------------------------------------}
Since we're still a long way from producing executable code, I
decided to just have these two routines tell us what they found.
OK, give this program a try. For data declarations, it's OK to
give a list separated by commas. We can't process initializers
as yet. We also can't process argument lists for the functions,
but the "(){}" characters should be there.
We're still a _VERY_ long way from having a C compiler, but what
we have is starting to process the right kinds of inputs, and is
recognizing both good and bad inputs. In the process, the
natural structure of the compiler is starting to take form.
Can we continue this until we have something that acts more like
a compiler. Of course we can. Should we? That's another matter.
I don't know about you, but I'm beginning to get dizzy, and we've
still got a long way to go to even get past the data
declarations.
At this point, I think you can see how the structure of the
compiler evolves from the language definition. The structures
we've seen for our two examples, Pascal and C, are as different
as night and day. Pascal was designed at least partly to be easy
to parse, and that's reflected in the compiler. In general, in
Pascal there is more structure and we have a better idea of what
kinds of constructs to expect at any point. In C, on the other
hand, the program is essentially a list of declarations,
terminated only by the end of file.
We could pursue both of these structures much farther, but
remember that our purpose here is not to build a Pascal or a C
compiler, but rather to study compilers in general. For those of
you who DO want to deal with Pascal or C, I hope I've given you
enough of a start so that you can take it from here (although
you'll soon need some of the stuff we still haven't covered yet,
such as typing and procedure calls). For the rest of you, stay
with me through the next installment. There, I'll be leading you
through the development of a complete compiler for TINY, a subset
of KISS.
See you then.
*****************************************************************
* *
* COPYRIGHT NOTICE *
* *
* Copyright (C) 1989 Jack W. Crenshaw. All rights reserved. *
* *
*****************************************************************