This is a Pascal compiler built using Python 2.7
Enter the command: python run.py Inside the run.py file there are 4 examples
- array example
- assignment example
- for-loop example
- if-loop example
- while-loop example
A very simple form of a compiler/interpreter:
Source File ==> Scanner ==> Lexer ==> Parser ==> Interpreter/Code Generator
-
Source File: This is the program that is read by the interpreter/compiler. This is the text that needs to be compiled or interpreted.
-
Scanner: This is the first module in a compiler/interpreter/
- The job of a scanner is to read the source file, one character at a time.
- It also keeps track of which line number and character is currently being read.
- A typical scanner can be instructed to move backwards and forwards through the source file.
- Why do we need to move backwards?
- For now assume that each time the scanner is called:
- it returns the next character in the file
-
Lexer: This module serves to break up the source file into chunks(called tokens). It calls the scanner to get characters one at a time and organizes them into:
- tokens
- token types
cx = cy + 324; print "value of cx is ", cx;
A lexer would break it like this:
cx --> Identifier (variable) = --> Symbol (assignment operator) cy --> Identifier (variable) + --> Symbol (addition operator) 324 --> Numeric Constant (integer) ; --> Symbol (end of statement) print --> Identifier (keyword) "value of cx is " --> String Constant (string) , --> Symbol (string concatentation operator) cx --> Identifier (variable)
- The lexer calls the scanner to pass it one character at a time
- Then lexer groups them together(groups characters together) and identifies them up as tokens for the language parser (which is the next stage)
- SO basically TOKENS are characters grouped together
- The lexer also identifies the type of token:
- variable vs keyword
- assignment operator vs addition operator vs string concatentation operator etc
- Occasionally, the lexer has to tell the scanner to back up.
- Consider a language that has operators that may be more than one character long
- For example ! vs !=
- < vs <=
- '+' vs ++
- Consider a language that has operators that may be more than one character long
- If we assume that the lexer needs to determine whether the operator is a < or a <=, the lexer will request the scanner for another character.
- If the next character is a '=', it changes the token to "<=" and passes it to the parser
- If not, it tells the scanner to back up one character and hold it in the buffer, while it passes the '<' to the parser.
- Parser: This is the part of the compiler that really understand the syntax of the language
- It calls the lexer to get tokens and prcessess the tokens per the syntax of the language
- For example, taking the example from the lexer above, the hypothetical interaction between the lexer and paraser could go like this:
Parser: Give me the next token
Lexer : Next token is "cx" which is a variable.
Parser: Ok, I have "cx" as a declared integer variable. Give me next token
Lexer : Next token is "=", the assignment operator.
Parser: Ok, the program wants me to assign something to "cx". Next token
------> Lexer : The next token is "cy" which is a variable.
------> Parser: Ok, I know "cy" is an integer variable. Next token please
------> Lexer : The next token is '+', which is an addition operator.
------> Parser: Ok, so I need to add something to the value in "cy". Next token please.
--------------> Lexer: The next token is "324", which is an integer.
--------------> Parser: Ok, both "cy" and "324" are integers, so I can add them. Next token please:
--------------> Lexer: The next token is ";" which is end of statement.
------> Parser: Ok, I will evaluate "cy + 324" and get the answer
Parser: I'll take the answer from "cy + 324" and assign it to "cx"
In the section above, the indenting shows a subprocess that the parsers enters to evaluate "cy+324". This gives an idea about how the parser operates.
- Also note that the parser is checking types and syntax rules (for instance, it checked whether cy and 324 were both integer types before adding them).
- If the parser gets a token that it was not expecting, it will stop processing and complain to the user about an error.
- The scanner holds the current line number and character, so the Parser can inform the user approximately where the error occured.
- Interpreter/Code Generator: This is the part that actually takes the action that is specified by a program statement.
- In some bases, this is actually part of the parser(especially for interpreters)
- The parser interprets and takes action directly
- In other cases, the parser converts the statements into byte-code
- In the case of a compiler, it then hands them to the Code Generator to convert into machine code instructions
- If you want a compiler for a different CPU or architecture, all you have to do is put a new code generator unit to translate the byte code into machine code for the new CPU