## Parsing files with Python

A major component of the Chapter 6 Assembler project involves writing a program that will

* read a file,
* ignore white space and comments in the file,
* break each line, or command, in the file into mnemonics,
* determine how the mnemonics map into various binary commands,
* write the equivalent binary code to a second file.

Because you may not have worked with this specific problem in your CSCI151 class, or because you may not remember that much Python, I've prepared this short notebook for you to study.

### Opening and reading a file

First the `.asm` file has to be opened and read. This must be done with a Python script called `hasm.py` (hack assembler) that can accept a command line argument that specifies the assembly file (`.asm`) that you want to covert to binary code (`.hack`). The following code opens a file specified by a command line argument an reads it line by line. Specific features in this code include:

* The `sys.argv` is a list containing each term that was typed at the command line. Individual list entries were separated by spaces at the command line.
* There is error checking to see that only one source (`.asm`) file is specified by checking that the command line argument was `hasm.py filename` and that the filename exists and can be opened. 
* The `with` keyword in python is used, along with `try` and `except` to do error handling.
* control is turned over to a function, `Pass1`, where the actual processing will take place. Later, this function should probably be moved to another file for clarity.
* The `Pass1` function has been equipped with some basic comment processing and white space handling as follows:
*`find('\\')` is used to get the location of the comment string `\\`. Python indexing is then used to eliminate parts of a line that follow the comment string.
*`strip()` is used to remove all white space from the resulting line.

@2
D=A
@3
D=D+A
@0
M=D


### Dictionaries
The main construct for managing translation of assembly to binary will be the dictionary, or symbol table. Let us demonstrate how these will be used with an example, a dictionary of symbols associated with ROM and RAM addresses.


In [1]:
symbols = {
         "R0" :  "0",
         "R1" :  "1",
         "R2" :  "2",
         "R3" :  "3",
         "R4" :  "4",
         "R5" :  "5",
         "R6" :  "6",
         "R7" :  "7",
         "R8" :  "8",
         "R9" :  "9",
         "R10" :  "10",
         "R11" :  "11",
         "R12" :  "12",
         "R13" :  "13",
         "R14" :  "14",
         "R15" :  "15",
         "SCREEN" : "16384",
         "KBD" :  "24576",
         "SP" : "0",
         "LCL" :  "1",
         "ARG" : "2",
         "THIS" : "3",
         "THAT" : "4"
}
 
# Test if an entry is present:
print("Is R3 in names_dict? ","R3" in symbols)
print("Is i in names_dict? ","i" in symbols)

# Add i to names_dict:
symbol = "i"
next_RAM = 16
symbols[symbol] = next_RAM
next_RAM += 1

# Print the value of "i"
print(symbol,symbols[symbol])

Is R3 in names_dict?  True
Is i in names_dict?  False
i 16


#### Other Dictionaries

For your convenience, here are the other dictionaries that will be used. They are a simple translation from the tables in the graphic at the bottom of this page to dictionary, except:

* the `comp_binary` is constructed in such a way as to include the `a` bit
* the `dest_binary` is alphabetized and does not include all permutations of 2 and 3 letter combinations.
* the `command_type` seemed a simple way to determine command type.

In [None]:
comp_binary = { '0' : '0101010', '1' : '0111111', '-1' :  '0111010', 'D' :  '0001100', 'A' : '0110000', 'M' : '1110000', '!D' : '0001101', '!A' : '0110001', '!M' : '1110001', '-D' : '0001111', '-A' : '0110011', '-M' : '1110011', 'D+1' : '0011111', '1+D' : '0011111', 'A+1' : '0110111', '1+A' : '0110111', 'M+1' : '1110111', '1+M' : '1110111', 'D-1' : '0001110', 'A-1' : '0110010', 'M-1' : '1110010', 'D+A' : '0000010', 'A+D' : '0000010', 'D+M' : '1000010', 'M+D' : '1000010', 'D-A' : '0010011', 'D-M' : '1010011', 'A-D' : '0000111', 'M-D' : '1000111', 'D&A' : '0000000', 'A&D' : '0000000', 'D&M' : '1000000', 'M&D' : '1000000', 'D|A' : '0010101', 'A|D' : '0010101', 'D|M' : '1010101', 'M|D' : '1010101' }

dest_binary = { 'null' : '000', 'M' : '001', 'D' : '010', 'DM' : '011','A' : '100', 'AM' : '101', 'AD' : '110', 'ADM' : '111'}

jump_binary = { 'null' : '000', 'JGT' : '001', 'JEQ' : '010', 'JGE' : '011', 'JLT' : '100', 'JNE' : '101', 'JLE' : '110', 'JMP' : '111'}

command_type = {'@':'A_COMMAND','(':'L_COMMAND','D':'C_COMMAND','A':'C_COMMAND','M':'C_COMMAND','0':'C_COMMAND','1','C_COMMAND','-1':'C_COMMAND',"!":"C_COMMAND","-":"C_COMMAND"}



### The assembler API

So that we can all agree on what's to be done, and share our work, I am insisting that potions of the API defined in the book be upheld. The following provides them. Note I use `pass` to get something unimplemented to run without error. You'll need to replace that with your code.

In [40]:
def dest2bin(mnemonic):
    # returns the binary code for the destination part of a C-instruction
    # Note that for the above dictionary, dest_binary, you have to sort
    # the mnemonic first. eg s = "" for i in sorted(mnemonic): s+=i 
    # return dest_binary[ss]
    
    return dest_binary[mnemonic]

def comp2bin(mnemonic):
    # returns the binary code for the comp part of a C-instruction
    return comp_binary[mnemonic]

def jump2bin(mnemonic):
    # returns the binary code for the jump part of a C-instruction
    return jump_binary.get(mnemonic)
    
def commandType(command):
    # returns "A_COMMAND", "C_COMMAND", or "L_COMMAND"
    # depending on the contents of the 'command' string
    if command.startswith('@'):
        return 'A_COMMAND'
    elif command.startswith("("):
        return "L_COMMAND"
    else:
        return "C_COMMAND"
    # Alternative implementation, lookup first character in dictionary command_type

def getSymbol(command):
    # given an A_COMMAND or L_COMMAND type, returns the symbol as a string,
    # eg (XXX) returns 'XXX'
    # @sum returns 'sum'
    if command.startswith("@"): # A_COMMAND
        return command[1:]
    elif command.startswith("(") and command.endswith(")"):
        return command[1:-1]
    else:
        return 'Invalid command type'
    
    
def getDest(command):
    # return the dest mnemonic in the C-instruction 'commmand'
    ep = command.find("=")
    if ep == -1:
        return "null"
    return command[0:ep]

def getComp(command):
    # return the comp mnemonic in the C-instruction 'commmand'
    if commandType(command) = "C_COMMAND":
        cmd = command
        if "=" in cmd:
            cmd = cmd.partition("=")[2]
        return cmd.partition(";")[0]
    return "" #This is an error condition

def getJump(command):
    # return the jump mnemonic in the C-instruction 'commmand'
    if ';' in command:
        return command.split(';')[1]
    else:
        return 'null'
    

### The format mini-language

Python has format command that will be very powerful for this and other assignments in this course. Essentially, you face the problem of writing a 16 bit binary number, given a decimal value. There are a number of ways of handling this, but the format command is probably the cleanest. See the following for an example, leaving the exercise of zero-padding the output string as an exercise for the student.


In [3]:
RAM_address = 13408
print(format(RAM_address,'b'))
# Above, the string 'b' creates a string representing RAM_address in
# binary format. You should research the format command to learn how to 
# make that string 16 bits long, and to 'pad' the places that aren't 
# needed to express RAM_address with zeros.

11010001100000


### Additional Functions for Completion

We split this assignment into two sessions. The remaining functions that are required follow.

In [None]:
def processA(line,lineNo,nextRAM):
    # Convert an A-instruction line of assmebly to a binary code that is 
    # 0 followed by a 15 bit address. Will use the symbol table to lookup
    # a symbol and replace it with a value. If label is not is symbol table
    # add it with correct RAM address (next in sequence)
    # Note: mini-format langauge is helpful

def processC(line):
    # Convert a C-instruction line of code to the correct computation,destination, 
    # and jump binary codes. These should be preceded by 111, which signifies the
    # C-instruction

def processL(line,lineNo):
    # When an L-Instruction (label in the form (LABEL)) is encountered, 
    # the label should be placed into the symbol table with the correct line

def pass_1(file):
    # scan each line of file and find L_COMMANDS
    # place them in the symbol table with appropriate ROM numbers

def pass_2(file):
    # Scan file and write correct binary code to stdout.
    # Hint: file.seek(0) resets the file pointer for another pass

# parse sys.argv and check that it's in the format "hasm.py file.asm"
# open the file and pass it off to pass_1 and pass_1
# place any other error checking here.


### Bringing it all together

What's left? Well, implement all the functions specified, and then finish functions for `pass_1` and `pass_2`. Other functions will have to work on globally defined dictionaries to keep track of symbols and manage translation of binary. You'll also need to create and manage a `.hack` output file where the binary instructions are written. In addition to the functions mentioned in the API, and here, it's possible that you'll decide to write some helper functions, or even some classes. You're free to do as you like, provided the functions in the API are complete, and `pass_1` as well as `pass_2` are complete and well defined. A very important table of binary codes for various destination, computation, and jump portions of a C command appears below.

![Binary codes for various destination, computation, and jump portions of a C command](binary_codes.png)