
# 1) Integrating MASTODON with SST

Requires:
SST-Core and SST-Elements (and dependencies)---These should be pre-installed and added to your bashrc

MASTODON (compile network-specific demo in network_demo folder)

## 1.1) Simple Send and Receive

Notable Files:
* CPP wrapper for RACER (mainDir/src/sst)
* stat.\*.xml -> cycle-cycle performance
* data.\*.xml -> data for the nodes (in/outs)

In this example, we instantiate two nodes that act as a sender (**Node0**) and receiver (**Node1**).

**Node0** computes a value, and sends the result to **Node1**, which subsequently computes a multiplication.

## 1.2) Modifying the Network Parameters

Modify the network parameters (*basic.py*) and observe the effects.

## 1.1) Simple Send and Receive

In [None]:
#Paths and imports

import os, sys
import subprocess 

networkDemoPath = os.getcwd()
networkDemoPath = networkDemoPath + "/../../network_demo" #Insert the path the to the demo-code folder
os.chdir(networkDemoPath)
!source ~/.bashrc
!!rm data.*.xml > /dev/null
!!rm stat.*.xml > /dev/null

!!pwd #Output should be a path to the demo-code folder

In [None]:
#This should list the demo_{examples} and node files (configs and endpoint)
!!ls $networkDemoPath

In [None]:
configFile = os.path.join(networkDemoPath, "test.config")

In [None]:
%%writefile $configFile

// Tutorial config file 
// Use for debug, bookeeping
// options: none, lane0, lane1, etc., controller, decoder, binary, uop_queue, inter_mover, intra_mover
show = none
// options: none, lane0, lane1, etc., controller, decoder, binary, uop_queue, inter_mover, intra_mover
record_stat = inter_mover, controller
// options: true, false
record_data = true
// input data source
data_source = none
// options: true, false
disable_backend = false

// Use for initialization
// number of cycles before Node retire
cycle_max_idle = 20
// number of entries the instruction storage can hold
entry_bin_size = 3000
// number of cycles it takes the fetcher to fetch an instruction from instruction storage
cycle_bin_out_lat = 1
// number of cycles it takes for an instruction to be loaded into the instruction storage
cycle_bin_in_lat = 1

// Node configuration
//number of lanes
num_lane = 2
// number of reg. files per lane
num_regfile = 2
// number of bits per register
granularity = 8

// PUM-Enabling Technology
// options:
// - ReRAM: IDEAL, NMP, MAGIC, OSCAR
// - SRAM: IDEAL, NMP
// - DRAM: AMBIT_ROWCLONE
PUMtech map_style = MAGIC
// Number of rows and columns in a memory array
num_col = 64
num_row = 2
// Racer-Specific: Number of intermediate values or mask columns in a memory array
num_imm = 17
num_mask = 0
// Racer-Specific: number of entry per uop queue
entry_queue_size = 12

//for all register, set the bits to 0 starting from random_bit_pos
random_bit_pos = 3

// type of data mover:
// options: peer to peer (P2P), serial (SERIAL)
cycle_mover_type = P2P
// latency to perform MEM_COPY
cycle_mover_COPY_lat = 16
// latency to establish connection between lanes before data transfer
cycle_mover_SETUP_lat = 1

// MPI related config
#Change these to see effects of generating MPI overheads
cycle_mpi_send_payload_generation_lat = 8
cycle_mpi_interupt_payload_generation_lat = 16
cycle_mpi_send_store_back_lat = 8
cycle_mpi_interupt_store_back_lat = 32
cycle_ping_skip_duration = 0
cycle_interupt_serve_wait = 1000
bit_payload_size = 128
bit_PING_size = 10
bit_ACK_size = 10
bit_DONE_size = 10


// PPA configurations from real chip extraction
process_name = Intel_16nm_SCIMBA
vector_length = 128
pJ_per_vec_primitive = 110
cm2_total_area = 0.0025
MHz_frequency = 133.33
watt_thermal_limit = 80

//Device Level Parameters
device_model_sim = false
memorisation = false
verilog_filename = RRAM_VTEAM
volt_MAGIC = -1.5
volt_SET = -1.7
volt_RESET = 0.4
volt_ISO_BL = -0.3
volt_ISO_WL = -0.3
second_cycle_time = 0.001
second_step_size = 10e-6
ohm_R = 2.925
farad_C = 1e-15
state_threshold = 0.5

// Playback configurations
entry_playback_buffer_size = 1024
num_max_active_regfile_per_lane = 1

// Hyper-threading configurations
num_smt_thread = 1

In [None]:
#Book keeping for paths for basic demo

basicExample = os.path.join(networkDemoPath,"demo_basic")
basicExamplePy = os.path.join(basicExample, "basic.py")
simpleSend = os.path.join(basicExample, "simple_send.rc")
simpleRecv = os.path.join(basicExample, "simple_recv.rc")
outNode0 = os.path.join(networkDemoPath,"data.0.xml")
outNode1 = os.path.join(networkDemoPath,"data.1.xml")
!!ls $basicExample

In [None]:
#Run this to see the 2-node network setup.

f = open(basicExamplePy, 'r')
print(f.read())
f.close()

In [None]:
#Run this to see the sender program.

f = open(simpleSend, 'r')
print(f.read())
f.close()

In [None]:
#Run this to see the receiver program.

f = open(simpleRecv, 'r')
print(f.read())
f.close()

In [None]:
#Execute
subprocess.run(["sst", basicExamplePy])

In [None]:
#Take a look at the output file for Node0. (By default, we used lane 1, RF 1)

n = open(outNode0, 'r')
print(n.read())
n.close()

In [None]:
#Take a look at the output file for Node1. (By default, we used lane 1, RF 0)

n = open(outNode1, 'r')
print(n.read())
n.close()

## 1.2) Modifying the Network Parameters

Now you try: Make a "new" network using Basic Example. Note: **We have already modified the newBasicFile for the sake of demo purposes; however, feel free to explore how different parameters affect the simulation)

We can rerun the same example and observe the difference in execution time.

In [None]:
#Add the new basic test to your path (No need to change unless you've changed your existing paths between 1.1 and 1.2
newBasicFile = os.path.join(basicExample, "new_basic.py")

In [None]:
%%writefile $newBasicFile

import sst
from sst.merlin import *
from node_endpoint import NODE_Endpoint_Generator

node_params = {
         "config_file" : "./test.config",
         "max_idle" : 1000,
         "num_peers" : 2
         }

# change the binary accordingly
binary_dict = {}
binary_dict[0]  = "./demo_basic/simple_send.rc"
binary_dict[1]  = "./demo_basic/simple_recv.rc"

#Change any merlin._params value
sst.merlin._params["flit_size"] = "16B"
sst.merlin._params["link_bw"] = "1.0GB/s" #We pre-changed this
sst.merlin._params["xbar_bw"] = "1.0GB/s" #We pre-changed this
sst.merlin._params["input_latency"] = "0.0ns"
sst.merlin._params["output_latency"] = "0.0ns"
sst.merlin._params["input_buf_size"] = "16.0KB"
sst.merlin._params["output_buf_size"] = "16.0KB"
sst.merlin._params["link_lat"] = "3200ps" #We pre-changed this

merlinmeshparams = {}
merlinmeshparams["mesh.shape"]="1x2"
merlinmeshparams["mesh.width"]="1x1"
merlinmeshparams["mesh.local_ports"]="1"
sst.merlin._params.update(merlinmeshparams)
topo = topoMesh()
topo.prepParams()

endPoint = NODE_Endpoint_Generator(node_params, binary_dict)
endPoint.prepParams()

topo.setEndPoint(endPoint)
topo.build()

In [None]:
subprocess.run(["sst", newBasicFile])

In [None]:
#Take a look at the output file for Node0. (By default, we used lane 1, RF 1)

n = open(outNode0, 'r')
print(n.read())
n.close()

In [None]:
#Take a look at the output file for Node1. (By default, we used lane 1, RF 0)

n = open(outNode1, 'r')
print(n.read())
n.close()