Skip to content
Mark Handley edited this page May 2, 2023 · 8 revisions

Once you clone the repo, the main source code of htsim is located in the sim directory, with the following files being relevant:

  • network.h, network.cpp

    • Implementation of the base packet type which has a Route and attributes.
    • The route specifies the list of next hops for this packet, and has an index that specifies the current location. It is typically a series of alternating queues and pipes.
    • packet.sendOn increments the index, finds the nextHop and calls receivePacket on that object (of type PacketSink).
    • Packets have a destination address but that is only used when ECMP routing is used; otherwise the packet is source routed and the routed is determined at the source. When ECMP routing is used, the packet carries the route to the next switch; when the next hop is determined, the route associated with it is used by the packet.
  • eventlist.h, eventlist.cpp

    • Implementation of the eventloop that drives htsim - it offers two main functions: sourceIsPending (absolute time when event fires), and sourceIsPendingRel(relative time starting from now).
  • queue.h, queue.cpp

    • Implementation of basic queues;
    • Exposes three main functions:
      • receivePacket is called when a packet arrives at a queue. Enqueues the packet or drops it if there is no buffer room left. If the queue is empty and the link is idle, then the packet begins transmission by calling beginService.
      • beginService simulates serialization for the first packet in queue; it registers a callback that elapses when the last byte of the packet has been serialized. When the callback expires, doNextEvent is called
    • Each queue has a remote endpoint, which is logically the other end of the wire. This is set when the topology is created and is used to enable PAUSE packet transmission.
  • pipe.h, pipe.cpp

    • Delays a packet by a specified time. Used to simulate wire and other latencies.
  • compositequeue.h, compositequeue.cpp

    • Implementation of a trimming queue.
  • queue_lossless_input.h, queue_lossless_input.cpp

    • Implementation of lossless input buffering where only accounting is performed and triggering PFC for upstream switches.
    • The PFC thresholds are set as static class members and initialized to 0; code that uses the PFC thresholds must set these parameters before attempting to create a lossless queue.
  • switch.h, switch.cpp

    • Implementation of a switch abstraction, mainly a collection of queues (the ports).
  • tcp.h, tcp.cpp

    • Implementation of TCP NewReno endpoints. Traffic is unidirectional from TcpSrc to TcpSink (same for all protocols).
  • ndp.h, ndp.cpp

    • Implementation of ndp.
  • swift.h, swift.cpp

    • I mplementation of Swift CC over a TCP-like transport protocol (starting point NewReno).
  • roce.h, roce.cpp

    • Implementation of vanilla Roce - rate-based, go-back-N protocol. No verbs API implementations provided.
  • cbr.h, cbr.cpp

    • Constant bitrate sender and sink; no reliability implemented.

All transports by default send continuously (backlogged sources) until the simulation ends; alternatively, each source can be configured with a predefined flowsize after which it will stop sending.

Logging infrastructure

  • (Most) Logging is done in binary format to ensure logging does not slow down the experiments.
  • Some logging is done as program output, mostly for debugging purposes.
  • Parsing the logfile can be done using the parse_output executable in the sim directory.
  • Each object (source, queue, pipe) logs events if an appropriate logger is passed to it during creation. Logging can be really fine grained (i.e. per packet) or aggregate.
  • The most used loggers are the sinklogger that use sampling to report throughput at the receiver, and queuelogger that logs periodically queue stats.

Experiments in htsim are created by writing a main_* file that creates the topology, endpoints The sim/tests directory contains some example main files - these are mostly used as unit tests for the specific transport implementations (e.g. dumbbell for TCP, NDP, swift, etc).

The datacenter subdirectory contains the htsim components required to run datacenter experiments. The main additions from a functionality point of view are:

  • Datacenter topologies (FatTree, DragonFly, BCube, etc)

  • An implementation of a FatTreeSwitch that supports ECMP routing as well. Simulated switches are output buffered by default; in the case when the lossless_input queuetype is used, virtual input buffers are also used to govern the generation of PFC.

  • A connection matrix class which has methods that generate various traffic patterns including incast, permutations, random, all-to-all and many others.

    • The connection matrix can also load a traffic pattern from a given file, generated externally. A number of such files are available in the connection_matrices subdirectory together with python scripts to generate popular traffic patterns.
  • <main_*> files for the main transports including ndp, swift, tcp and roce. These main files are highly configurable via command line parameters that specify many aspects of the experiments to be performed:

    • Topology size (fully provisioned three tier FatTree is the only one supported and is used by default; support for DragonFly and DragonFly+ to be added soon).
    • (Egress) queue size (-q), size of cwnd (used in the case of ndp to specify the number of grants a sender has upon startup).
    • The queuetype to be used, which typically depends on the transport. Composite is the one with trimming; Random is the one used by TCP / Swift. Lossless_input is the one needed by Roce.
    • The routing strategy to be used, in the case of NDP (e.g. perm which means source routing, ecmp_host which means ECMP in the switches, ecmp_ar uses adaptive routing, etc).
    • The level of logging required. E.g. logsink will log throughput at the receivers.
    • The connection matrix (-tm file)
    • The simulation duration in us (e.g. -end 500)
    • Etc.

The outputs are text-based (for debugging purposes and also for FCT reports) and binary in the logout.dat file by default (unless -o file is passed as parameter).

Using ../parse_output to parse the logfile:

  • ../parse_output logout.dat -ndp -show will show the average throughput for all ndp sinks that were configured to log. -tcp, -swift and -roce are supported too.
  • ../parse_output logout.dat -ascii will print all the logfile in ascii format; this can be used for postprocessing. E.g. ./parse.sh produces the X time Y instantaneous throughput plots.

ROCE

ROCE simulation-related files:

  • main_roce.cpp is the main file - look there to understand all parameters.
  • roce.cpp / roce.h are the source and destination implementations of go-back-n.
  • fat_tree_topology.c / .h is the topology implementation.
  • FatTreeSwitch (.c / .h) is the implementation of the switch model used for ECMP routing, adaptive routing, PFC, etc.

Running a ROCE experiment

  • ./htsim_roce -conns 2048 -nodes 8192 -tm connection_matrices/ perm_8192n_2048c_2000000.cm -strat ecmp_ar -paths 1 -log sink -end 2000 -mtu 4000 -switch_latency 0.2 -hop_latency 0.8 -start_delta 10 -q 200 -ar_sticky_delta 4 -ar_method pqb -seed 13 &> out

Once this finishes, run:

  • grep Flow out to see flow completion times.
  • You can use your favourite processing workflow to plot results, etc.

To run a batch of experiments that vary the idle_timer for single path adaptive routing (called ar_sticky_delta), you can use scripts/run_roce_idle.sh as follows:

  • ./scripts/run_roce_idle.sh destination_directory

The destination directory should be empty and created before you run the command. This will simply run a range of experiments with all AR strategies and idle timers exponentially spaced from 1 to 256, and output the results (both binary and text outputs) to the said directory.

You can then run

  • ./scripts/parse_directory destination_directory to generate the PDFs and data required for them for median, 99pct, max, etc.

NDP/EQDS

To run an NDP experiment, a required argument is the strategy for packet-level multipathing which can be perm (source routed, near-perfect), ecmp (virtual paths given with -paths X), ecmp_host (similar but with actual switches doing hashes), ecmp_ar - adaptive routing.:

Example to run with source routing:

  • ./htsim_ndp -nodes 16 -tm connection_matrices/perm_16n_16c.cm -cwnd 50 -strat perm -log sink -q 50 -end 1000

To see the average flow throughputs use * ../parse_output logout.dat -ndp -show

To run with per-packet adaptive routing, using queuesize and bandwidth utilization: * ./htsim_ndp -nodes 16 -tm connection_matrices/perm_16n_16c.cm -cwnd 50 -strat ecmp_ar -ar_method qb -log sink -q 50

Clone this wiki locally