Skip to content

Usage Guide

Niranjhana Narayanan edited this page Jun 8, 2020 · 4 revisions

To use TensorFI, users must first import the master TensorFI library in their programs (e.g., import TensorFI as ti). Then they need to add the following line of code to their programs, after the main graph of the session is constructed but before it is run (NOTE: they need not know precisely where these are done, but they would need to somehow make sure their code is executed in between).

import TensorFI as ti
fi = ti.TensorFI(s)

where s is the main session. Optionally, the following parameters can be given:

  1. fiConf: This is the location of the fault injection config file. If not specified, it defaults to the file fiConf/default.yaml. Note that right now, we only support YAML configuration files, but this can change in the future. If no config file can be found, then it uses a default configration for testing.

  2. logLevel: whether we want to enable debugging and information logging for the fault injector. The default option is none. A value of 10 indicated debug, and 20 indicates info. These correspond to the standard Python logging levels.

  3. name: Each fault injector is optionally assigned a name string, which is used in debug logging. If no name is specified, then it is given the name "noName".

  4. disableInjections: This is a boolean variable, which if set to True, disables fault injections upon calling the run command. The default value is False, and hence injections are enabled as soon as the above line of code is encountered. But in some situations, it may be preferable to wait until later to perform the fault injections, e.g., when the user is using fiRun or fiLoop (later).

  5. fiPrefix: This is the prefix to attach to all fault injection nodes inserted by TensorFI, for easy identification in the graph (e.g., with TensorBoard). The default prefix is "fi_".

Once the above line is encountered, TensorFI runs the instrumentation phase, and when the session.run() command is launched, it runs the execution phase. From this point on, faults will be injected into the execution phase of the graph unless they were turned off using the diableInjections option above.

All fault injections will be logged to a log file. See Logging below for more details about the log format.

While this is all that is necessary to inject faults, often one wants more fine-grained control over the fault injection process and statistics reporting. To that end, TensorFI provides the following helper functions to developers:

1. turnOnInjections()/turnOffInjections():

These turn fault injections on and off respectively during the execution phase.

2. launch(numInjections, diffFunction, collectStats):

This method is used to launch fault injections with the last used values of the arguments to the session.run function - these arguments are the tensorList that was passed to it, and the feed_dict parameters. Note that it is assumed that the session.run executed after the fault injector was used to initialize it - otherwise, it won't work. If you are not sure about this, use the Run method below. In addition, the following options need to specified for the Launch method:

  1. numInjections - The number of injections to be performed

  2. diffFunction - A function that takes a single argument and returns a single number to indicate the difference between the correct result and the faulty run's value that is passed as an argument. It is assumed that the function "remembers" the correct run's value from before. The difference value is used to update the statistics.

  3. collectStats - An optional parameter to collect the Injection statistics (Default is None). If set, statistics are collected using the specified collector. If the fileName is specified in the statistics collector, then its contents are written to the file when it is destructed.

  4. myID - An optional parameter that is used to identify the thread that launches the injections in the debug logs. It is set to 0 by default and used within pLaunch below.

3. run(numInjections, diffFunctionGen, tensorList, feed_dict, collectStats):

This is similar to the launch method except that it does not assume that session.run has been called beofre. In fact, it explicitly calls session.run with no faults injected, with the arguments tensorList and feed_dict. It then enables fault injection and runs the injection for "numInjection" times. Note that the parameter diffFunctionGen differs from the diffFunction in launch as follows - the diffFunctionGen takes a single parameter as an argument (the correct result value), and generates a function corresponding to diffFunction.

In other words, it is a generator for diffFunction that remembers the correct result that is passed to it after the golden run execution.

4. pLaunch(numberOfInjections, numberOfProcesses, diffFunction, collectStatsList, parallel, useProcesses, timeout)

This method is used to launch fault injection processes in parallel. The semantics are identical to launch except that it launces multiple threads or processes, each with the lanuch method. The arguments are:

  1. numberOfInjections - Same as numInjections in launch

  2. numberOfProcesses - Number of threads/processes to launch in parallel. We use the term thread/process interchangeably.

    NOTE: The injections will be evenly divided among the threads except that the last thread will get more/less injections depending on the total number of injections.

  3. diffFunction - Same as the launch method's diffFunction

  4. collectStatsList - A list of FIStats objects that has at least as many elements as the numeberOfProcesses. Each process writes its statistics to a separate element of the list in order, so what you get is the statistics corresponding to that process. You need to call collateStats with the list passed to the pLaunch function after it's done in order to collate all the statistics in one FiStats object.

    NOTE: The behavior is unspecified if collectStatsList has objects of type other than FIStats.

  5. parallel - Boolean flag to decide if the processes/threads should be launched in parallel (default value is True). It is pretty useful for debugging purposes.

  6. useProcesses - Boolean flag to determine if processes or threads should be used for parallelism. Default value is False, which means threads are used by default. This is because TensorFlow doesn't play well with separate processes and hangs (FIXME in the future). However, because Python uses a Global Interpreter Lock (GIL), using threads may not result in true parallelism on a multi-core machine, as the threads will be serialized.

  7. timeout - An optional parameter specifying the maximum amount of time to wait for a thread. The default value is None, which means that there is no TimeOut. If it's specified, the main thread will wait for atmost timeout seconds in the join call. NOTE: The behavior of the timedout thread can be unstable as the main thread will terminate the session, and leave the timedout thread hanging !

5. doneInjections:

This resets the session and replaces the monkey patched run with the original run method. No faults can be injected after this call. The purpose of this function is to allow TensorFlow to run at native speeds again after the injections are finished.

Config File:

You can get the configuration parameters by calling getFIConfig(). This returns a FIConfig object (defined in fiConfig.py) whose fields can be used to read and modify the configuration values - this is useful for running experiments by repeatedly changing the config parameters in each run. Though the config file is read only once at the beginning, any changes to the FIConfig structure will be reflected in subsequent fault injection trials. The injectMap field of FIConfig keeps track of the probabilities for injecting into each instruction, referenced by the Enum Ops.

Each config file includes 6 sections:

  1. A fault type for scalars and a fault type for tensors

  2. Operations and a probability

  3. A seed

  4. A skip count

  5. Number of instances an operation appears in the program

  6. An inject mode

A sample file can be found in confFiles/default.yaml. Here is a detailed description of the 6 sections:

  1. The fault types for both scalars and tensors can be one of the following values:

    1.1. None - does not inject fault

    1.2. Rand - shuffle all the data items in the output of the target op into random values

    1.3. Zero - change the value into all zeros

    1.4. Rand-element - shuffle one of the data item in the output of the target op into random value

    1.5. bitFlip-element - single bit-flip over one data item in the output of the target op

    1.6. bitFlip-tensor - single bit-flip over all data items in the output of the target op

    Example:

     ScalarFaultType: None
     TensorFaultType: bitFlip-element
    
  2. A full list of operations allowed is in fiConfig.py. One can change the Ops class in fiConfig.py to add new operations. Each operation coincides with a probability. This is used for the errorRate inject mode, where the probability represents the probability that a fault will be injected into that particular operation. This is used when InjectMode in section 6 is "errorRate".

    Example:

    - ALL = 1.0
    - LRN = 1.0
    - EQUAL = 1.0
    - SOFT-MAX = 1.0
    
  3. A seed is used to seed the randomness of the fault injection. Fault injection will be random if unspecified.

    Example:

    Seed: 1000
    
  4. The skip count representing the number of operations to skip at the beginning of the program before beginning injecting faults.

    Example:

    SkipCount: 1
    
  5. Instances shares the same operations as section 2. Each operation coincides with a number representing the number of instances it occurs in a given model. This is used when InjectMode in section 6 is "dynamicInstance". The sum of all these numbers is used when InjectMode in section 6 is "oneFaultPerRun".

    Example:

    - CONV2D = 2
    - MAX-POOL = 2
    - RELU = 3
    - BIASADD = 2
    - RESHAPE = 1
    - ADD = 3
    - MATMUL = 3
    
  6. There are three inject modes:

    6.1. errorRate: Uses the error rate specified in section 2 to determine which operations are injected with faults

    6.2. dynamicInstance: Injects each operation type in the program once

    6.3. oneFaultPerRun: Perform one random injection per run

    Example:

    InjectMode: "dynamicInstance"
    

Logging:

By default, the fault injection logs are written to the file "name-log" where name is the name of the fault injector. The logs specify

a. time of the injection (from experiment start time)

b. runCount of the injection (for each experiment)

c. count of the chosen operation (within an experiment)

d. Operation injected into (As specified in fiConfig.OpTypes)

e. Original value of the injected scalar or Tensor

f. Fault injected value of the injected scalar or Tensor

In case of multi-threaded injections, each thread creates its own log file. These are suffixed with the name of the thread. In particular, TensorFlow uses a lot of dummy threads, so you will see a lot of log files ending with "-dummyThread". Note that logs are always appended to the prior run, so you'll have to manually erase the log files if you want a clean slate. Each experiment can be identified from the experiment start time line written above. Unforttunately, at this time, there is no easy way to map TensorFlow thread names to the names of the threads launched in the fault injector.

FIXME: There's also a potential race condition in the runCount value logged to the fault log above, so users shouldn't rely on this value being correct.