# What is Software Optimization and Why Does it Matter?


**`Performance is a measure of how efficiently a system is performing its task. Typically, performance is measured in terms of the accuracy, inference speed, or energy efficiency of the system.`**

In our efforts to improve the performance of a system, there are generally two approaches we can take. The first is hardware optimization:

**`Hardware optimization`** 
* can be as simple as `changing to a new hardware platform` or 
* as complicated as `building a custom hardware `specifically designed to increase the performance of a particular application.

And the second approach is software optimization:

**`Software optimization`** involves `making changes to your code or model to improve your application's performance.` As applied to Edge computing, this will involve techniques and `algorithms that reduce the computational complexity of the models`.
  * Use new model that takes 30 sec for inference than model that takes 35 sec  
  * Refactor your image preprocessing code to reduce time in image preprocessing 
  * changing the model precision saved us time when loading our model.` Reducing the precision can also reduce the size of our model file.` So while our FP16 model does not give much improvement in terms of model loading time, it takes only about half the storage space of the FP32 model. `This would be advantageous if memory were a constraint in our system`.

#### Note:

One additional thing to keep in mind is that not all systems need to perform inference at the same rate. For example, as we noted in the video, a system that is performing inference on a video feed from a parking lot may be taking in a lower number of frames per second than one one that is running inference on high-speed traffic. This is something you should consider when looking into whether software optimization would be beneficial.

# Types of Software Optimization

In most deep learning applications, loading and performing inference on a model takes up the most time. For this reason, many of the techniques that have been developed focus on **model optimization**.

Note: Unless otherwise specified, whenever we talk about software optimization in this course, we are referring to optimizing the **model and not the code**.

Broadly speaking, there are two ways to optimize our model, based on how the model is changed. We can:

* **`Reduce the size of the model:`**
This will reduce the time it takes to load the model and perform inference by removing unimportant or redundant parameters from our network.
  * The model will load faster
  * The model will compile faster
  * Less space will be required for storing the model
  * There will be a reduction in the number of model parameter

**`Reduce the number of operations:`**
This will reduce the time taken to perform inference by reducing the number of operations or calculations needed to execute the network. This can be done by `using more efficient layers and by removing connections in between neurons in our model`.

* A larger "teacher" model trains a smaller model- **Knowledge distillation**

* High precision weights are converted to low precision weights- **Quantization**

* Model size is reduced by reducing the number of weights that need to be stored- **Model compression**

* Neurons or connections between nerons are removed in the model- **Model pruning**

* A computationally expensive layer is replaced with a computationally simple one- **Using efficient layers**


# Performance Metrics

**`A metric is a quantity or an attribute of a system that can be measured. A metric should help us infer useful information about a system.`**

In the case of an Edge AI system, we want to measure two kinds of performance:

**Software Performance:**

This is used to understand the properties of our model and application. Model accuracy is a good example of a metric used to measure software performance.

**Hardware Performance:** 

This is used to understand the properties of the device our model is running on. For instance, power consumption is a hardware metric that can be used to decide the size of battery our system will require.


* Energy table for 45nm CMOS process.

|Operation	|Energy [pJ]	|Relative Cost|
|-------|-------|-------|
|32 bit int ADD	|0.1	|1|
|32 bit float ADD	|0.9	|9|
|32 bit Register File	|1	|10|
|32 bit int MULT	|3.1	|31|
|32 bit float MULT	|3.7	|37|
|32 bit SRAM Cache	|5	|50|
|32 bit DRAM Memory	|640	|6400|

Adapted from Figure 1 of [ Learning both Weights and Connections for Efficient Neural Networks (Han et al., 2015).](https://arxiv.org/pdf/1506.02626.pdf)

QUESTION 

**From the table, what can you conclude about the energy taken to access memory vs. the energy taken to perform operations?**

answer:`The energy taken to access memory is far greater than the energy taken to perform operations`


`This is why we need to reduce the size of our model. A model with either fewer weights or smaller weight sizes (for instance 8 bit weights instead of 32 bits) will consume much lesser energy.`

**Performance Metrics**
* **Inference time**
  * should be reduced to increase performance
* **Model size**
  * smaller models takes less time and energy to load
* **Accuracy**
  * If there is no regard for power and cost,deployment environment is not resource constrained , highly accurate model is       deployed even if model becomes complicated and inference time high.
  * should be high but not at cost of other metriccs
* **Precision and recall**  (in classification and recommendation)
  
**Recall**
ratio of true positive to all instances

**Precision***
ratio of true positive to all true positives

* **Latency and throughput**

### Latency and Throughput
Latency and throughput are closely related metrics but not the same thing.

**`Latency is the time taken from when an image is available for inference until the result for that image is produced`**.
  * Latency is the time it takes to generate output for a given input.
  * Latency is measured either in seconds or milliseconds
 


**`Throughput is the amount of data that is processed in a given interval of time.`**
  * Throughput is number of input that can be processed in a given particular time.
  * Throughput is measured in seconds or minute.

|**Low latency** |**High throughput**|
|-----|-----------|
|One image at a time|five images at a time|
|inference time: 1 sec|bacth inference time: 2 sec|
|latency: 1 sec|latency:2 sec|
|predictions/min: Latency X time >>  1 X 60|predictions/min: predictions X time >> 5 X (60/2)|
|Throughput : 60 frames per minute(1fps)|Throughput: 150 frames per min(2.5 fps)|
| Latency=1/Throughput|Latency=Batchsize/Throughput|

* Choosing higher throughput or low latency depends on your application
* High throughput is used to generate results for large volume of data and more useful when you need to analyse more amount of data in given time.
* For instance if you have single edge device analysing security camera footage from multiple cameras low latency will not be as important as high throughput
* If edge device is controling the steering wheel of self driving car, then processing data sequencially at much faster rate is more imp than processing large amount of data.
* **Lower latency is more important for real time applications**.

|Latency|Throughput|
|-------|-----|
|For quickly processing single data points|For processing large volume of data|
|Inference is performed as soon as data is available and compute resources are free|Data is strored until batch is formed,then inference is performed|

# Some Other Performance Metrics


**System power**
* optimize for longer operating time(in unreliable power source area)

**System size**
* Optimize for less volume
   * FPGA will give much better latency and throughput but occupy more space and power to run
   
**System cost**
* optimize for low deployment cost

**FLOP and MAC**

* **FLOP**  floating point operations
  * operations that involve floating point values
  * Any operation like multiplication or addition that involves floating point value are called FLops.
  * if weights are stored as float values in neural network then running inference on that model requires to do FLOPs.
  * performing FLOPs require energy and time,measuring no of flops in your network,you can estimate time it will take to execute    your model 
  * more no of flops ,longer time to execute
  
 
 
* **MAC** Multiply And Accumulate
* Multiply followed by addition
* MAC operations are typical in Neural Network where widths and actiavtions are first multiplied  then added with other width and activation product.
* performing MAC operation involves two flops:add and multiply
* 1 MAC =2 FLOPs


**FLOPs**

* One way to measure inference time for your model is `to calculate the total number of computations the model will have to perform`.
* And a common metric for measuring the number of computations is the FLOP.
* `Any operation on a float value is called a Floating Point Operation or FLOP`.
* This could be an addition, subtraction, division, multiplication, or—again—any operation that involves a floating point value. * By calculating the total FLOPs (Floating Point Operations) in a model, `we can get a rough estimate of how computationally complex our model is and how long it might take to run an inference on that model.`
* The more FLOPs or Floating Point Operations a device can perform in a given time, the more computationally powerful that device will be.
* Hardware devices are generally rated for the number of Floating Point Operations they can perform in a second. This is known as FLOPS or Floating Point Operations per Second.
* Note that FLOPs (lowercase "s") is different than FLOPS (capital "S"). FLOPs is a quantity and represents the total number of floating point operations that a device needs to perform, whereas `FLOPS is a rate, and can tell us how many FLOPs a device can perform in a second.`

**MACs**
* Computations in neural networks typically involve a multiplication and then an addition. For instance, in a dense layer, we multiply the activation of a neuron with the weight for that neuron connection and then add it to another similar product:

$Layer_{output}$ = $W_{1}*A_{1}+W_{2}*A_{2}+...+W_{n}*A_{n}$

* Since these operations involve performing a multiplication and then an addition, they are called Multiply-Accumulate operations or simply MACs.
* Since each MAC involves two operations (a multiplication and an addition), this means that we can generally think of one MAC as involving two FLOPs:
* 1 MAC ~ 2 FLOPs
* Actually, in some hardware (especially hardware that is optimized for running many MACs), the multiplication and addition are fused into a single operation. But for the purposes of this course, we'll assume the above (1 MAC = 2 FLOPs) is true when performing calculations.

|OPTIMIZATION  |     METRIC|
|--------------|------------|
|A battery-powered edge system was using Bluetooth Low Energy for sharing log data. This was then removed as there was a wired interface that could do the same.
This increased in the system on time.|system power|
|In an edge system, a larger USB-A interface was replaced with a smaller micro-USB interface.|system size|
|An Atom processor was replaced with multiple Neural Compute Sticks in an Edge Computing system|system cost|


`When you change one metric this will usually affect another metric, and often not in a good way. As a developer trying to optimize your model for the edge, you need to be wary of the side-effects of any changes you're making. The paper,` [An Analysis of Deep Neural Network Models for Practical Applications](https://arxiv.org/pdf/1605.07678.pdf) `compares and analyzes the relationships between the metrics we discussed and gives detailed information that can help you design efficient neural networks for the edge.`

# When do we do Software Optimization?
Before you actually do software optimization, `it is important to know when it will not give you the results you are looking for.`


* One of the easiest ways to check whether optimizing the model will help is `to compare the time it takes to perform inference,` and perform the `other bottlenecks` in our application code `like pre-processing`. 
* If the pre-processing time is significantly more than the model inference time then optimizing the model will not give much performance improvements. 

If you want to try it for yourself, you can download the code used in the video [here](https://video.udacity-data.com/topher/2020/March/5e7b6c2c_profiling/profiling.py) or from the link at the end of this page.

Note: In the video, we will be using a line profiler. `A line profiler tells us the time it takes to run each line of code.` In particular we will be using [this](https://github.com/rkern/line_profiler) line profiler.

### Scenario-Based Software Optimization

* Remember that the main purpose of software optimization is to reduce the inference time of our model when we are executing our model on a device with limited computational resources. However, not every system needs to perform at the same rate, so whether you need to perform software optimization will depend a lot on the specific scenario.

* For instance, as we called out earlier in this lesson, if you are trying to read license plate numbers at a parking ticket kiosk, you don't need to have a system that performs inference every few milliseconds. Since cars generally stop for a few seconds at these kiosks, even if your system runs inference every second or two, your system will still function properly.

* On the other hand, if you were trying to read license plate numbers on a busy highway, your system would need to perform inference on frames very quickly—so in this scenario, software optimization would be necessary.

# Lesson Review


Here are the topics we covered in this lesson:

* What is Software Optimization?
* Why do we Need Software Optimization?
* Types of Software Optimization
* When to use Optimization Techniques
* Metrics to Measure Performance
* Other Metrics
  * Power
  * Cost
* When do we do Software Optimization?