# 6. Anomaly Detection

[index](../Index.ipynb) | [prev](./05.Forecasting.ipynb) | [next](./07.Conclusions.ipynb)

**Motivation**:

While initially the idea for this system was around object detection, and then forecasting, these two concepts have organically led me to the ultimate idea: <span style="background-color: yellow;">Can object detections be used to detect anomalies and trigger useful alerts to the users?</span>

This Chapter is an analysis of two methods, which I have identified as the most useful at the moment:
- based on unusually high count of objects in a given hour
- based on the content of the frames, which contain predicted objects

## 6.1. Anomalies estimated from event counts

Flagging object detections as anomalies can be determined by unusually high number of events in a given hour for an object class (like Person, Car, Cat, etc.).

This kind of anomaly detection routine could be run in real time when objects of a specific class are detected in the video stream. System compares a number of already registered objects in an hour (for example $3$) versus a threshold (for example $9$) to determine if the count should be classified as anomalous. If it is anomalous, then system could trigger an alert to home owner.

These min and max thresholds could be also displayed in the forecast as two bars: one on left and one on the right hand side of the counts.

This task forms a univariate outlier detection problem, where system learns the thresholds from the historical counts.

This section is focused around three individual ways of solving this problem:
- IQR
- Z-Score
- Probabilistic method

Just before diving into the details of each method, it is important to understand the distribution of counts. Again, for readability, all work in this section is based on the *Person* object, but the same method can be applied to any object class.

<p style="text-align: center; margin-bottom: 0;">Fig. 6.1. Count distribution - all hours</p>
<img src="../Resources/img/person-count-dist.png" style="width: 50%;"/>

Overall this data is heavily skewed towards 0's, but this is expected. During the night or when it's dark, the number of objects is $0$ as the camera is not night-vision. In other time intervals there is just not much activity happening.

The mean $\mu$ of this population is $1.16$ and standard deviation $\sigma$ is $1.95$. Taking a square root of $\sigma$ gives $1.08$, which is close to $\mu$ and it is one of the main characteristics of the *Poisson distribution*.

Next, looking at the distribution of $\mu$'s for each hour shows a more detailed picture, where bars represent mean averages and red dots are a square root of $\sigma$.

<p style="text-align: center; margin-bottom: 0;">Fig. 6.2. Count distribution by hour</p>
<img src="../Resources/img/count-dist-by-hour.png" style="width: 60%;"/>

This picture shows quite a large spread of means by hour, and it is convincing enough to focus on anomaly detection for individual hours separately.

And looking at the frequency of counts for 4PM shows quite heavy skewness towards the left side:

<p style="text-align: center; margin-bottom: 0;">Fig. 6.3. Count distribution at 4PM</p>
<img src="../Resources/img/person-count-dist-singlehour.png" style="width: 40%;"/>

**Note**

All code snippets, more in-depth commentary and code for plots created in this section can be found in the corresponding [Extra Notebook 5](../Notebooks/Extra.05.Person-AnomalyDetectionForHourlyCounts.ipynb).

### 6.1.1. IQR

There are many statistical tools to deal with this kind of challenge. Arguably the most popular and easy to understand is IQR (Interquartile Range).

This method is used in a very popular plotting technique for outlier identification, the boxplot (Tukey 1977). This method is simple to explain, well understood and often produces satisfactory results.

First, one would calculate an *Interquartile Range (IQR)* using the difference between the third and first quartile, where first quartile ($Q1$) and third quartile ($Q3$) are the medians of lower and upper halves of the data respectively:

$$IQR=Q3-Q1$$

Then the lower and upper bounds are calculated by removing and adding $IQR*1.5$ from $Q1$ and $Q3$ respectively:

$$lowerBound=Q1-(IQR*1.5)$$

$$upperBound=Q3+(IQR*1.5)$$

And finally, values below lower or above upper bound are classified is outliers:

$$
\begin{equation}
  f(x)=\begin{cases}
    anomaly, & \text{if $x<lowerBound$ or $x>upperBound$}\\
    not-anomaly, & \text{otherwise}
  \end{cases}
\end{equation}
$$

, where $x$ is an individual count of objects in a single hour.

Below is a boxplot for the count dataset by hour:

<p style="text-align: center; margin-bottom: 0;">Fig. 6.4. IQR analysis - boxplot</p>
<img src="../Resources/img/boxplot-by-hour.png" style="width: 70%;"/>

Looking at the graph, it is clearly flagging a lot of points. After calculating the percentage of points above and below the bounds, IQR method classifies $5\%$ of observations as anomalous.

The high percentage of anomalies is related to the fact that counts for each hour are heavily skewed, and as mentioned in the *Adjusted boxplot* work by Hubert at al., 2008, boxplots suffer from False Positives when heavy skewness exists.

### 6.1.2. Adjusted boxplot for skewed distributions

In the 2008's [paper](https://wis.kuleuven.be/stat/robust/papers/2008/adjboxplot-revision.pdf) by M. Hubert et al., an alternative method has been proposed to IQR: *An Adjusted Boxplot for Skewed
Distributions*.

The procedure is quite similar to IQR, but introduces some additional steps:
- calculation of data skewness (called medcouple - MC):

$$MC = \frac{(Q3 − Q2) − (Q2 − Q1)}{Q3 − Q1}$$

For the count dataset and Person object class, this measure is $0.33$.

Then, the lower and upper bounds are calculated as follows:

$$lowerBound=Q1-h_l(MC)IQR$$
$$upperBound=Q3+h_u(MC)IQR$$

, where:

$$h_l(MC)=1.5e^{aMC}$$
$$h_u(MC)=1.5e^{bMC}$$

The authors of the paper have optimised the values for the constants $a$ and $b$ as $-4$ and $3$, in a way that fences mark 0.7% observations as outliers.

Applying these calculations to the count dataset classifies $148$ observations as outliers, which represents $3.5\%$ of the dataset and still generates too many False Positives.

### 6.1.2. Z-Score

Z-Score determines how many standard deviations $\sigma%$ are points $X$ away from the mean $\mu$.

$$
zScore_i=(x_i-\mu)\div\sigma
$$

, where $x_i$ is an i-th data point, $\mu$ and $\sigma$ are a sample arithmetic mean and standard deviation respectively.

Z-Score has quite interesting properties when applied to Normal distributions, where $99.7\%$ of the data points lie between +/- 3 $\sigma$'s.

In case if the skewed count dataset it also performs quite well and identifies $75$ outliers, which represents $2\%$ of the dataset.

### 6.1.3. Probabilistic method

Probabilistic models utilise *Bayesian Theorem* to derive the following formula from the Conditional Probability theory:

$$P(A|B)=\frac{P(B|A)P(A)}{P(B)}$$

Where:
- $P(A|B)$ is the posterior, meaning conditional probability of event $A$ given that $B$ is true
- $P(B|A)$ is the likelihood, also conditional probability of event $B$ ocurring given $A$ is true
- $P(A)$ is the prior (information we already know about the data)
- $P(B)$ is the marginal probability of observing event $B$

There are many benefits from using probabilistic modelling. Some of them are included below:
- no assumptions made about the distribution of the data
- it allows us to provide prior information to the model about distributions
- it does not require a lot of data
- it gives us the predictions and the uncertainty about them

**Prior**

In probabilistic programming we use the prior information we already have (like the distribution of the outcome random variable), then we define the likelihood, which tells the library how to sample the probability space given the data, and then we perform an analysis of the posterior, which contains N-samples drawn from the distibution.

In relation to the count dataset, I have identified a $2$ candidate distrubutions, which can be used as a prior in the model:

- Half Student T distribution with parameters $\sigma=1.0$ and $\nu=1.0$ and density function:

$$f(t)=\frac{\gamma(\frac{\nu + 1}{2})}{\sqrt{\nu \pi} \Gamma (\frac{\nu}{2})} (1 + \frac{t^2}{\nu})^{-\frac{\nu + 1}{2}}$$

, where $\nu$ is the number of degrees of freedom and $\Gamma$ is the gamma function.

- Gamma distribution with parameters $\alpha=1.5$ (shape) and $\beta=0.5$ (rate) and density function:

$$f(x;\alpha;\beta)=\frac{\beta^\alpha x^{\alpha-1} e^{-\beta x}}{\Gamma(\alpha)}$$

, where $x>0$, $\alpha,\beta > 0$ and $\Gamma(\alpha)$ is the gamma function

Below is the multi-plot with both distrubutions and the true dataset with counts between 1PM and 3PM:

<p style="text-align: center; margin-bottom: 5px;">Fig. 6.5. Priors selection</p>
<img src="../Resources/img/anomaly-det-priors.png" style="width: 70%;"/>

Based on the graph above, Gamma distibution with $\alpha=1.8$ and $\beta=0.8$ seems to be more suitable to this distribution.

**Likelihood**

The next item we need is the likelihood function, which is used to estimate the counts for every hour, given the data $X$.

A suitable likelihood function will use the Poisson process.

Poisson is a discrete probability distribution, which is used when we need to model a number of event occuring in a time interval.

As per [Wiki page about Poisson distribution](https://en.wikipedia.org/wiki/Poisson_distribution), probability mass function of $X$ for $k=0,1,2,3, ...$ is given by:

$$f(k;\lambda) = Pr(X=k) = \frac{\lambda^{k} e^{-\lambda}}{k!})$$

, where $\lambda>0$, *expected value* and *variance* are both equal to $\lambda$, e is Euler's number ($e=2.718...$) and $k!$ is the factorial of k.

The likelihood function for Poisson process is given by:

$$L(\lambda;x_1,...,x_n)=\prod^{n}_{j=1}exp(-\lambda)\frac{1}{x_j!}\lambda^{x_j}$$

As highlighted in the [online.stat.psu.edu article](https://online.stat.psu.edu/stat504/node/27/), likelihood is a tool for summarizing the data’s evidence about unknown parameters, and often (due to computational convenience), it is transformed into log-likelihood.

The log-likelihood for the Poisson process is given by:

$$l(\lambda;x_1,...,x_n)=-n \lambda - \sum^n_{j=1}ln(x_j!)+ln(\lambda)\sum^n_{j=1}x_j$$

Now coding up the solution is made very simple with libraries for Probabilistic Programming, like `PyMC3`:
- first define a Gamma prior (it is possible to have a list of $24$ priors - one for each hour)
- then define a list Poisson likelihood functions (again, $1$ for each hour)
- finally, sample from the posterior and analyse results

Before going into the results, I would like to explain why sampling is used by PyMC3 and which sampling method is appropriate to which kind of data.

In order to compute the optimized values for the model's parameters (also called maximum a posteriori, or MAP), there are two paths to take:
- numerical optimization methods, which are usually fast and easy (`find_map` function in PyMC3)

Default optimization algorithm is BFGS (Broyden–Fletcher–Goldfarb–Shanno, Fletcher, Roger, 1987), but other functions from `scipy.optimize` are acceptable. The downside of this approach is that it often finds only local optima, and as advised in [PyMC3 documentation](https://docs.pymc.io/notebooks/getting_started.html), this method only exists for historical reasons. The second limitation is a lack of uncertainty measure, as only a single value for each parameter is returned.

- sampling based optimization used for more complex scenarios (`sample` function in PyMC3)

This method is a recommended, simulation-based approach with a few algorithms suitable for different problems:
- Binary variables will be assigned to BinaryMetropolis
- Discrete variables will be assigned to Metropolis
- Continuous variables will be assigned to NUTS (No-U-Turn Sampler)

The `sample` function return a `trace` object, which can be queried to obtain the samples for individual parameter. A standard deviation of these values can be interpreted as an uncertainty.

Sampling process for the count data dataset takes less than $30$ seconds.

Below are some of the most useful statistics (like mean, standard deviation, median and quantiles) for each hour, which can be easily generated by PyMC3 with the use of `pm.summary` method:

<p style="text-align: center; margin-bottom: 5px;">Fig. 6.6. PyMC3 - Posterior stats</p>
<img src="../Resources/img/posterior_stats.png" style="width: 90%;"/>

Next, one can take an advantage of having multiple samples for the estimated rate $\lambda$, and generate $N$ counts for all these rates (using all sampled rates for a single hour embeds the uncertainty into the generated counts).

Probability density for the 4PM then can be plotted and questions asked about the probability of obtaining a count $K$:

<p style="text-align: center; margin-bottom: 5px;">Fig. 6.7. Probability density for 4PM</p>
<img src="../Resources/img/4pm-prob-dens.png" style="width: 60%;"/>

For example, if $K=8$, then there is $1\%$ chance to see an $8$ or more objects at 4PM.

Now, the final idea here is to define a percentage of observations, which need to be classified as anomalies, and use the approach from above, but this time apply it to all hours.

Once this is set to $0.001$, $61$ anomalous observations are detected, which represents $1.5\%$ of the dataset. The result is the max count fence for each hour, above which counts become anomalies. For example the fence for 4PM has been determined as $9.0$.

To visualise the results for all hours I have generated a plot, with a dashed line representing the threshold for anomalies. Red dots are anomalies, and their size corresponds to their magnitude:

<p style="text-align: center; margin-bottom: 0;">Fig. 6.8. Anomaly thresholds by hour</p>
<img src="../Resources/img/anomaly-thresholds.png" style="width: 80%;"/>

### 6.1.4. Summary

It is very clear that the Probabilistic Model is the most flexible and gives the most interesting opportunities based on the generated samples.

The percentage of anomalies is also fully under control, and can be made as strict, or as relaxed as one would like.

The next step in this section would be to actually emded the max count fences in the hourly forecast graph, so it is clear for the users of the system when to expect an alert for each hour, and why an alert has been triggered.

It would be also interesting to calculate the fences using the mean rates estimated by Gaussian Process in the Forecasting Chapter. This way the fence would be tailored to the specific scenario (like weekend, rainy day, morning time for example).

## 6.2. Anomalies estimated from camera frames content

The aim of this section is to investigate if patterns encoded in the raw images can identify anomalies.

Analysis below will be conducted with two goals in mind:
- There could be a process running in real time, which would tell the difference between the normal and unusual images, and based on that, it could send notifications to the users about suspicious activities
- On average, object detector identifies roughly $2000$ images each day. It is very tedious to scan through all of them every day. If there was a score, which could be used to sort images by the "most different ones", the manual process could be eliminated

In the *Deep Learning for Anomaly Detection: A Survey* paper (Chalapathy 2019), the author mentions that even though the fields of Anomaly Detection and Deep Learning have been well exploited by the researchers indvidually, these two areas are not linked enough and there are a lot of opportunities to explore.

Finding anomalies based on the camera frames can be framed as an *unsupervised Machine Learning* problem, where a model is trained to learn classifying anomalies from the raw image content.

Using unsupervised learning has been chosen mainly for two reasons:
- there are over $600K$ images collected in the process without any labels
- anomaly detection usually deals with highly imbalanced datasets (where potentially less than $1\%$ represents anomalous images)

The usage of *Deep Learning* is mostly motivated by the fact, that Deep Neural Networks can efficiently deal with large scale datasets (i.e. image data) and they can use GPU to significantly boost parameter optimization speed (i.e. model training).

The idea, which will be explored in this Notebook, is to build a Deep Learning model, which learns from the training data to reconstruct the most common images (not-anomalous). Then, when it comes aross an anomaly, it reconstructs it with a high error. If this error is higher than some threshold, the image classified as anomalous.

These type of models are called the *Auto Encoders*. They have other use cases as well, like data compression or noise removal, but it is utilised here as a Convolutional Neural Network to detect anomalies from the image dataset with detected objects.

<p style="text-align: center; margin-bottom: 0;">Fig. 6.9. Autoencoder diagram</p>
<img src="../Resources/img/ae.png" style="width: 50%;"/>

Please refer to the [Literature Review](./02.LiteratureReview.ipynb) chapter for more theoretical aspects around the auto ancoders.

The difference between section 6.1. is that auto encoders will use the data for all object classes combined.

**Note:** All code snippets, more in-depth commentary and code for plots created in this section can be found in the corresponding [Extra Notebook 6](../Notebooks/Extra.06.RawImagesAnomalyDetectionTraining.ipynb).

### 6.2.1. Computer Vision for image pre-processing

Starting point for this section are all object detections created in [Forecasting Chapter, Section 5.1.](./05.Forecasting.ipynb):

<p style="text-align: center; margin-bottom: 5px;">Fig. 6.10. Detections tabular data</p>
<img src="../Resources/img/fcst-detections.png" style="width: 60%;"/>

This dataset contains the folders and filenames of the collected images.

Raw images themselves can be very useful for many purposes (like forensic investigations, automated alerts, or simply historical value), but they need some form of pre-processing before they can be used for Machine Learning.

Computer Vision is a vast area of Artificial Intelligence, which provides plethora of guidelines and solutions for image processing and image content analysis.

I have defined a function called `process_frame`, which allows to perform morphological operations on images using flags as function arguments. It was partially inspired by a post on PyImageSearch (PyImageSearch 2017): 
- convert to gray scale 
- apply Median blur
- apply Thresholding
- crop ROI using polyfill

Here is an example of an original image, and below are the morphological operations applied to it:

<p style="text-align: center; margin-bottom: 5px;">Fig. 6.11. Original image</p>
<img src="../Resources/img/orig-image.png" style="width: 40%;"/>

<p style="text-align: center; margin-bottom: 5px;">Fig. 6.11. Pre-processed image</p>
<img src="../Resources/img/processed-image.png" style="width: 90%;"/>

As the dataframe at the start only contains a list of all images stored on the disk, they need to be extracted into a data.

The procedure of extracting the raw image content consists of the following steps:
- pre-allocate memory in two numpy arrays: one for the image data and one for corresponding filenames
- take a sample of images (folders and filenames) from the dataframe
- iterate through the sampled dataframe and for each record:
    - open an image from the disk
    - process image using `process_frame` function by using provided pre-processing parameters
    - add image data (as a numpy array) and corresponding filename to the numpy arrays

It makes sense to be extra careful about the resolution of the images. The higher resolution we will use, the more detail is available for the Deep Learning model, but it comes at a cost:

Single gray-scale $28x28$ pixel image generates a record with $784$ features and an image with the size $112x112$ results in $12,544$ features respectively.

Since the original images' shape is $720x1280x3$, they would contain $2,764,800$ features only for a single image!

Based on my experiments, increasing size to more than $608x608$ tends to cause issues with the GPU memory and it actually degrades the performance of the model, as it learns too much noise (like leaves, shadows, objects moved by the wind, small items, plants etc.)

Another issue with the high image quality is the speed of image pre-processing and model training: it is a difference between $2$ and $35$ minutes on a single model training, which means $33$ minutes of idle time.

For all the above reasons I am only considering image size $56x56$ in the rest of this Notebook. It is a good tradeoff between too small and too large dimensions. I am also choosing to use $10K$ as it is the lowest number of images, which produces best results.

Overall - the smaller the images and less of them, the faster one can iterate over different ideas and more avenues can be explored. Preprocessing steps above took $1$ minute and $40$ seconds.

**Note:** The statistics from using different numbers of images and resolutions are provided in the Conclusion section.

Next step is to add an additional dimension to the dataset, as Neural Network will expect the shape of (height, width and depth). The depth is in case if we wanted to use the color images.

Normalizing the data on a $0-1$ scale helps with training stability. Since image data is a `uint8` type, it can take values only between $0$ and $255$, so dividing all values by $255.0$ is the standard normalisation step for image data (data type becomes a `float32` type):

$$normalize(X)=X/255.0$$

And the last step here is to split the dataset into train and test splits. Unfortunately due to slow training times it is not adviced to run cross validation splits for Deep Neural Networks. I am choosing a $0.8 / 0.2$ split with a `random_state` parameter set to a constant value for reproducibility reasons.

The shape of the training data, which is now ready to be used in auto encoder Neural Network is $(8000, 56, 56, 1)$.

### 6.2.2. Training with auto encoder

Below is a Convolutional auto encoder Neural Network architecture inspired by a blog entry on PyImageSearch (PyImageSearch 2020). It is built on top of [Keras](https://www.tensorflow.org/api_docs/python/tf/keras) functional API.

$$
autoEncoder=decoder(encoder(X))
$$

- Encoder:
    - Input: X_train
    - Convolutional layer: $32$ and $64$ filters, each followed by Leaky ReLU and Batch Normalization:
        - changing the size of filters or adding/removing filters have decreased the performance
        - the difference between ReLU and Leaky ReLU activations is very small, but Leaky ReLU seems to be a little more robust. I have concluded that letting some weights to be slightly negative makes a difference
    - Output: Latent (bottleneck) layer with $16$ nodes by default. Experiments with different sizes ($8$ and $32$) did not make any improvements, where $8$ nodes has decreased the performance by around $10\%$
- Decoder:
    - Input: Latent layer
    - Convolutional transpose layer: $32$ and $64$ filters, each followed by Leaky ReLU and Batch Normalization
    - Single CNN layer: this layer is used to recover the original depth
    - Activation: Sigmoid is used to make sure output values match the input range (between $0$ and $1$)
- Optimizer: Adam with learning rate $\alpha$ and decay $decay=\alpha \div nEpochs$
    - Switching optimizers to Stochastic Gradient Descent or RMS-Prop did not improve the model's performance
- Loss function: mean squared error is used as a simple and well understood error function, which will be used below to identify anomalies

I am leaving out the theory behind the Neural Network layers, as it would inflate this section a lot, it is constantly repeated across many papers and articles and would not make this section more useful. Hopefully the intuition, which is provided above is more valuable.

After many experiments with different model parameters, I have concluded that this architecture is quite optimal in its original shape:

<p style="text-align: center; margin-bottom: 5px;">Fig. 6.12. Auto Encoder summary</p>
<img src="../Resources/img/auto-encoder-summary.png" style="width: 45%;"/>

Given the CNN-based architecture above and $56x56$ image size, the model has almost $500K$ trainable parameters. This number goes up to $1.7M$ for $112x112$ images.

Model converges well without any symptoms of overfitting, and only requires $15$ epochs of training with $2s$ per epoch and ends with training loss $0.0135$ and validation loss $0.0167$:

<p style="text-align: center; margin-bottom: 5px;">Fig. 6.13. Auto Encoder loss</p>
<img src="../Resources/img/ae-loss.png" style="width: 50%;"/>

**Notes:**
- when $25K$ images are used for training, instead of $10K$, the curves are much smoother and converges after $30$ epochs
- it is important that the model generates some error and does not memorize the whole training set, as this kind of model would not be useful at all to detect anomalies

Below is a table, which can build a nice intuition around the number of epochs to converge and time it takes for each kind of sample size and image resolution:

<p style="text-align: center; margin-bottom: 5px;">Tbl. 6.1. Auto encoder - convergence statistics</p>

| Sample Size | Res.    | Sec. Per Epoch | Epochs to Converge |
|:------------|:-------:|---------------:|-------------------:|
| 10,000      | 56x56   | 2              | 10                 |
| 25,000      | 28x28   | 3              | 20                 |
| 25,000      | 56x56   | 6              | 30                 |
| 25,000      | 112x112 | 15             | 50                 |
| 50,000      | 28x28   | 5              | 50                 |
| 50,000      | 56x56   | 10             | 50                 |
| 50,000      | 112x112 | 28             | 50                 |

Obviously, the more samples and the higher the resolution, the longer it takes to train each epoch and more epochs is required to converge.

### 6.2.3. Model evaluation on test-set

The test-set contains $2K$ $56x56$ preprocessed, grayscale images, normalized to $0-1$ range.

When predictions are generated using Keras `predict` method, their values are compared against the original images and errors are calculated using *mean squared error* statistic (any suitable error measurement can be actually utilised here).

Making predictions for all $2000$ test images and calculating errors takes only $0.25$ of a second.

Then a decision needs to be made to determine a percentage of the dataset, which should be classified as anomalous. For example let this percentage be $0.99$.

Then it is simple to calculate a threshold for the error using $99$th quantile, above which points are classified as anomalies. This threshold value is $0.0651$. Below is a histogram, which shows the distribution of errors with a red dashed line, representing the calculated threshold:

<p style="text-align: center; margin-bottom: 8px;">Fig. 6.14. Auto Encoder loss</p>
<img src="../Resources/img/ae-error-dist.png" style="width: 60%;"/>


It is now possible to see the images with the largest error (they would be classified as anomalous), an example is below, which shows the original image (left), with the preprocessed (middle) and reconstructed one by auto encoder (right):

<p style="text-align: center; margin-bottom: 0;">Fig. 6.15. Auto encoder - anomalous image</p>
<img src="../Resources/img/ae-high-error.png" style="width: 80%;"/>


And the reversed process (look at lowest error) applies to observe images with are definitely not anomalies, according to the auto encoder:

<p style="text-align: center; margin-bottom: 0;">Fig. 6.16. Auto encoder - normal image</p>
<img src="../Resources/img/ae-normal.png" style="width: 80%;"/>

It worked for the anomalous image (as there is a lot going on in that frame), but it is somewhat suprising for normal images. The first $8$ normal images are very dark and contain a lot of pixels with a value of zero, so reconstructing those is much easier for the model and therefore the error will be much lower as well.

So, what can be said about the output of this procedure?

I think it has the potential, but it most likely needs some improvements in the data collection stage, and perhaps more computer vision preprocessing steps (for example one of the images with highest error was flagged due to the dry patches of otherwise wet surface, and perhaps more sensitivity is required when it is dark).

Finally the training procedure needs to be carefully crafted. Instead of training the model on the whole data, a better idea would be to train on a few rolling months. This would help if the region of interest changes over time (people change their cars or plant new trees, or even move the camera to another location).

### 6.2.4. Model evaluation on hand-labeled data

The very last step in terms of auto-encoder analysis is to see how it behaves when it is fed it with the hand-picked images.

This procedure allows to generate metrics, which models can be compared with analytically.

I have manually annotated 30 images:
- 15 as non-anomalous
- 15 as anomalous

**Procedure:**

The process is almost the same as error calculation in the previous step (I have defined a function called `test_anomalies` for this task):
- load images from the disk (non-anomalous and anomalous samples reside in their respective 0 and 1 folders)
- preprocess images and reshape data
- run prediction through autoencoder Keras model
- calculate mean squared errors
- establish if anomalies are found based on the previously calculated threshold
- show images (optionaly)
- plot errors from autoencoder (optionaly)
- return anomaly flags for each image in the folder

**Evaluation metrics:**

Now the labels are available and standard classification metrics to evaluate model's perfmance can be utilised (Stackabuse, 2020):

- Accuracy

Accuracy measures a percentage of correct predictions out of all predictions:

$$Accuracy=\frac{TP+TN}{TP+TN+FP+FN}$$

where $TP$ are True Positives, $TN$ are True Negatives, $FP$ are False Positives and $FN$ are False Negatives

- Precision

Precision tends to be a good metric when the cost of False Positives is high. Out of all the predicted positive instances, how many were predicted correctly:

$$Precision=\frac{TP}{TP+FP}$$

- Recall

Contrary to Precision, Recall is a useful metric when the cost of False Negatives is high. Out of all the positive classes, how many instances were identified correctly:

$$Recall=\frac{TP}{TP+FN}$$

- F1 Score

F1 Score is a mixture of Precision and Recall in a single metric (when classifying 0's and 1's correctly are both equally important). F1 Measure is nothing but the harmonic mean of Precision and Recall:

$$F1=2*\frac{Precision*Recall}{Precision+Recall}$$

In the case of anomaly detection, accuracy tends to be a poor measure due to the dominanace of the *Normal* observations. Precision is also not the most useful metric, as the cost of falsly classifying observation as an anomaly tends to be less detrimental than not catching the anomaly at all.

As a result of the above statements, **Recall** will be the most important metric to observe and optimize for, while keeping an eye on the F1 Score to not sacrifice too much on the other type of errors.

Below is the plot showing previous error distribution with the red dashed line representing the anomaly threshold, and a set of short red and green lines showing anomalous and normal predictions respectively:

<p style="text-align: center; margin-bottom: 5px;">Fig. 6.17. Auto encoder - hand crafted images classification</p>
<img src="../Resources/img/au-hand-crafted-hist.png" style="width: 80%;"/>

The perfect score would put all green lines on the left side of the threshold (red dashed) line and all red short lines on the right hand side.

The plot shows $6$ anomalous images misclassfied out of $15$.

Below are the metrics for using the current model, trained previousy:
- acc: $0.77$
- prec: $0.9$
- rec: $0.6$
- f1: $0.72$

Most misclassified images can be explained by not enough variaty against what is seen as a normal frame. Again, the best improvement would be to apply some computer vision to detect when it's dark and increase error sensitivity during nightly hours. Below is an example of anomalous image, which has been classified as normal:

<p style="text-align: center; margin-bottom: 5px;">Fig. 6.18. Auto encoder - misclassification</p>
<img src="../Resources/img/ae-misclassification.png" style="width: 85%;"/>

### 6.2.5. Summary

For the reference, below is a table with the statistics collected for all types of models trained as part of this exercise (with the model names self explanatory), sorted by highest Recall value:

<p style="text-align: center; margin-bottom: 5px;">Fig. 6.19. Auto encoder - model selection - metrics</p>
<img src="../Resources/img/ae-all-metrics.png" style="width: 75%;"/>

## 6.3. Conclusion

It turned that that collected data does display some anomalous signals, which can be exploited by statistics, machine learning and computer vision.

Both methods from sections 6.1. and 6.2. are only some examples of what can be done with object detection data from anomaly detection perspective and readers are certainly encouraged to think about their use cases.

In section 6.1. it was very easy to fall into a pitfall of *IQR* method, but since it is not suitable for skewed datasets - a much more flexible approach has been developed using probabilistic programming and Poisson distribution characteristics.

Then paragraph 6.2. showed promising capabilities for real time image scoring using auto-encoders.

There are many other techniques, which I am planning to explore in the future, like using forecast generated in previous chapter in 6.1. and using variational auto encoder in 6.2.

Next [chapter](./07.Conclusions.ipynb) contains the final conclusion surrounding this research as a whole (including data collection, forecasting and anomaly detection) and more future opportunities.

[index](../Index.ipynb) | [prev](./05.Forecasting.ipynb) | [next](./07.Conclusions.ipynb)