# 4. Data Collection and Pre-Processing

[index](../Index.ipynb) | [prev](./03.SystemDesign.ipynb) | [next](./05.Forecasting.ipynb)

The aim of the data collection pipeline is to use an outside camera to **stream video** to a central machine inside the house and **run motion, object and outlier detections** on it.

This chapter is an in depth analysis and an extension of a high level the diagram (Fig. 3.1.) from [Chapter 3](./03.SystemDesign.ipynb#fig3.5). The topic is quite complex and it is broken into a few sections:
- physical layer:
    - choosing hardware
    - connectivity
    - picking location for a camera
    - redundancy
- logical (software) layer:
    - streaming video (called the client) and consuming video (called the server)
    - frame lifecycle (pre-processing image data, detecting motion and objects, forwarding video to other devices)
- results:
    - data representation
    - data volumes

## 4.1. Physical layer (hardware)

### 4.1.1. Choosing hardware

The first step in this layer is choosing the right hardware for the given task. The most challenging part was to choose the camera. In general, there appears to be a two fundamentally differnt approaches to it:
- standalone (professional) camera
- mini computer with a camera module

My decision was motivated by the following factors:
- low costs
- online documentation and help
- flexibility in configurations
- ability to perform additional tasks on-device

Based on that I have decided that option 2 is a better choice, given the objectives above, and ended up with a Raspberry PI 3 mini-computer with additional Pi-Camera module. Raspberry PI 3 has a 4-core ARM processor and 1GB of RAM, so it is quite powerful considering the small form factor (3.54 x 2.36 x 0.79 inches).

The tradeoff is that most of the standalone cameras have better security and durability and possibly night vision capability. But this comes at much higher cost, low flexibility and little online documentation.

With regards to the server machine, I have decided to use a local Desktop PC (instead of a Cloud option, like AWS, Google Cloud or Azure) as I was not happy with my broadband speed at a time and this way data stays within the local network, which is a good security measure. Below is the configuration:
- Intel i5 6-core CPU
- 32 GB RAM
- Nvidia Geforce 11 GB GPU
- Storage:
    - 256 GB Nvme drive
    - 1 TB SSD drive

This kind of system allows to run a smooth, real time image preprocessing and object detection at 30+ FPS and hourly/nightly/weekly scheduled tasks.

### 4.1.2. Connectivity

Initially the system started as wireless, but turned out to be very challenging due to problems with weak Wi-Fi signal and low quality router, packages were dropped and logging in to Raspberry Pi remotely was very slow with occasional minutes of freeze.

As a result of this a wired solution was proposed and house was wired to allow for near noiseless communication between the camera and server.

Below is a very high level network topology for the system:

<p style="text-align: center; margin-bottom: 0;">Fig. 4.1. Network Topology</p>
<img src="../Resources/img/network-topology.png" style="width: 70%;"/>

**Diagram Description**

Wiring the house can be a timely and costly process if the house is not wired from the get go. The key is to do it without any significant (or visible) damage to the house while making it easy to change and extend in the future if required.

In the diagram above continous lines represent wires, and dashed lines are either dividing floors, or the areas inside and outside the house:

- Starting from box 1 (the router), signal goes to the attic to a 1Gb Switch (box 2), which has 4 Power Over Ethernet (POE) ports
- From the switch there is a network cable going into the Raspberry Pi (box 3), which is located outside of the house. Raspberry Pi has an additional POE module next to it, so it only requires a single wire to receive power and network. Pi does not support 1Gb connections through its native RJ45 port, but 100Mb is more than enough to transmit video signal at high rate
- The second connection from the switch goes through the wall from attic into the office on the first floor, where the server PC (box 4) is located

It is also important to assign static addresses to the devices, and even give them hostnames, so they can be easily accessed from within the network.

### 4.1.3. Camera location

Optimal camera placement is not a trivial task and it should be driven by the purpose:
- is it a security camera?
- is it a camera used for an experiment?

While building a security system is not the main objective of my research, I have decided to place a camera in a position, which is usually chosen as a primary security camera: in front of the house. This decision was made after many experiments with other locations (like inside the house behind the window).

The place chosen for the camera is certainly too low to be considered a proper security camera, but it shelters the device agains rain, wind and direct sunlight, which is often a problem.

The RaspberryPi with a camera module is glued using a strong double sided tape onto the roof in the porch at the front door:

<p style="text-align: center; margin-bottom: 0;">Fig. 4.1. Raspberry Pi Camera Placement</p>
<img src="../Resources/img/rpi-cam.png" style="float:left; width: 63%;"/>
<img src="../Resources/img/rpi-cam2.png" style="float:left; width: 36%;"/>
<div style="clear: both;"></div>

After many attempts at getting a good and flexible case for Pi, I finally found one which allows for a full 360 camera rotation along both axis (this can be seen in the image above - Fig. 4.1.).

Suprisingly enough, even during harsh storms, wind and rain, the camera mount did not even move once.

Another challenge, which can occur at random occassion is occlusion caused by a natural event. This could be as simple as a leaf, but it can prevent camera from registering anything for a long period of time. During the six months of data collection, only one such incident occurred when a spider has decided to adopt the Raspberry Pi case as a house.

Ideally the system should be able to detect loss in image quality and alert a user for maintenance.

### 4.1.4. Redundancy

I have experienced three power outages, which mean a loss in data for three days. Having an alternative source for the power is critical if the system must be always online. This redundancy comes at additional cost, and this is currently a limitation of this system.

What is also important is that when device comes online again, it should have a software level mechanism to resume streaming (or collecting) data. This will be discussed later in this chapter but is handled on both: client and server using a software called Supervisor.

In case of a hardware failure on the server, there is a back up script running every night to sync images into another machine. And in case of a hardware failure on the client, a spare Raspberry Pi exists (which can be used as a test/development box).

## 4.2. Logical layer (software)

Below are the key software ingredients used in this project. Each of them already exists in a working system and each can be further refined and improved from the scalability, reliability, security and performance point of view.

What is not included below are the two future components, which would run as a part of thos eco-system:
- forecast update
- anomaly detection

### 4.2.1. Video streams

Depending on the requirements, there are many ways video frames can be broadcasted to other devices.

In general people often choose easy to setup streaming protocols, like *RTSP* if they just want to display a video (with audio signal and additional actions to play or pause the stream) in a video player like VLC, however it can be quite troublesome to capture if further image processing is required through Open CV and Python. Also, RTSP streams the video without considering the receiving end.

If the audio component is not required and full flexibility and customisation is important, then a message queue sounds like a better option. [ImageZMQ](https://github.com/jeffbass/imagezmq) is a Python implementation of *ZMQ* - a peer to peer message queue system optimised for high performance with an option to receive acknowledgment signal from the receiving clients.

Here is a full cycle for a frame coming from the Pi-camera when ImageZMQ is used:

<p style="text-align: center; margin-bottom: 0;">Fig. 4.2. ZMQ Message Queue</p>
<img src="../Resources/img/zmq.png" style="width: 60%;"/>

The ACK signals are very useful, as there is no need to keep sending the frames when there is noone receiving them.

The drawback of this approach is that when receiver stops receiving, the streaming device sending script must be restarted (or connection re-initiated).

I have published two GitHub repositories with the code required to run a [client](https://github.com/Alchemication/iot-vstream-client), and a [server](https://github.com/Alchemication/iot-vstream-server).

The `client.py` and `server.py` scripts are registered in the Linux software called [Supervisor](http://supervisord.org/), which makes sure that they are always on (when system restarts or when scripts get terminated for any reason).

Here is a pseudo-code required to stream frames (client):
```python
sender = imagezmq.ImageSender(connect_to=ZMQ_SERVER_URL) # connect to ZMQ server
vs = VideoStream(usePiCamera=True, resolution=RES_DIMS).start() # initialize stream
while True:
    frame = vs.read() # read frame from camera
    # ... optional processing (like compression, flipping image etc.)
    sender.send_jpg(MESSAGE_STRING, frame)
```

And below is a pseudo-code for capturing the stream (server):
```python
image_hub = imagezmq.ImageHub(open_port='tcp://*:5555') # initialize ImageHub server
while True:
    (MESSAGE_STRING, frame) = image_hub.recv_image() # receive frame (as numpy array)
    imageHub.send_reply(b'OK') # send ACK signal to receive next frame
```

### 4.2.2. Frame lifecycle (including object detection)

This is the most complex part of the Data Collection and the end to end process can take long time to get right. Below is a data flow for each frame:

<p style="text-align: center; margin-bottom: 0;">Fig. 4.3. Frame lifecycle</p>
<img src="../Resources/img/obj-detection.png" style="width: 80%;"/>

**Diagram description**

Picking up from the previous pseudo-code for capturing the stream (server):

- Once the frame is collected from the ImageHub, it will be in a full HD resolution: $1920×1080$ px (box 1)
- The frame needs to be resized (box 2), otherwise next steps will be very slow. The choice of $608x608$ makes sense, as this is a resolution needed be Yolo (box 7)
- Before we can apply the backgroun subtraction algorithm to detect motion, after a check in box 3, it needs to be instantiated (box 4) with optional hyper-parameters:

    ```python
    BG_SUB_HISTORY = 20
    BG_SUB_THRESH = 30
    BG_SUB_SHADOWS = True    
    cv2.createBackgroundSubtractorMOG2(history=BG_SUB_HISTORY, varThreshold=BG_SUB_THRESH,
                                       detectShadows=BG_SUB_SHADOWS)
    ```


- As discussed in detail in the [Literature Review chaper](./02.LiteratureReview.ipynb), the aim of this step is to use a  static background to extract the moving foreground. Below are the parameters description taken from [opencv website](https://docs.opencv.org/master/de/de1/group__video__motion.html#ga2beb2dee7a073809ccec60f145b6b29c):
    - `history` - Length of the history
    - `varThreshold` - Threshold on the squared Mahalanobis distance between the pixel and the model to decide whether a pixel is well described by the background model. This parameter does not affect the background update
    - `detectShadows` - If true, the algorithm will detect shadows and mark them. It decreases the speed a bit, so if you do not need this feature, set the parameter to false
    - *Mahalanobis distance* is a multivariate distance metric that measures the distance between a point and a distribution (unlike Euclidian distance, which measures a distance between two points) and it is given by ([machinelearningplus 2019](https://www.machinelearningplus.com/statistics/mahalanobis-distance/)):
    
    $$
    D^2=(x-m)^T \cdot C^{-1} \cdot (x-m)
    $$
    
    , where $x$ is a vector of data points, $m$ is a vector of means for each feature, and $C^{-1}$ is an inverse covariance matrix of independent variable
    - The parameters above are mosty heuristics and they depend on the location of the camera, the size of the objects and the type of movement to detect. It usually helps to record multiple videos from a particular location for calibration purposes.

- The background subtraction algorithm is then applied (box 5) using following line of code:

```python
mask = subtractor.apply(frame_resized)

```

- The second part of the activity in box 5 is to scan through the `mask` generated above and verify if the detected contours meet the minimum criteria to become an object candidate:

```python
contours, hierarchy = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
for cnt in contours:
    if cv2.contourArea(cnt) >= MIN_OBJ_AREA:
        # object candidates found !
        break
```


- According to my study, objects of size > $35$ are a good candicates for further detection. Everything below this value is not a valid object and should be dropped. This parameter is another heuristic, which should be calibrated for each environment
- Once candidates are found, frame is forwarded to object detector (Yolo V2) to generate predictions for all objects (like Person,Car,Dog,Bike etc.) in a [Coco dataset](https://cocodataset.org/#home). This activity happens in box 7, while Yolo itself is instantiated in advance using a following code:

```python
options = {
    'model': 'cfg/yolov2.cfg',
    'load': 'weights/yolov2.weights',
    'threshold': 0.40,
    'gpu': 0.5
}
tfnet = TFNet(options)
```

- Yolo requires to provide the weights, labels, confidence threshold for predictions (reject below this value) and optionally how much GPU power can be used by this process ($0.5$ runs smoothly on 11Gb GPU with $608x608$ images). Predictions are generated with a simple `tfnet.return_predict(frame)` code, which returns a list of objects, with the confidence and x,y coordinates
- Also within box 7, a script needs to filter out all objects, which are not tracked (like lamp, monitor etc.) and only then final predictions can be verified (box 8)
- If objects of interest are detected, image is saved on the hard drive (box 9) in a folder corresponding to a date (along with the filename, representing time and name of all objects detected in a frame)
- Then, independently of the obejct detection process, the original image along with the predictions are pushed through the socket server (box 10) to the outside world (a web application can connect to this socket at any time and receive a real time stream with object detections), this is done through a code snippet:

```python
retval, buffer = cv2.imencode('.jpg', frame)
jpg_as_text = base64.b64encode(buffer)
socket.emit('EVENT_MESSAGE_TITLE', {
    'device': rpi_name, 
    'img': jpg_as_text, 
    'boxes': predictions,
    'res': stream_resolution
})
```

- The values sent over socket will help to draw the boxes around the objects for any screen size (even for multile devices)

## 4.3. Results

The result of the data collection are images with valid detections.

Images are organised into a folder hierarchy. The top level contains the streaming device name, which then contains directories corresponding to dates and image files inside:

<p style="text-align: center; margin-bottom: 0;">Fig. 4.4. Collected Images</p>
<img src="../Resources/img/saves-files-tree.png" style="width: 35%;"/>

The dataset collected for this research contains over 600K images between 09th of September 2019 and 02nd Match 2020. The size of a single 1080p image compressed to jpg is ~300kB. The total size of the dataset is then ~180GB.

An average number of raw images captured per day is ~2,000.

## 4.4. Conclusions

This Notebook presented an overview of the Data Collection pipeline. There are a lot of details, which have been ommited, like error handling, socket server implementation and running an infinite loop in a separate thread to capture the stream within a Flask app context, but all these nuggets can be found in an attached script [app.py](../Scripts/app.py), which was used to gather the dataset.

Considering the amount of processing for each frame, this pipeline runs at 30 frames per second providing a smooth experience for the end users, who can observe real time object detection in a web browser.

There are a number of improvements already identified in this process, which are left for the future iterations:
- include privacy mode, where boxes with detected people are blurred
- switch to Yolo V4 for increased detection accuracy and speed
- test image segmentation techniques to improve the bounding boxes approach
- add hourly forecast of expected objects for a day
- detect anomalies:
    - given number of objects in a current hour
    - given raw image content

[Next chapter](./05.Forecasting.ipynb) focuses on generating a forecst for the number of objects expected to appear by hour.

[index](../Index.ipynb) | [prev](./03.SystemDesign.ipynb) | [next](./05.Forecasting.ipynb)