# The Lab  - Antwerp Maritime Academy

## Introduction
Our team was tasked with analyzing the data of sensor measurements from an engine room and other parts of a container ship. These sensors were measuring the presence and/or quantity of different chemical substances like CO and NO2. The task wasn't all that clear but there was one semi-clear objective: “gain insights from the measurements”. There were immediately a few techniques that came to mind.

## Goal
The ultimate goal for this part is to predict certain chemical substances based on 1 reading of a substance, thus reducing the number of sensors needed. Given a prediction of these substances, an assessment can be made to control the engine room for failure or maintenance.

## Techniques
The techniques that will be discussed are not all implemented in the final product. Although they were not all implemented, they were all researched and have a valid reason why we did not go through with the implementation. All technologies discussed have led to the final product and its reasoning behind it.


## Researched but not implemented
### Big Data
The first problem present was data. We needed a lot of data to gain useful insights into the air quality of the engine room and other different parts of the ship. This is when we came to big data. The idea was that we set up a pipeline and handled our data through technology like Hadoop and Google Cloud Dataflow. This could allow us to use clean data to generate clear, non-polluted graphs and not worry about data augmentation or different biases. There were several reasons we did not go through with this. First of all, was the connection to Google Cloud a problem, we were not sure if it was possible to connect to the cloud from a moving ship. Even if it was possible, the cost of being constantly connected was most likely not worth the extra benefits of having near-perfect data fed to our algorithms. The second remark was on Hadoop. Hadoop is amazing just not for our use case. We were not going to feed the algorithms gigabytes of data per minute so there wasn’t a valid reason for us to use something as overkill as Hadoop. This is why we did not bother implementing such technologies, not that it would have cost us performance or added value. It just wouldn’t have been necessary and a waste of time.

### Data pipeline
An aspect of the big data implementation concept was the data pipeline. This builds upon the continuous data stream idea. Although the whole POC was not implemented, the use case for a data pipeline was clear. Data could flow right into simple visualizations of different chemical substances and be shown as a barometer to keep the personnel onboard updated about potentially harmful situations due to alarmingly high concentrations of certain substances for a certain amount of time. Later on, this idea was taken over by a teammate of mine to be implanted in their part of the project, is used to visualize real-time data.


## Used articles
- [How to Train Deep Neural Networks Over Data Streams](https://towardsdatascience.com/how-to-train-deep-neural-networks-over-data-streams-fdab15704e66)
- [Deep Neural Networks for Regression Problems](https://towardsdatascience.com/deep-neural-networks-for-regression-problems-81321897ca33)
- [How to Develop Multi-Output Regression Models with Python](https://machinelearningmastery.com/multi-output-regression-models-with-python/)
- [Deep Learning Models for Multi-Output Regression](https://machinelearningmastery.com/deep-learning-models-for-multi-output-regression/)
- [Multi-output Regression Example with Keras Sequential Model](https://www.datatechnotes.com/2019/12/multi-output-regression-example-with.html)
- [A Gentle Introduction to k-fold Cross-Validation](https://machinelearningmastery.com/k-fold-cross-validation/)
- [Various Optimization Algorithms For Training Neural Network](https://towardsdatascience.com/optimizers-for-training-neural-network-59450d71caf6)

## Implementation

In [1]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from sklearn.model_selection import RepeatedKFold
from statsmodels.tsa.stattools import grangercausalitytests
from scipy.stats import pearsonr
from numpy import mean
from numpy import std
import pandas as pd

## data selection

The data gets selected based on the needs of the network. We choose a dataframe with the input of the network and a dataframe with the output of the network. Later we separate the data using the k-folds cross validation algorithm to ensure the model will not over-fit.

In [2]:
df_with_dates = pd.read_csv("./csv/engine_room.csv").dropna().to_numpy()
df = df_with_dates[:, 2:15].astype('float32')

df_co_in = df[:, 7].astype('float32')
df_co_out = df[:, (2, 4, 9)].astype('float32')

df_pm_in = df[:, 10].astype('float32')
df_pm_out = df[:, (11, 12)].astype('float32')

## Correlation test

To predict something based on something else, you need to know whether these two have something to do with each other. This is where correlation comes into play. To be more precise, Pearson’s correlation. This measures the correlation coefficient that describes the strength of the linear relationship between two data points. If it is close to 1 or -1, the relationship is strong. We tested all correlations between the given substances in our dataset and picked the strong correlations (> 0.7 or < -0.7). This gave us the certainty that what we were predicting was going to be the real (or close to) value if we were to measure it. (keep in mind that all correlations are going to be double since it goes over each relation both ways)

In [None]:
elements_as_array = np.array(
    ['T', 'RH', 'CO2', 'NO2', 'O3', 'NO', 'SO2', 'CO', 'H2S', 'TVOC', 'PM1', 'PM2.5', 'PM10']
)

for i in range(df.shape[1]):
    for j in range(df.shape[1]):
        if i != j:
            cor, _ = pearsonr(df[:, i], df[:, j])
            if cor < -0.7 or cor > 0.7:
                print(f'corr between element {elements_as_array[i]} and element {elements_as_array[j]}: {cor}')

There seems to be a correlation between some chemical substances. The goal right now is to create 2 models that predicts **TVOC**, **O3** and **CO2** based on **CO** readings and one that predicts **PM2.5** and **PM10** based on **PM1**. This could be useful to achieve since we then only need to install a **CO** and a **PM1** sensor to get relevant information about these components.

## Causality test

A causality test defines or rejects a hypothesis that change in y substance (in our case) is caused by substance x. We can reject the hypothesis when the p-value is lower than 0.05. This would suggest that that substance x does indeed influence the value of subtance y.

h1:  x-values do explain the variation in y
h0:  x-values do not explain the variation in y

In [None]:
co_and_tvoc = df[:, [7, 2]]
co_and_o3 = df[:, [7, 4]]
co_and_co2 = df[:, [7, 9]]
pm1_and_pm2 = df[:, [10, 11]]
pm1_and_pm10 = df[:, [10, 12]]

grangercausalitytests(co_and_tvoc, maxlag=3)
grangercausalitytests(co_and_o3, maxlag=3)
grangercausalitytests(co_and_co2, maxlag=3)
grangercausalitytests(pm1_and_pm2, maxlag=3)
grangercausalitytests(pm1_and_pm10, maxlag=3)

Looking at the p-values we can indeed see that **CO** causes the other substances to change. This discovery leads us to believe that the h1 is indeed true and that h0 can be rejected. The same goes for **PM1** and its counterparts.

## Deep Learning
Deep learning is often used as a buzzword but also as a means to predict and classify. This is where we got the inspiration to use a neural network to predict certain substances based on one substance reading. This is something I implemented because if you were able to predict the values of some substances, it means you would not have to install an extra sensor. This could mean several things, firstly it could save time installing and maintaining the sensors, time that can now be spent on other time-consuming activities that maybe require more attention. Secondly, if you can predict something, you don’t have to measure it with a sensor. This could mean that the ships could have fewer sensors onboard, which would save money. Furthermore, fewer sensors mean less parts that can break. This is why softsensors are the way to go in my eyes.


## Setting up the neural network

This will be a multilayer perceptron model, used for our multi-output regression. We define a model with an input as big as the amount of substances present in the dataframe. Then we add some hidden layers to calculate the right weights for each neuron, this amount of hidden layers is a result of a lot of trail and error. The output will consist of 3 neurons for model_co and 2 neurons for model_pm, representing the 2 or 3 substances we want to predict.

## Multi-output regression
Linear regression is a relatively simple and well know the statistical model that assumes there is a linear relationship between a single input variable (x) and a single output variable (y). This is where things get sticky. In the dataset used, predictions are going to be made for multiple output variables based on one input variable. This is where multi-output regression comes in. The concept is very similar, the one of the major differences is that there is going to be a neural network. This network allows us to predict multiple outputs based on a single input. Of course, like any deep learning model, the network needs to be trained before any prediction can be done. After having defined the data sets, the challenge was to find the right kind of activation, loss, and optimizer. The best activation function turned out to be the relatively standard relu function.

In [3]:
model_co = Sequential([
    Dense(20, input_dim=1, kernel_initializer='he_uniform', activation='relu'),
    Dense(10, activation='relu'),
    Dense(50, activation='relu'),
    Dense(3)
])

model_pm = Sequential([
    Dense(20, input_dim=1, kernel_initializer='he_uniform', activation='relu'),
    Dense(10, activation='relu'),
    Dense(5, activation='relu'),
    Dense(2)
])

model_co.compile(loss='mae', optimizer='adam')
model_pm.compile(loss='mae', optimizer='adam')

## Evaluating the model

Here we are going to evaluate the model and return the mean absolute error to see how good our model did with on the data we provided. The MAE gives us the average of how far off the network guessed relative to the actual readings. To train the model k-folds cross validation will be used, this is an algorithm that splits the data up in k amount of parts. If we were to split the data up in 6 parts, it would be called 6-fold. K-fold is used to counter over-fitting, a phenomenon in neural networks that is a result of a network not being able to generalize. K-fold works by dividing the data in k amount of parts, while the network trains, the parts are being rotated so that it never trains with the same data for a longer period of time.

In [None]:
results_co = list()
n_input, n_output = 1, 3
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

for train, test in cv.split(df):
    data_test, data_train = df_co_in[train], df_co_in[test]
    output_test, output_train = df_co_out[train], df_co_out[test]
    model_co.fit(data_train, output_train, verbose=0, epochs=500)
    mae_co = model_co.evaluate(data_test, output_test, verbose=1)
    results_co.append(mae_co)

In [None]:
results_pm = list()
n_input, n_output = 1, 2
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

for train, test in cv.split(df):
    data_test, data_train = df_pm_in[train], df_pm_in[test]
    output_test, output_train = df_pm_out[train], df_pm_out[test]
    model_pm.fit(data_train, output_train, verbose=0, epochs=500)
    mae_pm = model_pm.evaluate(data_test, output_test, verbose=1)
    results_pm.append(mae_pm)

## Quantification of the results

To get a better idea of the results, the MAE and standard deviation are calculated for the whole network. This represents again how far the network was off and what the average deviation is of the mean. (lower is better)

In [None]:
print('(co model) mean absolute error: %.3f standard deviance: %.3f' % (mean(results_co), std(results_co)))
print('(pm model) mean absolute error: %.3f standard deviance: %.3f' % (mean(results_pm), std(results_pm)))

## Prediction

Now that the model is trained, the next logical step is to predict values. The following will display a prediction of **TVOC**, **O3** and **CO2** based on only **CO** values and **PM2.5** and **PM10** based on **PM1**.

In [None]:
new_co_reading = [452.19]
prediction_co = model_co.predict(new_co_reading)
#prediction order: CO2, O3, TVOC
print(f'Predicted: %s' % prediction_co)

In [None]:
new_pm_reading = [1.63]
prediction_pm = model_pm.predict(new_pm_reading)
#prediction order: pm2.5, pm10
print(f'Predicted: %s' % prediction_pm)

## Results
The predictions made by the regression model can be used for assessing the state of the engine. As previous research shows us that there is a causation effect to the rising of CO, TVOC, and CO2. Given the engine used is a diesel engine and the CO readings are going up, there is most likely servicing that needs to be done to the engine. The fact is that bad combustion can lead to the burning of lubrication oil resulting in the creation of several substances which also include CO, TVOC, and others.

## Possible further additions
If the project had a longer lifespan, the first thing that I would do is add gradient boosting. this is a method that is commonly used to enhance the performance of regression problems, classification problems... Secondly I would feed more data to the model. of course, this is a common thing to do with any machine learning model. Feeding more data into the model would certainly drop the loss even further. Another interesting thing I'd like to do is add more sensors for a while to see if any more substances could be predicted using the same model. I would also reverse the model to a multi-input regression model to predict substances that could become a safety concern to the crew and could be detected by combining the readings of different substances.

## source reference
Brownlee, J. (2020, August 27). Deep Learning Models for Multi-Output Regression. Machine Learning Mastery. Retrieved February 18, 2022, from https://machinelearningmastery.com/deep-learning-models-for-multi-output-regression/
Brownlee, J. (2021, April 26). How to Develop Multi-Output Regression Models with Python. Machine Learning Mastery. Retrieved February 18, 2022, from https://machinelearningmastery.com/multi-output-regression-models-with-python/
Kozyrkov, C. (2021, December 15). What is correlation? - Towards Data Science. Medium. Retrieved February 13, 2022, from https://towardsdatascience.com/what-is-correlation-975ea899aaed
Kozyrkov, C. (2021a, December 7). Statistical inference in one sentence - HackerNoon.com. Medium. Retrieved February 13, 2022, from https://medium.com/hackernoon/statistical-inference-in-one-sentence-33a4683a6424
Brownlee, J. (2020a, August 20). How to Calculate Correlation Between Variables in Python. Machine Learning Mastery. Retrieved February 18, 2022, from https://machinelearningmastery.com/how-to-use-correlation-to-understand-the-relationship-between-variables/
Brownlee, J. (2020a, August 15). Linear Regression for Machine Learning. Machine Learning Mastery. Retrieved February 19, 2022, from https://machinelearningmastery.com/linear-regression-for-machine-learning/
Diesel and Gasoline Engine Exhausts. (n.d.). NCBI. Retrieved March 11, 2022, from https://www.ncbi.nlm.nih.gov/books/NBK531294/#:%7E:text=Incomplete%20combustion%20results%20in%20the,the%20fuel%20and%20lubricating%20oil.
