<a href="https://colab.research.google.com/github/carolsworld/ICS-test-bed-PoC/blob/anomalydetection/TRIST_AnomalyDetection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Demonstration of using datasets generated from TRIST for developing Anomaly Detection Model#

##Getting Started
Anomaly detection is a common technique used for identifying abnormal or rare observations that significantly different from the majority of the data in a dataset. This technique is applicable to detection of anomalous events happened on Internet of Things (IoT) and many other real-life problems.

##Types of Anomaly Detection Models
 **Unsupervised model** is used in this demonstration, meaning that we are not making any labels on the dataset during the process of anomaly detection model training.

The dataset generated from TRIST representing the normal operations of the water supply and storage is used. In other words, the model is assuming that the majority of the instances are normal.

A typical workflow in PyCaret's unsupervised module consist of following 6 steps in this order:

**Setup** ➡️ **Create Model** ➡️ **Assign Labels** ➡️ **Analyze Model** ➡️ **Prediction** ➡️ **Save Model**

There are other types of anomaly detection model, one is "supervised", another one is "semi-supervised".
* *Supervised:* Supervised model uses dataset that specifies which data records are normal, and which data records are anomaly. Thus, it requires collection of sufficient abnormal data to train a supervised anomaly detector.
* *Semi-supervised:* Semi-supervised model uses only normal data during the training process. It predict whther new data point is normal or anmolay based on the distribution of the data in the trained model.

## Anomaly Detection Algorithm
Data scientists have developed many anomaly detection algorithm over the past decades. For simplicity, we have used PyCaret, an open-source low-code machine learning library, to perform end-to-end machine learning model management.

In [59]:
# Install the pyCaret library and check the version
!pip install -q pycaret && pip install -q package-name
from pycaret.utils import version
print(version())

3.2.0


## Deployment of Machine Learning Model
Thanks for the developers of PyCaret. PyCaret is a deployment ready library in Python and has been integrated with many business intelligence and analytics platforms such as Microsoft Power BI and Tableau. Allowing *Citizen* and *experienced* Data Scientists to develop and run machine learning in production with ease. For details about PyCaret, please visit their [official website](https://pycaret.org/) and [documentations](https://pycaret.gitbook.io/docs/).


#Step 1: Data Preparation

In [60]:
# Upload a file from local machine to the Google Colab environment
from google.colab import files
import io
import pandas as pd

uploaded = files.upload()

# Assuming you know the filename or if there's only one file uploaded
filename = next(iter(uploaded))
data = pd.read_csv(io.BytesIO(uploaded[filename]), parse_dates=['time'])
data['time'] = data['time'].dt.strftime('%Y-%m-%d %H:%M:%S')


Saving normal_1hr.csv to normal_1hr.csv


In [61]:
# Check the number of rows and columns in the dataframe
print("The number of rows and columns: ", data.shape)

The number of rows and columns:  (3595, 7)


In [62]:
# Print the first 5 rows of the dataframe
print("The first 5 rows of the dataset: \n", data.head())

The first 5 rows of the dataset: 
                   time  FIT101  MV101    LIT101  P101  FIT201    LIT301
0  2023-07-17 11:00:00    2.55      1  0.597593     0     0.0  0.951852
1  2023-07-17 11:00:01    2.55      1  0.644815     0     0.0  0.924074
2  2023-07-17 11:00:02    2.55      1  0.692037     0     0.0  0.896296
3  2023-07-17 11:00:03    2.55      1  0.739259     0     0.0  0.868519
4  2023-07-17 11:00:04    2.55      1  0.786481     0     0.0  0.840741


In [63]:
# Check the time span of the data
data['time'] = pd.to_datetime(data['time'])
time_span = data.time.max() - data.time.min()
print("Time span of the dataset:", time_span)

Time span of the dataset: 0 days 00:59:59


In [64]:
# Plot a graph to see how data looks like
import plotly.express as px
fig = px.line(data, x="time", y=['FIT101', 'MV101', 'LIT101', 'P101', 'FIT201', 'LIT301'], title='TRIST data', template = 'plotly_dark')
fig.show()

Plotting all the records for nearly 7 hours do not give us a clear picture of what is the data pattern of the simulation. Thus, the time feature is extracted for more accurate analysis.

In [65]:
# Algorithms cannot directly consume date or timestamp data, thus we will extract the features from the timestamp
# and will drop the actual timestamp column before training models.
# Set the timestamp column as the index of the dataframe
data.set_index('time', drop=True, inplace=True)

In [66]:
# Extract features from timestamp
data['hours'] = [i.hour for i in data.index]
data['minutes'] = [i.minute for i in data.index]
data['seconds']= [i.second for i in data.index]

print("\nThe first 5 rows of the dataset: \n", data.head())

print("\nThe last 5 rows of the dataset: \n", data.tail())


The first 5 rows of the dataset: 
                      FIT101  MV101    LIT101  P101  FIT201    LIT301  hours  \
time                                                                          
2023-07-17 11:00:00    2.55      1  0.597593     0     0.0  0.951852     11   
2023-07-17 11:00:01    2.55      1  0.644815     0     0.0  0.924074     11   
2023-07-17 11:00:02    2.55      1  0.692037     0     0.0  0.896296     11   
2023-07-17 11:00:03    2.55      1  0.739259     0     0.0  0.868519     11   
2023-07-17 11:00:04    2.55      1  0.786481     0     0.0  0.840741     11   

                     minutes  seconds  
time                                   
2023-07-17 11:00:00        0        0  
2023-07-17 11:00:01        0        1  
2023-07-17 11:00:02        0        2  
2023-07-17 11:00:03        0        3  
2023-07-17 11:00:04        0        4  

The last 5 rows of the dataset: 
                      FIT101  MV101    LIT101  P101  FIT201    LIT301  hours  \
time            

Simply listing the data record could not provide any insight. The following plots are used to analyse the data pattern of the TRIST dataset.

In [67]:
# Plot a graph to see how the first 10 minutes of data looks like
first_10_minutes = data.first('10T') # 'T' stands for minutes
fig = px.line(first_10_minutes, y=['FIT101', 'MV101', 'LIT101', 'P101', 'FIT201', 'LIT301'], title='TRIST data', template = 'plotly_dark')
fig.show()

In [68]:
# Plot a graph to see how the first 3 minutes of data looks like
first_3_minutes = data.first('3T') # 'T' stands for minutes
fig = px.line(first_3_minutes, y=['FIT101', 'MV101', 'LIT101', 'P101', 'FIT201', 'LIT301'], title='TRIST data', template = 'plotly_dark')
fig.show()

In [69]:
# Plot a graph to see how the last 10 minutes of data looks like
last_10_minutes = data.last('10T') # 'T' stands for minutes
fig = px.line(last_10_minutes, y=['FIT101', 'MV101', 'LIT101', 'P101', 'FIT201', 'LIT301'], title='TRIST data', template = 'plotly_dark')
fig.show()

In [70]:
# Plot a graph to see how the last 3 minutes of data looks like
last_3_minutes = data.last('3T') # 'T' stands for minutes
fig = px.line(last_3_minutes, y=['FIT101', 'MV101', 'LIT101', 'P101', 'FIT201', 'LIT301'], title='TRIST data', template = 'plotly_dark')
fig.show()

# Step 2: PyCaret Workflow

##**1. Setup**

In [72]:
# Setup PyCaret
# Session ID is added to provide a reproducible results.
from pycaret.anomaly import *
s = setup(data, session_id = 123)

Unnamed: 0,Description,Value
0,Session id,123
1,Original data shape,"(3533, 9)"
2,Transformed data shape,"(3533, 9)"
3,Numeric features,9
4,Preprocess,True
5,Imputation type,simple
6,Numeric imputation,mean
7,Categorical imputation,mode
8,CPU Jobs,-1
9,Use GPU,False


In [73]:
# Find out the models available in the anomaly detection module of PyCaret
models()

Unnamed: 0_level_0,Name,Reference
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
abod,Angle-base Outlier Detection,pyod.models.abod.ABOD
cluster,Clustering-Based Local Outlier,pycaret.internal.patches.pyod.CBLOFForceToDouble
cof,Connectivity-Based Local Outlier,pyod.models.cof.COF
iforest,Isolation Forest,pyod.models.iforest.IForest
histogram,Histogram-based Outlier Detection,pyod.models.hbos.HBOS
knn,K-Nearest Neighbors Detector,pyod.models.knn.KNN
lof,Local Outlier Factor,pyod.models.lof.LOF
svm,One-class SVM detector,pyod.models.ocsvm.OCSVM
pca,Principal Component Analysis,pyod.models.pca.PCA
mcd,Minimum Covariance Determinant,pyod.models.mcd.MCD


##**2. Create Model**

In [74]:
# Train model with Isolation Forest as an example for demonstration
# Fraction = 0.01 (i.e. 1% of the population, 10 out of 1000 records will be regarded as anomaly)
# When the value of fraction increases, the number of anomaly records will be increased in proportion.
# Therefore, this value shall be fine-tuned by the specific use case and the level of risk acceptance.
iforest = create_model('iforest', fraction = 0.01)

Processing:   0%|          | 0/3 [00:00<?, ?it/s]

In [75]:
# Find out more about the configurations of iforest
# Containmination is same as the 'fraction' defined in the previous code
print(iforest)

IForest(behaviour='new', bootstrap=False, contamination=0.01,
    max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=-1,
    random_state=123, verbose=0)


##**3. Assign Labels**

In [76]:
# Two new columns are appended in the table:
# Column "Anomaly" with value 1 refers to anomaly, value 0 refers to normal
# Column "Anomaly_Score" provides the score calculated by the algorithm.
iforest_results = assign_model(iforest)
iforest_results.head()

Unnamed: 0_level_0,FIT101,MV101,LIT101,P101,FIT201,LIT301,hours,minutes,seconds,Anomaly,Anomaly_Score
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2023-07-17 11:00:31,0.0,0,0.73037,1,2.45,0.806111,11,0,31,0,-0.05757
2023-07-17 11:00:32,0.0,0,0.685,1,2.45,0.823704,11,0,32,0,-0.072751
2023-07-17 11:00:33,0.0,0,0.63963,1,2.45,0.841296,11,0,33,0,-0.079607
2023-07-17 11:00:34,0.0,0,0.594259,1,2.45,0.858889,11,0,34,0,-0.073222
2023-07-17 11:00:35,0.0,0,0.548889,1,2.45,0.876481,11,0,35,0,-0.077517


In [77]:
# Print the anomalies
print("Anomalies: \n", iforest_results[iforest_results['Anomaly'] == 1])

Anomalies: 
                      FIT101  MV101    LIT101  P101  FIT201    LIT301  hours  \
time                                                                          
2023-07-17 11:00:50    0.00      0  0.808889     0    0.00  0.833889     11   
2023-07-17 11:00:52    0.00      0  0.808889     0    0.00  0.778333     11   
2023-07-17 11:00:53    0.00      0  0.781667     1    2.45  0.768704     11   
2023-07-17 11:01:15    0.00      0  0.806481     1    0.00  0.791296     11   
2023-07-17 11:01:59    0.00      0  0.827778     0    0.00  0.812778     11   
2023-07-17 11:02:00    0.00      0  0.827778     0    0.00  0.785000     11   
2023-07-17 11:02:45    0.00      0  0.820741     0    0.00  0.784259     11   
2023-07-17 11:03:00    2.55      1  0.516111     0    0.00  1.008333     11   
2023-07-17 11:03:08    0.00      0  0.818333     0    0.00  0.791667     11   
2023-07-17 11:03:52    0.00      0  0.802222     0    0.00  0.827778     11   
2023-07-17 11:04:16    0.00      0  0.8

In [90]:
# Sort records by anomaly score
sorted_results = iforest_results.sort_values(by='Anomaly_Score', ascending=False)

# Display the sorted results
print(sorted_results)

                     FIT101  MV101    LIT101  P101  FIT201    LIT301  hours  \
time                                                                          
2023-07-17 11:02:00    0.00      0  0.827778     0    0.00  0.785000     11   
2023-07-17 11:59:03    2.55      0  0.826481     0    0.00  0.823519     11   
2023-07-17 11:59:04    0.00      0  0.826481     0    0.00  0.795741     11   
2023-07-17 11:18:54    0.00      0  0.835185     1    0.00  0.775926     11   
2023-07-17 11:55:02    0.00      0  0.803889     1    0.00  0.778148     11   
...                     ...    ...       ...   ...     ...       ...    ...   
2023-07-17 11:20:35    2.55      1  0.490556     1    2.45  0.950370     11   
2023-07-17 11:20:34    2.55      1  0.488704     1    2.45  0.932778     11   
2023-07-17 11:24:34    2.55      1  0.490185     1    2.45  0.931296     11   
2023-07-17 11:24:35    2.55      1  0.492037     1    2.45  0.948889     11   
2023-07-17 11:24:36    2.55      1  0.493889     1  

In [89]:
# Display the sorted anomalies results
print("Anomalies: \n", sorted_results[sorted_results['Anomaly'] == 1])

Anomalies: 
                      FIT101  MV101    LIT101  P101  FIT201    LIT301  hours  \
time                                                                          
2023-07-17 11:02:00    0.00      0  0.827778     0    0.00  0.785000     11   
2023-07-17 11:59:03    2.55      0  0.826481     0    0.00  0.823519     11   
2023-07-17 11:59:04    0.00      0  0.826481     0    0.00  0.795741     11   
2023-07-17 11:18:54    0.00      0  0.835185     1    0.00  0.775926     11   
2023-07-17 11:55:02    0.00      0  0.803889     1    0.00  0.778148     11   
2023-07-17 11:01:59    0.00      0  0.827778     0    0.00  0.812778     11   
2023-07-17 11:58:56    2.55      1  0.505370     0    0.00  1.012407     11   
2023-07-17 11:05:00    0.00      0  0.818889     0    0.00  0.780000     11   
2023-07-17 11:00:52    0.00      0  0.808889     0    0.00  0.778333     11   
2023-07-17 11:52:01    0.00      0  0.803704     1    0.00  0.786667     11   
2023-07-17 11:59:27    0.00      0  0.8

In [91]:
# Count the number of anomalies
number_of_anomalies = iforest_results['Anomaly'].sum()

print("Number of anomalies detected:", number_of_anomalies)

Number of anomalies detected: 36


In [92]:
# Check the number of rows and columns in the dataframe
print("The number of rows and columns: ", data.shape)

The number of rows and columns:  (3595, 9)


The number of anomalies detected as 36 because 1% of the 3595 records in the datasets are being assigned as anomalies. 1% is the fraction parameter defined at '2. Create Model' of the PyCaret Workflow.

##**4. Analyse Model**

In [78]:
# Plot anomalies
import plotly.graph_objects as go

# Plot value on y-axis and date on x-axis
fig = px.line(iforest_results, x=iforest_results.index, y=['FIT101', 'MV101', 'LIT101', 'P101', 'FIT201', 'LIT301'], title='TRIST Anomaly Detection', template = 'plotly_dark')

# Create list of outliers
outliers = iforest_results[iforest_results['Anomaly'] == 1].index

# obtain y value of anomalies to plot
y_values1 = [iforest_results.loc[i]['FIT101'] for i in outliers]

fig.add_trace(go.Scatter(x=outliers, y=y_values1, mode = 'markers',
                name = 'FIT101 Anomalies',
                marker=dict(color='yellow',size=10)))

# obtain y value of anomalies to plot
y_values2 = [iforest_results.loc[i]['MV101'] for i in outliers]

fig.add_trace(go.Scatter(x=outliers, y=y_values2, mode = 'markers',
                name = 'MV101 Anomalies',
                marker=dict(color='white',size=10)))

# obtain y value of anomalies to plot
y_values3 = [iforest_results.loc[i]['LIT101'] for i in outliers]

fig.add_trace(go.Scatter(x=outliers, y=y_values3, mode = 'markers',
                name = 'LIT101 Anomalies',
                marker=dict(color='pink',size=10)))

# obtain y value of anomalies to plot
y_values4 = [iforest_results.loc[i]['P101'] for i in outliers]

fig.add_trace(go.Scatter(x=outliers, y=y_values4, mode = 'markers',
                name = 'P101 Anomalies',
                marker=dict(color='lightcyan',size=10)))

# obtain y value of anomalies to plot
y_values5 = [iforest_results.loc[i]['FIT201'] for i in outliers]

fig.add_trace(go.Scatter(x=outliers, y=y_values5, mode = 'markers',
                name = 'FIT201 Anomalies',
                marker=dict(color='lightgoldenrodyellow',size=10)))

# obtain y value of anomalies to plot
y_values6 = [iforest_results.loc[i]['LIT301'] for i in outliers]

fig.add_trace(go.Scatter(x=outliers, y=y_values6, mode = 'markers',
                name = 'LIT301 Anomalies',
                marker=dict(color='lightyellow',size=10)))

fig.show()

In [79]:
# tsne plot anomalies
plot_model(iforest, plot = 'tsne')

##**5. Predict Model**

In [80]:
# Predict on test set
# The predict_model function returns Anomaly and Anomaly_Score label as a new column in the input dataframe.
# The predict_model is only useful when you want to obtain labels on unseen data (i.e. data that was not used during training the model).
iforest_pred = predict_model(iforest, data=data)
iforest_pred

Unnamed: 0_level_0,FIT101,MV101,LIT101,P101,FIT201,LIT301,hours,minutes,seconds,Anomaly,Anomaly_Score
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2023-07-17 11:00:00,2.55,1.0,0.597593,0.0,0.00,0.951852,11.0,0.0,0.0,0,-0.015048
2023-07-17 11:00:01,2.55,1.0,0.644815,0.0,0.00,0.924074,11.0,0.0,1.0,0,-0.018931
2023-07-17 11:00:02,2.55,1.0,0.692037,0.0,0.00,0.896296,11.0,0.0,2.0,0,-0.021834
2023-07-17 11:00:03,2.55,1.0,0.739259,0.0,0.00,0.868519,11.0,0.0,3.0,0,-0.022974
2023-07-17 11:00:04,2.55,1.0,0.786481,0.0,0.00,0.840741,11.0,0.0,4.0,0,-0.012677
...,...,...,...,...,...,...,...,...,...,...,...
2023-07-17 11:59:55,0.00,0.0,0.574444,1.0,2.45,0.873333,11.0,59.0,55.0,0,-0.057710
2023-07-17 11:59:56,0.00,0.0,0.529074,1.0,2.45,0.887407,11.0,59.0,56.0,0,-0.050198
2023-07-17 11:59:57,2.55,1.0,0.493148,1.0,2.45,0.905000,11.0,59.0,57.0,0,-0.056831
2023-07-17 11:59:58,2.55,1.0,0.495000,1.0,2.45,0.922593,11.0,59.0,58.0,0,-0.063352


In [81]:
# Print the predicted anomalies
print("Anomalies: \n", iforest_pred[iforest_pred['Anomaly'] == 1])

Anomalies: 
                      FIT101  MV101    LIT101  P101  FIT201    LIT301  hours  \
time                                                                          
2023-07-17 11:00:05    0.00    0.0  0.824259   0.0    0.00  0.812963   11.0   
2023-07-17 11:00:06    0.00    0.0  0.824259   0.0    0.00  0.785185   11.0   
2023-07-17 11:00:50    0.00    0.0  0.808889   0.0    0.00  0.833889   11.0   
2023-07-17 11:00:52    0.00    0.0  0.808889   0.0    0.00  0.778333   11.0   
2023-07-17 11:00:53    0.00    0.0  0.781667   1.0    2.45  0.768704   11.0   
2023-07-17 11:01:15    0.00    0.0  0.806481   1.0    0.00  0.791296   11.0   
2023-07-17 11:01:59    0.00    0.0  0.827778   0.0    0.00  0.812778   11.0   
2023-07-17 11:02:00    0.00    0.0  0.827778   0.0    0.00  0.785000   11.0   
2023-07-17 11:02:45    0.00    0.0  0.820741   0.0    0.00  0.784259   11.0   
2023-07-17 11:03:00    2.55    1.0  0.516111   0.0    0.00  1.008333   11.0   
2023-07-17 11:03:08    0.00    0.0  0.8

##**6. Save Model**

In [82]:
# Save pipeline into pickle file
save_model(iforest, 'iforest_pipeline')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(include=['FIT101', 'MV101', 'LIT101',
                                              'P101', 'FIT201', 'LIT301',
                                              'hours', 'minutes', 'seconds'],
                                     transformer=SimpleImputer())),
                 ('categorical_imputer',
                  TransformerWrapper(include=[],
                                     transformer=SimpleImputer(strategy='most_frequent'))),
                 ('trained_model',
                  IForest(behaviour='new', bootstrap=False, contamination=0.01,
     max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=-1,
     random_state=123, verbose=0))]),
 'iforest_pipeline.pkl')

##**7. Load Model**

In [83]:
# load pipeline
loaded_iforest_pipeline = load_model('iforest_pipeline')
loaded_iforest_pipeline

Transformation Pipeline and Model Successfully Loaded
