# Story
On an important mission to Mars, Red, Blue, Green, Pink, Orange, Yellow, Black, White, Purple, Brown, Cyan and Lime were boarded on to the spaceship. On their way to explore the Mars, the shapeship undergoes series of internal component failure in the navigation and flight parts. It was time for the crew to repair the issues . But there was one Imposter one Betrayer one Anamoly among them and to identify who that is there are set of tasks which has to be performed in the data to reveal who is the imposter in the dataset.

To understand what this this project really focuses on is that there are many realtime cases we are witnessing on tracking abonormal data which possess a serious threat to the business in the field of IT, health and various other sectors. Even though the cyber security teams are forging to figure out the anomaly behaviour in the transactions, the system built using algorithms are not efficient enough to capture all anomaly's. Huge millions of money are lost due to the cyber attacks. It not only affects the business revenue but also the reputation and trust of doing business with the firm.

## Where to find the Imposter ?
TO find the imposter in our spaceship, we use [umenta Anomaly Benchmark (NAB)](https://www.kaggle.com/boltzmannbrain/nab), where we consider speed_6005 which has the dataset with 2500 rows of the speed for specific sensors in the spaceship.

**CSV name**: speed_6005.csv

In these dataset above, The crew will analyse the dataset with time-series visualizations and perform analysis to detect the anomaly records and thereby capture the imposter. These are crucial records which can help in identify suspicious speed recorded in the sensors.

# Tasks
- Getting to Know the Data
> - Changing the Datatype of Timestamp
- Emergency Meetings
> - Overview of time series data
> - Histogram and Scatter on datetime
> - Which hour and day of the month we had high CPC?
> - Behaviour during weekend
- Sus Pattern Via Visuals - RED SUS?
> - Let's Brainstorm What Happened
- Building Model to Trace Anomalies
> - IsolationForest
>  - (Extra Notes)
> - Prophet

# Libraries

In [None]:
import numpy as np
np.random.seed(1)
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objs as go
import altair as alt
%matplotlib inline

from fbprophet import Prophet
from sklearn.ensemble import IsolationForest

# Getting to Know the Data

In [None]:
imposter = pd.read_csv('../input/nab/realTraffic/realTraffic/speed_6005.csv')

print(f'speed_6005.csv: {imposter.shape}')
imposter.head()

According to dataset information, it has the following features :
- **timestamp**:
 - This is the date and time when click is made by the visitor in the website
- **value**:
 - This is the speed recorded in the specific sensor

*Take a note that the speed (value) doesn't actually have any units nor the metadata doesn't have any information on that.*

In [None]:
imposter.info()

In [None]:
imposter.describe()

From the information we can identify that
- We don't have any null records in the dataset. BAM !
- timestamp column is an object data type. small bam!

## Changing the Datatype of Timestamp
So that we can extract the data from the time stamp like `year`, `month`, `day`, `hour`, `weekdays`. This will help us to reveal a lot of information from the data. We gotta look for all possible ways to find the imposter among us

In [None]:
imposter['timestamp'] = pd.to_datetime(imposter['timestamp'])
imposter.info()

In [None]:
imposter['year'] = imposter['timestamp'].apply(lambda x: x.year)
imposter['month'] = imposter['timestamp'].apply(lambda x: x.month)
imposter['day'] = imposter['timestamp'].apply(lambda x: x.day)
imposter['weekday'] = imposter['timestamp'].apply(lambda x: x.weekday())
imposter['hour'] = imposter['timestamp'].apply(lambda x: x.hour)

imposter = imposter[['timestamp', 'year', 'month', 'day', 'weekday', 'hour', 'value']]
imposter.rename(columns={'timestamp': 'Datetime'}, inplace=True)

# Weekday starts from Monday
print(f'{imposter.Datetime[0]} with weekday {imposter.weekday[0]} is {imposter.Datetime[0].strftime("%A")}.\n')

imposter.sample(5)

# Emergency Meetings
| Inocents | Statements |
| --- | ---------- |
| **Pink** | I was working on increasing speed during the weekends at medbay |
| **Orange** | I was working on increasing speed during holiday months such as December and January at admin |
| **Yellow** | I was working on increasing speed during the late night hours at storage |
| **Cyan** | I proposed a strategy to form a seasonality across 2015 for speed at shields |
| **Red** | I worked really hard during Sep 4- Sep 10 to fix our ship at reactor |

Let's putforth our work status and check each's proposal by looking at the visualizations.

## Overview of time series data
Let's take the `Datetime` as x axis and plot the values and identify whether it has the characteristics of time series data and also check against **what they stated**.

In [None]:
fig = px.line(imposter, x='Datetime', y='value', title='Overview of out time series data')

fig.update_xaxes(rangeslider_visible=True,
                rangeselector=dict(
                buttons=[
                    dict(count=1, label='1m', step='month', stepmode='backward'),
                    dict(count=6, label='6m', step='month', stepmode='backward'),
                    dict(step='all')
                ]))
fig

**Discussions**:<br>
We don't have the data for entire 2015, instead we have only for 1 month(Sep-17days) and it doesn't exhibit seasonality - we can reject our 4th assumption and since cyan said it's true I feel cyan is sus.<br>
And, let's straight away reject the 2nd assumption as we don't have enough data to prove it, ORANGE IS SUS MAX.

| Inocents | Statements |
| --- | ---------- |
| **Pink** | I was working on increasing speed during the weekends at medbay |
| ~~**Orange**~~ | ~~I was working on increasing speed during holiday months such as December and January at admin~~ |
| **Yellow** | I was working on increasing speed during the late night hours at storage |
| ~~**Cyan**~~ | ~~I proposed a strategy to form a seasonality across 2015 for speed at shields~~ |
| **Red** | I worked really hard during Sep 4- Sep 10 to fix our ship at reactor |

**Actual Insights**:
- The speed across 17 days exhibits only **stationarity** and not seasonality
- The value of CPC lies between 20 and 109, **most of them lies in between 70 to 90** (see below)
- **The drop seen in the later time is huge** compared to the drop happened in the initial days of September
- Even though we have same pattern, you can visually see the **same speed at Sep4-Sep8**. Could it be an ANOMALY ?

In [None]:
sns.displot(imposter.value)

## Histogram and Scatter on datetime
Let's plot combined chart. If you wanna find some imposters in our data, scatter and box plot are the best.

In [None]:
fig = px.histogram(imposter, x='Datetime', y='value', histfunc='avg', title='Histogram and Scatter on Date Axes')

fig.update_traces(xbins_size='M1')
fig.update_xaxes(showgrid=True, ticklabelmode='period', dtick='M1', tickformat='%b\n%Y')
fig.update_layout(bargap=0.1)
fig.add_trace(go.Scatter(mode='markers', x=imposter['Datetime'], y=imposter['value'], name='daily'))
fig

**Actual Insights**:
- We have **only one data point from Aug** and we can't consider the left bar and if you see the September bar we can see the average speed was around 81.9
- We can see **3 points on Sep end** which looks like an outlier, but that doesn't mean they are anamoly
- We can also notice that there are **no speed recorded in the mid Sep**, Could it be a shutdown ?and that might even invite cyber attacks in our spaceship

## Which hour and day of the month we had high CPC?
Let's use altair library to plot a beautiful heatmap which can help us to identify which hour and which day of the month were speed higher

In [None]:
alt.Chart(imposter).mark_rect().encode(alt.X('hour:O', title='hour of day'),
                                      alt.Y('day:O', title='date'),
                                      alt.Color('value:Q', title='speed'))

**Discussions**:<br>
We can clearly see there is not much speed during late night hours compared to morning hours - our 3rd assumption is false, and since Yellow said it's true I feel yellow is sus.

| Inocents | Statements |
| --- | ---------- |
| **Pink** | I was working on increasing speed during the weekends at medbay |
| ~~**Orange**~~ | ~~I was working on increasing speed during holiday months such as December and January at admin~~ |
| ~~**Yellow**~~ | ~~I was working on increasing speed during the late night hours at storage~~ |
| ~~**Cyan**~~ | ~~I proposed a strategy to form a seasonality across 2015 for speed at shields~~ |
| **Red** | I worked really hard during Sep 4- Sep 10 to fix our ship at reactor |

**Actual Insights**:
- We can also see the recordings started at Aug 31 6pm to Sep 17 6 pm. In these days no speed recorded during the rest of the hours.
- We can also notice that there are several shutdown of sensors happening between the hours. Are those Anomalys ? Let's find out!

## Behaviour during weekend
Let's check out our final assumption of whether there is a rise in CPC during weekends. Since most of them will be free to surf internet and tend to click more ads

In [None]:
alt.Chart(imposter).mark_bar().encode(x = 'weekday:O',
                                     y = 'value:Q',
                                     color = alt.condition(alt.datum.weekday == 0,
                                                              alt.value('orange'),
                                                              alt.value('steelblue'))).properties(width=600)

**Discussions**:<br>
The highest was recorded on a weekend - Saturday. But there has been low records of speed during thursday and Friday and also Monday has low speed. We can accept our 1st assumption of hike of speed during weekend to be more specific it was only the start of the weeekend and end of the weekend doesn't have much speed in sensors.

| Inocents | Statements |
| --- | ---------- |
| **Pink** | ***I was working on increasing speed during the weekends at medbay*** |
| ~~**Orange**~~ | ~~I was working on increasing speed during holiday months such as December and January at admin~~ |
| ~~**Yellow**~~ | ~~I was working on increasing speed during the late night hours at storage~~ |
| ~~**Cyan**~~ | ~~I proposed a strategy to form a seasonality across 2015 for speed at shields~~ |
| **Red** | I worked really hard during Sep 4- Sep 10 to fix our ship at reactor |

# Sus Pattern Via Visuals - RED SUS?
Let's see the anomaly patterns that are visible to naked eye. Here, the anomaly points which are highlighted may not be an anamoly thrown by the algorithm since it purely based on visualizations

In [None]:
fig = px.line(imposter, x='Datetime', y='value',
             title='SUS pattern-ANOMALY?',
             range_x=['2015-08-31 19:00:00', '2015-09-17 17:00:00'])

fig.update_layout(shapes=[dict(type = 'rect',
                              xref = 'x',
                              yref = 'paper',
                              x0 = '2015-09-05',
                              y0 = 0,
                              x1 = '2015-09-08 11:00:00', 
                              y1 = 1,
                              fillcolor = 'Red',
                              opacity = 0.5,
                              layer = 'below',
                              line_width = 0),
                          dict(type = 'rect',
                              xref = 'x',
                              yref = 'paper',
                              x0 = '2015-09-09',
                              y0 = 0,
                              x1 = '2015-09-10 09:00:00', 
                              y1 = 1,
                              fillcolor = 'Red',
                              opacity = 0.5,
                              layer = 'below',
                              line_width = 0),
                          dict(type = 'rect',
                              xref = 'x',
                              yref = 'paper',
                              x0 = '2015-09-16 20:00:00',
                              y0 = 0,
                              x1 = '2015-09-17 12:00:00', 
                              y1 = 1,
                              fillcolor = 'Red',
                              opacity = 0.5,
                              layer = 'below',
                              line_width = 0)],
                 annotations=[dict(x = '2015-09-08 11:00:00',
                                  y = 0.99,
                                  xref = 'x',
                                  yref = 'paper',
                                  showarrow = False,
                                  xanchor = 'right',
                                  text = 'SUS activity 1'),
                              dict(x = '2015-09-10 09:00:00',
                                  y = 0.99,
                                  xref = 'x',
                                  yref = 'paper',
                                  showarrow = False,
                                  xanchor = 'right',
                                  text = 'SUS activity 2'),
                              dict(x = '2015-09-17 12:00:00',
                                  y = 0.99,
                                  xref = 'x',
                                  yref = 'paper',
                                  showarrow = False,
                                  xanchor = 'right',
                                  text = 'SUS activity 3')])
fig

## Let's Brainstorm What Happened
**Suspicious activity 1**<br>
There are no speed recorded during this period, the sensor got stuck, the imposter must have SABOTAGED the sensors to get in the spaceship without alerting anyone. Imposter is still among us and he possibly could have entered during this time

# Building Model to Trace Anomalies

## [IsolationForest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html)

- Return the anomaly score of each sample using the IsolationForest algorithm.

The IsolationForest ‘isolates’ observations by **randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature**.<br>
Since recursive partitioning can be represented by a tree structure, the number of splittings required to isolate a sample is equivalent to the path length from the root node to the terminating node.<br>
This path length, averaged over a forest of such random trees, is a measure of normality and our decision function.<br>
**Random partitioning produces noticeably shorter paths for anomalies**. Hence, when a forest of random trees collectively produce shorter path lengths for particular samples, they are highly likely to be anomalies.

1. How Isolation Forest works<br>
![](https://miro.medium.com/max/3000/1*d-4xINDQHv0G82o2GUApJQ.png)<br>
Above I show examples of the procedure after four splits, respectively. In this case, I had only two features `x` and `y` and four observations to check. The first condition is the one that distinguishes a normal observation from an anomaly. If `x` is bigger than 120, then the observation is an outlier and is coloured in red. Then, normal and anomalous data points can be distinguished based on average path length: **shorter paths indicates that we have anomalies**, while **longer path shows that there are normal observations**.

2. Anomaly Score<br>
![](https://miro.medium.com/max/1704/1*D78QLbcwXesymhquuofnOg.png)<br>
The isolation forest needs an Anomaly Score to have an idea of how anomalous a data point is. Its values lie between 0 and 1. The anomaly score is defined as:<br>
![](https://miro.medium.com/max/3000/1*GMWS-FkTTYWaRgOhKV_QCQ.png)<br>
where `E(h(x))` is the average of `h(x)`, which is the path length from the root node to the external node `x`, while `c(n)` is the average of `h(x)` given n and is used to normalize `h(x)`. There are three possible situations:
 - When the score of the observation is close to 1, the path length is very small, and then the data point is easily isolated. We have an **anomaly**.
 - When the score of that observation is smaller than 0.5, the path length is large, and then we have a **normal data point**.
 - If all the observations have an anomaly score around 0.5, then the **entire sample doesn’t have any anomaly**.

### Algorithm Workflow

> Isolation forest is an **unsupervised learning algorithm** for anomaly detection that works on the principle of isolating anomalies, instead of the most common techniques of profiling normal points.
> 1. In the first stage, a training dataset is used to build iTrees as described in previous sections.
> 2. In the second stage, each instance in the test set is passed through the iTrees build in the previous stage, and a proper “`anomaly score`” is assigned to the instance. Once all the instances in the test set have been assigned an `anomaly score`, it is possible to mark as “anomaly” any point whose score is greater than a predefined `threshold`, which depends on the **domain** the analysis is being applied to.

In [None]:
model = IsolationForest(verbose=1)
model.fit(imposter[['value']])

### (Extra Notes)

In [None]:
print(f"type(imposter['value']) = {type(imposter['value'])}")
print(f"type(imposter[['value']]) = {type(imposter[['value']])}")

In [None]:
imposter['outliers'] = pd.Series(model.predict(imposter[['value']])).apply(lambda x: 'yes' if (x==-1) else 'no')
imposter.query('outliers=="yes"')

In [None]:
fig = px.scatter(imposter.reset_index(), x='Datetime', y='value',
                hover_data=['weekday'], color='outliers', title='NAM-speed outliers')
fig.update_xaxes(rangeslider_visible=True)
fig

Alright, I'll initalize the library with `contamination = 0.01`. We can also fix the `contamination` rate as **per the domain**. Since we got only one imposter I have set the `threshold` very low.

In [None]:
model = IsolationForest(contamination=0.01, verbose=1)
model.fit(imposter[['value']])
imposter['outliers'] = pd.Series(model.predict(imposter[['value']])).apply(lambda x: 'yes' if (x==-1) else 'no')
fig = px.scatter(imposter.reset_index(), x='Datetime', y='value',
                hover_data=['weekday'], color='outliers', title='NAM-speed outliers')
fig.update_xaxes(rangeslider_visible=True)
fig

## [Prophet](https://pypi.org/project/fbprophet/)

**FBProphet: Automatic Forecasting Procedure**
> Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.

![](https://miro.medium.com/max/5452/1*gkpBh6iZZBk5-fyZeOCK8A.jpeg)
An Additive Model above can **absorb the absence of seasonal effects by having s(t) = 0**, as the other terms of the equation have no impact to predict future values in y(t). Unlike, fixed and linear regression models like Fama — French above, . . . **Prophet is a modular and non — linear regression model** that separates and recombines a single dataset of history. Feature Engineering when features explain a future value . . .or when factors drive a forecast are removed . . . **leaving more room for the option of a variable Domain Knowledge** for a user of Prophet.

### Data Preprocessing

First let's rename the columns according to the prophet's standards. The input to Prophet is always a dataframe with two columns: ds and y. The ds (datestamp) column should be of a format expected by Pandas, ideally YYYY-MM-DD for a date or YYYY-MM-DD HH:MM:SS for a timestamp. The y column must be numeric, and represents the measurement we wish to forecast.

In [None]:
imposter_prophet = imposter.copy()
imposter_prophet = imposter_prophet.reset_index()[['Datetime', 'value']].rename({
    'Datetime': 'ds', 'value': 'y'
}, axis='columns')

Now let's call the library and initialize the `changepoint_range` to 95%. It is just the **confidence level** fixed for any statistical analysis. Here we only let 5% margin of error because finding an anamoly comes under rigorous scrutiny of data.

In [None]:
model = Prophet(changepoint_range=0.95)
model.fit(imposter_prophet)

Now we can also make future predictions of our data, this is the important part where we can get the lower and upper interval range. We can get a suitable dataframe that extends into the future a specified number of days using the helper method `Prophet.make_future_dataframe`. By default it will also include the dates from the history, so we will see the model fit as well.

We can say that the predicted values may vary inbetween these intervals, no predictions can be accurate. But **we are 95% confident that our predicted values can fall in the interval**.

In [None]:
future = model.make_future_dataframe(periods=100, freq='H')
forecast = model.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()

In [None]:
results = pd.concat([
    imposter_prophet.set_index('ds')['y'], forecast.set_index('ds')[['yhat', 'yhat_lower', 'yhat_upper']]
], axis=1)

results.head()

In [None]:
fig1 = model.plot(forecast)

We can see the future predicted values and it follows the same seasonal patterns. Let's check the model components which gives us the trend pattern followed across different time.

In [None]:
comp = model.plot_components(forecast)

We can confirm that **the speed was high in case of weekends** and in rest of the days it as dim except wednesday. Anomaly might fall in that day.

Also if take a closer look at the chart, according to the time, **the speed was high from early morning till evening and later dipped down midnight**.

Considerint the days, there was a steep increase in the initial recording days and there was a steep downhill and was never risen again. Can we connect all the dots?

Before passing any judgements, let's **figure out the error value from the predictions** and also **calculate the uncertainity by differencing the lower and upper interval**, that leaves us with the records which lies above the intervals which is an unusual case, we can term those as anamolies.

In [None]:
results['error'] = results['y'] - results['yhat']
results['uncertainity'] = results['yhat_upper'] - results['yhat_lower']
results[results['error'].abs() > 1.5*results['uncertainity']]

Now let's classify the anamoly as yes if the error lies beyond 1.5 times of `uncertainity`. With `threshold = 1.5`, it depends on the application we are working on, here we are not concerned about the value that landed as uncertain but what lies beyond those uncertain values which has to be classified as anamoly.

In [None]:
results['anomaly'] = results.apply(lambda x: 'yes' if ( np.abs(x['error']) > 1.5*x['uncertainity'] ) else 'No', axis=1)

fig = px.scatter(results.reset_index(), x='ds', y='y',
                color='anomaly', title='NAM-speed outliers')
fig.update_xaxes(rangeslider_visible=True,
                rangeselector=dict(buttons=list([
                    dict(count=1, label='1y', step='year', stepmode='backward'),
                    dict(count=2, label='3y', step='year', stepmode='backward'),
                    dict(count=3, label='3y', step='year', stepmode='backward'),
                    dict(step='all')
                ])))
fig

Yes we got the anomaly points which were recorded on Sep 1 and Sep 17.

# Next: Ensemble/ Vote?
Ensemble methods is a machine learning technique that combines several base models in order to produce one optimal predictive model...

# References

- [Anomaly Detection With Isolation Forest](https://betterprogramming.pub/anomaly-detection-with-isolation-forest-e41f1f55cc6)
- [Detecting anomalies using Isolation Trees: Practical Machine Learning](https://www.youtube.com/watch?v=smiu01pLosI)
- [【異常檢測】孤立森林（Isolation Forest）演算法簡介](https://codingnote.cc/zh-tw/p/177980/)
- [The Facebook Prophet Prediction Model and Product Analytics](https://foxworthy-8036.medium.com/the-facebook-prophet-prediction-model-and-product-analytics-a1db05fbe454)
- [Fortune-Telling with Python: An Intro to Facebook Prophet](https://www.youtube.com/watch?v=95-HMzxsghY)
- [Quick Start | Prophet - Facebook Open Source](https://facebook.github.io/prophet/docs/quick_start.html)
- [Facebook時間序列預測演算法Prophet的研究](https://iter01.com/10666.html)

This notebook was created based on this amazing notebook:
- [@benroshan](https://www.kaggle.com/benroshan)
 - [⚠️ Anomaly Detection 🚨 AMONG US](https://www.kaggle.com/benroshan/anomaly-detection-among-us)
 
Please go check it out and give applause to the author!