# 1. Business Understanding

## 1.1 Objective:

The primary objective is to reduce water waste and improve efficiency in water use across various appliances in a residential or commercial setting. By detecting leakages through anomalous water flow readings, the goal is to identify which appliances suffer from water leakage and at what times, and rectify these issues promptly, thereby conserving water, reducing costs associated with excessive water use, and minimizing potential damage to property due to unchecked leakages.

## 1.2 Problem statement:

Inefficient water usage and leakages in water appliances pose significant challenges, including increased operational costs and environmental impact. Despite the critical need for water conservation, current methods of leakage detection often rely on manual inspection or are not sufficiently sensitive to detect small leakages, leading to prolonged periods of water wastage. There is a need for an automated, data-driven approach to identify and quantify water leakages in real-time or near-real-time to address these challenges effectively.

## 1.3 Requirements

**Leakage detection:** Utilize the adtk library to analyze water flow data from various appliances to detect anomalies that could indicate leakages. This involves comparing  water usage patterns with historical patterns to identify significant deviations that are not explained by regular usage patterns.

# 2. Data Understanding

Upon reviewing the available data, the business objectives were reviewed and reformulated in order to ensure that it is realistic and achievable in accordance with the nature of available data. It was determined whether additional data transformations would have to be applied in case the exact business objective would not be achievable based on the existing format of the available data. This would be quite very important so as to achieve something remotely close to our objective if not the actual objective itself and produce valuable results.

## 2.1 Data Collection

The data was acquired via **Fairwater** which was recorded in a 1-person household flat in Naples, Italy. The dataset is popularly also know as the **WEUSEDTO** dataset, named after the research paper written on it. This data was made available for analysis via Newcastle University as a task to be performed under the module **CSC8633 - Group Project in Data Science**.

### A. Library imports

In [None]:
from global_imports import *
warnings.filterwarnings('ignore')

### B. Data loading

In [None]:
def load_data():
    agg_df = pd.read_csv('../data/raw/aggregatedWholeHouse.csv')
    bidet_df = pd.read_csv('../data/raw/feedBidet.csv')
    dishwash_df = pd.read_csv('../data/raw/feedDishwasher.csv')
    kitchen_df = pd.read_csv('../data/raw/feedKitchenfaucet.csv')
    shower_df = pd.read_csv('../data/raw/feedShower.csv')
    toilet_df = pd.read_csv('../data/raw/feedToilet.csv')
    washbasin_df = pd.read_csv('../data/raw/feedWashbasin.csv')
    washmachine_df = pd.read_csv('../data/raw/feedWashingmachine.csv')

    bidet_df_interpolated = feather.read_dataframe('../data/processed/Bidet_interpolated.feather')
    kitchen_df_interpolated = feather.read_dataframe('../data/processed/Kitchen_interpolated.feather')
    shower_df_interpolated = feather.read_dataframe('../data/processed/Shower_interpolated.feather')
    washmachine_df_interpolated = feather.read_dataframe('../data/processed/Washing Machine_interpolated.feather')

    return agg_df, bidet_df, dishwash_df, kitchen_df, shower_df, toilet_df, washbasin_df, washmachine_df, bidet_df_interpolated, kitchen_df_interpolated, shower_df_interpolated, washmachine_df_interpolated


agg_df, bidet_df, dishwash_df, kitchen_df, shower_df, toilet_df, washbasin_df, washmachine_df, bidet_df_interpolated, kitchen_df_interpolated, shower_df_interpolated, washmachine_df_interpolated = load_data()

agg_df.rename(columns={'unix': 'StartTime', 'flow': 'aggFlow'}, inplace=True)
bidet_df.rename(columns={'Time': 'StartTime', 'Flow': 'bidetFlow'}, inplace=True)
dishwash_df.rename(columns={'Time': 'StartTime', 'Flow': 'dishFlow', 'EndTime': 'EndTime'}, inplace=True)
kitchen_df.rename(columns={'Time': 'StartTime', 'Flow': 'kitchenFlow'}, inplace=True)
shower_df.rename(columns={'Time': 'StartTime', 'Flow': 'showerFlow'}, inplace=True)
toilet_df.rename(columns={'Time': 'StartTime', 'Flow': 'toiletFlow', 'EndFlow': 'EndTime'}, inplace=True)
washbasin_df.rename(columns={'Time': 'StartTime', 'Flow': 'basinFlow'}, inplace=True)
washmachine_df.rename(columns={'Time': 'StartTime', 'Flow': 'machineFlow'}, inplace=True)

print(f'Data shapes :\n\n\t1. Aggreggated data : {agg_df.shape}\t\t2. Bidet data ; {bidet_df.shape}\n\n\t3. Dishwasher : {dishwash_df.shape}\t\t\t\t4. Kitchen : {kitchen_df.shape}\n\n\t5. Shower : {shower_df.shape}\t\t\t\t6. Toilet : {toilet_df.shape}\n\n\t7. Washbasin : {washbasin_df.shape}\t\t\t8. Washmachine : {washmachine_df.shape}\n\n\t9. Bidet interpolated data : {bidet_df_interpolated.shape}\t10. Kitchen interpolated data : {kitchen_df_interpolated.shape}\n\n\t11. Shower interpolated data : {shower_df_interpolated.shape}\t12. Washmachine interpolated data : {washmachine_df_interpolated.shape}')

Data shapes :

	1. Aggreggated data : (166082, 2)		2. Bidet data ; (121993, 2)

	3. Dishwasher : (53, 3)				4. Kitchen : (167065, 2)

	5. Shower : (170181, 2)				6. Toilet : (1188, 3)

	7. Washbasin : (179933, 2)			8. Washmachine : (12055, 2)

	9. Bidet interpolated data : (244607, 1)	10. Kitchen interpolated data : (377722, 1)

	11. Shower interpolated data : (283721, 1)	12. Washmachine interpolated data : (137431, 1)


## 2.2 Exploring the data

The data was recorded for 7 fixtures and each had their own characteristics:

Most of the data characteristics have been revealed and mentioned in the previous cycles, however for this cycle specifically, I decided to check for any common timestamps across the various fixtures to check if a potential merge was possible in order to make an attempt for multivariate anomaly detection. Unfortunately there were not many common timestamp values across the fixtures as they were recorded at different time-points, hence a merge would have resulted in a lot of missing data points, thus the idea for multivariate anomaly detection was dropped off.

### A. Common timestamp values across different datasets

In [None]:
def count_common_start_end_times_with_shape(*data):
    start_times, end_times, shapes = None, None, []

    # Iterate over each data file
    for df in data:
        shapes.append(df.shape)  # Store shape of current DataFrame

        # Extract StartTime column and find common values
        if 'StartTime' in df.columns:
            if start_times is None: start_times = set(df['StartTime'])
            else: start_times &= set(df['StartTime'])

        # Extract EndTime column and find common values
        if 'EndTime' in df.columns:
            if end_times is None: end_times = set(df['EndTime'])
            else: end_times &= set(df['EndTime'])

    # Count common StartTimes and EndTimes
    common_start_count = len(start_times) if start_times is not None else 0
    common_end_count = len(end_times) if end_times is not None else 0

    return common_start_count, common_end_count, shapes


# Pass file paths as arguments to the function
common_start_count, common_end_count, shapes = count_common_start_end_times_with_shape(bidet_df, kitchen_df, shower_df, washbasin_df, washmachine_df)
print("Common StartTimes count:", common_start_count)
print("Common EndTimes count:", common_end_count)
print("Shapes of data files:", shapes)

Common StartTimes count: 0
Common EndTimes count: 0
Shapes of data files: [(121993, 2), (167065, 2), (170181, 2), (179933, 2), (12055, 2)]


# 3. Data Preparation

This stage involved renaming data columns to a standardised name for ease in pre-processing, interpolation of possible missing values in the fixtures datasets and conversion of unix timestamps to datetime objects. This stage also typically involves removal of outliers, but since this task does not involve machine learning, the outlier values were retained so as to not disturb the temporal pattern/order of the data.

### A. Datetime conversion

In [None]:
def convert_unix_to_datetime(dataframes):

    converted_dataframes = []

    for df in dataframes:
        # Make a copy of the dataframe to avoid modifying the original
        new_df = df.copy()

        # Check and convert 'StartTime' if it exists
        if 'StartTime' in new_df.columns: new_df['StartTime'] = pd.to_datetime(new_df['StartTime'], unit='s')

        # Check and convert 'EndTime' if it exists
        if 'EndTime' in new_df.columns: new_df['EndTime'] = pd.to_datetime(new_df['EndTime'], unit='s')

        converted_dataframes.append(new_df)

    return converted_dataframes


agg_df, bidet_df, dishwash_df, kitchen_df, shower_df, toilet_df, washbasin_df, washmachine_df = convert_unix_to_datetime([agg_df, bidet_df, dishwash_df, kitchen_df, shower_df, toilet_df, washbasin_df, washmachine_df])

### B. Data interpolation

In [None]:
def resampling_and_interpolation(df, data = ''):
    temp_dfs = []  # List to collect the temporary DataFrames

    for i in range(len(df) - 1):
        start, end = df.index[i], df.index[i + 1]
        time_diff = (end - start).total_seconds()

        if time_diff < 90:
            temp_df = df.iloc[i:i+2].resample('1S').interpolate(method='time')

        elif time_diff > 301:
            # Creating a new index with 5 minutes and 1 second frequency using 'inclusive' argument
            new_index = pd.date_range(start=start + pd.Timedelta(seconds=301), end=end, freq='301S')
            # Reindex the df with the new index, filling missing values with 0
            # We'll use .iloc[:, [0]] to reference the first column regardless of its name
            temp_df = df.iloc[i:i+1, [0]].reindex(new_index, fill_value=0)
            temp_df.iloc[:, 0] = 0  # Set the first column to 0 for the new rows

        else : temp_df = df.iloc[i:i+2]

        temp_dfs.append(temp_df if i < len(df) - 2 else temp_df)

    resampled_df = pd.concat(temp_dfs)
    resampled_df = resampled_df[~resampled_df.index.duplicated(keep='first')]

    print(f"{data} data successfully interpolated, saving to .feather...")
    feather.write_dataframe(resampled_df, f'../data/processed/{data}_interpolated.feather')

    print(f"{data} data successfully saved as a .feather file, proceeding to save as a .csv...")
    resampled_df.to_csv(f'../data/processed/{data}_interpolated.csv')

    print(f"{data} data successfully saved as a .csv file!\n")

    return resampled_df

In [None]:
bidet_df_interpolated = resampling_and_interpolation(bidet_df.set_index('StartTime'), 'Bidet')
kitchen_df_interpolated = resampling_and_interpolation(kitchen_df.set_index('StartTime'), 'Kitchen')
shower_df_interpolated = resampling_and_interpolation(shower_df.set_index('StartTime'), 'Shower')
washbasin_df_interpolated = resampling_and_interpolation(washbasin_df.set_index('StartTime'), 'Wash Basin')
washing_machine_df_interpolated = resampling_and_interpolation(washmachine_df.set_index('StartTime'), 'Washing Machine')

Washing Machine data successfully interpolated, saving to .feather...
Washing Machine data successfully saved as a .feather file, proceeding to save as a .csv...
Washing Machine data successfully saved as a .csv file!



# 4. Modelling

## 4.1 Exploratory Analysis

Here, I have explored the auto-correlation and partial auto-correlation of the data, as well as performed a seasonal decomposition to understand the data's temporal parameters following a systematic approach to identify the nature of the data at hand, allowing me to uncover any snippets of information which prove merit before proceeding into anomaly detection with the ADTK package.

### A. Auto-correlation and partial auto-correlation

#### A.1 ACF, PACF function

In [None]:
def acf_pacf_plots(time_series, data_type='Data', acf_lag=None, pacf_lag=None):

    if isinstance(time_series, pd.DataFrame): time_series = time_series.iloc[:, 1].values.squeeze()

    if acf_lag is None: acf_lag = min(len(time_series) - 1, 20)  # Default ACF lag
    if pacf_lag is None: pacf_lag = min(len(time_series) - 1, 20)  # Default PACF lag

    # Create subplots
    fig, axes = plt.subplots(nrows=2, ncols=1, figsize=(10, 8))

    # Plot ACF
    plot_acf(time_series, ax=axes[0], lags=acf_lag)
    axes[0].set_title(f'Autocorrelation Plot (ACF) - {data_type} - {acf_lag} lags')

    # Plot PACF
    plot_pacf(time_series, ax=axes[1], lags=pacf_lag)
    axes[1].set_title(f'Partial Autocorrelation Plot (PACF) - {data_type} - {pacf_lag} lags')

    plt.tight_layout()

    plt.savefig(f'./visualisations/Anomaly Detection/ACF - PACF/ACF-PACF - {data_type} - {acf_lag, pacf_lag} acf-pacf lags.png')

    plt.show()

#### A.2 Aggregate data

In [None]:
acf_pacf_plots(agg_df, data_type='Aggregate Data')

#### A.3 Bidet

In [None]:
acf_pacf_plots(bidet_df, data_type='Bidet Data')

#### A.4 Dishwasher

In [None]:
acf_pacf_plots(dishwash_df, data_type='Dishwasher Data')

#### A.5 Kitchen

In [None]:
acf_pacf_plots(kitchen_df, data_type='Kitchen Data')

#### A.6 Shower

In [None]:
acf_pacf_plots(shower_df, data_type='Shower Data')

#### A.7 Toilet

In [None]:
acf_pacf_plots(toilet_df, data_type='Toilet Data')

#### A.8 Wash Basin

In [None]:
acf_pacf_plots(washbasin_df, data_type='Wash Basin Data')

#### A.9 Washing Machine

In [None]:
acf_pacf_plots(washmachine_df, data_type='Washing Machine Data')

### B. Seasonal Decomposition

#### B.1 Decomposition function

In [None]:
def plot_seasonal_decompositions(time_series_datasets, data_type, plot_title='Seasonal Decomposition', periods=10, x_labels=None, y_labels=None):
    n = len(time_series_datasets)
    total_rows = n * 4  # 4 rows (Original, Trend, Seasonal, Residual) per dataset
    fig = make_subplots(rows=total_rows, cols=1, subplot_titles=[f"{dtype} - {component}" for dtype in data_type for component in ['Original', 'Trend', 'Seasonal', 'Residual']], vertical_spacing=0.06)
    colours = {'original': 'blue', 'trend': 'red', 'seasonal': 'green', 'residual': 'grey'}

    for i, data in enumerate(time_series_datasets, start=1):
        period = periods[i-1] if periods is not None else None
        decomposition = seasonal_decompose(data.iloc[:, 1].values, model='additive', period=period, extrapolate_trend='freq')

        components = [decomposition.observed, decomposition.trend, decomposition.seasonal, decomposition.resid]
        component_names = ['Original', 'Trend', 'Seasonal', 'Residual']

        for j, comp_data in enumerate(components, start=0):
            row_index = (i - 1) * 4 + j + 1
            fig.add_trace(go.Scatter(x=data.StartTime.values, y=comp_data, mode='lines', name=component_names[j], line=dict(color=colours[component_names[j].lower()]), showlegend=i==1), row=row_index, col=1)

            # Update x-axis labels only for the last component of each dataset
            if x_labels is not None and j == 3:  # Last component row
                fig.update_xaxes(title_text=x_labels[i-1], row=row_index, col=1)
            # Add y-axis label to the first component of each dataset with the dataset title
            if y_labels is not None and j == 0:  # First component row
                fig.update_yaxes(title_text=y_labels[i-1], row=row_index, col=1)

    fig.update_layout(height=400*total_rows, width=1400, title_text=f'{plot_title} - {periods[0]} periods', title_x=0.5, title_font=dict(size=24, color='black', family='Arial, bold'), legend_title='Components')

    pio.write_html(fig, f'./visualisations/Anomaly Detection/Seasonal Decompositions/{plot_title} - {periods[0]} periods.html')

    return fig

#### B.2 Aggregate data

In [None]:
agg_decomp_plot = plot_seasonal_decompositions(time_series_datasets=[agg_df], data_type=['Aggregate'], periods=[2000], plot_title = 'Seasonal Decomposition of Aggregate Data')

#### B.3 Bidet

In [None]:
bidet_decomp_plot = plot_seasonal_decompositions(time_series_datasets=[bidet_df], data_type=['Bidet'], periods=[2000], plot_title = 'Seasonal Decomposition of Bidet Data')

#### B.4 Dishwasher

In [None]:
dishwash_decomp_plot = plot_seasonal_decompositions(time_series_datasets=[dishwash_df], data_type=['Dishwasher'], periods=[5], plot_title = 'Seasonal Decomposition of Dishwasher Data')

#### B.5 Kitchen

In [None]:
kitchen_decomp_plot = plot_seasonal_decompositions(time_series_datasets=[kitchen_df], data_type=['Kitchen'], periods=[2000], plot_title = 'Seasonal Decomposition of Kitchen Data')

#### B.6 Shower

In [None]:
shower_decomp_plot = plot_seasonal_decompositions(time_series_datasets=[shower_df], data_type=['Shower'], periods=[2000], plot_title = 'Seasonal Decomposition of Shower Data')

#### B.7 Toilet

In [None]:
toilet_decomp_plot = plot_seasonal_decompositions(time_series_datasets=[toilet_df], data_type=['Toilet'], periods=[594], plot_title = 'Seasonal Decomposition of Toilet Data')

#### B.8 Wash Basin

In [None]:
basin_decomp_plot = plot_seasonal_decompositions(time_series_datasets=[washbasin_df], data_type=['Wash Basin'], periods=[2000], plot_title = 'Seasonal Decomposition of Wash Basin Data')

#### B.9 Washing Machine

In [None]:
machine_decomp_plot = plot_seasonal_decompositions(time_series_datasets=[washmachine_df], data_type=['Washing Machine'], periods=[2000], plot_title = 'Seasonal Decomposition of Washing Machine Data')

## 4.2 ADTK modelling

The anomaly detection was performed for 5 fixtures, which are *bidet, kitchen faucet, shower, wash basin, washing machine*. All the data of these fixtures were resamples by *days* or *hours* and the sum of their values was taken as the aggregate so as to reduce the number of data points and 0 values which were causing the model to be extremely biased and label any non-zero value as an anomaly. I have utilised 4 major methodologies/algorithms for anomaly identification here, which are as follows:

### A. Inter Quartile Range

`InterQuartileRangeAD` is a widely used detector based on simple historical statistics is based on interquartile range (IQR). When a value is out of the range defined by `[Q1−c×IQR, Q3+c×IQR]` where **IQR=Q3−Q1** is the difference *between 25% and 75% quantiles.*

### B. Auto-regression

`AutoregressionAD` detects anomalous changes of autoregressive behavior in time series.

### C. Persist

`PersistAD` compares each time series value with its previous values.

### D. Level Shift

`LevelShiftAD` detects shift of value level by tracking the difference between median values at two sliding time windows next to each other.

### A. ADTK Anomaly Detection

#### A.1 Anomaly detector functions

##### A.1.1 Anomaly plot function

In [None]:
def anomaly_plot(s, anomalies, plot_title, plot_name, ad_type=None):

    fig = go.Figure()

    fig.add_trace(go.Scatter(x=s.index, y=s.iloc[:, 0], mode='lines', name='Original Data', line=dict(color='blue')))
    fig.add_trace(go.Scatter(x=s.loc[anomalies.iloc[:, 0]].iloc[:, 0].index, y=s.loc[anomalies.iloc[:, 0]].iloc[:, 0].values, mode='markers', name='Anomaly', marker=dict(color='red', size=6)))

    if (ad_type == 'Persist') | (ad_type == 'Level Shift') :
        fig.add_trace(go.Scatter(x=[None], y=[None], mode='lines', name=f'{ad_type} Anomaly', marker=dict(color='orange'), line=dict(width=10, color='orange')))
        for anomaly_time in s.loc[anomalies.iloc[:, 0]].iloc[:, 0].index: fig.add_vrect(x0=anomaly_time, x1=anomaly_time, fillcolor='orange', opacity=0.5, line_width=5, line=dict(color='orange'))

    # Update plot layout
    fig.update_layout(title={'text': plot_title, 'y': 0.9, 'x': 0.5, 'xanchor': 'center', 'yanchor': 'top'},
        title_font=dict(size=24, color='black', family='Arial, bold'),
        xaxis_title='Time', yaxis_title='Value',
        xaxis_title_font=dict(size=18, color='black', family='Arial, bold'),
        yaxis_title_font=dict(size=18, color='black', family='Arial, bold'),
        xaxis_rangeslider_visible=True
    )

    pio.write_html(fig, f'./visualisations/Anomaly Detection/Anomaly Plots/{plot_name}.html')

    return anomalies, s.loc[anomalies.iloc[:, 0]].iloc[:, 0], fig

##### A.1.2 InterQuartileRangeAD

`InterQuartileRangeAD` is a widely used detector based on simple historical statistics is based on interquartile range (IQR). When a value is out of the range defined by `[Q1−c×IQR, Q3+c×IQR]` where **IQR=Q3−Q1** is the difference *between 25% and 75% quantiles.*

This detector is usually preferred to QuantileAD in the case where only a tiny portion or even none of training data is anomalous.

In [None]:
def ad_InterQuartileRangeAD(dataframe, c, plot_title='Anomaly Detector', plot_name='Anomaly Plot') :
    # s = validate_series(dataframe)
    iqr_model = InterQuartileRangeAD(c=c)
    anomalies = iqr_model.fit_detect(dataframe)

    return anomaly_plot(s=dataframe, anomalies=anomalies, plot_title=plot_title, plot_name=plot_name, ad_type='IQR_range'), iqr_model

##### A.1.3 AutoregressionAD

`AutoregressionAD` detects anomalous changes of autoregressive behavior in time series. Internally, it is implemented as a [pipenet](https://adtk.readthedocs.io/en/stable/notebooks/demo.html#Pipenet) with transformers [Retrospect](https://adtk.readthedocs.io/en/stable/notebooks/demo.html#Retrospect) and [RegressionResidual](https://adtk.readthedocs.io/en/stable/notebooks/demo.html#RegressionResidual).

In [None]:
def ad_AutoregressionAD(dataframe, n_steps, step_size, c, plot_title='Anomaly Detector', plot_name='Anomaly Plot') :
    # s = validate_series(dataframe)
    auto_regg_model = AutoregressionAD(n_steps=n_steps, step_size=step_size, c=c)
    anomalies = auto_regg_model.fit_detect(dataframe)

    return anomaly_plot(s=dataframe[~anomalies.isna().any(axis=1)], anomalies=anomalies.dropna(), plot_title=plot_title, plot_name=plot_name, ad_type='auto_regg'), auto_regg_model

##### A.1.4 PersistAD

`PersistAD` compares each time series value with its previous values. Internally, it is implemented as a [pipenet](https://adtk.readthedocs.io/en/stable/notebooks/demo.html#Pipenet) with transformer [DoubleRollingAggregate](https://adtk.readthedocs.io/en/stable/notebooks/demo.html#DoubleRollingAggregate).

In [None]:
def ad_PersistAD(dataframe, persist_window=24, c=1.0, persist_side='positive', plot_title='Anomaly Detector', plot_name='Anomaly Plot') :
    # s = validate_series(dataframe)
    persist_model = PersistAD(c=c, side=persist_side)
    persist_model.window = persist_window
    anomalies = persist_model.fit_detect(dataframe)

    return anomaly_plot(s=dataframe[~anomalies.isna().any(axis=1)], anomalies=anomalies.dropna(), plot_title=plot_title, plot_name=plot_name, ad_type='Persist'), persist_model

##### A.1.5 LevelShiftAD

`LevelShiftAD` detects shift of value level by tracking the difference between median values at two sliding time windows next to each other. It is not sensitive to instantaneous spikes and could be a good choice if noisy outliers happen frequently. Internally, it is implemented as a [pipenet](https://adtk.readthedocs.io/en/stable/notebooks/demo.html#Pipenet) with transformer [DoubleRollingAggregate](https://adtk.readthedocs.io/en/stable/notebooks/demo.html#DoubleRollingAggregate).

In [None]:
def ad_LevelShiftAD(dataframe, c=1.0, level_shift_side='both', level_shift_window=5, plot_title='Anomaly Detector', plot_name='Anomaly Plot') :
    level_shift_model = LevelShiftAD(c=c, side=level_shift_side, window=level_shift_window)
    anomalies = level_shift_model.fit_detect(dataframe)

    return anomaly_plot(s=dataframe[~anomalies.isna().any(axis=1)], anomalies=anomalies.dropna(), plot_title=plot_title, plot_name=plot_name, ad_type='Level Shift'), level_shift_model

#### A.2 Bidet

##### A.2.1 Per day samples - *InterQuartileRangeAD*

In [None]:
data = bidet_df_interpolated.resample('D').sum()
c = 1.5
plot_title = f'Bidet Data Anomalies <i>(per day samples - InterQuartileRangeAD(c={c}))</i>'
plot_name = 'Bidet/Bidet Data Anomalies (per day samples - InterQuartileRangeAD)'

[bidet_D_IQR_ano, bidet_D_IQR_ano_val, bidet_D_IQR_ano_plot], bidet_D_IQR_model = ad_InterQuartileRangeAD(validate_series(data), c=c, plot_title=plot_title, plot_name=plot_name)

bidet_D_IQR_ano_plot.show()

##### A.2.2 Per day samples - *AutoregressionAD*

In [None]:
data = bidet_df_interpolated.resample('D').sum()
n_steps = 10
step_size = 5
c = 3.5
plot_title = f'Bidet Data Anomalies <i>(per day samples - AutoregressionAD(n_steps={n_steps}, step_size={step_size}, c={c}))</i>'
plot_name = 'Bidet/Bidet Data Anomalies (per day samples - AutoregressionAD)'

[bidet_D_autoreg_ano, bidet_D_autoreg_ano_val, bidet_D_autoreg_ano_plot], bidet_D_autoreg_model = ad_AutoregressionAD(validate_series(data), n_steps=n_steps, step_size=step_size, c=c, plot_title=plot_title, plot_name=plot_name)

bidet_D_autoreg_ano_plot.show()

##### A.2.3 Per day samples - *PersistAD*

In [None]:
data = bidet_df_interpolated.resample('D').sum()
data.loc['2020-02-14':, data.columns[0]] += 20000           # simulating data spike

c = 3
persist_window = 14
persist_side = 'positive'
plot_title = f'Bidet Data Anomalies <i>(per day samples - PersistAD(c={c}, side=\'{persist_side}\', window={persist_window}))</i>'
plot_name = 'Bidet/Bidet Data Anomalies (per day samples - PersistAD)'

[bidet_D_persist_ano, bidet_D_persist_ano_val, bidet_D_persist_ano_plot], bidet_D_persist_model = ad_PersistAD(validate_series(data), persist_window=persist_window, c=c, persist_side=persist_side, plot_title=plot_title, plot_name=plot_name)

bidet_D_persist_ano_plot.show()

##### A.2.4 Per day samples - *LevelShiftAD*

In [None]:
data = bidet_df_interpolated.resample('D').sum()
data.loc['2019-10-29':'2020-02-14', data.columns[0]] += 20000

c = 1.5
level_shift_side = 'both'
level_shift_window = 14
plot_title = f'Bidet Data Anomalies <i>(per day samples - LevelShiftAD(c={c}, side=\'{level_shift_side}\', window={level_shift_window}))</i>'
plot_name = 'Bidet/Bidet Data Anomalies (per day samples - LevelShiftAD)'

[bidet_D_LevelShift_ano, bidet_D_LevelShift_ano_val, bidet_D_LevelShift_ano_plot], bidet_D_LevelShift_model = ad_LevelShiftAD(validate_series(data), c, level_shift_side, level_shift_window, plot_title=plot_title, plot_name=plot_name)

bidet_D_LevelShift_ano_plot.show()

#### A.3 Kitchen

##### A.3.1 Per day samples - *InterQuartileRangeAD*

In [None]:
data = kitchen_df_interpolated.resample('D').sum()
c = 4.0
plot_title = f'Kitchen Anomalies <i>(per day samples - InterQuartileRangeAD(c={c}))</i>'
plot_name = 'Kitchen/Kitchen Anomalies (per day samples - InterQuartileRangeAD)'

[kitchen_D_IQR_ano, kitchen_D_IQR_ano_val, kitchen_D_IQR_ano_plot], kitchen_D_IQR_model = ad_InterQuartileRangeAD(validate_series(data), c=c, plot_title=plot_title, plot_name=plot_name)

kitchen_D_IQR_ano_plot.show()

##### A.3.2 Per day samples - *AutoregressionAD*

In [None]:
data = kitchen_df_interpolated.resample('D').sum()
n_steps = 7
step_size = 3
c = 3.5
plot_title = f'Kitchen Anomalies <i>(per day samples - AutoregressionAD(n_steps={n_steps}, step_size={step_size}, c={c}))</i>'
plot_name = 'Kitchen/Kitchen Anomalies (per day samples - AutoregressionAD)'

[kitchen_D_autoreg_ano, kitchen_D_autoreg_ano_val, kitchen_D_autoreg_ano_plot], kitchen_D_autoreg_model = ad_AutoregressionAD(validate_series(data), n_steps=n_steps, step_size=step_size, c=c, plot_title=plot_title, plot_name=plot_name)

kitchen_D_autoreg_ano_plot.show()

##### A.3.3 Per day samples - *PersistAD*

In [None]:
data = kitchen_df_interpolated.resample('D').sum()
data.loc['2020-02-14':, data.columns[0]] += 25000

c = 3
persist_window = 14
persist_side = 'positive'
plot_title = f'Kitchen Anomalies <i>(per day samples - PersistAD(c={c}, side=\'{persist_side}\', window={persist_window}))</i>'
plot_name = 'Kitchen/Kitchen Anomalies (per day samples - PersistAD)'

[kitchen_D_persist_ano, kitchen_D_persist_ano_val, kitchen_D_persist_ano_plot], kitchen_D_persist_model = ad_PersistAD(validate_series(data), persist_window=persist_window, c=c, persist_side=persist_side, plot_title=plot_title, plot_name=plot_name)

kitchen_D_persist_ano_plot.show()

##### A.3.4 Per day samples - *LevelShiftAD*

In [None]:
data = kitchen_df_interpolated.resample('D').sum()
data.loc['2019-10-29':'2020-02-14', data.columns[0]] += 25000

c = 1.2
level_shift_side = 'both'
level_shift_window = 2
plot_title = f'Kitchen Anomalies <i>(per day samples - LevelShiftAD(c={c}, side=\'{level_shift_side}\', window={level_shift_window}))</i>'
plot_name = 'Kitchen/Kitchen Anomalies (per day samples - LevelShiftAD)'

[kitchen_D_LevelShift_ano, kitchen_D_LevelShift_ano_val, kitchen_D_LevelShift_ano_plot], kitchen_D_LevelShift_model = ad_LevelShiftAD(validate_series(data), c, level_shift_side, level_shift_window, plot_title=plot_title, plot_name=plot_name)

kitchen_D_LevelShift_ano_plot.show()

#### A.4 Shower

##### A.4.1 Per day samples - *InterQuartileRangeAD*

In [None]:
data = shower_df_interpolated.resample('D').sum()
c = 2.0
plot_title = f'Shower Anomalies <i>(per day samples - InterQuartileRangeAD(c={c}))</i>'
plot_name = 'Shower/Shower Anomalies (per day samples - InterQuartileRangeAD)'

[shower_D_IQR_ano, shower_D_IQR_ano_val, shower_D_IQR_ano_plot], shower_D_IQR_model = ad_InterQuartileRangeAD(validate_series(data), c=c, plot_title=plot_title, plot_name=plot_name)

shower_D_IQR_ano_plot.show()

##### A.4.2 Per day samples - *AutoregressionAD*

In [None]:
data = shower_df_interpolated.resample('D').sum()
n_steps = 5
step_size = 14
c = 1.5
plot_title = f'Shower Anomalies <i>(per day samples - AutoregressionAD(n_steps={n_steps}, step_size={step_size}, c={c}))</i>'
plot_name = 'Shower/Shower Anomalies (per day samples - AutoregressionAD)'

[shower_D_autoreg_ano, shower_D_autoreg_ano_val, shower_D_autoreg_ano_plot], shower_D_autoreg_model = ad_AutoregressionAD(validate_series(data), n_steps=n_steps, step_size=step_size, c=c, plot_title=plot_title, plot_name=plot_name)

shower_D_autoreg_ano_plot.show()

##### A.4.3 Per day samples - *PersistAD*

In [None]:
data = shower_df_interpolated.resample('D').sum()
data.loc['2020-02-14':, data.columns[0]] += 45000

c = 1.5
persist_window = 15
persist_side = 'positive'
plot_title = f'Shower Anomalies <i>(per day samples - PersistAD(c={c}, side=\'{persist_side}\', window={persist_window}))</i>'
plot_name = 'Shower/Shower Anomalies (per day samples - PersistAD)'

[shower_D_persist_ano, shower_D_persist_ano_val, shower_D_persist_ano_plot], shower_D_persist_model = ad_PersistAD(validate_series(data), persist_window=persist_window, c=c, persist_side=persist_side, plot_title=plot_title, plot_name=plot_name)

shower_D_persist_ano_plot.show()

##### A.4.4 Per day samples - *LevelShiftAD*

In [None]:
data = shower_df_interpolated.resample('D').sum()
data.loc['2019-10-29':'2020-02-14', data.columns[0]] += 55000

c = 1.5
level_shift_side = 'both'
level_shift_window = 1
plot_title = f'Shower Anomalies <i>(per day samples - LevelShiftAD(c={c}, side=\'{level_shift_side}\', window={level_shift_window}))</i>'
plot_name = 'Shower/Shower Anomalies (per day samples - LevelShiftAD)'

[shower_D_LevelShift_ano, shower_D_LevelShift_ano_val, shower_D_LevelShift_ano_plot], shower_D_LevelShift_model = ad_LevelShiftAD(validate_series(data), c, level_shift_side, level_shift_window, plot_title=plot_title, plot_name=plot_name)

shower_D_LevelShift_ano_plot.show()

#### A.5 Wash Basin

##### A.5.1 Per day samples - *InterQuartileRangeAD*

In [None]:
data = washbasin_df[(washbasin_df['StartTime'].dt.year != 1970)].set_index('StartTime').resample('D').sum()
c = 2.0
plot_title = f'Wash Basin Anomalies <i>(per day samples - InterQuartileRangeAD(c={c}))</i>'
plot_name = 'Wash Basin/Wash Basin Anomalies (per day samples - InterQuartileRangeAD)'

[washbasin_D_IQR_ano, washbasin_D_IQR_ano_val, washbasin_D_IQR_ano_plot], washbasin_D_IQR_model = ad_InterQuartileRangeAD(validate_series(data), c=c, plot_title=plot_title, plot_name=plot_name)

washbasin_D_IQR_ano_plot.show()

##### A.5.2 Per day samples - *AutoregressionAD*

In [None]:
data = washbasin_df[(washbasin_df['StartTime'].dt.year != 1970)].set_index('StartTime').resample('D').sum()
n_steps = 5
step_size = 14
c = 1.5
plot_title = f'Wash Basin Anomalies <i>(per day samples - AutoregressionAD(n_steps={n_steps}, step_size={step_size}, c={c}))</i>'
plot_name = 'Wash Basin/Wash Basin Anomalies (per day samples - AutoregressionAD)'

[washbasin_D_autoreg_ano, washbasin_D_autoreg_ano_val, washbasin_D_autoreg_ano_plot], washbasin_D_autoreg_model = ad_AutoregressionAD(validate_series(data), n_steps=n_steps, step_size=step_size, c=c, plot_title=plot_title, plot_name=plot_name)

washbasin_D_autoreg_ano_plot.show()

##### A.5.3 Per day samples - *PersistAD*

In [None]:
data = washbasin_df[(washbasin_df['StartTime'].dt.year != 1970)].set_index('StartTime').resample('D').sum()
data.loc['2020-02-14':, data.columns[0]] += 45000

c = 1.5
persist_window = 15
persist_side = 'positive'
plot_title = f'Wash Basin Anomalies <i>(per day samples - PersistAD(c={c}, side=\'{persist_side}\', window={persist_window}))</i>'
plot_name = 'Wash Basin/Wash Basin Anomalies (per day samples - PersistAD)'

[washbasin_D_persist_ano, washbasin_D_persist_ano_val, washbasin_D_persist_ano_plot], washbasin_D_persist_model = ad_PersistAD(validate_series(data), persist_window=persist_window, c=c, persist_side=persist_side, plot_title=plot_title, plot_name=plot_name)

washbasin_D_persist_ano_plot.show()

##### A.5.4 Per day samples - *LevelShiftAD*

In [None]:
data = washbasin_df[(washbasin_df['StartTime'].dt.year != 1970)].set_index('StartTime').resample('D').sum()
data.loc['2019-10-29':'2020-02-14', data.columns[0]] += 55000

c = 1.5
level_shift_side = 'both'
level_shift_window = 1
plot_title = f'Wash Basin Anomalies <i>(per day samples - LevelShiftAD(c={c}, side=\'{level_shift_side}\', window={level_shift_window}))</i>'
plot_name = 'Wash Basin/Wash Basin Anomalies (per day samples - LevelShiftAD)'

[washbasin_D_LevelShift_ano, washbasin_D_LevelShift_ano_val, washbasin_D_LevelShift_ano_plot], washbasin_D_LevelShift_model = ad_LevelShiftAD(validate_series(data), c, level_shift_side, level_shift_window, plot_title=plot_title, plot_name=plot_name)

washbasin_D_LevelShift_ano_plot.show()

#### A.6 Washing Machine

##### A.6.1 Per hour samples - *InterQuartileRangeAD*

In [None]:
data = washmachine_df_interpolated.resample('H').sum()
c = 3.5
plot_title = f'Washing Machine Anomalies <i>(per hour samples - InterQuartileRangeAD(c={c}))</i>'
plot_name = 'Washing Machine/Washing Machine Anomalies (per hour samples - InterQuartileRangeAD)'

[washmachine_D_IQR_ano, washmachine_D_IQR_ano_val, washmachine_D_IQR_ano_plot], washmachine_D_IQR_model = ad_InterQuartileRangeAD(validate_series(data), c=c, plot_title=plot_title, plot_name=plot_name)

washmachine_D_IQR_ano_plot.show()

##### A.6.2 Per hour samples - *AutoregressionAD*

In [None]:
data = washmachine_df_interpolated.resample('H').sum()
n_steps = 12
step_size = 6
c = 1.0
plot_title = f'Washing Machine Anomalies <i>(per hour samples - AutoregressionAD(n_steps={n_steps}, step_size={step_size}, c={c}))</i>'
plot_name = 'Washing Machine/Washing Machine Anomalies (per hour samples - AutoregressionAD)'

[washmachine_D_autoreg_ano, washmachine_D_autoreg_ano_val, washmachine_D_autoreg_ano_plot], washmachine_D_autoreg_model = ad_AutoregressionAD(validate_series(data), n_steps=n_steps, step_size=step_size, c=c, plot_title=plot_title, plot_name=plot_name)

washmachine_D_autoreg_ano_plot.show()

##### A.6.3 Per hour samples - *PersistAD*

In [None]:
data = washmachine_df_interpolated.resample('H').sum()
data.loc['2020-02-14':, data.columns[0]] += 45000

c = 3.5
persist_window = 5
persist_side = 'positive'
plot_title = f'Washing Machine Anomalies <i>(per hour samples - PersistAD(c={c}, side=\'{persist_side}\', window={persist_window}))</i>'
plot_name = 'Washing Machine/Washing Machine Anomalies (per hour samples - PersistAD)'

[washmachine_D_persist_ano, washmachine_D_persist_ano_val, washmachine_D_persist_ano_plot], washmachine_D_persist_model = ad_PersistAD(validate_series(data), persist_window=persist_window, c=c, persist_side=persist_side, plot_title=plot_title, plot_name=plot_name)

washmachine_D_persist_ano_plot.show()

##### A.6.4 Per day samples - *LevelShiftAD*

In [None]:
data = washmachine_df_interpolated.resample('D').sum()
data.loc['2019-10-29':'2020-02-14', data.columns[0]] += 55000

c = 3.5
level_shift_side = 'both'
level_shift_window = 6
plot_title = f'Washing Mmachine Anomalies <i>(per hour samples - LevelShiftAD(c={c}, side=\'{level_shift_side}\', window={level_shift_window}))</i>'
plot_name = 'Washing Machine/Washing Machine Anomalies (per hour samples - LevelShiftAD)'

[washmachine_D_LevelShift_ano, washmachine_D_LevelShift_ano_val, washmachine_D_LevelShift_ano_plot], washmachine_D_LevelShift_model = ad_LevelShiftAD(validate_series(data), c, level_shift_side, level_shift_window, plot_title=plot_title, plot_name=plot_name)

washmachine_D_LevelShift_ano_plot.show()

# 5. Evaluation

Since we did not have any training data of anomalies to compare the AD detector results with, the evaluation in this case is solely based on visual observations of the identified anomalies. For the IRQ algorithm, the model is mostly detecting outliers to be the anomalies, which is reasonable for its statistical nature. Speaking of the auto-regressive algorithm, a few more anomalies were detected based on the patterns and trends observed in the historical data and not all of the detected anomalies were outliers unlike IQR. Coming to Persist, which specialises in detecting sudden spikes and drops in the data sequence, the expected anomalies were detected, I also simulated a spike in the data by means of data manipulation and the algorithm did not fail to identify that as an anomaly, apart from other spikes and drops in the data. Speaking of level shift, which specialises in identifying a constant shift in the value level of the data, I again simulated a level shift event in the data and the algorithm did identify the anomaly quite successfully. Given below are the examples of bidet data where the Persist and LevelShift anomalies performed well.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

# 6. Sample User Program

## 6.1 Inter Quartile Range

In [None]:
file_path = '../data/raw/aggregatedWholeHouse.csv'
data_name = ''
resampling_frequency = 'D'

c = 0.5

plot_title = f'{data_name} Data Anomalies <i>(per {resampling_frequency} samples - InterQuartileRangeAD(c={c}))</i>'
plot_name = f'{data_name} Data Anomalies (per {resampling_frequency} samples - InterQuartileRangeAD)'

data = pd.read_csv(file_path)
data.rename(columns={data.columns[0]: 'StartTime'}, inplace=True)
if 'StartTime' in data.columns: data['StartTime'] = pd.to_datetime(data['StartTime'], unit='s')
if 'EndTime' in data.columns: data['EndTime'] = pd.to_datetime(data['EndTime'], unit='s')
data = data.set_index('StartTime').resample(resampling_frequency).sum()


[data_IQR_ano, data_IQR_ano_val, data_IQR_ano_plot], data_IQR_model = ad_InterQuartileRangeAD(validate_series(data), c=c, plot_title=plot_title, plot_name=plot_name)

data_IQR_ano_plot.show()

## 6.2 Auto Regressor

In [None]:
file_path = '../data/raw/aggregatedWholeHouse.csv'
data_name = ''
resampling_frequency = 'D'

n_steps = 12
step_size = 6
c = 1.0

plot_title = f'{data_name} Data Anomalies <i>(per {resampling_frequency} samples - AutoregressionAD(n_steps={n_steps}, step_size={step_size}, c={c}))</i>'
plot_name = f'{data_name} Data Anomalies (per {resampling_frequency} samples - AutoregressionAD)'

data = pd.read_csv(file_path)
data.rename(columns={data.columns[0]: 'StartTime'}, inplace=True)
if 'StartTime' in data.columns: data['StartTime'] = pd.to_datetime(data['StartTime'], unit='s')
if 'EndTime' in data.columns: data['EndTime'] = pd.to_datetime(data['EndTime'], unit='s')
data = data.set_index('StartTime').resample(resampling_frequency).sum()

[data_autoreg_ano, data_autoreg_ano_val, data_autoreg_ano_plot], data_autoreg_model = ad_AutoregressionAD(validate_series(data), n_steps=n_steps, step_size=step_size, c=c, plot_title=plot_title, plot_name=plot_name)

data_autoreg_ano_plot.show()

## 6.3 Persist

In [None]:
file_path = '../data/raw/aggregatedWholeHouse.csv'
data_name = ''
resampling_frequency = 'D'

c = 1.0
persist_window = 14
persist_side = 'positive'

plot_title = f'{data_name} Data Anomalies <i>(per {resampling_frequency} samples - PersistAD(c={c}, side=\'{persist_side}\', window={persist_window}))</i>'
plot_name = f'{data_name} Data Anomalies (per {resampling_frequency} samples - PersistAD)'

data = pd.read_csv(file_path)
data.rename(columns={data.columns[0]: 'StartTime'}, inplace=True)
if 'StartTime' in data.columns: data['StartTime'] = pd.to_datetime(data['StartTime'], unit='s')
if 'EndTime' in data.columns: data['EndTime'] = pd.to_datetime(data['EndTime'], unit='s')
data = data.set_index('StartTime').resample(resampling_frequency).sum()

[data_persist_ano, data_persist_ano_val, data_persist_ano_plot], data_persist_model = ad_PersistAD(validate_series(data), persist_window=persist_window, c=c, persist_side=persist_side, plot_title=plot_title, plot_name=plot_name)

data_persist_ano_plot.show()

## 6.4 Level Shift

In [None]:
file_path = '../data/raw/aggregatedWholeHouse.csv'
data_name = ''
resampling_frequency = 'D'

c = 1.5
level_shift_window = 7
level_shift_side = 'both'

plot_title = f'{data_name} Data Anomalies <i>(per {resampling_frequency} samples - LevelShiftAD(c={c}, side=\'{level_shift_side}\', window={level_shift_window}))</i>'
plot_name = f'{data_name} Data Anomalies (per {resampling_frequency} samples - LevelShiftAD)'

data = pd.read_csv(file_path)
data.rename(columns={data.columns[0]: 'StartTime'}, inplace=True)
if 'StartTime' in data.columns: data['StartTime'] = pd.to_datetime(data['StartTime'], unit='s')
if 'EndTime' in data.columns: data['EndTime'] = pd.to_datetime(data['EndTime'], unit='s')
data = data.set_index('StartTime').resample(resampling_frequency).sum()

[data_LevelShift_ano, data_LevelShift_ano_val, data_LevelShift_ano_plot], data_LevelShift_model = ad_LevelShiftAD(validate_series(data), c, level_shift_side, level_shift_window, plot_title=plot_title, plot_name=plot_name)

data_LevelShift_ano_plot.show()