<img align="center" style="max-width: 1000px" src="banner.png">

<img align="right" style="max-width: 200px; height: auto" src="hsg_logo.png">

##  Lab 06 - "Autoencoder Neural Networks"

GSERM'21 course "Deep Learning: Fundamentals and Applications", University of St. Gallen

In the last lab we learned how to implement, train, and apply our first **Autoencoder Neural Network (AENN)** using a Python library named `PyTorch`. AENNs learn how to **encode** the input data into a low dimensional representation.  At the same time, the AENN learns how to **decode** the original data back from the encoded representation. The decoded data usually referred to as "reconstruction", should match the original input as closely as possible. In this lab, we aim to leverage that knowledge by applying it to a set of self-coding assignments.

Before we start let's watch a motivational video:

In [None]:
from IPython.display import YouTubeVideo
# GitHub Arctic Code Vault
# YouTubeVideo('fzI9FNjXQ0o', width=800, height=400)

As always, pls. don't hesitate to ask all your questions either during the lab, post them in our CANVAS (StudyNet) forum (https://learning.unisg.ch), or send us an email (using the course email).

## 1. Assignment Objectives:

Similar today's lab session, after today's self-coding assignments you should be able to:

>1. Understand the **basic concepts, intuitions and major building blocks** of autoencoder neural networks.
>2. **Pre-process** categorical financial data to learn a model of its characteristics and pattern.
>3. Apply autoencoder neural networks to **detect anomalies** in large-scale financial data.
>4. **Interpret the detection results** of the network as well as its reconstruction loss.

## 2. Setup of the Jupyter Notebook Environment

As a next step, let's import the libraries needed throughout the lab:

In [None]:
import warnings
warnings.filterwarnings('ignore')

Similar to the previous labs, we need to import a couple of Python libraries that allow for data analysis and data visualization. We will mostly use the `PyTorch`, `Numpy`, `Sklearn`, `Matplotlib`, `Seaborn`, `BT`, and a few utility libraries throughout the lab:

In [None]:
# import python data science and utility libraries
import os, sys, itertools, urllib, io
import datetime as dt
import pandas as pd
import pandas_datareader as dr
import numpy as np

Import the Python machine / deep learning libraries:

In [None]:
# pytorch libraries
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils import data
from torch.utils.data import dataloader

Import Python plotting libraries and set general plotting parameters:

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn')
plt.rcParams['figure.figsize'] = [10, 5]
plt.rcParams['figure.dpi']= 150

Enable notebook matplotlib inline plotting:

In [None]:
%matplotlib inline

Create a structure of notebook sub-directories to store the data as well as the trained neural network models:

In [None]:
if not os.path.exists('./data'): os.makedirs('./data')  # create data directory
if not os.path.exists('./models'): os.makedirs('./models')  # create trained models directory

Set a random seed value to obtain reproducable results:

In [None]:
# init deterministic seed
seed_value = 1234
np.random.seed(seed_value) # set numpy seed
torch.manual_seed(seed_value); # set pytorch seed CPU

## 3. Autoencoder Neural Networks (AENNs) Assignments

### 3.1 Dataset Download and Data Assessment

Nowadays, organizations accelerate the digitization and reconfiguration of business processes [4] affecting in particular Accounting Information Systems (AIS) or more general Enterprise Resource Planning (ERP) systems.

Steadily, these systems collect vast quantities of electronic evidence at an almost atomic level. This observation holds in particular for the journal entries of an organization recorded in its general ledger and sub-ledger accounts. SAP, one of the most prominent ERP software providers, estimates that approx. 76% of the world's transaction revenue touches one of their systems [5].

The illustration in **Figure 1** depicts a hierarchical view of an Accounting Information System (AIS) recording process and journal entry information in designated database tables. In the context of fraud examinations, the data collected by such systems may contain valuable traces of a potential fraud scheme.

<img align="middle" style="max-width: 700px; height: auto" src="accounting.png">

**Figure 1:** Hierarchical view of an Accounting Information System (AIS) that records distinct layers of abstraction, namely (1) the business process information, (2) the accounting information as well as the (3) technical journal entry information in designated database tables.

In this section of the lab notebook, we will conduct a descriptive analysis of the lab's financial dataset. Furthermore, we will apply some necessary pre-processing steps to train a deep neural network. The lab is based on a derivation of the **"Synthetic Financial Dataset For Fraud Detection"** by Lopez-Rojas [6] available via the Kaggle predictive modeling and analytics competitions platform that can be obtained using the following link: https://www.kaggle.com/ntnu-testimon/paysim1.

Let's start loading the dataset and investigate its structure and attributes:

In [None]:
# load the dataset into the notebook kernel
url = 'https://raw.githubusercontent.com/HSG-AIML/LabGSERM/master/lab_06/data/fraud_dataset_v2.csv'
ori_dataset = pd.read_csv(url)

Let's also check the dataset dimensionalities for completeness: 

In [None]:
# inspect the datasets dimensionalities
now = dt.datetime.utcnow().strftime("%Y.%m.%d-%H:%M:%S")
print('[LOG {}] transactional dataset of {} rows and {} columns retreived.'.format(now, ori_dataset.shape[0], ori_dataset.shape[1]))

#### 3.1.1 Initial Data and Attribute Assessment

We augmented the dataset and renamed the attributes to mimic a real-world dataset that one usually observes in SAP-ERP systems as part of SAP's Finance and Cost controlling (FICO) module. 

The dataset contains a subset of in total seven categorical and two numerical attributes available in the FICO BKPF (containing the posted journal entry headers) and BSEG (containing the posted journal entry segments) tables. Please, find below a list of the individual attributes as well as a brief description of their respective semantics:

>- `BELNR`: the accounting document number,
>- `BUKRS`: the company code,
>- `BSCHL`: the posting key,
>- `HKONT`: the posted general ledger account,
>- `PRCTR`: the posted profit center,
>- `WAERS`: the currency key,
>- `KTOSL`: the general ledger account key,
>- `DMBTR`: the amount in the local currency,
>- `WRBTR`: the amount in the document currency.

Let's also have a closer look into the top 10 rows of the dataset:

In [None]:
# inspect top rows of dataset
ori_dataset.head(10) 

You may also have noticed the attribute `label` in the data. We will use this field throughout the lab to evaluate the quality of our trained models. The field describes the true nature of each transaction of either being a **regular** transaction (denoted by `regular`) or an **anomaly** (denoted by `global` and `local`). Let's have a closer look into the distribution of the regular vs. anomalous transactions in the dataset:

In [None]:
# number of anomalies vs. regular transactions
ori_dataset.label.value_counts()

Ok, the statistic reveals that similar to real-world scenarios, we are facing a highly "unbalanced" dataset. Overall, the dataset contains only a small fraction of **100 (0.018%)** anomalous transactions. While the 100 anomalous entries encompass **70 (0.013%)** "global" anomalies and **30 (0.005%)** "local" anomalies as introduced in section 1.2.

In [None]:
# remove the "ground-truth" label information for the following steps of the lab
label = ori_dataset.pop('label')

#### 3.1.2 Pre-Processing of Categorical Transaction Attributes

From the initial data assessment above, we can observe that the majority of attributes recorded in AIS- and ERP-systems correspond to categorical (discrete) attribute values, e.g. the posting date, the general ledger account, the posting type, the currency. Let's have a more detailed look into the distribution of two dataset attributes, namely (1) the posting key `BSCHL` as well as (2) the general ledger account `HKONT`:

In [None]:
# prepare to plot posting key and general ledger account side by side
fig, ax = plt.subplots(1, 2)
fig.set_figwidth(20)

# plot the distribution of the posting key attribute
g = sns.countplot(x=ori_dataset['BSCHL'], ax=ax[0])

# set axis labels
g.set_xticklabels(g.get_xticklabels(), rotation=90)
g.set_xlabel('BSCHL Value', fontsize=18)
g.set_ylabel('Value Count', fontsize=18)

# set plot title
g.set_title('Distribution of the \'Posting Key\' attribute values', fontsize=20)

# plot the distribution of the general ledger attribute
g = sns.countplot(x=ori_dataset['HKONT'], ax=ax[1])

# set axis labels
g.set_xticklabels(g.get_xticklabels(), rotation=90)
g.set_xlabel('HKONT Value', fontsize=18)
g.set_ylabel('Value Count', fontsize=18)

# set plot title
g.set_title('Distribution of the \'General Ledger\' attribute values', fontsize=20);

Unfortunately, neural networks are, in general, not designed to be trained directly on categorical data and require the attributes to be trained on to be numeric. One simple way to meet this requirement is by applying a technique referred to as **"one-hot" encoding**. Using this encoding technique, we will derive a numerical representation of each of the categorical attribute values. One-hot encoding creates new binary columns for each categorical attribute value present in the original data. 

Let's have a look at the example shown in **Figure 2** below. The **categorical attribute “Receiver”** below contains the names "John," "Timur" and "Marco." We "one-hot" encode the names by creating a separate binary column for each possible name-value observable in the "Receiver" column. Now, we encode for each transaction that contains the value "John" in the "Receiver" column this observation with 1.0 in the newly created "John" column and 0.0 in all other generated name columns.

<img align="middle" style="max-width: 600px; height: auto" src="encoding.png">

**Figure 2:** Exemplary one-hot encoding of the distinct `Receiver` attribute values into specific binary ("one-hot) columns. Thereby, each attribute value observable in the dataset results in a separate column. The column value `1.0` denotes the occurance of the attribute value in the corresponding journal entry. In contrast the column value `0.0` indicates the absence of the attribute value in the corresponding journal entry.

Using this technique will "one-hot" encode the six categorical attributes in the original transactional dataset. This can be achieved using the `get_dummies()` function available in the Pandas data science library:

In [None]:
# select categorical attributes to be "one-hot" encoded
categorical_attr_names = ['KTOSL', 'PRCTR', 'BSCHL', 'HKONT']

# encode categorical attributes into a binary one-hot encoded representation 
ori_dataset_cat_processed = pd.get_dummies(ori_dataset[categorical_attr_names])

Finally, let's inspect the encoding of 10 sample transactions to see if the encoding was accomplished successfully;

In [None]:
# inspect encoded sample transactions
ori_dataset_cat_processed.head(10)

#### 3.1.3 Pre-Processing of Numerical Transaction Attributes

Let's now inspect the distributions of the two numerical attributes contained in the transactional dataset namely, the (1) local currency amount `DMBTR` and the (2) document currency amount `WRBTR`:

In [None]:
# plot the log-scaled "DMBTR" as well as the "WRBTR" attribute value distribution
fig, ax = plt.subplots(1,2)
fig.set_figwidth(20)

# plot distribution of the local amount attribute
g = sns.distplot(ori_dataset['DMBTR'].tolist(), ax=ax[0])

# set axis labels
g.set_xlabel('DMBTR Value', fontsize=18)
g.set_ylabel('Value Count', fontsize=18)

# set plot title
g.set_title('Distribution of the \'Local Amount\' attribute values', fontsize=20)

# plot distribution of the document amount attribute
g = sns.distplot(ori_dataset['WRBTR'].tolist(), ax=ax[1])

# set axis labels
g.set_xlabel('WRBTR Value', fontsize=18)
g.set_ylabel('Value Count', fontsize=18)

# set plot title
g.set_title('Distribution of the \'Foreign Amount\' attribute values', fontsize=20);

As expected, it can be observed that for both attributes, the distributions of amount values are **heavy-tailed**. In order to approach faster a potential global minimum scaling and normalization of numerical input values is good practice. Therefore, we first log-scale both variables and second min-max normalize the scaled amounts to the interval [0, 1].

In [None]:
# select the 'DMBTR' and 'WRBTR' attribute
numeric_attr_names = ['DMBTR', 'WRBTR']

# add a small epsilon to eliminate zero values from data for log scaling
numeric_attr = ori_dataset[numeric_attr_names] + 1e-7

# log scale the 'DMBTR' and 'WRBTR' attribute values
numeric_attr = numeric_attr.apply(np.log)

# normalize all numeric attributes to the range [0,1]
ori_dataset_num_processed = (numeric_attr - numeric_attr.min()) / (numeric_attr.max() - numeric_attr.min())

Let's now visualize the log-scaled and min-max normalized distributions of both attributes:

In [None]:
# plot the log-scaled "DMBTR" as well as the "WRBTR" attribute value distribution
fig, ax = plt.subplots(1,2)
fig.set_figwidth(20)

# plot distribution of the local amount attribute
g = sns.distplot(ori_dataset_num_processed['DMBTR'].tolist(), ax=ax[0])

# set axis labels
g.set_xlabel('DMBTR Value', fontsize=18)
g.set_ylabel('Value Count', fontsize=18)

# set plot title
g.set_title('Distribution of the \'Local Amount\' attribute values', fontsize=20)

# plot distribution of the document amount attribute
g = sns.distplot(ori_dataset_num_processed['WRBTR'].tolist(), ax=ax[1])

# set axis labels
g.set_xlabel('WRBTR Value', fontsize=18)
g.set_ylabel('Value Count', fontsize=18)

# set plot title
g.set_title('Distribution of the \'Foreign Amount\' attribute values', fontsize=20);

#### 3.1.4 Merge Categorical and Numerical Transaction Attributes

Finally, we merge both pre-processed numerical and categorical attributes into a single dataset that we will use for training our deep autoencoder neural network (explained an implemented in the following section 2):

In [None]:
# merge categorical and numeric subsets
ori_subset_transformed = pd.concat([ori_dataset_cat_processed, ori_dataset_num_processed], axis = 1)

Now, let's again have a look at the dimensionality of the dataset after we applied the distinct pre-processing steps to the attributes:

In [None]:
# inspect final dimensions of pre-processed transactional data
ori_subset_transformed.shape

Ok, upon completion of all the pre-processing steps (excl. the exercises), we should end up with an encoded dataset consisting of a total number of **533,009 records** (rows) and **384 encoded attributes** (columns). Let's keep the number number of columns in mind since it will define the dimensionality of the input- and output-layer of our deep autoencoder network, which we will now implement in the following section.

### 3.2 Autoencoder Neural Network (AENN) Model Training and Evaluation

We recommend you to try the following exercises as part of the lab, based on the notebook you have seen:

**1. Add of the two additional journal entry attributes `WAERS` and `BUKRS`.**

>Pre-process the journal entry data and learn an `AENN model` including also the attributes `WAERS` and `BUKRS`. Therefore, (1) plot and inspect the distribution of the distinct values observable for both attributes, (2) encode the values of the attributes using the `get_dummies()` method provided by the Pandas library, and (3) merge your encoding results with the `ori_subset_transformed` data frame (upon successful merge the one-hot encoded dataset should encompass a total dimensionality of 638 instead of 384 columns). Ultimately, train an `AENN model` including both attributes and evaluate its anomaly detection performance.

In [None]:
#### Step 1. pre-process the journal entry data, including the WAERS and BURKS attributes (steps above) #############

# ***************************************************
# INSERT YOUR CODE HERE
# ***************************************************

#### Step 2. define and init neural network architecture #############################################################

# ***************************************************
# INSERT YOUR SOLUTION/CODE HERE
# ***************************************************

#### Step 3. define loss and training hyperparameters ################################################################

# ***************************************************
# INSERT YOUR SOLUTION/CODE HERE
# ***************************************************

#### Step 4. run model training ######################################################################################

# ***************************************************
# INSERT YOUR SOLUTION/CODE HERE
# ***************************************************

#### Step 5. run model evaluation ####################################################################################

# ***************************************************
# INSERT YOUR SOLUTION/CODE HERE
# ***************************************************

**2. Apply a `dropout` throughout the network training.**

>Set the `dropout` probability to `0.2` (20%) and re-start the training procedure. What impact do you observe in terms of training performance / reconstruction loss?

In [None]:
#### Step 1. define and init neural network architecture, with dropout ###############################################

# ***************************************************
# INSERT YOUR SOLUTION/CODE HERE
# ***************************************************

#### Step 2. define loss, training hyperparameters and dataloader ####################################################

# ***************************************************
# INSERT YOUR SOLUTION/CODE HERE
# ***************************************************

#### Step 3. run model training ######################################################################################

# ***************************************************
# INSERT YOUR SOLUTION/CODE HERE
# ***************************************************

#### Step 4. run model evaluation ####################################################################################

# ***************************************************
# INSERT YOUR SOLUTION/CODE HERE
# ***************************************************

**3. Train and evaluate a `shallow` autoencoder model.**

> The lab model architecture resulted in a good anomaly detection accuracy. Let's see how the reconstruction performance change if **several of the hidden layers** will be removed. First, adjust the encoder and decoder model definitions from the lab accordingly (you may want to use the code snippets shown below). Then, follow all the instructions for training from scratch.

In [None]:
# implementation of the shallow encoder network 
# containing only a single layer
class shallow_encoder(nn.Module):

    def __init__(self):

        super(encoder, self).__init__()

        # specify layer 1 - in 618, out 3
        self.encoder_L1 = nn.Linear(in_features=ori_subset_transformed.shape[1], out_features=3, bias=True) # add linearity 
        nn.init.xavier_uniform_(self.encoder_L1.weight) # init weights according to [9]
        self.encoder_R1 = nn.LeakyReLU(negative_slope=0.4, inplace=True) # add non-linearity according to [10]
        
    def forward(self, x):

        # define forward pass through the network
        x = self.encoder_R1(self.encoder_L1(x)) # don't apply dropout to the AE bottleneck

        return x

In [None]:
# implementation of the shallow decoder network 
# containing only a single layer
class shallow_decoder(nn.Module):

    def __init__(self):

        super(decoder, self).__init__()

        # specify layer 1 - in 3, out 618
        self.decoder_L1 = nn.Linear(in_features=3, out_features=ori_subset_transformed.shape[1], bias=True) # add linearity 
        nn.init.xavier_uniform_(self.decoder_L1.weight)  # init weights according to [9]
        self.decoder_R1 = nn.LeakyReLU(negative_slope=0.4, inplace=True) # add non-linearity according to [10]

    def forward(self, x):

        # define forward pass through the network
        x = self.decoder_R1(self.decoder_L1(x)) # don't apply dropout to the AE output
        
        return x

In [None]:
#### Step 1. define and init neural network architecture #############################################################

# ***************************************************
# INSERT YOUR SOLUTION/CODE HERE
# ***************************************************

#### Step 2. define loss, training hyperparameters and dataloader ####################################################

# ***************************************************
# INSERT YOUR SOLUTION/CODE HERE
# ***************************************************

#### Step 3. run model training ######################################################################################

# ***************************************************
# INSERT YOUR SOLUTION/CODE HERE
# ***************************************************

#### Step 4. run model evaluation ####################################################################################

# ***************************************************
# INSERT YOUR SOLUTION/CODE HERE
# ***************************************************