<style>
th {background-color:#55FF33;}
td {background-color:#00FFFF;}
</style>

<img align="right" style="max-width: 200px; height: auto" src="01_images/logo.png">

## Lab 02 - Statistical Audit Data Analytics

Audit Data Analytics in Python, University of St.Gallen (HSG), January 13th, 2020

The lab environment of the **"Audit Data Analytics Course"** at the University of St. Gallen (HSG) is based on Jupyter Notebooks (https://jupyter.org), which allow to perform a variety of statistical evaluations and data analyses.

<img align="center" style="max-width: 900px; height: auto" src="01_images/banner.png">

In this lab, we will use Jupyter Notebook to implement and apply an initial **mathematical-statistical audit analysis procedures** namely the Benford's Law analysis. Thereby, we will implement the Benford distribution using the Python Programming language. Furthermore, we will perform the Benford's Law analysis of leading digits derived from the transaction amounts of a given population of financial transactions:

<img align="center" style="max-width: 800px; height: auto" src="01_images/benford.png">

As always, pls. don't hesitate to ask all your questions either during the lab or send us an email via marco (dot) schreyer (at) unisg (dot) ch.

## Lab Objectives:

After today's lab, you should be able to:
    
> 1. Understand how to perform statistical data analysis using **Jupyter** and **Python**;
> 2. Use the **Pandas** library to target and analyze a variety of transactional data;
> 3. Use the **Matplotlib** library to create custom data visualizations;
> 4. Develop initial **more concrete ideas** for possible data analyses within your company or institution.

But before we start let's watch a brief motivational video published in 2017 by **NVIDIA Inc.** as part of their GPU Technology Conference (GTC) on the revolution of data analytics driven be deep neural networks referred to as "Deep Learning": 

In [None]:
from IPython.display import YouTubeVideo
# NVIDIA: "The Deep Learning Revolution"
# YouTubeVideo('Dy0hJWltsyE', width=1024, height=576)

## Setup of the Jupyter Notebook Environment

Similar to the previous labs, we need to import a couple of Python libraries that allow for data analysis and data visualization. We will mostly use the `NumPy`, `Pandas`, `Matplotlib`, `Seaborn`, and a few utility libraries throughout the lab.

Let's import the `Pandas` and the `NumPy` libraries accordingly by executing the following `import` statements:

In [None]:
import pandas as pd
import numpy as np

In addition, we import a couple of `Python's` utility libraries:

In [None]:
import os # allows to create, access and manipulate data directories
import datetime as dt # allows for the create of data time stamps

We also import a set of `Python's` data access and import libraries: 

In [None]:
import io # allows to open and access streams of data
import zipfile # allows to zip and unzip data
import urllib # allows to handle website requests

Finally, import the `Matplotlib` and `Seaborn` plotting libraries and set the general data visualization parameters:

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# set global data visualization parameters
plt.style.use('seaborn') # set the plotting style
plt.rcParams['figure.figsize'] = [5, 3] # set the plot figure size
plt.rcParams['figure.dpi']= 150 # set the plotting resolution

Enable the "inline plotting" of visualizations within the current notebook:

In [None]:
%matplotlib inline

Create notebook folder structure to store the original data as well as the analysis results:

In [None]:
if not os.path.exists('./02_data'): os.makedirs('./02_data')  # create data directory
if not os.path.exists('./03_results'): os.makedirs('./03_results')  # create results directory

Filter and suppress potential library warnings, for example due to library enhancements: 

In [None]:
import warnings

# set the warning filter flag to ignore warnings
warnings.filterwarnings('ignore')

## 1. Dataset Download and Data Import

The synthetic **PaySim** dataset simulates mobile money transactions based on real transactions extracted from one month of financial logs of a mobile financial service provider implemented in an African country. The original logs were provided by a multinational company that provided mobile financial services. At the time the data was published, the service provider operated in more than 14 countries worldwide.

The latest version of the dataset was published at the Kaggle Data Science Competitions website on April 3th, 2017 by the Norwegian University of Science and Technology (NTNU)

In total, the **PaySim** dataset comprises a population of **6.3 million logged transactions**. Each transaction contains **nine different attributes (features)**. The attribute names and their respective semantic meaning is given below:

>- `Step:` Denotes the current hour of time. In total 744 hours (30 simulation days).
>- `Type:` Denotes the type of the transaction. In total 5 different transaction types.
>- `Amount:` Indicates the amount transferred in local currency.

>- `NameOrig:` Identifies the (anonymized) ID of the sender who ordered the transaction.
>- `OldBalanceOrg:` Denotes the initial balance of the sender's account before the transaction.
>- `NewBalanceOrg:` Indicates the new balance of the sender's account after the transaction.

>- `NameDest:` Denotes the (anonymized) ID of the recipient of the transaction.
>- `OldBalanceOrg:` Denotes the initial account balance of the recipient before the transaction.
>- `NewBalanceOrg:` Denotes the new balance of the recipient's account after the transaction has taken place.

In addition, each transaction is marked with the following **two additional flags**:

>- `isFraud:` Indicates actual "fraudulent" transactions.
>- `isFlaggedFraud:` Indicates fraudulent transactions detected by the system.

Further details of the dataset, as well as the dataset itself, can be obtained via the following publication:

*E. A. Lopez-Rojas , A. Elmir, and S. Axelsson. "PaySim: A financial mobile money simulator for fraud detection". In: The 28th European Modeling and Simulation Symposium-EMSS, Larnaca, Cyprus. 2016* 

or the following website on Kaggle: https://www.kaggle.com/ntnu-testimon/paysim1.

### 1.1. Download the PaySim Dataset of Financial Transactions

Now let's download a partial extract of the previously described data set consisting of **2,770,409 logged transactions** into the notebook. To do this, we first define the path or URL of the transaction data to be imported: 

In [None]:
url = 'https://raw.githubusercontent.com/GitiHubi/courseACA/master/lab01/02_data/transactions.zip'

In a next step we will open an URL open request to read the data from the provided URL:

In [None]:
request = urllib.request.urlopen(url)

Furthermore, we will retrieve the ZIP archive of the data from the opened URL request:

In [None]:
datazip = zipfile.ZipFile(io.BytesIO(request.read()))

### 1.2. Import the PaySim Dataset as Pandas Dataframe

Finally, we will extract the `transactions.csv` file contained in the ZIP archive and read it as a Comma Separated Value (CSV) into `Pandas` dataframe:

In [None]:
# open and unzip the ZIP archive
csv_file = datazip.open('transactions.csv')

# read the csv data as pandas dataframe
data = pd.read_csv(csv_file)

Review the first 10 transactions (rows) of the data set:

In [None]:
data.head(10)

Review the last 10 transactions (rows) of the data set:

In [None]:
data.tail(10)

## 2. Assignment of Unique Transaction Identifiers

A unique **transaction identifier** is used within the dataset to uniquely mark individual records in order to uniquely identify them in the further analysis procedure. Such a unique identifies is often comprised of a sequence of values selected so that each row in the dataset has a unique identifying characteristic.

Let's now generate such a unique sequence of transaction identifiers using the following naming convention `ACA_ID_0000001`, `ACA_ID_0000002`,..., `ACA_ID_2770408`:

In [None]:
# create list of numeric values 0, 1, 2, ..., N
ids = list(range(0, data.shape[0]))

# create list of unique transaction identifier
keys = ['ACA_ID_' + str(e).zfill(7) for e in ids]

Subsequently, let's verify the first five created unique transaction identifier:

In [None]:
keys[0:5]

Ok, that looks like anticipated. Let's now add the unique transaction identifiers to the original dataset we aim to investigate in the following. Thereby, we will add a designated and leading `AUDIT_ID` column to our dataframe using the `insert` statement available in the `Pandas` library:

In [None]:
data.insert(0, "AUDIT_ID", keys)

Let's verify if the `AUDIT_ID` column including the unique identifier was successfully created by inspecting the first 10 rows of the dataframe containing the transaction data:

In [None]:
data.head(10)

Again, let's also inspect the last 10 rows of the dataframe containing the transaction data:

In [None]:
data.tail(10)

Excellent, now that we assigned each row in our dataset a unique identifier let's continue with the structural data validation. 

## 3. Data Preparation and Formatting

**Data preparation** defines the cleaning and transformation of raw data prior to the actual processing and analysis. Data preparation is an important step before the actual data analysis to be performed and often involves reformatting data, correcting information, and combining data sets to enrich that data.

### 3.1. Extraction of "CASH_OUT" Transactions

Let's again, in a first step, extract all "CASH_OUT" transactions from the dataset. To achieve this we will use the data filter capabilities of the `Pandas` library:

In [None]:
# filter and extraction of cash out transactions
transactions_cash_out = data[data["type"] == "CASH_OUT"]

### 3.2. Formatting of Data Attributes

Let's conduct a simple semantic formatting of the data attributes `isFraud` and `isFlaggedFraud` in order to improve the interpretability of a human auditor. Therefore, let's first review the current formatting by inspecting the first five rows of the transactional dataset: 

In [None]:
transactions_cash_out.head(5)

It can be observed that the `isFraud` attribute encompasses two binary values. The values corresponding to either the value `1` which denotes a fraudulent transaction or the value `0` which denotes a non-fraudulent transaction. In a next step we will reformat those values accordingly in the dataset:

In [None]:
# filter for fraudulent transations and replace the "isFraud" flag value
transactions_cash_out.loc[transactions_cash_out['isFraud'] == 1, 'isFraud'] = 'yes' # replace the value "1" with "yes"

# filter for non-fraudulent transations and replace the "isFraud" flag value
transactions_cash_out.loc[transactions_cash_out['isFraud'] == 0, 'isFraud'] = 'no' # replace the value "0" with "no"

Let's spot check the performed replacement by the re-inspection of the first five rows:

In [None]:
transactions_cash_out.head(5)

Let's now apply the same reformatting to the `isFlaggedFraud` attribute in the dataset. Remember, the values corresponding to either the value 1 which denotes a transaction flagged as fraudulent or the value 0 which denotes a transaction flagged as non-fraudulent. 

In [None]:
# filter for transations flagged as fraudulent and replace the "isFlaggedFraud" flag value
transactions_cash_out.loc[transactions_cash_out['isFlaggedFraud'] == 1, 'isFlaggedFraud'] = 'yes' # replace the value "1" with "yes"

# filter for transations flagged as non-fraudulent and replace the "isFlaggedFraud" flag value
transactions_cash_out.loc[transactions_cash_out['isFlaggedFraud'] == 0, 'isFlaggedFraud'] = 'no' # replace the value "0" with "no"

Let's again spot check the performed replacement by the re-inspection of the first five rows:

In [None]:
transactions_cash_out.head(5)

## 4. Mathematical-Statistical Audit Data Analytics

<img align="center" style="max-width: 800px; height: auto" src="01_images/analytics.png">

### 4.1. Analytics: Benford-Newcomb Analysis of the First Leading Digit

In a first step,  let's create a Benford distribution reference table for each possible single leading digit value. Therefore, we will derive the probabilities $p(d)$ according to Benford for the individual leading digits as defined by: 

$$ p(d) = \log_{10}(d+1) - \log_{10}(d);$$

where $d \in [0, 1, ...,9]$ denotes an actual leading digit value. 

Source: „The Law of Anomalous Numbers“, Benford F., Proceedings of the American Philosophical Society, Vol. 78, 1938, USA

#### 4.1.1 Create the Benford-Newcomb Probability Reference Table

Let's start by creating a `Pandas` dataframe that contains all the individual leading digits: 

In [None]:
benford_table = pd.DataFrame({"digit_1": range(1, 10)})

In a next step, we will derive the probability of observing a particular leading digit according to Benford and add the probability accordingly to the dataframe: 

In [None]:
benford_table["benford"] = (np.log10(benford_table["digit_1"] + 1)) - np.log10(benford_table["digit_1"])

Let's now inspect our created Benford probability reference table of the leading transaction amount digits:

In [None]:
benford_table

In addition, let's also compute and add confidence intervals of $\sigma=3$ standard deviations to the created Benford probability reference table:

In [None]:
# determine the total number of cash out transactions
n = transactions_cash_out.shape[0]

# determine the upper bound of the three sigma confidence interval
benford_table["benford_upp"] = benford_table["benford"] + 1.96 * np.sqrt((benford_table["benford"] * (1 - benford_table["benford"]))/n) 

# determine the lower bound of the three sigma confidence interval
benford_table["benford_low"] = benford_table["benford"] - 1.96 * np.sqrt((benford_table["benford"] * (1 - benford_table["benford"]))/n) 

Following, let's verify the added lower and upper bound of the confidence intervals:

In [None]:
benford_table

Finally, let's also visualize the expected first leading digit probability according to Benford:

In [None]:
# initialise the plot 
fig, ax = plt.subplots(figsize=(15, 5))

# plot the benford probabilities 
plt.plot(benford_table["digit_1"], benford_table["benford"], color="red")

# plot the benford probability density
plt.fill_between(benford_table["digit_1"], benford_table["benford"], color="red", alpha=0.1)

# add the axis labels
plt.ylabel("[Probability]", fontsize=12)
plt.xlabel("[Leading Digit]", fontsize=12)

# rotate x-axis tick labels
plt.xticks(rotation=0)

# add the plot title
plt.title("Benford-Newcomb Distribution - First Leading Digit", fontsize=12);

#### 4.1.2 Determine the Actual Probabilities of the Transaction Amounts Leading Digit

Ok, now that we have prepared our reference table including the confidence intervals let's focus on the leading digits of the "CASH_OUT" transactions. Therefore, we will extract the leading digit of each transaction and add it as a separate column to dataframe of all transactions:

In [None]:
transactions_cash_out["digit_1"] = transactions_cash_out["amount"].astype(str).str[0]

Let's verify the extracted leading digits based on the first 10 rows of the transactional dataset:

In [None]:
transactions_cash_out[["amount", "digit_1"]].head(10)

In a next step, let's determine the actual probability of observing a specific leading digit in the dataset of "CASH_OUT" transactions. Therefore, we will derive a list of all observable leading digits in the dataset:  

In [None]:
benford_analysis = pd.DataFrame({"digit_1": transactions_cash_out["digit_1"].value_counts().index.astype(np.int64).tolist()})

Next, we count the number of times a particular leading digit is evident in the "CASH_OUT" transactions:

In [None]:
benford_analysis["count"] = transactions_cash_out["digit_1"].value_counts().tolist()

Finally, we compute the probability of observing a particular leading digit in the "CASH_OUT" transactions:

In [None]:
benford_analysis["probability"] = benford_analysis["count"] / transactions_cash_out.shape[0]

Let’s now inspect and verify the derived probabilities:

In [None]:
benford_analysis

#### 4.1.3 Benford-Newcomb Analysis of the Transaction Amounts Leading Digit

To conclude the Benford analysis let's merge the initially created reference table of Benford probabilities with the actual observed probability of observing a particular leading digit. To achieve this we will use the `merge` function available in the `Pandas` library: 

In [None]:
analysis_result_single_leding_digit = benford_table.merge(benford_analysis, on="digit_1")

Now we are finally in the position to compare both probabilities (the expected probability according to Benford-Newcomb and the observed probability in the dataset) and detect potential deviations: 

In [None]:
analysis_result_single_leding_digit 

Furthermore, let's also visually inspect the probability distribution expected by Benford-Newcomb and the observed probabilities available in the dataset of "CASH_OUT" transactions:

In [None]:
# initialise the plot 
fig, ax = plt.subplots(figsize=(15, 5))

# plot the benford probabilities 
plt.plot(analysis_result_single_leding_digit["digit_1"], analysis_result_single_leding_digit["benford"], color="red")

# plot the actual distribution of the first digit
plt.bar(analysis_result_single_leding_digit["digit_1"], analysis_result_single_leding_digit["probability"], color="green")

# plot the benford probability density
plt.fill_between(np.arange(1.0, 10.0, 1.0), analysis_result_single_leding_digit["benford"], color="red", alpha=0.1)

# add the axis labels
plt.ylabel("[Probability]", fontsize=12)
plt.xlabel("[Leading Digit]", fontsize=12)

# format the x-tick labels
plt.xticks(range(1,10), range(1,10))

# add the plot title
plt.title("Benford-Newcomb Analysis - First Leading Digit", fontsize=12);

### 4.2. Analytics: Benford-Newcomb Analysis of the First and Second Digits

#### 4.2.1 Create the Benford-Newcomb Probability Reference Table

Let’s start again by creating a `Pandas` dataframe that contains all possible combinations of the first and second leading transaction amount digits:

In [None]:
benford_table = pd.DataFrame({"digit_2": range(1, 100)})

Similarly as before, we will derive the probability of observing a particular leading digit combination according to
Benford. Afterwards, we will add the obtained probabilities to the dataframe:

In [None]:
benford_table["benford"] = (np.log10(benford_table["digit_2"] + 1)) - np.log10(benford_table["digit_2"])

Let’s now inspect the distinct rows of the created Benford probability reference table. The table contains all possible two leading digit combinations as well as their corresponding probability of occurrence according to Benford:

In [None]:
benford_table

In addition, let’s also compute and add confidence intervals of σ = 3 standard deviations. We will add the upper and lower bound of the determined confidence intervals to the created reference table of Benford probabilities:

In [None]:
# determine the total number of cash out transactions
n = transactions_cash_out.shape[0]

# determine the upper bound of the three sigma confidence interval
benford_table["benford_upp"] = benford_table["benford"] + 1.96 * np.sqrt((benford_table["benford"] * (1 - benford_table["benford"]))/n) 

# determine the lower bound of the three sigma confidence interval
benford_table["benford_low"] = benford_table["benford"] - 1.96 * np.sqrt((benford_table["benford"] * (1 - benford_table["benford"]))/n) 

Following, let’s verify the added lower and upper bound of the confidence intervals:

In [None]:
benford_table

Finally, let’s also visualize the expected first leading digit probability according to Benford:

In [None]:
# initialise the plot 
fig, ax = plt.subplots(figsize=(15, 5))

# plot the benford probabilities 
plt.plot(benford_table["digit_2"], benford_table["benford"], color="red")

# plot the benford probability density
plt.fill_between(benford_table["digit_2"], benford_table["benford"], color="red", alpha=0.1)

# add the axis labels
plt.ylabel("[Probability]", fontsize=12)
plt.xlabel("[Leading Digits]", fontsize=12)

# format the x-tick labels
plt.xticks(range(10, 100), range(10, 100))

# rotate x-axis tick labels
plt.xticks(rotation=90)

# format the x-axis limits
plt.xlim(10, 99)

# format the y-axis limits
plt.ylim(0.0, 0.05)

# add the plot title
plt.title("Benford-Newcomb Distribution - First and Second Leading Digit", fontsize=12);

#### 4.2.2 Determine the Actual Probabilities of the First and Second Leading Transaction Amounts Digits

Ok, now that we have prepared our reference table including the confidence intervals let’s focus on the two leading digits of the “CASH_OUT” transactions. Therefore, we will extract both leading digit of each transaction and add it as a separate column to dataframe of all transactions:

In [None]:
transactions_cash_out["digit_2"] = transactions_cash_out["amount"].astype(str).str[0] + transactions_cash_out["amount"].astype(str).str[1]

Let’s verify the extracted leading digits based on the first 10 rows of the transactional dataset:

In [None]:
transactions_cash_out[["amount", "digit_2"]].head(10)

In a next step, let’s determine the actual probability of observing a specific combination of leading digits in the dataset of “CASH_OUT” transactions. Therefore, we will derive a list of all observable leading digits in the dataset:

In [None]:
benford_analysis = pd.DataFrame({"digit_2": transactions_cash_out["digit_2"].value_counts().index.map(lambda t: t.replace('.', '')).astype(np.int64).tolist()})

Next, we count the number of times a particular combination of leading digits is evident in the “CASH_OUT” transactions:

In [None]:
benford_analysis["count"] = transactions_cash_out["digit_2"].value_counts().tolist()

Finally, we compute the probability of observing a particular combination of leading digits in the “CASH_OUT” transactions:

In [None]:
benford_analysis["probability"] = transactions_cash_out["digit_2"].value_counts(normalize=True).tolist()

Let’s now inspect and verify the derived probabilities:

In [None]:
benford_analysis

#### 4.2.3 Benford-Newcomb Analysis of the First and Second Leading Transaction Amounts Digits

To conclude the Benford-Newcomb analysis let’s merge the initially created reference table of Benford-Newcomb probabilities with the actual observed probability of observing a particular combination of leading digits. To achieve this we will again use the merge function available in the `Pandas` library:

In [None]:
analysis_result_double_leading_digits = benford_table.merge(benford_analysis, on="digit_2")

Now we are finally in the position to compare both probabilities (the expected probability according to Benford-Newcomb and the observed probability in the dataset) and detect potential deviations:

In [None]:
analysis_result_double_leading_digits 

Furthermore, let’s again also visually inspect the probability distribution expected by Benford-Newcomb and the observed probabilities available in the dataset of “CASH_OUT” transactions:

In [None]:
# initialise the plot 
fig, ax = plt.subplots(figsize=(15, 5))

# plot the benford probabilities 
plt.plot(analysis_result_double_leading_digits["digit_2"], analysis_result_double_leading_digits["benford"], color="red")

# plot the actual distribution of the first digit
plt.bar(analysis_result_double_leading_digits["digit_2"], analysis_result_double_leading_digits["probability"], color="green")

# plot the benford probability density
plt.fill_between(analysis_result_double_leading_digits["digit_2"], analysis_result_double_leading_digits["benford"], color="red", alpha=0.1)

# add the axis labels
plt.ylabel("[Probability]", fontsize=12)
plt.xlabel("[Leading Digits]", fontsize=12)

# format the x-tick labels
plt.xticks(range(10, 100), range(10, 100))

# rotate x-axis tick labels
plt.xticks(rotation=90)

# format the x-axis limits
plt.xlim(9.0, 100.0)

# format the y-axis limits
plt.ylim(0.0, 0.05)

# add the plot title
plt.title("Benford-Newcomb Distribution - First and Second Leading Digit", fontsize=12);

### 4.3. Analytics: Investigation of Significant Probability Deviations

In next step, let's investigate the combination of leading transaction amount digits that correspond to the largest deviation when compared to Benford-Newcomb distribution. Therefore, we compute the the delta of the Benford-Newcomb probability and the actual observable probability of each leading digit combination:

In [None]:
analysis_result_double_leading_digits["delta"] = np.abs(analysis_result_double_leading_digits["benford"] -  analysis_result_double_leading_digits["probability"])

Now, we are able to determine the combinations of leading digits probabilities that show a significant deviation. To achieve this, we will sort the dataframe accordingly using the `sort_values` function of the `Pandas` library:

In [None]:
analysis_result_double_leading_digits.sort_values(by=['delta'], ascending=False)

Following we will visualize the obtained deviation accordingly:

In [None]:
# initialise the plot 
fig, ax = plt.subplots(figsize=(15, 5))

# plot the actual distribution of the first digit
plt.bar(analysis_result_double_leading_digits["digit_2"], analysis_result_double_leading_digits["delta"], color="darkviolet")

# add the axis labels
plt.ylabel("[Probability Deviation]", fontsize=12)
plt.xlabel("[Leading Digits]", fontsize=12)

# format the x-tick labels
plt.xticks(range(10, 100), range(10, 100))

# rotate x-axis tick labels
plt.xticks(rotation=90)

# format the x-axis limits
plt.xlim(9.5, 100.0)

# format the y-axis limits
plt.ylim(0.0, 0.01)

# add the plot title
plt.title("Benford-Newcomb Deviation Analysis - First and Second Leading Digit", fontsize=12);

Judging from the deviation analysis shown above, a significant difference for the digit combinations ranging from "16" to "22" can be observed. Thereby, the digit combination "18" corresponds to the highest digit combination. 

In the following, we will therefore extract all "CASH_OUT" transactions that exhibit the digit combination of "18":

In [None]:
# set digit combination
digit = "18"

# filter corresponding cash out transactions
transactions_cash_out_18 = transactions_cash_out[transactions_cash_out["digit_2"] == digit]

Next, let's review the extracted transactions: 

In [None]:
transactions_cash_out_18.sort_values(by=['amount'], ascending=False)

Let's now inspect in detail the amounts of the extracted "CASH_OUT" transactions that exhibit the digit combination of "18":

In [None]:
# initialize the plot
fig, ax = plt.subplots(figsize=(15, 5))

# scatter plot of cash out transactions that exhibit a leading digit amount equal to 18
plt.scatter(transactions_cash_out_18.index, transactions_cash_out_18["amount"], color="darkviolet")

# plot unusual amount threshold
plt.axhline(y=1750000, color="r", linestyle="--", label="threshold")

# add labels of the x- and y-axis
plt.ylabel("[Amount]", fontsize=12)
plt.xlabel("[Transaction]", fontsize=12)

# format y-axis tick labels
ax.ticklabel_format(style='plain')

# hide x-ticks
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)

# add plot title
plt.title("Benford-Newcomb Deviation Analysis - First and Second Leading Digit: 18", fontsize=14);

Ok, it seems that the is an unusual transaction amount pattern evident.

Let's apply a filter to determine all "CASH_OUT" transactions that correspond to a total transaction volume **equal or exceeding an amount value of 1,75 Mio.** in local currency:

In [None]:
# define the amount threshold
threshold = 1750000

# filter the cash-out transactions according to the amount threshold 
transactions_cash_out_18_large = transactions_cash_out_18[transactions_cash_out_18["amount"] >= threshold]

Let's do a sample based review of the extracted transactions: 

In [None]:
transactions_cash_out_18_large.head(20)

Finally let's extract the filtered transactions into an excel spreadsheet for a further sample based testing by the audit team. Therefore, we will in a first step create a time stamp of the data extract for audit trail purposes:

In [None]:
timestamp = dt.datetime.utcnow().strftime("%Y-%m-%d_%H-%M-%S")

Finally, we extract the filtered transactions to excel to the local filesystem:

In [None]:
# specify the filename of the excel spreadsheet
filename = str(timestamp) + " - ACA_001_benford_newcomb_18.xlsx"

# specify the target data directory of the excel spreadsheet
data_directory = os.path.join('./03_results', filename)

# extract the filtered transactions to excel
transactions_cash_out_18_large.to_excel(data_directory, header=True, index=False, sheet_name="Business_Partner_Amounts", encoding="utf-8")

### Lab Assignements:

We recommend you to try the following exercises as part of the lab:

**1. Analyze the "CASH-OUT" transactions that have the leading digit combination 15 and 16.**

> Analyze the approx. 2.2 million "CASH-OUT" transactions extracted during data validation with regard to the leading digit combinations '15' and '16'. For this, please follow the procedure presented in section 4.3. of the notebook. Extract the individual transactions to a separate excel or csv file for a downstream sample testing.

In [None]:
# ***************************************************
# INSERT YOUR CODE SOLUTION HERE
# ***************************************************

**2. Analyze the leading digit combinations of the "TRANSFER" transactions according to Benford-Newcomp.**

> Analyze the transaction amounts in the data validation extracted 532'909 "TRANSFER" transactions according to the Benford-Newcomb law. In doing so, please follow the procedure presented in sections 4.1 and 4.2 of the notebook. Extract the individual transactions that correspond to deviations of the Benford-Newcomb law to a separate excel or csv file for a downstream sample testing.

In [None]:
# ***************************************************
# INSERT YOUR CODE SOLUTION HERE
# ***************************************************

### Lab Summary:

In this lab, a step-by-step introduction to mathematical-statistical audit data analytics was presented. In particular, the analysis of the leading digits of a population of financial transactions according to the Benford-Newcomb Law. The analysis procedure presented in this lab can be viewed as starting point for more tailored and complex analytics.

You may want to execute the content of your lab outside of the Jupyter notebook environment, e.g. on a compute node or a server. The cell below converts the lab notebook into a standalone and executable python script. Pls. note that to convert the notebook, you need to install Python's `NBConvert` library and its extensions:

In [None]:
# installing the nbconvert library (uncomment the following statements if needed)
# !pip3 install nbconvert
# !pip3 install jupyter_contrib_nbextensions

Let's now convert the Jupyter notebook into a plain Python script:

In [None]:
!jupyter nbconvert --to script aca_lab02.ipynb