# Analytical Approach - The Hypothesis

- **Given the quality of the dataset provided and the amount of information available, the best approach to detecting anomalous transactions is to use a rule-based analysis**


- **From the provided Data Dictionary, It states that the `BAT_NAME` column is an ID for a single journal posting, and all the transactions under a particular BAT_NAME ID must net to zero**


- **Concretely, If the journal entries of a BAT_NAME ID is not balanced (i.e, does not net to zero), we can see that journal posting(entry) is anomalous or `not business as usual`**


- **Hence, our approach is to sum all the `Amount` column entries that belongs to a particular `BAT_NAME` ID and extract all the Journal Posting (Transactions) whoose `Amount` does not net to zero**

### Import important Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Read-In the Dataset and Select the Important Features

In [2]:
# Set low_memory to False to ensure pandas use the right datatypes for each features
data = pd.read_csv("JET_CSV_DATA/JET_CSV_DATA/JET_DATA_0.csv", low_memory=False)

In [3]:
# The list of important features
features = ['GL account','Journal','Account No','BAT_NAME','JNL_LNE','Period','SOURCE_USER','POST_USER','Amount']

In [4]:
# Extract the important features using the features-list created above
data = data[features]

### Drop Missing Values and reset the index

In [5]:
data = data.dropna().reset_index(drop=True)

### Create a new dataframe that is grouped by the `BAT_NAME` column and perform addition operation on all the `Amount` column entries that fall under a particular `BAT_NAME` ID

**Extract only the `BAT_NAME and Amount` column**

In [6]:
new_data = data[['BAT_NAME','Amount']]

**Group By the `BAT_NAME` and perfom addition operation on the `Amount` column**

In [7]:
Amount_sum = new_data.groupby('BAT_NAME',as_index=False, sort=False).sum()

In [8]:
Amount_sum.head()

Unnamed: 0,BAT_NAME,Amount
0,BAT_NAME 517750,0.0
1,BAT_NAME 514862,0.0
2,BAT_NAME 35108,-4.440892e-16
3,BAT_NAME 737556,0.0
4,BAT_NAME 735999,6.9e-08


### Using the `Amount_sum` dataframe, extract all data entries in `Amount` column that is not equal to zero (This indicates transactions that are not business as usual). Then use their corresponding `BAT_NAME` to get the full information about the transaction from the main dataset 

**Extract entries with Amount not equal to zero**

In [9]:
amount_not_zero = Amount_sum[Amount_sum['Amount']!=0]

**extract the corresponding `BAT_NAME` and save in a list**

In [10]:
BAT_NAME_LIST = list(amount_not_zero['BAT_NAME'])

**use the `BAT_NAME_LIST` to get the full information about the transactions from the main dataset. Basically, These will be transactions whose journal entries does not sum to zero**

In [11]:
anomaly_journal = data.loc[data['BAT_NAME'].isin(BAT_NAME_LIST)]
anomaly_journal

Unnamed: 0,GL account,Journal,Account No,BAT_NAME,JNL_LNE,Period,SOURCE_USER,POST_USER,Amount
6,GL account 197,Journal 1,Account No 28428,BAT_NAME 35108,3.0,Period 1,SOURCE_USER 61,POST_USER 60,0.994410
7,GL account 197,Journal 1,Account No 28428,BAT_NAME 35108,4.0,Period 1,SOURCE_USER 61,POST_USER 60,0.805815
8,GL account 197,Journal 1,Account No 28428,BAT_NAME 35108,5.0,Period 1,SOURCE_USER 61,POST_USER 60,1.769364
9,GL account 197,Journal 1,Account No 28428,BAT_NAME 35108,6.0,Period 1,SOURCE_USER 61,POST_USER 60,-1.769364
10,GL account 197,Journal 1,Account No 28428,BAT_NAME 35108,8.0,Period 1,SOURCE_USER 61,POST_USER 60,-0.805815
...,...,...,...,...,...,...,...,...,...
877680,GL account 629,Journal 1,Account No 24466,BAT_NAME 513075,6.0,Period 12,SOURCE_USER 61,POST_USER 60,0.278126
877682,GL account 629,Journal 1,Account No 24466,BAT_NAME 513101,18.0,Period 12,SOURCE_USER 61,POST_USER 60,1.666494
877684,GL account 629,Journal 1,Account No 24466,BAT_NAME 513121,5.0,Period 12,SOURCE_USER 61,POST_USER 60,0.004972
877688,GL account 629,Journal 1,Account No 24466,BAT_NAME 513063,7.0,Period 12,SOURCE_USER 61,POST_USER 60,0.160134


### Save the Anomalous transactions in a new csv file

In [12]:
anomaly_journal.to_csv('JET_DATA_0_anomalous.csv', index=False)

### Inspect the `anomaly_journal` dataframe to investigate what is wrong with the anomalous transactions 

In [14]:
anomaly_journal[anomaly_journal['BAT_NAME']=='BAT_NAME 35108'].reset_index(drop=True)

Unnamed: 0,GL account,Journal,Account No,BAT_NAME,JNL_LNE,Period,SOURCE_USER,POST_USER,Amount
0,GL account 197,Journal 1,Account No 28428,BAT_NAME 35108,3.0,Period 1,SOURCE_USER 61,POST_USER 60,0.99441
1,GL account 197,Journal 1,Account No 28428,BAT_NAME 35108,4.0,Period 1,SOURCE_USER 61,POST_USER 60,0.805815
2,GL account 197,Journal 1,Account No 28428,BAT_NAME 35108,5.0,Period 1,SOURCE_USER 61,POST_USER 60,1.769364
3,GL account 197,Journal 1,Account No 28428,BAT_NAME 35108,6.0,Period 1,SOURCE_USER 61,POST_USER 60,-1.769364
4,GL account 197,Journal 1,Account No 28428,BAT_NAME 35108,8.0,Period 1,SOURCE_USER 61,POST_USER 60,-0.805815
5,GL account 197,Journal 1,Account No 28428,BAT_NAME 35108,9.0,Period 1,SOURCE_USER 61,POST_USER 60,-0.99441
6,GL account 629,Journal 1,Account No 28235,BAT_NAME 35108,7.0,Period 1,SOURCE_USER 61,POST_USER 60,3.569589
7,GL account 779,Journal 1,Account No 28248,BAT_NAME 35108,11.0,Period 1,SOURCE_USER 61,POST_USER 60,-3.926548
8,GL account 539,Journal 1,Account No 28215,BAT_NAME 35108,2.0,Period 1,SOURCE_USER 61,POST_USER 60,0.356959
9,GL account 779,Journal 1,Account No 48169,BAT_NAME 35108,10.0,Period 1,SOURCE_USER 61,POST_USER 60,3.926548


## Create a Funtion to automatically perform the anomaly detection for all the Data

In [21]:
def find_anomaly(path_list):
    
    anomaly_names=[] #new
    

    for path in path_list:

        data = pd.read_csv(path, low_memory=False)
        features = ['GL account','Journal','Account No','BAT_NAME','JNL_LNE','Period','SOURCE_USER','POST_USER','Amount']
        data = data[features]

        data = data.dropna().reset_index(drop=True)
        new_data = data[['BAT_NAME','Amount']]

        Amount_sum = new_data.groupby('BAT_NAME',as_index=False, sort=False).sum()
        amount_not_zero = Amount_sum[Amount_sum['Amount']!=0]

        anomaly_journal = data.loc[data['BAT_NAME'].isin(list(amount_not_zero['BAT_NAME']))]

        new_name = path.split('.')[0]+'_anomalous.csv'
        anomaly_names.append(new_name) #new

        anomaly_journal.to_csv(new_name, index=False)
        
    return anomaly_names #new

In [23]:
anomalies = find_anomaly(path_list)

In [25]:
for anomaly in anomalies:
    
    data = pd.read_csv(anomaly, low_memory=False)
    display(data.head(10))

Unnamed: 0,GL account,Journal,Account No,BAT_NAME,JNL_LNE,Period,SOURCE_USER,POST_USER,Amount
0,GL account 197,Journal 1,Account No 28428,BAT_NAME 35108,3.0,Period 1,SOURCE_USER 61,POST_USER 60,0.99441
1,GL account 197,Journal 1,Account No 28428,BAT_NAME 35108,4.0,Period 1,SOURCE_USER 61,POST_USER 60,0.805815
2,GL account 197,Journal 1,Account No 28428,BAT_NAME 35108,5.0,Period 1,SOURCE_USER 61,POST_USER 60,1.769364
3,GL account 197,Journal 1,Account No 28428,BAT_NAME 35108,6.0,Period 1,SOURCE_USER 61,POST_USER 60,-1.769364
4,GL account 197,Journal 1,Account No 28428,BAT_NAME 35108,8.0,Period 1,SOURCE_USER 61,POST_USER 60,-0.805815
5,GL account 197,Journal 1,Account No 28428,BAT_NAME 35108,9.0,Period 1,SOURCE_USER 61,POST_USER 60,-0.99441
6,GL account 197,Journal 1,Account No 29363,BAT_NAME 735999,32.0,Period 10,SOURCE_USER 61,POST_USER 60,0.053492
7,GL account 197,Journal 1,Account No 29363,BAT_NAME 735999,33.0,Period 10,SOURCE_USER 61,POST_USER 60,0.039536
8,GL account 197,Journal 1,Account No 29363,BAT_NAME 735999,34.0,Period 10,SOURCE_USER 61,POST_USER 60,0.197167
9,GL account 197,Journal 1,Account No 29363,BAT_NAME 735999,35.0,Period 10,SOURCE_USER 61,POST_USER 60,0.51435


Unnamed: 0,GL account,Journal,Account No,BAT_NAME,JNL_LNE,Period,SOURCE_USER,POST_USER,Amount
0,GL account 197,Journal 1,Account No 28428,BAT_NAME 31459,5,Period 1,SOURCE_USER 61,POST_USER 60,1.50876
1,GL account 197,Journal 1,Account No 29363,BAT_NAME 742545,2,Period 11,SOURCE_USER 61,POST_USER 60,0.96012
2,GL account 197,Journal 1,Account No 29363,BAT_NAME 742545,3,Period 11,SOURCE_USER 61,POST_USER 60,0.272606
3,GL account 197,Journal 1,Account No 25889,BAT_NAME 838033,149,Period 1,SOURCE_USER 61,POST_USER 60,0.141481
4,GL account 197,Journal 1,Account No 25889,BAT_NAME 838033,189,Period 1,SOURCE_USER 61,POST_USER 60,0.032576
5,GL account 197,Journal 1,Account No 25840,BAT_NAME 838033,50,Period 2,SOURCE_USER 61,POST_USER 60,1.817713
6,GL account 197,Journal 1,Account No 25840,BAT_NAME 838033,51,Period 2,SOURCE_USER 61,POST_USER 60,0.458869
7,GL account 197,Journal 1,Account No 25840,BAT_NAME 838033,52,Period 2,SOURCE_USER 61,POST_USER 60,1.324966
8,GL account 197,Journal 1,Account No 25840,BAT_NAME 838033,53,Period 2,SOURCE_USER 61,POST_USER 60,0.427253
9,GL account 197,Journal 1,Account No 25840,BAT_NAME 838033,54,Period 2,SOURCE_USER 61,POST_USER 60,0.226314


### Testing the function

In [15]:
import os

In [16]:
path_list = os.listdir('JET_CSV_DATA/JET_CSV_DATA/')

In [18]:
new_path = [i for i in path_list if i.endswith('.csv')]

In [20]:
path_list

['JET_DATA_0.csv',
 'JET_DATA_5.csv',
 'JET_DATA_4.csv',
 'JET_DATA_6.csv',
 'JET_DATA_10.csv',
 'JET_DATA_7.csv',
 'JET_DATA_1.csv',
 'JET_DATA_9.csv',
 'JET_DATA_2.csv',
 '.ipynb_checkpoints',
 'JET_DATA_8.csv',
 'JET_DATA_3.csv']

In [19]:
new_path

['JET_DATA_0.csv',
 'JET_DATA_5.csv',
 'JET_DATA_4.csv',
 'JET_DATA_6.csv',
 'JET_DATA_10.csv',
 'JET_DATA_7.csv',
 'JET_DATA_1.csv',
 'JET_DATA_9.csv',
 'JET_DATA_2.csv',
 'JET_DATA_8.csv',
 'JET_DATA_3.csv']

['JET_DATA_0.csv',
 'JET_DATA_5.csv',
 'JET_DATA_4.csv',
 'JET_DATA_6.csv',
 'JET_DATA_10.csv',
 'JET_DATA_7.csv',
 'JET_DATA_1.csv',
 'JET_DATA_9.csv',
 'JET_DATA_2.csv',
 '.ipynb_checkpoints',
 'JET_DATA_8.csv',
 'JET_DATA_3.csv']

In [22]:
path_list = ['JET_CSV_DATA/JET_CSV_DATA/JET_DATA_0.csv','JET_CSV_DATA/JET_CSV_DATA/JET_DATA_1.csv']

In [27]:
find_anomaly(path_list)