Trying to reproduce the paper [Scalable and Weakly Supervised Bank Transaction Classification](https://arxiv.org/abs/2305.18430), follow the article of [No Labels? No Problem! A Better Way to Classify Bank Transaction Data](https://medium.com/@echo_neath_ashtrees/no-labels-no-problem-a-better-way-to-classify-bank-transaction-data-73380ce20734)

In [1]:
import numpy as np
import pandas as pd

data = pd.read_csv('../data/CSVData.csv')

In [2]:
data.head(10)

Unnamed: 0,Date,Expense,Description,Balance
0,06/04/2024,-36.67,Banme Braddon AU AUS Card xx0393 Value Date: 0...,1992.35
1,06/04/2024,-6.45,COLES 4787 CANBERRA AU AUS Card xx0393 Value D...,2029.02
2,06/04/2024,-7.0,Soul Origin Belconnen Belconnen AC AUS Card xx...,2035.47
3,06/04/2024,-4.9,Soul Origin Belconnen Belconnen AC AUS Card xx...,2042.47
4,06/04/2024,-45.17,Vodafone Australia North Sydney AU AUS Card xx...,2047.37
5,05/04/2024,1377.22,Salary HSCT PTY LTD PAY FOR 5/04/2024,2092.54
6,05/04/2024,-17.8,COLES 4787 CANBERRA AU AUS Card xx0393 Value D...,715.32
7,04/04/2024,356.34,Direct Credit 128594 FUEGO NERO PTY L PAY FOR ...,733.12
8,04/04/2024,-11.24,COLES 4787 CANBERRA AU AUS Card xx0393 Value D...,376.78
9,03/04/2024,-16.04,1919 Lanzhou Beef Nood Canberra AC AUS Card xx...,388.02


In [3]:
missing_value_cnt = data.isnull().sum()
missing_value_cnt

Date           0
Expense        0
Description    0
Balance        0
dtype: int64

Dataset from the CommomBank is quite clean.

In [4]:
# only need description data to train the categorizer
description = data[['Description', 'Expense']]
description

Unnamed: 0,Description,Expense
0,Banme Braddon AU AUS Card xx0393 Value Date: 0...,-36.67
1,COLES 4787 CANBERRA AU AUS Card xx0393 Value D...,-6.45
2,Soul Origin Belconnen Belconnen AC AUS Card xx...,-7.00
3,Soul Origin Belconnen Belconnen AC AUS Card xx...,-4.90
4,Vodafone Australia North Sydney AU AUS Card xx...,-45.17
...,...,...
265,GUZMAN Y GOMEZ SURRY HILLS NS AUS Card xx0393 ...,-14.00
266,Nespresso Australia BT Canberra AU AUS Card xx...,-48.20
267,COLES 4787 CANBERRA AU AUS Card xx0393 Value D...,-23.20
268,Sticky Beak Canberra AC AUS Card xx0393 Value ...,-8.09


### Step 1: NLP bank description text normalisation and grouping

In [5]:
# text normalisation
# convert to lower case
description.loc[:, 'Description'] = description.loc[:, 'Description'].str.lower()
# remove numbers
description.loc[:, 'Description'] = description.loc[:, 'Description'].str.replace(r'\d+', '', regex=True)
# remove all punctuation except words and space
description.loc[:, 'Description'] = description.loc[:, 'Description'].str.replace(r'[^\w\s]', '', regex=True)
# remove white spaces
description.loc[:, 'Description'] = description.loc[:, 'Description'].str.strip()

# remove stop words
from nltk.corpus import stopwords
stop = stopwords.words('english')
description.loc[:, 'Description'] = description.loc[:, 'Description'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

# not sure if these words are useless, may comment them in the future
# remove useless words
useless = ['au', 'aus', 'card', 'xx', 'value', 'date']
description.loc[:, 'Description'] = description.loc[:, 'Description'].apply(lambda x: ' '.join([word for word in x.split() if word not in (useless)]))

description

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)


Unnamed: 0,Description,Expense
0,banme braddon,-36.67
1,coles canberra,-6.45
2,soul origin belconnen belconnen ac,-7.00
3,soul origin belconnen belconnen ac,-4.90
4,vodafone australia north sydney,-45.17
...,...,...
265,guzman gomez surry hills ns,-14.00
266,nespresso australia bt canberra,-48.20
267,coles canberra,-23.20
268,sticky beak canberra ac,-8.09


Group by Name column and calculate the amount_max and amount_median

In [8]:
# Split data into expense and income
expense_df = description[description['Expense'] < 0]
income_df = description[description['Expense'] > 0]
# reset index
expense_df.reset_index(drop=True, inplace=True)
income_df.reset_index(drop=True, inplace=True)
# apply absolute value to expense
expense_df['Expense'] = expense_df['Expense'].abs()

expense_df, income_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  expense_df['Expense'] = expense_df['Expense'].abs()


(                            Description  Expense
 0                         banme braddon    36.67
 1                        coles canberra     6.45
 2    soul origin belconnen belconnen ac     7.00
 3    soul origin belconnen belconnen ac     4.90
 4       vodafone australia north sydney    45.17
 ..                                  ...      ...
 234         guzman gomez surry hills ns    14.00
 235     nespresso australia bt canberra    48.20
 236                      coles canberra    23.20
 237             sticky beak canberra ac     8.09
 238              lucky duck canberra ac    31.56
 
 [239 rows x 2 columns],
                                  Description  Expense
 0                    salary hsct pty ltd pay  1377.22
 1         direct credit fuego nero pty l pay   356.34
 2         direct credit fuego nero pty l pay   232.21
 3                    salary hsct pty ltd pay  1582.89
 4                      transfer commbank app   300.00
 5                      transfer commbank a

In [9]:
expense_cal_df = expense_df.groupby('Description').agg({'Expense': ['max', 'median']}).reset_index()
expense_cal_df.columns = ['clean_text', 'amount_max', 'amount_median']
expense_cal_df

Unnamed: 0,clean_text,amount_max,amount_median
0,access canb shopfront belconnen,70.00,70.00
1,act gov parking fees canberra,3.00,2.13
2,aesop south yarra south yarra,70.00,67.50
3,aga ovhc wollongong ns,85.80,85.80
4,aldi stores canberra canberra,4.00,4.00
...,...,...,...
91,vodafone australia north sydney,45.17,45.17
92,wl vue testing exam bloomington mn usa aud,445.01,445.01
93,woolworths batemans bay,23.90,23.90
94,yat bun tong braddon ac,59.60,59.60


### Step 2: weak label generation

In [20]:
from snorkel.labeling import labeling_function 
import re
ABSTAIN = -1
MATCH = 1

@labeling_function()
def lf_heur_amount(x):
  # Sample labelling function using heuristic
  if (
      (x["amount_max"] >= 100)
      and (x["amount_median"] >= 10)
  ):
      return MATCH
  else:
      return ABSTAIN

match_regexes = ["fee", "bank", "cash", "atm"]

@labeling_function()
def lf_regex_text(x):
  for match in match_regexes:
     if re.search(match, x["clean_text"]):
        return MATCH
  return ABSTAIN

lfs = [lf_heur_amount, lf_regex_text]

In [23]:
from snorkel.labeling import PandasLFApplier, LFAnalysis
from snorkel.labeling.model.label_model import LabelModel
import pandas as pd

# Apply the LFs to the data
applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=expense_cal_df)

# Evaluate performance on training set
coverage_check_out, coverage_check = (L_train != ABSTAIN).mean(axis=0)
print(f"check_out coverage: {coverage_check_out * 100:.1f}%")
print(f"check coverage: {coverage_check * 100:.1f}%")


100%|██████████| 96/96 [00:00<00:00, 35848.75it/s]

check_out coverage: 16.7%
check coverage: 7.3%





In [None]:
# Fit the label model and get the training labels
label_model = LabelModel(cardinality=2, verbose=True)  # assume binary classification
label_model.fit(L_train=L_train, n_epochs=500, log_freq=50, seed=123)
expense_cal_df["label"] = label_model.predict(L=L_train, tie_break_policy="abstain")

expense_cal_df