Trying to reproduce the paper [Scalable and Weakly Supervised Bank Transaction Classification](https://arxiv.org/abs/2305.18430), follow the article of [No Labels? No Problem! A Better Way to Classify Bank Transaction Data](https://medium.com/@echo_neath_ashtrees/no-labels-no-problem-a-better-way-to-classify-bank-transaction-data-73380ce20734)

In [114]:
import numpy as np
import pandas as pd

data = pd.read_csv('../data/CSVData.csv')

In [115]:
data.head(10)

Unnamed: 0,Date,Expense,Description,Balance
0,06/04/2024,-36.67,Banme Braddon AU AUS Card xx0393 Value Date: 0...,1992.35
1,06/04/2024,-6.45,COLES 4787 CANBERRA AU AUS Card xx0393 Value D...,2029.02
2,06/04/2024,-7.0,Soul Origin Belconnen Belconnen AC AUS Card xx...,2035.47
3,06/04/2024,-4.9,Soul Origin Belconnen Belconnen AC AUS Card xx...,2042.47
4,06/04/2024,-45.17,Vodafone Australia North Sydney AU AUS Card xx...,2047.37
5,05/04/2024,1377.22,Salary HSCT PTY LTD PAY FOR 5/04/2024,2092.54
6,05/04/2024,-17.8,COLES 4787 CANBERRA AU AUS Card xx0393 Value D...,715.32
7,04/04/2024,356.34,Direct Credit 128594 FUEGO NERO PTY L PAY FOR ...,733.12
8,04/04/2024,-11.24,COLES 4787 CANBERRA AU AUS Card xx0393 Value D...,376.78
9,03/04/2024,-16.04,1919 Lanzhou Beef Nood Canberra AC AUS Card xx...,388.02


In [116]:
missing_value_cnt = data.isnull().sum()
missing_value_cnt

Date           0
Expense        0
Description    0
Balance        0
dtype: int64

Dataset from the CommomBank is quite clean.

In [117]:
# only need description data to train the categorizer
description = data['Description']
description

0      Banme Braddon AU AUS Card xx0393 Value Date: 0...
1      COLES 4787 CANBERRA AU AUS Card xx0393 Value D...
2      Soul Origin Belconnen Belconnen AC AUS Card xx...
3      Soul Origin Belconnen Belconnen AC AUS Card xx...
4      Vodafone Australia North Sydney AU AUS Card xx...
                             ...                        
265    GUZMAN Y GOMEZ SURRY HILLS NS AUS Card xx0393 ...
266    Nespresso Australia BT Canberra AU AUS Card xx...
267    COLES 4787 CANBERRA AU AUS Card xx0393 Value D...
268    Sticky Beak Canberra AC AUS Card xx0393 Value ...
269    Lucky Duck Canberra AC AUS Card xx0393 Value D...
Name: Description, Length: 270, dtype: object

### Step 1: NLP bank description text normalisation and grouping

In [118]:
# text normalisation
# convert to lower case
description = description.str.lower()
# remove numbers
description = description.str.replace(r'\d+', '', regex=True)
# remove all punctuation except words and space
description = description.str.replace(r'[^\w\s]', '', regex=True)
# remove white spaces
description = description.str.strip()

# remove stop words
from nltk.corpus import stopwords
stop = stopwords.words('english')
description = description.apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

# not sure if these words are useless, may comment them in the future
# remove useless words
useless = ['au', 'aus', 'card', 'xx', 'value', 'date']
description = description.apply(lambda x: ' '.join([word for word in x.split() if word not in (useless)]))

description

0                           banme braddon
1                          coles canberra
2      soul origin belconnen belconnen ac
3      soul origin belconnen belconnen ac
4         vodafone australia north sydney
                      ...                
265           guzman gomez surry hills ns
266       nespresso australia bt canberra
267                        coles canberra
268               sticky beak canberra ac
269                lucky duck canberra ac
Name: Description, Length: 270, dtype: object

In [133]:
# grouping
# convert Series to Dataframe
dsc_df = description.to_frame()
dsc_df.columns = ['Name']

# groupby name
dsc_group_df = dsc_df.groupby(['Name']).size().to_frame()
dsc_group_df.columns = ['Count']
dsc_group_df = dsc_group_df.sort_values('Count', ascending=False)
dsc_group_df

Unnamed: 0_level_0,Count
Name,Unnamed: 1_level_1
coles canberra,40
guzman gomez surry hills ns,26
transfer commbank app,19
soul origin belconnen belconnen ac,14
direct credit fuego nero pty l pay,12
...,...
hokka hokka canberra ac,1
fuego nero braddon ac,1
fresh juice bars pty l canberra,1
football aust merch worongary ql,1
