# Transaction Classification



## Challenge

For a given set of financial transactions, classify each one into one of seven revenue (transaction) categories. 
In this challenge, a "Multinomial Naive Bayes" classifier is trained  and fitted with the trasaction  word counts and class categori  values. The categories are
- Income
- Private (cash, deposit, donation, presents)
- Living (rent, additional flat expenses, ...)
- Standard of living (food, health, children, ...)
- Finance (credit, bank costs, insurances, savings)
- Traffic (public transport, gas stations, bike, car rent, ...)
- Leisure (hobby, sport, vacation, shopping, ...) 


## Aproach 
Implement a Generic Naive Bayes Classifier

1. Clean and prepare the given data
2. Label the data and store it
3. Define the features you want to use
4. Prepare your features / transform them into a format you can work with
5. Train your model
5. Evaluate your model
6. Visualize your results


In [None]:
# Load necessary libraries for data processing
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Imports for Classification 
from sklearn.naive_bayes import GaussianNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import mutual_info_classif, SelectKBest, f_classif
from sklearn.metrics import confusion_matrix, classification_report

## Data

The first step is to load the data in a pandas Dataframe structure 
and have an some information about the dataset.

In [None]:
# Load the data as pandas Dataframe structure
data = pd.read_csv("SAKI Exercise 1 - Transaction Classification - Data Set.csv",sep=';',index_col=0, encoding='utf8',header=0)

# Print summary of the column names and Data types 
print(' Columns  \t \t \t Data Types\n' + 20*'---'+'\n', data.dtypes)


#### Data Exploration

In order to obtain a better understanding of data and valuable features, somo basiuc exploration is done  

In [None]:
#Print an first row of the data as overview
data.head()

In [None]:
print(' Columns  \t \t \t Unique Values \t \t \t Empty Values \n' + 40*'--')
for col in data.columns: 
    print("{0: <35} {1: <35} {2: <35}".format(col, len(data[col].unique()), data[col].isnull().sum()))

Explore the Data class frequencies to see if there is a class imbalance

In [None]:
# Print summary of the classes and lablel numbers
print('\nClass Names \t  Class Frequencies\n' + 20*'--')
for class_label, c in zip( data.label,data.label.value_counts()) : 
    print("{0: <20} {1: <20}".format(class_label, c))

### Clean data

Auxiliary functions to standarize text columns and format. Remove invalid characters in order to standarize and avoid mistakes

In [None]:
from nltk.corpus import stopwords

# Preprocessing function
def preprocess_text_fields(text, remove_stopwords=True):
    #remove punctuation 
    text = text.translate(str.maketrans(string.punctuation, ' '*len(string.punctuation)))
    #convert text to lower case
    text = text.lower()
    return textre

# def remove_special_characters_and_numbers(str):
#     str = re.sub('[\W_]+', '', str)
#     str = re.sub('[\d_]+', '', str)
#     return str;

def remove_special_characters(df_column): 
    df_column = df_column.str.lower() # Just for assurance
    df_column = df_column.str.replace('[^a-z-]', ' ')
    return df_column


def tokenize(df_column, language = 'german'):
    stopwords = stopwords.words(language)
    df_column = df_column.str.replace('\s+', ' ', regex=True)
    tokens  = df_column.apply(lambda l: ' '.join([word for word in l.split() if (word not in (stopwords) and len(word) > 2)]))
    return tokens

def tokenize_text(text, language = 'german'): 
    tokens = nltk.word_tokenize(text)
    for t in tokens:
        if t in stopwords.words(language):
            tokens.remove(t)
    #Join the tokens
    return  " ".join(tokens)


In [None]:
# Remove the columns that have not many different values and threfore are probablyu mot meaningfulñ for classification and could lead to overfit problem
non_meaningul_columns = ["Auftragskonto",  "Valutadatum", "Kontonummer", "BLZ", "Waehrung"]
data = data.drop(columns=non_meaningul_columns)

In [None]:
## Clean and prepare the given data casting to proper Data Type

# Remove the invaled values with a numeric(-11111)
data = data.fillna(-11111)

#Remove the punctuation and making all the words lowercase for comparison
data['Betrag'].replace(regex=True, inplace=True, to_replace=',', value='.')
data['Betrag'] = data['Betrag'].astype('float')

#Convert to String  Datatypes
# data['Kontonummer'] = data['Kontonummer'].astype('str')
# data['BLZ'] = data['BLZ'].astype('str')
#Convert to Date Time format
#data['Buchungstag'] = data.to_datetime(df['Buchungstag'])

#Replace the text fields to lower char values
data['Buchungstext'] = data['Buchungstext'].str.lower()
data['Beguenstigter/Zahlungspflichtiger'] = data['Beguenstigter/Zahlungspflichtiger'].str.lower()

# Store the Column Index
column_index = data.columns

# Select columns that are not necessary ( no causality) to the labels by removing them from the data
#delete_column_ind = ['Auftragskonto', 'Buchungstag', 'Valutadatum', 'Kontonummer', 'BLZ', 'Waehrung']
#data.drop(data[delete_column_ind], axis=1, inplace=True)


## Fearures

In [None]:
data.describe(include="all")

In [None]:
from sklearn.model_selection import train_test_split
