### Generating training data for multi-label text classification model using Faker & self-made utilities
In this notebook, we're going to ingest the sample input data provided by BDO and use it to manually generate a larger dataset with the same parameters. \
This example will make use of the 'generator' and 'utils' modules to create a pandas dataframe of test data.

#### Setup Environment
1. Configure the `autoreload` extension for the current Notebook. This allows you to automatically reload modules that have been previously imported or loaded.
2. We install and import libraries which we'll use: Pandas and openpyxl


In [5]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [6]:
import pandas as pd
import generator
from typing import Callable


#### Load dataset
Load the sample data into pandas and check the data type of each column. The types gives an indication of the types of variables to generate for test data.

In [7]:
path = './sample_data/sample_input.xlsx'
sample_df = pd.read_excel(path)
sample_df.dtypes

Factuurnr.                            int64
Naam                                 object
Deb.nr.                               int64
Omschrijving factuur                 object
Boekstuk                     datetime64[ns]
Vervalt                      datetime64[ns]
Factuur                             float64
Saldo                               float64
Factuurreferentie                   float64
Aanmaningen                           int64
Vervaldagen                           int64
Factuur Gbl.                          int64
Aut. bet./inc.                        int64
IBAN-nr.                            float64
Incassomachtiging vereist             int64
Dossier                               int64
Betaaltermijnen                       int64
Betaalwijze                         float64
Factuurstatus                        object
dtype: object

#### Prepare Values and Structure for generating test data
Following the dtypes structure output above we can go ahead and: 
1. Define how many rows we would like our sample dataset to have. 
2. Use it as input for a utility function designed to prepare a list of values by repeatedly calling a specified function `NUM_ROWS` number of times. 
3. This fits with the pattern below of writing a dictionary wherein the key represents a column name and the value is a function that uses the Faker library to generate a given subset of the data. For more context on the latter please visit the `generator.py`

In [8]:
NUM_ROWS = 100

def prepare_values(function: Callable, num_prepared: int = NUM_ROWS) -> list:
    return [function() for _ in range(num_prepared)]

In [10]:
table_dict = {
    "Factuurnr.": generator.factuurnummer,
    "Naam": generator.name,
    "Deb.nr.": generator.debnummer, 
    'Omschrijving factuur': generator.omschrijving, 
    'Boekstuk': generator.boekstuk,
    'Vervalt': generator.vervalt, 
    'Factuur': generator.factuur,
    'Saldo': generator.factuur, 
    'Factuurreferentie': generator.debnummer, 
    'Aanmaningen': generator.debnummer,
    'Vervaldagen': generator.debnummer, 
    'Factuur Gbl.': generator.debnummer, 
    'Aut. bet./inc.': generator.debnummer, 
    'IBAN-nr.': generator.iban,
    'Incassomachtiging vereist': generator.debnummer, 
    'Dossier': generator.debnummer, 
    'Betaaltermijnen': generator.debnummer, 
    'Betaalwijze': generator.debnummer,  
    'Factuurstatus': generator.factuurstatus,
}

columns = list(table_dict.keys())
df = pd.DataFrame(columns=columns)

In [11]:
for key, function in table_dict.items():
    df[key] = prepare_values(function)

In [14]:
gen_path = './generated_data/gen_data_v2.csv'
df.to_csv(gen_path, index=False)