In [1]:
import pandas as pd
import os 
import re
from unidecode import unidecode

### Objective:
We need to put it in a generic label-text format, as standard practice. 

If the same text entry has multiple labels, one can put the same text on different rows along with the different labels for trainining. 

The ```municip_faudit_plans.csv``` dataset was created by unifying several sources that describe with different parameters the actions within the plans for Italian Municipalities that have participated and obtained a Family Audit certification. 

The file specification may be one of the following: ```.csv```, ```.gzip```, ```.xlsx```, ```.json```, ```.feather```

In [2]:
df = pd.read_parquet('municip_faudit_plans.gzip')
org = pd.read_parquet('')

The data used for training contains: 

- ```ID_tassonomia```
- ```titolo```
- ```descrizione```
- ```obiettivo```
ID_tassonomia is the category to predict. the whole list of ID_tassonomia can be found in [correspondence.csv](https://github.com/FluveFV/faudit-classifier/blob/main/src/correspondence.csv), along with the relationship with other categories relevant to the taxonomy description.

The other elements, title-description-objective, will be unified in one text and pre-processed to get rid of non-ascii characters. Here's an example function to do so:

In [88]:
def formatter(og, c=None):
    df = og.fillna('')
    r = {}
    tdo = []
    if not isinstance(c, type(None)):
        assert isinstance(c, list), 'The additional column(s) must be in a list' 
        for el in c:
            r[el] = df[el]

    for t, d, o in zip(df.titolo, df.descrizione, df.obiettivo):
        t = ascificatore(t)
        d = ascificatore(d)
        o = ascificatore(o)
        tdo.append((t + ' . ' + d + ' . '+ o).lower())
    r['text'] = tdo

    return pd.DataFrame(r)   
    
def ascificatore(s):
    return unidecode(' '.join([c for c in re.split(r'[\r\n\t]+', s) if s.strip()]).encode('ascii', 'ignore').decode())

In [89]:
data = formatter(df)
data.head()
data

Unnamed: 0,text
0,adesione al piano giovani di zona della comuni...
1,riduzione tariffe asilo nido dal 1 gennaio 201...
2,revisione parametri icef per servizio tagesmut...
3,agevolazione per lacquisto kit pannolini lavab...
4,abbattimento della quota di iscrizione al serv...
...,...
20303,fasciatoio . il bagno al pianterreno del munic...
20304,allattamento . all'interno della biblioteca co...
20305,sentieri . i cinque sentieri tematici che perm...
20306,sentieri . i cinque sentieri tematici che perm...


Then, we attach the equivalent label we want to predict. One can also try and train the model wrt. 'macrocategoria' or 'field' as they are a less granular version of 'ID_tassonomia' for a simpler prediction.

In [90]:
data['label'] = df['ID_tassonomia']
data.drop_duplicates(inplace=True)
data.head()

Unnamed: 0,text,label
0,adesione al piano giovani di zona della comuni...,11
1,riduzione tariffe asilo nido dal 1 gennaio 201...,26
2,revisione parametri icef per servizio tagesmut...,26
3,agevolazione per lacquisto kit pannolini lavab...,30
4,abbattimento della quota di iscrizione al serv...,26


### Important mention
Given the nature of the data, the classes are often very imbalanced. Since ```train.py``` automatically splits in a stratified manner, we need to have classes that have more than 3 occurrences (for each to end up in one of the train/val/test split). 

Furthermore, those classes are not expected to appear in future data, given their extremely low usage by municipalities, therefore we suggest using a threshold of 30 observations per class minimum for the processed data.

In [91]:
before = data.shape[0]

In [93]:
c = data['label'].value_counts()[data.label.value_counts() >= 30].index.tolist()
data = data[data.label.isin(c)]
after = data.shape[0]
print(f'{before-after} rows have been removed. Remaining classes: {data.label.unique().shape[0]}')

230 rows have been removed. Remaining classes: 56


In [33]:
os.makedirs('dataset_BERT', exist_ok=True)
data.to_parquet('dataset_BERT/addestramento.gzip') #or csv, xlsx, etc.

This is the preprocess in its simplest version. We also checked for potential data leaks. In this case, we mean not exactly identical text between different rows, but similar enough that the training and testing process may share an almost identical text. 

The accuracy often does improve, but also the risk of overfitting on the data, as only ~$30$% of the entries of the original dataset may be truly unique. Depending on how different the different entries of text can be, the percentage mentioned above may change.

A demo-version of the data is uploaded [here](https://github.com/FluveFV/faudit-classifier/blob/main/src/addestramento.gzip).