In [25]:
import pandas as pd
import os 
import re
from unidecode import unidecode

### Objective:
We need to put it in a generic label-text format, as standard practice. 

If the same text entry has multiple labels, one can put the same text on different rows along with the different labels for trainining. 

The ```municip_faudit_plans.csv``` dataset was created by unifying several sources that describe with different parameters the actions within the plans for Italian Municipalities that have participated and obtained a Family Audit certification. 

The file specification may be one of the following: ```.csv```, ```.gzip```, ```.xlsx```, ```.json```, ```.feather```

In [6]:
df = pd.read_parquet('municip_faudit_plans.gzip')
df.columns

Index(['codice_macro', 'ID_piano', 'ID_azione', 'ID_tassonomia', 'titolo',
       'obiettivo', 'descrizione', 'assessorato', 'tipologia_partnership',
       'altre_organizzazioni_coinvolte', 'indicatore', 'azione',
       'codice_campo', 'numero_codice_campo', 'descrizione_codice_macro',
       'descrizione_codice_campo', 'ID_organizzazione', 'anno_compilazione',
       'premessa', 'valutazione_globale', 'status', 'comune', 'codice_istat',
       'dimensione', 'num_det_assegnazione', 'data_det_assegnazione',
       'numero_registro_family_trentino', 'num_det_revoca', 'data_det_revoca',
       'comune_breve'],
      dtype='object')

The data used for training contains: 

- ```ID_tassonomia```
- ```titolo```
- ```descrizione```
- ```obiettivo```
ID_tassonomia is the category to predict. the whole list of ID_tassonomia can be found in [correspondence.csv](https://github.com/FluveFV/faudit-classifier/blob/main/src/correspondence.csv), along with the relationship with other categories relevant to the taxonomy description.

The other elements, title-description-objective, will be unified in one text and pre-processed to get rid of non-ascii characters. Here's an example function to do so:

In [17]:
def formatter(og, c=None):
    df = og.fillna('')
    r = {}
    tdo = []
    if not isinstance(c, type(None)):
        assert isinstance(c, list), 'The additional column(s) must be in a list' 
        for el in c:
            r[el] = df[el]

    for t, d, o in zip(df.titolo, df.descrizione, df.obiettivo):
        t = ascificatore(t)
        d = ascificatore(d)
        o = ascificatore(o)
        tdo.append((t + ' . ' + d + ' . '+ o).lower())
    r['text'] = tdo

    return pd.DataFrame(r)   
    
def ascificatore(s):
    return unidecode(' '.join([c for c in re.split(r'[\r\n\t]+', s) if s.strip()]).encode('ascii', 'ignore').decode())

In [22]:
data = formatter(df)
data.head()

Unnamed: 0,text
0,adesione al piano giovani di zona della comuni...
1,riduzione tariffe asilo nido dal 1 gennaio 201...
2,revisione parametri icef per servizio tagesmut...
3,agevolazione per lacquisto kit pannolini lavab...
4,abbattimento della quota di iscrizione al serv...


Then, we attach the equivalent label we want to predict. One can also try and train the model wrt. 'macrocategoria' or 'field' as they are a less granular version of 'ID_tassonomia' for a simpler prediction.

In [24]:
data['label'] = df['ID_tassonomia']
data.head()

Unnamed: 0,text,label
0,adesione al piano giovani di zona della comuni...,11
1,riduzione tariffe asilo nido dal 1 gennaio 201...,26
2,revisione parametri icef per servizio tagesmut...,26
3,agevolazione per lacquisto kit pannolini lavab...,30
4,abbattimento della quota di iscrizione al serv...,26


In [27]:
os.makedirs('dataset_BERT', exist_ok=True)
data.to_parquet('dataset_BERT/addestramento.gzip') #or csv, xlsx, etc.

This is the preprocess in its simplest version. We also checked for potential data leaks. In this case, we mean not exactly identical text between different rows, but similar enough that the training and testing process may share an almost identical text. 

The accuracy often does improve, but also the risk of overfitting on the data, as only $~30$% of the entries of the original dataset may be unique. Depending on how different the different entries of text can be, the percentage mentnioned above may change.