In [92]:
import pandas as pd
import os 
import re
from unidecode import unidecode

### Objective:
Data needs to be put in a generic text-label(s) format. If the one observation (entry) has multiple labels, then they need to be saved in different columns, as many as the labels the observation can have. 

The ```companies_FA.csv``` dataset was created by unifying several sources that describe with different parameters the actions within the plans for companies that have participated and obtained a Family Audit certification. 

The file specification may be one of the following: ```.csv```, ```.gzip```. Other filetypes can work but haven't been tested.

The data lake *must* have a text column. All the other columns will be interpreted as labels if they are named anything that starts with "label". The script treats column names as case insensitive, so there is no difference if a column is named "Label", "laBEL", "LABEL", etc.

In [120]:
df = pd.read_csv('companies_FA.csv')
df.head()

Unnamed: 0,text,Label 0,Label 1,Label 2,Label 3,Label 4,Label 5,Label 6,Label 7,Label 8,Label 9
0,telelavoro domiciliare/mobile . avviare uno st...,11630743.0,11630742.0,,,,,,,,
1,gruppo di lavoro interno . formalizzazione del...,11630762.0,,,,,,,,,
2,lavoro decentrato . e' prassi in essere la pos...,11630744.0,,,,,,,,,
3,rientro accompagnato dopo lunghe assenze . suc...,11630760.0,,,,,,,,,
4,supervisione delle politiche di conciliazione ...,11630753.0,,,,,,,,,


The data used for training contains: 

```text```

```Label 0```

```Label 1```

    ...

```Label 9```

The IDs in the several columns are the labels to predict. The data lake can have NaN. There is no importance on the order of the labels in the columns; the first one does is not more important than the second, etc. There must be at least one label per observation. The labels do not have to be close to the text. 

Valid examples of dataframes are:
|text|Label 0|Label 1|
|---|---|--|
|example of a sentence|5|NaN|

|text|Label 0|Label 1|Label 2|Label 3|
|---|---|---|---|---|
|example of a sentence|"Cat"|NaN|"Dog"|NaN|

The first column must have a label. In the ```train.py``` script the first column is used to [stratify split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) the observations into training, validation and test set.

If you have multiple text columns (e.g. one representing the title of the document, one representing the description, etc.) you can merge them with the following formatter.

In [121]:
def formatter(og, c):
    df = og.fillna('')    
    assert isinstance(c, list), 'The additional column(s) must be in a list' 
    
    df['text'] = og[c.pop(0)] 
    for column in c: 
        df['text'] + df[column]    
    df['text'].replace(',', '')  #for CSV formats, no text should have commas. 
    return df

def ascificatore(s):
    return unidecode(' '.join([c for c in re.split(r'[\r\n\t]+', s) if s.strip()]).encode('ascii', 'ignore').decode())

For the sake of the example, three "different" columns of text will be merged together. 

In [122]:
data = formatter(df, c=['text'])  #Replace the column list with names of your columns of text.
data.head()

Unnamed: 0,text,Label 0,Label 1,Label 2,Label 3,Label 4,Label 5,Label 6,Label 7,Label 8,Label 9
0,telelavoro domiciliare/mobile . avviare uno st...,11630743.0,11630742.0,,,,,,,,
1,gruppo di lavoro interno . formalizzazione del...,11630762.0,,,,,,,,,
2,lavoro decentrato . e' prassi in essere la pos...,11630744.0,,,,,,,,,
3,rientro accompagnato dopo lunghe assenze . suc...,11630760.0,,,,,,,,,
4,supervisione delle politiche di conciliazione ...,11630753.0,,,,,,,,,


### Important mention
Given the nature of the data, the classes are often very imbalanced. Since ```train.py``` automatically splits in a stratified manner, we need to have classes that have more than 3 occurrences (for each to end up in one of the train/val/test split). 

Furthermore, those classes are not expected to appear in future data, given their extremely low usage by municipalities, therefore we suggest using a threshold of 30 observations per class minimum for the processed data.

In [123]:
before = data.shape[0]

In [124]:
c = data['Label 0'].value_counts()[data['Label 0'].value_counts() >= 3].index.tolist()
data = data[data['Label 0'].isin(c)]
after = data.shape[0]
print(f'{before-after} rows have been removed. Remaining classes: {data["Label 0"].unique().shape[0]}')

72 rows have been removed. Remaining classes: 40


In [125]:
data.to_csv('addestramento_d.csv')

This is the preprocess in its simplest version. We also checked for potential data leaks. In this case, we mean not exactly identical text between different rows, but similar enough that the training and testing process may share an almost identical text. 

The accuracy often does improve, but also the risk of overfitting on the data, as only $\approx31\%$ of the entries of the original dataset are truly unique.

A demo-version of the data is uploaded [here]().