# Preprocess des données initiales

Beaucoup de colonnes, lesquelles sont utiles ?

Le but est d'avoir les features les plus informatives possibles.

Trois propriétés recherchées :
* nombre de features indep de la taille du problème
* features doivent être invariantes à des modifs qui ne changent pas le MIP (row / column permutation)
* **la valeur des features doivent être indep de la taille du problème**

Ce dernier point est le plus important pour le cas de nos features ici. Les autres sont déjà valides.

In [1]:
import pylab
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

file = 'data/second_samples/train/labels_NW_319_test_1_1_win_0'
pylab.rcParams['figure.figsize'] = (12, 5)
df = pd.read_csv(file, sep='\t')

df.head(10)

Unnamed: 0,node_number,parent_node_number,var_cost,frac_val,number_conflicting_columns,fraction_conflicting_columns,number_conflicting_columns_positive_value,fraction_conflicting_columns_positive_value,min_cost_conflicting_column,min_cost_conflicting_column_positive_value,number_cols_in_mp,dual_cost_min,dual_cost_max,dual_cost_avg,frac_pairing_tasks_fixed,nb_fractional_vars,nb_tasks,nb_pairing_tasks,value
0,1,0,584.0,0.977538,56,0.006554,4,0.000468,330.0,584.0,8544,-351.113,1183.12,349.276,0.818043,716,5,4,220197.254226
1,2,0,350.0,0.961992,61,0.00714,3,0.000351,350.0,350.0,8544,-2.0,849.543,349.333,0.818043,716,3,2,220046.973699
2,3,0,970.0,0.961992,153,0.017907,4,0.000468,493.0,962.0,8544,-691.84,1310.54,331.714,0.818043,716,7,6,220004.284337
3,4,0,1103.0,0.931345,285,0.033357,10,0.00117,597.0,616.0,8544,-251.65,1217.37,522.113,0.818043,716,6,5,219871.141033
4,5,0,253.0,0.920486,51,0.005969,4,0.000468,253.0,253.0,8544,-287.52,1046.52,252.333,0.818043,716,3,2,219731.483368
5,6,5,350.0,0.968328,22,0.002617,3,0.000357,350.0,350.0,8405,-2.0,911.503,349.333,0.818383,704,3,2,219608.777662
6,7,5,756.0,0.960549,182,0.021654,3,0.000357,594.0,756.0,8405,-364.5,1050.56,246.8,0.818383,704,5,4,219587.612981
7,8,5,757.0,0.938851,108,0.012849,4,0.000476,757.0,757.0,8405,-148.66,1361.78,366.167,0.818383,704,6,5,219542.605492
8,9,5,330.0,0.938851,66,0.007852,4,0.000476,330.0,330.0,8405,-218.564,1208.56,329.333,0.818383,704,3,2,219450.817028
9,10,5,1183.0,0.919687,165,0.019631,3,0.000357,646.667,1170.0,8405,-404.522,1633.47,589.727,0.818383,704,6,5,219415.897025


On présente maintenant les décisions de normalisation effectuées.

Dans l'ordre :
* *node_number* : pas une feature qui nous intéresse pour la prédiction
* *parent_node_number* : pas une feature qui nous intéresse pour la prédiction
* *var_cost* : normalisation classique
* *frac_val* : déjà dans un bon format
* *number_conflicting_columns* : déjà un équivalent normalisé (*fraction_conflicting_columns*)
* *fraction_conflicting_columns* : déjà dans un bon format
* *number_conflicting_columns_positive_value* : déjà un équivalent normalisé (*fraction_conflicting_columns_positive_value*)
* *min_cost_conflicting_column* : normalisation classique avec les données de var_cost
* *min_cost_conflicting_column_positive_value* : normalisation classique avec les données de var_cost
* *number_cols_in_mp* : normalisation classique
* *dual_cost_min* : normalisation classique avec les données des noeuds enfants
* *dual_cost_max* : normalisation classique avec les données des noeuds enfants
* *dual_cost_avg* : normalisation classique avec les données des noeuds enfants
* *frac_pairing_task* : déjà dans un bon format
* *nb_fractional_vars* : déjà un équivalent normalisé (*frac_pairing_tasks_fixed*)
* *nb_tasks* : déjà un équivalent (*nb_pairing_tasks*)
* *nb_pairing_tasks* : normalisation classique
* *value* : normalisation classique, mais remplace la moyenne par la valeur du noeud parent

Les features ayant des équivalents normalisés sont supprimées.
Est-ce que *number_cols_in_mp* est équivalent à *frac_pairing_task* ?

In [2]:
def normalize_y(df, use_std=True, with_noise=True, remove_0_stds=False):
    """
    Normalise la valeur de chaque noeud.
    Deux possibilités:
     - y <- y / y_parent
     - y <- (y - y_parent) / std_childs(y)
    
    std_childs(y) est l'ecart type des noeuds fils du noeud parent pour le noeud courant.
    
    On a la possibilite d'ajouter un bruit aleatoire a la colonne value,
    afin d'eviter un std a 0. Cela aide meme peut etre a une meilleure
    generalisation (une sorte de pseudo data augmentation) ?
    
    Supprime les noeuds orphelins (dont le noeud parent n'est pas présent dans la df).
    Retourne une nouvelle df.
    """
    if with_noise:
        noise = np.random.normal(scale=50, size=len(df))
        df['value'] = df['value'] + noise
        
    if use_std:  # Liste des ecarts types en fonction du noeud parent
        std = df.groupby('parent_node_number')['value'].std()
        std = {parent_node: childs_std for parent_node, childs_std in zip(std.index, std.values)}
        
        if remove_0_stds:
            null_stds = df.groupby('parent_node_number')['value'].std() == 0
            null_indexes = df['parent_node_number'].isin(null_stds.index[null_stds.values])
            null_indexes = df[null_indexes].index
            df = df.drop(null_indexes)
    
    parents = set(df['parent_node_number'].unique())
    nodes = set(df['node_number'])

    # Le node 0 n'existe pas...
    non_existing_parents = parents - nodes
    filter_row = None
    for parent_id in non_existing_parents:
        f = df['parent_node_number'] == parent_id
        if filter_row is not None:
            filter_row |= f
        else:
            filter_row = f
    new_df = df.loc[~filter_row].copy()  # Retire les noeuds sans parent
    
    parents = parents & nodes  # On garde uniquement les parents existants dans les données
    parents = {
        parent_node: df[ df['node_number'] == parent_node ]['value'].values[0]  # parent_node -> parent_value
        for parent_node in parents
    }
    
    new_values = []
    for node_id, parent_id, value in new_df[['node_number', 'parent_node_number', 'value']].values:
        if remove_0_stds and std[parent_id] == 0:
            continue
        
        if use_std:
            new_values.append( (value - parents[parent_id]) / std[parent_id] )
        else:
            new_values.append(value / parents[parent_id])
    
    new_df['value'] = new_values
    return new_df

def normalize_y_minmax(df):
    """
    y <- (y - y_min) / (y_max - y_min)
    """
    df = df.copy()
    y_min = df['value'].min()
    y_max = df['value'].max()
    df['value'] = (df['value'] - y_min) / (y_max - y_min)
    return df

def normalize_childs(df, columns):
    """
    Normalise les donnees de maniere classique,
    mais ne prends en compte que les donnees des enfants.
    Donner la liste des noms des colonnes a normaliser de cette facon.
    
    x <- (x - mean_childs(x) / std_childs(x))
    
    Retourne une nouvelle df.
    """
    new_df = df.copy()
    for column_name in columns:
        std = df.groupby('parent_node_number')[column_name].std()
        std = {parent_node: childs_std for parent_node, childs_std in zip(std.index, std.values)}
        
        mean = df.groupby('parent_node_number')[column_name].mean()
        mean = {parent_node: childs_mean for parent_node, childs_mean in zip(mean.index, mean.values)}

        new_values = [
            (value - mean[parent_id]) / (std[parent_id] + 1)
            for parent_id, value in df[ ['parent_node_number', column_name] ].values
        ]
        new_df[column_name] = new_values
    
    return new_df

def normalize_std(df, columns):
    """
    Normalise les donnees de maniere classique.
    Donner la liste des noms des colonnes à normaliser de cette façon.
    
    x <- (x - mean(x)) / std(x)
    
    Retourne une nouvelle df.
    """
    new_df = df.copy()
    for column_name in columns:
        new_df[column_name] = (df[column_name] - df[column_name].mean()) / (df[column_name].std() + 1)
    return new_df

In [3]:
def normalize_df(df):
    drop_columns = ['number_conflicting_columns', 'number_conflicting_columns_positive_value', 'nb_fractional_vars', 'nb_tasks']
    n_df = df.drop(columns=drop_columns)


    std_columns = ['var_cost', 'number_cols_in_mp', 'nb_pairing_tasks']
    n_df = normalize_std(n_df, std_columns)

    childs_columns = ['dual_cost_min', 'dual_cost_max', 'dual_cost_avg']
    n_df = normalize_childs(n_df, childs_columns)

    mean_c, std_c = df['var_cost'].mean(), df['var_cost'].std()
    max_c_columns = ['min_cost_conflicting_column', 'min_cost_conflicting_column_positive_value']
    for column_name in max_c_columns:
        n_df[column_name] = (df[column_name] - mean_c) / std_c

    n_df = normalize_y_minmax(n_df)
    
    return n_df

n_df = normalize_df(df)

## Résultat

In [4]:
n_df.head(10)

Unnamed: 0,node_number,parent_node_number,var_cost,frac_val,fraction_conflicting_columns,fraction_conflicting_columns_positive_value,min_cost_conflicting_column,min_cost_conflicting_column_positive_value,number_cols_in_mp,dual_cost_min,dual_cost_max,dual_cost_avg,frac_pairing_tasks_fixed,nb_pairing_tasks,value
0,1,0,-1.125763,0.977538,0.006554,0.000468,-1.635816,-1.128013,2.117031,-0.1377,0.342596,-0.117233,0.818043,-0.526819,1.0
1,2,0,-1.592647,0.961992,0.00714,0.000351,-1.595831,-1.595831,2.117031,1.264312,-1.509584,-0.116661,0.818043,-1.077975,0.894451
2,3,0,-0.355603,0.961992,0.017907,0.000468,-1.309943,-0.372307,2.117031,-1.506033,1.050093,-0.293538,0.818043,0.024337,0.864468
3,4,0,-0.090236,0.931345,0.033357,0.00117,-1.102023,-1.064038,2.117031,0.261736,0.532768,1.617873,0.818043,-0.251241,0.770956
4,5,0,-1.786185,0.920486,0.005969,0.000468,-1.789756,-1.789756,2.117031,0.117685,-0.415873,-1.090441,0.818043,-1.077975,0.672868
5,6,5,-1.592647,0.968328,0.002617,0.000357,-1.595831,-1.595831,2.054788,1.369676,-1.143645,-0.209113,0.818383,-1.077975,0.586686
6,7,5,-0.782583,0.960549,0.021654,0.000357,-1.108021,-0.784147,2.054788,-0.830676,-0.649253,-1.005021,0.818383,-0.526819,0.571821
7,8,5,-0.780587,0.938851,0.012849,0.000476,-0.782148,-0.782148,2.054788,0.479459,0.457233,-0.07844,0.818383,-0.251241,0.54021
8,9,5,-1.632552,0.938851,0.007852,0.000476,-1.635816,-1.635816,2.054788,0.055147,-0.087513,-0.364362,0.818383,-1.077975,0.475743
9,10,5,0.069382,0.919687,0.019631,0.000357,-1.002728,0.043531,2.054788,-1.073607,1.423178,1.656935,0.818383,-0.251241,0.451217


In [5]:
n_df.describe()

Unnamed: 0,node_number,parent_node_number,var_cost,frac_val,fraction_conflicting_columns,fraction_conflicting_columns_positive_value,min_cost_conflicting_column,min_cost_conflicting_column_positive_value,number_cols_in_mp,dual_cost_min,dual_cost_max,dual_cost_avg,frac_pairing_tasks_fixed,nb_pairing_tasks,value
count,765.0,765.0,765.0,765.0,765.0,765.0,765.0,765.0,765.0,765.0,765.0,765.0,765.0,765.0,765.0
mean,388.0,382.686275,0.006471,0.801142,0.039297,0.002096,-1.236273,-0.585321,-0.013837,-5.514833e-18,-1.207458e-16,6.6178e-17,0.885399,0.003803,0.457798
std,220.980768,220.488346,0.996574,0.108604,0.034406,0.001577,0.427013,0.779588,0.987988,0.8934637,0.8937997,0.8814672,0.042699,0.724375,1.817736
min,6.0,5.0,-1.812123,0.5,0.000655,0.000189,-1.815746,-1.815746,-1.369908,-1.788852,-1.78885,-1.768291,0.818383,-1.077975,-7.07452
25%,197.0,195.0,-0.983602,0.734103,0.016774,0.000927,-1.595831,-1.199985,-0.742555,-0.7299802,-0.672,-0.6987429,0.848794,-0.526819,-0.581381
50%,388.0,381.0,0.167149,0.784483,0.03059,0.001684,-1.26596,-0.584225,-0.175653,0.4431134,-0.2557176,0.05516219,0.883112,0.024337,0.4935
75%,579.0,571.0,0.843533,0.89398,0.049918,0.002851,-1.005395,0.102508,0.481703,0.686753,0.7281807,0.6370826,0.921679,0.575492,1.6036
max,770.0,761.0,2.384851,0.997924,0.265522,0.01224,0.018541,1.305041,3.035895,1.788849,1.788852,1.74949,0.965511,1.126648,10.11857


## Normalisation de toutes les données

On normalise maintenant toutes les données dans un dossier spécial.
L'entraînement des algos pourra alors se faire sur ces données traitées.

In [5]:
import os
from pathlib import Path

dossier_original = 'data/second_samples/train/'
dossier_cible = 'data/second_samples/normalized/train/'

Path(dossier_cible).mkdir(parents=True, exist_ok=True)  # Créé le dossier cible si besoin
files = [f for f in os.listdir(dossier_original) if f.startswith('labels')]
for f in files:
    path_original, path_cible = os.path.join(dossier_original, f), os.path.join(dossier_cible, f)
    df = pd.read_csv(path_original, sep='\t')
    if not df.empty:
        df = normalize_df(df)
        df.to_csv(path_cible, index=False)

print('Done !')

Done !
