# Data Processing

The data processing is split in two parts. Firstly Input processing, secondly output processing.

Output processing consists mostly in:
1. Removing campaigns with missing values (NaN in the DataFrame)
2. Removing outliers values, such as negatives or almost-zeros (while non-null campaign) values

## Loading the output data

Firstly, as usual, we change our working directory to go to the root of the project. The working diretory should be 
something like 'xxx\Roll Wear Project'

In [1]:
from utils_notebooks import move_current_path_up
move_current_path_up(n_times=2)

Working directory = P:\My Documents\Projets Programmation\Roll Wear Project


We load the complete data

In [2]:
import pandas as pd

output_df : pd.DataFrame = pd.read_hdf('Data/notebooks_data/wear_center.h5', key='outputs')


## Removing missing values

Firstly, we remove the campaigns containing NaN (Not a Number), which corresponds to missing values

In [3]:
print('Removed campaigns:')
print(output_df[output_df.isna().any(axis=1)])

output_df.dropna(inplace=True)

Removed campaigns:
                  f6t       f6b       f7t       f7b
id_campaign                                        
3                 NaN       NaN       NaN       NaN
7                 NaN       NaN       NaN       NaN
9            0.187548       NaN       NaN       NaN
18           0.245452  0.291968  0.293903       NaN
25                NaN       NaN  0.030059  0.062325
...               ...       ...       ...       ...
365               NaN       NaN       NaN       NaN
366               NaN       NaN       NaN       NaN
382          0.159548  0.123258       NaN  0.090258
383          0.173032  0.162387  0.145194       NaN
390          0.102290  0.091710  0.093677       NaN

[69 rows x 4 columns]


## Removing outliers

Next, we remove the campaigns with negative values, and the ones (manually) spotted as too low.

In [4]:
negative_index = output_df[(output_df < 0).any(axis=1)].index

print('Removed campaigns:')
print(output_df.loc[negative_index])

output_df.drop(negative_index, inplace=True, errors='ignore')

Removed campaigns:
Empty DataFrame
Columns: [f6t, f6b, f7t, f7b]
Index: []


The following campaigns have been found to have very low values (too low) and that they should be removed to not perturbate
the training. The following cell will show their values, so if one of them seems normal, it could be kept by removing it 
from this list.

 If one of the value do not appear in the following cell result, it means that it has already been removed by one of the 
 previous processing. (if None appears, this means they have all been already removed)

In [5]:
# Original list : [25, 56, 86, 75, 93, 103, 131, 188, 257, 271, 365]
null_camp = [25, 56, 86, 75, 93, 103, 131, 188, 257, 271, 365]

# Initialising empty dataframe
tmp_df = pd.DataFrame(columns=output_df.columns)
tmp_df.index.names = output_df.index.names

# For all column in null columns, if it exists in output_df, we add it to tmp_df
for campaign in null_camp:
    try:
        tmp_df = tmp_df.append(output_df.loc[campaign])
    except KeyError:
        pass

print(tmp_df)

output_df.drop(null_camp, inplace=True, errors='ignore')

                  f6t       f6b       f7t       f7b
id_campaign                                        
131          0.020613  0.039839  0.057129  0.057323


## Saving the preprocessed outputs

In [6]:
output_df.to_hdf('Data/notebooks_data/wear_center_preprocessed.h5', key='outputs')

