# 03 - Création d'un dataset des données brutes fractionnées de jobs_events

Ce notebook génère 2 csv :

- raw_merge_job_events_dataset.csv qui fusionne les données du dataset de brut (une ligne par job id)

- raw_concat_job_events_dataset.csv qui concatène les données du dataset de brut (une ligne par tag)

Etapes : 

- fractionnement de la colonne payload

- fractionnement des sous-colonnes

- fusion des sous-colonnes entre elle (chaque job à une ligne et regroupe les données des 3 tags)

- concaténation des sous-colonnes (chaque job à plusieurs ligne : job_start, job_preview, job_end)


# A) Imports

## Librairies

In [1]:
import os, json, ast
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime

## Fonctions

In [2]:
# fonction retournant un dataframe à partir du payload pour un tag ciblé
def payload_dataframe_by_tag(input_df, tag):
    # creation du dataframe avec selection par tag
    df = input_df.loc[input_df['tag'] == tag]
    # creation du dataframe du payload fractionné
    payload_df = df.payload.apply(lambda x : json.loads(x)).apply(pd.Series)
    # merge des 2 dataframes
    tag_df = df.merge(payload_df,left_index=True, right_index=True)
    # suppression de la colonne 'payload' et de la colonne 'tag'
    tag_df.drop(['payload','tag'], axis=1, inplace=True)
    # remise à 0 des index
    tag_df.reset_index(level=None, drop=True, inplace=True, col_level=0, col_fill='')
    return tag_df

In [3]:
# fonction retournant le dataframe d'une colonne fractionnée
# col=colonne à fractionner
# df=dataframe source
# data=dict des colonnes du df à conserver dans le df à retourner
def convert_col_to_df(col:str, df:pd.DataFrame, data:dict=None):
    
    # création du dictionnaire de données vide
    if data == None :
        data = {}
    # ou liste des clés du dictionnaire input
    else :
        data_keys = list(data.keys())

    # on converti le type des valeurs str en list
    if not isinstance(df[col].loc[0], list) and not isinstance(df[col].loc[0], dict):
        try :
            df[col] = df[col].apply(lambda x : json.loads(x))
        except:
            df[col] = df[col].fillna(0)

    # liste des clés du dictionnaire de la colonne à partir de la première occurence
    # on recherche la première occurence non vide et de type list 
    # pour l'affecter à une variable first
    for i in range(0, (len(df[col]))):
        value = df[col].loc[i]
        if isinstance(value, list):
            if len(value) > 0 :
                first = value[0]
                #print('first: ', type(first), first)
                break
        if isinstance(value, dict):
            if len(value) > 0 :
                first = value
                #print('first: ', type(first), first)
                break

    # on liste les clés du dictionnaire de l'occurence
    col_keys = first.keys()
    for ck in col_keys :
        data[ck+'_'+col] = []
    # print(data)
    
    # on itére dans la serie pour récupérer les valeurs et les stocker dans le dictionnaire data
    for i in range(df.index.start, df.index.stop):
        # evaluation des valeurs 'str' en 'list'
        values = df[col].loc[i]
        if isinstance(values, list) and len(values) > 0 :
            # ajout des valeurs dans le dictionnaire 'd'
            for value in values :
                for k in value.keys():
                    data[k+'_'+col].append(value.get(k))
                for dk in data_keys:
                    data[dk].append(df[dk].loc[i])
        if isinstance(values, dict) :
            # ajout des valeurs dans le dictionnaire 'd'
            for k in values.keys() :
                data[k+'_'+col].append(values.get(k))
            for dk in data_keys:
                data[dk].append(df[dk].loc[i])
                
    # re-assignation de la variable df
    df = pd.DataFrame(data)

    return df

## Data

In [4]:
# source path to raw metrics dataset
filename = 'job_events.csv'
path = '../data/raw/'
# target path to save merge raw job events dataset
save_csv_merge = '../data/jobs/raw_merge_job_events_dataset.csv'
save_csv_concat = '../data/jobs/raw_concat_job_events_dataset.csv'

In [5]:
# téléchargement dans le repertoire 'data' d'un fichiers 'csv' depuis le blob
job_events = os.path.join(path, filename)

# B) Dataframe

## a) Création

In [6]:
# création d'un dataframe à partir du csv de données
job_events_df = pd.read_csv(job_events).sort_values(by='received_at')
job_events_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 112105 entries, 0 to 112104
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   id           112105 non-null  int64 
 1   payload      112105 non-null  object
 2   received_at  112105 non-null  object
 3   machine_id   112105 non-null  int64 
 4   tag          112105 non-null  object
dtypes: int64(2), object(3)
memory usage: 5.1+ MB


In [7]:
# réindexation
job_events_df.reset_index(level=None, drop=True, inplace=True, col_level=0, col_fill='')
job_events_df.head(5)

Unnamed: 0,id,payload,received_at,machine_id,tag
0,82917,"{""iper"": [{""id"": ""PRINT_ENGINE_1"", ""LED"": 50, ...",2022-02-22 09:43:18.114000,18,job-started
1,82918,"{""path"": ""D:/IMAGES/Standard/1504750#1/0000001...",2022-02-22 09:43:18.290000,18,job-preview-ready
2,82919,"{""jobId"": ""1645522997"", ""jobState"": ""SUCCESS"",...",2022-02-22 09:44:33.472000,18,job-ended
3,82921,"{""iper"": [{""id"": ""PRINT_ENGINE_1"", ""LED"": 50, ...",2022-02-22 09:45:01.297000,18,job-started
4,82922,"{""path"": ""D:/IMAGES/Standard/1504749#1/0000001...",2022-02-22 09:45:01.456000,18,job-preview-ready


In [8]:
# on verifie que les valeurs de la colonne id n'ont pas de doublon
any(job_events_df.id.duplicated())

False

## b) Fractionnement du payload

Le contenu du payload diffère selon le tag donc on subdivise le dataset en fonction du tag pour fractionner le payload

In [9]:
# liste des tag
job_events_df.tag.unique().tolist()

['job-started', 'job-preview-ready', 'job-ended']

In [10]:
job_events_df

Unnamed: 0,id,payload,received_at,machine_id,tag
0,82917,"{""iper"": [{""id"": ""PRINT_ENGINE_1"", ""LED"": 50, ...",2022-02-22 09:43:18.114000,18,job-started
1,82918,"{""path"": ""D:/IMAGES/Standard/1504750#1/0000001...",2022-02-22 09:43:18.290000,18,job-preview-ready
2,82919,"{""jobId"": ""1645522997"", ""jobState"": ""SUCCESS"",...",2022-02-22 09:44:33.472000,18,job-ended
3,82921,"{""iper"": [{""id"": ""PRINT_ENGINE_1"", ""LED"": 50, ...",2022-02-22 09:45:01.297000,18,job-started
4,82922,"{""path"": ""D:/IMAGES/Standard/1504749#1/0000001...",2022-02-22 09:45:01.456000,18,job-preview-ready
...,...,...,...,...,...
112100,873661,"{""iper"": [{""id"": ""PRINT_ENGINE_1"", ""LED"": 50, ...",2023-11-16 12:48:08.191000,18,job-started
112101,873662,"{""path"": ""D:/IMAGES/Standard/2118611#1/0000001...",2023-11-16 12:48:08.328000,18,job-preview-ready
112102,873666,"{""jobId"": ""1700138888"", ""jobState"": ""SUCCESS"",...",2023-11-16 12:51:59.168000,18,job-ended
112103,873671,"{""iper"": [{""id"": ""PRINT_ENGINE_1"", ""LED"": 50, ...",2023-11-16 12:53:08.804000,18,job-started


In [11]:
# creation des dataframes du payload fractionné pour chaque tag
job_event_payload = job_events_df.drop(['machine_id','received_at'], axis=1).copy()
job_started_df = payload_dataframe_by_tag(input_df=job_event_payload, tag='job-started')
job_preview_df = payload_dataframe_by_tag(input_df=job_event_payload, tag='job-preview-ready')
job_ended_df = payload_dataframe_by_tag(input_df=job_event_payload, tag='job-ended')

In [12]:
# on verifie que les valeurs de la colonne jobId soient unique dans chaque df
print('job-started tag :', any(job_started_df.jobId.duplicated()), job_started_df.jobId.nunique())
print('job-preview tag :', any(job_preview_df.jobId.duplicated()), job_preview_df.jobId.nunique())
print('job-ended tag :', any(job_ended_df.jobId.duplicated()), job_ended_df.jobId.nunique())

job-started tag : False 37398
job-preview tag : False 37374
job-ended tag : False 37333


### a. Dataframe de tag job start

In [13]:
# visualisation des valeurs
job_started_df.head(3)

Unnamed: 0,id,iper,user,ifoil,jobId,layout,memjet,octopus,irDryers,uvDryers,machineId,timestamp,totalCopies,remoteScanner,remoteScannerRegistration,jsonVersion
0,82917,"[{'id': 'PRINT_ENGINE_1', 'LED': 50, 'bars': [...","{'level': 'Operator', 'operator': 'User'}","[{'id': 'IFOIL_1', 'speed': 24, 'enabled': Tru...",1645522997,"{'speed': 313, 'pageLayout': 'LEFT', 'imageLay...",[],[],"[{'id': 'IR_DRYER_1', 'power': 20, 'enable': F...","[{'id': 'UV_DRYER_1', 'power': 70, 'enable': T...","{'type': 'JETvarnish 3D EVO', 'numMachine': 68...",2022-02-22T09:43:18.1166478Z,6,[],"[{'id': 'REGISTRATION_SCANNER_1', 'mode': 3, '...",
1,82921,"[{'id': 'PRINT_ENGINE_1', 'LED': 50, 'bars': [...","{'level': 'Operator', 'operator': 'User'}","[{'id': 'IFOIL_1', 'speed': 24, 'enabled': Tru...",1645523101,"{'speed': 313, 'pageLayout': 'LEFT', 'imageLay...",[],[],"[{'id': 'IR_DRYER_1', 'power': 20, 'enable': F...","[{'id': 'UV_DRYER_1', 'power': 70, 'enable': T...","{'type': 'JETvarnish 3D EVO', 'numMachine': 68...",2022-02-22T09:45:01.3041033Z,11,[],"[{'id': 'REGISTRATION_SCANNER_1', 'mode': 3, '...",
2,82928,"[{'id': 'PRINT_ENGINE_1', 'LED': 50, 'bars': [...","{'level': 'Operator', 'operator': 'User'}","[{'id': 'IFOIL_1', 'speed': 24, 'enabled': Tru...",1645523250,"{'speed': 313, 'pageLayout': 'LEFT', 'imageLay...",[],[],"[{'id': 'IR_DRYER_1', 'power': 20, 'enable': F...","[{'id': 'UV_DRYER_1', 'power': 70, 'enable': T...","{'type': 'JETvarnish 3D EVO', 'numMachine': 68...",2022-02-22T09:47:30.3197334Z,7,[],"[{'id': 'REGISTRATION_SCANNER_1', 'mode': 3, '...",


In [14]:
# suppression des colonnes ne contenant aucune valeurs :
job_started_df = job_started_df.drop(['memjet','octopus'], axis=1)

In [15]:
# liste des colonnes contenant des valeurs de type list ou dict à fractionner
job_started_col_to_split = []
for col in job_started_df.columns :
    if isinstance(job_started_df[col].loc[0], list) or isinstance(job_started_df[col].loc[0], dict):
        job_started_col_to_split.append(col)

job_started_col_to_split

['iper',
 'user',
 'ifoil',
 'layout',
 'irDryers',
 'uvDryers',
 'machineId',
 'remoteScanner',
 'remoteScannerRegistration']

In [16]:
col_to_drop = []

#### 1) Fractionnement colonne iper

In [17]:
job_started_df.iper.loc[0]

[{'id': 'PRINT_ENGINE_1',
  'LED': 50,
  'bars': [1, 2],
  'drops': 4,
  'enable': True,
  'dithering': False,
  'deadPixelsOffset': 0}]

In [18]:
# on fractionne une colonne
iper = convert_col_to_df('iper', job_started_df, {'id':[]})
iper.head(3)

Unnamed: 0,id,id_iper,LED_iper,bars_iper,drops_iper,enable_iper,dithering_iper,deadPixelsOffset_iper
0,82917,PRINT_ENGINE_1,50,"[1, 2]",4,True,False,0
1,82921,PRINT_ENGINE_1,50,"[1, 2]",4,True,False,0
2,82928,PRINT_ENGINE_1,50,"[1, 2]",4,True,False,0


In [19]:
# suppression de colonne
#iper = iper.drop(['bars_iper'], axis=1)

In [20]:
# liste le nombre de valeurs uniques par colonne
iper_col_to_drop = []
for col in iper.drop(['bars_iper'], axis=1).columns:
    print(col, iper[col].nunique())
    if iper[col].nunique() <= 1 :
        iper_col_to_drop.append(col)
col_to_drop.append(iper_col_to_drop)
iper_col_to_drop

id 37398
id_iper 1
LED_iper 20
drops_iper 8
enable_iper 1
dithering_iper 2
deadPixelsOffset_iper 5


['id_iper', 'enable_iper']

#### 2) Fractionnement colonne user

In [21]:
# on visualise les valeurs
job_started_df.user.loc[0]

{'level': 'Operator', 'operator': 'User'}

In [22]:
# on fractionne une colonne
user = convert_col_to_df('user', job_started_df, {'id':[]})
user.head(2)

Unnamed: 0,id,level_user,operator_user
0,82917,Operator,User
1,82921,Operator,User


#### 3) Fractionnement colonne ifoil

In [23]:
# on visualise les valeurs
job_started_df.ifoil.loc[0]

[{'id': 'IFOIL_1',
  'speed': 24,
  'enabled': True,
  'irEnable': False,
  'optifoil': False,
  'vacuumIn': 100,
  'vacuumOut': 100,
  'stampAreas': [{'id': 1, 'end': 483, 'start': 1, 'height': 482},
   {'id': 2, 'end': 0, 'start': 0, 'height': 0},
   {'id': 3, 'end': 0, 'start': 0, 'height': 0},
   {'id': 4, 'end': 0, 'start': 0, 'height': 0},
   {'id': 5, 'end': 0, 'start': 0, 'height': 0},
   {'id': 6, 'end': 0, 'start': 0, 'height': 0}],
  'irTemperature': 0,
  'heater1Enabled': True,
  'speedTensionIn': -0.6,
  'speedTensionOut': 1,
  'backSidePressure': 0,
  'filmSensor1Enable': False,
  'filmSensor2Enable': False,
  'filmSensor3Enable': False,
  'frontSidePressure': 0,
  'heater1Temperature': 125,
  'deadZoneStampAreaBack': 0,
  'deadZoneStampAreaFront': 0}]

In [24]:
# on fractionne une colonne
ifoil = convert_col_to_df('ifoil', job_started_df, {'id':[]})
ifoil.head(2)

Unnamed: 0,id,id_ifoil,speed_ifoil,enabled_ifoil,irEnable_ifoil,optifoil_ifoil,vacuumIn_ifoil,vacuumOut_ifoil,stampAreas_ifoil,irTemperature_ifoil,...,speedTensionIn_ifoil,speedTensionOut_ifoil,backSidePressure_ifoil,filmSensor1Enable_ifoil,filmSensor2Enable_ifoil,filmSensor3Enable_ifoil,frontSidePressure_ifoil,heater1Temperature_ifoil,deadZoneStampAreaBack_ifoil,deadZoneStampAreaFront_ifoil
0,82917,IFOIL_1,24.0,True,False,False,100,100,"[{'id': 1, 'end': 483, 'start': 1, 'height': 4...",0,...,-0.6,1.0,0,False,False,False,0,125,0,0
1,82921,IFOIL_1,24.0,True,False,False,100,100,"[{'id': 1, 'end': 483, 'start': 1, 'height': 4...",0,...,-0.6,1.0,0,False,False,False,0,125,0,0


In [25]:
# liste le nombre de valeurs uniques par colonne
ifoil_col_to_drop = []
for col in ifoil.drop(['stampAreas_ifoil'], axis=1).columns:
    print(col, ifoil[col].nunique())
    if ifoil[col].nunique() <= 1 :
        ifoil_col_to_drop.append(col)
col_to_drop.append(ifoil_col_to_drop)
ifoil_col_to_drop

id 37398
id_ifoil 1
speed_ifoil 32
enabled_ifoil 2
irEnable_ifoil 1
optifoil_ifoil 2
vacuumIn_ifoil 2
vacuumOut_ifoil 2
irTemperature_ifoil 1
heater1Enabled_ifoil 2
speedTensionIn_ifoil 10
speedTensionOut_ifoil 8
backSidePressure_ifoil 1
filmSensor1Enable_ifoil 1
filmSensor2Enable_ifoil 1
filmSensor3Enable_ifoil 1
frontSidePressure_ifoil 1
heater1Temperature_ifoil 44
deadZoneStampAreaBack_ifoil 1
deadZoneStampAreaFront_ifoil 1


['id_ifoil',
 'irEnable_ifoil',
 'irTemperature_ifoil',
 'backSidePressure_ifoil',
 'filmSensor1Enable_ifoil',
 'filmSensor2Enable_ifoil',
 'filmSensor3Enable_ifoil',
 'frontSidePressure_ifoil',
 'deadZoneStampAreaBack_ifoil',
 'deadZoneStampAreaFront_ifoil']

#### 4) Fractionnement colonne layout

In [26]:
# on visualise les valeurs
job_started_df.layout.loc[0]

{'speed': 313,
 'pageLayout': 'LEFT',
 'imageLayout': {'x': 1488,
  'y': -24,
  'flip': False,
  'assembled': False,
  'rotate180deg': False},
 'paperFormat': {'name': '', 'width': 329.99, 'height': 483.02},
 'paperThickness': 0}

In [27]:
# on fractionne une colonne
layout_df = convert_col_to_df('layout', job_started_df, {'id':[]})
layout_df.head(3)

Unnamed: 0,id,speed_layout,pageLayout_layout,imageLayout_layout,paperFormat_layout,paperThickness_layout
0,82917,313,LEFT,"{'x': 1488, 'y': -24, 'flip': False, 'assemble...","{'name': '', 'width': 329.99, 'height': 483.02}",0
1,82921,313,LEFT,"{'x': 1488, 'y': -24, 'flip': False, 'assemble...","{'name': '', 'width': 329.99, 'height': 483.02}",0
2,82928,313,LEFT,"{'x': 1488, 'y': -24, 'flip': False, 'assemble...","{'name': '', 'width': 329.99, 'height': 483.02}",0


In [28]:
# on fractionne une colonne
imageLayout_layout = convert_col_to_df('imageLayout_layout', layout_df, {'id':[]})
imageLayout_layout.head(2)

Unnamed: 0,id,x_imageLayout_layout,y_imageLayout_layout,flip_imageLayout_layout,assembled_imageLayout_layout,rotate180deg_imageLayout_layout
0,82917,1488,-24,False,False,False
1,82921,1488,-24,False,False,False


In [29]:
# on fractionne une colonne
paperFormat_layout = convert_col_to_df('paperFormat_layout', layout_df, {'id':[]})
paperFormat_layout.head(2)

Unnamed: 0,id,name_paperFormat_layout,width_paperFormat_layout,height_paperFormat_layout
0,82917,,329.99,483.02
1,82921,,329.99,483.02


In [30]:
# on fusionne les colonnes fractionnées
merge_imageLayout_paperFormat = pd.merge(imageLayout_layout, paperFormat_layout, how='outer', on='id')
merge_layout = pd.merge(merge_imageLayout_paperFormat, layout_df, how='outer', on='id')
merge_layout = merge_layout.drop(['imageLayout_layout','paperFormat_layout'], axis=1)
merge_layout.head(3)

Unnamed: 0,id,x_imageLayout_layout,y_imageLayout_layout,flip_imageLayout_layout,assembled_imageLayout_layout,rotate180deg_imageLayout_layout,name_paperFormat_layout,width_paperFormat_layout,height_paperFormat_layout,speed_layout,pageLayout_layout,paperThickness_layout
0,82917,1488,-24,False,False,False,,329.99,483.02,313,LEFT,0
1,82921,1488,-24,False,False,False,,329.99,483.02,313,LEFT,0
2,82928,1488,-24,False,False,False,,329.99,483.02,313,LEFT,0


In [31]:
# liste le nombre de valeurs uniques par colonne
layout_col_to_drop = []
for col in merge_layout.columns:
    print(col, merge_layout[col].nunique())
    if merge_layout[col].nunique() <= 1 :
        layout_col_to_drop.append(col)
col_to_drop.append(layout_col_to_drop)
layout_col_to_drop

id 37398
x_imageLayout_layout 67
y_imageLayout_layout 81
flip_imageLayout_layout 1
assembled_imageLayout_layout 1
rotate180deg_imageLayout_layout 1
name_paperFormat_layout 3
width_paperFormat_layout 14
height_paperFormat_layout 13
speed_layout 56
pageLayout_layout 1
paperThickness_layout 1


['flip_imageLayout_layout',
 'assembled_imageLayout_layout',
 'rotate180deg_imageLayout_layout',
 'pageLayout_layout',
 'paperThickness_layout']

#### 5) Fractionnement colonne irDryers

In [32]:
job_started_df.irDryers.loc[0]

[{'id': 'IR_DRYER_1', 'power': 20, 'enable': False}]

In [33]:
# on fractionne une colonne
irDryers = convert_col_to_df('irDryers', job_started_df, {'id':[]})
irDryers.head(3)

Unnamed: 0,id,id_irDryers,power_irDryers,enable_irDryers
0,82917,IR_DRYER_1,20,False
1,82921,IR_DRYER_1,20,False
2,82928,IR_DRYER_1,20,False


In [34]:
# liste le nombre de valeurs uniques par colonne
irDryers_col_to_drop = []
for col in irDryers.columns:
    print(col, irDryers[col].nunique())
    if irDryers[col].nunique() <= 1 :
        irDryers_col_to_drop.append(col)
col_to_drop.append(irDryers_col_to_drop)
irDryers_col_to_drop

id 37398
id_irDryers 1
power_irDryers 18
enable_irDryers 1


['id_irDryers', 'enable_irDryers']

#### 6) Fractionnement colonne uvDryers

In [35]:
job_started_df.uvDryers.loc[0]

[{'id': 'UV_DRYER_1', 'power': 70, 'enable': True}]

In [36]:
# on fractionne une colonne
uvDryers = convert_col_to_df('uvDryers', job_started_df, {'id':[]})
uvDryers.head(3)

Unnamed: 0,id,id_uvDryers,power_uvDryers,enable_uvDryers
0,82917,UV_DRYER_1,70,True
1,82921,UV_DRYER_1,70,True
2,82928,UV_DRYER_1,70,True


In [37]:
# liste le nombre de valeurs uniques par colonne
uvDryers_col_to_drop = []
for col in uvDryers.columns:
    print(col, uvDryers[col].nunique())
    if uvDryers[col].nunique() <= 1 :
        uvDryers_col_to_drop.append(col)
col_to_drop.append(uvDryers_col_to_drop)
uvDryers_col_to_drop

id 37398
id_uvDryers 1
power_uvDryers 35
enable_uvDryers 1


['id_uvDryers', 'enable_uvDryers']

#### 7) Fractionnement colonne remoteScannerRegistration

In [38]:
job_started_df.remoteScannerRegistration.loc[0]

[{'id': 'REGISTRATION_SCANNER_1',
  'mode': 3,
  'gridMode': {'redScore': 1500,
   'descriptor': {'name': None, 'rows': 0, 'columns': 0, 'default': False}},
  'registration': {'topMargin': 0, 'leftMargin': 0},
  'troubleshoot': False,
  'cropmarksMode': {'redScore': 1500,
   'cropmark1': {'x': 0, 'y': 0, 'score': 0, 'valid': True},
   'cropmark2': {'x': 0, 'y': 0, 'score': 0, 'valid': True}},
  'manualLighting': {'enable': False,
   'platePoint': {'x': 0, 'y': 0, 'valid': False},
   'exposureTime': 0,
   'linearLightOn': 0,
   'coaxialLightOn': 0,
   'substratePoint': {'x': 0, 'y': 0, 'valid': False},
   'coaxialPowerLevel': 0},
  'fullScannerMode': {'redScore': 1500, 'blueScore': 24, 'greenScore': 25},
  'specialSubstrate': {'enable': False, 'paperEdge': 0}}]

In [39]:
# on fractionne la colonne
remoteScannerRegistration = convert_col_to_df('remoteScannerRegistration', job_started_df, {'id':[]})
remoteScannerRegistration.head(3)

Unnamed: 0,id,id_remoteScannerRegistration,mode_remoteScannerRegistration,gridMode_remoteScannerRegistration,registration_remoteScannerRegistration,troubleshoot_remoteScannerRegistration,cropmarksMode_remoteScannerRegistration,manualLighting_remoteScannerRegistration,fullScannerMode_remoteScannerRegistration,specialSubstrate_remoteScannerRegistration
0,82917,REGISTRATION_SCANNER_1,3,"{'redScore': 1500, 'descriptor': {'name': None...","{'topMargin': 0, 'leftMargin': 0}",False,"{'redScore': 1500, 'cropmark1': {'x': 0, 'y': ...","{'enable': False, 'platePoint': {'x': 0, 'y': ...","{'redScore': 1500, 'blueScore': 24, 'greenScor...","{'enable': False, 'paperEdge': 0}"
1,82921,REGISTRATION_SCANNER_1,3,"{'redScore': 1500, 'descriptor': {'name': None...","{'topMargin': 0, 'leftMargin': 0}",False,"{'redScore': 1500, 'cropmark1': {'x': 0, 'y': ...","{'enable': False, 'platePoint': {'x': 0, 'y': ...","{'redScore': 1500, 'blueScore': 24, 'greenScor...","{'enable': False, 'paperEdge': 0}"
2,82928,REGISTRATION_SCANNER_1,3,"{'redScore': 1500, 'descriptor': {'name': None...","{'topMargin': 0, 'leftMargin': 0}",False,"{'redScore': 1500, 'cropmark1': {'x': 0, 'y': ...","{'enable': False, 'platePoint': {'x': 0, 'y': ...","{'redScore': 1500, 'blueScore': 24, 'greenScor...","{'enable': False, 'paperEdge': 0}"


In [40]:
# on liste les sous-colonnes à fractionner, leurs valeurs sont de type list ou dict:
remoteScannerRegistration_col_to_split = []
for col in remoteScannerRegistration.columns:
    if isinstance(remoteScannerRegistration[col].loc[0], list) or isinstance(remoteScannerRegistration[col].loc[0], dict):
        remoteScannerRegistration_col_to_split.append(col)

remoteScannerRegistration_col_to_split

['gridMode_remoteScannerRegistration',
 'registration_remoteScannerRegistration',
 'cropmarksMode_remoteScannerRegistration',
 'manualLighting_remoteScannerRegistration',
 'fullScannerMode_remoteScannerRegistration',
 'specialSubstrate_remoteScannerRegistration']

##### remoteScannerRegistration > gridMode

In [41]:
# on fractionne une colonne
gridMode = convert_col_to_df('gridMode_remoteScannerRegistration', remoteScannerRegistration, {'id':[]})
gridMode.head(3)

Unnamed: 0,id,redScore_gridMode_remoteScannerRegistration,descriptor_gridMode_remoteScannerRegistration
0,82917,1500,"{'name': None, 'rows': 0, 'columns': 0, 'defau..."
1,82921,1500,"{'name': None, 'rows': 0, 'columns': 0, 'defau..."
2,82928,1500,"{'name': None, 'rows': 0, 'columns': 0, 'defau..."


In [42]:
# on fractionne une colonne
descriptor = convert_col_to_df('descriptor_gridMode_remoteScannerRegistration', gridMode, {'id':[]})
descriptor.head(3)

Unnamed: 0,id,name_descriptor_gridMode_remoteScannerRegistration,rows_descriptor_gridMode_remoteScannerRegistration,columns_descriptor_gridMode_remoteScannerRegistration,default_descriptor_gridMode_remoteScannerRegistration
0,82917,,0,0,False
1,82921,,0,0,False
2,82928,,0,0,False


In [43]:
# on fusionne les colonnes fractionnées
merge_gridMode = pd.merge(gridMode, descriptor, how='outer', on='id')
merge_gridMode = merge_gridMode.drop(['descriptor_gridMode_remoteScannerRegistration'], axis=1)
merge_gridMode.head(2)

Unnamed: 0,id,redScore_gridMode_remoteScannerRegistration,name_descriptor_gridMode_remoteScannerRegistration,rows_descriptor_gridMode_remoteScannerRegistration,columns_descriptor_gridMode_remoteScannerRegistration,default_descriptor_gridMode_remoteScannerRegistration
0,82917,1500,,0,0,False
1,82921,1500,,0,0,False


##### remoteScannerRegistration > registration

In [44]:
# on fractionne une colonne
registration = convert_col_to_df('registration_remoteScannerRegistration', remoteScannerRegistration, {'id':[]})
registration.head(3)

Unnamed: 0,id,topMargin_registration_remoteScannerRegistration,leftMargin_registration_remoteScannerRegistration
0,82917,0,0
1,82921,0,0
2,82928,0,0


##### remoteScannerRegistration > cropmarksMode

In [45]:
# on fractionne une colonne
cropmarksMode = convert_col_to_df('cropmarksMode_remoteScannerRegistration', remoteScannerRegistration, {'id':[]})
cropmarksMode.head(3)

Unnamed: 0,id,redScore_cropmarksMode_remoteScannerRegistration,cropmark1_cropmarksMode_remoteScannerRegistration,cropmark2_cropmarksMode_remoteScannerRegistration
0,82917,1500,"{'x': 0, 'y': 0, 'score': 0, 'valid': True}","{'x': 0, 'y': 0, 'score': 0, 'valid': True}"
1,82921,1500,"{'x': 0, 'y': 0, 'score': 0, 'valid': True}","{'x': 0, 'y': 0, 'score': 0, 'valid': True}"
2,82928,1500,"{'x': 0, 'y': 0, 'score': 0, 'valid': True}","{'x': 0, 'y': 0, 'score': 0, 'valid': True}"


In [46]:
cropmarksMode_1 = convert_col_to_df('cropmark1_cropmarksMode_remoteScannerRegistration', cropmarksMode, {'id':[]})
cropmarksMode_2 = convert_col_to_df('cropmark2_cropmarksMode_remoteScannerRegistration', cropmarksMode, {'id':[]})
merge_cropmarksModes = pd.merge(cropmarksMode_1, cropmarksMode_2, how='outer', on='id')
merge_cropmarksMode = pd.merge(cropmarksMode, merge_cropmarksModes, how='outer', on='id')
merge_cropmarksMode = merge_cropmarksMode.drop(['cropmark1_cropmarksMode_remoteScannerRegistration','cropmark2_cropmarksMode_remoteScannerRegistration'], axis=1)
merge_cropmarksMode.head(2)

Unnamed: 0,id,redScore_cropmarksMode_remoteScannerRegistration,x_cropmark1_cropmarksMode_remoteScannerRegistration,y_cropmark1_cropmarksMode_remoteScannerRegistration,score_cropmark1_cropmarksMode_remoteScannerRegistration,valid_cropmark1_cropmarksMode_remoteScannerRegistration,x_cropmark2_cropmarksMode_remoteScannerRegistration,y_cropmark2_cropmarksMode_remoteScannerRegistration,score_cropmark2_cropmarksMode_remoteScannerRegistration,valid_cropmark2_cropmarksMode_remoteScannerRegistration
0,82917,1500,0,0,0,True,0,0,0,True
1,82921,1500,0,0,0,True,0,0,0,True


##### remoteScannerRegistration > manualLighting

In [47]:
# on fractionne une colonne
manualLighting = convert_col_to_df('manualLighting_remoteScannerRegistration', remoteScannerRegistration, {'id':[]})
manualLighting.head(3)

Unnamed: 0,id,enable_manualLighting_remoteScannerRegistration,platePoint_manualLighting_remoteScannerRegistration,exposureTime_manualLighting_remoteScannerRegistration,linearLightOn_manualLighting_remoteScannerRegistration,coaxialLightOn_manualLighting_remoteScannerRegistration,substratePoint_manualLighting_remoteScannerRegistration,coaxialPowerLevel_manualLighting_remoteScannerRegistration
0,82917,False,"{'x': 0, 'y': 0, 'valid': False}",0,0,0,"{'x': 0, 'y': 0, 'valid': False}",0
1,82921,False,"{'x': 0, 'y': 0, 'valid': False}",0,0,0,"{'x': 0, 'y': 0, 'valid': False}",0
2,82928,False,"{'x': 0, 'y': 0, 'valid': False}",0,0,0,"{'x': 0, 'y': 0, 'valid': False}",0


In [48]:
platePoint = convert_col_to_df('platePoint_manualLighting_remoteScannerRegistration', manualLighting, {'id':[]})
substratePoint = convert_col_to_df('substratePoint_manualLighting_remoteScannerRegistration', manualLighting, {'id':[]})
merge_plate_substrate = pd.merge(platePoint, substratePoint, how='outer', on='id')
merge_manualLighting = pd.merge(manualLighting, merge_plate_substrate, how='outer', on='id')
merge_manualLighting = merge_manualLighting.drop(['platePoint_manualLighting_remoteScannerRegistration','substratePoint_manualLighting_remoteScannerRegistration'], axis=1)
merge_manualLighting.head(2)

Unnamed: 0,id,enable_manualLighting_remoteScannerRegistration,exposureTime_manualLighting_remoteScannerRegistration,linearLightOn_manualLighting_remoteScannerRegistration,coaxialLightOn_manualLighting_remoteScannerRegistration,coaxialPowerLevel_manualLighting_remoteScannerRegistration,x_platePoint_manualLighting_remoteScannerRegistration,y_platePoint_manualLighting_remoteScannerRegistration,valid_platePoint_manualLighting_remoteScannerRegistration,x_substratePoint_manualLighting_remoteScannerRegistration,y_substratePoint_manualLighting_remoteScannerRegistration,valid_substratePoint_manualLighting_remoteScannerRegistration
0,82917,False,0,0,0,0,0,0,False,0,0,False
1,82921,False,0,0,0,0,0,0,False,0,0,False


##### remoteScannerRegistration > fullScannerMode

In [49]:
# on fractionne une colonne
fullScannerMode = convert_col_to_df('fullScannerMode_remoteScannerRegistration', remoteScannerRegistration, {'id':[]})
fullScannerMode.head(3)

Unnamed: 0,id,redScore_fullScannerMode_remoteScannerRegistration,blueScore_fullScannerMode_remoteScannerRegistration,greenScore_fullScannerMode_remoteScannerRegistration
0,82917,1500,24,25
1,82921,1500,24,25
2,82928,1500,24,25


##### remoteScannerRegistration >  specialSubstrate

In [50]:
# on fractionne une colonne
specialSubstrate = convert_col_to_df('specialSubstrate_remoteScannerRegistration', remoteScannerRegistration, {'id':[]})
specialSubstrate.head(3)

Unnamed: 0,id,enable_specialSubstrate_remoteScannerRegistration,paperEdge_specialSubstrate_remoteScannerRegistration
0,82917,False,0
1,82921,False,0
2,82928,False,0


##### fusion des sous-colonnes remoteScannerRegistration

In [51]:
merge_registration_gridMode_df = pd.merge(registration, merge_gridMode, how='outer', on='id')
merge_cropmarksMode_df = pd.merge(merge_registration_gridMode_df, merge_cropmarksMode, how='outer', on='id')
merge_manualLighting_df = pd.merge(merge_cropmarksMode_df, merge_manualLighting, how='outer', on='id')
merge_fullScannerMode_df = pd.merge(merge_manualLighting_df, fullScannerMode, how='outer', on='id')
merge_specialSubstrate_df = pd.merge(merge_fullScannerMode_df, specialSubstrate, how='outer', on='id')
merge_remoteScannerRegistration = pd.merge(merge_specialSubstrate_df, remoteScannerRegistration, how='outer', on='id')


In [52]:
# suppression des sous-colonnes qui ont été fractionnées
merge_remoteScannerRegistration = merge_remoteScannerRegistration.drop(remoteScannerRegistration_col_to_split, axis=1)
merge_remoteScannerRegistration.head(2)

Unnamed: 0,id,topMargin_registration_remoteScannerRegistration,leftMargin_registration_remoteScannerRegistration,redScore_gridMode_remoteScannerRegistration,name_descriptor_gridMode_remoteScannerRegistration,rows_descriptor_gridMode_remoteScannerRegistration,columns_descriptor_gridMode_remoteScannerRegistration,default_descriptor_gridMode_remoteScannerRegistration,redScore_cropmarksMode_remoteScannerRegistration,x_cropmark1_cropmarksMode_remoteScannerRegistration,...,y_substratePoint_manualLighting_remoteScannerRegistration,valid_substratePoint_manualLighting_remoteScannerRegistration,redScore_fullScannerMode_remoteScannerRegistration,blueScore_fullScannerMode_remoteScannerRegistration,greenScore_fullScannerMode_remoteScannerRegistration,enable_specialSubstrate_remoteScannerRegistration,paperEdge_specialSubstrate_remoteScannerRegistration,id_remoteScannerRegistration,mode_remoteScannerRegistration,troubleshoot_remoteScannerRegistration
0,82917,0,0,1500,,0,0,False,1500,0,...,0,False,1500,24,25,False,0,REGISTRATION_SCANNER_1,3,False
1,82921,0,0,1500,,0,0,False,1500,0,...,0,False,1500,24,25,False,0,REGISTRATION_SCANNER_1,3,False


##### suppression des colonnes contenant des valeurs nulles ou une valeur unique

In [53]:
# liste le nombre de valeurs uniques par colonne
remoteScannerRegistration_df_col_to_drop = []
for col in merge_remoteScannerRegistration.columns:
    #print(col, merge_remoteScannerRegistration_df[col].nunique())
    if merge_remoteScannerRegistration[col].nunique() <= 1 :
        remoteScannerRegistration_df_col_to_drop.append(col)

print('nombre total de colonnes :', merge_remoteScannerRegistration.shape[1])
print('nombre de colonnes à supprimer :', len(remoteScannerRegistration_df_col_to_drop))
remoteScannerRegistration_df_col_to_drop

nombre total de colonnes : 36
nombre de colonnes à supprimer : 23


['topMargin_registration_remoteScannerRegistration',
 'leftMargin_registration_remoteScannerRegistration',
 'name_descriptor_gridMode_remoteScannerRegistration',
 'rows_descriptor_gridMode_remoteScannerRegistration',
 'columns_descriptor_gridMode_remoteScannerRegistration',
 'default_descriptor_gridMode_remoteScannerRegistration',
 'score_cropmark1_cropmarksMode_remoteScannerRegistration',
 'valid_cropmark1_cropmarksMode_remoteScannerRegistration',
 'score_cropmark2_cropmarksMode_remoteScannerRegistration',
 'valid_cropmark2_cropmarksMode_remoteScannerRegistration',
 'enable_manualLighting_remoteScannerRegistration',
 'linearLightOn_manualLighting_remoteScannerRegistration',
 'coaxialLightOn_manualLighting_remoteScannerRegistration',
 'coaxialPowerLevel_manualLighting_remoteScannerRegistration',
 'x_platePoint_manualLighting_remoteScannerRegistration',
 'y_platePoint_manualLighting_remoteScannerRegistration',
 'valid_platePoint_manualLighting_remoteScannerRegistration',
 'x_substratePo

In [54]:
# suppression des colonnes
merge_remoteScannerRegistration = merge_remoteScannerRegistration.drop(remoteScannerRegistration_df_col_to_drop, axis=1)

#### 8) Fusion des colonnes fractionnées de job_started_df

In [55]:
merge_iper_user = pd.merge(iper, user, how='outer', on='id')
merge_ifoil_df = pd.merge(merge_iper_user, ifoil, how='outer', on='id')
merge_layout_df = pd.merge(merge_ifoil_df, merge_layout, how='outer', on='id')
merge_irDryers_df = pd.merge(merge_layout_df, irDryers, how='outer', on='id')
merge_uvDryers_df = pd.merge(merge_irDryers_df, uvDryers, how='outer', on='id')
merge_remoteScannerRegistration_df = pd.merge(merge_uvDryers_df, merge_remoteScannerRegistration, how='outer', on='id')
merge_remoteScannerRegistration_df.head(2)

Unnamed: 0,id,id_iper,LED_iper,bars_iper,drops_iper,enable_iper,dithering_iper,deadPixelsOffset_iper,level_user,operator_user,...,x_cropmark1_cropmarksMode_remoteScannerRegistration,y_cropmark1_cropmarksMode_remoteScannerRegistration,x_cropmark2_cropmarksMode_remoteScannerRegistration,y_cropmark2_cropmarksMode_remoteScannerRegistration,exposureTime_manualLighting_remoteScannerRegistration,redScore_fullScannerMode_remoteScannerRegistration,blueScore_fullScannerMode_remoteScannerRegistration,greenScore_fullScannerMode_remoteScannerRegistration,enable_specialSubstrate_remoteScannerRegistration,mode_remoteScannerRegistration
0,82917,PRINT_ENGINE_1,50,"[1, 2]",4,True,False,0,Operator,User,...,0,0,0,0,0,1500,24,25,False,3
1,82921,PRINT_ENGINE_1,50,"[1, 2]",4,True,False,0,Operator,User,...,0,0,0,0,0,1500,24,25,False,3


In [56]:
# suppression des colonnes contenant des valeurs uniques
for cols in col_to_drop :
    for col in cols :
        merge_remoteScannerRegistration_df = merge_remoteScannerRegistration_df.drop(col, axis=1)
merge_remoteScannerRegistration_df.head(2)

Unnamed: 0,id,LED_iper,bars_iper,drops_iper,dithering_iper,deadPixelsOffset_iper,level_user,operator_user,speed_ifoil,enabled_ifoil,...,x_cropmark1_cropmarksMode_remoteScannerRegistration,y_cropmark1_cropmarksMode_remoteScannerRegistration,x_cropmark2_cropmarksMode_remoteScannerRegistration,y_cropmark2_cropmarksMode_remoteScannerRegistration,exposureTime_manualLighting_remoteScannerRegistration,redScore_fullScannerMode_remoteScannerRegistration,blueScore_fullScannerMode_remoteScannerRegistration,greenScore_fullScannerMode_remoteScannerRegistration,enable_specialSubstrate_remoteScannerRegistration,mode_remoteScannerRegistration
0,82917,50,"[1, 2]",4,False,0,Operator,User,24.0,True,...,0,0,0,0,0,1500,24,25,False,3
1,82921,50,"[1, 2]",4,False,0,Operator,User,24.0,True,...,0,0,0,0,0,1500,24,25,False,3


In [57]:
merge_job_started_df = pd.merge(job_started_df, merge_remoteScannerRegistration_df, how='outer', on='id')
merge_job_started_df = merge_job_started_df.drop(job_started_col_to_split, axis=1)
merge_job_started_df.head(3)

Unnamed: 0,id,jobId,timestamp,totalCopies,jsonVersion,LED_iper,bars_iper,drops_iper,dithering_iper,deadPixelsOffset_iper,...,x_cropmark1_cropmarksMode_remoteScannerRegistration,y_cropmark1_cropmarksMode_remoteScannerRegistration,x_cropmark2_cropmarksMode_remoteScannerRegistration,y_cropmark2_cropmarksMode_remoteScannerRegistration,exposureTime_manualLighting_remoteScannerRegistration,redScore_fullScannerMode_remoteScannerRegistration,blueScore_fullScannerMode_remoteScannerRegistration,greenScore_fullScannerMode_remoteScannerRegistration,enable_specialSubstrate_remoteScannerRegistration,mode_remoteScannerRegistration
0,82917,1645522997,2022-02-22T09:43:18.1166478Z,6,,50,"[1, 2]",4,False,0,...,0,0,0,0,0,1500,24,25,False,3
1,82921,1645523101,2022-02-22T09:45:01.3041033Z,11,,50,"[1, 2]",4,False,0,...,0,0,0,0,0,1500,24,25,False,3
2,82928,1645523250,2022-02-22T09:47:30.3197334Z,7,,50,"[1, 2]",4,False,0,...,0,0,0,0,0,1500,24,25,False,3


In [58]:
merge_job_started_df.columns

Index(['id', 'jobId', 'timestamp', 'totalCopies', 'jsonVersion', 'LED_iper',
       'bars_iper', 'drops_iper', 'dithering_iper', 'deadPixelsOffset_iper',
       'level_user', 'operator_user', 'speed_ifoil', 'enabled_ifoil',
       'optifoil_ifoil', 'vacuumIn_ifoil', 'vacuumOut_ifoil',
       'stampAreas_ifoil', 'heater1Enabled_ifoil', 'speedTensionIn_ifoil',
       'speedTensionOut_ifoil', 'heater1Temperature_ifoil',
       'x_imageLayout_layout', 'y_imageLayout_layout',
       'name_paperFormat_layout', 'width_paperFormat_layout',
       'height_paperFormat_layout', 'speed_layout', 'power_irDryers',
       'power_uvDryers', 'redScore_gridMode_remoteScannerRegistration',
       'redScore_cropmarksMode_remoteScannerRegistration',
       'x_cropmark1_cropmarksMode_remoteScannerRegistration',
       'y_cropmark1_cropmarksMode_remoteScannerRegistration',
       'x_cropmark2_cropmarksMode_remoteScannerRegistration',
       'y_cropmark2_cropmarksMode_remoteScannerRegistration',
       'expos

### b. Dataframe de tag job preview

In [59]:
# on visualise les données
job_preview_df.head(3)

Unnamed: 0,id,path,image,jobId,machineId,jsonVersion
0,82918,D:/IMAGES/Standard/1504750#1/0000001.tif,/9j/4AAQSkZJRgABAQEASABIAAD/4gxYSUNDX1BST0ZJTE...,1645522997,"{'type': 'JETvarnish 3D EVO', 'numMachine': 68...",
1,82922,D:/IMAGES/Standard/1504749#1/0000001.tif,/9j/4AAQSkZJRgABAQEASABIAAD/4gxYSUNDX1BST0ZJTE...,1645523101,"{'type': 'JETvarnish 3D EVO', 'numMachine': 68...",
2,82929,D:/IMAGES/Standard/1505959#1/0000001 V01.tif,/9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAgGBgcGBQgHBw...,1645523250,"{'type': 'JETvarnish 3D EVO', 'numMachine': 68...",


In [60]:
# # on visualise l'image
# import PIL.Image as Image
# import io, base64
# byte_data = job_preview_df.image.loc[0]
# b = base64.b64decode(byte_data)
# img = Image.open(io.BytesIO(b))
# img.show()
# img_name = job_preview_df.path.loc[0].split('/')[-1]
# img.save(img_name)

In [61]:
# on supprime la colonne machineId
job_preview_df = job_preview_df.drop(['machineId'], axis=1)

Ces données sont utiles pour afficher les images des job mais pas pertinentes pour l'exploration ou la prédiction.

### c. Dataframe de tag job end

In [62]:
# on visualise les données
job_ended_df.head(3)

Unnamed: 0,id,jobId,jobState,machineId,timestamp,totalCopies,varnishConsumption,jsonVersion
0,82919,1645522997,SUCCESS,"{'type': 'JETvarnish 3D EVO', 'numMachine': 68...",2022-02-22T09:44:33.3894028Z,6,"[{'iperId': 'PRINT_ENGINE_1', 'operatorSideTan...",
1,82924,1645523101,SUCCESS,"{'type': 'JETvarnish 3D EVO', 'numMachine': 68...",2022-02-22T09:46:34.9290928Z,11,"[{'iperId': 'PRINT_ENGINE_1', 'operatorSideTan...",
2,82930,1645523250,SUCCESS,"{'type': 'JETvarnish 3D EVO', 'numMachine': 68...",2022-02-22T09:48:37.5548877Z,7,"[{'iperId': 'PRINT_ENGINE_1', 'operatorSideTan...",


In [63]:
# on supprime la colonne machineId
job_ended_df = job_ended_df.drop(['machineId'], axis=1)

La colonne varnishConsumption est fractionnable

#### 1. Fractionnement colonne 'varnishConsumption'

In [64]:
job_ended_df.varnishConsumption.loc[0]

[{'iperId': 'PRINT_ENGINE_1',
  'operatorSideTanks': [{'id': '3D Tank',
    'position': 0,
    'consumption': 4.585923001800001}],
  'technicalSideTanks': []}]

In [65]:
varnishConsumption = convert_col_to_df('varnishConsumption', job_ended_df, {'id':[]})
varnishConsumption.head(2)

Unnamed: 0,id,iperId_varnishConsumption,operatorSideTanks_varnishConsumption,technicalSideTanks_varnishConsumption
0,82919,PRINT_ENGINE_1,"[{'id': '3D Tank', 'position': 0, 'consumption...",[]
1,82924,PRINT_ENGINE_1,"[{'id': '3D Tank', 'position': 0, 'consumption...",[]


##### Fractionnement colonne 'operatorSideTanks_varnishConsumption'

In [66]:
operatorSideTanks = convert_col_to_df('operatorSideTanks_varnishConsumption', varnishConsumption, {'id':[]})
operatorSideTanks.head(3)

Unnamed: 0,id,id_operatorSideTanks_varnishConsumption,position_operatorSideTanks_varnishConsumption,consumption_operatorSideTanks_varnishConsumption
0,82919,3D Tank,0,4.585923
1,82924,3D Tank,0,2.917403
2,82930,3D Tank,0,0.423666


##### Suppression colonne 'technicalSideTanks_varnishConsumption'

In [67]:
# la colonne ne contient aucune valeur
varnishConsumption = varnishConsumption.drop(['technicalSideTanks_varnishConsumption'], axis=1)

##### Fusion des colonnes varnishConsumption

In [68]:
merge_varnishConsumption = pd.merge(varnishConsumption, operatorSideTanks, how='outer', on='id')
merge_varnishConsumption = merge_varnishConsumption.drop(['operatorSideTanks_varnishConsumption'], axis=1)
merge_varnishConsumption.head(3)

Unnamed: 0,id,iperId_varnishConsumption,id_operatorSideTanks_varnishConsumption,position_operatorSideTanks_varnishConsumption,consumption_operatorSideTanks_varnishConsumption
0,82919,PRINT_ENGINE_1,3D Tank,0,4.585923
1,82924,PRINT_ENGINE_1,3D Tank,0,2.917403
2,82930,PRINT_ENGINE_1,3D Tank,0,0.423666


#### 2. Fusion des sous-colonnes job_ended

In [69]:
merge_job_ended_df = pd.merge(job_ended_df, merge_varnishConsumption, how='outer', on='id')
merge_job_ended_df = merge_job_ended_df.drop(['varnishConsumption'], axis=1)
merge_job_ended_df.head(3)

Unnamed: 0,id,jobId,jobState,timestamp,totalCopies,jsonVersion,iperId_varnishConsumption,id_operatorSideTanks_varnishConsumption,position_operatorSideTanks_varnishConsumption,consumption_operatorSideTanks_varnishConsumption
0,82919,1645522997,SUCCESS,2022-02-22T09:44:33.3894028Z,6,,PRINT_ENGINE_1,3D Tank,0,4.585923
1,82924,1645523101,SUCCESS,2022-02-22T09:46:34.9290928Z,11,,PRINT_ENGINE_1,3D Tank,0,2.917403
2,82930,1645523250,SUCCESS,2022-02-22T09:48:37.5548877Z,7,,PRINT_ENGINE_1,3D Tank,0,0.423666


In [70]:
# liste le nombre de valeurs uniques par colonne
merge_job_ended_df_col_to_drop = []
for col in merge_job_ended_df.columns:
    if merge_job_ended_df[col].nunique() <= 1 :
        merge_job_ended_df_col_to_drop.append(col)

print('nombre total de colonnes :', merge_job_ended_df.shape[1])
print('nombre de colonnes à supprimer :', len(merge_job_ended_df_col_to_drop))
merge_job_ended_df_col_to_drop

nombre total de colonnes : 10
nombre de colonnes à supprimer : 4


['jsonVersion',
 'iperId_varnishConsumption',
 'id_operatorSideTanks_varnishConsumption',
 'position_operatorSideTanks_varnishConsumption']

In [71]:
# suppression des colonnes à valeur unique
merge_job_ended_df = merge_job_ended_df.drop(merge_job_ended_df_col_to_drop, axis=1)

## c) Creation du dataframe final

### 1. Vérification

In [72]:
# concatenation des dataframes du payload
print('job_started shape :', merge_job_started_df.shape)
print('job_preview shape :', job_preview_df.shape)
print('job_ended shape :', merge_job_ended_df.shape)

job_started shape : (37398, 42)
job_preview shape : (37374, 5)
job_ended shape : (37333, 6)


In [73]:
# on verifie l'intégrité des jobId
print(merge_job_started_df['jobId'].isin(job_started_df['jobId']).value_counts())
print(merge_job_ended_df['jobId'].isin(job_ended_df['jobId']).value_counts())

True    37398
Name: jobId, dtype: int64
True    37333
Name: jobId, dtype: int64


Il y a toujours autant de jobId dans les df fusionnés et les df de départ ne comportaient aucun doublon.

On peut donc les fusionner sur la valeur de jobId.

### 2. Par fusion

In [74]:
# on fusionnes les datasets des tag start et tag end par job id
merge_start_end_df = pd.merge(
    merge_job_started_df.drop(['id'],axis=1), 
    merge_job_ended_df.drop(['id'],axis=1), 
    how='outer', 
    on='jobId',
    suffixes=['_start', '_end'])

In [75]:
merge_payload_df = pd.merge(
    merge_start_end_df, 
    job_preview_df.drop(['id'],axis=1), 
    how='outer', 
    on='jobId')

In [76]:
merge_payload_df.head(2)

Unnamed: 0,jobId,timestamp_start,totalCopies_start,jsonVersion_x,LED_iper,bars_iper,drops_iper,dithering_iper,deadPixelsOffset_iper,level_user,...,greenScore_fullScannerMode_remoteScannerRegistration,enable_specialSubstrate_remoteScannerRegistration,mode_remoteScannerRegistration,jobState,timestamp_end,totalCopies_end,consumption_operatorSideTanks_varnishConsumption,path,image,jsonVersion_y
0,1645522997,2022-02-22T09:43:18.1166478Z,6.0,,50.0,"[1, 2]",4.0,False,0.0,Operator,...,25.0,False,3.0,SUCCESS,2022-02-22T09:44:33.3894028Z,6.0,4.585923,D:/IMAGES/Standard/1504750#1/0000001.tif,/9j/4AAQSkZJRgABAQEASABIAAD/4gxYSUNDX1BST0ZJTE...,
1,1645523101,2022-02-22T09:45:01.3041033Z,11.0,,50.0,"[1, 2]",4.0,False,0.0,Operator,...,25.0,False,3.0,SUCCESS,2022-02-22T09:46:34.9290928Z,11.0,2.917403,D:/IMAGES/Standard/1504749#1/0000001.tif,/9j/4AAQSkZJRgABAQEASABIAAD/4gxYSUNDX1BST0ZJTE...,


In [77]:
merge_payload_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 37412 entries, 0 to 37411
Data columns (total 48 columns):
 #   Column                                                 Non-Null Count  Dtype  
---  ------                                                 --------------  -----  
 0   jobId                                                  37412 non-null  object 
 1   timestamp_start                                        37398 non-null  object 
 2   totalCopies_start                                      37398 non-null  float64
 3   jsonVersion_x                                          22919 non-null  float64
 4   LED_iper                                               37398 non-null  float64
 5   bars_iper                                              37398 non-null  object 
 6   drops_iper                                             37398 non-null  float64
 7   dithering_iper                                         37398 non-null  object 
 8   deadPixelsOffset_iper                         

#### Output csv

In [79]:
merge_payload_df.to_csv(save_csv_merge)

### 3. Par concaténation

Si l'on souhaite conserver un dataset avec un tag par ligne on effectue une concaténation des datasets tag start et end

In [80]:
# on concatene les dataset des tag start et tag end
concat_job_events_df = pd.concat([merge_job_started_df,merge_job_ended_df])
concat_job_events_df.reset_index(level=None, drop=True, inplace=True, col_level=0, col_fill='')

In [81]:
print(concat_job_events_df.info())
concat_job_events_df.head(3)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74731 entries, 0 to 74730
Data columns (total 44 columns):
 #   Column                                                 Non-Null Count  Dtype  
---  ------                                                 --------------  -----  
 0   id                                                     74731 non-null  int64  
 1   jobId                                                  74731 non-null  object 
 2   timestamp                                              74731 non-null  object 
 3   totalCopies                                            74731 non-null  int64  
 4   jsonVersion                                            22919 non-null  float64
 5   LED_iper                                               37398 non-null  float64
 6   bars_iper                                              37398 non-null  object 
 7   drops_iper                                             37398 non-null  float64
 8   dithering_iper                                

Unnamed: 0,id,jobId,timestamp,totalCopies,jsonVersion,LED_iper,bars_iper,drops_iper,dithering_iper,deadPixelsOffset_iper,...,x_cropmark2_cropmarksMode_remoteScannerRegistration,y_cropmark2_cropmarksMode_remoteScannerRegistration,exposureTime_manualLighting_remoteScannerRegistration,redScore_fullScannerMode_remoteScannerRegistration,blueScore_fullScannerMode_remoteScannerRegistration,greenScore_fullScannerMode_remoteScannerRegistration,enable_specialSubstrate_remoteScannerRegistration,mode_remoteScannerRegistration,jobState,consumption_operatorSideTanks_varnishConsumption
0,82917,1645522997,2022-02-22T09:43:18.1166478Z,6,,50.0,"[1, 2]",4.0,False,0.0,...,0.0,0.0,0.0,1500.0,24.0,25.0,False,3.0,,
1,82921,1645523101,2022-02-22T09:45:01.3041033Z,11,,50.0,"[1, 2]",4.0,False,0.0,...,0.0,0.0,0.0,1500.0,24.0,25.0,False,3.0,,
2,82928,1645523250,2022-02-22T09:47:30.3197334Z,7,,50.0,"[1, 2]",4.0,False,0.0,...,0.0,0.0,0.0,1500.0,24.0,25.0,False,3.0,,


#### Output csv

In [82]:
concat_job_events_df.to_csv(save_csv_concat)