# Quick training

This notebook tries to demonstrate efficiently how to load data and train a network for our data. 

## Initialisation

Firstly, we set our working directory as the root of the Project Folder, in order to have access to all data.

<b>The Working directory after the following cell should be something like "...\Roll Wear Project".</b>

In [1]:
from utils_notebooks import move_current_path_up
move_current_path_up(n_times=2)

Working directory = P:\My Documents\Projets Programmation\Roll Wear Project


## Loading the inputs

We will load inputs and outputs. Inputs are all the strips, identified by an unique identifier and their campaign number.
The order of the strips inside a campaign is defined by the order of the unique identifiers.

In [2]:
import pandas as pd

def load_strips(excel_path: str):
    """ Load the data of one excel file of input data
    https://datacarpentry.org/python-ecology-lesson/05-merging-data/ """

    print("Loading Input data from excel. About 2mn left")
    # Loading raw strips data
    strips_df: pd.DataFrame = pd.read_excel(io=excel_path, sheet_name='Strips_data', usecols='B, F:AP, AS:BN',
                                            index_col=[0, 1], header=2,  skiprows=[3])
    strips_df.index.names = ['id_campaign', 'id_strip']  # Renaming the indexes

    # Data processing.
    # 1. We extract the families as one_hot vector
    strips_df = pd.get_dummies(strips_df, prefix=['family'], columns=['STIP GRADE FAMILY'])
    # 2. Oil flow rate is considered as ON/OFF
    strips_df['F6 Oil Flow Rate, ml/min'] = (strips_df['F6 Oil Flow Rate, ml/min'] > 0).astype(int)
    strips_df.rename(columns={'F6 Oil Flow Rate, ml/min': 'F6 Oil Flow Rate, on/off'}, inplace=True)

    print("Loading Input data from excel. About 1mn left")
    # Loading campaigns data
    camp_df: pd.DataFrame = pd.read_excel(io=excel_path, sheet_name='Campaign_data', header=1, skiprows=[2], 
                                          usecols='A, C:E, J:M, N:Q, R:U', index_col=0)
    camp_df.index.names = ['id_campaign']

    # We transform the line up and supplier columns into one_hot vectors
    camp_df = pd.get_dummies(camp_df, prefix=['lineup'], columns=['LINE_UP'])
    camp_df = pd.get_dummies(camp_df, prefix=['supplier_f6t', 'supplier_f6b', 'supplier_f7t', 'supplier_f7b'],
                             columns=['F6 TOP SUPPLIER', 'F6 BOT SUPPLIER', 'F7 TOP SUPPLIER', 'F7 BOT SUPPLIER'])

    return strips_df.join(camp_df, how='inner')

We will now load the data from the Excel file, and plot a quick report about them

In [3]:
input_file_path = 'Data/RawData/WearDataForDatamining.xlsx'
input_df = load_strips(input_file_path)

input_df.describe()

Loading Input data from excel. About 2mn left
Loading Input data from excel. About 1mn left


Unnamed: 0,STRIP HARDNESS INDICATOR,STRIP WIDTH,STRIP LENGTH F5 EXIT*,STRIP LENGTH F6 EXIT*,STRIP LENGTH F7 EXIT,STAND FORCE / WIDTH F6*,STAND FORCE / WIDTH F7*,BENDING FORCE F6,BENDING FORCE F7,SHIFTINGF6,...,supplier_f7t_Kubota ECC-CX2 Type,supplier_f7t_National ICON,supplier_f7t_Union Electric UK Apex Alloy,supplier_f7t_Villares Vindex VRP0313,supplier_f7b_Akers National Micra X,supplier_f7b_ESW VANIS,supplier_f7b_Kubota ECC-CX2 Type,supplier_f7b_National ICON,supplier_f7b_Union Electric UK Apex Alloy,supplier_f7b_Villares Vindex VRP0313
count,51710.0,51710.0,51710.0,51710.0,51710.0,51710.0,51710.0,51710.0,51710.0,51710.0,...,51710.0,51710.0,51710.0,51710.0,51710.0,51710.0,51710.0,51710.0,51710.0,51710.0
mean,1.258308,1208.92133,413.662288,534.74417,626.509944,1.019051,0.934226,61.529216,53.023186,-0.893406,...,0.440456,0.126378,0.009901,0.417231,0.004699,0.006034,0.516206,0.052253,0.065461,0.355347
std,0.12299,201.493205,136.987169,189.763687,229.364312,0.148514,0.149128,18.386355,20.096913,26.851735,...,0.496447,0.332278,0.099013,0.493106,0.068391,0.077443,0.499742,0.222539,0.24734,0.478623
min,1.0,702.0,0.0,0.0,102.5088,0.572803,0.51481,20.9086,8.625,-75.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.1619,1060.0,326.816179,413.152315,479.718725,0.919116,0.834361,47.00535,37.75,-20.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,1.2094,1231.0,400.431897,517.556895,606.6122,1.005484,0.915359,62.2725,49.04035,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
75%,1.4035,1342.0,482.330176,628.388838,744.56725,1.10591,1.017933,75.245,65.105525,20.0,...,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
max,1.6901,1613.0,947.526261,1259.695772,1489.7225,1.725392,1.71124,119.665,121.2436,75.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## Loading output data

We do the same, for output data, which are, however, easier to load

In [4]:
def load_wearcentre(excel_file):
    """ Load the data of one excel file of wear centre data """
    
    print("Loading Output data from excel. Takes about 1mn")

    # We read the data from the Excel file
    wearcenter_df: pd.DataFrame = pd.read_excel(io=excel_file, sheet_name='Feuil1', usecols="A:E", 
                                                header=2, skiprows=[3], index_col=0)

    # renaming columns
    wearcenter_df.rename(inplace=True, columns={'Usure F6 TOP': 'f6t', 'Usure F6 BOT': 'f6b',
                                                'Usure F7 TOP': 'f7t', 'Usure F7 BOT': 'f7b'})
    wearcenter_df.index.names = ['id_campaign']
    
    return wearcenter_df

In [5]:
output_file_path = 'Data/RawData/WearCentres.xlsx'
output_df = load_wearcentre(output_file_path)

output_df.describe()

Loading Output data from excel. Takes about 1mn


Unnamed: 0,f6t,f6b,f7t,f7b
count,348.0,353.0,347.0,346.0
mean,0.260685,0.244293,0.193591,0.264225
std,0.147254,0.118168,0.090392,0.116364
min,0.020613,-0.043806,0.005613,0.014968
25%,0.15471,0.149613,0.12979,0.177669
50%,0.234516,0.235581,0.191226,0.26179
75%,0.335524,0.315,0.253823,0.345766
max,0.763419,0.635581,0.466581,0.657387


## Saving the data

We save the data into .h5 files, quicker to load than Excel files for future uses

In [6]:
input_df.to_hdf('Data/notebooks_data/wear_center.h5', key='inputs') 
output_df.to_hdf('Data/notebooks_data/wear_center.h5', key='outputs') 

Here is defined the function to load the data from the save file

In [7]:
def load_hdf(file_path):
    input_from_hdf = pd.read_hdf(file_path, key='inputs')
    output_from_hdf = pd.read_hdf(file_path, key='outputs')
    
    return input_from_hdf, output_from_hdf

## Full data profiling

To be used only in browser view : can't work in IDE

In [8]:
# import pandas_profiling
# 
# pandas_profiling.ProfileReport(input_df)
# pandas_profiling.ProfileReport(output_df)
