# Data pre-processing and visualization

The first step in application of machine learning methods is to explore and visualize the available data. This means an understanding of the properties and relationships in a dataset (between the data columns and between the data columns and labels).
This task is divided into several stages:
   - Files containing useful data are uploaded from the folder.
   - The size of the data is determined.
   - For each file, it is necessary to check whether the type of the data in columns corresponds to the type that is expected.This also gives an insight if the data is numerical, categorical or mixed.
   - In general cases, it can happen that some cells are empty, or contain NaN values. Because of this, check for empty cells is performed as an example even if it is not expected that this error arise from the sensors' data.
   - Using Pandas method `describe`,some statistical summary is gained.
   - Some basic plots are presented in order to describe the nature of the data from the sensors.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os,sys
import ipywidgets as widgets
from ipywidgets import interact, interact_manual
import h5py
from scipy import signal

### Files upload, data size and type

Uploading 81 files which correspond to 81 production parts.

In [None]:
scope_traces=[] #this will be the list of dataframes
file_names=os.listdir('Data/AFRC Radial Forge - Zenodoo Upload v3/Data/ScopeTraces')

Upload check

In [None]:
file_names[0]

Creating the list of dataframes for the uploaded files:

In [None]:
for file in file_names:
    df=pd.read_csv("Data\\AFRC Radial Forge - Zenodoo Upload v3\\Data\\ScopeTraces\\"+file,encoding = 'unicode_escape') 
    scope_traces.append(df)

In [None]:
type(scope_traces[0]) #checking the type of element in the list (dataframe)

A total number of 81 files (81 sensors' cycles or 81 produced parts per day in the forging process) is analyzed. The first file contains 23328 rows (corresponding to the time samples from 97 sensors). Later, it is going to be discovered that the number of rows differs among all files.

In [None]:
scope_traces[0].head()

In [None]:
scope_traces[0].shape #file size

For each dataframe, it is possible to check the type of values in all columns. It is expected that all values are integers *int* or real numbers *float64* because they represent measuring data from the sensors. To demonstrate this step, it will be done on one file.

In [None]:
scope_traces[0].dtypes

### Check for missing values

Function *check_empty* checks whether the file contains empty cells. Here, it is supposed that empty cell is a blank cell (" ").

In [None]:
def check_empty(trace):
    empty_cell=(trace.astype(np.object) == '').any()
    [print("Empty_cell") for x in empty_cell if x==True]

In [None]:
for i in range(len(scope_traces)): #check for all of 81 files
    print(i, end="\r") 
    check_empty(scope_traces[i])

### Statistical summary

Based on previous notebooks and similar plots where changes in signals and peaks were observed, sensors are divided into two groups: forging and heating for each produced part. These sensors were chosen based on their importance for the heating or forging phases. It is important to notice that some sensors were dropped because they are not in use, or because they are described as auxiliary sensors. 

In [None]:
forging_sensors=np.zeros((len(scope_traces), *scope_traces[0].shape))

forging_sensors=[scope_traces[part][['Power [kW]', 'Force [kN]', 'A_ges_vibr','Schlagzahl [1/min]', 'RamRetract_ActSpd [rpm]',
       'A_ACTpos [mm]', 'L_ACTpos [mm]', 'R_ACTpos [mm]','SBA_ActPos [mm]', 'A_ACT_Force [kN]', 'DB_ACT_Force [kN]',
       'L_NOMpos [mm]', 'R_NOMpos [mm]', 'INDA_NOMpos [deg]','A_NOMpos [mm]', 'Frc_Volt','ForgingBox_Temp', 'L1.R_B41 [bar]', 'TMP_Ind_F [°C]','W2 Durchfluss [l]', 'W1 Durchfluss [l]']] for part in range(len(scope_traces))] 

In [None]:
heating_sensors=np.zeros((len(scope_traces), *scope_traces[0].shape))
heating_sensors=[scope_traces[part][['TMP_Ind_U1 [°C]','IP_ActPos [mm]', 'IP_NomPos']] for part in range(len(scope_traces))]

Statistical overview of the sensors is made by function *df_summary*. This is demonstrated for the first production part only, but if this function runs for other parts, the difference in sensors' cycles (duration of the production of each part) can be seen by the `count`property. It can be noticed that for some sensors standard deviation from the mean value is very high. Also, the median value much differs from the mean value in some cases. This situation implies that it might be useful to extract the heating and forging phase instead of analyzing the signals for the whole production process. 

In [None]:
def df_summary(part):
    print("Part", part.describe())

In [None]:
df_summary(forging_sensors[0])

In [None]:
df_summary(heating_sensors[0])

The statistical summary can also help to eliminate the sensors with **low variance** and **zero variance**. Zero variance features are comprised of the same values. Low variance features arise from features with most values the same and with few unique values.  
In addition to the statistical summary, function *plot_sensors* plots signals from the sensor recorded during the whole production process of one part (comprised of three phases: heating, transferring and forging).

In [None]:
def plot_sensors(part):
    column_names=part.columns
    for column in column_names:
        fig = plt.figure(figsize=(4,4)) # define plot area
        plt.plot(np.arange(0,(part.shape[0]/100),0.01),part[column].values)
        plt.xlabel("Time, s")
        plt.ylabel(column)
        plt.show()

In [None]:
plot_sensors(forging_sensors[0])

In [None]:
plot_sensors(heating_sensors[0])

Pearson correlation coefficients between different sensors are calculated in the next cell. Correlation can have big impact on feature selection. Sensors with big correlation should be detected and they should not be in the same Machine Learning model.Correlation coefficients are calculated as the mean values for 81 parts.

In [None]:
def calculate_correlation(sensors):
    correlation_list=[sensors[i].corr().values for i in range(len(sensors))]
    correlation_array=np.array(correlation_list)
    print("Correlation array dimensions:",correlation_array.shape)
    corr_mean=np.mean(correlation_array,axis=0)
    print("Mean value of correlation coefficients for all sensors:")
    corr_mean_df=pd.DataFrame(corr_mean,columns=sensors[0].columns,index=sensors[0].columns)
    return corr_mean_df.style.background_gradient(cmap='coolwarm').set_precision(2)

In [None]:
calculate_correlation(forging_sensors) #mean values of corr coefficients for all parts

In [None]:
calculate_correlation(heating_sensors) #mean values of corr coefficients for all parts

## Extraction of the heating phase and forging phase from the sensors' signals

The following analysis is based on the extraction of the heating and forging phase from the sensors. Then, all of the previous steps are repeated. 

Merge data from all parts into one Dataframe:

In [None]:
for index,df in enumerate(scope_traces):
    df['traceID'] = index+1

In [None]:
merged_data = pd.concat(scope_traces, ignore_index=True)
merged_data['Time [s]']=(merged_data.index.values)/100.0

Function *extract_heating_phases* uses the value of the sensor "$U_GH_HEATON_1 (U25S0).1" greater than zero as reference for the start of the heating phase. When the value becomes zero, it means that heating phase is finished. This phase is extracted only for heating sensors.

In [None]:
def extract_heating_phases(trace):
    heating_start=[]
    heating_stop=[]
    
    digital_sig_heating = trace['$U_GH_HEATON_1 (U25S0).1']>0
    heating_diff = digital_sig_heating.astype('int').diff()
    heating_start.append(heating_diff[heating_diff==1].index.values) # difference between next cell and previous cell 
                                                                    #is positive because the heating phase is still active
    heating_stop.append(heating_diff[heating_diff==-1].index.values) # difference between next cell and previous cell 
                                                                    #is negative because the heating phase is finished
    heat_list=list(zip(heating_start[0], heating_stop[0]))
    heating_ph=[None]*(len(heat_list))
                  
    heating_ph=[trace[['TMP_Ind_U1 [°C]','IP_ActPos [mm]','IP_NomPos']][heat_list[i][0]:heat_list[i][1]] for i in range(len(heat_list))] 
        
    return heating_ph


Function *extract_forging_phases* uses the value of the sensor "Force [kN]" greater than zero as reference for the start of the forging phase.

In [None]:
def extract_forging_phases(trace):
    forging_start=[]
    forging_end=[]
    digital_sig_forge = trace['Force [kN]']>0
    forge_diff = digital_sig_forge.astype('int').diff()
    forging_start.append(forge_diff[forge_diff==1].index.values)
    forging_end.append(forge_diff[forge_diff==-1].index.values) 
                    
    forge_list=list(zip(forging_start[0], forging_end[0]))     
    forging_ph=[None]*(len(forge_list))
    forging_ph=[trace[['Power [kW]', 'Force [kN]', 'A_ges_vibr','Schlagzahl [1/min]', 'RamRetract_ActSpd [rpm]',
       'A_ACTpos [mm]', 'L_ACTpos [mm]', 'R_ACTpos [mm]','SBA_ActPos [mm]', 'A_ACT_Force [kN]', 'DB_ACT_Force [kN]',
       'L_NOMpos [mm]', 'R_NOMpos [mm]', 'INDA_NOMpos [deg]','A_NOMpos [mm]', 'Frc_Volt','ForgingBox_Temp', 'L1.R_B41 [bar]', 'TMP_Ind_F [°C]','W2 Durchfluss [l]', 'W1 Durchfluss [l]']][forge_list[i][0]:forge_list[i][1]] for i in range(len(forge_list))]
    return forging_ph

In [None]:
heating_ph=extract_heating_phases(merged_data)

In [None]:
forging_ph=extract_forging_phases(merged_data)

Minimum duration of heating and forging phase for all parts that were produced:

In [None]:
cycle_length_heat=[None]*(len(heating_ph))
cycle_length_forge=[None]*(len(heating_ph))
cycle_length_heat=[(heating_ph[i].shape[0]) for i in range(len(heating_ph))]
cycle_length_forge=[(forging_ph[i].shape[0]) for i in range(len(heating_ph))]
min_length_heat=min(cycle_length_heat)
min_length_forge=min(cycle_length_forge)
print("Minimum length of time signals  for the heating phase is :",min_length_heat, "and it is in",cycle_length_heat.index(min_length_heat),". cycle")
print("Minimum length of time signals  for the forging phase is :",min_length_forge, "and it is in",cycle_length_forge.index(min_length_forge),". cycle")    

Because of the different length of heating phase and forging phase of time signals, the minimum duration for both phases is extracted. Then, steps for the calculation of statistical summary and correlation coefficients are repeated.

In [None]:
heating_ph_extracted=[None]*(len(heating_ph))
heating_ph_extracted=[heating_ph[x].iloc[:10997,:] for x in range(len(heating_ph))]

In [None]:
forging_ph_extracted=[None]*(len(heating_ph))
forging_ph_extracted=[forging_ph[x].iloc[:5619,:] for x in range(len(forging_ph))]

In [None]:
df_summary(heating_ph_extracted[10])

From the statistical summary for the heating phase, it can be seen that standard deviation for two of these sensors is close to zero. Because of this, these sensors are eliminated from the future observations.

In [None]:
heating_sensors=[heating_ph_extracted[i].drop(columns=['IP_NomPos',"IP_ActPos [mm]"]) for i in range(len(heating_ph_extracted))]

In [None]:
df_summary(forging_ph_extracted[0])

Standard deviation for the sensors: L1.R_B41 [bar], ForgingBox_Temp are close to zero. They will be eliminated.  

In [None]:
#strong correlation
forging_sensors=[forging_ph_extracted[i].drop(columns=['L1.R_B41 [bar]',"ForgingBox_Temp"]) for i in range(len(forging_ph_extracted))]

In [None]:
calculate_correlation(forging_sensors) #mean values of corr coefficients for all parts

In [None]:
#strong correlation
forging_sensors=[forging_ph_extracted[i].drop(columns=['Power [kW]','R_NOMpos [mm]',"L_NOMpos [mm]", "R_ACTpos [mm]",'W1 Durchfluss [l]',"A_NOMpos [mm]","L_NOMpos [mm]",'L1.R_B41 [bar]',"ForgingBox_Temp"]) for i in range(len(forging_ph_extracted))]

In [None]:
calculate_correlation(forging_sensors)

In [None]:
plot_sensors(forging_sensors[0])

In [None]:
forging_sensors[0].shape

In [None]:
heating_sensors[0].shape

In [None]:
len(heating_sensors)

In [None]:
plot_sensors(heating_sensors[0])

Now, data is going to be arranged in a 3D matrix, in a way that first dimension represents number of sensors, the second - number of cycles (81 cycles for 81 part produced) and the third one represents number of discrete points (measurement values).

In [None]:
#split in dimensions 13x81x5619:
forging_sensors=[np.transpose(forging_sensors[i])for i in range(len(forging_sensors))] #81x14x5619
#define the array of sensors (13x81x5619):
split_forging_sensors=np.zeros((forging_sensors[0].shape[0],len(forging_sensors),forging_sensors[0].shape[1]))
#turn 81x13 into 13x81
for i in range(forging_sensors[0].shape[0]):
    for j in range(len(forging_sensors)):
        xx=np.asarray(forging_sensors[j])
        split_forging_sensors[i,j]=xx[i]
print("Dimensions of matrix:", split_forging_sensors.shape)

In [None]:
#split in dimensions 2x81x10997:
heating_sensors=[np.transpose(heating_sensors[i])for i in range(len(heating_sensors))] #81x2x10997
#define the array of sensors (2x81x10997):
split_heating_sensors=np.zeros((heating_sensors[0].shape[0],len(heating_sensors),heating_sensors[0].shape[1]))
#turn 81x2 into 2x81
for i in range(heating_sensors[0].shape[0]):
    for j in range(len(heating_sensors)):
        xx=np.asarray(heating_sensors[j])
        split_heating_sensors[i,j]=xx[i]
print("Dimensions of matrix:", split_heating_sensors.shape)

In [None]:
def plot_density_hist_forge(num_of_sensor,time):
    sns.set_style("whitegrid")
    sns.distplot(split_forging_sensors[num_of_sensor,:,time], bins = 5, rug=True, hist = True)
     # Give the plot a main title
         # Set text for the x axis
    plt.title("Histogram of"+" "+forging_sensors[0].index[num_of_sensor])
    plt.show()
interact(plot_density_hist_forge,num_of_sensor=widgets.IntSlider(min=0, max=14, step=1),time=widgets.IntSlider(min=0, max=5619, step=1))

In [None]:
def plot_density_hist_heat(num_of_sensor,time):
    sns.set_style("whitegrid")
    sns.distplot(split_heating_sensors[num_of_sensor,:,time], bins = 5, rug=True, hist = True)
     # Give the plot a main title
         # Set text for the x axis
    plt.title("Histogram of"+" "+heating_sensors[0].index[num_of_sensor])
    plt.show()
interact(plot_density_hist_heat,num_of_sensor=widgets.IntSlider(min=0, max=2, step=1),time=widgets.IntSlider(min=0, max=10997, step=1))


For each sensor, 20 s interval around midpoint is extracted and will be used in the next steps.

In [None]:
hf_forge = h5py.File('forge_sensors.h5', 'w')
hf_heat=h5py.File('heating_sensors.h5', 'w')

These sensors are saved separately for the use in the following notebooks.

In [None]:
[hf_forge.create_dataset('Sensor'+str(i),data=split_forging_sensors[i][:,1809:3809]) for i in range(len(split_forging_sensors))]

In [None]:
[hf_heat.create_dataset('Sensor'+str(i),data=split_heating_sensors[i][:,4498:6498]) for i in range(len(split_heating_sensors))]

In [None]:
hf_forge.close()

In [None]:
hf_heat.close()

Now, a white noise $\epsilon$ is added to the observed part of the signals (20 s):

$$x_{n}(t) = x(t)+\epsilon$$


A white noise is defined with normal distribution - 

White noise has normal distribution ${\mathcal {N}}(0 ,\sigma ^{2})$, where $\sigma ^{2}$ represents variance. The process of adding a white noise is as follows:
* Step1: For each sensor, a mean value for the centered part of the signals is calculated. 
* Step2: Standard deviation $\sigma$ of a white noise is defined as 1% of the mean value calculated in the Step 1.
This has been done separately for forging and heating sensors.

In [None]:
mean_value_forge= [split_forging_sensors[i][:,1809:3809].mean() for i in range(len(split_forging_sensors))]
white_noise_forge=[np.random.randn(2000)*0.01*(mean_value_forge[i]) for i in range(len(split_forging_sensors))]

mean_value_heat= [split_heating_sensors[i][:,4498:6498].mean() for i in range(len(split_heating_sensors))]
white_noise_heat=[np.random.randn(2000)*0.01*(mean_value_heat[i]) for i in range(len(split_heating_sensors))]

White noise signal is plotted together with original signal to observe if the value of variance can be correctly used for each sensor.

In [None]:
def plot_white_noise_forge(num_of_sensor,part):
    
    fig = plt.figure(figsize=(7,7)) # define plot area
    plt.plot(np.arange(0,20,0.01),white_noise_forge[num_of_sensor],label="white noise")
    plt.plot(np.arange(0,20,0.01),split_forging_sensors[num_of_sensor][part,1809:3809], label="original signal")
    plt.xlabel("Time, s")
    plt.ylabel(forging_sensors[0].index[num_of_sensor])
    plt.legend()
    plt.show()
interact(plot_white_noise_forge,num_of_sensor=widgets.IntSlider(min=0, max=14, step=1),part=widgets.IntSlider(min=0, max=81, step=1))


In [None]:
def plot_white_noise_heat(num_of_sensor,part):
    
    fig = plt.figure(figsize=(7,7)) # define plot area
    plt.plot(np.arange(0,20,0.01),white_noise_heat[num_of_sensor],label="white noise")
    plt.plot(np.arange(0,20,0.01),split_heating_sensors[num_of_sensor][part,4498:6498], label="original signal")
    plt.xlabel("Time, s")
    plt.ylabel(heating_sensors[0].index[num_of_sensor])
    plt.legend()
    plt.show()
interact(plot_white_noise_heat,num_of_sensor=widgets.IntSlider(min=0, max=2, step=1),part=widgets.IntSlider(min=0, max=81, step=1))


After an insight through plots, white noise signals are added to the original signals from sensors.

In [None]:
#adding white noise
def add_white_noise(sensors,white_noise,start,end):
    sensors_with_noise=[sensors[i][:,start:end]+ white_noise[i] for i in range(len(sensors))]   
    return sensors_with_noise

In [None]:
sensors_with_noise_forge=add_white_noise(split_forging_sensors,white_noise_forge,1809,3809)
sensors_with_noise_heat=add_white_noise(split_heating_sensors,white_noise_heat,4498,6498)

In [None]:
def plot_signal_with_noise_forge(num_of_sensor,part):
    fig = plt.figure(figsize=(7,7)) # define plot area
    plt.plot(np.arange(0,20,0.01),sensors_with_noise_forge[num_of_sensor][part],label=" Signal with white noise")
    plt.plot(np.arange(0,20,0.01),split_forging_sensors[num_of_sensor][part,1809:3809], label="Original signal")
    plt.xlabel("Time, s")
    plt.ylabel(forging_sensors[0].index[num_of_sensor])
    plt.legend()
    plt.show()
interact(plot_signal_with_noise_forge,num_of_sensor=widgets.IntSlider(min=0, max=14, step=1),part=widgets.IntSlider(min=0, max=81, step=1))


In [None]:
def plot_signal_with_noise_heat(num_of_sensor,part):
    fig = plt.figure(figsize=(7,7)) # define plot area
    plt.plot(np.arange(0,20,0.01),sensors_with_noise_heat[num_of_sensor][part],label="Signal with white noise")
    plt.plot(np.arange(0,20,0.01),split_heating_sensors[num_of_sensor][part,4498:6498], label="Original signal")
    plt.xlabel("Time, s")
    plt.ylabel(heating_sensors[0].index[num_of_sensor])
    plt.legend()
    plt.show()
interact(plot_signal_with_noise_heat,num_of_sensor=widgets.IntSlider(min=0, max=2, step=1),part=widgets.IntSlider(min=0, max=81, step=1))


In [None]:
forging_sensors[0].index

In [None]:
time=20
n_of_sampling_points=2000
freq = np.fft.rfftfreq(n_of_sampling_points, float(time)/n_of_sampling_points)   # frequency axis
amp = np.fft.rfft(sensors_with_noise_forge[0][0])   

# END

In [None]:
m=sensors_with_noise_forge[0][0].size

In [None]:
w = np.fft.rfft(sensors_with_noise_forge[0][0] * signal.get_window("hamming", m))

In [None]:
plt.plot(np.arange(2000),sensors_with_noise_forge[0][0] * signal.get_window("hamming", m))

In [None]:
time=20
n_of_sampling_points=2000
freq = np.fft.rfftfreq(n_of_sampling_points, float(time)/n_of_sampling_points)   # frequency axis

In [None]:
fig = plt.figure(figsize=(15,5))
plt.plot(freq, 20*np.log10(np.abs(w)))
plt.xlabel("Frequency (Hz)")  
plt.ylabel("Amplitude (Pa)")  

In [None]:
fig = plt.figure(figsize=(15,5))

plt.plot(freq, (np.abs(w)))
plt.xlabel("Frequency (Hz)")  
plt.ylabel("Amplitude (Pa)")  

In [None]:
plt.plot(np.arange(0,20,0.01),sensors_with_noise_forge[0][0], label="Real values")
plt.ylabel("Microphone (Pa)")
plt.xlabel("Time (s)")

In [None]:
fig = plt.figure(figsize=(15,5))
plt.plot(freq,np.abs(amp))
plt.xlabel("Frequency (Hz)")  
plt.ylabel("Amplitude (Pa)")  

In [None]:
time=20
n_of_sampling_points=2000
freq = np.fft.rfftfreq(n_of_sampling_points, float(time)/n_of_sampling_points)   # frequency axis
amp = np.fft.rfft(sensors_with_noise_heat[0][0])   

In [None]:
plt.plot(np.arange(0,20,0.01),sensors_with_noise_heat[0][0], label="Real values")
plt.ylabel("Microphone (Pa)")
plt.xlabel("Time (s)")

In [None]:
fig = plt.figure(figsize=(15,5))
plt.plot(freq,np.abs(amp))
plt.xlabel("Frequency (Hz)")  
plt.ylabel("Amplitude (Pa)")  