#Dataset Preparation for Sequence Modeling

Data needs to be packed(= arranged) in sequences

NOTE: Sequence data of shape [no_of_files, SEQ_LEN, C, H, W] for all bearing data files doesn't fit in memory of LSTM, so they need to be packed into sequences at runtime. (no_of_files = total no. of files present in single combined pickle file(bearing1_1.pkz, bearing1_2.pkz, bearing2_1.pkz, bearing2_2.pkz, bearing3_1.pkz, bearing3_2.pkz), SEQ_LEN = each sequence length, C = no. of channels or no. of filters, H = height of an image in pixels, W = width of an image in pixels)

(runtime - The runtime environment is literally python.exe, It's the Python executable(executable file) that will interpret(= convert, explain) our Python code by transforming(= changing) it into CPU-readable bytecode.)

References - (1) https://www.youtube.com/watch?v=CznICCPa63Q&t=35s  (2) https://www.allerin.com/blog/sequence-modeling-for-beginners

Sequence - one input sample. sequence is an input sample which consists of multiple inputs and those inputs depend on each other. sequence is a series of inputs, and the next input in the series depends on all the previous inputs. (sequences - means input samples(input examples) and each input sample consists of multiple data points. There can be a variable(= changeable) number of these data points per each input sample or input example and the data points depend on each other )

(sequence synonyms = order, series, consecutive, succession)

Sequence Modeling - Sequence modeling, put simply, is the process of generating a sequence of values by analyzing a series of input values. These input values could be time series data. (time series data is the data varies(= alters, changes) over a period of time. Sequence Modeling is a task of predicting what comes next. Unlike FNN and CNN, in sequence modeling, the current output is dependent on the previous inputs and the length of the inputs is not fixed

(Recurrent Neural Networks (RNN) is the best example for sequence modeling. RNN is a sequence model algorithm. Sequence models are the machine learning models that input or output sequences of data. Sequence data or data sequence includes time-series data, text streams, audio clips, video clips, etc. )

(modeling synonyms = creating, designing, forming) 

Sequence Modeling problems:

(1) Can't model long-term dependencies 

(2) Don't preserve(= maintain, support) order of data points

(3) no parameter sharing

To model sequences, need:

(1) to deal with variable length sequences

(2) to maintain sequence order

(3) to keep track of(monitor) long-term dependencies

(Long-term dependency - (long-term = longer period), long-term dependencies are those problems for which the desired output depends on inputs presented at times far in the past )

(4) to share parameters across(= everywhere on) the sequence

Note:

Our dataset is time-series dataset. Sequence Modeling is quite useful for time-series prediction. By applying sequence modeling to the time-series input data, we can obtain more insights from that time-series input data and prediction becomes easy and accurate

Dataset preparation also called as dataset preprocessing. Any Machine Learning and Deep Learning model requires preprocessed data for training. Therefore data preprocessing should be performed on the dataset which we used to build the prediction model. Data Preprocessing is required for cleaning and organizing(= formatting, arranging) the raw data to make it suitable for building and training machine learnig models and deep learning models. Raw data refers to data that has not yet been preprocessed or prepared and raw data is the data cannot be understood by machine learning model and deep learning model. Preprocessing or preparing the raw data also removes unwanted and faulty data(errors) from data

#Mounting Google Drive

In [None]:
'''
Mounting => Before your computer can use any kind of storage device (such as hard drive, Google drive), we or our operating system must make it
accessible through the computer’s file system. This process is called mounting. we can only access files on mounted media.
In Computers, to mount is to make a group of files in a file system structure accessible to a user or user group. In some usages, it means to make a
device physically accessible. Mounting a file system (Google drive) attaches that Google drive to a directory (mount point) and makes it available to the
system. In simple words, with mounting a Google drive, user and operating system can access to all the files present in the Google drive. A mounted disk 
(a mounted drive) is available to the operating system as a file system, for reading, writing, or both.
'''
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


#Importing the Required Libraries

In [None]:
'''
Python library = python library is the collection of modules(= python files) and this python library is the reusable(=able to use again) chunk(= part, section
, block) of code we want to include in our python programs or projects to make the implementation easier and faster.

import = send

os module in python provides functions for interacting with the operating system. os module in Python provides functions for creating and removing a 
directory(folder), fetching its contents, os module used for changing and identifying the current directory, etc. Basically os module allows source code
to communicate (interact) with operating system.

numpy module allows us to work with numerical data. numpy provides an object called numpy array. numpy supports large multi-dimensional arrays & matrices. 
Basically numpy is a python library used for working with arrays. numpy used for arithmetic operations, statistical operations, bitwise operations, copying 
and viewing arrays, stacking, matrix operations, linear algebra, mathematical operations, searching, sorting, and counting.

pywt library used to perform wavelet transform (both Continuous Wavelet Transform (CWT) and Discrete Wavelet Transform (DWT)). This pywt library is a 
package of various wavelets of CWT and DWT (pywt library contains different wavelets of CWT and DWT) to perform wavelet transform.
(A wavelet is a wave-like oscillation with an amplitude(= the maximum displacement or distance moved by a point on a vibrating body or wave) that begins at zero, 
increases, and then decreases back to zero.)

pandas library is used for data manipulation and data analysis. pandas module works with the tabular data (i.e. data in rows and columns). Pandas provide a
2D table object called dataframe. pandas module offers data structures and operations for manipulating numerical tables and time series.

Pickle in Python is primarily used in serializing and deserializing a Python object structure. In other words, it's the process of converting a python
object into a byte stream to store it in a file/database. Basically pickle library is used to dump (store) all the files of a directory (folder) into 
single combined file (pickle(.pkz) file) for easy fetching and fast retrieval of data. pickle.dump() is used to create a pickle file, it is used to dump
(store) data in a pickle file. pickle.load() is used to load(= start, activate) pickle file

Matplotlib is the most popular plotting(=sketching, drawing, designing, outlining) library for Python. python library = Collection of modules 
(modules = python files). Pyplot is a Matplotlib module which provides a MATLAB-like interface. Each pyplot function makes some change to a figure. For
example, creates a figure, creates a plotting area in a figure, plots some lines in a plotting area,  decorates the plot with labels, etc. The various plots
we can utilize using Pyplot are Line Plot, Histogram, Scatter, 3D Plot, Image, Contour, Hexagonal binning plot and Polar. Can also import as  
"import matplotlib.pyplot as plt"

as keyword is used as alias (AKA, also called as)
'''
import os
import numpy as np
import pywt
import pandas as pd
import pickle as pkl
from matplotlib import pyplot as plt

#Parameters or Required Variables

In [None]:
'''
DATA_POINTS_PER_FILE => Data points present in each file. In the dataset in every directory (folder), each file contains 2560 data points. Therefore we
created a variable DATA_POINTS_PER_FILE and assigned that variable with 2560

TIME_PER_REC => Time recorded for each vibration signal. we store each vibration recording time in TIME_PER_REC. In the dataset, every 10 seconds, 1
vibration recording of 0.1 seconds is collected. That means for every 10 seconds, one vibration (both horizontal and vertical) is recorded for 0.1 seconds
Therefore we created a variable TIME_PER_REC and assigned that variable with 0.1

SAMPLING_FREQ => Sampling frequency or sampling rate defines the number of samples (data points) per second. vibration sample frequency is 25.6 KHz => for 1
second we get 25600 data points. each 10 seconds one vibration recording of 0.1s is collected => therefore 25600 * 0.1 (i.e. 2560) data points in each file
for example, Bearing1_1 set has total 2803 files, each file with 2560 data points => 2803*2560=7175680 total data points

The sampling period (or) sampling interval is the time difference between two consecutive(= successive, pakkana pakkana, subsequent) samples. It is the 
inverse of the sampling frequency. For example: if the sampling frequency is 25600 Hz, the sampling period is 1/25600 = 0.000039 seconds (that means the 
samples are spaced approximately 39 microseconds apart (=away, aside, distant, far))

since we assigned WIN_SIZE = 20, we take 20 data points at a time. WIN_SIZE is the fixed size window 

We are using morlet wavelet in our code. Therefore we created a variable called WAVELET_TYPE and assigned ‘morl’ to the variable. morl => indicates morlet 
wavelet. assigning morl to the variable, means assigning morlet wavelet and morlet wavelet operations and functionalities to that variable. Morlet Wavelet is
the most commonly used Continuous Wavelet Transform (CWT) Wavelet to respresent a vibration signal in time and frequency domains

created VAL_SPLIT variable to store the percentage of data split(= divide, separate) into validation data from training data.
since we assigned, 0.1 to the VAL_SPLIT variable, we are separating 10% data from training data (total training data) and made it as validation data

SEQ_LEN = 5 => SEQ_LEN - each sequence length, therefore each sequence contains 5 data points

'''
DATA_POINTS_PER_FILE = 2560
TIME_PER_REC = 0.1
SAMPLING_FREQ = 25600 # 25.6 KHz
SAMPLING_PERIOD = 1.0/SAMPLING_FREQ

WIN_SIZE = 20
WAVELET_TYPE = 'morl'

VAL_SPLIT = 0.1 

SEQ_LEN = 5 # sequence length

#Helper Functions

Helper Function = A helper function is a function that performs part of the computation (= operation, calculation, estimation, guess) of another function. Helper functions are used to make our programs (= codes) easier to read by giving names to computations. Helper functions also let you reuse computations, just as with functions in general. A helper function is a function we write because we need that particular functionality (= purpose, operation) of a function in multiple places in a code, and because it makes the code more readable. Instead of defining a particular functionality many times, insert(=put, embed) the functionality which we required many times in a helper function, so that we can use that particular functionality as many times we required without defining again

(1) Loading(= activating, starting) pickle(.pkz) files (Bearing1_1, Bearing1_2, Bearing2_1, Bearing2_2, Bearing3_1, Bearing3_2)

(2) Applying CWT(Continuous Wavelet Transform)

(3) Data Normalization

In [None]:
'''
using def keyword, defined(=created) a function named load_df and passed pkz_file as an argument to the function

with keyword => automatically releases memory after allocation. Whenever we open the file with open() function, it allocates some resources and memory to the 
file. And we should use close() function to release or delete that memory from the file otherwise errors will come. Sometimes we forget to close() the file 
and we couldn’t find that we didn’t closed the file, so even the whole code is correct we might get errors and we may not be able to correct it. So it’s better 
to use “with” keyword along with open() function as “with” keyword automatically releases or deletes memory after process completion

The open() function opens a file in text format by default. To open a file in binary format, add 'b' to the mode parameter. Hence the "rb" mode opens the file 
in binary format for reading, while the "wb" mode opens the file in binary format for writing. (Note: there are 2 basic mode parameters (r = read mode,
w = write mode)). Unlike text files, binary files are not human-readable.

as = The as keyword is used to create an alias (= aka, also known as, also called). In the code, we create an alias f when opening the pkz_file, and 
now we can refer to the pkz_file (or we can access the pkz_file) by using f instead of pkz_file.

df => Dataframe, Basically The pickle(.pkz) file is created using Python pickle and the dump() method and is loaded (=started, activated) using Python pickle 
and the load() method. we imported(=send) pickle module as pkl in the code. Therefore pkl.dump() is used to create pickle(.pkz) file and pkl.load() is 
used to load(=start, activate) pickle file. Here, f is the pickle file.

return keyword = The return keyword is used to exit (= come out from) a function and return a value. (return df - returns df and exits load_df function. here
                 df is the dataframe that contains pickle file data, so return df returns a pickle file) (Dataframe looks like tabular data i.e. data present 
                 in rows and columns)
'''
def load_df(pkz_file):  #load data frame from pickle file
    with open(pkz_file, 'rb') as f:
        df=pkl.load(f)
    return df

In [None]:
'''
Range of values = values present between a lower limit (included) and an upper limit (excluded).

with def keyword, created(= defined) a function called df_row_ind_to_data_range() function and passed an argument called ind to the function

DATA_POINTS_PER_FILE = 2560, if ind = 0, then the function df_row_ind_to_data_range(ind) returns a range of values (from 0 to 2560). if ind = 1, then
the function df_row_ind_to_data_range(ind) returns a range of values (from 2560 to 5120) and so on.
'''
def df_row_ind_to_data_range(ind):  #get range of values from data frame given file index
    return (DATA_POINTS_PER_FILE*ind, DATA_POINTS_PER_FILE*(ind+1))

In [None]:
'''
basically, here we are doing signal processing and data normalization. signal processing and feature extraction can be used interchangeably. signal processing 
is one of the sub-concepts of Data Preprocessing. (Signal Processing = Signal processing is the analysis, interpretation and manipulation of signals. Processing 
of signals includes storage, separation of data from noise(= loud and irritating sound) , feature extraction (2D image features extraction or Time-Frequency 
features extraction).)

using def keyword, created(= defined) extract_feature_image() function and passed ind and feature_name as parameters(function variables or arguments) inside
the function. horiz accel is assigned to the feature_name parameter

Here, data_range is a variable and we assigned the function df_row_index_to_data_range(index) return value to the data_range variable. In other words, we call 
the function df_row_index_to_data_range(index) and assigned that function’s return value to the data_range variable.

data = here, data is a variable and in this variable we stored the values of horizontal and vertical acceleration from 0 (included) to 2560 (excluded) 
(0 to 2559).

data = here, data is a variable. np.array() = https://www.youtube.com/watch?v=NYPKbmE0H6E , np.mean() = https://www.youtube.com/watch?v=hSxslgMFQys ,
       range() = https://www.w3schools.com/python/ref_func_range.asp , here, in data variable we stored the average of 1st 20 data points (0 to 20), then 
       average of 2nd 20 data points (20 to 40), then average of 3rd 20 data points (40 to 60) and so on average of nth 20 data points. since we assigned 
       WIN_SIZE = 20, we take 20 data points at a time. WIN_SIZE is the fixed size window 

CWT => https://www.youtube.com/watch?v=F_QvT_8kOfc , https://www.weisang.com/en/documentation/timefreqspectrumalgorithmscwt_en/ 
(CWT = The Continuous Wavelet Transform (CWT) is used to decompose(= break up, separate) a signal into wavelets. The CWT is used to construct a time-frequency 
representation of a signal that offers very good time and frequency localization which helps in understanding more insights about time domain and frequency
domain. The CWT is an excellent tool for mapping the changing properties of non-stationary signals. The CWT is also an ideal tool for determining whether or not 
a signal is stationary in a global sense. )
(A Wavelet is a wave-like oscillation with an amplitude(= length, size, magnitude, breadth) that begins at zero, increases, and then decreases back to zero. 
(Oscillation = movement back and forth(=forward) in a regular rhythm))
(Different types of CWT Wavelets - https://pywavelets.readthedocs.io/en/latest/ref/cwt.html#:~:text=pywt.,data%20%3A%20array_like ..... there are various types
of CWT Wavelets and we are using morlet waveltet among them)
(1D data or 1D array = 1D array contains elements only in one dimension. In other words, the shape of the numpy array should contain only one value in the tuple. 
                       To create a one dimensional array (1D array) in numpy, you can use either of the array(), arange() or linspace() numpy functions.)
(Here, coef = variable......  _ = variable (In python, it is conventional (= usual, traditional, normal) to use  underscore ( _ ) as variable name.)
(pywt.cwt(data, np.linspace(1,128,128), WAVELET_TYPE) - The above is the syntax to perform continuous wavelet transform (CWT) on 1D data in python. )
(np.linspace() = https://www.youtube.com/watch?v=NYPKbmE0H6E )

here, coef = variable , np.log2() = https://www.geeksforgeeks.org/numpy-log2-python/ , ** operator = power operator in python (example => 2 ** 5 = 2 power 5 
= 32 , np.log2() - This mathematical function helps user to calculate Base-2 logarithm of x where x belongs to all the input array elements. 
(log2 value = 0.3010)  )

Data Normalization => https://towardsdatascience.com/data-normalization-in-machine-learning-395fdec69d02 
Normalize or normalization = https://www.educative.io/edpresso/data-normalization-in-python
Rescaling = resizing, adjusting (Data Normalization means rescaling all the data of a set in a particular range (say 0 to 1) )
(coef - coef.min())/(coef.max() - coef.min()) - this is normalization syntax in python. min() = min() function returns the minimum value in the sequence.
                                                max() = max() function returns the maximum value in the sequence

return coef - return keyword is used to exit (= come out from) a function and return a value. return coef means returns coef and exits extract_feature_image()
              funtion
                                          
'''
#perform continuous wavelet transform (CWT) on 1D signals and return 2D feature image (Extracting 2D image features) (Converting 1D vibration signals into
#2D CWT feature images(2D CWT feature images contains both time domain and frequency domain information) 2D CWT feature images are horiz accel feature images 
#and vert accel feature images )
def extract_feature_image(ind, feature_name='horiz accel'):   
    data_range = df_row_ind_to_data_range(ind)
    data = df[feature_name].values[data_range[0]:data_range[1]]
    # use window to process(= prepare, develop) 1D signal
    data = np.array([np.mean(data[i:i+WIN_SIZE]) for i in range(0, DATA_POINTS_PER_FILE, WIN_SIZE)])
    # perform CWT on 1D data(= 1D array)
    coef, _ = pywt.cwt(data, np.linspace(1,128,128), WAVELET_TYPE)
    # transform to power and apply logarithm?!
    coef = np.log2(coef**2+0.001)
    # normalize coef
    coef = (coef - coef.min())/(coef.max() - coef.min()) 
    return coef

#Directory or Folder where pickle files(.pkz files) were stored in the Google Drive

In [None]:
'''
created main_dir variable, and assigned(= allocate, allot, set) the path where 6 bearing training dataset pickle files were saved in the pc or in the google 
drive.
'''
main_dir = '/content/drive/MyDrive/Colab Notebooks/'

#Bearing1_1

In [None]:
'''
In pkz_file variable, we stored the bearing1_1.pkz along with the bearing1_1.pkz path (i.e. main_dir). bearing1_1.pkz => is a pickle file.

df = load_df(pkz_file) - calling load_df(pkz_file) function which we created(=defined) previously. load_df(pkz_file) loads dataframe from pickle file and 
                         returns a data frame (here, data frame is a pickle file(.pkz file), dataframe contains pickle file data. Dataframe looks like tabular 
                         data i.e. data present in rows and columns )
                         (pickle file(.pkz file) - pickle file means collection of all files present in a single directory or folder. for example, Bearing1_1
                         is a training or learning data directory which contains 2803 data files, bearing1_1.pkz is a training or learning data pickle file 
                         and bearing1_1.pkz is a single combined file which contains all 2803 data files of Bearing1_1)

df.head() => returns the 1st 5 rows of the dataframe (bearing1_1.pkz) i.e. displays the 1st 5 rows of the dataframe (bearing1_1.pkz).

'''
pkz_file = main_dir + 'bearing1_1.pkz'
df = load_df(pkz_file)
df.head()

Unnamed: 0,hour,minute,second,microsecond,horiz accel,vert accel
0,9,39,39,65664.0,0.552,-0.146
1,9,39,39,65703.0,0.501,-0.48
2,9,39,39,65742.0,0.138,0.435
3,9,39,39,65781.0,-0.423,0.24
4,9,39,39,65820.0,-0.802,0.02


In [None]:
'''
created no_of_rows variable to store the total no. of rows present in a data frame(pickle file) (i.e. to store the no.of rows present in each file of
pickle file.)

df.shape[0] => return total rows(total data points in bearing1_1) of a data frame (pickle file)

created no_of_files variable to store the total no. of files present in single combined pickle file(bearing1_1.pkz file)

printing(=displaying) no. of rows(total no. of data points) and no. of files(total no. of files) present in bearing1_1.pkz file.
'''
no_of_rows = df.shape[0]
no_of_files = int(no_of_rows / DATA_POINTS_PER_FILE)
print(no_of_rows, no_of_files)

7175680 2803


#extracting 2D feature images for each data file in Bearing1_1 and converting into numpy array

#storing the probability of failure

In [None]:
'''
created dictionary named data with keys x and y. x key referred to an empty list. y key also referred to an empty list

extracted(= derived, calculated, taken out)horiz accel features from file 0 to file 2802(2803 total files) and stored in coef_h
extracted(= derived, calculated, taken out)vert accel features from file 0 to file 2802(2803 total files) and stored in coef_v

created x_ variable to store the array(numpy array) of list of horiz accel feature values and vert accel feature values from file 0 to file 2802 (2803 total
files)
created y_ variable to store the probability(=chance, likelihood, possibility, anticipation, expectation) of failure. Probability as a number lies between
(0 and 1). A probability of 0 means that the event will not happen. Here, event is the failure. Hence if y_ variable value is 0, then that means there is
no machine failure, machine is working properly, if y_ variable is 1, then that means there is machine failure (complete machine breakdown). Since our data
is run-to-failure data, first there will be no failure at all (i.e. 0) and eventually the machine will fail over time (i.e. 1). According to our code, 
0 => means no failure , 1 => means complete failure, >0.5 => means some failure. y_ variable stores the values like this => 0/2802, 1/2802,
2/2802,--------2802/2802

using append method added x_ variable values to the data dictionary with x key (i.e. data['x']). In data dictionary, x key represents an empty list
therefore values of x_ variable added at the end of x empty list.
using append method added y_ variable values to the end of data dictionary with y key (i.e. data['y']). In data dictionary, y key represents an empty list
therefore values of y_ variable added at the end of y empty list.

after appending, in data['x'] we stored the array of data['x'] values, in data['y'] we stored the array of data['y'] values. Therefore, data['x'] becomes
an numpy array and data['y'] becomes an numpy array.
array = collection of similar types of values

with assert keyword, if the condition returns true, then nothing happens. (or) If the condition returns true, then that condition will be displayed 
or printed as it is.If the condition returns false, then an AssertionError is raised. The keyword assert functionality is somewhat similar to if condition.
shape => https://www.w3schools.com/python/numpy/numpy_array_shape.asp (The shape of an array is the number of elements in each dimension.)

Here, no_of_files = 2803 (bearing1_1 contains total 2803 files), 2 = represents no. of channels or no. of filters
128, 128 = represents pixels (i.e. 128 x 128 pixels) Pixel => short form for picture element
since the above condition is true, we printed no_of_files, shape of data['x'], and shape of data['y']
data['x'].shape gives = no. of horiz accel & vert accel feature elements in array, and each feature image size (i,e. no. of pixels(128 x 128)) as an ouptut.
data['y'].shape gives = no. of failure probabilities or no. of fault probability values as an output.
shape in python displays output in tuple format.
The shape of an array is the number of elements in each dimension.
data['x'] array contains 4 dimensions, and data['y'] array contains 1 dimension.
basically, computer stores an image in 0's and 1's.
128 x 128 pixels => represents that each feature image is divided into 128 small parts(i.e. each image is represented as 128 rows and 128 columnns with
0's and 1's)

'''
data = {'x': [], 'y': []}
for i in range(0, no_of_files):
    coef_h = extract_feature_image(i, feature_name='horiz accel')
    coef_v = extract_feature_image(i, feature_name='vert accel')
    x_ = np.array([coef_h, coef_v])
    y_ = i/(no_of_files-1)
    data['x'].append(x_)
    data['y'].append(y_)
data['x']=np.array(data['x'])
data['y']=np.array(data['y'])

assert data['x'].shape==(no_of_files, 2, 128, 128)
print(no_of_files, data['x'].shape, data['y'].shape)

2803 (2803, 2, 128, 128) (2803,)


#Saving as pickle files (.pkz files)

In [None]:
'''
bearing1_1_all_data.pkz - means bearing1_1_train_data.pkz + bearing1_1_val_data.pkz

The pickle(.pkz) file is created using Python pickle and the dump() method and is loaded (=started, activated) using Python pickle and the load() method.
we imported(=send) pickle module as pkl in the code. Therefore pkl.dump() is used to create pickle(.pkz) file and pkl.load() is used to load(=start, activate) 
pickle file.
pickle.dump() function => is used to store the object data(dump information) to the file. pickle.dump() function takes 3 arguments. The first argument is the
object that you want to store. The second argument is the file object you get by opening the desired file in write-binary (wb) mode. third argument is 
optional.

created out_file variable to store all the 2D CWT feature data extracted from bearing1_1.pkz file. file name => bearing1_1_all_data.pkz
main_dir variable => path where 6 bearing training dataset pickle files were saved in the pc or in the google drive. now, bearing1_1_all_data.pkz also saved
                     in the same path

with keyword => automatically releases memory after allocation. Whenever we open the file with open() function, it allocates some resources and memory to the 
file. And we should use close() function to release or delete that memory from the file otherwise errors will come. Sometimes we forget to close() the file
and we couldn’t find that we didn’t closed the file, so even the whole code is correct we might get errors and we may not be able to correct it. So it’s better
to use “with” keyword along with open() function as “with” keyword automatically releases or deletes memory after process completion.

The open() function opens a file in text format by default. To open a file in binary format, add 'b' to the mode parameter. Hence the "wb" mode opens the file 
in binary format for writing. Unlike text files, binary files are not human-readable.

as = The as keyword is used to create an alias (= aka, also known as, also called). In the below code, we create an alias f when opening the pkz_file, and 
now we can refer to the pkz_file (or we can access the pkz_file) by using f instead of pkz_file.

pkl.dump() creates file object named "f" and dumps(=stores) the data into "f" and saves "f" as bearing1_1_all_data.pkz in selected path (google drive)
'''
out_file = main_dir+'bearing1_1_all_data.pkz'
with open(out_file, 'wb') as f:
    pkl.dump(data, f)

#Bearing1_2

implementation is same as Bearing1_1

(1) Extracting 2D feature images (both horiz accel feature images and vert accel feature images) from bearing1_2.pkz and converting into numpy array

(2) storing the count(total number) of failure probability values or fault probability values in numpy array

(3) Saving as pickle files (.pkz files)

In [None]:
pkz_file = main_dir + 'bearing1_2.pkz'
df = load_df(pkz_file)
df.head()

Unnamed: 0,hour,minute,second,microsecond,horiz accel,vert accel
0,8,47,5,196910.0,0.05,-0.253
1,8,47,5,196950.0,0.165,-0.14
2,8,47,5,196990.0,0.125,0.542
3,8,47,5,197030.0,0.157,-0.261
4,8,47,5,197070.0,0.421,0.081


In [None]:
no_of_rows = df.shape[0]
no_of_files = int(no_of_rows / DATA_POINTS_PER_FILE)
print(no_of_rows, no_of_files)

2229760 871


In [None]:
data = {'x': [], 'y': []}
for i in range(0, no_of_files):
    coef_h = extract_feature_image(i, feature_name='horiz accel')
    coef_v = extract_feature_image(i, feature_name='vert accel')
    x_ = np.array([coef_h, coef_v])
    y_ = i/(no_of_files-1)
    data['x'].append(x_)
    data['y'].append(y_)
data['x']=np.array(data['x'])
data['y']=np.array(data['y'])

assert data['x'].shape==(no_of_files, 2, 128, 128)
print(no_of_files, data['x'].shape, data['y'].shape)

871 (871, 2, 128, 128) (871,)


In [None]:
out_file = main_dir+'bearing1_2_all_data.pkz'
with open(out_file, 'wb') as f:
    pkl.dump(data, f)

#Bearing2_1

implementation is same as Bearing1_1, Bearing1_2

(1) Extracting 2D feature images (both horiz accel feature images and vert accel feature images) from bearing2_1.pkz and converting into numpy array

(2) storing the count(total number) of failure probability values or fault probability values in numpy array

(3) Saving as pickle files (.pkz files)



In [None]:
pkz_file = main_dir + 'bearing2_1.pkz'
df = load_df(pkz_file)
df.head()

Unnamed: 0,hour,minute,second,microsecond,horiz accel,vert accel
0,8,14,15,884410.0,-0.391,0.011
1,8,14,15,884450.0,0.292,0.133
2,8,14,15,884490.0,0.596,0.024
3,8,14,15,884530.0,0.23,0.272
4,8,14,15,884570.0,-0.225,0.272


In [None]:
no_of_rows = df.shape[0]
no_of_files = int(no_of_rows / DATA_POINTS_PER_FILE)
print(no_of_rows, no_of_files)

2332160 911


In [None]:
data = {'x': [], 'y': []}
for i in range(0, no_of_files):
    coef_h = extract_feature_image(i, feature_name='horiz accel')
    coef_v = extract_feature_image(i, feature_name='vert accel')
    x_ = np.array([coef_h, coef_v])
    y_ = i/(no_of_files-1)
    data['x'].append(x_)
    data['y'].append(y_)
data['x']=np.array(data['x'])
data['y']=np.array(data['y'])

assert data['x'].shape==(no_of_files, 2, 128, 128)
print(no_of_files, data['x'].shape, data['y'].shape)

911 (911, 2, 128, 128) (911,)


In [None]:
out_file = main_dir+'bearing2_1_all_data.pkz'
with open(out_file, 'wb') as f:
    pkl.dump(data, f)

#Bearing2_2

implementation is same as Bearing1_1, Bearing1_2, Bearing2_1

(1) Extracting 2D feature images (both horiz accel feature images and vert accel feature images) from bearing2_2.pkz and converting into numpy array

(2) storing the count(total number) of failure probability values or fault probability values in numpy array

(3) Saving as pickle files (.pkz files)

In [None]:
pkz_file = main_dir + 'bearing2_2.pkz'
df = load_df(pkz_file)
df.head()

Unnamed: 0,hour,minute,second,microsecond,horiz accel,vert accel
0,7,40,33,540660.0,0.038,0.29
1,7,40,33,540700.0,0.125,-0.104
2,7,40,33,540740.0,0.035,-0.314
3,7,40,33,540780.0,-0.092,0.2
4,7,40,33,540820.0,0.033,0.211


In [None]:
no_of_rows = df.shape[0]
no_of_files = int(no_of_rows / DATA_POINTS_PER_FILE)
print(no_of_rows, no_of_files)

2040320 797


In [None]:
data = {'x': [], 'y': []}
for i in range(0, no_of_files):
    coef_h = extract_feature_image(i, feature_name='horiz accel')
    coef_v = extract_feature_image(i, feature_name='vert accel')
    x_ = np.array([coef_h, coef_v])
    y_ = i/(no_of_files-1)
    data['x'].append(x_)
    data['y'].append(y_)
data['x']=np.array(data['x'])
data['y']=np.array(data['y'])

assert data['x'].shape==(no_of_files, 2, 128, 128)
print(no_of_files, data['x'].shape, data['y'].shape)

797 (797, 2, 128, 128) (797,)


In [None]:
out_file = main_dir+'bearing2_2_all_data.pkz'
with open(out_file, 'wb') as f:
    pkl.dump(data, f)

#Bearing3_1

implementation is same as Bearing1_1, Bearing1_2, Bearing2_1, Bearing2_2

(1) Extracting 2D feature images (both horiz accel feature images and vert accel feature images) from bearing3_1.pkz and converting into numpy array

(2) storing the count(total number) of failure probability values or fault probability values in numpy array

(3) Saving as pickle files (.pkz files)

In [None]:
pkz_file = main_dir + 'bearing3_1.pkz'
df = load_df(pkz_file)
df.head()

Unnamed: 0,hour,minute,second,microsecond,horiz accel,vert accel
0,9,10,39,118790.0,0.338,-0.263
1,9,10,39,118830.0,0.278,0.285
2,9,10,39,118870.0,0.143,0.59
3,9,10,39,118910.0,0.09,-0.193
4,9,10,39,118940.0,0.035,-0.109


In [None]:
no_of_rows = df.shape[0]
no_of_files = int(no_of_rows / DATA_POINTS_PER_FILE)
print(no_of_rows, no_of_files)

1318400 515


In [None]:
data = {'x': [], 'y': []}
for i in range(0, no_of_files):
    coef_h = extract_feature_image(i, feature_name='horiz accel')
    coef_v = extract_feature_image(i, feature_name='vert accel')
    x_ = np.array([coef_h, coef_v])
    y_ = i/(no_of_files-1)
    data['x'].append(x_)
    data['y'].append(y_)
data['x']=np.array(data['x'])
data['y']=np.array(data['y'])

assert data['x'].shape==(no_of_files, 2, 128, 128)
print(no_of_files, data['x'].shape, data['y'].shape)

515 (515, 2, 128, 128) (515,)


In [None]:
out_file = main_dir+'bearing3_1_all_data.pkz'
with open(out_file, 'wb') as f:
    pkl.dump(data, f)

#Bearing3_2

implementation is same as Bearing1_1, Bearing1_2, Bearing2_1, Bearing2_2, Bearing3_1

(1) Extracting 2D feature images (both horiz accel feature images and vert accel feature images) from bearing3_2.pkz and converting into numpy array

(2) storing the count(total number) of failure probability values or fault probability values in numpy array

(3) Saving as pickle files (.pkz files)

In [None]:
pkz_file = main_dir + 'bearing3_2.pkz'
df = load_df(pkz_file)
df.head()

Unnamed: 0,hour,minute,second,microsecond,horiz accel,vert accel
0,8,34,41,978160.0,-0.291,0.181
1,8,34,41,978200.0,0.146,0.185
2,8,34,41,978240.0,0.404,-0.159
3,8,34,41,978280.0,0.191,-0.179
4,8,34,41,978320.0,-0.18,0.072


In [None]:
no_of_rows = df.shape[0]
no_of_files = int(no_of_rows / DATA_POINTS_PER_FILE)
print(no_of_rows, no_of_files)

4190720 1637


In [None]:
data = {'x': [], 'y': []}
for i in range(0, no_of_files):
    coef_h = extract_feature_image(i, feature_name='horiz accel')
    coef_v = extract_feature_image(i, feature_name='vert accel')
    x_ = np.array([coef_h, coef_v])
    y_ = i/(no_of_files-1)
    data['x'].append(x_)
    data['y'].append(y_)
data['x']=np.array(data['x'])
data['y']=np.array(data['y'])

assert data['x'].shape==(no_of_files, 2, 128, 128)
print(no_of_files, data['x'].shape, data['y'].shape)

1637 (1637, 2, 128, 128) (1637,)


In [None]:
out_file = main_dir+'bearing3_2_all_data.pkz'
with open(out_file, 'wb') as f:
    pkl.dump(data, f)