PyTorch(or Torch) Dataset and Dataloader are used for easy accessing(=retrieving, obtaining, acquiring, reading, examining) of data for machine learning and deep learning models. A lot of effort in solving any machine learning problem goes in to preparing(=preprocessing) the data. PyTorch provides many tools to make data loading (= activation) easy (if data loading is easy, feeding the data into the model is easy), and to make the code more readable.

#Mounting Google Drive

In [None]:
'''
Mounting => Before your computer can use any kind of storage device (such as hard drive, Google drive), you or your operating system must make it
accessible through the computer’s file system. This process is called mounting. You can only access files on mounted media.
In Computers, to mount is to make a group of files in a file system structure accessible to a user or user group. In some usages, it means to make a
device physically accessible. Mounting a file system (Google drive) attaches that Google drive to a directory (mount  point) and makes it available to the
system. In simple words, with mounting a Google drive, user and operating system can access to all the files present in the Google drive. A mounted disk 
(a mounted drive) is available to the operating system as a file system, for reading, writing, or both.
'''
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


#Importing the Required Libraries

In [None]:
'''
Python library = python library is the collection of modules(= python files) and this python library is the reusable(=able to use again) chunk(= part, section
, block) of code we want to include in our python programs or projects to make the implementation easier and faster.

os module in python provides functions for interacting with the operating system. os module in Python provides functions for creating and removing a 
directory(folder), fetching its contents, os module used for changing and identifying the current directory, etc. Basically os module allows source code
to communicate (interact) with operating system.

Pickle in Python is primarily used in serializing and deserializing a Python object structure. In other words, it's the process of converting a python
object into a byte stream to store it in a file/database. Basically pickle library is used to dump (store) all the files of a directory (folder) into 
single combined file (pickle(.pkz) file) for easy fetching and fast retrieval of data. pickle.dump() is used to create a pickle file, it is used to dump
(store) data in a pickle file. pickle.load() is used to load(= start, activate) pickle file

Torch (or PyTorch) is an open-source library for machine learning and deep learning. Torch library provides a wide range of algorithms for machine learning and 
deep learning. The core package of Torch is “torch”, torch package provides a flexible (= adaptable, adjustable, alterable, changeable) N-dimensional array or 
Tensor, which supports basic routines for indexing, slicing, transposing, type-casting, resizing, sharing storage and cloning. The Tensor also supports 
mathematical operations like max, min, sum, statistical distributions like uniform, normal and multinomial, and BLAS operations like dot product, matrix-vector 
multiplication, matrix-matrix multiplication, matrix-vector product and matrix product. The torch package also simplifies object oriented programming and 
serialization by providing various convenience functions which are used throughout its packages 

Code for processing data samples can get messy and hard to maintain; we ideally want our dataset code to be decoupled (=separated) from our model training code 
for better readability and modularity. PyTorch provides two data primitives: torch.utils.data.DataLoader and torch.utils.data.Dataset that allow us to use 
pre-loaded datasets as well as our own data. DataLoader wraps an iterable around the Dataset to enable easy access to the samples

torch.utils.data.Dataset is an abstract class representing a dataset. custom dataset (torch dataset) should inherit Dataset and apply the following methods:
(1)__init__ , (2)__len__ , (3)__getitem__ , and (4) an optional argument transform. Dataset stores the samples and their corresponding labels

torch.utils.data.DataLoader wraps an iterable around the Dataset to enable easy access to the samples

numpy module allows us to work with numerical data. numpy provides an object called numpy array. numpy supports large multi-dimensional arrays & matrices. 
Basically numpy is a python library used for working with arrays. numpy used for arithmetic operations, statistical operations, bitwise operations, copying 
and viewing arrays, stacking, matrix operations, linear algebra, mathematical operations, searching, sorting, and counting.
 
as keyword is used as alias (AKA, also called as)
'''
import os
import pickle as pkl
import torch
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
import numpy as np

#Helper Functions
Helper Function = A helper function is a function that performs part of the computation (= operation, calculation, estimation, guess) of another function. Helper functions are used to make our programs (= codes) easier to read by giving names to computations. Helper functions also let you reuse computations, just as with functions in general. A helper function is a function we write because we need that particular functionality (= purpose, operation) of a function in multiple places in a code, and because it makes the code more readable. Instead of defining a particular functionality many times, insert(=put, embed) the functionality which we required many times in a helper function, so that we can use that particular functionality as many times we required without defining again

#Loading Data from Pickle file

In [None]:
'''
#Load Data from Pickle File

def keyword is used to define (= create) a function or method in python

using def keyword we defined (= created) a function load_data_from_pfile and passed an argument file_path. Argument is the value to be passed in a function. 
Here file_path is the location of all pickle files. i.e. file_path = /content/drive/MyDrive/Colab Notebooks 

with keyword => automatically releases memory after allocation. Whenever we open the file with open() function, it allocates some resources and memory to the 
file. And we should use close() function to release or delete that memory from the file otherwise errors will come. Sometimes we forget to close() the file and 
we couldn’t find that we didn’t closed the file, so even the whole code is correct we might get errors and we may not be able to correct it. So it’s better to 
use “with” keyword along with open() function as “with” keyword automatically releases or deletes memory after process completion

The open() function opens a file in text format by default. To open a file in binary format, add 'b' to the mode parameter. Hence the "rb" mode opens the file 
in binary format for reading, while the "wb" mode opens the file in binary format for writing. (Note: there are 2 basic mode parameters (r = read mode,w = write 
mode)). Unlike text files, binary files are not human-readable

as = The as keyword is used to create an alias (= aka, also known as, also called). In the code, we created an alias pfile when opening the file_path, and now 
we can refer to the file_path (or we can access the file_path) by using pfile instead of file_path.

Basically The pickle(.pkz) file is created using Python pickle and the dump() method and is loaded (=started, activated) using Python pickle and the load() 
method. we imported(=send) pickle module as pkl in the code. Therefore pkl.dump() is used to create pickle(.pkz) file and pkl.load() is used to load(=start, 
activate) pickle file. Here, pfile is the pickle file. Pkl.load(pfile) loads (= starts, activates) the pfile in rb (read binary) mode. We stored the loaded 
pfile in sample_data variable.

return keyword = the return keyword is used to exit (= come out from) a function and return a value. Return sample_data => returns value of sample_data and 
                 exits a function.

'''
def load_data_from_pfile(file_path):
    with open(file_path, 'rb') as pfile:
        sample_data = pkl.load(pfile)
    return sample_data

#Pickle(.pkz) Files Path

In [None]:
'''
created pkzfiles_path variable, and assigned(= allocate, allot, set) the path where the pickle(.pkz) files were saved in the pc or in the google drive.
'''
pkzfiles_path = '/content/drive/MyDrive/Colab Notebooks/'

In [None]:
'''
This code cell is just for representation purpose

sample_file = example file, created sample_file variable and assigned bearing1_1_train_data.pkz file along with it's path

sample_data = created sample_data variable and stored load_data_from_pfile() function and passed sample_file as an argument in the function. argument is the 
              data or value passed in the function. arguments should be passed inside the parenthesis. load_data_from_pfile() is the function which we created
              previously, this functions loads(= activates, starts) the data from pickle files and returns pickle files(data from pickle files). Here, we
              passed sample_file as an argument in load_data_from_pfile() function, hence it activates the bearing1_1_train_data.pkz file and returns that 
              file data

printing sample_data['x'].shape and sample_data['y'].shape. 
shape => https://www.w3schools.com/python/numpy/numpy_array_shape.asp
shape in python displays output in tuple format. The shape of an array is the number of elements in each dimension. sample_data['x'].shape gives = no. of horiz 
accel & vert accel feature elements in array, no. of rows(here, 1st row represents horiz accel feature images, 2nd row represents vert accel feature images), 
and each feature image size (i,e. no. of pixels(128 x 128)) as an ouptut. sample_data['y'].shape gives = no. of failure probabilities as an output.
sample_data['x'] array contains 4 dimensions, and sample_data['y'] array contains 1 dimension.
basically, computer stores an image in 0's and 1's.
128 x 128 pixels => represents that each feature image is divided into 128 small parts(i.e. each image is represented as 128 rows and 128 columnns with
0's and 1's)
'''
sample_file = pkzfiles_path + 'bearing1_1_train_data.pkz'
sample_data = load_data_from_pfile(sample_file)
print(sample_data['x'].shape, sample_data['y'].shape)

(2523, 2, 128, 128) (2523,)


#Dataset class

(Converting numpy array (nd array) into tensor)

https://pytorch.org/tutorials/beginner/data_loading_tutorial.html

torch.utils.data is a data loading (=activation) class of Torch (or PyTorch)

torch.utils.data.Dataset is an abstract class representing(= considered as) a dataset. custom dataset (= if the dataset can be modified (=changed) according to our (user’s) requirements, then dataset is considered as custom dataset, for example we are modifying torch dataset in the code, so torch dataset is the custom dataset) should inherit (= acquire, obtain, derive, get, receive)  Dataset and apply (= use) the following methods:

(1) `__len__` so that len(dataset) returns the size of the dataset

(2) `__getitem__`to support the indexing such that dataset[i] can be used to get ith sample (data)

In [None]:
'''
Python classes and objects => https://www.w3schools.com/python/python_classes.asp
A Class is like an object constructor, or a blueprint for creating objects. and objects of the class is used as references to access (= read, get, obtain,
examine, acquire) the class properties (properties = variables, arrays, lists, etc.)

Python Inheritance => https://www.w3schools.com/python/python_inheritance.asp
Inheritance allows us to define a class that inherits (= acquires, obtains, gets, recieves) all the methods and properties from another class.

Created PHMDataset class using class keyword. we send Dataset as the parameter in the class. To create a class that inherits the functionality (=methods and 
properties) from another class, send the parent class as a parameter when creating the child class. Here, parameter Dataset is the parent class and PHMDataset 
is the child class. Parent class is a class from which a class inherits, parent class also called as base class or super class. Child class is a class which 
inherits from another class (parent class), child class also called as derived class or sub class. class PHMDatset(Dataset) => this means PHMDataset class 
inherits the functionality (= methods and propperties) from Dataset class

Using def keyword, we defined (= created) __init__ function. __init__ function is used to assign values to class object properties or other operations that are 
necessary to do when the object is being created. The __init__() function is called automatically every time the class is being used to create a new object. We 
added __init__() function to the child class, therefore the child class will no longer inherit the parent's __init__() function. The child's __init__() 
function overrides (= removes) the inheritance of the parent's __init__() function. 
The self parameter is a reference to the current (= present, ongoing) instance (= object) of the class, and is used to access (= read, obtain, examine, acquire) 
variables that belongs to the class. (reference in python = a python program accesses (= reads, obtains, gets) data values through (= via) references. A 
reference is a name that refers to the specific (= certain, particular) location in memory of a value (object))
created pfiles list and passed as a parameter in __init__ function and assigned an empty list to the pfiles. __init__ function helps in assigning the values.

self.data = {'x': [], 'y': []} => here, self acts as a class (PHMDataset() class) object and data is the dictionary with keys x and y present in PHMDataset() 
class. x key referred to an empty list. y key also referred to an empty list. since self is the object of PHMDataset() class, we can access data dictionary of
PHMDataset class using self parameter i.e. self.data
Total 6 different bearing training dataset pickle files are there, therefore used for loop to access one training pickle file at a time from 6 different pickle
files
underscore can be used as variable name in python. created _data variable, _data is the single leading underscore variable. created _data variable and assigned
(= stored) load_data_from_pfile() function which we created previously and passed pfile as a parameter inside the function. this load_data_from_pfile() function
returns the data of pickle file as an output.
The append() method in python adds a single item to the existing list. It doesn’t return a new list of items but will modify the original list by adding the 
item to the end of the list. After executing the append method on the list the size of the list increases by one. 
self.data['x'].append(_data['x']) => _data['x'] contains the list of horiz accel feature values and vert accel feature values, using append method added 
_data['x'] values to the data dictionary with x key (i.e. data['x']). In data dictionary (data['x']), x key represents an empty list therefore values of _data[x] 
added at the end of x empty list. self parameter is used to access.
Therefore, self.data['x'] contains the list of horiz accel feature values and vert accel feature values of pfile (single pickle file)
self.data['y'].append(_data['y']) => _data['y'] contains the list of failure probability values, using append method added _data['y'] values to the data 
dictionary with y key (i.e. data['y']). In data dictionary (data['y']), y key represents an empty list therefore values of _data[y] added at the end of y empty 
list. self parameter is used to access.
Therefore, self.data['y'] contains the list of failure probability values of pfile (single pickle file)

numpy.concatenate() => https://www.tutorialspoint.com/numpy/numpy_concatenate.htm
Concatenation refers to joining. The concatenate() function is used to join two or more arrays of the same shape along a specified axis. The function takes the 
following parameters. general syntax => numpy.concatenate((a1, a2, ...), axis), here a1, a2, ... => represents Sequence (= order, arrangement) of arrays of the 
same type and axis => represents Axis along which arrays have to be joined. Default is 0.
self.data['x'] = np.concatenate(self.data['x']) => joining list of horiz accel feature values and vert accel feature values of all 6 different bearing
training pickle files as one. Therefore, self.data['x'] now contains the list of horiz accel feature values and vert accel feature values of all pfiles (6 
different bearing training pickle files)
self.data['y'] = np.concatenate(self.data['y']) => joining list of failure probability values of all 6 different bearing training pickle files as one. Therefore
self.data['y'] now contains the list of failure probability values of all pfiles (6 different bearing training pickle files)
np.newaxis is used to add the new axis (= new dimension). Hence, self.data['y'] values are stored in new dimension (= new axis)

def keyword is used to create a function. using def keyword defined (= created) __len__() function and passed self parameter as an argument. __len__() function
returns the size of the dataset. we passed self parameter in __len__() function, hence we can access all self parameter or self object properties inside the
__len__() function.
return self.data['x'].shape[0] => here, __len__() function returns the shape of data['x'] in 0th dimension. shape in python returns the no. of elements in each
dimension. shape returns the output in tuple format. shape[0] => returns the no. of elements in 0th dimension. in data['x'] in 0th dimension(data['x'].shape[0]) 
=> total no. of training pickle data files are present i.e. total no. of horiz accel feature elements and vert accel feature elements are present

def keyword is used to create a function. using def keyword defined (= created) __getitem__() function and passed self parameter and i as argmuments. arguments
are the values that can be passed inside the function. arguments should be passed inside the parenthesis of function name. __getitem__() is used to support the 
indexing such that dataset[i] can be used to get ith sample (ith data). we passed self parameter in __getitem__() function, hence we can access all self 
parameter or self object properties inside the __getitem__() function. i is the index
the function torch.from_numpy() => converts a numpy array into a tensor 
numpy array => https://www.javatpoint.com/numpy-array, 
tensor => https://www.javatpoint.com/pytorch-tensors#:~:text=A%20tensor%20is%20an%20n,computation%20with%20strong%20GPU%20acceleration. 
tensor = torch.from_numpy(numpy.array) => general syntax to Create a Tensor from numpy array
i is used to access the ith data in tensor
created sample dictionary with x and y keys. x key contains tensor of self.data['x'][i] i.e. contains the tensor of all horiz accel feature values and vert
accel feature values. y key contains tensor of self.data['y'][i] i.e. contains the tensor of all failure probability values
return sample => __getitem__() function return the sample dictionary as an output and exits the function

'''
class PHMDataset(Dataset):
    '''
    PHM IEEE 2012 Data Challenge Training data set (6 different Mechanical Bearings data)
    '''
    def __init__(self, pfiles=[]):
        self.data = {'x': [], 'y': []}
        for pfile in pfiles:
            _data = load_data_from_pfile(pfile)
            self.data['x'].append(_data['x'])
            self.data['y'].append(_data['y'])
        self.data['x'] = np.concatenate(self.data['x'])
        self.data['y'] = np.concatenate(self.data['y'])[:,np.newaxis]

    def __len__(self):
        return self.data['x'].shape[0]
    
    def __getitem__(self, i):
        sample = {'x': torch.from_numpy(self.data['x'][i]), 'y': torch.from_numpy(self.data['y'][i])}
        return sample

(storing 6 different bearings train data pickle files in a single list (train_pfiles))

In [None]:
'''
created pkzfiles_path variable, and assigned(= allocate, allot, set) the path where the pickle(.pkz) files were saved in the pc or in the google drive.

Python lists => https://www.w3schools.com/python/python_lists.asp (lists are used to store multiple items in a single variable)
created train_pfiles list and stored 6 different bearings train data pickle files (bearing1_1_train_data.pkz, bearing1_2_train_data.pkz, 
bearing2_1_train_data.pkz, bearing2_2_train_data.pkz, bearing3_1_train_data.pkz, bearing3_2_train_data.pkz)

'''
pkzfiles_path = '/content/drive/MyDrive/Colab Notebooks/'
train_pfiles = [pkzfiles_path+'bearing1_1_train_data.pkz', pkzfiles_path+'bearing1_2_train_data.pkz', \
                pkzfiles_path+'bearing2_1_train_data.pkz', pkzfiles_path+'bearing2_2_train_data.pkz', \
                pkzfiles_path+'bearing3_1_train_data.pkz', pkzfiles_path+'bearing3_2_train_data.pkz']

(Total length (= total no. data files present in train_dataset))

In [None]:
'''
created train_dataset variable and called PHMDataset class and send pfiles list as a parameter inside the class and assigned train_pfiles list to the pfiles.
therefore PHMDataset() class now can access to all the train dataset pickle files, hence with train_dataset, can access all train dataset pickle files
'''
train_dataset = PHMDataset(pfiles=train_pfiles)
print(len(train_dataset))

6783


#DataLoader

https://pytorch.org/tutorials/beginner/data_loading_tutorial.html

Following features are essential when loading(= activating, starting) data for training the models
1. Batching data => splitting(= dividing) the dataset into a collection(= batch) of smaller data chunks(= portions, blocks) that are send into the model one at a time. (Basically, batching data means dividing the dataset into batches of data and feed that batches as input to the model one at a time

2. Shuffling data => rearranging data or reorganizing data. data shuffling is important because by shuffling the data, one can make sure that each data point creates a separate change on the model, without being affected(= altered, changed) by the same data points before them 

3. (optionally) load tha data in parallel using multiprocessing workers

torch.utils.data.DataLoader is an iterator which provides all these features

(iterator in python is an object that is used to iterate(= repeat, loop) over iterable objects like lists, tuples, dictionaries, and sets. the iterator object is initialized using the iter() method. iterator uses the next() method for iteration)

(iterable is an object, which one can iterate(=repeat, loop) over. examples: lists, tuples, dictionaries, sets. iterable generates an iterator when passed(=moved) to iter() method. iterator is an object, which is used to iterate over an iterable object using __next__() method)

(every iterator is also an iterable, but not every iterable is an iterator)

Note: if for loop is used to iterate(= repeat, loop) over the data, then these features(Batching data, shuffling data) are not included which makes data loading somewhat difficult and requires more processing time

sample dataloader(train_dataloader) that loads batch of 4 samples(data files) at a time from train_dataset (without any multiprocessing (= simultaneously process two or more batches of data)

In [None]:
'''
DataLoader is an iterator which is used to iterate over train_dataset
train_dataset is our train dataset, train_dataset contains all the train data pickle files
batch_size => refers to the no. of samples in each batch
shuffle => whether we want the data to be shuffled(= rearranged or reorganized) or not. shuffle is a boolean value, either true or false. shuffle=true 
represents data to be shuffled. shuffle=false represents data not to be shuffled
num_workers => no. of processes needed for loading(= activating, starting) the data
'''
train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True, num_workers=0)

In [None]:
'''
representing torch dataset files for 5 samples
for loop => https://www.w3schools.com/python/python_for_loops.asp
A for loop is used for iterating over a sequence(iterable) (that is either a list, a tuple, a dictionary, a set, or a string)
i is the index of for loop. i acts as an iterator of for loop
batch = collection of 4 samples(4 datafiles) of learning/training dataset. batch acts as an iterator of for loop
the enumerate() function takes an iterable(list or tuple) as an input and returns it as an enumerate object. the enumerate() method adds counter to an iterable
and returns the enumerate object. (counter is the variable that automatically increments(=increases) it's value).
syntax of enumerate() => enumerate(iterable, start = 0)
parameters of enumerate() => enumerate method takes 2 parameters. (1) iterable => is a python object which supports iteration (iterable is an object, which one 
can iterate(=repeat, loop) over. examples: lists, tuples, dictionaries, sets.), (2) start(optional) => this start parameter is considered as counter in 
enumerate() method. enumerate() starts counting from this start variable value. since this start is an optional parameter for enumerate(), we can exclude start.
if start is excluded, zero(0) is taken as start value.
printing indices (each ith index), size of batch['x'] (each batch['x'] contains 4 horiz accel & vert accel feature elements), size of batch['y']
(each batch['y']) contains 4 failure probability values)
size() method in python => count the no. of elements along a given axis(= dimension) 
batch['x'].size() => gives 4 horiz accel & vert accel feature elements, no. of rows(here, 1st row represents horiz accel feature images, 2nd row 
                     represents vert accel feature images), and each feature image size (i,e. no. of pixels(128 x 128)) as an ouptut
batch['x'] has 4 dimensions
batch['y'].size() => gives 4 failure probability values as an output 
batch size = 4, that means each batch contains 4 samples.
i==4, therefore displays output for 5 samples
break => With the break statement we can stop the loop before it has looped(=iterated, repeated) through all the items

'''
for i, batch in enumerate(train_dataloader):
    print(i, batch['x'].size(), batch['y'].size())
    if i==4:
        break

0 torch.Size([4, 2, 128, 128]) torch.Size([4, 1])
1 torch.Size([4, 2, 128, 128]) torch.Size([4, 1])
2 torch.Size([4, 2, 128, 128]) torch.Size([4, 1])
3 torch.Size([4, 2, 128, 128]) torch.Size([4, 1])
4 torch.Size([4, 2, 128, 128]) torch.Size([4, 1])
