# Feature Engineering and Model Optimization

### Created by @YankunQiu and @JunLuo
#### 22 Dec 2017 - 10 Jan 2018

## Purpose of this document

This document shows the data mining and machine learning process of the LanceGuard project, and the knowledge we found which is important for solving the problems. This document mainly focus on the work done from 22 Dec 2017 to 10 Jan 2018.

The document will be structured into three sections:

1. Research

2. Implementation

3. Evaluation and Analysis

4. Next Steps

5. References

## 1. Research

### (1) Time-series features: 

One of the main idea in this project is to transfer the unfamiliar problem into a familiar problem. Since the motion detection of lances has not appeared in any research, it is difficult to come up with a brandly new way of doing this. Luckily, inspired by smart wrist band, a popular wearable device such as Fitbit and Mi Band, the problem can be transfered to activity classification. Mannini and Sabatini's paper introduced experiments using different features in this problem, as well as their results (Mannini, A. and Sabatini, A.M., 2010). 
![Previous Results](https://image.ibb.co/ian2HG/previous_results.png)


The paper also shows the conceptual scheme of a generic classification system with supervised learning.

 
 ![Flow Chart](https://image.ibb.co/jUtpWb/flow_chart.png)


The use of wavelet coefficient (Sekine, M., Tamura, T., Togawa, T. and Fukui, Y., 2000) is in several papers, which is worth researching in the future.



### (2) DNN

Since we are going to use the most popular python deep-learning library Tensorflow, the research on DNN in this period is mostly based on the official document of Tensorflow. In its website, there is a paragraph of introduction:

"TensorFlow™ is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API. TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google's Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks research, but the system is general enough to be applicable in a wide variety of other domains as well."

![TensorFlow](https://avatars0.githubusercontent.com/u/15658638?s=200&v=4)

The concept of Deep Neural Network has been exists for decades. However, it is only becoming popular because of the boosting of computational power in the recent three years. The structure of the neural network can be very deep, but according to a Quora answer, most problems in the world can be solved using one or two layers of neural network. The idea is that we need to keep the model simple and well performing.

Also, different structure of DNN has different effectiveness on different kinds of data. 

The structures we can try in this project are: 

1. CNN(Convolutional Neural Network)
2. RNN(Recurring Neural Network)
3. LSTM(Long-short Term Memory)


### (3) RNN and LSTM

RNN can find the relationship between time t and t-1 and t-2. The neurons in it flows around each other. See videos of Siraj(https://www.youtube.com/watch?v=cdLUzrjnlr4). And this tutorial introduced the Tensorflow version of RNN (https://github.com/llSourcell/How-to-Use-Tensorflow-for-Time-Series-Live-/blob/master/demo_full_notes.ipynb).

LSTM(Long Short Term Memory) neural network can learn what to remember and what to forget in the sequence data. This tutorial introduces the structure of LSTM and how to implement it in a non-library way.(https://github.com/llSourcell/LSTM_Networks/blob/master/LSTM%20Demo.ipynb)
![LSTM Structure](https://camo.githubusercontent.com/284f12768a57940bbd21c5e9746e5d4bf6f22fea/68747470733a2f2f7777772e7265736561726368676174652e6e65742f70726f66696c652f4d6f6873656e5f46617979617a2f7075626c69636174696f6e2f3330363337373037322f6669677572652f666967322f41533a33393830383238343931363533313440313437313932313735353538302f4669672d322d416e2d6578616d706c652d6f662d612d62617369632d4c53544d2d63656c6c2d6c6566742d616e642d612d62617369632d524e4e2d63656c6c2d72696768742d4669677572652e70706d)


### (4) Feature Engineering 

### (5) Feature Transformation

### (6) Hyper-parameter Optimization

Hyper-parameter optimization is an essential and time-consuming task for almost any machine learning algorithms. The aim of it is to find the best hyper-parameters for the model and improve its performance on testing data. 

There are three main methods for hyper-parameter optimization: Exhaustive Search, Grid Search and Random Search(Bergstra, J. and Bengio, Y., 2012).

![Random Search vs Grid Search](https://image.ibb.co/mX7mBb/random_search.png)

Grid search is a way similar to exhaustive search, just more coarse-grained. It assumes every hyper-parameter has the same importance.

However, Random Search assumes each hyper-parameter has a probablity to be more important than others. It maintains a weighted parameter list.

Another thing is, when different trails have nearly optimal validation means, then it is not clear which test score to report, and a slightly different choice of hyper-pameter could have yielded a different test error.

## 2. Implementation


### (1) Algorithm Independent Tasks (Both)
#### a. Feature extraction: Time series(Jun)

While doing the feature extraction, the method we mainly use is "Rolling Window". This is one of the most popular way for dealing with time-series data. The basic operation of "Rolling window" is to take a n-data-points long data sequence, and extract features from this window of data points. 

Using the python library Pandas, this can be done more conveniently.
![Pandas](https://pandas.pydata.org/_static/pandas_logo.png)

After adding the timeseries features, our feature space expands from 3 to 9.

    ['timeStamp',

    'x', 'y', 'z',
    
    'Rolling_Mean_x','Rolling_Mean_y','Rolling_Mean_z',
    
    'Rolling_Std_x','Rolling_Std_y','Rolling_Std_z',
    
    'label']
 
    [1510837962239.0, 
    
    -0.9301766753196716, -0.19591960310935974, -0.14742934703826904, 
    
    -0.9847284739358084, -0.11343124378173212, -0.023986388131244374, 
    
    0.09911378051724326, 0.14364807808884125, 0.15149487064158973,
    
    0.0]
    
 
 Notice that we did not add the 'w'to the feature space. The next steps can be investigate the effect of 'w'(rotation angle of the device) and add it as a feature if needed. 

Note that this method of feature selection is called 'Heuristic Based Feature Selection'. It is based on the researchers' knowledge about the data. This is the tipical way of selecting features. 

However, when using deep neural network, a better way is to just use raw data as features. The hidden units in the network can automaticlly learn the functions to summarize the most effective features based on the raw data. This is also a point we want to further study about.





#### b. Feature Transformation(Yankun)


#### c. GPU support on windows 10: NVDIA GEFORCE 750M(Jun)

We run the same task on the CPU, GPU and both. The task is training and predicting in a 3-layer deep neural network, 1000 epoch. The result of the running time is shown as below:

CPU: 00:01:20

GPU: 00:01:15

CPU+GPU: 00:01:36


Seems it does not speed up much of the process. Only 5 seconds are saved in the task.

We believe the reasons why it has not improve much are: 

1. NVDIA GEFORCE 750M is a GPU with the computation capability of 3, according to NVIDA website(https://developer.nvidia.com/cuda-gpus). 

2. The performance can be improved using suggestions from Tensorflow's document. Some improvements can be used to implement in the future code based on the document. (https://www.tensorflow.org/versions/r1.1/performance/performance_guide)

3. The combination of CPU and GPU may cause some delay by the memory exchange between them.


The local GPU support configuration help us understand the usage of GPU on Tensorflow (how to deploy a new GPU, how to write code that is suitable for GPU computation, how to run computation tasks on single/multiple CPUs and GPUs) and able to work on the remote Teamviewer workstation using NVIDIA 1050 Ti.


### (2) Random Forest Model(Jun)
#### a. Parameter Tuning
There are 2 hyper parameters which are needed to be optimized:

1. Length of the window
2. Number of the trees in the Random Forest

There are a certain rules for these parameters. Since we only have 9 features, we do not need too many trees in the forest. Testing the number of trees from 1 to 30 is good enough.

We should do a grid search for tuning the length of window. Since the time between any two ajascent data points is 32ms, we do not need the window to cover the time period that is too long. 1000 data points would be the longest window we will test.

The result of tuning Random Forest is shown in the pictures below. 

##### Tuning Number of Trees:

![Tuning Number of Trees](https://image.ibb.co/npkCmb/tree_num_tune.png)

The x-axis shows the number of trees, and the y-axis shows the accuracy of the algorithm. The algorithm converges at the 10 trees.

##### Tuning Length of Window:
![Tuning Length of Window](https://image.ibb.co/d214XG/window_tune.png)

The x-axis shows the length of the window, and the y-axis shows the accuracy of the algorithm. The algorithm converges at the length of 100.


#### b. Computational Parallelization

Random Forest is an emsemble learning algorithm, which is able to be parallelized. All trees can be trained at the same time. However, the python library we use in Scikit-learn does not support GPU computation. But it still provides a way to speed up the algorithm. The official document states this feature(http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html):

"n_jobs : integer, optional (default=1)

The number of jobs to run in parallel for both fit and predict. If -1, then the number of jobs is set to the number of cores."


Based on the number of cores in a training machine, we can change the parameter of 'n_jobs' to parallel the training process to make it more efficient.  

### (3) Artifitial Neural Network(Jun)
#### a. Tensorflow

#### b. Softmax/Relu Regression & Choosing Activation Functions

#### c. Deep Neural Network (Layer and Hidden Units)

#### d. Parameter Tuning

## 3. Evaluation and Analysis

(We mainly focus on Random Forest)

### (1) Effect of Feature Engineering(Both)

#### a. Time-series Feature vs Non-Time-series Feature

#### b. Scaling vs Non-Scaling

#### c. Rotation vs Non-Rotation

#### d. Effect of Parameter Tuning in Random Forest Model


### (2) Result After Feature Engineering and Parameter Tuning(Yankun)

#### a. Binary Classification

#### b. Multi-class Clasification



# 4. Next Steps

#### a. Deep Learning Cont. (may use 350*3 features x1,x2...x350, y1, y2, y350)
#### b. More data, more classes
#### c. Experiment -> Real industrial condition
#### d. PCA
#### e. HMM
#### f. Tag the process features(drain, lift up...), add the tilt angle data into feature.

# References

[1] Bergstra, J. and Bengio, Y., 2012. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(Feb), pp.281-305.

[2] Mannini, A. and Sabatini, A.M., 2010. Machine learning methods for classifying human physical activity from on-body accelerometers. Sensors, 10(2), pp.1154-1175.

[3] Sekine, M., Tamura, T., Togawa, T. and Fukui, Y., 2000. Classification of waist-acceleration signals in a continuous walking record. Medical engineering & physics, 22(4), pp.285-291.