# Title Data Normolization Speed per Sensor

## Summary
This document shows how to normalize the original dataset. It is just used to do the initial data analysis, so only a few of the features are chosen. It chooses itapudid, max1stdetectwssc, max1stdetectwssd, max1stdetectwsse, max1stdetectwssf and eventtime as features (No max1stdetectwssa and max1stdetectwssb because of sensors in that positions). It removes the poweroffevents with all NAN values which have the same itapudid. In the final outcome dataset, there are 3 columns named itapudid, speed, sensor.

## Questions to be answered
- How to deal with NA values?

### Imports
Imports should be grouped in the following order:
1. Magics

2. Alphabetical order
    
    A. standard librarby imports
    
    B. related 3rd party imports
    
    C. local application/library specific imports

In [1]:
# Magics
%matplotlib inline
# Do below if you want interactive matplotlib plot ()
# %matplotlib notebook

# Reload modules before executing user code
# https://ipython.org/ipython-doc/3/config/extensions/autoreload.html
%load_ext autoreload
%autoreload 2

# Show version information for dependency modules
# https://github.com/jrjohansson/version_information
%load_ext version_information
%version_information numpy, scipy, matplotlib, pandas

Software,Version
Python,3.5.2 64bit [MSC v.1900 64 bit (AMD64)]
IPython,5.1.0
OS,Windows 7 6.1.7601 SP1
numpy,1.11.1
scipy,0.18.1
matplotlib,1.5.3
pandas,0.18.1
Fri Dec 09 12:30:11 2016 W. Europe Standard Time,Fri Dec 09 12:30:11 2016 W. Europe Standard Time


In [1]:
# Standard library
import os
import sys
# sys.path.append('../src/')

# Third party imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import math

# Local imports

In [2]:
# Customizations
sns.set() # matplotlib defaults

# Any tweaks that normally go in .matplotlibrc, etc., should be explicitly stated here
plt.rcParams['figure.figsize'] = (12,8)
%config InlineBackend.figure_format = 'retina'

### Load data

#### References
The first loaded data is the original dataset from the poweroffevent

In [3]:
df = pd.read_pickle('../data/wss') # Original dataset from the poweroffevent

In [4]:
#select the features and fill in the na values with 0
df = df[['itapudid', 'max1stdetectwssc', 'max1stdetectwssd', 'max1stdetectwsse', 'max1stdetectwssf', 'eventtime']]
df.to_pickle('../data/wss_n1')

In [5]:
# There is only one trailer with 4 wss, and the poweroffevents are far less than others. So here we remove the poweroffevents of this ita
df = df[df.itapudid != '170540055001DC915C90E'] 

In [10]:
# This function is used to normalize the original dataset.
# It removes all the NA values and return a new dataset with columns speed and sensor position
def Data_Normalization_perSensor(df):
    # First cut the dataset into four pieces, representing 4 different sensor positions respectively
    df_c = df[['itapudid', 'max1stdetectwssc', 'eventtime']] # Speed values of sensor c
    df_c = df_c.dropna(axis=0, how='any') # Drop all NA values
    df_c['sensor'] = 'C'
    df_c.columns = ['itapudid', 'speed', 'eventtime', 'sensor']

    df_d = df[['itapudid', 'max1stdetectwssd', 'eventtime']] # Speed values of sensor d
    df_d = df_d.dropna(axis=0, how='any') # Drop all NA values
    df_d['sensor'] = 'D'
    df_d.columns = ['itapudid', 'speed', 'eventtime', 'sensor']

    df_e = df[['itapudid', 'max1stdetectwsse', 'eventtime']] # Speed values of sensor e
    df_e = df_e.dropna(axis=0, how='any') # Drop all NA values
    df_e['sensor'] = 'E'
    df_e.columns = ['itapudid', 'speed', 'eventtime', 'sensor']

    df_f = df[['itapudid', 'max1stdetectwssf', 'eventtime']] # Speed values of sensor f
    df_f = df_f.dropna(axis=0, how='any') # Drop all NA values
    df_f['sensor'] = 'F'
    df_f.columns = ['itapudid', 'speed', 'eventtime', 'sensor']
    
    df_cd = pd.concat([df_c, df_d]) # Combine the sensor positions C and D
    df_ef = pd.concat([df_e, df_f]) # Combine the sensor positions C and D
    
    df_cd = df_cd.sort_values(['eventtime','itapudid']) # sort the records by itapudid and eventtime
    df_ef = df_ef.sort_values(['eventtime','itapudid']) # sort the records by itapudid and eventtime
    
    # Remove the poweroffevents which the sum of all poweroffevent speed with a same itapudid is 0
    df1 = df_cd.groupby('itapudid').sum().reset_index() # Calculate the sum of the speed by itapudid
    df2 = df_ef.groupby('itapudid').sum().reset_index() # Calculate the sum of the speed by itapudid
    # Select the itapudid whose sum of speed are equal to 0
    k1 = df1[df1['speed'] == 0].itapudid
    k2 = df2[df2['speed'] == 0].itapudid    
    # remove poweroffevents where itapudids are selected as above
    for i in k1:
        df_cd = df_cd[df_cd.itapudid != i]
    for j in k2:
        df_ef = df_ef[df_ef.itapudid != j]
    # Combine the dataset df_cd and df_ef, and get a final complete dataset    
    dataFinal = pd.concat([df_cd, df_ef])
    dataFinal = dataFinal.sort_values(['eventtime','itapudid'])
    dataFinal = dataFinal.reset_index(drop = True)
    
    return dataFinal

In [11]:
dataNormalized = Data_Normalization_perSensor(df)
dataNormalized.to_pickle('../data/wss_n_c1')

In [12]:
dataNormalized

Unnamed: 0,itapudid,speed,eventtime,sensor
0,163540031001DC924C7B3,0.0,1970-01-01 00:00:00,C
1,163540031001DC924C7B3,0.0,1970-01-01 00:00:00,D
2,163540032001DC9248262,0.0,1970-01-01 00:00:00,E
3,163540032001DC9248262,0.0,1970-01-01 00:00:00,F
4,164320032001DC92C8F02,0.0,1970-01-01 00:00:00,C
5,164320032001DC92C8F02,0.0,1970-01-01 00:00:00,D
6,164320033001DC924D8F0,0.0,1970-01-01 00:00:00,C
7,164320033001DC924D8F0,0.0,1970-01-01 00:00:00,D
8,164750011001DC915CDB2,0.0,1970-01-01 00:00:00,C
9,164750011001DC915CDB2,0.0,1970-01-01 00:00:00,D
