## Scikit Learn Preprocessing

In this notebook, we'll use `sklearn.preprocessing` and `sklearn.impute` to do some scaling for us. If you need to prepare data for machine learning or feature extraction, the [sklearn.preprocessing documentation](http://scikit-learn.org/stable/modules/preprocessing.html) has great examples.

In [1]:
from sklearn.impute import SimpleImputer
from sklearn import preprocessing
import pandas as pd
import numpy as np
from datetime import datetime

In [33]:
hvac = pd.read_csv('../Dataset/HVAC_with_nulls.csv')

## Checking Data Quality

In [34]:
hvac.dtypes

Date           object
Time           object
TargetTemp    float64
ActualTemp      int64
System          int64
SystemAge     float64
BuildingID      int64
10            float64
dtype: object

In [35]:
hvac.shape

(8000, 8)

In [36]:
hvac.head()

Unnamed: 0,Date,Time,TargetTemp,ActualTemp,System,SystemAge,BuildingID,10
0,6/1/13,0:00:01,66.0,58,13,20.0,4,
1,6/2/13,1:00:01,,68,3,20.0,17,
2,6/3/13,2:00:01,70.0,73,17,20.0,18,
3,6/4/13,3:00:01,67.0,63,2,,15,
4,6/5/13,4:00:01,68.0,74,16,9.0,3,


## Impute missing values with mean

In [37]:
hvac_numeric.loc[:10]

Unnamed: 0,TargetTemp,SystemAge
0,66.0,20.0
1,,20.0
2,70.0,20.0
3,67.0,
4,68.0,9.0
5,67.0,28.0
6,70.0,24.0
7,,26.0
8,66.0,9.0
9,65.0,5.0


In [38]:
imp = SimpleImputer(missing_values=np.nan, 
                            strategy='mean')

In [39]:
hvac_numeric = hvac[['TargetTemp', 'SystemAge']]

In [40]:
imp = imp.fit(hvac_numeric.loc[:10])

In [41]:
transformed = imp.fit_transform(hvac_numeric)

In [45]:
transformed

array([[66.        , 20.        ],
       [67.50773481, 20.        ],
       [70.        , 20.        ],
       ...,
       [67.50773481,  4.        ],
       [65.        , 23.        ],
       [66.        , 21.        ]])

In [46]:
hvac['TargetTemp'], hvac['SystemAge'] = transformed[:,0], transformed[:,1]

In [47]:
hvac.head()

Unnamed: 0,Date,Time,TargetTemp,ActualTemp,System,SystemAge,BuildingID,10
0,6/1/13,0:00:01,66.0,58,13,20.0,4,
1,6/2/13,1:00:01,67.507735,68,3,20.0,17,
2,6/3/13,2:00:01,70.0,73,17,20.0,18,
3,6/4/13,3:00:01,67.0,63,2,15.386643,15,
4,6/5/13,4:00:01,68.0,74,16,9.0,3,


In [56]:
hvac['ActualTemp'].unique().max()

80

## Scale temperature values

In [49]:
hvac['ScaledTemp'] = preprocessing.scale(hvac['ActualTemp'])

In [54]:
hvac

Unnamed: 0,Date,Time,TargetTemp,ActualTemp,System,SystemAge,BuildingID,10,ScaledTemp
0,6/1/13,0:00:01,66.000000,58,13,20.000000,4,,-1.293272
1,6/2/13,1:00:01,67.507735,68,3,20.000000,17,,0.048732
2,6/3/13,2:00:01,70.000000,73,17,20.000000,18,,0.719733
3,6/4/13,3:00:01,67.000000,63,2,15.386643,15,,-0.622270
4,6/5/13,4:00:01,68.000000,74,16,9.000000,3,,0.853934
...,...,...,...,...,...,...,...,...,...
7995,6/16/13,1:33:07,66.000000,58,17,18.000000,20,,-1.293272
7996,6/17/13,2:33:07,68.000000,72,17,27.000000,12,,0.585533
7997,6/18/13,3:33:07,67.507735,69,10,4.000000,3,,0.182932
7998,6/19/13,4:33:07,65.000000,63,7,23.000000,20,,-0.622270


## Scale using a min and max scaler

In [28]:
min_max_scaler = preprocessing.MinMaxScaler()

In [29]:
temp_minmax = min_max_scaler.fit_transform(hvac[['ActualTemp']])

In [30]:
temp_minmax

array([[0.12],
       [0.52],
       [0.72],
       ...,
       [0.56],
       [0.32],
       [0.44]])

### Exercise: add the `temp_minmax` back to the dataframe as a new column

In [31]:
# %load ../solutions/preprocessing.py

