# Data preprocessing for machine learning

### Learning Objectives
* Understand the different approaches for data preprocessing in developing ML models
* Use Dataflow to perform data preprocessing steps
* 

## Introduction

In the previous notebook we achieved an RMSE of **3.85**. Let's see if we can improve upon that by creating a data preprocessing pipeline in Cloud Dataflow.

Preprocessing data for a machine learning model involves both data engineering and feature engineering. During data engineering, we convert raw data into prepared data which is necessary for the model. Feature engineering then takes that prepared data and creates the features expected by the model. We have already seen various ways we can engineer new features for a machine learning model and where those steps take place. We also have flexibility as to where data preprocessing steps can take place; for example, BigQuery, Cloud Dataflow and Tensorflow. In this lab, we'll explore different data preprocessing strategies and see how they can be accomplished with Cloud Dataflow.

One perspective in which to categorize different types of data preprocessing operations is in terms of the granularity of the operation. Here, we will consider the following three types of operations:
1. Instance-level transformations
2. Full-pass transformations
3. Time-windowed aggregations

Cloud Dataflow can perform each of these types of operations and is particularly useful when performing computationally expensive operations as it is an autoscaling service for batch and streaming data processing pipelines. We'll say a few words about each of these below. For more information, have a look at this article about [data preprocessing for machine learning from Google Cloud](https://cloud.google.com/solutions/machine-learning/data-preprocessing-for-ml-with-tf-transform-pt1).

**1. Instance-level transformations**
These are transformations which take place during training and prediction, looking only at values from a single data point. For example, they might include clipping the value of a feature, polynomially expand a feature, multiply two features, or compare two features to create a Boolean flag.

It is necessary to apply the same transformations at training time and at prediction time. Failure to do this results in training/serving skew and will negatively affect the performance of the model.

**2. Full-pass transformations**
These transformations occur during training, but occur as instance-level operations during prediction. That is, during training you must analyze the entirety of the training data to compute quantities such as maximum, minimum, mean or variance while at prediction time you need only use those values to rescale or normalize a single data point. 

A good example to keep in mind is standard scaling (z-score normalization) of features for training. You need to compute the mean and standard deviation of that feature across the whole training data set, thus it is called a full-pass transformation. At prediction time you use those previously computed values to appropriately normalize the new data point. Failure to do so results in training/serving skew.

**3. Time-windowed aggregations**
These types of transformations occur during training and at prediction time. They involve creating a feature by summarizing real-time values by aggregating over some temporal window clause. For example, if we wanted our model to estimate the taxi trip time based on the traffic metrics for the route in the last 5 minutes, in the last 10 minutes or the last 30 minutes we would want to create a time-window to aggreagate these values. 

At prediction time these aggregations have to be computed in real-time from a data stream.

## Set environment variables

Apache Beam only works in Python 2 at the moment, so we're going to switch to the Python 2 kernel. In the above menu, click the dropdown arrow and select python2. After that, run the following to ensure we've installed Beam.

In [3]:
%%bash
conda update -n base -c defaults conda
source activate py2env
apt-get -y update
apt-get -y --allow-unauthenticated install python-pip
pip uninstall -y google-cloud-dataflow
conda install -y pytz 
pip install apache-beam[gcp]

Collecting package metadata: ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Get:1 http://archive.ubuntu.com/ubuntu xenial InRelease [247 kB]
Get:2 http://security.ubuntu.com/ubuntu xenial-security InRelease [109 kB]
Get:3 http://ftp.us.debian.org/debian testing InRelease [159 kB]
Ign:3 http://ftp.us.debian.org/debian testing InRelease
Get:4 http://ftp.us.debian.org/debian testing/main Sources [10.4 MB]
Ign:2 http://security.ubuntu.com/ubuntu xenial-security InRelease
Ign:1 http://archive.ubuntu.com/ubuntu xenial InRelease
Get:5 http://security.ubuntu.com/ubuntu xenial-security/main amd64 Packages [785 kB]
Get:6 http://archive.ubuntu.com/ubuntu xenial-updates InRelease [109 kB]
Ign:6 http://archive.ubuntu.com/ubuntu xenial-updates InRelease
Get:7 http://archive.ubuntu.com/ubuntu xenial-backports InRelease [107 kB]
Ign:7 http://archive.ubuntu.com/ubuntu xenial-backports InRelease
Get:8 http://archive.ubuntu.com/ubuntu xenial/main 

Couldn't create tempfiles for splitting up /var/lib/apt/lists/partial/ftp.us.debian.org_debian_dists_testing_InReleaseCouldn't create tempfiles for splitting up /var/lib/apt/lists/partial/security.ubuntu.com_ubuntu_dists_xenial-security_InReleaseCouldn't create tempfiles for splitting up /var/lib/apt/lists/partial/archive.ubuntu.com_ubuntu_dists_xenial_InReleaseCouldn't create tempfiles for splitting up /var/lib/apt/lists/partial/archive.ubuntu.com_ubuntu_dists_xenial-updates_InReleaseCouldn't create tempfiles for splitting up /var/lib/apt/lists/partial/archive.ubuntu.com_ubuntu_dists_xenial-backports_InReleaseW: GPG error: http://ftp.us.debian.org/debian testing InRelease: Could not execute 'apt-key' to verify signature (is gnupg installed?)
W: The repository 'http://ftp.us.debian.org/debian testing InRelease' is not signed.
W: GPG error: http://security.ubuntu.com/ubuntu xenial-security InRelease: Could not execute 'apt-key' to verify signature (is gnupg installed?)
W: The repository

**After installing the libraries from the previous cell, be sure to restart your kernel.**

Next we'll install the necessary libraries and set environment variables

In [1]:
!pip freeze | grep tensorflow==1.12.0 || pip install tensorflow==1.12.0

tensorflow==1.12.0


In [1]:
import tensorflow as tf
import apache_beam as beam
import shutil
import os
print(tf.__version__)

  from ._conv import register_converters as _register_converters


ImportError: No module named apache_beam

In [None]:
PROJECT = 'munn-sandbox'    # CHANGE THIS
BUCKET = 'munn-bucket' # REPLACE WITH YOUR BUCKET NAME. Use a regional bucket in the region you selected.
REGION = 'us-central1' # Choose an available region for Cloud MLE from https://cloud.google.com/ml-engine/docs/regions.

In [None]:
%%bash
os.environ['PROJECT'] = PROJECT
os.environ['BUCKET'] = BUCKET
os.environ['REGION'] = REGION
os.environ['TFVERSION'] = '1.8' 

## ensure we're using python2 env
os.environ['CLOUDSDK_PYTHON'] = 'python2'