<h1> Preprocessing using Dataflow </h1>

This notebook illustrates:
<ol>
<li> Creating datasets for Machine Learning using Dataflow
</ol>
<p>
While Pandas is fine for experimenting, for operationalization of your workflow, it is better to do preprocessing in Apache Beam. This will also help if you need to preprocess data in flight, since Apache Beam also allows for streaming.

In [5]:
# change these to try this notebook out
BUCKET = 'asl-ml-immersion-temp'
PROJECT = 'asl-ml-immersion'
REGION = 'us-central1'

In [6]:
import os
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION

In [7]:
%%bash
if ! gsutil ls | grep -q gs://${BUCKET}/; then
  gsutil mb -l ${REGION} gs://${BUCKET}
fi

Creating gs://asl-ml-immersion-temp/...


<h2> Save the query from earlier </h2>

The data is natality data (record of births in the US). My goal is to predict the baby's weight given a number of factors about the pregnancy and the baby's mother.  Later, we will want to split the data into training and eval datasets. The hash of the year-month will be used for that.

In [9]:
query="""
SELECT
  weight_pounds,
  is_male,
  mother_age,
  mother_race,
  plurality,
  gestation_weeks,
  mother_married,
  ever_born,
  cigarette_use,
  alcohol_use,
  FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING))) AS hashmonth
FROM
  publicdata.samples.natality
WHERE year > 2000
"""

In [10]:
import google.datalab.bigquery as bq
df = bq.Query(query + " LIMIT 100").execute().result().to_dataframe()
df.head()

Unnamed: 0,weight_pounds,is_male,mother_age,mother_race,plurality,gestation_weeks,mother_married,ever_born,cigarette_use,alcohol_use,hashmonth
0,1.937863,True,16,2,1,24,False,1.0,,False,-2126480030009879160
1,3.615581,True,23,2,1,31,False,2.0,,False,-1403073183891835564
2,7.312733,False,27,2,1,31,True,3.0,,False,3572456083892950068
3,3.249614,True,39,2,2,31,True,5.0,,,1088037545023002395
4,4.047687,True,25,2,2,34,True,2.0,,False,-1403073183891835564


<h2> Create ML dataset using Dataflow </h2>
<p>
Let's use Cloud Dataflow to read in the BigQuery data and write it out as CSV files. Note the 

Instead of using Beam/Dataflow, I had two other options:
<ol>
<li> Read from BigQuery directly using TensorFlow. However, using CSV files gives us the advantage of shuffling during read. This is important for distributed training because some workers might be slower than others, and shuffling the data helps prevent the same data from being assigned to the slow workers.
<li> Use the BigQuery console (http://bigquery.cloud.google.com) to run a Query and save the result as a CSV file. For larger datasets, you may have to select the option to "allow large results" and save the result into a CSV file on Google Cloud Storage. However, in this case, I want to do some preprocessing (on the "race" column, and to clean up missing values for mother's age, etc.). If I didn't need preprocessing, I could have used the web console. Also, I prefer to script it out rather than run queries on the user interface, so I am using Cloud Dataflow for the preprocessing.
</ol>

<p>
Note that after you launch this, the notebook won't show you progress. Go to the GCP webconsole to the Dataflow section and monitor the running job. It took about <b>30 minutes</b> for me. If you wish to continue without doing this step, you can copy my preprocessed output:
<pre>
gsutil -m cp -r gs://asl-ml-immersion/babyweight/preproc gs://your-bucket/
</pre>

In [15]:
import apache_beam as beam
import datetime

def to_csv(rowdict):
    # pull columns from BQ and create a line
    import hashlib
    import copy
    CSV_COLUMNS = 'weight_pounds,is_male,mother_age,mother_race,plurality,gestation_weeks,mother_married,cigarette_use,alcohol_use'.split(',')
    # modify opaque numeric race code into human-readable data
    races = dict(zip([1,2,3,4,5,6,7,18,28,39,48],
                     ['White', 'Black', 'American Indian', 'Chinese', 
                      'Japanese', 'Hawaiian', 'Filipino',
                      'Asian Indian', 'Korean', 'Samaon', 'Vietnamese']))
    result = copy.deepcopy(rowdict)
    if 'mother_race' in rowdict and rowdict['mother_race'] in races:
      result['mother_race'] = races[rowdict['mother_race']]
    else:
      result['mother_race'] = 'Unknown'
    
    # cleanup: write out only the data we that we want to train on
    if result['weight_pounds'] > 0 and result['mother_age'] > 0 and result['gestation_weeks'] > 0 and result['plurality'] > 0:
      data = ','.join([str(result[k]) if k in result else 'None' for k in CSV_COLUMNS])
      key = hashlib.sha224(data).hexdigest()  # hash the columns to form a key (instead of using hashmonth, of which there are only 96 unique values)
      yield str('{},{}'.format(data, key))
  
def preprocess(query, in_test_mode):
  job_name = 'preprocess-babyweight-features' + '-' + datetime.datetime.now().strftime('%y%m%d-%H%M%S')
    
  if in_test_mode:
    print 'Launching local job ... hang on'
    OUTPUT_DIR = './preproc'
  else:
    print 'Launching Dataflow job {} ... hang on'.format(job_name)
    OUTPUT_DIR = 'gs://{0}/babyweight/preproc/'.format(BUCKET)
    
  options = {
    'staging_location': os.path.join(OUTPUT_DIR, 'tmp', 'staging'),
    'temp_location': os.path.join(OUTPUT_DIR, 'tmp'),
    'job_name': job_name,
    'project': PROJECT,
    'max_num_workers': 24,
    'teardown_policy': 'TEARDOWN_ALWAYS',
    'no_save_main_session': True
  }
  opts = beam.pipeline.PipelineOptions(flags=[], **options)
  if in_test_mode:
    RUNNER = 'DirectRunner'
  else:
    RUNNER = 'DataflowRunner'
  p = beam.Pipeline(RUNNER, options=opts)
  
  if in_test_mode:
    query = query + ' LIMIT 100' 
  
  for step in ['train', 'eval']:
    if step == 'train':
      selquery = 'SELECT * FROM ({}) WHERE MOD(ABS(hashmonth),4) < 3'.format(query)
    else:
      selquery = 'SELECT * FROM ({}) WHERE MOD(ABS(hashmonth),4) = 3'.format(query)

    (p 
     | '{}_read'.format(step) >> beam.io.Read(beam.io.BigQuerySource(query=selquery, use_standard_sql=True))
     | '{}_csv'.format(step) >> beam.FlatMap(to_csv)
     | '{}_out'.format(step) >> beam.io.Write(beam.io.WriteToText(os.path.join(OUTPUT_DIR, '{}.csv'.format(step))))
    )
 
  job = p.run()
  
preprocess(query, in_test_mode=False)

Launching Dataflow job preprocess-babyweight-features-170801-190710 ... hang on


In [17]:
%bash
gsutil ls gs://${BUCKET}/babyweight/preproc/*-00000*

gs://asl-ml-immersion-temp/babyweight/preproc/eval.csv-00000-of-00014
gs://asl-ml-immersion-temp/babyweight/preproc/train.csv-00000-of-00037


Copyright 2017 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License