# Creating a Sampled Dataset

**Learning Objectives**
- Sample the natality dataset to create train/eval/test sets
- Preprocess the data in Pandas dataframe

## Introduction

In this notebook we'll read data from BigQuery into our notebook to preprocess the data within a Pandas dataframe. 

In [1]:
PROJECT = 'munn-sandbox'  # Replace with your PROJECT
BUCKET = 'munn-bucket'  # Replace with your BUCKET
REGION = 'us-central1'            # Choose an available region for Cloud MLE 
TFVERSION = '1.12'                # TF version for CMLE to use

In [2]:
import os
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION

In [3]:
%%bash
if ! gsutil ls | grep -q gs://${BUCKET}/; then
  gsutil mb -l ${REGION} gs://${BUCKET}
fi

## Create ML datasets by sampling using BigQuery

We'll begin by sampling the BigQuery data to create smaller datasets.

In [5]:
# Create SQL query using natality data after the year 2000
query_string = """
SELECT
  weight_pounds,
  is_male,
  mother_age,
  plurality,
  gestation_weeks,
  ABS(FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING)))) AS hashmonth
FROM
  publicdata.samples.natality
WHERE year > 2000
"""

There are only a limited number of years and months in the dataset. Let's see what the hashmonths are.

We'll call BigQuery but group by the hashmonth and see the number of records for each group. This will enable us to get the correct train/eval/test percentages

In [15]:
from google.cloud import bigquery
bq = bigquery.Client(project=PROJECT)

df = bq.query("SELECT hashmonth, COUNT(weight_pounds) AS num_babies FROM (" 
              + query_string + 
              ") GROUP BY hashmonth").to_dataframe()

print("There are {} unique hashmonths.".format(len(df)))
df.head()

There are 96 unique hashmonths.


Unnamed: 0,hashmonth,num_babies
0,6392072535155213407,323758
1,8387817883864991792,331629
2,328012383083104805,359891
3,9183605629983195042,329975
4,8391424625589759186,364497


Here's a way to get a well-distributed portion of the data in such a way that the train/eval/test sets do not overlap. 

In [35]:
# Added the RAND() so that we can now subsample from each of the hashmonths to get approximately the record counts we want
train_query = "SELECT * FROM (" + query_string + ") WHERE MOD(hashmonth, 4) < 2 AND RAND() < 0.0005"
eval_query = "SELECT * FROM (" + query_string + ") WHERE MOD(hashmonth, 4) = 2 AND RAND() < 0.0005"
test_query = "SELECT * FROM (" + query_string + ") WHERE MOD(hashmonth, 4) = 3 AND RAND() < 0.0005"

train_df = bq.query(train_query).to_dataframe()
eval_df = bq.query(eval_query).to_dataframe()
test_df = bq.query(test_query).to_dataframe()

print("There are {} examples in the train dataset.".format(len(train_df)))
print("There are {} examples in the validation dataset.".format(len(eval_df)))
print("There are {} examples in the test dataset.".format(len(test_df)))

There are 9713 examples in the train dataset.
There are 3602 examples in the validation dataset.
There are 3375 examples in the test dataset.


## Preprocess data using Pandas

We'll perform a few preprocessing steps to the data in our dataset. Let's add extra rows to simulate the lack of ultrasound. That is we'll duplicate some rows and make the `is_male` field be `Unknown`. Also, if there is more than child we'll change the `plurality` to `Multiple(2+)`. While we're at it, We'll also change the plurality column to be a string. We'll perform these operations below. 

Let's start by examining the training dataset as is.

In [36]:
train_df.head()

Unnamed: 0,weight_pounds,is_male,mother_age,plurality,gestation_weeks,hashmonth
0,9.687112,True,27,1,40.0,5896567601480310696
1,5.81359,True,20,1,36.0,3545707052733304728
2,7.874912,True,33,1,38.0,3408502330831153141
3,7.352416,True,35,1,35.0,1403073183891835564
4,8.875811,False,35,1,40.0,260598435387740869


Also, notice that there are some very important numeric fields that are missing in some rows (the count in Pandas doesn't count missing data)

In [37]:
train_df.describe()

Unnamed: 0,weight_pounds,mother_age,plurality,gestation_weeks,hashmonth
count,9702.0,9713.0,9713.0,9655.0,9713.0
mean,7.226279,27.445485,1.035725,38.585707,4.188515e+18
std,1.317331,6.165494,0.19587,2.533693,2.619672e+18
min,0.500449,14.0,1.0,18.0,2.605984e+17
25%,6.563162,23.0,1.0,38.0,1.622638e+18
50%,7.312733,27.0,1.0,39.0,3.765901e+18
75%,8.000575,32.0,1.0,40.0,6.749419e+18
max,12.749333,48.0,4.0,47.0,8.668301e+18


It is always crucial to clean raw data before using in machine learning, so we have a preprocessing step. We'll define a `preprocess` function below. Note that the mother's age is an input to our model so users will have to provide the mother's age; otherwise, our service won't work. The features we use for our model were chosen because they are such good predictors and because they are easy enough to collect.

In [38]:
import pandas as pd

def preprocess(df):
  # clean up data
  # remove what we don't want to use for training
  df = df[df.weight_pounds > 0]
  df = df[df.mother_age > 0]
  df = df[df.gestation_weeks > 0]
  df = df[df.plurality > 0]
  
  # modify plurality field to be a string
  twins_etc = dict(zip([1,2,3,4,5],
                   ['Single(1)', 'Twins(2)', 'Triplets(3)', 'Quadruplets(4)', 'Quintuplets(5)']))
  df['plurality'].replace(twins_etc, inplace=True)
  
  # now create extra rows to simulate lack of ultrasound
  no_ultrasound = df.copy(deep=True)
  no_ultrasound.loc[no_ultrasound['plurality'] != 'Single(1)', 'plurality'] = 'Multiple(2+)'
  no_ultrasound['is_male'] = 'Unknown'
  
  return pd.concat([df, no_ultrasound])

Let's process the train/eval/test set and see a small sample of the training data after our preprocessing:

In [42]:
train_df = preprocess(train_df)
eval_df = preprocess(eval_df)
test_df = preprocess(test_df)

In [43]:
train_df.head()

Unnamed: 0,weight_pounds,is_male,mother_age,plurality,gestation_weeks,hashmonth
0,9.687112,True,27,Single(1),40.0,5896567601480310696
1,5.81359,True,20,Single(1),36.0,3545707052733304728
2,7.874912,True,33,Single(1),38.0,3408502330831153141
3,7.352416,True,35,Single(1),35.0,1403073183891835564
4,8.875811,False,35,Single(1),40.0,260598435387740869


In [44]:
train_df.tail()

Unnamed: 0,weight_pounds,is_male,mother_age,plurality,gestation_weeks,hashmonth
9708,8.809672,Unknown,32,Single(1),40.0,4329667052416032880
9709,6.812284,Unknown,16,Single(1),40.0,1403073183891835564
9710,8.399612,Unknown,37,Single(1),38.0,1443901198490054949
9711,7.438397,Unknown,25,Single(1),39.0,5107972924983092617
9712,8.249698,Unknown,30,Single(1),40.0,7420272703711713305


Let's look again at a summary of the dataset. Note that we only see numeric columns, so `plurality` does not show up.

In [45]:
train_df.describe()

Unnamed: 0,weight_pounds,mother_age,gestation_weeks,hashmonth
count,19292.0,19292.0,19292.0,19292.0
mean,7.226681,27.456666,38.589571,4.186708e+18
std,1.315203,6.16255,2.518875,2.618412e+18
min,0.500449,14.0,18.0,2.605984e+17
25%,6.563162,23.0,38.0,1.622638e+18
50%,7.312733,27.0,39.0,3.572456e+18
75%,8.000575,32.0,40.0,6.544755e+18
max,12.749333,48.0,47.0,8.668301e+18


## Write to .csv files 

In the final versions, we want to read from files, not Pandas dataframes. So, we write the Pandas dataframes out as csv files. Using csv files gives us the advantage of shuffling during read. This is important for distributed training because some workers might be slower than others, and shuffling the data helps prevent the same data from being assigned to the slow workers.

In [49]:
train_df.to_csv('train.csv', index=False, header=False)
eval_df.to_csv('eval.csv', index=False, header=False)
test_df.to_csv('test.csv', index=False, header=False)

In [51]:
%%bash
wc -l *.csv

   7124 eval.csv
   6694 test.csv
  19292 train.csv
  33110 total


In [53]:
%%bash
head *.csv

==> eval.csv <==
7.7712947355,False,17,Single(1),41.0,6691862025345277042
8.000575487979999,True,27,Single(1),37.0,4979697502521811334
9.93843877096,False,26,Single(1),41.0,7170969733900686954
7.5618555866,True,23,Single(1),38.0,7773938200482214258
6.9666074791999995,True,18,Single(1),38.0,1002950341933487066
5.3131405142,False,42,Twins(2),37.0,8599690069971956834
6.75055446244,True,18,Single(1),39.0,411066950820961322
6.6248909731,True,21,Single(1),35.0,8391424625589759186
6.3272669193999995,True,20,Single(1),38.0,7872612453343038854
7.12534030784,True,28,Single(1),38.0,1077881854928885650

==> test.csv <==
6.87621795178,False,19,Single(1),40.0,7146494315947640619
8.811876612139999,False,26,Single(1),40.0,6392072535155213407
8.18796841068,False,26,Single(1),38.0,8904940584331855459
8.24969784404,False,24,Single(1),39.0,74931465496927487
7.6941329438,False,36,Single(1),41.0,7146494315947640619
6.2501051276999995,True,16,Single(1),40.0,1088037545023002395
7.7602716223999995,True,26,Sing

In [54]:
%%bash
tail *.csv

==> eval.csv <==
7.31273323054,Unknown,25,Single(1),40.0,411066950820961322
7.4295782294,Unknown,32,Single(1),38.0,1002950341933487066
8.12623897732,Unknown,22,Single(1),42.0,1451354159195218418
6.2501051276999995,Unknown,25,Single(1),39.0,8599690069971956834
7.81318256528,Unknown,30,Single(1),38.0,8391424625589759186
8.0799419023,Unknown,33,Single(1),39.0,7872612453343038854
3.31354779786,Unknown,25,Single(1),30.0,3095933535584005890
8.37315671076,Unknown,30,Single(1),38.0,1002950341933487066
7.5618555866,Unknown,34,Single(1),38.0,7170969733900686954
8.68841774542,Unknown,29,Single(1),39.0,411066950820961322

==> test.csv <==
6.4992274837599995,Unknown,24,Single(1),37.0,2246942437170405963
9.7554550935,Unknown,22,Single(1),41.0,6782146986770280327
7.68751907594,Unknown,33,Single(1),37.0,1088037545023002395
6.6248909731,Unknown,21,Single(1),38.0,7517141034410775575
8.375361333379999,Unknown,31,Single(1),40.0,1569531340167098963
7.25100379718,Unknown,17,Single(1),34.0,108803754502300239

Copyright 2017-2018 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License