<h1> 2. Creating a sampled dataset </h1>

This notebook illustrates:
<ol>
<li> Sampling a BigQuery dataset to create datasets for ML
<li> Preprocessing with Pandas
</ol>

In [1]:
# change these to try this notebook out
BUCKET = 'cloud-training-demos-ml'
PROJECT = 'cloud-training-demos'
REGION = 'us-central1'

In [2]:
import os
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION

In [3]:
%%bash
if ! gsutil ls | grep -q gs://${BUCKET}/; then
  gsutil mb -l ${REGION} gs://${BUCKET}
fi

<h2> Create ML dataset by sampling using BigQuery </h2>
<p>
Let's sample the BigQuery data to create smaller datasets.
</p>

In [4]:
import google.datalab.bigquery as bq
query="""
SELECT
  weight_pounds,
  is_male,
  mother_age,
  plurality,
  gestation_weeks,
  FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING))) AS hashmonth
FROM
  publicdata.samples.natality
WHERE year > 2000
"""

There are only a limited number of years and months in the dataset. Let's see what the hashmonths are.

In [5]:
df = bq.Query("SELECT hashmonth, COUNT(weight_pounds) AS num_babies FROM (" + query + ") GROUP BY hashmonth").execute().result().to_dataframe()
print("There are {} unique hashmonths.".format(len(df)))
df.head()

There are 96 unique hashmonths.


Unnamed: 0,hashmonth,num_babies
0,-2126480030009879160,344357
1,8904940584331855459,344191
2,6691862025345277042,338820
3,-1525201076796226340,303664
4,5934265245228309013,324598


Here's a way to get a well distributed portion of the data in such a way that the test and train sets do not overlap:

In [19]:
trainQuery = "SELECT * FROM (" + query + ") WHERE MOD(hashmonth, 4) < 3 AND RAND() < 0.0005"
evalQuery = "SELECT * FROM (" + query + ") WHERE MOD(hashmonth, 4) = 3 AND RAND() < 0.0005"
traindf = bq.Query(trainQuery).execute().result().to_dataframe()
evaldf = bq.Query(evalQuery).execute().result().to_dataframe()
print("There are {} examples in the train dataset and {} in the eval dataset".format(len(traindf), len(evaldf)))

There are 15118 examples in the train dataset and 1565 in the eval dataset


<h2> Preprocess data using Pandas </h2>
<p>
Let's add extra rows to simulate the lack of ultrasound. In the process, we'll also change the plurality column to be a string.

In [20]:
traindf.head()

Unnamed: 0,weight_pounds,is_male,mother_age,plurality,gestation_weeks,hashmonth
0,5.423372,True,13,1,37.0,124458947937444850
1,4.625298,True,14,1,34.0,3095933535584005890
2,6.437498,False,14,1,36.0,-2995620979373137889
3,7.374463,True,14,1,37.0,8387817883864991792
4,6.937947,True,14,1,39.0,-1305143018446161857


Also notice that there are some very important numeric fields that are missing in some rows (the count in Pandas doesn't count missing data)

In [21]:
traindf.describe()

Unnamed: 0,weight_pounds,mother_age,plurality,gestation_weeks,hashmonth
count,15098.0,15118.0,15118.0,15012.0,15118.0
mean,7.244834,27.402037,1.033007,38.615441,-2.900248e+17
std,1.32477,6.207927,0.190488,2.552714,5.241513e+18
min,0.593043,13.0,1.0,17.0,-9.183606e+18
25%,6.563162,22.0,1.0,38.0,-5.107973e+18
50%,7.312733,27.0,1.0,39.0,-1.00295e+18
75%,8.062305,32.0,1.0,40.0,3.572456e+18
max,15.000252,54.0,4.0,47.0,8.59969e+18


In [22]:
import pandas as pd
def preprocess(df):
  # clean up data we don't want to train on
  # in other words, users will have to tell us the mother's age
  # otherwise, our ML service won't work.
  # these were chosen because they are such good predictors
  # and because these are easy enough to collect
  df = df[df.weight_pounds > 0]
  df = df[df.mother_age > 0]
  df = df[df.gestation_weeks > 0]
  df = df[df.plurality > 0]
  
  # modify plurality field to be a string
  twins_etc = dict(zip([1,2,3,4,5],
                   ['Single(1)', 'Twins(2)', 'Triplets(3)', 'Quadruplets(4)', 'Quintuplets(5)']))
  df['plurality'].replace(twins_etc, inplace=True)
  
  # now create extra rows to simulate lack of ultrasound
  nous = df.copy(deep=True)
  nous.loc[nous['plurality'] != 'Single(1)', 'plurality'] = 'Multiple(2+)'
  nous['is_male'] = 'Unknown'
  
  return pd.concat([df, nous])

traindf = preprocess(traindf)
evaldf = preprocess(traindf)

In [23]:
traindf.head()

Unnamed: 0,weight_pounds,is_male,mother_age,plurality,gestation_weeks,hashmonth
0,5.423372,True,13,Single(1),37.0,124458947937444850
1,4.625298,True,14,Single(1),34.0,3095933535584005890
2,6.437498,False,14,Single(1),36.0,-2995620979373137889
3,7.374463,True,14,Single(1),37.0,8387817883864991792
4,6.937947,True,14,Single(1),39.0,-1305143018446161857


In [24]:
traindf.tail()

Unnamed: 0,weight_pounds,is_male,mother_age,plurality,gestation_weeks,hashmonth
15113,4.332083,Unknown,47,Multiple(2+),34.0,-1195438672706281328
15114,6.750554,Unknown,49,Multiple(2+),38.0,-5107972924983092617
15115,5.577695,Unknown,49,Multiple(2+),34.0,-7146494315947640619
15116,7.62579,Unknown,50,Single(1),39.0,-4329667052416032880
15117,7.297301,Unknown,54,Single(1),38.0,-9068386407968572094


In [26]:
# describe only does numeric columns, so you won't see plurality
traindf.describe()

Unnamed: 0,weight_pounds,mother_age,gestation_weeks,hashmonth
count,29990.0,29990.0,29990.0,29990.0
mean,7.243869,27.40967,38.624808,-2.798088e+17
std,1.325488,6.206579,2.52262,5.240128e+18
min,0.593043,13.0,17.0,-9.183606e+18
25%,6.563162,22.0,38.0,-5.107973e+18
50%,7.312733,27.0,39.0,-1.00295e+18
75%,8.062305,32.0,40.0,3.572456e+18
max,15.000252,54.0,47.0,8.59969e+18


<h2> Write out </h2>
<p>
In the final versions, we want to read from files, not Pandas dataframes. So, write the Pandas dataframes out as CSV files. 
Using CSV files gives us the advantage of shuffling during read. This is important for distributed training because some workers might be slower than others, and shuffling the data helps prevent the same data from being assigned to the slow workers.


In [27]:
traindf.to_csv('train.csv', index=False, header=False)
evaldf.to_csv('eval.csv', index=False, header=False)

In [29]:
%bash
wc -l *.csv
head *.csv
tail *.csv

  59980 eval.csv
  29990 train.csv
  89970 total
==> eval.csv <==
5.4233716452,True,13,Single(1),37.0,124458947937444850
4.62529825676,True,14,Single(1),34.0,3095933535584005890
6.4374980503999994,False,14,Single(1),36.0,-2995620979373137889
7.3744626639,True,14,Single(1),37.0,8387817883864991792
6.93794738514,True,14,Single(1),39.0,-1305143018446161857
5.43659938092,True,14,Single(1),38.0,7186614341837170520
8.5870051049,True,14,Single(1),40.0,-6141045177192779423
6.87621795178,True,14,Single(1),39.0,8599690069971956834
5.621787681,True,14,Single(1),40.0,7604198770453299557
7.1980928543,True,14,Single(1),41.0,6691862025345277042

==> train.csv <==
5.4233716452,True,13,Single(1),37.0,124458947937444850
4.62529825676,True,14,Single(1),34.0,3095933535584005890
6.4374980503999994,False,14,Single(1),36.0,-2995620979373137889
7.3744626639,True,14,Single(1),37.0,8387817883864991792
6.93794738514,True,14,Single(1),39.0,-1305143018446161857
5.43659938092,True,14,Single(1),38.0,7186614341837170

Copyright 2017-2018 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License