<h1> 2. Creating a sampled dataset </h1>

This notebook illustrates:
<ol>
<li> Sampling a BigQuery dataset to create datasets for ML
<li> Preprocessing with Pandas
</ol>

In [1]:
# change these to try this notebook out
BUCKET = 'asl-ml-immersion-temp'
PROJECT = 'asl-ml-immersion'
REGION = 'us-central1'

In [2]:
# Import os environment variables
import os
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION

Make bucket if bucket doesn't exist

In [3]:
%%bash
if ! gsutil ls | grep -q gs://${BUCKET}/; then
  gsutil mb -l ${REGION} gs://${BUCKET}
fi

<h2> Create ML dataset by sampling using BigQuery </h2>
<p>
Let's sample the BigQuery data to create smaller datasets.
</p>

In [3]:
# Create SQL query using natality data after the year 2000
import google.datalab.bigquery as bq
query = """
SELECT
  weight_pounds,
  is_male,
  mother_age,
  mother_race,
  plurality,
  gestation_weeks,
  mother_married,
  ever_born,
  cigarette_use,
  alcohol_use,
  ABS(FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING)))) AS hashmonth
FROM
  publicdata.samples.natality
WHERE year > 2000
"""

There are only a limited number of years and months in the dataset. Let's see what the hashmonths are.

In [4]:
# Call BigQuery but GROUP BY the hashmonth and see number of records for each group to enable us to get the correct train and evaluation percentages
df = bq.Query("SELECT hashmonth, COUNT(weight_pounds) AS num_babies FROM (" + query + ") GROUP BY hashmonth").execute().result().to_dataframe()
print("There are {} unique hashmonths.".format(len(df)))
df.head()

There are 96 unique hashmonths.


Unnamed: 0,hashmonth,num_babies
0,7170969733900686954,331274
1,7146494315947640619,335327
2,1403073183891835564,351299
3,3545707052733304728,327823
4,1451354159195218418,334485


In [5]:
# SQL query to try different modulus of the hashmonth to get the correct data split for training and evaluation
query_moddedhashmonth ="""
SELECT
   A.moddedhashmonth
   ,A.Frequency
   ,B.Total
   ,IF(B.Total = 0, null, CAST(A.Frequency AS FLOAT64)/B.Total * 100.0) AS Percentage
FROM
  (SELECT  DISTINCT
     MOD(ABS(FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING)))), {0}) AS moddedhashmonth
     ,COUNT(*) AS Frequency
  FROM 
    publicdata.samples.natality
  WHERE
    year > 2000
  GROUP BY
    moddedhashmonth
  ) AS A,
  (SELECT
    COUNT(*) AS Total
  FROM 
    publicdata.samples.natality
  WHERE
    year > 2000
  ) AS B
ORDER BY
   A.moddedhashmonth
""".format(4) # Try different values here!

In [6]:
# Let's now see how our chosen modulus of the hashmonth split the data
df = bq.Query(query_moddedhashmonth + " LIMIT 100").execute().result().to_dataframe()
df

Unnamed: 0,moddedhashmonth,Frequency,Total,Percentage
0,0,10268859,33271914,30.863445
1,1,9134316,33271914,27.453533
2,2,7247485,33271914,21.782591
3,3,6621254,33271914,19.90043


Here's a way to get a well distributed portion of the data in such a way that the test and train sets do not overlap:

In [7]:
# Added the RAND() so that we can now subsample from each of the hashmonths to get approximately the record counts we want
trainQuery = "SELECT * FROM (" + query + ") WHERE MOD(hashmonth, 4) < 3 AND RAND() < 0.0005"
evalQuery = "SELECT * FROM (" + query + ") WHERE MOD(hashmonth, 4) = 3 AND RAND() < 0.0005"
traindf = bq.Query(trainQuery).execute().result().to_dataframe()
evaldf = bq.Query(evalQuery).execute().result().to_dataframe()
print("There are {} examples in the train dataset and {} in the eval dataset".format(len(traindf), len(evaldf)))

There are 13401 examples in the train dataset and 3302 in the eval dataset


<h2> Preprocess data using Pandas </h2>
<p>
Notice that the race field is 1.0, 4.0, etc.  Let's replace by text-strings. Our final deployed service will ask for text names, not magic numbers.

In [8]:
# Let's look at a small sample of the training data
traindf.head()

Unnamed: 0,weight_pounds,is_male,mother_age,mother_race,plurality,gestation_weeks,mother_married,ever_born,cigarette_use,alcohol_use,hashmonth
0,7.667677,False,25,1.0,1,36.0,False,4.0,,,774501970389208065
1,7.799955,True,25,1.0,1,38.0,False,1.0,,,774501970389208065
2,7.561856,True,31,1.0,1,39.0,True,1.0,,,774501970389208065
3,5.399121,True,29,1.0,1,36.0,True,2.0,,,774501970389208065
4,7.572879,True,20,1.0,1,40.0,True,1.0,,,774501970389208065


Also notice that there are some very important numeric fields that are missing in some rows (the count in Pandas doesn't count missing data)

In [9]:
# Let's look at some of the statistics of the training data
traindf.describe()

Unnamed: 0,weight_pounds,mother_age,mother_race,plurality,gestation_weeks,ever_born,hashmonth
count,13395.0,13401.0,9457.0,13401.0,13292.0,13338.0,13401.0
mean,7.233758,27.344153,2.8011,1.037236,38.621652,2.059679,4.396392e+18
std,1.319797,6.157611,9.494604,0.199331,2.566874,1.214632,2.796826e+18
min,0.500449,13.0,1.0,1.0,18.0,1.0,1.244589e+17
25%,6.563162,22.0,1.0,1.0,38.0,1.0,1.622638e+18
50%,7.312733,27.0,1.0,1.0,39.0,2.0,4.329667e+18
75%,8.062305,32.0,1.0,1.0,40.0,3.0,7.108882e+18
max,12.974204,50.0,78.0,4.0,47.0,14.0,9.183606e+18


In [10]:
# It is always crucial to clean raw data before using in ML, so we have a preprocessing step
def preprocess(df):
    # Modify opaque numeric race code into human-readable data
    races = dict(zip([-1,1,2,3,4,5,6,7,18,28,39,48],
                   ['Unknown', 'White', 'Black', 'American Indian', 'Chinese', 
                    'Japanese', 'Hawaiian', 'Filipino',
                    'Asian Indian', 'Korean', 'Samaon', 'Vietnamese']))
    df['mother_race'].fillna(-1, inplace = True)
    df['mother_race'].replace(races, inplace = True)

    # Remove unwanted columns
    del df['ever_born']

    # Clean up data we don't want to train on
    # in other words, users will have to tell us the mother's age
    # otherwise, our ML service won't work.
    # these were chosen because they are such good predictors
    # and because these are easy enough to collect
    df = df[df.weight_pounds > 0]
    df = df[df.mother_age > 0]
    df = df[df.gestation_weeks > 0]
    df = df[df.plurality > 0]

    return df

In [11]:
# Let's see a small sample of the training data now after our preprocessing
traindf = preprocess(traindf)
evaldf = preprocess(evaldf)
traindf.head()

Unnamed: 0,weight_pounds,is_male,mother_age,mother_race,plurality,gestation_weeks,mother_married,cigarette_use,alcohol_use,hashmonth
0,7.667677,False,25,White,1,36.0,False,,,774501970389208065
1,7.799955,True,25,White,1,38.0,False,,,774501970389208065
2,7.561856,True,31,White,1,39.0,True,,,774501970389208065
3,5.399121,True,29,White,1,36.0,True,,,774501970389208065
4,7.572879,True,20,White,1,40.0,True,,,774501970389208065


In [12]:
# Let's also look at a sample of the tail
traindf.tail()

Unnamed: 0,weight_pounds,is_male,mother_age,mother_race,plurality,gestation_weeks,mother_married,cigarette_use,alcohol_use,hashmonth
13396,5.125748,True,22,White,1,34.0,True,True,True,6637442812569910270
13397,5.996574,False,38,White,1,39.0,True,True,True,6637442812569910270
13398,5.749656,False,22,Black,1,35.0,False,True,True,6637442812569910270
13399,7.374463,True,19,White,1,42.0,False,True,True,6637442812569910270
13400,6.000983,False,26,White,1,38.0,True,True,True,6637442812569910270


In [13]:
# Let's see the training data statistics now after our preprocessing
traindf.describe()

Unnamed: 0,weight_pounds,mother_age,plurality,gestation_weeks,hashmonth
count,13287.0,13287.0,13287.0,13287.0,13287.0
mean,7.234231,27.348235,1.037254,38.6258,4.396987e+18
std,1.31903,6.15427,0.199457,2.554766,2.794539e+18
min,0.500449,13.0,1.0,18.0,1.244589e+17
25%,6.563162,22.0,1.0,38.0,1.622638e+18
50%,7.312733,27.0,1.0,39.0,4.329667e+18
75%,8.062305,32.0,1.0,40.0,7.108882e+18
max,12.974204,50.0,4.0,47.0,9.183606e+18


<h2> Write out </h2>
<p>
In the final versions, we want to read from files, not Pandas dataframes. So, write the Pandas dataframes out as CSV files. 
Using CSV files gives us the advantage of shuffling during read. This is important for distributed training because some workers might be slower than others, and shuffling the data helps prevent the same data from being assigned to the slow workers.


In [32]:
# Write out both training and evaluation csvs
traindf.to_csv('train.csv', index = False, header = False)
evaldf.to_csv('eval.csv', index = False, header = False)

Check the files to make sure they wrote out correctly

In [2]:
%bash
wc -l *.csv
head *.csv
tail *.csv

   3264 eval.csv
  13079 train.csv
  16343 total
==> eval.csv <==
5.51155655,False,26,Filipino,1,40.0,True,,,4740473290291881219
5.06181353552,False,20,Unknown,1,36.0,True,,,4740473290291881219
1.1243575362,False,24,Unknown,1,24.0,False,,,4740473290291881219
8.377565956,True,33,Unknown,1,40.0,True,,,4740473290291881219
7.1870697412,False,21,Unknown,1,38.0,False,,,4740473290291881219
8.12623897732,False,36,Unknown,1,39.0,True,,,4740473290291881219
6.2280589015,False,30,Unknown,1,37.0,True,,,4740473290291881219
6.50804597424,True,14,White,1,39.0,False,,,4740473290291881219
6.29860682534,False,35,Unknown,1,38.0,True,,,4740473290291881219
6.2611282408,True,26,Unknown,1,37.0,False,,,4740473290291881219

==> train.csv <==
3.12395025254,False,34,White,1,33.0,True,,,774501970389208065
7.62578964258,True,28,White,1,39.0,True,,,774501970389208065
7.4846937949,True,22,White,1,38.0,False,,,774501970389208065
8.24969784404,False,37,White,1,40.0,True,,,774501970389208065
9.4027154743,False,29,White,

Copyright 2017 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License