<br />

<div style="text-align: center;">
<font size="7">Create sample dataset</font>
</div>
<br />
<div style="text-align: right;">
<font size="4">2020/11/11</font>
<br />
<font size="4">Ryutaro Hashimoto</font>
</div>

___

# Summary

- Prepare a sample data set to be used for training the model and save it in the specified S3 bucket.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Loading-a-Data-Set" data-toc-modified-id="Loading-a-Data-Set-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Loading a Data Set</a></span></li><li><span><a href="#Transform-according-to-the-SageMaker-format" data-toc-modified-id="Transform-according-to-the-SageMaker-format-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Transform according to the SageMaker format</a></span><ul class="toc-item"><li><span><a href="#Move-the-target-variable-to-the-leftmost-column" data-toc-modified-id="Move-the-target-variable-to-the-leftmost-column-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Move the target variable to the leftmost column</a></span></li><li><span><a href="#Split-into-training-data-and-validation-data" data-toc-modified-id="Split-into-training-data-and-validation-data-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Split into training data and validation data</a></span></li><li><span><a href="#Output-in-CSV-format" data-toc-modified-id="Output-in-CSV-format-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Output in CSV format</a></span></li></ul></li><li><span><a href="#Upload-the-created-dataset-to-Amazon-S3" data-toc-modified-id="Upload-the-created-dataset-to-Amazon-S3-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Upload the created dataset to Amazon S3</a></span></li></ul></div>

## Loading a Data Set

In [1]:
import pandas as pd
dataset = pd.read_csv('housing.csv')

In [2]:
print(dataset.shape)
dataset[:5]

(506, 13)


Unnamed: 0,crim,zn,indus,chas,nox,age,rm,dis,rad,tax,ptratio,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,5.33,36.2


## Transform according to the SageMaker format

### Move the target variable to the leftmost column
Note: Amazon SageMaker uses CSVs without headers, with the target variable in the leftmost column.

In [3]:
# Move column 'medv' to front
dataset = pd.concat([dataset['medv'], dataset.drop(['medv'], axis=1)], axis=1)
dataset.head()

Unnamed: 0,medv,crim,zn,indus,chas,nox,age,rm,dis,rad,tax,ptratio,lstat
0,24.0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,4.98
1,21.6,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,9.14
2,34.7,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,4.03
3,33.4,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,2.94
4,36.2,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,5.33


### Split into training data and validation data

In [4]:
from sklearn.model_selection import train_test_split
training_dataset, validation_dataset = train_test_split(dataset, test_size=0.1)

print(training_dataset.shape)
print(validation_dataset.shape)

(455, 13)
(51, 13)


### Output in CSV format

In [5]:
training_dataset.to_csv('training_dataset.csv', index=False, header=False)
validation_dataset.to_csv('validation_dataset.csv', index=False, header=False)

## Upload the created dataset to Amazon S3

In [6]:
import boto3
s3_client = boto3.client('s3')

local_path = 'training_dataset.csv'
bucket_path = 'boston-housing/training_dataset.csv'
data_bucket_name='sagemaker-tutorial-hashimoto'

s3_client.upload_file(local_path, data_bucket_name, bucket_path)

In [7]:
local_path = 'validation_dataset.csv'
bucket_path = 'boston-housing/validation_dataset.csv'
data_bucket_name='sagemaker-tutorial-hashimoto'

s3_client.upload_file(local_path, data_bucket_name, bucket_path)

In [8]:
# End of Dile