To add this notebook to a git repository https://www.youtube.com/watch?v=cnS813vKmPk

In [2]:
import boto3
import uuid
import pathlib
import pickle
import pandas as pd
import sagemaker
from sklearn.model_selection import StratifiedKFold

Creating a class that stores parameters, this is done by subsetting a dictionary and adding the possibility to serialise the dictionary directly via pickle.

In [3]:
class parameterdict(dict):
    def __init__(self, file_name):
        self.file_name = file_name
        
    def store(self):
        with open(self.file_name, 'ab') as dbfile:
            pickle.dump(self, dbfile)                      
        return
    
    def load(self):  
        with open(self.file_name, 'rb') as dbfile:     
            db = pickle.load(dbfile)  
        return db

The default sagemaker IAM cannot access S3 buckets unless their name contains 'sagemaker' or 'SageMaker' or has any of those 2 words as tags. Therefore to avoid have setup another IAM with S3 permission we have simply added sagemaker to the S3 bucket's prefix name (which only represents the first part of the final name)

In [4]:
s3 = boto3.resource('s3')
session = boto3.session.Session()

PARAMETERS = parameterdict('titanic_parameters.pkl')
PARAMETERS["my_region"] = session.region_name

Each S3 bucket name must be unique, not only in our own AWS account but in all of AWS therefore if the user specifies the bucket name directly there is a fair chance that that name will have already been take. Thsi will raise a BucketAlreadyExists error stating that 'The requested bucket name is not available'.

In [7]:
def create_bucket_name(bucket_prefix):
    # The generated bucket name must be between 3 and 63 chars long
    return ''.join([bucket_prefix, str(uuid.uuid4().hex[:20])])

PARAMETERS["bucket_name"] = create_bucket_name('s3.sagemaker.end2endtitanic.')
bucket = s3.create_bucket(Bucket=PARAMETERS["bucket_name"]
                          , CreateBucketConfiguration={'LocationConstraint': PARAMETERS["my_region"]})
bucket = s3.Bucket( PARAMETERS["bucket_name"]) 

In [8]:
PARAMETERS["bucket_name"]

's3.sagemaker.end2endtitanic.7ededbe5308646a3adbc'

In [5]:
#PARAMETERS["bucket_name"] = 's3.sagemaker.end2endtitanic.7ededbe5308646a3adbc'

Checking out the data

In [7]:
dataset_path = pathlib.Path('/home/ec2-user/SageMaker/Titanic/dataset')
X = pd.read_csv(dataset_path/'train.csv')
X.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Save original data (raw) into S3

In [70]:
data_partition = [['train', 'train.csv'], ['test', 'test.csv']]
file_sizes = 0
for i, (data_type, fn) in enumerate(data_partition):
    key= f"data/raw/{data_type}"
    data_path = f"s3://{bucket.name}/{key}"
    if data_type == 'test':
        PARAMETERS['raw_test_data_path'] = data_path
    data_partition[i].append()
    bucket.Object(key).upload_file(str(dataset_path/fn))
    file_sizes += (dataset_path/fn).stat().st_size
    print( f'Data stored on {data_partition[i][-1]}')

file_sizes = file_sizes / 10**9 #converting size to GigaBytes
PARAMETERS['data_set_size'] = file_sizes
print(f"The dataset has a size of {file_sizes}GB")


Data stored on s3://s3.sagemaker.end2endtitanic.7ededbe5308646a3adbc/data/raw/train
Data stored on s3://s3.sagemaker.end2endtitanic.7ededbe5308646a3adbc/data/raw/test
The dataset has a size of 8.9823e-05GB


In [40]:
survival_rate = sum(X.Survived) / len(X)
PARAMETERS['death_odds'] = (len(X) - sum(X.Survived)) / sum(X.Survived)
print(f'survival rate {survival_rate:.2%}')
print(f"death odds {PARAMETERS['death_odds']:.2}")

survival rate 38.38%
death odds 1.6


In [63]:
X_test = pd.read_csv(dataset_path/'test.csv')
X_test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [65]:
def format4xgboost(df):
    header = set(df.columns.values.tolist())
    if "Survived" in header:
        sub_header = header - {'Survived'}
        header = ['Survived'] + sorted(list(sub_header))    
    else:
        header = sorted(list(header))
    df = df[header]
    return df

X = format4xgboost(X)
display(X.head())
X_test = format4xgboost(X_test)
display(X_test.head())

Unnamed: 0,Survived,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Ticket
0,0,22.0,,S,7.25,"Braund, Mr. Owen Harris",0,1,3,male,1,A/5 21171
1,1,38.0,C85,C,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,2,1,female,1,PC 17599
2,1,26.0,,S,7.925,"Heikkinen, Miss. Laina",0,3,3,female,0,STON/O2. 3101282
3,1,35.0,C123,S,53.1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,4,1,female,1,113803
4,0,35.0,,S,8.05,"Allen, Mr. William Henry",0,5,3,male,0,373450


Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Ticket
0,34.5,,Q,7.8292,"Kelly, Mr. James",0,892,3,male,0,330911
1,47.0,,S,7.0,"Wilkes, Mrs. James (Ellen Needs)",0,893,3,female,1,363272
2,62.0,,Q,9.6875,"Myles, Mr. Thomas Francis",0,894,2,male,0,240276
3,27.0,,S,8.6625,"Wirz, Mr. Albert",0,895,3,male,0,315154
4,22.0,,S,12.2875,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",1,896,3,female,1,3101298


In [101]:
X.dtypes

Survived         int64
Age            float64
Cabin           object
Embarked        object
Fare           float64
Name            object
Parch            int64
PassengerId      int64
Pclass           int64
Sex             object
SibSp            int64
Ticket          object
dtype: object

In [223]:
"""
X = pd.read_csv('s3://s3.sagemaker.end2endtitanic.7ededbe5308646a3adbc/data/for_xgboost/test.csv'
                , names = ['Survived', "Age", "Cabin", "Embarked", "Fare", "Name"
                            , "Parch", "PassengerId", "Pclass", "Sex", "SibSp", "Ticket"])
X.head()
X_test = pd.read_csv('s3://s3.sagemaker.end2endtitanic.7ededbe5308646a3adbc/data/for_xgboost/test.csv'
                , names = [ "Age", "Cabin", "Embarked", "Fare", "Name"
                            , "Parch", "PassengerId", "Pclass", "Sex", "SibSp", "Ticket"])
X_test.head()
"""

'\nX = pd.read_csv(\'s3://s3.sagemaker.end2endtitanic.7ededbe5308646a3adbc/data/for_xgboost/test.csv\'\n                , names = [\'Survived\', "Age", "Cabin", "Embarked", "Fare", "Name"\n                            , "Parch", "PassengerId", "Pclass", "Sex", "SibSp", "Ticket"])\nX.head()\nX_test = pd.read_csv(\'s3://s3.sagemaker.end2endtitanic.7ededbe5308646a3adbc/data/for_xgboost/test.csv\'\n                , names = [ "Age", "Cabin", "Embarked", "Fare", "Name"\n                            , "Parch", "PassengerId", "Pclass", "Sex", "SibSp", "Ticket"])\nX_test.head()\n'

In [210]:
print(f'X.Embarked values: {X.Embarked.unique()}; X_test.Embarked: {X_test.Embarked.unique()}')
print(f'X.Sex values: {X.Embarked.unique()}; X_test.Sex: {X_test.Embarked.unique()}')

X.Embarked values: ['S' 'C' 'Q' nan]; X_test.Embarked: ['Q' 'S' 'C']
X.Sex values: ['S' 'C' 'Q' nan]; X_test.Sex: ['Q' 'S' 'C']


Performing label encoding on the categorical variables

In [219]:
from collections import defaultdict
multi_encoder = defaultdict(preprocessing.LabelEncoder)

def fit_transform_labels(df, names, only_transform=False):
    categorical_cols = df.loc[:,names]
    categorical_cols.fillna('nan', inplace=True) #LabelEncoder cannot handle NaNs
    if only_transform:
        fitted_cols = categorical_cols.apply(lambda col: multi_encoder[col.name].transform(col))
    else:
        fitted_cols = categorical_cols.apply(lambda col: multi_encoder[col.name].fit_transform(col))
    return fitted_cols

X[['Embarked', 'Sex']] = fit_transform_labels(X, ['Embarked', 'Sex'])
X_test[['Embarked', 'Sex']] = fit_transform_labels(X_test, ['Embarked', 'Sex'], True)

To then get back to the original data, simply:

In [220]:
X[['Embarked', 'Sex']].apply(lambda col: multi_encoder[col.name].inverse_transform(col)).head()

Unnamed: 0,Embarked,Sex
0,S,male
1,C,female
2,S,female
3,S,female
4,S,male


In [222]:
def convert_back_nan(col):
    try:
        #find out what is the label of the NaN's
        nan_label = multi_encoder[col.name].transform(['nan'])[0]
    except ValueError:
        return col
    new_col = col.copy()
    #replace that NaN label with np.nan
    new_col[new_col == nan_label] = np.nan
    return new_col

X[['Embarked', 'Sex']] = X[['Embarked', 'Sex']].apply(convert_back_nan)
X_test[['Embarked', 'Sex']] = X_test[['Embarked', 'Sex']].apply(convert_back_nan)
fitted_cols.Embarked.unique()

array([ 2.,  0.,  1., nan])

Selecting features

In [227]:
X = X[['Survived', 'Age', 'Embarked', 'Fare', 'Parch', 'Pclass', 'Sex', 'SibSp']]
X_test = X_test[['Age', 'Embarked', 'Fare', 'Parch', 'Pclass', 'Sex', 'SibSp']]

In [6]:
from sklearn.model_selection import train_test_split

X, X_val = train_test_split(X)

NameError: name 'X' is not defined

In [7]:
from io import StringIO # python3; python2: BytesIO 

data_partition = [['train', 'train.csv', X], ['validation', 'validation.csv', X_val], ['test', 'test.csv', X_test]]
for partition in data_partition:
    data_type, fn, df = partition
    csv_buffer = StringIO()
    df.to_csv(csv_buffer, header=False, index=False)
    key = f'data/for_xgboost/{fn}'
    bucket.Object(key).put(Body=csv_buffer.getvalue())
    partition[-1] = f's3://{PARAMETERS["bucket_name"]}/{key}'
    print(f'{data_type} saved to {partition[-1]}')

NameError: name 'X' is not defined

REMOVE THIS CELL

In [9]:
data_partition = [['train', 'train.csv'], ['validation', 'validation.csv'], ['test', 'test.csv']]
for partition in data_partition:
    data_type, fn = partition
    key = f'data/for_xgboost/{fn}'
    partition.append(f's3://{PARAMETERS["bucket_name"]}/{key}') 
    print(f'{data_type} saved to {partition[-1]}')

train saved to s3://s3.sagemaker.end2endtitanic.7ededbe5308646a3adbc/data/for_xgboost/train.csv
validation saved to s3://s3.sagemaker.end2endtitanic.7ededbe5308646a3adbc/data/for_xgboost/validation.csv
test saved to s3://s3.sagemaker.end2endtitanic.7ededbe5308646a3adbc/data/for_xgboost/test.csv


In [34]:
PARAMETERS['data_partition'] = data_partition

Importing the xgboost image

In [12]:
from sagemaker.amazon.amazon_estimator import get_image_uri

container = get_image_uri(PARAMETERS["my_region"], 'xgboost')

Configuring the training, by setting the number of instances, instance type, EBS volume attached

In [14]:
role = sagemaker.get_execution_role()
s3_output_location = "s3://{}/xgboost_model_sdk".format(PARAMETERS['bucket_name'])
xgb_model = sagemaker.estimator.Estimator(container,
                                         role, 
                                         train_instance_count=1, 
                                         train_instance_type='ml.m4.xlarge',
                                         #file sizes times 1.2 to have some margin but we need to allocate at least 1GB
                                         train_volume_size = 1,#max( 1.2 * file_sizes, 1), 
                                         output_path=s3_output_location,
                                         sagemaker_session=sagemaker.Session())

In [16]:
#num_class = 2,
xgb_model.set_hyperparameters(max_depth = 5,
                              eta = .2,
                              gamma = 2,
                              scale_pos_weight = 1.6, #PARAMETERS['death_odds'],
                              silent = 0,
                              objective = 'binary:logistic',
                              num_round = 30)

The Amazon Xgboost algorithm requires that data be provided to it via a list of channels

In [17]:
data_channels = {}
for partition in data_partition[:2]:
    data_channels[partition[0]] = sagemaker.session.s3_input(partition[-1], content_type='text/csv')

In [18]:
xgb_model.fit(inputs=data_channels,  logs=True)

2019-07-15 18:56:15 Starting - Starting the training job...
2019-07-15 18:56:16 Starting - Launching requested ML instances...
2019-07-15 18:57:14 Starting - Preparing the instances for training.........
2019-07-15 18:58:38 Downloading - Downloading input data
2019-07-15 18:58:38 Training - Downloading the training image...
2019-07-15 18:59:08 Uploading - Uploading generated training model
2019-07-15 18:59:08 Completed - Training job completed

[31mArguments: train[0m
[31m[2019-07-15:18:58:57:INFO] Running standalone xgboost training.[0m
[31m[2019-07-15:18:58:57:INFO] File size need to be processed in the node: 0.02mb. Available memory size in the node: 8464.18mb[0m
[31m[2019-07-15:18:58:57:INFO] Determined delimiter of CSV input is ','[0m
[31m[18:58:57] S3DistributionType set as FullyReplicated[0m
[31m[18:58:57] 668x7 matrix with 4676 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[31m[2019-07-15:18:58:57:INFO] Determined delimiter o

IDEAL END OF FILE

In [25]:
print(xgb_model.model_data)

s3://s3.sagemaker.end2endtitanic.7ededbe5308646a3adbc/xgboost_model_sdk/xgboost-2019-07-15-18-56-15-081/output/model.tar.gz


In [28]:
xgb_model2 = sagemaker.estimator.Estimator(container,
                                         role, 
                                         train_instance_count=1, 
                                         train_instance_type='ml.m4.xlarge',
                                         #file sizes times 1.2 to have some margin but we need to allocate at least 1GB
                                         train_volume_size = 1, #max( 1.2 * file_sizes, 1), 
                                         output_path=s3_output_location,
                                         sagemaker_session=sagemaker.Session(),
                                         model_uri=xgb_model.model_data)

In [30]:
batch_input = 's3://s3.sagemaker.end2endtitanic.7ededbe5308646a3adbc/data/for_xgboost/test.csv' # The location of the test dataset

# The location to store the results of the batch transform job
batch_output = f"s3://{PARAMETERS['bucket_name']}/batch-inference" 

transformer = xgb_model.transformer(instance_count=1, instance_type='ml.m4.xlarge', output_path=batch_output)

transformer.transform(data=batch_input, data_type='S3Prefix', content_type='text/csv', split_type='Line')

transformer.wait()

........................................!


In [32]:
PARAMETERS['latest_training_job_name'] = xgb_model.latest_training_job.name

In [35]:
PARAMETERS

{'my_region': 'eu-west-1',
 'bucket_name': 's3.sagemaker.end2endtitanic.7ededbe5308646a3adbc',
 'latest_training_job_name': 'xgboost-2019-07-15-18-56-15-081',
 'data_partition': [['train',
   'train.csv',
   's3://s3.sagemaker.end2endtitanic.7ededbe5308646a3adbc/data/for_xgboost/train.csv'],
  ['validation',
   'validation.csv',
   's3://s3.sagemaker.end2endtitanic.7ededbe5308646a3adbc/data/for_xgboost/validation.csv'],
  ['test',
   'test.csv',
   's3://s3.sagemaker.end2endtitanic.7ededbe5308646a3adbc/data/for_xgboost/test.csv']]}

In [52]:
PARAMETERS.store()

In [53]:
PARAMETERS

{'my_region': 'eu-west-1',
 'bucket_name': 's3.sagemaker.end2endtitanic.7ededbe5308646a3adbc',
 'latest_training_job_name': 'xgboost-2019-07-15-18-56-15-081',
 'data_partition': [['train',
   'train.csv',
   's3://s3.sagemaker.end2endtitanic.7ededbe5308646a3adbc/data/for_xgboost/train.csv'],
  ['validation',
   'validation.csv',
   's3://s3.sagemaker.end2endtitanic.7ededbe5308646a3adbc/data/for_xgboost/validation.csv'],
  ['test',
   'test.csv',
   's3://s3.sagemaker.end2endtitanic.7ededbe5308646a3adbc/data/for_xgboost/test.csv']],
 'raw_test_data_path': 's3://s3.sagemaker.end2endtitanic.7ededbe5308646a3adbc/data/raw/test',
 'data_set_size': 8.9823e-05,
 'death_odds': 1.6}

In [47]:
tt = pd.read_csv('s3://s3.sagemaker.end2endtitanic.7ededbe5308646a3adbc/batch-inference/test.csv.out',names=['Survived'])
output = pd.DataFrame( {'PassengerId': ,'Survived': (tt<=0.5) + 0})

Unnamed: 0,Survived
0,1
1,1
2,1
3,1
4,0
5,1
6,0
7,1
8,0
9,1
