# Titanic Survivability Linear Learner in SageMaker

## Prerequisites and Data <a class="anchor" id="pre_and_data">
### Initialize SageMaker  <a class="anchor" id="initsagemaker">
Add the data files for training set, test set and validation set

In [3]:
import sagemaker
from sagemaker import Session
bucket = 'ml-i6-breakingcode'
prefix = 'sagemaker/ebsco-titanic-survivabiity'

# Define IAM role
import re
from sagemaker import get_execution_role
import pandas as pd
import numpy as np
import urllib
import os
import sklearn.preprocessing as preprocessing
import seaborn as sns

role = get_execution_role()
data_key = 'train.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)
training_set = pd.read_csv(data_location)
print("1. finished uploading training set")

1. finished uploading training set


### Data Manipulation <a class="anchor" id="inspect_data">
Remove columns that doesn't affect the analysis from the training set

In [4]:
clean_data = training_set.drop("Name", axis = 1)
clean_data.head()
clean_data = clean_data.drop("Ticket", axis = 1)
clean_data.head()
clean_data = clean_data.drop("Cabin", axis = 1)
clean_data.head()
clean_data = clean_data.drop("PassengerId", axis = 1)
clean_data.head()
clean_data = pd.get_dummies(clean_data, columns = ["Sex"])
clean_data.head()
clean_data = pd.get_dummies(clean_data, columns = ["Embarked"])
clean_data.head()
clean_data = clean_data.dropna(how='any',axis = 0)
clean_data.head()

train_data, validation_data, test_data = np.split(clean_data.sample(frac=1, random_state=1729), [int(0.7 * len(removeheader)), int(0.9 * len(removeheader))])
train_data.to_csv('train.csv', header=False, index=False)
validation_data.to_csv('validation.csv', header=False, index=False)

Upload the train set and validation set into s3 bucket

In [5]:
import boto3
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('validation.csv')

## Standard linear model  <a class="anchor" id="train_linear_model">

Once we have the data preprocessed and available in the correct format for training, the next step is to actually train the model using the data. More details on algorithm containers can be found in [AWS documentation](https://docs-aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html).

In [6]:
from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(boto3.Session().region_name, 'xgboost')

s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='csv')

sess = sagemaker.Session()

xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=sess)
xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        objective='binary:logistic',
                        num_round=100)

xgb.fit({'train': s3_input_train, 'validation': s3_input_validation})

2019-07-03 15:02:18 Starting - Starting the training job...
2019-07-03 15:02:22 Starting - Launching requested ML instances......
2019-07-03 15:03:23 Starting - Preparing the instances for training......
2019-07-03 15:04:31 Downloading - Downloading input data...
2019-07-03 15:05:05 Training - Downloading the training image..
[31mArguments: train[0m
[31m[2019-07-03:15:05:24:INFO] Running standalone xgboost training.[0m
[31m[2019-07-03:15:05:24:INFO] File size need to be processed in the node: 0.02mb. Available memory size in the node: 8471.57mb[0m
[31m[2019-07-03:15:05:24:INFO] Determined delimiter of CSV input is ','[0m
[31m[15:05:24] S3DistributionType set as FullyReplicated[0m
[31m[15:05:24] 499x10 matrix with 4990 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[31m[2019-07-03:15:05:24:INFO] Determined delimiter of CSV input is ','[0m
[31m[15:05:24] S3DistributionType set as FullyReplicated[0m
[31m[15:05:24] 142x10 matrix with 

### Accuracy and Fairness of the model <a class="anchor" id="performance_linear_model">
Now that we've trained our model, we can deploy it behind an Amazon SageMaker real-time hosted endpoint.  This will allow out to make predictions (or inference) from the model dyanamically.

In [7]:
xgb_predictor = xgb.deploy(initial_instance_count=1,
                           instance_type='ml.m4.xlarge')

---------------------------------------------------------------------------------------------------!

In [9]:
from sagemaker.predictor import csv_serializer
xgb_predictor.content_type = 'text/csv'
xgb_predictor.serializer = csv_serializer
xgb_predictor.deserializer = None

In [10]:
def predict(data, rows=len(test_data)):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ''
    for array in split_array:
        predictions = ','.join([predictions, xgb_predictor.predict(array).decode('utf-8')])

    return np.fromstring(predictions[1:], sep=',')

predictions = predict(test_data.as_matrix()[:, 1:])



In [11]:
pd.crosstab(index=test_data.iloc[:, 0], columns=np.round(predictions), rownames=['actual'], colnames=['predictions'])

predictions,0.0,1.0
actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,38,8
1,7,20


Accuracy and precision calculation
AC=TN+TP/TN+FP+FN+TP

In [19]:
Accuracy=38+20/38+20+15
print(Accuracy)

73.52631578947368


In [12]:
import numpy as np
from sklearn.metrics import roc_auc_score

roc_auc_score(test_data['Survived'], predictions)

0.8474235104669886

In [14]:
from sklearn.metrics import f1_score

f1_score(test_data["Survived"], np.round(predictions))

0.7272727272727273

In [15]:
from sklearn.metrics import precision_score

precision_score(test_data["Survived"], np.round(predictions))

0.7142857142857143

In [16]:
from sklearn.metrics import recall_score

recall_score(test_data["Survived"], np.round(predictions))

0.7407407407407407

Clean-up
If you're ready to be done with this notebook, please run the cell below. This will remove the hosted endpoint you created and avoid any charges from a stray instance being left on.

In [None]:
#sagemaker.Session().delete_endpoint(xgb_predictor.endpoint)