# Training Model on Sagemaker

In this notebook, I am going to train my model using AWS Sagemaker. For this, I need to upload my dataset to S3 bucket. So first, let's upload the data.

## Uploading data

In [1]:
import sagemaker
import boto3

In [2]:
# session and role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# create an S3 bucket
bucket = sagemaker_session.default_bucket()

In [3]:
# name of the directory of saved data
data_dir = 'dataset'

# set prefix, a descriptive name for a directory  
prefix = 'pneumonia_data'

In [None]:
# upload all data to S3
import os

train_location = sagemaker_session.upload_data(os.path.join(data_dir, 'train'), key_prefix=prefix+'/train')
validation_location = sagemaker_session.upload_data(os.path.join(data_dir, 'validation'), key_prefix=prefix+'/validation')
print("data upload complete")

data upload complete


In [5]:
# printing the location string will allow me to continue from here later
# delete when submitting the project
print("train_location:", train_location)
print("validation_location:", validation_location)

train_location: s3://sagemaker-us-east-1-595868480840/pneumonia_data/train
validation_location: s3://sagemaker-us-east-1-595868480840/pneumonia_data/validation


In [None]:
train_location = "s3://sagemaker-us-east-1-595868480840/pneumonia_data/train"
validation_location = "s3://sagemaker-us-east-1-595868480840/pneumonia_data/validation"


## Training

From the data exploration notebook, we know that the chest x-ray dataset contains ~1M images which is relatively small for training a deep neural network. To tackle this problem, I will be using transfer learning technique. First I will initiate my model with a Densenet121 model which is trained on ImageNet. This is easily downloadable from torchvision's models library. Then I will replace the classification layer with a Linear layer which outputs only a single value, followed by a Sigmoid activation layer. Secondly, there is also a huge class imbalance in the dataset. To tackle this problem, I will use a weighted binary cross-entropy loss.

My model is an implimentation of the [CheXNet](https://arxiv.org/abs/1711.05225) paper.

First, I need to calculate the fraction of the negative and positive sample in my data. These fractions are needed for the loss function.

In [6]:
import pandas as pd

In [7]:
train_df = pd.read_csv("train.csv")

total = train_df.shape[0]
pos = (train_df["Finding Labels"] == 1).sum()
neg = total - pos

pos_weight = neg/total
neg_weight = pos/total

print(f"pos-weight: {pos_weight: 0.4f}, neg-weight: {neg_weight: 0.4f}")

pos-weight:  0.9870, neg-weight:  0.0130


In the following cells, I will test the functions that I created in the `source/model.py` script before training the estimator. This way, if there is any error or typo, I can go back to the `model.py` file and do necessary changes. Once everything runs as intended, then I will create the estimator object for training.

In [8]:
import torch
import torch.optim
from source_pytorch.model import *

In [9]:
torch.__version__

'1.4.0'

In [10]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device {}.".format(device))

# Load the training data.
trainloader = train_data_loader(16, "dataset/train")

# Load the training data.
validationloader = validation_data_loader(16, "dataset/validation")

# initiate model
model = densnet_pretrained().to(device)

# define an optimizer and loss function for training
optimizer = optim.Adam(model.classifier.parameters(), lr=0.001)

# define a learning rate scheduler
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, 'min', patience=1)

# define loss function
criterion = weighted_BCELoss(pos_weight, neg_weight)

# train the model
for epoch in range(3):
    print(f"{epoch+1}/{3}")
    model, training_loss = train(model, trainloader, criterion, optimizer, device, testing=True)
    validation_loss = validation(model, validationloader, criterion, device, testing=True)
    # displaying the loss values by multiplying instead of deviding with the dataloader len 
    # because testing=True will run only one batch
    print(" - training loss "+ str(training_loss*len(trainloader)) + " - val. loss " + str(validation_loss*len(validationloader)))
    print("Learning rate used " + str(optimizer.param_groups[0]['lr']))
    scheduler.step(validation_loss)

print("\nDone")

Using device cpu.
Get train data loader.
Get validation data loader.


Downloading: "https://download.pytorch.org/models/densenet121-a639ec97.pth" to /home/ec2-user/.cache/torch/checkpoints/densenet121-a639ec97.pth


HBox(children=(FloatProgress(value=0.0, max=32342954.0), HTML(value='')))


1/3
 - training loss 0.011426616460084917 - val. loss 0.05097552016377449
Learning rate used 0.001
2/3
 - training loss 0.05622316151857376 - val. loss 0.012568872421979904
Learning rate used 0.001
3/3
 - training loss 0.009672258980572224 - val. loss 0.035417623817920685
Learning rate used 0.001

Done


The functions work as expected. So now I will create the PyTorch Estimator object.

In [11]:
from sagemaker.pytorch import PyTorch

output_path = 's3://{}/{}'.format(bucket, prefix)

estimator = PyTorch(entry_point='model.py',
                    source_dir='source_pytorch',
                    role=role,
                    framework_version='1.4',
                    train_instance_count=1,
                    train_instance_type='ml.p2.xlarge',
                    output_path=output_path,
                    sagemaker_session=sagemaker_session,
                    hyperparameters={
                        'epochs': 5,
                        "batch-size": 32,
                        "pos-weight": pos_weight,
                        "neg-weight": neg_weight
                    })

Now my estimator can be trained by calling the fit method.

In [12]:
estimator.fit({'train': train_location,
              'validation': validation_location})

'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


2020-10-22 14:41:49 Starting - Starting the training job...
2020-10-22 14:41:52 Starting - Launching requested ML instances.........
2020-10-22 14:43:23 Starting - Preparing the instances for training.........
2020-10-22 14:44:56 Downloading - Downloading input data...............................................................
2020-10-22 14:55:49 Training - Training image download completed. Training in progress..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-10-22 14:55:50,886 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-10-22 14:55:50,911 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-10-22 14:55:50,916 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2020-10-22 14:56:52,173 sagemaker-containers INFO     Module default_user_module_name does not pr

[34m - training loss 0.01820211831675955 - val. loss 0.015809946657261913[0m
[34mLearning rate used 0.001[0m
[34m2/5[0m
[34m - training loss 0.017852654449339585 - val. loss 0.01629522839561105[0m
[34mLearning rate used 0.001[0m
[34m3/5[0m
[34m - training loss 0.017839569625667552 - val. loss 0.017286286156417596[0m
[34mLearning rate used 0.001[0m
[34m4/5[0m
[34m - training loss 0.016848810568755247 - val. loss 0.015664986120536923[0m
[34mLearning rate used 0.0001[0m
[34m5/5[0m
[34m - training loss 0.016814828571150624 - val. loss 0.015647853339711824[0m
[34mLearning rate used 0.0001[0m
[34m[2020-10-22 16:48:23.424 algo-1:42 INFO utils.py:25] The end of training job file will not be written for jobs running under SageMaker.[0m
[34m2020-10-22 16:48:23,933 sagemaker-containers INFO     Reporting training SUCCESS[0m

2020-10-22 16:48:41 Uploading - Uploading generated training model
2020-10-22 16:48:41 Completed - Training job completed
Training seconds: 74

## Evaluating

First, I will create a predictor object by deploying my estimator.

In [13]:
from sagemaker.pytorch import PyTorchModel

model = PyTorchModel(model_data=estimator.model_data,
                     role = role,
                     framework_version='1.4',
                     entry_point='model.py',
                     source_dir='source_pytorch')

# deploy your model to create a predictor
predictor = model.deploy(initial_instance_count=1, instance_type='ml.t2.medium')

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


-----------------!

I will create a "testloader" which will help me to go through all the test images and their labels using a for loop.

In [14]:
test_transform = transforms.Compose([transforms.CenterCrop(224),
                                     transforms.ToTensor(),
                                     transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                                          std=[0.229, 0.224, 0.225])])
test_data = datasets.ImageFolder('dataset/test', transform=test_transform)
testloader = torch.utils.data.DataLoader(test_data, batch_size=1)

Now I will go through all the test images and store their labels in the "labels" array and predictions to the "preds" array. Once all the images are predicted, I can calculate the results.

In [17]:
import numpy as np

preds = np.array([])
labels = np.array([])

for image, im_label in testloader:
    im_pred = predictor.predict(image)
    preds = np.append(preds, im_pred.squeeze())
    labels = np.append(labels, im_label)

In [18]:
# convert the predictions to binary results
result = np.zeros(preds.shape)
th= 0.5
result[preds>th] = 1.

In [19]:
# calculate metrices
tp = np.logical_and(labels, result).sum()
fp = np.logical_and(1-labels, result).sum()
tn = np.logical_and(1-labels, 1-result).sum()
fn = np.logical_and(labels, 1-result).sum()

recall = tp / (tp + fn)
precision = tp / (tp + fp)
accuracy = (tp + tn) / (tp + fp + tn + fn)
f1 = 2 * (precision*recall)/(precision+recall)

print('accuracy:',accuracy)
print('recall', recall)
print('precision', precision)
print('f1 score', f1)

accuracy: 0.6389396709323584
recall 0.64
precision 0.020075282308657464
f1 score 0.0389294403892944


In [20]:
print("TP", tp)
print("TN", tn)
print("FP", fp)
print("FN", fn)

TP 16
TN 1382
FP 781
FN 9


The model's performance is not as close as mentioned in the paper because it is trained only for 5 epochs considering the expense. FOr getting better performance, the model needs to be trained for longer epochs.

## Deleting Endpoint

In [21]:
predictor.delete_endpoint()