# 1. Introduction

## Finding a dataset

We want to find a dataset related to cancer research. These could be annotated CT scans or chest X-rays for lung cancer. We will use the LUNA16 dataset.

## Using ML Ops

We will use the AWS Free Tier for this project where we will be able to use the following free tier services:
- S3: for storing the LUNA16 dataset
- SageMaker: for training our model
- EC2 Free Tier: for preprocessing and other smaller tasks
- CloudWatch: for monitoring logs

## First sign up on the AWS Free Tier

I had to sign up with my card information as it would bill me if I went over the free tier limit. Let´s hope that I don´t. I will update this section if I do. 

## The LUNA16 dataset

Link: https://luna16.grand-challenge.org/Download/

Seems like the dataset is approximately 70GB big - so too large to do it locally in my laptop. Let´s see if the dataset exists on a serve somewhere where we can access it, e.g. by streaming directly from AWS using e.g. wget or aws cli - or just download the data directly to an S3 bucket. So it seems like the Free Tier limit for storage in S3 buckets is 5GB - the purpose of this repo is for me  to learn ML Ops rather than to spend too much money, so let´s see if we can do something else. 

# 2. Preprocessing our data in EC2 

## Creating an EC2 instance to download the dataset

It seems like we can download and preprocess the data on an EC2 instance using the EC2 instance storage. So I started by launching an EC2 instance and creating one which is "free tier eligible" - I chose the "Amazon Linux 2023 AMI". And then I chose the only free tier eligible instance type which was "t3.micro". I then allow only SSH traffic from my IP. I want to add 100 GB of storage but see that in the free tier you can only add up to 30GB of storage - I see that the units of storage displayed is GiB - and 1GB is approx 0.93GiB. 

30*0.93 = 27.9

So let´s add 27 GiB of storage to be on the safe side. 

I then create a key pair because I don´t have one. The public key is stored by AWS and the private key is downloaded an stored by us. The private key allows us to securely SSH into our EC2 instance without needing a password. AWS generates the private key (a .pem file) which we store on our computer. 

We then move it to a directory where we want to store it. It is also good to set permissions to restrict access to the private key - so that onle I the owner can read and access it.  

chmod 400 {name_of_my_ec2_key}.pem

## Connecting to our EC2 instance 

ssh -i "{name_of_my_ec2_key}.pem" ec2-user@{EC2-PUBLIC-IP}

You can find the public IP by looking at the Public IPv4 Address column/section on your EC2 instance page.

   ,     #_
   ~\_  ####_        Amazon Linux 2023
  ~~  \_#####\
  ~~     \###|
  ~~       \#/ ___   https://aws.amazon.com/linux/amazon-linux-2023
   ~~       V~' '->
    ~~~         /
      ~~._.   _/
         _/ _/
       _/m/'

This looks more like a painting than an error. Looks like it worked!

## Connecting a Jupyter notebook to our instance

I like working in notebooks so I will now create a jupyter notebook server on my EC2 instance.

Let´s check if python3 and jupyter exists first by running:

python3 --version
jupyter --version

Seems like python3 exists but not jupyter, so let´s install it by running:

pip3 install jupyter

Alright, got an error. pip3 doesn´t exist so let´s install it using yum, which is the package manager for Amazon Linux. 

sudo yum install -y python3-pip

Now we can install jupyter:

pip3 install jupyter

OBS! We got two dependency conflicts here, which don´t necessarily stop the installation of jupyter from working. I will continue withou resolving these for now, so ignoring the warnings for now. A good practice in the future however would be to first create a python virtual environment and then install all required packages there, such as jupyter, without worrying about dependency conflicts with packages that exist globally. 

## Getting started on the notebook server 

Let´s start the server by running: 

jupyter notebook

Since it is running on our EC2 instance and we want to access it locally we need to forward the notebooks port 8888 (e.g.) to our local machine through an SSH tunnel. To create this tunnel, we open up a new terminal and then we run:

ssh -i "{name_of_my_ec2_key}.pem" -L 8888:localhost:8888 ec2-user@{EC2-PUBLIC-IP}

Now that the SSH tunnel is set, we can access the notebook server by visiting one of the urls that were displayed after running "jupyter notebook"

## Downloading the data from a jupyter notebook

So now we open a jupyter notebook file on the server where we want to download some of the LUNA16 data. 

## Reconnecting to our EC2 instance after stopping it

So let´s say we stopped working on our instance, it´s a new day and now want to reconnect to the instance since we have existed it

ssh -i "{name_of_my_ec2_key}.pem" ec2-user@{PUBLIC_IP}

## If you encounter issues with connecting to your EC2 instance

- make sure it doesn´t have to do with the public ip changing, if you e.g. rebooted your instance it could´ve changed so use the most recent one
- you may also need to edit your "security group inbound rule". Find the security groups section associated with your ec2 instance and edit the inbound rules. Set type to SSH, port to 22 and and in the source add your IP followed by a /32 to limit access only to your specific IP. To find your IP :  in the terminal do "curl ifconfig.me". If this doesn´t work - you can try setting it to "My IP".  

Now try to SSH in again

## Accessing the jupyter notebook

In the terminal: "jupyter notebook" 

## Create the SSH tunnel again from the notebooks port to our local machine

In a different terminal window: ssh -i "my_ec2_key.pem" -L 8888:localhost:8888 ec2-user@{PUBLIC_IP}

Now we can open one of the localhost links that were listed when we ran "jupyter notebook"

# 3. Saving our preprocessed data in S3

## Creating an S3 bucket

After preprocessing our data in our EC2 instance we are ready to create an S3 bucket where we can store this data. 

In the AWS console we search for S3 and click on "Create Bucket"

We need to enter a unique name for our bucket - it must be globally unique across all of AWS

We also add a region for where we want our bucket to reside

I left the rest of the options as defaults 

# Uploading data to S3 from our jupyter notebook in our EC2 instance

You can use the following command to upload something from your EC2 instance to your S3 bucket

aws s3 cp /path/to/local/file s3://your-bucket-name/path/in/bucket/

# Assigning AWS credentials to the EC2 instance

When running the above command to move data between EC2 and S3 you might runt into some problems related to not having the right credentials. E.g. a "NoCredentialsError". To fix this problem we can assign an IAM role to our EC2 instance. We go to the EC2 dashboard in our AWS console and click on actions, then security, and then modify IAM role. You might need to create a new IAM role. Chose AWS Service as the trusted entity. Chose EC2 as the service/use case. I attached "AmazonS3FullAccess" permission - so for uploading and downloading. I set the role name to EC2_access_S3. Now we can go back to the modify IAM role window and choose this new IAM role that we created. Our instance is now attached to this new IAM role. We can now go ahead an upload data from EC2 to S3. 

# 4. Training our model with SageMaker

## Setting things up in the SageMaker console

We have now preprocessed our data in an EC2 instance and saved the data to an S3 bucket that we have created. Now it is time to train/finetune a model of our choice with this data using SageMaker. 

We first want to go to the SageMaker console in AWS. We want to find the "Amazon SageMaker AI" and then click on "Notebooks" in the panel. Create a new notebook instance from here and choose a an instance type covered under the free tier if you don´t want to be charged. I looked at this article: https://www.cloudzero.com/blog/sagemaker-pricing/ and decided to go with the ml.t3.medium instance. I won´t be able to use GPUs on this type of instance, so that adds some limitations but we can try and see how far we can get on this instance for now as we want this project to be completely free. 

Our SageMaker notebook instance needs to have permissions to S3 so let´s set the appropriate IAM role for this. Click on "Create an IAM role" and specify the S3 bucket it should have access to or allow it access to any S3 bucket in your account. 

Now we can open a new notebook by either clicking on "jupyter" or "jupyterlab" - I think the latter has a newer interface. I went with jupyterlab. Choose a notebok with your preferred kernel - I chose a tensorflow based one. 

## If you have problems with accessing S3 from your SageMaker instance

You may need to add this action: "Action": "s3:GetObject" to your IAM role json file. If you go to the configurations of your SageMaker notebook and go to the IAM role you created (AmazonSageMaker-ExecutionRole). Ensure that it has the s3:GetObject permission. Otherwise you can add it manually. 

If this doesn´t work you can try to add a permission and search for AmazonS3FullAccess e.g. This worked for me. 

## Getting back into our SageMaker instance after we have exited it

Go to the AWS management console. Click on Notebooks. Find your notebook and click on "Open jupyterlab". 

## Stopping notebook instance

Remember to stop the notebook instance to avoid going over the free limit of 250 hours per month on SageMaker