## Finding a dataset

We want to find a dataset related to cancer research. These could be annotated CT scans or chest X-rays for lung cancer. We will use the LUNA16 dataset.

## Using ML Ops

We will use the AWS Free Tier for this project where we will be able to use the following free tier services:
- S3: for storing the LUNA16 dataset
- SageMaker: for training our model
- EC2 Free Tier: for preprocessing and other smaller tasks
- CloudWatch: for monitoring logs

## First sign up on the AWS Free Tier

I had to sign up with my card information as it would bill me if I went over the free tier limit. Let´s hope that I don´t. I will update this section if I do. 

## The LUNA16 dataset

Link: https://luna16.grand-challenge.org/Download/

Seems like the dataset is approximately 70GB big - so too large to do it locally in my laptop. Let´s see if the dataset exists on a serve somewhere where we can access it, e.g. by streaming directly from AWS using e.g. wget or aws cli - or just download the data directly to an S3 bucket. So it seems like the Free Tier limit for storage in S3 buckets is 5GB - the purpose of this repo is for me  to learn ML Ops rather than to spend too much money, so let´s see if we can do something else. 

## Creating an EC2 instance to download the dataset

It seems like we can download and preprocess the data on an EC2 instance using the EC2 instance storage. So I started by launching an EC2 instance and creating one which is "free tier eligible" - I chose the "Amazon Linux 2023 AMI". And then I chose the only free tier eligible instance type which was "t3.micro". I then allow only SSH traffic from my IP. I want to add 100 GB of storage but see that in the free tier you can only add up to 30GB of storage - I see that the units of storage displayed is GiB - and 1GB is approx 0.93GiB. 

30*0.93 = 27.9

So let´s add 27 GiB of storage to be on the safe side. 

I then create a key pair because I don´t have one. The public key is stored by AWS and the private key is downloaded an stored by us. The private key allows us to securely SSH into our EC2 instance without needing a password. AWS generates the private key (a .pem file) which we store on our computer. 

We then move it to a directory where we want to store it. It is also good to set permissions to restrict access to the private key - so that onle I the owner can read and access it.  

chmod 400 {name_of_my_ec2_key}.pem

## Connecting to our EC2 instance 

ssh -i "{name_of_my_ec2_key}.pem" ec2-user@{EC2-PUBLIC-IP}

You can find the public IP by looking at the Public IPv4 Address column/section on your EC2 instance page.

   ,     #_
   ~\_  ####_        Amazon Linux 2023
  ~~  \_#####\
  ~~     \###|
  ~~       \#/ ___   https://aws.amazon.com/linux/amazon-linux-2023
   ~~       V~' '->
    ~~~         /
      ~~._.   _/
         _/ _/
       _/m/'

This looks more like a painting than an error. Looks like it worked!

## Connecting a Jupyter notebook to our instance

I like working in notebooks so I will now create a jupyter notebook server on my EC2 instance.

Let´s check if python3 and jupyter exists first by running:

python3 --version
jupyter --version

Seems like python3 exists but not jupyter, so let´s install it by running:

pip3 install jupyter

Alright, got an error. pip3 doesn´t exist so let´s install it using yum, which is the package manager for Amazon Linux. 

sudo yum install -y python3-pip

Now we can install jupyter:

pip3 install jupyter

OBS! We got two dependency conflicts here, which don´t necessarily stop the installation of jupyter from working. I will continue withou resolving these for now, so ignoring the warnings for now. A good practice in the future however would be to first create a python virtual environment and then install all required packages there, such as jupyter, without worrying about dependency conflicts with packages that exist globally. 

## Getting started on the notebook server 

Let´s start the server by running: 

jupyter notebook

Since it is running on our EC2 instance and we want to access it locally we need to forward the notebooks port 8888 (e.g.) to our local machine through an SSH tunnel. To create this tunnel, we open up a new terminal and then we run:

ssh -i "{name_of_my_ec2_key}.pem" -L 8888:localhost:8888 ec2-user@{EC2-PUBLIC-IP}

Now that the SSH tunnel is set, we can access the notebook server by visiting one of the urls that were displayed after running "jupyter notebook"


