# Cloud computing

------
### Learning objectives:

+ Introduce cloud computing infrastructure

+ Define the components and utility of a virtual machine

+ Discuss best practices for data storage on the cloud

+ Download data from the cloud to use in lessons 3-7


## What Is Cloud Computing?
-----

Cloud computing is the utilization of compute resources that are accessed remotely through your physical machine. The compute resources that you use could be for purposes of storage or data processing. In both cases though you are renting resources that are not connected to the keyboard and mouse you are accessing the resources from. The cloud is usually made up of a set of computers that work together as a single system and have more space and computing power than are available on your local machine. The cloud compute system is made up of various nodes, pictured below as the larger rectangles. The nodes are made up of multiple cores, the smaller boxes inside the rectangles, which can be accessed for data storage or data processing. Compute resources can be rented on a long term basis for data storage, in blue on the figure below, or on a shorter term basis for data processing, in red on the figure below. 


<p align="center">
  <img src="images/cloudComputing.png" width="60%"/>
</p>

### Virtual Machines

Short term access to cloud compute resources for data processing is sometimes referred to as a virtual machine (VM). For example when you built this Jupyter notebook you requested the use of a VM to leverage Google Cloud compute resources. Each time you request a VM you will be asked to define the resources needed for the task you will perform, i.e., the amount of memory and number of processors needed for the compute task. You will need some familiarity with the compute resources your task requires to select these parameters. 

Selecting more resources than needed will increase the cost of the process, selecting fewer resources than needed will increase the processing time and ultimately increase the cost of renting the machine, or prevent the VM from accomplishing the task due to insufficient resources (generally if this has been an issue with running a specific piece of software this will be noted in the manual for that software, another good reason to frequently reference the software manual). To build a VM you will need to indicate the amount of memory you need to use and the number of processors you would like to use.

#### Memory, Short Term Storage

Memory is data storage that is immediately accessible to your data processor. The amount of memory a compute resource has is the limit of how many open processes you can run. The more memory you have the easier it will be for you to multi-task on your machine. 

As VMs are temporary instances that leverage compute resources for a defined period of time, all memory is short term and cleared when the compute session is ended. All results should be written to a long term storage repository before your VM instance has ended as any files generated during your session will be erased. 


#### What is a Processor- How Many CPUs/GPUs Do I Need?

A processor is the part of your computer that executes the tasks that are performed using the data stored in the memory of the machine. A processor is sometimes referred to as the "brain" of the computer. Processors are made up of multiple *cores* and each core performs a task and returns the answer to the processor enabling parallelization of computing tasks. 

Below is a toy example of how parallelization can speed up a compute task. Here we solve the problem **((5-3) x (10 + 6))/(2 x 2)**. Using parallelization with three cores (the table on the left) speeds up the process from taking 5 steps to only 3 steps.


<table>
<tr><th>One Core (a) </th><th>Three cores (a,b,c)</th></tr>
<tr><td><table></table>

|Steps|Core (a)| 
|--|--|
|1|5 - 3|
|2|10 + 6|
|3|2 x 16|
|4|2 x 2|
|5|32/4|

</td><td>

|Steps|Core a|Core b|Core c| 
|--|--|--|--|
|1|5 - 3|10 + 6|2 x 2|
|2|2 x 16| | |
|3|32/4| | |

</td></tr> </table>


For this simple example we save milliseconds by using parallelization but when you're mapping millions of reads onto a reference genome parallelization significantly speeds up the process. Most modern machines will have a processors with between 6-8 cores, the number of cores dictates the amount of parallelization a task can utilize. However, virtual machines provide access to between 16-64 cores. A process that takes days on your local machine can be completed in hours. As with memory there is a limit to the utility of using more cores, if the task you are performing is not parallelizable or is minimally parallelizable and you build a VM instance with more cores than you can use, you will pay for resources that cannot be leveraged. 

*There is an episode of The Magic School Bus where they have a simple explanation of storage, memory (RAM), and CPUs, season 4 episode 11 The Magic School Bus Gets Programmed*


#### Compute Requirements for Common Tasks

1. **SRA data download** <br>
    time with 1 core ---- 312 seconds <br>
    time with 12 cores -- 176 seconds <br>
    RAM: 200MB per core <br>
    This task is bounded more by the network connection and time reading/writing to storage, so adding cores has only minimal benefit. <br>
     <br>
2. **Read mapping human RNAseq data (STAR)** <br>
    time with 1 core ---- 81 minutes <br>
    time with 12 cores -- 26 minutes <br>
    RAM: 40GB <br>
    This task scales well with adding cores.  Adding memory beyond the 40GB requirement doesn't affect performance. <br>
     <br>
3. **Read mapping E.coli RNAseq data (STAR)** <br>
    time with 1 core ---- 32 minutes <br>
    time with 12 cores -- 8 minutes <br>
    RAM: 4GB <br>
    This task scales well with adding cores.  Adding memory beyond the 4GB requirement doesn't affect performance. <br>
     <br>
3. **Variant calling E. coli (freeBayes)** <br>
    time with 1 core ---- 136 seconds <br>
    time with 12 core -- 15 seconds (requires manual parallelization and concatenation) <br>
    RAM: < 1GB <br>
    This task has minimal memory requirements, and scales very well with adding cores, though the parallelization has to be done manually, as the software contains no multi-threaded option.

## Amazon S3 Buckets, One Option for Long Term Data Storage
--------

There are several options for storing data long term. First and most obvious is storing data on your local machine. However genomic data are large and running analyses locally is often not feasible without a very large local machine. Additionally data should be backed up to another location to protect against hardware failure. Lastly sharing data from your local machine can be tricky, especially with very large files. Another option is storing data in the cloud. With this option you have access to more storage space than is available on a personal computer, data is stored in multiple locations to protect against hardware failure, and data can be shared by providing a URL. 

#### Data Organization

When using AWS cloud compute environments to analyze data we suggest that you create an S3 bucket for each project. This S3 bucket might contain any raw data files, any code created to analyze the data, final results and figures, and perhaps most importantly a README file describing the contents of the S3 bucket and how each file is used/generated. The README file is a digital lab notebook and is useful for data sharing, writing up methods and duplicating an analysis.

Often there are files that need to be accessed across multiple projects, such as a reference genome or genome annotation file (discussed more in the next lesson). These files should be stored in their own S3 bucket that you can provide access to for each project that will require these files.


<p align="center">
  <img src="images/cloudStorage.png" width="50%"/>
</p>

You can find more information about creating S3 buckets, adding data to S3 buckets, and providing access to S3 buckets [here](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html).

#### Accessing An S3 Bucket

The data that you will need for the next several lessons has been stored in an S3 bucket. Before you start working through the lessons you will need to copy the files from the S3 bucket to your notebook. We will use the command `aws s3` with the `cp` option to copy the files from the S3 bucket to your notebook. The `aws s3` suite of commands enables you to interact with and manage S3 buckets. You can use the command `aws s3 help` to learn about the flags available to customize an `aws s3` command and interact with a storage bucket.

The syntax of the `aws s3 cp` command is:

`aws s3 cp SOURCE DESTINATION`

Here the SOURCE is the S3 bucket called `aws_research_workflow` that contains the data we would like to copy. The DESTINATION is your current working directory, which can be represented with `.`. 

There is an expectation that one of the arguments provided will be an S3 bucket indicated by a proceeding `s3://`, here our SOURCE will be `s3://nigms-sandbox/aws_research_workflow`.


In [None]:
%%bash

# Copy the contents of the S3 bucket into your notebook
# The --recursive flag with aws s3 cp to indicate we want to copy the directory and all of its contents
aws s3 cp --recursive s3://nigms-sandbox/aws_research_workflow/ aws_research_workflow/