# [Data Engineering using Amazon Web Services](https://www.udemy.com/course/data-engineering-using-aws-analytics-services/)

## Introduction

## [Docker](https://www.docker.com/resources/what-container/)

A container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another. A Docker container image is a lightweight, standalone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries and settings.

<div style="text-align:center"><img src="images/docker.png" /></div>

Container images become containers at runtime and in the case of Docker containers – images become containers when they run on Docker Engine. Available for both Linux and Windows-based applications, containerized software will always run the same, regardless of the infrastructure. Containers isolate software from its environment and ensure that it works uniformly despite differences for instance between development and staging.

### Example

First, make sure **Docker** is installed in your system using ***sudo apt-get install docker***, and you are logged in to your account with ***docker login***. To make sure everything is working fine, execute the ***docker run hello-world*** command.

As an example, we've created a simple [Dockerfile](https://u.group/thinking/how-to-put-jupyter-notebooks-in-a-dockerfile/) on which we declare a sequence of operations to be followed. These operations will be executed once one builds the container in their machine. In this example, we will update **apt**, **python3**, and **pip**. Furthermore, **pip** will install all packages described in the **requirements.txt**, and execute **module.py**, which cleans the file **raw_data.csv**. The final step is to open this notebook file and make available every data from **clean_data.csv** as a **DataFrame**. 

<div style="text-align:center"><img src="images/dockerfile.png" /></div>

With the **Dockerfile** in place, run the build command, ***docker build -t username/project .***, which will create the image described on the **Dockerfile** in the local environment. One can push this new container to origin with ***docker push username/project*** to share with other developers or ***pull*** to another machine. Once the build succeeds, one can run the build with 
***docker run -p 8888:8888 username/project***.

The following kernell should only work if executed inside the container, since the file **clean_data.csv** will only be created once the image is build.

In [8]:
import pandas as pd

try:
    df = pd.read_csv('../data/clean_data.csv')
    print('The build was a success, and the file is available')
except:
    print('You are not running this notebook inside the container, or the build was not a success.')

df

You are not running this notebook inside the container, or the build was not a success.


Unnamed: 0,name,age,job
0,Lucas,24,Professor
1,Pedro,28,Empresario
2,Miguel,21,Advogado


## AWS IAM Console

The first thing we should do once we open **AWS Console** with **root user** is to create a new **IAM Group**. This step is very intuitive on the **AWS Console**. 

Firstly, we can create a group with Admin policies, or any policy of your choice. Secondly, we add a user to this group. This will be the user we will use to navigate **AWS** throughout the begining of this course. **P.S. Remember to save the new user credentials into a .csv file!!**

To further understand **IAM Groups**, **IAM Roles**, **IAM policies**, and **IAM Users** read [AWS Identity and Access Management documentation](https://docs.aws.amazon.com/IAM/latest/UserGuide/introduction.html)

### AWS IAM Structure

When thinking about IAM, there are two broad categories to consider: identities and permissions. 

<div style="text-align:center"><img src="images/iamstructure.png" /></div>

Identities refer to the various mechanisms that AWS provides to identify who is requesting a particular AWS action, for authenticating that person or entity, and for organizing similar entities into groups. Identities include users, groups, credentials, and roles. 

Permissions refer to what a particular identity is allowed to do in the AWS account. Permissions are managed by writing identity-based policies, which are collections of statements.

I highly advise you to read the original documentation [here](https://docs.aws.amazon.com/IAM/latest/UserGuide/introduction.html). 


### Managing AWS IAM with the Command Line Interface (CLI)

You will see throughout this course that using **AWS CLI** is the best choice you can make to interact with any service from **AWS**. For instance, let us list every user on our **AWS IAM Structure**.

In [1]:
! aws iam list-users --profile adminuser

{
    "Users": [
        {
            "Path": "/",
            "UserName": "AdminUser",
            "UserId": "AIDAREGJCUVZW77Z56TDT",
            "Arn": "arn:aws:iam::077731112307:user/AdminUser",
            "CreateDate": "2022-04-12T02:17:31Z"
        },
        {
            "Path": "/",
            "UserName": "S3FullAccessUser",
            "UserId": "AIDAREGJCUVZ2V4WH7JGH",
            "Arn": "arn:aws:iam::077731112307:user/S3FullAccessUser",
            "CreateDate": "2022-04-10T14:43:33Z",
            "PasswordLastUsed": "2022-04-10T20:06:52Z"
        }
    ]
}


Using the list argument, we are capable of listing roles, groups, and policies. Furthermore, one can create/delete every possible Identity. Manage policies and Identities permissions. To see all possible arguments, run the command ```aws iam help``

## AWS Cloud9

A development environment is a place in **AWS Cloud9** where you store your project's files and where you run the tools to develop your applications. One can easily create a new **Cloud9** Environment attached to an **EC2** via the **AWS Console**. The **Cloud9 IDE** is a frontend to the newly created **EC2**. You can use it to quickly deploy applications since you are already inside the **AWS Environment**.

As soon as you open the **IDE**, go to the **sourcecontrol** tab on the left, and clone the repository of your choice. It is a good start to clone, yours truly, ***[https://github.com/Corbanez97/data_engineering_aws.git](https://github.com/Corbanez97/data_engineering_aws.git)***.

### Jupyter Lab on Cloud9

Furthermore, it is possible to set up **Jupyter Lab** in this **EC2** For this, you must run all desired pip commands (***pip install jupyterlab***,***pip install addons***, **pip install themes***) to install **Jupyter Lab** once the **Cloud9 IDE** is open. From that, on the terminal, execute the command ***jupyter lab --ip 0.0.0.0 --port 8890***.

<div style="text-align:center"><img src="images/cloud9jupyter.png" /></div>

Then, you should go **EC2's** console to edit its security group. Once you find yourself in the entry rules section, you edit these rules. As you can see, the last security group is set up to the previously described port 8890. 

<div style="text-align:center"><img src="images/securitygroups_jupyter.png" /></div>

This will allow us to connect to the **EC2's local host** using its **public IPv4 DNS** followed by "colon port number", for instance, ***[ec2-100-24-117-215.compute-1.amazonaws.com:8890](ec2-100-24-117-215.compute-1.amazonaws.com:8890)***. Just copy this address on the browser, and it will lead you to the **Jupyter Lab** hosted on the **EC2**! **(づ￣ ³￣)づ**

If everything worked out fine, you should by now be seeing this notebook via the **AWS EC2**!


## AWS EC2

Amazon Elastic Compute Cloud (Amazon EC2) provides scalable computing capacity in the Amazon Web Services (AWS) Cloud. Using Amazon EC2 eliminates your need to invest in hardware up front, so you can develop and deploy applications faster. You can use Amazon EC2 to launch as many or as few virtual servers as you need, configure security and networking, and manage storage. Amazon EC2 enables you to scale up or down to handle changes in requirements or spikes in popularity, reducing your need to forecast traffic.

### AWS EC2 Local connection

It is also possible to connect to the EC2 using our local machine using a ssh key. To create a connection, run the following command:

```
corbanez@corbanez-H110M-H:~$ ssh-keygen

```
Which will generate the following output:

```
corbanez@corbanez-H110M-H:~$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/corbanez/.ssh/id_rsa):

```
Make sure you save the ssh key inside the *.ssh* folder, given that if you do not do so, applications that use the key, could not accept it because of the permissions surrounding different folders. Next, you should go to the *~/.ssh* folder to get to key you've just created.

```
corbanez@corbanez-H110M-H:~$ cd ~/.ssh
corbanez@corbanez-H110M-H:~/.ssh$ cat key_name.pub
```
The last command should return a MONSTRUOS string of random characters. Save it to the clipboard an open the **Cloud9 IDE**.

On the **IDE**, we must add the newly created ssh key to our **EC2's *authorized_keys***. To do so, we shall open this file with **vi**, and add the MOUNTRUOS string right at the end.

```
ITVCloud9User:~ $ vi ~/.ssh/authorized_keys  
```

Just to make sure every thing turned out ok, run the command ```ITVCloud9User:~ $ tail -1 ~/.ssh/authorized_keys```. It should return, again, our MONSTRUOUS string of character. If this happens, we are good to go back to our local terminal.

Back at the terminal, we must execute the following command using our **EC2 Instance public IPv4 DSN**:

```
corbanez@corbanez-H110M-H:~$ ssh -i ~/.ssh/ivc9user ec2-user@<EC2 public IPv4 DNS>

```

And if everything is **Ok**, we should be able to see a new **EC2** connection! **(♥‿♥)**

<div style="text-align:center"><img src="images/ec2_local_connex.png" /></div>

## AWS S3

### Introduction

**Amazon S3** or **Amazon Simple Storage Service** is a service offered by **Amazon Web Services (AWS)** that provides object storage through a web service interface. **Amazon S3** uses the same scalable storage infrastructure that Amazon.com uses to run its global e-commerce network. It can be employed to store any type of object, which allows for uses like storage for Internet applications, backup and recovery, disaster recovery, data archives, data lakes for analytics, and hybrid cloud storage.[[Wiki]](https://en.wikipedia.org/wiki/Amazon_S3) 

**S3 Buckets** is one of the pillars of our data lake. That is where the data will arrive, be transformed, and archived. Therefore, understanding every possible type of storing data is a great way to begin creating Data Lakes and Data Warehouses. Take a look at the documentation describing different types of [**S3 Storage Classes**](https://aws.amazon.com/pt/s3/storage-classes/?nc=sn&loc=3).

<div style="text-align:center"><img src="images/s3_storage_classes.png" /></div>

### Creation and Setup

To create a **Bucket**, one must log in to the **AWS root user** and go to **S3 Services**. The path from here is very straightforward. Simply follow every big flashy button, giving names and descriptions of what may come. Furthermore, after creating the bucket, go to the **S3 Console** and create two folders (flashy buttons): ***landing*** and ***raw***.

Once these two folders are setup, we should create a new **IAM Role** with permissions to List Buckets and Full Objects Access. Follow the steps described on the **AWS IAM Console** to create a new role with these permissions. Once you find yourself in the **Policies Panel**, go to the {}JSON mode, and paste the following code

```
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": "arn:aws:s3:::<BUCKET_NAME>"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:*Object"
            ],
            "Resource": "arn:aws:s3:::<BUCKET_NAME>/*"
        }
    ]
}
```

Create this **Policy**, and add it to the desired **IAM Group**. All users under this group should be able to list buckets and manipulate objects (saving, deleting and moving files inside of the bucket).

The next step is to configure these new credentials in our local machine using **AWS CLI**. This is simply done by running the command 

```aws configure --profile <name_of_the_profile>```

and use the **AWS Access Key ID**/**AWS Secret Access Key** from the **credentials.csv** file you saved once you created the user. Now, with the new profile, you can use the **AWS CLI** command to list folders inside the bucket of which the user has permissions. Just use the command

```aws s3 ls s3://<BUCKET_NAME> --profile <name_of_the_profile>```.

Since we've created two folders, if everything is working properly, you should receive the following output.

<div style="text-align:center"><img src="images/s3_setup_awscli.png" /></div>

### Cross-Region Replication

For extra security, **AWS S3** has the great functionality of replicating a bucket throughout multiple regions. To have this feature in our data lake, we will create a new bucket with the same name as our current bucket (plus a suffix *copy*. Then, using an **IAM User** with  **AmazonS3FullAccess** we will set a **Replication Rule** for the main bucket. This rule will specify the prefix of data that will be copied, and the name of the bucket (```s3://<DUMP-BUCKET-NAME>```) which will be dumping our data. While creating the rule, you will be able to enable **Bucket Versioning** on the dump bucket, given you have the correct **IAM Role**.

If you have set up this feature correctly, every change made to the main bucket should be seen on the destination bucket.

### S3 Browsers

There are many **[S3 Browsers](https://chrome.google.com/webstore/detail/my-s3-browser/lgkbddebikceepncgppakonioaopmbkk)** available for free on the internet. They are a simple way to interact with our buckets just like we interact with folders on our machine.

To access the bucket you've just created, you will need to activate the **Static site hosting**. To do so, go to the **Bucket Console**, then properties, and down we go! The last panel will be deactivated. Go to edit, and select activate. Under the field **Index Document**, place ```index.html```. Finally, hit **Save changes**.

On the **Static site hosting** panel, you will now se something like this:

<div style="text-align:center"><img src="images/static_site_hosting.png" /></div>

The link with the format ```http://<BUCKET>.s3-website-<region>.amazonaws.com``` is the **Bucket's Endpoint URL**. With this **URL** and your user credentials, you are capable to browse your bucket just like a computer folder. Moreover, you can add, delete, and copy objects into and out of the bucket.

### S3 Bucket Versioning

Given a specific bucket, one can enable **AWS Versioning** capabilities. Versioning in **Amazon S3** is a means of keeping multiple variants of an object in the same bucket. You can use the **S3 Versioning** feature to preserve, retrieve, and restore every version of every object stored in your buckets. 

With versioning you can recover more easily from both unintended user actions and application failures. After versioning is enabled for a bucket, if **Amazon S3** receives multiple write requests for the same object simultaneously, it stores all of those objects. [[Doc]](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Versioning.html)

To apply this option in a bucket, just go to your bucket properties and see the first block on the console. Select **Edit** and enable **Versioning**. With **Bucker Versioning** enabled, we must create a **Life Cycle Rule**. Go to bucket management, and create a rule to start using this feature.

These rules are a way to both secure and correctly clean our buckets or transition between **Storage Classes**. Given the number of days on which an object is deleted/transition after it is considered out of date, we make sure we are not spending too much money maintaining our storage. And, given the number of different versions we assure security in case we lost our updated data. [[Doc]](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html)

**P.S.:** Bucket Versioning will also be enabled on every other bucker, given you have set up replication of your main storage.

### Managing AWS S3 using AWS Command Line Interface (AWS CLI)

One of the main ways to interact with **S3** is using **AWS CLI**. The command-line interface creates the possibility of scripting and algorithmically managing buckets and objects. See the [Documentation](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/index.html) for deeper learning of different commands.

As an example, we ca try listing buckets and objects with ```aws s3 ls s3://<BUCKET_NAME> --profile <profile_name>```.

In [2]:
! aws s3 ls s3://supernovae --profile s3user #Shell command running on Linux

                           PRE archive/
                           PRE ingestion/
                           PRE landing/


As an example, let us create a *analytics* folder which will be used in the future to store **.parquet** file for **AWS Athena**.

In [26]:
! mkdir -p analytics/test.parquet #aws s3 does not copy empty directories to bucket.

#cp is a key for copy on AWS CLI; With the --recursive argument, we are passing every subdirectory
! aws s3 cp analytics s3://supernovae/analytics --recursive --profile s3user 
! aws s3 ls s3://supernovae --profile s3user #list new folders

! rm -r analytics

                           PRE analytics/
                           PRE archive/
                           PRE ingestion/
                           PRE landing/


### Interacting with S3 using AWS Python Software Development Kit (Boto3)