# [Data Engineering using Amazon Web Services](https://www.udemy.com/course/data-engineering-using-aws-analytics-services/)

## Introduction

## [Docker](https://www.docker.com/resources/what-container/)

A container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another. A Docker container image is a lightweight, standalone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries and settings.

<div style="text-align:center"><img src="images/docker.png" /></div>

Container images become containers at runtime and in the case of Docker containers – images become containers when they run on Docker Engine. Available for both Linux and Windows-based applications, containerized software will always run the same, regardless of the infrastructure. Containers isolate software from its environment and ensure that it works uniformly despite differences for instance between development and staging.

## Example

First, make sure **Docker** is installed in your system using ***sudo apt-get install docker***, and you are logged in to your account with ***docker login***. To make sure everything is working fine, execute the ***docker run hello-world*** command.

As an example, we've created a simple [Dockerfile](https://u.group/thinking/how-to-put-jupyter-notebooks-in-a-dockerfile/) on which we declare a sequence of operations to be followed. These operations will be executed once one builds the container in their machine. In this example, we will update **apt**, **python3**, and **pip**. Furthermore, **pip** will install all packages described in the **requirements.txt**, and execute **module.py**, which cleans the file **raw_data.csv**. The final step is to open this notebook file and make available every data from **clean_data.csv** as a **DataFrame**. 

<div style="text-align:center"><img src="images/dockerfile.png" /></div>

With the **Dockerfile** in place, run the build command, ***docker build -t username/project .***, which will create the image described on the **Dockerfile** in the local environment. One can push this new container to origin with ***docker push username/project*** to share with other developers or ***pull*** to another machine. Once the build succeeds, one can run the build with 
***docker run -p 8888:8888 username/project***.

The following kernell should only work if executed inside the container, since the file **clean_data.csv** will only be created once the image is build.

In [8]:
import pandas as pd

try:
    df = pd.read_csv('../data/clean_data.csv')
    print('The build was a success, and the file is available')
except:
    print('You are not running this notebook inside the container, or the build was not a success.')

df

You are not running this notebook inside the container, or the build was not a success.


Unnamed: 0,name,age,job
0,Lucas,24,Professor
1,Pedro,28,Empresario
2,Miguel,21,Advogado


## AWS IAM

## AWS Cloud9

A development environment is a place in **AWS Cloud9** where you store your project's files and where you run the tools to develop your applications. One can easily create a new **Cloud9** Environment attached to an **EC2** via the **AWS Console**. The **Cloud9 IDE** is a frontend to the newly created **EC2**. You can use it to quickly deploy applications since you are already inside the **AWS Environment**.

As soon as you open the **IDE**, go to the **sourcecontrol** tab on the left, and clone the repository of your choice. It is a good start to clone, yours truly, ***[https://github.com/Corbanez97/data_engineering_aws.git](https://github.com/Corbanez97/data_engineering_aws.git)***.

Furthermore, it is possible to set up **Docker** and **Jupyter Lab** in this **EC2** For this, you must run all desired pip commands (***pip install jupyterlab***,***pip install addons***, **pip install themes***) to install **Jupyter Lab** once the **Cloud9 IDE** is open. From that, on the terminal, execute the command ***jupyter lab --ip 0.0.0.0 --port 8890***.

<div style="text-align:center"><img src="images/cloud9jupyter.png" /></div>

Then, you should go **EC2's** console to edit its security group. Once you find yourself in the entry rules section, you edit these rules. As you can see, the last security group is set up to the previously described port 8890. 

<div style="text-align:center"><img src="images/securitygroups_jupyter.png" /></div>

This will allow us to connect to the EC2's local host using its public IPv4 DNS followed by "colon port number", i.e., ***[ec2-100-24-117-215.compute-1.amazonaws.com:8890](ec2-100-24-117-215.compute-1.amazonaws.com:8890)***. Just copy this address on the browser, and it will lead you to the Jupyter Lab hosted on the EC2! **(づ￣ ³￣)づ**

If everything worked out fine, you should by now be seeing this notebook via the AWS EC2!
