# Data Engineering Capstone Project


### Project Summary

The purpose of the data engineering capstone project is to combine the techniques learned throughout the program.
In this project, I have chosen to complete the project provided for me by Udacity: We are going to analyse immigration data on US and enriched it with some other information that help us determine analytical questions. 


To automate the different tasks, we are going to take the different information to the AWS cloud and we are going to set up a data warehouse in Redshift, which will be fed from S3 periodically by jobs launched by Apache Airflow.

### Project Structure

The project follows the follow steps:

* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

## 1. Scope the Project and Gather Data.

### 1.1. Identify and gather the data that will be used for the project.



For this project we are going to use the I94 data provided, with information about immigration into US. This would be our main source of data. We are going to enrich it with US demographic data, global temperature data, airports data, etc.

### 1.2. Explain what end use cases it will be prepared the data for.

We are going to create a data model in a data warehose that allows us to extract analytical information about different aspects of immigration such as:
* Visitors by country/airline.
* Effect of temperature/Demographical patterns on the trend of visitors.

Another case of use that could be performed after completing the data warehouse would be feeding other US systems that needs this data or developing some machine learning algorithms to design a tourists recommendation engine. 

## 2. Explore and Assess the Data.

### 2.1. Overview.

We are going to apply some exploratory analysis of the different sources of data, in order to understand the different information better. We are going to describe, clean and join those tables to check what data model is better for this case of use.

We are going to analyse each datasource separately and later we will see how to build a robust model, but before this we import the Python libraries necessaries for executing the code.

**I would like to point that this Jupyter notebook is going to be used as a guide of what is going to be scripted separately. For example, for the exploratory analysis in this notebook I'm using Pandas but in the scripts we are going to productivize this in Spark.**

In [1]:
# Import of libraries for this exploratory analysis.
import os
import pandas as pd
from datetime import datetime

### 2.2. I94-Immigration- Data
As wikipedia explains on their website https://en.wikipedia.org/wiki/Form_I-94,
the I-94 is a document that proof of legal entry into the US. 

More exactly, *Form I-94, the Arrival-Departure Record Card, is a form used by U.S. Customs and Border Protection (CBP) intended to keep track of the arrival and departure to/from the United States of people who are not United States citizens or lawful permanent residents (with the exception of those who are entering using the Visa Waiver Program or Compact of Free Association, using Border Crossing Cards, re-entering via automatic visa revalidation, or entering temporarily as crew members). While the form is usually issued by CBP at ports of entry or deferred inspection sites, USCIS can issue an equivalent as part of the Form I-797A approval notice for a Form I-129 petition for an alien worker or a Form I-539 application for extension of stay or change of status (in the case that the alien is already in the United States).*

In our case of use, we have all the I94 data of the year 2016 in the path /data/18-83510-I94-Data-2016, splitted my month and in format sas7bdat (SAS binary database storage). This 12 datasets have almost 41 million rows and 28 columns. To simplify, we have taken the data of January, which has 2.847.924 rows.

In [2]:
immigration_fname = '../../data/18-83510-I94-Data-2016/i94_jan16_sub.sas7bdat'
df_immigration = pd.read_sas(immigration_fname, 'sas7bdat', encoding="ISO-8859-1")

In [3]:
df_immigration.head()

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,...,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,7.0,2016.0,1.0,101.0,101.0,BOS,20465.0,1.0,MA,,...,,,1996.0,D/S,M,,LH,346608285.0,424,F1
1,8.0,2016.0,1.0,101.0,101.0,BOS,20465.0,1.0,MA,,...,,,1996.0,D/S,M,,LH,346627585.0,424,F1
2,9.0,2016.0,1.0,101.0,101.0,BOS,20469.0,1.0,CT,20480.0,...,,M,1999.0,07152016,F,,AF,381092385.0,338,B2
3,10.0,2016.0,1.0,101.0,101.0,BOS,20469.0,1.0,CT,20499.0,...,,M,1971.0,07152016,F,,AF,381087885.0,338,B2
4,11.0,2016.0,1.0,101.0,101.0,BOS,20469.0,1.0,CT,20499.0,...,,M,2004.0,07152016,M,,AF,381078685.0,338,B2


This table will be the fact table of our data model. All the dimensions will enrich the study and analysis of this one.
In order to start the analysis, we check the datatypes of the columns and how many nulls are on each field.

In [4]:
df_immigration.dtypes

cicid       float64
i94yr       float64
i94mon      float64
i94cit      float64
i94res      float64
i94port      object
arrdate     float64
i94mode     float64
i94addr      object
depdate     float64
i94bir      float64
i94visa     float64
count       float64
dtadfile     object
visapost     object
occup        object
entdepa      object
entdepd      object
entdepu      object
matflag      object
biryear     float64
dtaddto      object
gender       object
insnum       object
airline      object
admnum      float64
fltno        object
visatype     object
dtype: object

In [5]:
df_immigration.describe()

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,arrdate,i94mode,depdate,i94bir,i94visa,count,biryear,admnum
count,2847924.0,2847924.0,2847924.0,2847924.0,2847924.0,2847924.0,2847864.0,2325312.0,2846734.0,2847924.0,2847924.0,2846734.0,2847924.0
mean,3175194.0,2016.0,1.0,327.6385,326.6035,20469.13,1.06408,20481.55,37.54639,1.95163,1.0,1978.454,65094010000.0
std,1801695.0,0.0,0.0,209.2268,207.2825,8.767172,0.4720415,19.68092,16.94457,0.5625815,0.0,16.94457,23217020000.0
min,7.0,2016.0,1.0,101.0,101.0,20454.0,0.0,14991.0,-2.0,1.0,1.0,1905.0,0.0
25%,1663577.0,2016.0,1.0,148.0,135.0,20462.0,1.0,20470.0,24.0,2.0,1.0,1966.0,45065360000.0
50%,3313376.0,2016.0,1.0,245.0,245.0,20469.0,1.0,20480.0,36.0,2.0,1.0,1980.0,84176510000.0
75%,4692311.0,2016.0,1.0,574.0,528.0,20477.0,1.0,20488.0,50.0,2.0,1.0,1992.0,85640610000.0
max,6148395.0,2016.0,1.0,999.0,760.0,20484.0,9.0,31442.0,111.0,3.0,1.0,2018.0,99132710000.0


In [6]:
df_immigration.isnull().sum()

cicid             0
i94yr             0
i94mon            0
i94cit            0
i94res            0
i94port           0
arrdate           0
i94mode          60
i94addr      177076
depdate      522612
i94bir         1190
i94visa           0
count             0
dtadfile      90486
visapost    1386375
occup       2802355
entdepa          61
entdepd      521813
entdepu     2847880
matflag      521813
biryear        1190
dtaddto         707
gender       216929
insnum      2709236
airline       61279
admnum            0
fltno         12232
visatype          0
dtype: int64

It's important to notice that the PK of the table is *cicid* (and has no nulls) and that the field *i94port* (that has the code of the airport) also has no nulls

From the previous analysis we see that :

* Columns **insnum, matflag, entdepu, occup and visapost** can be dropped of the study because they are populated with too many nulls. 

* Dropping the previous columns, the fields that might seem useful to form the fact table that will feed analytical model are: **cicid**(primary key of the table), **i94yr**(year),  **i94mon**(month), **i94cit**(city code), **i94res**(residence country code), **i94port**(code for arrival port), **arrdate**(year with 4 digits), **i94mode**(year with 4 digits), **depdate**(departure date), **i94visa**(cause of the travel), **gender** and **airline**.

* Several of this chosen columns haven't the right datatype. We are going to cast them into their correct datatype: **cicid** should be integer, **i94yr** should be integer,  **i94mon** should be integer, **i94cit** should be integer, **i94res** should be integer, **i94port** is a string, **arrdate** should be date, **i94mode** is float and should be integer, **depdate** should be date, **i94visa** should be integer, **gender** is string and **airline** is string.

* According to the indications given in the lessons, **we will create a dimension table with every date** in this table and write several aspects of each date (day, year, month, week of the year, day of the week, etcetera). As an observation, the fields **arrdate** and **deptdate** indicates the number of days between the 1st of January of 1960 and today, so we'll have to process it correctly.

* We can make some feature engineering columns that increase funcional value to the analysis.

* All this steps and the corresponding save of the tables into AWS S3 buckets its done in the productivized script.


### 2.3. Temperature Data

A complete description of the dataset *GlobalLandTemperaturesByCity.csv.csv* that has been given can be found on the page  of Kaggle https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data. There they explain how this dataset has been built and the different sources that feed it: NOAA’s MLOST, NASA’s GISTEMP and the UK’s HadCrut.

We are interested in studying the mean temperature by state the last 20 years to join this info to the demographic data on state. That's why we are going to get external data from https://simplemaps.com/data/world-cities to put the state of each city.

In [7]:
worldcities = pd.read_csv('external_data/worldcities.csv')
worldcities.head()

Unnamed: 0,city,city_ascii,lat,lng,country,iso2,iso3,admin_name,capital,population,id
0,Tokyo,Tokyo,35.685,139.7514,Japan,JP,JPN,Tōkyō,primary,35676000.0,1392685764
1,New York,New York,40.6943,-73.9249,United States,US,USA,New York,,19354922.0,1840034016
2,Mexico City,Mexico City,19.4424,-99.131,Mexico,MX,MEX,Ciudad de México,primary,19028000.0,1484247881
3,Mumbai,Mumbai,19.017,72.857,India,IN,IND,Mahārāshtra,admin,18978000.0,1356226629
4,São Paulo,Sao Paulo,-23.5587,-46.625,Brazil,BR,BRA,São Paulo,admin,18845000.0,1076532519


As we said, we are only interested on US cities, so we filter the dataframe.

In [8]:
us_cities = worldcities[(worldcities['country'] == "United States")]

We get a dataframe with each city in US and its corresponding state.

In [9]:
dict_city_state = us_cities.groupby(['city'], as_index=False).agg({'admin_name': "first"})

Now we load the world temperature dataset from Kaggle (as described before) and cast its *dt* column to have a datetime datatype.

In [10]:
df_temperatures = pd.read_csv('../../data2/GlobalLandTemperaturesByCity.csv')
df_temperatures['dt'] = pd.to_datetime(df_temperatures.dt)

In [11]:
df_temperatures.head()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E
2,1744-01-01,,,Århus,Denmark,57.05N,10.33E
3,1744-02-01,,,Århus,Denmark,57.05N,10.33E
4,1744-03-01,,,Århus,Denmark,57.05N,10.33E


We select only data corresponding to US and with AverageTemperature completed. Furthermore, as temperature trends has been modified due to climate change, we only consider the last 20 years of data.

In [12]:
df_temp_us = df_temperatures[(df_temperatures['Country'] == "United States") & \
                             (df_temperatures['AverageTemperature'].notnull()) & \
                             (df_temperatures['dt'] > '2000-01-01')]

Now we enrich this dataframe with the state that we have got before.

In [13]:
temp_df = pd.merge(df_temp_us, dict_city_state, left_on=['City'], right_on=['city'])

And calculate the mean of Temperature by state.

In [14]:
temp_state = temp_df.groupby(['admin_name'], as_index=False).agg({'AverageTemperature': "mean", 'Latitude': "first",'Longitude': 'first'})

In [15]:
temp_state.head()

Unnamed: 0,admin_name,AverageTemperature,Latitude,Longitude
0,Alabama,17.884701,32.95N,87.13W
1,Alaska,-0.925859,61.88N,151.13W
2,Arizona,20.819293,32.95N,112.02W
3,Arkansas,17.415079,34.56N,79.78W
4,California,16.302102,32.95N,117.77W


Looking at I94 immigration data, we can check that the state is a two letter code. So to be able to join the previous table, we need to map the state name and that is why we load a table with this mapping.

In [16]:
states_code = pd.read_csv('external_data/statesabbr.csv')

In [17]:
states_code.head()

Unnamed: 0,State,Abbrev,Code
0,Alabama,Ala.,AL
1,Alaska,Alaska,AK
2,Arizona,Ariz.,AZ
3,Arkansas,Ark.,AR
4,California,Calif.,CA


In [18]:
df_states_temp = pd.merge(temp_state, \
                          states_code, \
                          left_on=['admin_name'], \
                          right_on=['State'])[['Code', 'AverageTemperature', 'Latitude', 'Longitude']]
                                             

In [19]:
df_states_code = df_states_temp.rename(columns={'Code': 'State'}, inplace=True)

In [20]:
df_states_temp.head()

Unnamed: 0,State,AverageTemperature,Latitude,Longitude
0,AL,17.884701,32.95N,87.13W
1,AK,-0.925859,61.88N,151.13W
2,AZ,20.819293,32.95N,112.02W
3,AR,17.415079,34.56N,79.78W
4,CA,16.302102,32.95N,117.77W


With the previous data, we will enrich the table with information of states. But furthermore, we would like to increase the data of the immigration data. That's why in the next cell we compute an aggregation of temperature by country in the last 20 years (same temporal filter than the other computation).

In [21]:
df_temperatures_country = df_temperatures.groupby(['Country'], as_index=False)\
                                         .agg({'AverageTemperature': "mean", 'Latitude': "first",'Longitude': 'first'})

In [22]:
df_temperatures_country.head()

Unnamed: 0,Country,AverageTemperature,Latitude,Longitude
0,Afghanistan,13.816497,36.17N,69.61E
1,Albania,15.525828,40.99N,19.17E
2,Algeria,17.763206,36.17N,3.98E
3,Angola,21.759716,12.05S,13.15E
4,Argentina,16.999216,39.38S,62.43W


This table will be used as a "Countries" dimension in our Snowflake data model.

### 2.4. Airport Data

As can be found in https://datahub.io/core/airport-codes#readme: *The airport codes may refer to either IATA airport code, a three-letter code which is used in passenger reservation, ticketing and baggage-handling systems, or the ICAO airport code which is a four letter code used by ATC systems and for airports that do not have an IATA airport code*.

In [23]:
airport = pd.read_csv("airport-codes_csv.csv")

In [24]:
airport.head()

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
0,00A,heliport,Total Rf Heliport,11.0,,US,US-PA,Bensalem,00A,,00A,"-74.93360137939453, 40.07080078125"
1,00AA,small_airport,Aero B Ranch Airport,3435.0,,US,US-KS,Leoti,00AA,,00AA,"-101.473911, 38.704022"
2,00AK,small_airport,Lowell Field,450.0,,US,US-AK,Anchor Point,00AK,,00AK,"-151.695999146, 59.94919968"
3,00AL,small_airport,Epps Airpark,820.0,,US,US-AL,Harvest,00AL,,00AL,"-86.77030181884766, 34.86479949951172"
4,00AR,closed,Newport Hospital & Clinic Heliport,237.0,,US,US-AR,Newport,,,,"-91.254898, 35.6087"


In [25]:
airport_df = airport[~airport['iata_code'].isnull()]
airport_df.head()

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
223,03N,small_airport,Utirik Airport,4.0,OC,MH,MH-UTI,Utirik Island,K03N,UTK,03N,"169.852005, 11.222"
440,07FA,small_airport,Ocean Reef Club Airport,8.0,,US,US-FL,Key Largo,07FA,OCA,07FA,"-80.274803161621, 25.325399398804"
594,0AK,small_airport,Pilot Station Airport,305.0,,US,US-AK,Pilot Station,,PQS,0AK,"-162.899994, 61.934601"
673,0CO2,small_airport,Crested Butte Airpark,8980.0,,US,US-CO,Crested Butte,0CO2,CSE,0CO2,"-106.928341, 38.851918"
1088,0TE7,small_airport,LBJ Ranch Airport,1515.0,,US,US-TX,Johnson City,0TE7,JCY,0TE7,"-98.62249755859999, 30.251800537100003"


In [26]:
airport_data = airport_df[['name', 'iso_country', 'iso_region', 'municipality', 'iata_code']]

In [27]:
airport_data.head()

Unnamed: 0,name,iso_country,iso_region,municipality,iata_code
223,Utirik Airport,MH,MH-UTI,Utirik Island,UTK
440,Ocean Reef Club Airport,US,US-FL,Key Largo,OCA
594,Pilot Station Airport,US,US-AK,Pilot Station,PQS
673,Crested Butte Airpark,US,US-CO,Crested Butte,CSE
1088,LBJ Ranch Airport,US,US-TX,Johnson City,JCY


### 2.5. USA City Demographic Data

This dataset contains information about the demographics of all US cities and census-designated places with a population greater or equal to 65,000. 

This data comes from the US Census Bureau's 2015 American Community Survey.

In [28]:
us_cities_demographics = pd.read_csv("us-cities-demographics.csv", sep=";")

In order to understand the data better, we fixed a city and take a look into the rows

In [29]:
us_cities_demographics[us_cities_demographics['City'] == "Miami"].head()

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
1420,Miami,Florida,40.4,215840.0,225149.0,440989,7233.0,260789.0,2.5,FL,Asian,4613
1783,Miami,Florida,40.4,215840.0,225149.0,440989,7233.0,260789.0,2.5,FL,Hispanic or Latino,319942
1784,Miami,Florida,40.4,215840.0,225149.0,440989,7233.0,260789.0,2.5,FL,Black or African-American,87331
2180,Miami,Florida,40.4,215840.0,225149.0,440989,7233.0,260789.0,2.5,FL,American Indian and Alaska Native,1571
2376,Miami,Florida,40.4,215840.0,225149.0,440989,7233.0,260789.0,2.5,FL,White,338232


It can be checked that all the fields for one city have the same info, except race and count. So we are going to pivot this table to have one row per city with so many columns of races as data is into it and the count by race.

This official dataset is going to be the source of info about states, so I have decided to group by state and aggregate properly. When this will be done, we'll can join the temperature info obtained by state previously.

## 3. Define the Data Model
### 3.1 Conceptual Data Model

I have chosen to develop a star model. It is shown in the next diagram:

![image info](./model.png)

Let's explain a bit the previous graphic: 

* I94 table is the fact model, containing the info that it's been cleaned and selected from original files.
* From the date fields of it, we have created a dimension called Date with different aspects of this dates, such us month, year or day of the week.
* A dimension table that specify average temperature by US State.
* A dimension table that specify average temperature by world country.
* A dimension with IATA codes of different US airports.
* A table with demographical information of each US state that is the "Demographic Dimension"

### 3.2 Mapping Out Data Pipelines

The different phases to reach this data model are the following:

#### 3.2.1 etl.py

I have structured this script in distinct parts that make the code much more easy to read:

* Imports of libraries and modules.
* Load of parameters and constants from .cfg file.
* Design of ETL functions for distinct sources and UDFs. There are distinct ETL functions for i94 data, temperature dimension, airport dimension and demographic data. Each one loads their tables from their corresponding routes, processed them and loads into a subset of the AWS S3 bucket that has been created (`arc-udacity-dataengineer-project-capstone`).
* Main: Executes the different functions with the appropiate paramethers.

#### 3.2.2 Apache Airflow.

Apache Airflow has been deployed on top of an M2.xlarge EC2 instance with OS Ubuntu18.04. The process of configuration followed to make an EC2 instance to be an  Airflow server is really well explained on https://medium.com/@abraham.pabbathi/airflow-on-aws-ec2-instance-with-ubuntu-aff8d3206171. Once we have made it, I have SSH into the IP of this EC2 instance and I have left the folder `airflow` with all the code and dependencies into `/home/airflow/`. Accessing to this IP adress throught the web browser, we Turn-On the `project_capstone_dend_dag` developed for this case of use and trigger it.

Basically, it creates the metadata structure for each table of the model in AWS redshift, stages the data of different tables loaded previously in an S3 bucket into those tables and performs some quality checks to the proccess of load.


### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model

Once the different code has been created, to make all the ETL we must run the python script from a CMD:

`>> python etl.py`

When the process is completed, the different tables in the model will be in the AWS S3 bucket we have configured. To stage the data into its corresponding AWS Redshift table, we run the Apache Airflow Dag launched in the Ubuntu EC2 instance that was configured and run it.

![image info](./DAG.png)

#### 4.2 Data Quality Checks
Data quality checks have been performed by adding a step in the Apache Airflow Dag: DataQualityOperator.

On its corresponding script, we have made the following two checks for each table in the model:

 * Tables have been created in Redshift (are not empty)
 * Each table have at least one record

#### 4.3 Data dictionary 

In this section we are describe each field on this datamodel.

#### 4.3.1 Fact Table: i94_data

| FIELD | MEANING |
|-------------------|-----------------------------------------------|
| cicid | ID of each row in this fact table |
| i94yr | Year of completing the i94 form |
| i94mon | Mear of completing the i94 form |
| i94cit | Code: Born country of the traveler |
| i94res | Code: Residence country of the traveler |
| i94port | Code: Arrival airport |
| arrdate | Arrival date |
| i94mode | Code: Kind of transport |
| depdate | Departure date |
| i94visa | Code: Reason for the travel |
| gender | Gender of the traveler |
| airline | Company with which the traveler arrives |
| airport_name | Description: Arrival airport |
| state_code | Name of the state of arrival |
| born_country | Description: Born country of the traveler |
| residence_country | Description: Residence counry of the traveler |
| mode | Description: Kind of transport |
| visa | Description: Reason for the travel |

#### 4.3.2 Dimension Table: Demographic

| FIELD | MEANING |
|-----------------------------------|----------------------------------------------------------------|
| state | Name of the US state |
| median_age | Median age of the population in the state |
| male_population | Total number of men in the state |
| female_population | Total number of women in the state |
| total_population | Total number of people in the state |
| number_of_veterans | Total number of veterans in the state |
| foreign_borns | Total number of foreigners in the state |
| average_household_size | Average size of a house in the state |
| state_code | Code of the US state |
| american_indian_and_alaska_native | Total number of american indian and Alaska nativesin the state |
| asian | Total number of asian people in the state |
| black_african_american | Total number of black african american people in the state |
| hispanic_latino | Total number of latinos in the state |
| white | Total number of white people in the state |

#### 4.3.3 Dimension Table: Date

| FIELD | MEANING |
|---------|--------------------------------------------|
| date | Date |
| day | Day of the corresponding date |
| month | Month of the corresponding date |
| year | Year of the corresponding date |
| week | Week of the year of the corresponding date |
| weekday | Day of the week of the corresponding date |
| yearday | Day of the year of the corresponding date |

#### 4.3.4 Dimension Table: airports

| FIELD | MEANING |
|--------------|-----------------------------------|
| name | Name of the airport |
| iso_country | Country where the airport locates |
| municipality | City where the airport locates |
| iata_code | International code of the airport |
| state | State where the airport locates |


#### 4.3.5 Dimension Table: countries_temperature

| FIELD | MEANING |
|---------------------|-------------------------------------|
| country | Name of the country |
| average_temperature | Average temperature in that country |

#### 4.3.6 Dimension Table: states_temperature

| FIELD | MEANING |
|---------------------|-----------------------------------|
| state | Name of the US state |
| average_temperature | Average temperature in that state |

## 5. Complete Project Write Up

#### 5.1. Clearly state the rationale for the choice of tools and technologies for the project.

The solution of the project that have different parts. The first one is doing an ETL over the different datasets. I've chosen doing it throught Spark scripts. As a result, output datasets are stored in AWS S3. An AWS EC2 instaced it has been launched and configured to have Apache Airflow and DAGS has been programed to be executed on this machine. As a result, it moves the files from S3 into Redshift and performs some checks on data.

A possible improvement to this project would be uploading the initial data into other S3 buckets and deploying an EMR cluster where executing the spark-submit job.

* **Spark**: As indicates its own documetation on https://spark.apache.org/docs/latest/: *Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.*

* **AWS S3**: Amazon Simple Storage Service (S3) is a service offered by AWS that provides object scalable storage through a web service interface. AWS S3 can be employed to store any type of object which allows for uses like storage for Internet applications, backup and recovery, disaster recovery, data archives, data lakes for analytics, and hybrid cloud storage.

* **Amazon Redshift**: Amazon Redshift is a data warehouse product which forms part of AWS. It can handle large scale data sets and database migrations. Redshift is able to handle analytic workloads on big data data sets stored by a column-oriented DBMS principle.

* **Apache Airflow**: Apache Airflow is an open-source tool for orchestrating workflows and data processing pipelines. Python language is used to define tasks and combine them into a Directed Acyclic Graph (DAG) that executes the logic desired.

* **AWS EC2**: Amazon Elastic Cloud Computing is a computing cloud solution that allow us to easily deploy IaaC machines and configuring them however we want. Apache Airflow has been deployed on top of M2.xlarge EC2 instance with OS Ubuntu18.04. The process of configuration follow to make an EC2 instance an Apache Airflow server is really well explained on https://medium.com/@abraham.pabbathi/airflow-on-aws-ec2-instance-with-ubuntu-aff8d3206171

#### 5.2. Propose how often the data should be updated and why.

Since the fact table of our data model is feed with monthly I94 data files, we can suppose that the official department will create a batch of data each month. So updating once a month could be a criterion to refresh the data model.

#### 5.3. Write a description of how you would approach the problem differently under the following scenarios:
##### 5.3.1. The data was increased by 100x.

If the increase of data was of this magnitude, the Spark script that performs the ETL should be run in a cluster with enough datanodes. Scalling the solution would move us to execute this part on a cloud solution. As it has been explained during the course, AWS ElasticMapReduce would fit great into solving this issue, since you could provide the enviroment the ability to scalling up or down the number of EC2 node instances depending on the workloads.

##### 5.3.2. The data populates a dashboard that must be updated on a daily basis by 7am every day.

As I proposed in the solution, using a data pipeline orchestator as Apache Airflow would be great to schedule tasks. It would syncronize the execution of different tasks automatically.

##### 5.3.3. The database needed to be accessed by 100+ people.

Since we have chosen AWS Redshift as our data warehose solution, this need of increase the people that access to the database would not be a problem. As other AWS solutions, there are options to scale the number of nodes in our cluster depending workloads.