# Project Title
### Data Engineering Capstone Project

#### Project Summary
--describe your project at a high level--

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
# Do all imports and installs here
import re

from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, date_add
from pyspark.sql.types import StringType
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import desc
from pyspark.sql.functions import asc
from pyspark.sql.functions import sum as Fsum

import datetime

import numpy as np
import pandas as pd

import parso
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession



In [3]:
import psycopg2
from sql_queries import immigration_table_insert, temperature_table_insert, i94port_table_insert

### Step 1: Scope the Project and Gather Data

#### Scope 
For the scope of work, there will be 2 dimension tables and 1 fact table. Firstly the immigration data will be aggregated by city, secondly the temperature data will be aggregated by city information. The results of these two operations will be merged based on city value to create one fact table. The final database will be created to analyze whether or not the temperature affects the destination cities of immigration.

#### Describe and Gather Data 
I94 immigration data gathered from the US National Tourism and Trade Office website. The format of the data is a binary database storage formata and called SAS7BDAT.

The temperature data is a Kaggle data set. It contains the temperature information of cities all around the world. This data can be found in the link below.
https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data

#### 1. Process of Immigration Data

In [4]:
# Read in the immigration data here
fname = './data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat'
df = pd.read_sas(fname, 'sas7bdat', encoding="ISO-8859-1")

In [5]:
raw_row_amount=len(df)

In [6]:
df.head(50)

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,...,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,6.0,2016.0,4.0,692.0,692.0,XXX,20573.0,,,,...,U,,1979.0,10282016,,,,1897628000.0,,B2
1,7.0,2016.0,4.0,254.0,276.0,ATL,20551.0,1.0,AL,,...,Y,,1991.0,D/S,M,,,3736796000.0,296.0,F1
2,15.0,2016.0,4.0,101.0,101.0,WAS,20545.0,1.0,MI,20691.0,...,,M,1961.0,09302016,M,,OS,666643200.0,93.0,B2
3,16.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,MA,20567.0,...,,M,1988.0,09302016,,,AA,92468460000.0,199.0,B2
4,17.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,MA,20567.0,...,,M,2012.0,09302016,,,AA,92468460000.0,199.0,B2
5,18.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,MI,20555.0,...,,M,1959.0,09302016,,,AZ,92471040000.0,602.0,B1
6,19.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,NJ,20558.0,...,,M,1953.0,09302016,,,AZ,92471400000.0,602.0,B2
7,20.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,NJ,20558.0,...,,M,1959.0,09302016,,,AZ,92471610000.0,602.0,B2
8,21.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,NY,20553.0,...,,M,1970.0,09302016,,,AZ,92470800000.0,602.0,B2
9,22.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,NY,20562.0,...,,M,1968.0,09302016,,,AZ,92478490000.0,608.0,B1


In [7]:
# Read in the temperature data here
temp_data = './data/data2/GlobalLandTemperaturesByCity.csv'
df_temp_data = pd.read_csv(temp_data, sep=',')

In [8]:
df_temp_data.head(50)

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E
2,1744-01-01,,,Århus,Denmark,57.05N,10.33E
3,1744-02-01,,,Århus,Denmark,57.05N,10.33E
4,1744-03-01,,,Århus,Denmark,57.05N,10.33E
5,1744-04-01,5.788,3.624,Århus,Denmark,57.05N,10.33E
6,1744-05-01,10.644,1.283,Århus,Denmark,57.05N,10.33E
7,1744-06-01,14.051,1.347,Århus,Denmark,57.05N,10.33E
8,1744-07-01,16.082,1.396,Århus,Denmark,57.05N,10.33E
9,1744-08-01,,,Århus,Denmark,57.05N,10.33E


In [7]:
# Define Spark session
from pyspark.sql import SparkSession
spark2 = SparkSession.builder.\
config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11")\
.enableHiveSupport().getOrCreate()


### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

In [9]:
set(df_temp_data["Country"].values)

{'Afghanistan',
 'Albania',
 'Algeria',
 'Angola',
 'Argentina',
 'Armenia',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bahamas',
 'Bahrain',
 'Bangladesh',
 'Belarus',
 'Belgium',
 'Benin',
 'Bolivia',
 'Bosnia And Herzegovina',
 'Botswana',
 'Brazil',
 'Bulgaria',
 'Burkina Faso',
 'Burma',
 'Burundi',
 'Cambodia',
 'Cameroon',
 'Canada',
 'Central African Republic',
 'Chad',
 'Chile',
 'China',
 'Colombia',
 'Congo',
 'Congo (Democratic Republic Of The)',
 'Costa Rica',
 'Croatia',
 'Cuba',
 'Cyprus',
 'Czech Republic',
 "Côte D'Ivoire",
 'Denmark',
 'Djibouti',
 'Dominican Republic',
 'Ecuador',
 'Egypt',
 'El Salvador',
 'Equatorial Guinea',
 'Eritrea',
 'Estonia',
 'Ethiopia',
 'Finland',
 'France',
 'Gabon',
 'Gambia',
 'Georgia',
 'Germany',
 'Ghana',
 'Greece',
 'Guatemala',
 'Guinea',
 'Guinea Bissau',
 'Guyana',
 'Haiti',
 'Honduras',
 'Hong Kong',
 'Hungary',
 'Iceland',
 'India',
 'Indonesia',
 'Iran',
 'Iraq',
 'Ireland',
 'Israel',
 'Italy',
 'Jamaica',
 'Japan',
 'Jorda

In [10]:
df_country = df_temp_data[df_temp_data["Country"] == "Turkey"]

In [11]:
df_country.head(10)

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
3239,1743-11-01,10.013,2.291,Çorlu,Turkey,40.99N,27.69E
3240,1743-12-01,,,Çorlu,Turkey,40.99N,27.69E
3241,1744-01-01,,,Çorlu,Turkey,40.99N,27.69E
3242,1744-02-01,,,Çorlu,Turkey,40.99N,27.69E
3243,1744-03-01,,,Çorlu,Turkey,40.99N,27.69E
3244,1744-04-01,13.685,2.162,Çorlu,Turkey,40.99N,27.69E
3245,1744-05-01,15.021,1.824,Çorlu,Turkey,40.99N,27.69E
3246,1744-06-01,19.663,1.701,Çorlu,Turkey,40.99N,27.69E
3247,1744-07-01,22.314,1.648,Çorlu,Turkey,40.99N,27.69E
3248,1744-08-01,,,Çorlu,Turkey,40.99N,27.69E


In [12]:
list(set(df_country.City))

['Edirne',
 'Tarsus',
 'Urfa',
 'Ankara',
 'Kayseri',
 'Van',
 'Ordu',
 'Manisa',
 'Alanya',
 'Siverek',
 'Aksaray',
 'Turhal',
 'Antakya',
 'Inegol',
 'Mersin',
 'Gebze',
 'Esenyurt',
 'Batman',
 'Nazilli',
 'Tekirdag',
 'Bursa',
 'Gaziantep',
 'Karaman',
 'Viransehir',
 'Eskisehir',
 'Tokat',
 'Konya',
 'Sivas',
 'Trabzon',
 'Siirt',
 'Afyonkarahisar',
 'Istanbul',
 'Denizli',
 'Antalya',
 'Izmir',
 'Adana',
 'Izmit',
 'Erzurum',
 'Isparta',
 'Çorlu',
 'Erzincan',
 'Zonguldak',
 'Samsun',
 'Kütahya',
 'Malatya',
 'Iskenderun',
 'Kahramanmaras',
 'Usak',
 'Çorum',
 'Turgutlu',
 'Osmaniye']

In [13]:
# Dictionary of valid i94port codes is created
re_obj = re.compile(r'\'(.*)\'.*\'(.*)\'')
i94port_valid = {}

with open('i94port.txt') as f:
     for data in f:
        match = re_obj.search(data)
        i94port_valid[match[1]]=[match[2]]
        

       


In [14]:
# Convert dictionary to list to save as a dataframe
pCode=[]
pLocation=[]
for key, val in i94port_valid.items():
    pCode.append(key)
    pLocation.append(val)

In [15]:
port_locations = [f[0].replace("'","").strip() for f in pLocation]

In [16]:
print (port_locations)

['ALCAN, AK', 'ANCHORAGE, AK', 'BAKER AAF - BAKER ISLAND, AK', 'DALTONS CACHE, AK', 'DEW STATION PT LAY DEW, AK', 'DUTCH HARBOR, AK', 'EAGLE, AK', 'FAIRBANKS, AK', 'HOMER, AK', 'HYDER, AK', 'JUNEAU, AK', 'KETCHIKAN, AK', 'KETCHIKAN, AK', 'MOSES POINT INTERMEDIATE, AK', 'NIKISKI, AK', 'NOM, AK', 'POKER CREEK, AK', 'PORT LIONS SPB, AK', 'SKAGWAY, AK', 'ST. PAUL ISLAND, AK', 'TOKEEN, AK', 'WRANGELL, AK', 'MADISON COUNTY - HUNTSVILLE, AL', 'MOBILE, AL', 'LITTLE ROCK, AR (BPS)', 'ROGERS ARPT, AR', 'DOUGLAS, AZ', 'LUKEVILLE, AZ', 'MARIPOSA AZ', 'NACO, AZ', 'NOGALES, AZ', 'PHOENIX, AZ', 'PORTAL, AZ', 'SAN LUIS, AZ', 'SASABE, AZ', 'TUCSON, AZ', 'YUMA, AZ', 'ANDRADE, CA', 'BURBANK, CA', 'CALEXICO, CA', 'CAMPO, CA', 'FRESNO, CA', 'IMPERIAL COUNTY, CA', 'LONG BEACH, CA', 'LOS ANGELES, CA', 'MEADOWS FIELD - BAKERSFIELD, CA', 'OAKLAND, CA', 'ONTARIO, CA', 'OTAY MESA, CA', 'PACIFIC, HWY. STATION, CA', 'PALM SPRINGS, CA', 'SACRAMENTO, CA', 'SALINAS, CA (BPS)', 'SAN DIEGO, CA', 'SAN FRANCISCO, CA', 'S

In [17]:
pCities = [f.split(",")[0] for f in port_locations]
print(pCities)

['ALCAN', 'ANCHORAGE', 'BAKER AAF - BAKER ISLAND', 'DALTONS CACHE', 'DEW STATION PT LAY DEW', 'DUTCH HARBOR', 'EAGLE', 'FAIRBANKS', 'HOMER', 'HYDER', 'JUNEAU', 'KETCHIKAN', 'KETCHIKAN', 'MOSES POINT INTERMEDIATE', 'NIKISKI', 'NOM', 'POKER CREEK', 'PORT LIONS SPB', 'SKAGWAY', 'ST. PAUL ISLAND', 'TOKEEN', 'WRANGELL', 'MADISON COUNTY - HUNTSVILLE', 'MOBILE', 'LITTLE ROCK', 'ROGERS ARPT', 'DOUGLAS', 'LUKEVILLE', 'MARIPOSA AZ', 'NACO', 'NOGALES', 'PHOENIX', 'PORTAL', 'SAN LUIS', 'SASABE', 'TUCSON', 'YUMA', 'ANDRADE', 'BURBANK', 'CALEXICO', 'CAMPO', 'FRESNO', 'IMPERIAL COUNTY', 'LONG BEACH', 'LOS ANGELES', 'MEADOWS FIELD - BAKERSFIELD', 'OAKLAND', 'ONTARIO', 'OTAY MESA', 'PACIFIC', 'PALM SPRINGS', 'SACRAMENTO', 'SALINAS', 'SAN DIEGO', 'SAN FRANCISCO', 'SAN JOSE', 'SAN LUIS OBISPO', 'SAN LUIS OBISPO', 'SAN PEDRO', 'SAN YSIDRO', 'SANTA ANA', 'STOCKTON', 'TECATE', 'TRAVIS-AFB', 'ARAPAHOE COUNTY', 'ASPEN', 'COLORADO SPRINGS', 'DENVER', 'LA PLATA - DURANGO', 'BRADLEY INTERNATIONAL', 'BRIDGEPORT',

In [18]:
pStates = [f.split(",")[-1] for f in port_locations]
print(pStates)

[' AK', ' AK', ' AK', ' AK', ' AK', ' AK', ' AK', ' AK', ' AK', ' AK', ' AK', ' AK', ' AK', ' AK', ' AK', ' AK', ' AK', ' AK', ' AK', ' AK', ' AK', ' AK', ' AL', ' AL', ' AR (BPS)', ' AR', ' AZ', ' AZ', 'MARIPOSA AZ', ' AZ', ' AZ', ' AZ', ' AZ', ' AZ', ' AZ', ' AZ', ' AZ', ' CA', ' CA', ' CA', ' CA', ' CA', ' CA', ' CA', ' CA', ' CA', ' CA', ' CA', ' CA', ' CA', ' CA', ' CA', ' CA (BPS)', ' CA', ' CA', ' CA', ' CA', ' CA (BPS)', ' CA', ' CA', ' CA', ' CA (BPS)', ' CA', ' CA', ' CO', ' CO #ARPT', ' CO', ' CO', ' CO', ' CT', ' CT', ' CT', ' CT', ' CT', ' CT', ' CT', 'WASHINGTON DC', ' DE', ' DE', ' DE', ' FL', ' FL', ' FL', ' FL', ' FL', ' FL', ' FL', ' FL', ' FL', ' FL', ' FL', ' FL', ' FL', ' FL', ' FL', ' FL #ARPT', ' FL', ' FL', ' FL', ' FL', ' FL', ' FL', ' FL', ' FL', ' FL', ' FL', ' FL', ' FL', ' FL', ' FL', ' GA', ' GA', ' GA', ' GA', ' GU', ' HI', ' HI', ' HI', ' HI', ' IA', ' IA', ' ID', ' ID', ' ID', ' ID', ' IL', ' IL', ' IL', ' IL', ' IL', ' IL', ' IN', ' IN', ' IN', ' IN', 

In [19]:
a = {"port_code" : pCode, "port_city": pCities, "port_state": pStates}
df_port_locations = pd.DataFrame.from_dict(a, orient='index')
i94port_valid_df = df_port_locations.transpose()
i94port_valid_df.head(20)

Unnamed: 0,port_code,port_city,port_state
0,ALC,ALCAN,AK
1,ANC,ANCHORAGE,AK
2,BAR,BAKER AAF - BAKER ISLAND,AK
3,DAC,DALTONS CACHE,AK
4,PIZ,DEW STATION PT LAY DEW,AK
5,DTH,DUTCH HARBOR,AK
6,EGL,EAGLE,AK
7,FRB,FAIRBANKS,AK
8,HOM,HOMER,AK
9,HYD,HYDER,AK


In [20]:
#Clean the data with valid formats
new_df=df[df.i94port.isin(i94port_valid_df.port_code)]

In [21]:
processed_row_amount= len(new_df)
filtered_row_amount= raw_row_amount - processed_row_amount
print(f"Raw immigration data amount: {raw_row_amount}")
print(f"Final immigration data amount: {processed_row_amount}")
print(f"Number of rows cleaned: {filtered_row_amount}")

Raw immigration data amount: 3096313
Final immigration data amount: 3088544
Number of rows cleaned: 7769


In [22]:
new_df.head(20)

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,...,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
1,7.0,2016.0,4.0,254.0,276.0,ATL,20551.0,1.0,AL,,...,Y,,1991.0,D/S,M,,,3736796000.0,296,F1
2,15.0,2016.0,4.0,101.0,101.0,WAS,20545.0,1.0,MI,20691.0,...,,M,1961.0,09302016,M,,OS,666643200.0,93,B2
3,16.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,MA,20567.0,...,,M,1988.0,09302016,,,AA,92468460000.0,199,B2
4,17.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,MA,20567.0,...,,M,2012.0,09302016,,,AA,92468460000.0,199,B2
5,18.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,MI,20555.0,...,,M,1959.0,09302016,,,AZ,92471040000.0,602,B1
6,19.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,NJ,20558.0,...,,M,1953.0,09302016,,,AZ,92471400000.0,602,B2
7,20.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,NJ,20558.0,...,,M,1959.0,09302016,,,AZ,92471610000.0,602,B2
8,21.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,NY,20553.0,...,,M,1970.0,09302016,,,AZ,92470800000.0,602,B2
9,22.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,NY,20562.0,...,,M,1968.0,09302016,,,AZ,92478490000.0,608,B1
10,23.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,NY,20671.0,...,,M,1964.0,09302016,,,TK,92501390000.0,1,B2


In [23]:
#Clean up null data from average temperature
df_temp_data = df_temp_data[df_temp_data.AverageTemperature.notnull()]

In [24]:
df_temp_data.head(20)

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
5,1744-04-01,5.788,3.624,Århus,Denmark,57.05N,10.33E
6,1744-05-01,10.644,1.283,Århus,Denmark,57.05N,10.33E
7,1744-06-01,14.051,1.347,Århus,Denmark,57.05N,10.33E
8,1744-07-01,16.082,1.396,Århus,Denmark,57.05N,10.33E
10,1744-09-01,12.781,1.454,Århus,Denmark,57.05N,10.33E
11,1744-10-01,7.95,1.63,Århus,Denmark,57.05N,10.33E
12,1744-11-01,4.639,1.302,Århus,Denmark,57.05N,10.33E
13,1744-12-01,0.122,1.756,Århus,Denmark,57.05N,10.33E
14,1745-01-01,-1.333,1.642,Århus,Denmark,57.05N,10.33E


### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Immigration data contains the major information. Therefore it is better to keep it as a fact table and make foreign keys for other tables to connect immigration table.
- cicid     
- year     
- month    
- city     
- res      
- iport  
- arrdate  
- depdate  
- visa     
- addr 

The first dimension table is I94port. The columns are showed below.
- port_code --foreign key
- port_city 
- port_state 

The second dimension table will be the temperature data.
- AverageTemperature 
- City 
- Country 
- Latitude 
- Longitude 
- iport -- foreign key


#### 3.2 Mapping Out Data Pipelines
As described in the step 2, data clean up should be completed first of all.
- Clean up and normalize the I94 data
- Clean up and normalize the temperature data
- Organize i94port data
- Run create_table.py file
- Join temperature data with i94port
- Insert the data into the database

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [25]:
# Create the database
conn = psycopg2.connect("host=127.0.0.1 dbname=sparkifydb user=student password=student")
cur = conn.cursor()

OperationalError: could not connect to server: Connection refused (0x0000274D/10061)
	Is the server running on host "127.0.0.1" and accepting
	TCP/IP connections on port 5432?


In [None]:
#Temporary in case
#conn.close()

In [92]:
immigration_final_df = new_df[['cicid', 'i94yr', 'i94mon', 'i94cit', 'i94res', 'i94port', 'arrdate', 'depdate', 'visatype', 'i94addr']]

In [None]:
#Postgres doesn't accept string literal. The data has NULL (0x00)
immigration_final_df.dropna(inplace=True)

In [95]:
for i, row in immigration_final_df.iterrows():
    cur.execute(immigration_table_insert,list(row.values))
    conn.commit()

IndexError: list index out of range

In [None]:
#Join the temperature data with i94port to bring port_code column, drop the port_city and port_state columns from i94port dataframe
df_temp_data=pd.merge(df_temp_data, i94port_valid_df, left_on='City', right_on='port_city', how='left').drop(['port_city','port_state'], axis=1)

In [None]:
# Clean up the null values from temperature dataframe
df_temp_data = df_temp_data[df_temp_data.port_code.notnull()]
df_temp_data.head(20)

In [97]:
#Final data structure for the database
temperature_final_df = df_temp_data[['AverageTemperature', 'City', 'Country', 'Latitude', 'Longitude', 'port_code']]

In [98]:
for i, row in temperature_final_df.iterrows():
    cur.execute(temperature_table_insert,list(row.values))
    conn.commit()

ProgrammingError: relation "users" does not exist
LINE 1:  INSERT INTO users (AverageTemperature, City, Country, Latit...
                     ^


In [None]:
for i, row in i94port_valid_df.iterrows():
    cur.execute(i94port_table_insert,list(row.values))
    conn.commit()

#### 4.2 Data Quality Checks
Run Quality Checks

In [None]:
#Final Quality Check to make sure everything is good
cur.execute("SELECT COUNT(*) FROM immigration")
conn.commit()
if cur.rowcount < 1:
    print("Nothing has been transferred to immigration table.")
    
cur.execute("SELECT COUNT(*) FROM temperature")
conn.commit()
if cur.rowcount < 1:
    print("Nothing has been transferred to temperature table.")
    
cur.execute("SELECT COUNT(*) FROM i94port")
conn.commit()
if cur.rowcount < 1:
    print("Nothing has been transferred to i94port table.")

#### 4.3 Data dictionary 
#### Fact Table:

- i94yr: 4 digit year,
- i94mon: numeric month,
- i94cit: 3 digit code of origin city,
- i94port: 3 character code of destination USA city,
- arrdate: arrival date in the USA,
- i94mode: 1 digit travel code,
- depdate: departure date from the USA,
- i94visa: reason for immigration,
- AverageTemperature: average temperature of destination city

#### Dimension Table - I94 immigration data Events Columns:

- i94yr: 4 digit year
- i94mon: numeric month
- i94cit: 3 digit code of origin city
- i94port: 3 character code of destination USA city
- arrdate: arrival date in the USA
- i94mode: 1 digit travel code
- depdate: departure date from the USA
- i94visa: reason for immigration

#### Dimension Table - temperature data Columns:

- i94port: 3 character code of destination city
- AverageTemperature: average temperature
- City: city name
- Country: country name
- Latitude: latitude
- Longitude: longitude

#### Step 5: Complete Project Write Up
Clearly state the rationale for the choice of tools and technologies for the project.
- There is a significant size of the immigration data which is combined with temperature data. Therefore Spark has been used since it would be the best practice for this case.

Propose how often the data should be updated and why.
- There is a significant size of the immigration data which is combined with temperature data. Therefore Spark has been used since it would be the best practice for this case.

#### Scenarios
Write a description of how you would approach the problem differently under the following scenarios.
- The data was increased by 100x.
- - Use Spark with EMR to process the data in a distributed way with high efficiency
- The data populates a dashboard that must be updated on a daily basis by 7am every day
- - Use Airflow and create a DAG to monitor the process
- The database needed to be accessed by 100+ people
- - Use Redshift. Great auto-scaling capabilities and can be accessed by many people
