# I94 Immigration Data Warehouse
### Data Engineering Capstone Project by Thelma Obirieze

#### Project Summary
The objective of this project is design a data warehouse for 194 Immigration data. This data is collected by the US National Tourism and Trade Office and contains details of all immigrants coming into the country and their ports of entry. 

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
# Do all imports and installs here
import pandas as pd
import re
import psycopg2
from collections import defaultdict
from datetime import datetime, timedelta
from pyspark.sql import SparkSession
from cleaning import clean_immigration_data, clean_city_data, clean_airport_data, clean_country_data
from check import check_data
from check import check_integrity

In [2]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = "/opt/conda/bin:/opt/spark-2.4.3-bin-hadoop2.7/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/lib/jvm/java-8-openjdk-amd64/bin"
os.environ["SPARK_HOME"] = "/opt/spark-2.4.3-bin-hadoop2.7"
os.environ["HADOOP_HOME"] = "/opt/spark-2.4.3-bin-hadoop2.7"
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df_spark =spark.read.load('./sas_data')
print("Completed, ran successfully")

Completed, ran successfully


In [30]:
df_spark.head()

Row(cicid=5748517.0, i94yr=2016.0, i94mon=4.0, i94cit=245.0, i94res=438.0, i94port='LOS', arrdate=20574.0, i94mode=1.0, i94addr='CA', depdate=20582.0, i94bir=40.0, i94visa=1.0, count=1.0, dtadfile='20160430', visapost='SYD', occup=None, entdepa='G', entdepd='O', entdepu=None, matflag='M', biryear=1976.0, dtaddto='10292016', gender='F', insnum=None, airline='QF', admnum=94953870030.0, fltno='00011', visatype='B1')

### Step 1: Scope the Project and Gather Data

#### Scope 
In this project, I will be pulling the data from the I94 Immigration data store (sas_data) and I create fact and dimension tablesin the data warehouse where this data will be stored.

In addition, I will be using the following data 
* US Cities Demographics data which contains data by US city, state, age, population, veteran status and race.
* Countries Data
* Airport codes data

With the data, I plan to create a data warehouse that the Business Analysts can use to make further analysis. The team plans to analyze the number of immigrants that comes into US on a daily basis. They will like to know their source country, the airport they landed, the demography of the airport city they landed and the type of visa they came in with.


#### Describe and Gather Data 


##### Data Source 1: I94 immigration data

The I94 immigration data is sourced from the US National Tourism and Trade Office and it contains the following structure

In [29]:
# Read in the data here
df_immigration = df_spark
df_immigration.show(5)

+---------+------+------+------+------+-------+-------+-------+-------+-------+------+-------+-----+--------+--------+-----+-------+-------+-------+-------+-------+--------+------+------+-------+--------------+-----+--------+
|    cicid| i94yr|i94mon|i94cit|i94res|i94port|arrdate|i94mode|i94addr|depdate|i94bir|i94visa|count|dtadfile|visapost|occup|entdepa|entdepd|entdepu|matflag|biryear| dtaddto|gender|insnum|airline|        admnum|fltno|visatype|
+---------+------+------+------+------+-------+-------+-------+-------+-------+------+-------+-----+--------+--------+-----+-------+-------+-------+-------+-------+--------+------+------+-------+--------------+-----+--------+
|5748517.0|2016.0|   4.0| 245.0| 438.0|    LOS|20574.0|    1.0|     CA|20582.0|  40.0|    1.0|  1.0|20160430|     SYD| null|      G|      O|   null|      M| 1976.0|10292016|     F|  null|     QF|9.495387003E10|00011|      B1|
|5748518.0|2016.0|   4.0| 245.0| 438.0|    LOS|20574.0|    1.0|     NV|20591.0|  32.0|    1.0|  

##### I94 Immigration Data Structure

Below shows the data structure of the Immigration data

In [5]:
df_immigration.printSchema()

root
 |-- cicid: double (nullable = true)
 |-- i94yr: double (nullable = true)
 |-- i94mon: double (nullable = true)
 |-- i94cit: double (nullable = true)
 |-- i94res: double (nullable = true)
 |-- i94port: string (nullable = true)
 |-- arrdate: double (nullable = true)
 |-- i94mode: double (nullable = true)
 |-- i94addr: string (nullable = true)
 |-- depdate: double (nullable = true)
 |-- i94bir: double (nullable = true)
 |-- i94visa: double (nullable = true)
 |-- count: double (nullable = true)
 |-- dtadfile: string (nullable = true)
 |-- visapost: string (nullable = true)
 |-- occup: string (nullable = true)
 |-- entdepa: string (nullable = true)
 |-- entdepd: string (nullable = true)
 |-- entdepu: string (nullable = true)
 |-- matflag: string (nullable = true)
 |-- biryear: double (nullable = true)
 |-- dtaddto: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- insnum: string (nullable = true)
 |-- airline: string (nullable = true)
 |-- admnum: double (nullable = 

#### Data Source 2: US Cities Demographics data

The US Cities Demographics data comes from OpenSoft. I will be using the csv format of the data for this project

In [6]:
file_cities = './us-cities-demographics.csv'
df_cities = spark.read.format("csv").option("header", "true").option("delimiter", ";").load(file_cities)

In [7]:
df_cities.show()

+----------------+--------------+----------+---------------+-----------------+----------------+------------------+------------+----------------------+----------+--------------------+------+
|            City|         State|Median Age|Male Population|Female Population|Total Population|Number of Veterans|Foreign-born|Average Household Size|State Code|                Race| Count|
+----------------+--------------+----------+---------------+-----------------+----------------+------------------+------------+----------------------+----------+--------------------+------+
|   Silver Spring|      Maryland|      33.8|          40601|            41862|           82463|              1562|       30908|                   2.6|        MD|  Hispanic or Latino| 25924|
|          Quincy| Massachusetts|      41.0|          44129|            49500|           93629|              4147|       32935|                  2.39|        MA|               White| 58723|
|          Hoover|       Alabama|      38.5|      

##### US Cities Data Structure

Below shows the data structure of the us cities data

In [8]:
df_cities.printSchema()

root
 |-- City: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Median Age: string (nullable = true)
 |-- Male Population: string (nullable = true)
 |-- Female Population: string (nullable = true)
 |-- Total Population: string (nullable = true)
 |-- Number of Veterans: string (nullable = true)
 |-- Foreign-born: string (nullable = true)
 |-- Average Household Size: string (nullable = true)
 |-- State Code: string (nullable = true)
 |-- Race: string (nullable = true)
 |-- Count: string (nullable = true)



#### Data Source 3: Airport Codes Data
In addition, I will be utilizing the airport codes file as a dimenion to show more information about the immigrants port of entry

In [9]:
file_ports = './airport-codes_csv.csv'
df_airports = spark.read.format("csv").option("header", "true").load(file_ports) 

In [10]:
df_airports.show()

+-----+-------------+--------------------+------------+---------+-----------+----------+------------+--------+---------+----------+--------------------+
|ident|         type|                name|elevation_ft|continent|iso_country|iso_region|municipality|gps_code|iata_code|local_code|         coordinates|
+-----+-------------+--------------------+------------+---------+-----------+----------+------------+--------+---------+----------+--------------------+
|  00A|     heliport|   Total Rf Heliport|          11|       NA|         US|     US-PA|    Bensalem|     00A|     null|       00A|-74.9336013793945...|
| 00AA|small_airport|Aero B Ranch Airport|        3435|       NA|         US|     US-KS|       Leoti|    00AA|     null|      00AA|-101.473911, 38.7...|
| 00AK|small_airport|        Lowell Field|         450|       NA|         US|     US-AK|Anchor Point|    00AK|     null|      00AK|-151.695999146, 5...|
| 00AL|small_airport|        Epps Airpark|         820|       NA|         US|     

##### Airport codes Data Structure

In [11]:
df_airports.printSchema()

root
 |-- ident: string (nullable = true)
 |-- type: string (nullable = true)
 |-- name: string (nullable = true)
 |-- elevation_ft: string (nullable = true)
 |-- continent: string (nullable = true)
 |-- iso_country: string (nullable = true)
 |-- iso_region: string (nullable = true)
 |-- municipality: string (nullable = true)
 |-- gps_code: string (nullable = true)
 |-- iata_code: string (nullable = true)
 |-- local_code: string (nullable = true)
 |-- coordinates: string (nullable = true)



#### Data Source 4: Countries Codes Data

In [12]:
df_countries = spark.read.format("csv").option("header", "true").load('country-codes_csv.csv')  

In [13]:
df_countries.printSchema()

root
 |-- FIFA: string (nullable = true)
 |-- Dial: string (nullable = true)
 |-- ISO3166-1-Alpha-3: string (nullable = true)
 |-- MARC: string (nullable = true)
 |-- is_independent: string (nullable = true)
 |-- ISO3166-1-numeric: string (nullable = true)
 |-- GAUL: string (nullable = true)
 |-- FIPS: string (nullable = true)
 |-- WMO: string (nullable = true)
 |-- ISO3166-1-Alpha-2: string (nullable = true)
 |-- ITU: string (nullable = true)
 |-- IOC: string (nullable = true)
 |-- DS: string (nullable = true)
 |-- UNTERM Spanish Formal: string (nullable = true)
 |-- Global Code: string (nullable = true)
 |-- Intermediate Region Code: string (nullable = true)
 |-- official_name_fr: string (nullable = true)
 |-- UNTERM French Short: string (nullable = true)
 |-- ISO4217-currency_name: string (nullable = true)
 |-- Developed / Developing Countries: string (nullable = true)
 |-- UNTERM Russian Formal: string (nullable = true)
 |-- UNTERM English Short: string (nullable = true)
 |-- ISO42

### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

#### Identifying data quality issues in each of the data source

In this steps I will be looking out for:
* Duplicates
* Nulls
* Incorrect Data formats

##### Cleaning Immigration Data

In [14]:
df_immigration2 = clean_immigration_data(spark, df_immigration)

In [15]:
df_immigration2.show(5)

+---------+-----+------+------+------+---------+----------+---------+---------+--------+------+-------+--------+------------+--------------+
|    cicid|i94yr|i94mon|i94cit|i94res|port_code|state_code|visa_type|mode_type|visapost|gender|airline|visatype|arrival_date|departure_date|
+---------+-----+------+------+------+---------+----------+---------+---------+--------+------+-------+--------+------------+--------------+
|5748517.0| 2016|     4|   245|   438|      LOS|        CA|        1|        1|     SYD|     F|     QF|      B1|  2016-04-30|    2016-05-08|
|5748518.0| 2016|     4|   245|   438|      LOS|        NV|        1|        1|     SYD|     F|     VA|      B1|  2016-04-30|    2016-05-17|
|5748519.0| 2016|     4|   245|   438|      LOS|        WA|        1|        1|     SYD|     M|     DL|      B1|  2016-04-30|    2016-05-08|
|5748520.0| 2016|     4|   245|   438|      LOS|        WA|        1|        1|     SYD|     F|     DL|      B1|  2016-04-30|    2016-05-14|
|5748521.0| 2

In [16]:
df_immigration2.printSchema()

root
 |-- cicid: double (nullable = true)
 |-- i94yr: integer (nullable = true)
 |-- i94mon: integer (nullable = true)
 |-- i94cit: integer (nullable = true)
 |-- i94res: integer (nullable = true)
 |-- port_code: string (nullable = true)
 |-- state_code: string (nullable = true)
 |-- visa_type: integer (nullable = true)
 |-- mode_type: integer (nullable = true)
 |-- visapost: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- airline: string (nullable = true)
 |-- visatype: string (nullable = true)
 |-- arrival_date: date (nullable = true)
 |-- departure_date: date (nullable = true)



###### Cleaning US Cities data

In [17]:
# Checking and droping duplicates 
Clean_df_cities = clean_city_data(spark, df_cities)

In [18]:
# Checking for Nulls
Clean_df_cities.show(5)

+----------------+-------------+----------+---------------+-----------------+----------------+------------+----------------------+----------+--------------------+-----+
|            City|        State|median_age|male_population|female_population|total_population|foreign_born|average_household_size|state_code|                Race|Count|
+----------------+-------------+----------+---------------+-----------------+----------------+------------+----------------------+----------+--------------------+-----+
|   Silver Spring|     Maryland|      33.8|          40601|            41862|           82463|       30908|                   2.6|        MD|  Hispanic or Latino|25924|
|          Quincy|Massachusetts|      41.0|          44129|            49500|           93629|       32935|                  2.39|        MA|               White|58723|
|          Hoover|      Alabama|      38.5|          38040|            46799|           84839|        8229|                  2.58|        AL|              

###### Cleaning airport data

In [19]:
# dropping records with nulls in iata_code column
clean_df_airports = clean_airport_data(spark, df_airports)

In [20]:
clean_df_airports.show(5)

+---------+--------------------+--------------+-----------+---------+----------+------------+--------+--------------------+----------+
|iata_code|                name|          type|iso_country|continent|iso_region|municipality|gps_code|           port_city|port_state|
+---------+--------------------+--------------+-----------+---------+----------+------------+--------+--------------------+----------+
|      ALC|Alicante Internat...| large_airport|         ES|       EU|      ES-V|    Alicante|    LEAL|               ALCAN|        AK|
|      ANC|Ted Stevens Ancho...| large_airport|         US|       NA|     US-AK|   Anchorage|    PANC|           ANCHORAGE|        AK|
|      BAR|Qionghai Bo'ao Ai...|medium_airport|         CN|       AS|     CN-46|    Qionghai|    ZJQH|BAKER AAF - BAKER...|        AK|
|      DAC|Hazrat Shahjalal ...| large_airport|         BD|       AS|      BD-3|       Dhaka|    VGHS|       DALTONS CACHE|        AK|
|      PIZ|Point Lay LRRS Ai...|medium_airport|        

###### Cleaning Countries data

In [21]:
Clean_df_countries = clean_country_data(spark, df_countries)

In [22]:
Clean_df_countries.show()

+----+-----------------+----------------+
|code|          country|         Capital|
+----+-----------------+----------------+
|  TW|           Taiwan|          Taipei|
|  AF|      Afghanistan|           Kabul|
|  AL|          Albania|          Tirana|
|  AG|          Algeria|         Algiers|
|  AQ|   American Samoa|       Pago Pago|
|  AN|          Andorra|Andorra la Vella|
|  AO|           Angola|          Luanda|
|  AV|         Anguilla|      The Valley|
|  AY|       Antarctica|            null|
|  AC|Antigua & Barbuda|      St. John's|
|  AR|        Argentina|    Buenos Aires|
|  AM|          Armenia|         Yerevan|
|  AA|            Aruba|      Oranjestad|
|  AS|        Australia|        Canberra|
|  AU|          Austria|          Vienna|
|  AJ|       Azerbaijan|            Baku|
|  BF|          Bahamas|          Nassau|
|  BA|          Bahrain|          Manama|
|  BG|       Bangladesh|           Dhaka|
|  BB|         Barbados|      Bridgetown|
+----+-----------------+----------

### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model

For the purpose of this project, I will be using a star schema. I prefer start schema becuase it:

* requires simpler queries/joins during reporting
* has simplified business reporting logic
* provides query performance gains

The data model used for this project are:

### Dimension Tables
* dim_airports: This table is used to stored detailed information about each airports. This include: the name of the airport, the iata code, the city and country
    iata_code 
    name        
    type        
    local_code  
    coordinates  
    city         
    elevation_ft 
    continent    
    iso_country
    iso_region  
    municipality
    gps_code   
    
* dim_cities: This stores the demographic information about each of the US cities that immigrants will likely arrivate at. This include: the name of the city, their populations, the state etc.
    city                  
    state                 
    media_age             
    male_population        
    female_population    
    total_population     
    num_veterans          
    foreign_born         
    average_household_size 
    state_code            
    race                  
    count 
    
* dim_countries: This stores country codes and the full name of the counry and their capitals
    country_code                   
    country                
    capital             

### Fact table(s)
* fact_immigration: This tsores the information about the immigrants that comes into US. This include the counry they departe from, the departure date, the arrivate date and city, the type of visa they are arriving to the US with etc.
    cicid   
    year    
    month   
    cit      
    res     
    port_code    
    state_code  
    visa_type    
    mode_type   
    visapost  
    gender    
    airline   
    visatype    
    arrival_date 
    departure_date 

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model


The steps involves in the data pipeline process are:

The data pipeline for this project eill be done with Apache Airflow. The data pipeline to take should be in this format

##### Start process and read the data from sources

* start_process >> read_immigration_data
* start_process >> read_city_data
* start_process >> read_airports_data
* start_process >> read_countries_data


##### Cleaning the data 
* read_immigration_data >> clean_immigration_data
* read_city_data >> clean_city_data
* read_airports_data >> clean_airports_data
* read_countries_data >> clean_countries_data


##### transforming the data 
* clean_immigration_data >> transform_immigration_data
* clean_city_data >> transform_city_data
* clean_airports_data >> transform_airports_data
* clean_countries_data >> transform_countries_data


##### Loading the transformed data to the data warehouse
* transform_immigration_data >> write_fact_immigration_parquet
* transform_city_data >> write_dim_city_parquet
* transform_airports_data >> write_dim_airports_parquet
* transform_countries_data >> write_dim_countries_parquet


##### Ending the process 
* write_dim_city_parquet >> end_process
* write_dim_airports_parquet >> end_process
* write_dim_countries_parquet >> end_process
* write_fact_immigration_parquet >> end_process

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [23]:
Clean_df_cities.write.parquet("dim_city", mode='overwrite')
clean_df_airports.write.parquet("dim_airport", mode='overwrite')
Clean_df_countries.write.parquet("dim_country", mode='overwrite')
df_immigration2.write.parquet("fact_immigrations", mode='overwrite')

#### 4.2 Data Quality Checks
For data quality checks, 
 
Run Quality Checks

In [24]:
# make sure each row has the expected row numbers : 2891 expected for dim_city
check_data(spark, 'dim_city')

# make sure each row has the expected row numbers : 557 expected for dim_airport
check_data(spark, 'dim_airport')

# make sure each row has the expected row numbers : 250 expected for dim_country
check_data(spark, 'dim_country')

# make sure each row has the expected row numbers : 3096313 expected for fact_immigrations

check_data(spark, 'fact_immigrations')


+--------+
|count(1)|
+--------+
|    2891|
+--------+

None
+--------+
|count(1)|
+--------+
|     557|
+--------+

None
+--------+
|count(1)|
+--------+
|     250|
+--------+

None
+--------+
|count(1)|
+--------+
| 3096313|
+--------+

None


In [31]:
# checking that all fact columns joined with the dimension tables have correct values.
dim_city = spark.read.parquet("dim_city/")
dim_airport = spark.read.parquet("dim_airport/")
dim_country = spark.read.parquet("dim_country/")
fact_immigrations = spark.read.parquet("fact_immigrations/")

In [47]:
## State code relationship between dim_city and fact_immigrations

dim_city.select('state_code').show(10)
fact_immigrations.select('state_code').show(10)

+----------+
|state_code|
+----------+
|        MD|
|        MA|
|        AL|
|        CA|
|        NJ|
|        IL|
|        AZ|
|        CA|
|        MO|
|        NC|
+----------+
only showing top 10 rows

+----------+
|state_code|
+----------+
|        CA|
|        NV|
|        WA|
|        WA|
|        WA|
|        HI|
|        HI|
|        HI|
|        FL|
|        CA|
+----------+
only showing top 10 rows



Below shows that the state_code field in fact_immigrations and state field in dim_city has the same type data

In [48]:
# country code relationship between dim_airport and dim_country
dim_airport.select("iso_country").show(10)
dim_country.select("code").show(10)

+-----------+
|iso_country|
+-----------+
|         ES|
|         US|
|         CN|
|         BD|
|         US|
|         US|
|         ET|
|         AU|
|         US|
|         IN|
+-----------+
only showing top 10 rows

+----+
|code|
+----+
|  TW|
|  AF|
|  AL|
|  AG|
|  AQ|
|  AN|
|  AO|
|  AV|
|  AY|
|  AC|
+----+
only showing top 10 rows



This show that they two has the same kind of data

In [58]:
# port_code relationship between dim_airport and fact_immigrations
dim_airport.select("iata_code").show(10)
fact_immigrations.select('port_code').show(10)

+---------+
|iata_code|
+---------+
|      ALC|
|      ANC|
|      BAR|
|      DAC|
|      PIZ|
|      DTH|
|      EGL|
|      FRB|
|      HOM|
|      HYD|
+---------+
only showing top 10 rows

+---------+
|port_code|
+---------+
|      LOS|
|      LOS|
|      LOS|
|      LOS|
|      LOS|
|      HHW|
|      HHW|
|      HHW|
|      HOU|
|      LOS|
+---------+
only showing top 10 rows



The two has the same kind of data

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.


### Airports 

* IDENT - Identification code
* TYPE - Type of the airport
* NAME - Name of the Airport
* ELEVATION_FT - Elevation above the sea level in feet
* CONTINENT - Continent code
* ISO_COUNTRY - Country code according to ISO **This has have a foreign key relationship with code in dim_country table
* ISO_REGION - Region code according to ISO
* MUNICIPALITY - Mucipality where the airport is located
* GPS_CODE - GPS code
* IATA_CODE - Code of the airport assigned by International Air Transport Association  **Primary Key
* LOCAL_CODE - Local code of the airport
* COORDINATES - GPS coordinates - longitude and latitude  

### Cities
* STATE_CODE - Two-letter code of the state **Primary Key
* STATE - Name of the state
* MEDIAN_AGE - Median age in the state (estimation)
* AVERAGE_HOUSEHOLD_SIZE - Average number of people loving in a household in the state (estimation)
* TOTAL_POPULATION - Number of citizens
* FEMALE_POPULATION - Number of female citizens
* MALE_POPULATION - Number of male citizens
* NUMBER_OF_VETERANS - Number of veteran citizens
* BLACK_OR_AFRICAN_AMERICAN - Number of citizens belonging to this ethnic group
* HISPANIC_OR_LATINO - Number of citizens belonging to this ethnic group
* ASIAN - Number of citizens belonging to this ethnic group
* AMERICAN_INDIAN_AND_ALASKA_NATIVE - Number of citizens belonging to this ethnic group
* WHITE - Number of citizens belonging to this ethnic group
* FOREIGN_BORN - Number of citizens born outside of US 
    
### Countries
* COUNTRY_CODE   - Country Code **Primary Key           
* COUNTRY      - Country name          
* CAPITAL     - country's capital        

### I94 Immigration 
* CICID - Record ID
* I94YR - 4 digit year
* I94MON - Numeric month
* I94CIT - Contry of citizenship
* I94RES - Country of residence  
* I94PORT - Airport of addmittance into the USA  **This has forign key relationship with dim_airport
* ARRDATE - Arrival date in the USA
* I94MODE - Mode of transportation (1 = Air; 2 = Sea; 3 = Land; 9 = Not reported)
* I94ADDR - State of arrival **This has foreign key relationship with state_code in the dim_city table
* DEPDATE - Departure date
* I94BIR - Age of the visitor
* I94VISA - Visa codes: (1 = Business; 2 = Pleasure; 3 = Student)
* DTADFILE - Character date field
* GENDER - Gender of the visitor
* VISAPOST - Department of State where where Visa was issued
* FLTNO - Flight number of Airline used to arrive in U.S.
* VISATYPE - Class of admission legally admitting the non-immigrant to temporarily stay in U.S.
* arrival_year - Numeric year of the arrival (used for data partitioning)
* arrival_month - Numeric month of the arrival (used for data partitioning)
* arrival_day - Numeric day of the arrival (used for data partitioning)

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project. I tried running this project with pandas dataframe but because of the volume of data in the I94 Immigration data file, it took a very long time to process. I decided to use Apache Spark becuase it was faster to process
* Propose how often the data should be updated and why. I plan to update the data once daily. I choose this approach because that will help me load only finalized dataset into the data warehouse. This is also becuase the Analyst team are not interested in analyszing current day's data. 
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x. If I am faced with a scenario where the data increases by 100x, I would consider Scaling the whole pipeline horizontally by adding new nodes or moving Spark to cluster mode using a cluster manager such as a Yarn
 * The data populates a dashboard that must be updated on a daily basis by 7am every day. This shouldnt be a challenge as the entire process runs in less than 20mins currently.
 * The database needed to be accessed by 100+ people. Once the data is ready to be consumed, it would be stored in a redshift cluster postgres database on that easily supports multiuser access