
### Data Engineering Capstone Project

#### Project Summary

this project aims to create an ETL pipeline for I94 immigration, global land temperatures and US demographics datasets to form an analytics database on immigration events.
A use case for this analytics is to find immigration patterns to the US. For example, we can find answears to questions such as,
What kind of people contries are they have warmer or colder who has large numbers immigrate to US?


## Import Libraries

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os
import configparser
import datetime as dt

from pyspark.sql import SparkSession
from pyspark.sql.functions import avg
from pyspark.sql import SQLContext
from pyspark.sql.functions import isnan, when, count, col, udf, dayofmonth, dayofweek, month, year, weekofyear
from pyspark.sql.functions import monotonically_increasing_id
from pyspark.sql.types import *

In [None]:
import plotly.plotly as py
import plotly.graph_objs as go
import requests
requests.packages.urllib3.disable_warnings()

import utilities
import etl_functions

import importlib
importlib.reload(utilities)
from utilities import visualize_missing_values, clean_immigration_data, clean_temperature_data
from utilities import clean_demographics_data, print

### Loading Configuration Data

In [None]:
config = configparser.ConfigParser()
config.read('config.cfg')

os.environ['AWS_ACCESS_KEY_ID']=config['AWS']['AWS_ACCESS_KEY_ID']
os.environ['AWS_SECRET_ACCESS_KEY']=config['AWS']['AWS_SECRET_ACCESS_KEY']

### Create a Spark Session

In [None]:
spark = SparkSession.builder.\
    config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11").\
    enableHiveSupport().getOrCreate()

# Step 1: Project scope and Gathering Data

## Project Scope 
<hr></hr>

To create the analytics database, the following steps will be carried out:
* Using Spark to load the data into dataframes.
* Exploratory data analysis of I94 immigration dataset to identify missing values and strategies and data cleaning.
* Exploratory data analysis of demographics dataset to identify missing values and strategies and data cleaning.
* Exploratory data analysis of global land temperatures by city dataset to identify missing values and strategies for data cleaning.
* Create dimension tables.
    * Create immigration calendar dimension table from I94 immigration dataset, linked to fact table with arrdate field.
    * Create country dimension table from the I94 immigration and the global temperatures dataset. The global land temperatures data was aggregated at country level. The table links to the fact table through the country of residence code allowing analysts to understand correlation between country of residence climate and immigration to US states. 
    * Create usa demographics dimension table from the us cities demographics data. This table links to the fact table through the state code field. 
    
* Create fact table from the clean I94 immigration dataset and the visa_type dimension.

<p>
The technology stack used in this project is <b>Amazon S3, Apache Spark </b>.
</p>

While the whole project has been implemented on this notebook, provisions has been made to run the ETL on a spark cluster through etl.py. The etl.py script reads data from S3 and creates fact and dimesion tables through Spark that are loaded back into S3.

## Data Descriptions
---

#### I94 Immigration Data: Data Description 
<hr style="background-color: #b7d0e2;"/> 

This data comes from the US National Tourism and Trade Office. In the past all foreign visitors to the U.S. arriving via air or sea were required to complete paper Customs and Border Protection Form I-94 Arrival/Departure Record or Form I-94W Nonimmigrant Visa Waiver Arrival/Departure Record and this dataset comes from this forms. 

This dataset forms the core of the data warehouse and the customer repository has a years worth of data for the year 2016 and the dataset is divided by month. For this project the data is in a folder located at ../../data/18-83510-I94-Data-2016/. Each months data is stored in an SAS binary database storage format <i>sas7bdat</i>. For this project we have chosen going to work with data for the month of April. However, the data extraction, transformation and loading utilities functions have been designed to work with any month's worth of data.

##### Loading I94 Immigration dataset

In [None]:
# read in the data
fname = '../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat'
immigration_df =spark.read.format('com.github.saurfang.sas.spark').load(fname)

In [None]:
# display the first five records
immigration_df.limit(5).toPandas()

In [None]:
# counts the total number of records
print(immigration_df.count())

In [None]:
immigration_df.select("visapost").dropDuplicates().show(5)

<b><i>Immigration Data dictionary</i></b>

<table class="tg" align="left">
  <tr>
    <th class="tg-0pky">Feature</th>
    <th class="tg-0pky">Description</th>
  </tr>
 <tr><td class="tg-0pky">cicid</td><td class="tg-0pky">Unique record ID</td>
 <tr><td class="tg-0pky">i94yr</td><td class="tg-0pky">4 digit year</td>
 <tr><td class="tg-0pky">i94mon</td><td class="tg-0pky">Numeric month</td>
 <tr><td class="tg-0pky">i94cit</td><td class="tg-0pky">3 digit code for immigrant country of birth</td>
 <tr><td class="tg-0pky">i94res</td><td class="tg-0pky">3 digit code for immigrant country of residence </td>
 <tr><td class="tg-0pky">i94port</td><td class="tg-0pky">Port of admission</td>
 <tr><td class="tg-0pky">arrdate</td><td class="tg-0pky">Arrival Date in the USA</td>
 <tr><td class="tg-0pky">i94mode</td><td class="tg-0pky">Mode of transportation (1 = Air; 2 = Sea; 3 = Land; 9 = Not reported)</td>
 <tr><td class="tg-0pky">i94addr</td><td class="tg-0pky">USA State of arrival</td>
 <tr><td class="tg-0pky">depdate</td><td class="tg-0pky">Departure Date from the USA</td>
 <tr><td class="tg-0pky">i94bir</td><td class="tg-0pky">Age of Respondent in Years</td>
 <tr><td class="tg-0pky">i94visa</td><td class="tg-0pky">Visa codes collapsed into three categories</td>
 <tr><td class="tg-0pky">count</td><td class="tg-0pky">Field used for summary statistics</td>
 <tr><td class="tg-0pky">dtadfile</td><td class="tg-0pky">Character Date Field - Date added to I-94 Files</td>
 <tr><td class="tg-0pky">visapost</td><td class="tg-0pky">Department of State where where Visa was issued </td>
 <tr><td class="tg-0pky">occup</td><td class="tg-0pky">Occupation that will be performed in U.S</td>
 <tr><td class="tg-0pky">entdepa</td><td class="tg-0pky">Arrival Flag - admitted or paroled into the U.S.</td>
 <tr><td class="tg-0pky">entdepd</td><td class="tg-0pky">Departure Flag - Departed, lost I-94 or is deceased</td>
 <tr><td class="tg-0pky">entdepu</td><td class="tg-0pky">Update Flag - Either apprehended, overstayed, adjusted to perm residence</td>
 <tr><td class="tg-0pky">matflag</td><td class="tg-0pky">Match flag - Match of arrival and departure records</td>
 <tr><td class="tg-0pky">biryear</td><td class="tg-0pky">4 digit year of birth</td>
 <tr><td class="tg-0pky">dtaddto</td><td class="tg-0pky">Character Date Field - Date to which admitted to U.S. (allowed to stay until)</td>
 <tr><td class="tg-0pky">gender</td><td class="tg-0pky">Non-immigrant sex</td>
 <tr><td class="tg-0pky">insnum</td><td class="tg-0pky">INS number</td>
 <tr><td class="tg-0pky">airline</td><td class="tg-0pky">Airline used to arrive in U.S.</td>
 <tr><td class="tg-0pky">admnum</td><td class="tg-0pky">Admission Number</td>
 <tr><td class="tg-0pky">fltno</td><td class="tg-0pky">Flight number of Airline used to arrive in U.S.</td>
 <tr><td class="tg-0pky">visatype</td><td class="tg-0pky">Class of admission legally admitting the non-immigrant to temporarily stay in U.S.</td>
</table>

#### World Temperature Data: Data Description 

<hr style="background-color: #b7d0e2;"/> 
The World Temperature dataset represents global land temperatures by city.

##### Load World Temperature data

In [None]:
file_name = '../../data_samples/GlobalLandTempretureByCity_sample.csv'
temperature_df = spark.read.csv(file_name, header=True, inferSchema=True)

In [None]:
# Show the first five records
temperature_df.limit(5).toPandas()

<b><i>Data dictionary</i></b>

<table class="tg" align="left">
  <tr>
    <th class="tg-0pky">Feature</th>
    <th class="tg-0pky">Description</th>
  </tr>
 <tr><td class="tg-0pky">dt</td><td class="tg-0pky">Date</td>
 <tr><td class="tg-0pky">AverageTemperature</td><td class="tg-0pky">Global average land temperature in celsius</td>
 <tr><td class="tg-0pky">AverageTemperatureUncertainty</td><td class="tg-0pky">95% confidence interval around the average</td>
 <tr><td class="tg-0pky">City</td><td class="tg-0pky">Name of City</td>
 <tr><td class="tg-0pky">Country</td><td class="tg-0pky">Name of Country</td>
 <tr><td class="tg-0pky">Latitude</td><td class="tg-0pky">City Latitude</td>
 <tr><td class="tg-0pky">Longitude</td><td class="tg-0pky">City Longitude</td>
</table>

In [None]:
# The total number of records
print(temperature_df.count())

#### U.S. City Demographic Data: Data Description 
<hr style="background-color: #b7d0e2;"/> 

This data contains information about the demographics of all US cities and census-designated places with a population greater or equal to 65,000. 

##### Load the dataset

In [None]:
file_name = "../../data_samples/us-cities-demographics.csv"
demographics_df = spark.read.csv(file_name, inferSchema=True, header=True, sep=';')

In [None]:
# The first five records
demographics_df.limit(5).toPandas()

<b><i>Data dictionary</i></b>

<table class="tg" align="left">
  <tr>
    <th class="tg-0pky">Feature</th>
    <th class="tg-0pky">Description</th>
  </tr>
 <tr><td class="tg-0pky">City</td><td class="tg-0pky">City Name</td>
 <tr><td class="tg-0pky">State</td><td class="tg-0pky">US State where city is located</td>
 <tr><td class="tg-0pky">Median Age</td><td class="tg-0pky">Median age of the population</td>
 <tr><td class="tg-0pky">Male Population</td><td class="tg-0pky">Count of male population</td>
 <tr><td class="tg-0pky">Female Population</td><td class="tg-0pky">Count of female population</td>
 <tr><td class="tg-0pky">Total Population</td><td class="tg-0pky">Count of total population</td>
 <tr><td class="tg-0pky">Number of Veterans</td><td class="tg-0pky">Count of total Veterans</td>
 <tr><td class="tg-0pky">Foreign born</td><td class="tg-0pky">Count of residents of the city that were not born in the city</td>
 <tr><td class="tg-0pky">Average Household Size</td><td class="tg-0pky">Average city household size</td>
 <tr><td class="tg-0pky">State Code</td><td class="tg-0pky">Code of the US state</td>
 <tr><td class="tg-0pky">Race</td><td class="tg-0pky">Respondent race</td>
 <tr><td class="tg-0pky">Count</td><td class="tg-0pky">Count of city's individual per race</td>
</table>

In [None]:
# check the total number of records
print(demographics_df.count())

### Step 2: Explore and Assess the Data

#### Exploratory Data Analysis: Immigration data
---

In [None]:
# list all files in the customer repository
files = os.listdir('../../data/18-83510-I94-Data-2016/')
files

In [None]:
# lets see the dataframe schema
immigration_df.printSchema()

#### Visualize Missing Values
---

In [None]:
#Visualize missing values
utilities.visualize_missing_values_spark(immigration_df)

#### Data Cleaning steps: 

* Dropping all columns missing values.Assumption (SLA is 90%)
* Dropping all rows contain missing values. It's a NAN row so we don't need to consider it.Assumption (SLA is 100%)

#### Drop columns with significant missing values

In [None]:
# columns have over 90% missing values
cols = ['occup', 'entdepu','insnum']

# drop columns have NAN 
immigration_df_dropped_cols = immigration_df.drop(*cols)

In [None]:
# display the new schema
immigration_df_dropped_cols.printSchema()

In [None]:
# drop duplicate entries
immigration_df_dropped_cols = immigration_df_dropped_cols.dropDuplicates(['cicid'])

In [None]:
# get a count after dropping duplicates
print(immigration_df_dropped_cols.count())

* Data has no duplicates

In [None]:
# drop rows with missing values
immigration_df_dropped_cols = immigration_df_dropped_cols.dropna(how='all', subset=['cicid'])

In [None]:
# get a count after dropping rows with missing values
print(immigration_df_dropped_cols.count())

> Immigration data has no rows with missing record ID

In [None]:
# clean the immigration dataframe
immigration_cleansed_df = utilities.clean_spark_immigration_data(immigration_df)

## Exploratory Data Analysis: World Temperature Data
---

In [None]:
# print dataframe schema
temperature_df.printSchema()

#### Visualize Missing Values
---

In [None]:
# visualize missing values by columns 
temperature_df_new = temperature_df.withColumn("dt",col("dt").cast(StringType())) # convert dt column dataType to string
utilities.visualize_missing_values_spark(temperature_df_new)

#### Data Cleaning Steps

-  Drop all rows with missing average temperature
-  Drop duplicate columns 

In [None]:
# clean the data
new_temperature_df = utilities.clean_spark_temperature_data(temperature_df)

## Exploratory Data Analysis: U.S. City Demographic Data 
---

In [None]:
# count the number of records in dataset
print(demographics_df.count())

In [None]:
# print database schema
demographics_df.printSchema()

#### Visualize Missing Values
---

In [None]:
utilities.visualize_missing_values_spark(demographics_df)

In [None]:
# View columns with missing data
df_with_nulls = pd.DataFrame(data= demographics_df.toPandas().isnull().sum(), columns=['values'])
df_with_nulls = df_with_nulls.reset_index()
df_with_nulls.columns = ['cols', 'values']

# calculate % missing values
df_with_nulls['% missing values'] = 100*df_with_nulls['values']/demographics_df.count()
df_with_nulls[df_with_nulls['% missing values']>0]

#### Data Cleaning Steps

-  demographics dataset has very few missing values
-  Drop duplicate columns 

In [None]:
# clean demographics data
demographics_final_df = utilities.clean_spark_demographics_data(demographics_df)

# Step 3: Defining the Data Model
#### 3.1 Data Model (Star schema)

![Database schema](images/data_model.png)

The country dimension table is made up of data from the global land temperatures by city and the immigration datasets. The combination of these two datasets allows analysts to study correlations between global land temperatures and immigration patterns to the US.

The us demographics dimension table comes from the demographics dataset and linked to the immigration fact table at US state level. This dimension would allow analysts to get insights and KPIs into migration patterns into the US based on demographics as well as overall population of states. We could ask questions such as, do populous states attracting more visitors on a monthly basis? One envisions a dashboard that could be designed based on the data model with drill downs into gradular information on visits to the US. Such a dashboard could foster a culture of data driven decision making within tourism and immigration departments at state level. 

The visa type dimension table comes from the immigration datasets and links to the immigaration by the  visa_type_key. 

The immigration fact table data comes from the immigration data sets and contains keys that links to the dimension tables. The data dictionary of the immigration dataset contains detailed information on the data that makes up the fact table. 


#### 3.2 Mapping Out Data Pipelines
The pipeline steps are as follows:

* 1- Loading the datasets 
* 2- Cleaning the I94 Immigration data to create Spark dataframe for each month
* 3- creating visa_type dimension table
* 4- creating calendar dimension table
* Extract clean global temperatures data
* Create country dimension table
* Create immigration fact table
* Load demographics data
* Clean demographics data
* Create demographic dimension table

# Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

##### Create the immigration calendar dimension table
---

In [None]:
def create_immigration_calendar_dim(df, output_data):
    """This function creates an immigration calendar based on arrival date key
    
    :param df: spark dataframe of immigration events
    :param output_data: path to write dimension dataframe to
    :return: spark dataframe representing calendar dimension
    """
    # Convert arrival date in SAS format to datetime object
    get_datetime = udf(lambda x: (dt.datetime(1960, 1, 1).date() + dt.timedelta(x)).isoformat() if x else None)
    
    # create initial calendar df from arrdate column
    calendar_df = df.select(['arrdate']).withColumn("arrdate", get_datetime(df.arrdate)).distinct()
    
    # expand df by adding other calendar columns
    calendar_df = calendar_df.withColumn('arrival_day', dayofmonth('arrdate'))
    calendar_df = calendar_df.withColumn('arrival_week', weekofyear('arrdate'))
    calendar_df = calendar_df.withColumn('arrival_month', month('arrdate'))
    calendar_df = calendar_df.withColumn('arrival_year', year('arrdate'))
    calendar_df = calendar_df.withColumn('arrival_weekday', dayofweek('arrdate'))

    # create an id field in calendar df
    calendar_df = calendar_df.withColumn('id', monotonically_increasing_id())
    
    # write the calendar dimension to parquet file
    partition_columns = ['arrival_year', 'arrival_month', 'arrival_week']
    calendar_df.write.parquet(output_data + "immigration_calendar", partitionBy=partition_columns, mode="overwrite")
    
    return calendar_df

In [None]:
output_data = "tables/"

In [None]:
calendar_df = create_immigration_calendar_dim(immigration_cleansed_df, output_data)

##### Create the country dimension table
---

In [None]:
def create_country_dim_table(df, temp_df, output_data):
    """This function creates a country dimension from the immigration and global land temperatures data.
    
    :param df: spark dataframe of immigration events
    :temp_df: spark dataframe of global land temperatures data.
    :param output_data: path to write dimension dataframe to
    :return: spark dataframe representing calendar dimension
    """
    # get the aggregated temperature data
    agg_temp = utilities.aggregate_temperature_data(temp_df).toPandas()
    # load the i94res to country mapping data
    mapping_codes = pd.read_csv('i94res.csv')
    
    @udf('string')
    def get_country_average_temperature(name):
        print("Processing: ", name)
        avg_temp = agg_temp[agg_temp['Country']==name]['average_temperature']
        
        if not avg_temp.empty:
            return str(avg_temp.iloc[0])
        
        return None
    
    @udf()
    def get_country_name(code):
        name = mapping_codes[mapping_codes['code']==code]['Name'].iloc[0]
        
        if name:
            return name.title()
        return None
        
    # select and rename i94res column
    dim_df = df.select(['i94res']).distinct() \
                .withColumnRenamed('i94res', 'country_code')
    
    # create country_name column
    dim_df = dim_df.withColumn('country_name', get_country_name(dim_df.country_code))
    
    # create average_temperature column
    dim_df = dim_df.withColumn('average_temperature', get_country_average_temperature(dim_df.country_name))
    
    # write the dimension to a parquet file
    dim_df.write.parquet(output_data + "country", mode="overwrite")
    
    return dim_df

In [None]:
country_dim_f = create_country_dim_table(immigration_cleansed_df, new_temperature_df, output_data)

In [None]:
country_dim_f.show(5)

##### Create the visa type dimension table
---

In [None]:
def create_visa_type_dimension_table(df, output_data):
    """This function creates a visa type dimension from the immigration data.
    
    :param df: spark dataframe of immigration events
    :param output_data: path to write dimension dataframe to
    :return: spark dataframe representing calendar dimension
    """
    # create visatype df from visatype column
    visatype_df = df.select(['visatype']).distinct()
    
    # add an id column
    visatype_df = visatype_df.withColumn('visa_type_key', monotonically_increasing_id())
    
    # write dimension to parquet file
    visatype_df.write.parquet(output_data + "visatype", mode="overwrite")
    
    return visatype_df

def get_visa_type_dimension(output_data):
    return spark.read.parquet(output_data + "visatype")

In [None]:
# test create visa_type dimension function
visatype_df = create_visa_type_dimension_table(immigration_cleansed_df, output_data)
visatype_df.show(n=5)

##### Create the demographics dimension table
---

In [None]:
def create_demographics_dimension_table(df, output_data):
    """This function creates a us demographics dimension table from the us cities demographics data.
    
    :param df: spark dataframe of us demographics survey data
    :param output_data: path to write dimension dataframe to
    :return: spark dataframe representing demographics dimension
    """
    dim_df = df.withColumnRenamed('Median Age','median_age') \
            .withColumnRenamed('Male Population', 'male_population') \
            .withColumnRenamed('Female Population', 'female_population') \
            .withColumnRenamed('Total Population', 'total_population') \
            .withColumnRenamed('Number of Veterans', 'number_of_veterans') \
            .withColumnRenamed('Foreign-born', 'foreign_born') \
            .withColumnRenamed('Average Household Size', 'average_household_size') \
            .withColumnRenamed('State Code', 'state_code')
    # lets add an id column
    dim_df = dim_df.withColumn('id', monotonically_increasing_id())
    
    # write dimension to parquet file
    dim_df.write.parquet(output_data + "demographics", mode="overwrite")
    
    return dim_df

In [None]:
demographics_dim_df = create_demographics_dimension_table(demographics_final_df, output_data)
demographics_dim_df.limit(5).toPandas()

##### Create the immigration fact table
---

In [None]:
def create_immigration_fact_table(df, output_data):
    """This function creates an country dimension from the immigration and global land temperatures data.
    
    :param df: spark dataframe of immigration events
    :param visa_type_df: spark dataframe of global land temperatures data.
    :param output_data: path to write dimension dataframe to
    :return: spark dataframe representing calendar dimension
    """
    # get visa_type dimension
    dim_df = get_visa_type_dimension(output_data).toPandas()
    
    @udf('string')
    def get_visa_key(visa_type):
        """user defined function to get visa key
        
        :param visa_type: US non-immigrant visa type
        :return: corresponding visa key
        """
        key_series = dim_df[dim_df['visatype']==visa_type]['visa_type_key']
        
        if not key_series.empty:
            return str(key_series.iloc[0])
        
        return None
    
    # create a udf to convert arrival date in SAS format to datetime object
    get_datetime = udf(lambda x: (dt.datetime(1960, 1, 1).date() + dt.timedelta(x)).isoformat() if x else None)
    
    # rename columns to align with data model
    df = df.withColumnRenamed('cicid','record_id') \
            .withColumnRenamed('i94res', 'country_residence_code') \
            .withColumnRenamed('i94addr', 'state_code') 
    
    # create visa_type key
    df = df.withColumn('visa_type_key', get_visa_key('visatype'))
    
    # convert arrival date into datetime object
    df = df.withColumn("arrdate", get_datetime(df.arrdate))
    
    # write dimension to parquet file
    df.write.parquet(output_data + "immigration_fact", mode="overwrite")
    
    return df

In [None]:
immigration_fact_df = create_immigration_fact_table(immigration_cleansed_df, output_data)

In [None]:
def run_pipeline():
    # load data
    
    # run cleaning functions
    
    # create fact and dimension tables
    
    

#### 4.2 Data Quality Checks
The data quality checks ensures that the ETL has created fact and dimension tables with adequate records. 

In [None]:
table_dfs = {
    'immigration_fact': immigration_fact_df,
    'visa_type_dim': visatype_df,
    'calendar_dim': calendar_df,
    'usa_demographics_dim': demographics_dim_df,
    'country_dim': country_dim_f
}
for table_name, table_df in table_dfs.items():
    # quality check for table
    etl_functions.quality_checks(table_df, table_name)

#### 4.3 Data dictionary 

##### Fact Table - data dictionary

<table class="tg" align="left">
  <tr>
    <th class="tg-0pky">Feature</th>
    <th class="tg-0pky">Description</th>
  </tr>
 <tr><td class="tg-0pky">record_id</td><td class="tg-0pky">Unique record ID</td></tr>
 <tr><td class="tg-0pky">country_residence_code</td><td class="tg-0pky">3 digit code for immigrant country of residence </td></tr>    
 <tr><td class="tg-0pky">visa_type_key</td><td class="tg-0pky">A numerical key that links to the visa_type dimension table</td></tr>
 <tr><td class="tg-0pky">state_code</td><td class="tg-0pky">US state of arrival</td></tr>
 <tr><td class="tg-0pky">i94yr</td><td class="tg-0pky">4 digit year</td></tr>
 <tr><td class="tg-0pky">i94mon</td><td class="tg-0pky">Numeric month</td></tr>
 <tr><td class="tg-0pky">i94port</td><td class="tg-0pky">Port of admission</td></tr>
 <tr><td class="tg-0pky">arrdate</td><td class="tg-0pky">Arrival Date in the USA</td></tr>
 <tr><td class="tg-0pky">i94mode</td><td class="tg-0pky">Mode of transportation (1 = Air; 2 = Sea; 3 = Land; 9 = Not reported)</td></tr>
 <tr><td class="tg-0pky">i94addr</td><td class="tg-0pky">USA State of arrival</td></tr>
 <tr><td class="tg-0pky">depdate</td><td class="tg-0pky">Departure Date from the USA</td></tr>
 <tr><td class="tg-0pky">i94bir</td><td class="tg-0pky">Age of Respondent in Years</td></tr>
 <tr><td class="tg-0pky">i94visa</td><td class="tg-0pky">Visa codes collapsed into three categories</td></tr>
 <tr><td class="tg-0pky">count</td><td class="tg-0pky">Field used for summary statistics</td></tr>
 <tr><td class="tg-0pky">dtadfile</td><td class="tg-0pky">Character Date Field - Date added to I-94 Files</td></tr>
 <tr><td class="tg-0pky">visapost</td><td class="tg-0pky">Department of State where where Visa was issued </td></tr>
 <tr><td class="tg-0pky">occup</td><td class="tg-0pky">Occupation that will be performed in U.S</td></tr>
 <tr><td class="tg-0pky">entdepa</td><td class="tg-0pky">Arrival Flag - admitted or paroled into the U.S.</td></tr>
 <tr><td class="tg-0pky">entdepd</td><td class="tg-0pky">Departure Flag - Departed, lost I-94 or is deceased</td></tr>
 <tr><td class="tg-0pky">entdepu</td><td class="tg-0pky">Update Flag - Either apprehended, overstayed, adjusted to perm residence</td></tr>
 <tr><td class="tg-0pky">matflag</td><td class="tg-0pky">Match flag - Match of arrival and departure records</td></tr>
 <tr><td class="tg-0pky">biryear</td><td class="tg-0pky">4 digit year of birth</td></tr>
 <tr><td class="tg-0pky">dtaddto</td><td class="tg-0pky">Character Date Field - Date to which admitted to U.S. (allowed to stay until)</td></tr>
 <tr><td class="tg-0pky">gender</td><td class="tg-0pky">Non-immigrant sex</td></tr>
</table>

##### Country Dimension Table - data dictionary
<p>  
<i>The country code and country_name fields come from the labels description SAS file while the average_temperature data comes from the global land temperature by cities data.</i>
</p>
<table class="tg" align="left">
  <tr>
    <th class="tg-0pky">Feature</th>
    <th class="tg-0pky">Description</th>
  </tr>
 <tr><td class="tg-0pky">country_code</td><td class="tg-0pky">Unique country code</td></tr>
 <tr><td class="tg-0pky">country_name</td><td class="tg-0pky">Name of country</td></tr>    
 <tr><td class="tg-0pky">average_temperature</td><td class="tg-0pky">Average temperature of country</td></tr>
</table>

##### Visa Type Dimension Table - data dictionary
<table class="tg" align="left">
  <tr>
    <th class="tg-0pky">Feature</th>
    <th class="tg-0pky">Description</th>
  </tr>
 <tr><td class="tg-0pky">visa_type_key</td><td class="tg-0pky">Unique id for each visa issued</td></tr>
 <tr><td class="tg-0pky">visa_type</td><td class="tg-0pky">Name of visa</td></tr>
</table>

 ##### Immigration Calendar Dimension Table - data dictionary
<p>
<i>The whole of this dataset comes from the immigration dataset.</i>
</p>
<table class="tg" align="left">
  <tr>
    <th class="tg-0pky">Feature</th>
    <th class="tg-0pky">Description</th>
  </tr>
 <tr><td class="tg-0pky">id</td><td class="tg-0pky">Unique id</td></tr>
 <tr><td class="tg-0pky">arrdate</td><td class="tg-0pky">Arrival date into US</td></tr>    
 <tr><td class="tg-0pky">arrival_year</td><td class="tg-0pky">Arrival year into US</td></tr>
 <tr><td class="tg-0pky">arrival_month</td><td class="tg-0pky">Arrival MonthS</td></tr>
 <tr><td class="tg-0pky">arrival_day</td><td class="tg-0pky">Arrival Day</td></tr>
 <tr><td class="tg-0pky">arrival_week</td><td class="tg-0pky">Arrival Week</td></tr>
 <tr><td class="tg-0pky">arrival_weekday</td><td class="tg-0pky">Arrival WeekDay</td></tr>
</table>

##### US Demographics Dimension Table - data dictionary
<p>
<i>The whole of this dataset comes from the us cities demographics data.</i>
</p>

<table class="tg" align="left">
  <tr>
    <th class="tg-0pky">Feature</th>
    <th class="tg-0pky">Description</th>
  </tr>
 <tr><td class="tg-0pky">id</td><td class="tg-0pky">Record id</td>
 <tr><td class="tg-0pky">state_code</td><td class="tg-0pky">US state code </td>
 <tr><td class="tg-0pky">City</td><td class="tg-0pky">City Name</td>
 <tr><td class="tg-0pky">State</td><td class="tg-0pky">US State where city is located</td>
 <tr><td class="tg-0pky">Median Age</td><td class="tg-0pky">Median age of the population</td>
 <tr><td class="tg-0pky">Male Population</td><td class="tg-0pky">Count of male population</td>
 <tr><td class="tg-0pky">Female Population</td><td class="tg-0pky">Count of female population</td>
 <tr><td class="tg-0pky">Total Population</td><td class="tg-0pky">Count of total population</td>
 <tr><td class="tg-0pky">Number of Veterans</td><td class="tg-0pky">Count of total Veterans</td>
 <tr><td class="tg-0pky">Foreign born</td><td class="tg-0pky">Count of residents of the city that were not born in the city</td>
 <tr><td class="tg-0pky">Average Household Size</td><td class="tg-0pky">Average city household size</td>
 <tr><td class="tg-0pky">Race</td><td class="tg-0pky">Respondent race</td>
 <tr><td class="tg-0pky">Count</td><td class="tg-0pky">Count of city's individual per race</td>
</table>

#### Step 5: Complete Project Write Up
* Rationale for the choice of tools and technologies for the project
    * Apache spark was used because of:
        * it's ability to handle multiple file formats with large amounts of data. 
        * Apache Spark offers a lightning-fast unified analytics engine for big data.
        * Spark has easy-to-use APIs for operating on large datasets
* Propose how often the data should be updated and why.
    * The current I94 immigration data is updated monthly, and hence the data will be updated monthly. 
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
     * Spark can handle the increase but we would consider increasing the number of nodes in our cluster.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
     * In this scenario, Apache Airflow will be used to schedule and run data pipelines.
 * The database needed to be accessed by 100+ people.
     * In this scenario, we would move our analytics database into Amazon Redshift. 