# Data Engineering for Analysis on i94Immigration Data from US
### Udacity Data Engineering Capstone Project

#### Project Summary
In this project, I worked with four datasets from different sources, designed a Star Schema for those data and prepared them ready for interested analysis on immigration to USA. 

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
# Do all imports and installs here
import pandas as pd
import os
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession,Window
from pyspark.sql.types import *
from pyspark.sql.functions import *

### Step 1: Scope the Project and Gather Data

#### Scope 
I will start from exploring the raw datasets, loading, checking size, schema, columns etc. and find out the connections between tables and do necessary cleanings. Then I will design a Star Schema for the datasets which is fit to the analytical purpose and selecting columns and join them to create the fact and dimension tables and save them back to the cloud cluster. Data will be processed mainly with PySpark and the final tables will be stored back to the cloud cluster as parquet files.

#### Describe and Gather Data 
Describe the data sets you're using. Where did it come from? What type of information is included? 

The following datasets are included in the project. 

**I94 Immigration Data**: This data comes from the US National Tourism and Trade Office. A data dictionary is included in the workspace. [This](https://travel.trade.gov/research/reports/i94/historical/2016.html) is where the data comes from.     
**World Temperature Data**: This dataset came from Kaggle. Read more about it [here](https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data).            
**U.S. City Demographic Data**: This data comes from OpenSoft. Read more about it [here](https://public.opendatasoft.com/explore/dataset/us-cities-demographics/export/).   

**Airport Code Table**: This is a simple table of airport codes and corresponding cities. It comes from [here](https://datahub.io/core/airport-codes#data).

In [2]:
	# creating sparksession 
spark = SparkSession.builder.\
                    config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11"). \
                    enableHiveSupport().getOrCreate()

### Step 2: Explore and Assess the Data

### 2.1 I94 Immigration Data

In [3]:
# load i94immigration data from local
df_sas =spark.read.format('com.github.saurfang.sas.spark') \
            .load('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')
#write to parquet
#df_sas.write.parquet("sas_data")
# read in parquet files
df_sas=spark.read.parquet("sas_data")

In [4]:
# check columns
df_sas.printSchema()

root
 |-- cicid: double (nullable = true)
 |-- i94yr: double (nullable = true)
 |-- i94mon: double (nullable = true)
 |-- i94cit: double (nullable = true)
 |-- i94res: double (nullable = true)
 |-- i94port: string (nullable = true)
 |-- arrdate: double (nullable = true)
 |-- i94mode: double (nullable = true)
 |-- i94addr: string (nullable = true)
 |-- depdate: double (nullable = true)
 |-- i94bir: double (nullable = true)
 |-- i94visa: double (nullable = true)
 |-- count: double (nullable = true)
 |-- dtadfile: string (nullable = true)
 |-- visapost: string (nullable = true)
 |-- occup: string (nullable = true)
 |-- entdepa: string (nullable = true)
 |-- entdepd: string (nullable = true)
 |-- entdepu: string (nullable = true)
 |-- matflag: string (nullable = true)
 |-- biryear: double (nullable = true)
 |-- dtaddto: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- insnum: string (nullable = true)
 |-- airline: string (nullable = true)
 |-- admnum: double (nullable = 

In [5]:
# show first 5 rows
df_sas.show(n=5)

+---------+------+------+------+------+-------+-------+-------+-------+-------+------+-------+-----+--------+--------+-----+-------+-------+-------+-------+-------+--------+------+------+-------+--------------+-----+--------+
|    cicid| i94yr|i94mon|i94cit|i94res|i94port|arrdate|i94mode|i94addr|depdate|i94bir|i94visa|count|dtadfile|visapost|occup|entdepa|entdepd|entdepu|matflag|biryear| dtaddto|gender|insnum|airline|        admnum|fltno|visatype|
+---------+------+------+------+------+-------+-------+-------+-------+-------+------+-------+-----+--------+--------+-----+-------+-------+-------+-------+-------+--------+------+------+-------+--------------+-----+--------+
|5748517.0|2016.0|   4.0| 245.0| 438.0|    LOS|20574.0|    1.0|     CA|20582.0|  40.0|    1.0|  1.0|20160430|     SYD| null|      G|      O|   null|      M| 1976.0|10292016|     F|  null|     QF|9.495387003E10|00011|      B1|
|5748518.0|2016.0|   4.0| 245.0| 438.0|    LOS|20574.0|    1.0|     NV|20591.0|  32.0|    1.0|  

In [6]:
# explore columns 
df_sas.select(col('airline'),col('i94port'),col('i94addr')).distinct().sort(df_sas.airline.desc()).show(truncate=False)

+-------+-------+-------+
|airline|i94port|i94addr|
+-------+-------+-------+
|ZZ     |MIA    |NC     |
|ZZ     |WAS    |NY     |
|ZZ     |SEA    |WA     |
|ZZ     |SFR    |DE     |
|ZZ     |NYC    |NJ     |
|ZZ     |ADW    |null   |
|ZZ     |SFR    |NY     |
|ZZ     |MIA    |FL     |
|ZZ     |BRO    |TX     |
|ZZ     |NYC    |CT     |
|ZZ     |ATL    |NY     |
|ZZ     |NEW    |HI     |
|ZZ     |ADW    |MD     |
|ZZ     |NYC    |NY     |
|ZZ     |HHW    |HI     |
|ZX     |WAS    |IN     |
|ZX     |FTL    |MN     |
|ZX     |TOR    |GA     |
|ZX     |FTL    |OH     |
|ZX     |HHW    |CT     |
+-------+-------+-------+
only showing top 20 rows



In [7]:
# count rows
df_sas.count()

3096313

In [8]:
# Get count of both null and missing values
df_sas.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df_sas.columns]).show()

+-----+-----+------+------+------+-------+-------+-------+-------+-------+------+-------+-----+--------+--------+-------+-------+-------+-------+-------+-------+-------+------+-------+-------+------+-----+--------+
|cicid|i94yr|i94mon|i94cit|i94res|i94port|arrdate|i94mode|i94addr|depdate|i94bir|i94visa|count|dtadfile|visapost|  occup|entdepa|entdepd|entdepu|matflag|biryear|dtaddto|gender| insnum|airline|admnum|fltno|visatype|
+-----+-----+------+------+------+-------+-------+-------+-------+-------+------+-------+-----+--------+--------+-------+-------+-------+-------+-------+-------+-------+------+-------+-------+------+-----+--------+
|    0|    0|     0|     0|     0|      0|      0|    239| 152592| 142457|   802|      0|    0|       1| 1881250|3088187|    238| 138429|3095921| 138429|    802|    477|414269|2982605|  83627|     0|19549|       0|
+-----+-----+------+------+------+-------+-------+-------+-------+-------+------+-------+-----+--------+--------+-------+-------+-------+---

In [9]:
# Drop duplicates
df_sas.dropDuplicates()
df_sas.count()

3096313

In [4]:
# convert SAS date to spark datetype, change column names and types, drop columns
df_sas = df_sas.withColumn("data_base_sas", to_date(lit("01/01/1960"), "MM/dd/yyyy")) \
            .withColumn("arrival_date", expr("date_add(data_base_sas, arrdate)")) \
            .withColumn("departure_date", expr("date_add(data_base_sas, depdate)")) \
            .drop("data_base_sas", "arrdate", "depdate") \
            .withColumn("cic_id",col("cicid").cast(IntegerType())).drop("cicid") \
            .withColumn("arrive_year",col('i94yr').cast(IntegerType())).drop("i94yr") \
            .withColumn("arrive_month",col('i94mon').cast(IntegerType())).drop("i94mon") \
            .withColumn("citizen_country",col('i94cit').cast(IntegerType())).drop("i94cit") \
            .withColumn("resident_country",col('i94res').cast(IntegerType())).drop("i94res") \
            .withColumn("age",col('i94bir').cast(IntegerType())).drop("i94bir") \
            .withColumn("birth_year",col('biryear').cast(IntegerType())).drop("biryear") \
            .withColumn("visa_class",col('i94visa').cast(IntegerType())).drop("i94visa") \
            .withColumn("mode",col('i94mode').cast(IntegerType())).drop("i94mode") \
            .withColumn("allowed_date", to_date("dtaddto", "MMddyyyy")) \
            .withColumnRenamed("i94port", "port") \
            .withColumnRenamed("i94addr","arrive_state") \
            .withColumnRenamed("visapost","visa_issue_state") \
            .withColumnRenamed("entdepa","arrive_flag") \
            .withColumnRenamed("entdepd","departure_flag") \
            .withColumnRenamed("matflag","match_flag") \
            .withColumnRenamed("entdepu","update_flag") \
            .withColumnRenamed("fltno","flight_num") \
            .withColumnRenamed("visatype","visa_type") \
            .withColumnRenamed("visapost","visa_issue_state") \
            .withColumnRenamed("occup","occupation") \
            .drop('count','dtadfile','insnum','admnum','dtaddto')

In [7]:
df_sas.printSchema()

root
 |-- port: string (nullable = true)
 |-- arrive_state: string (nullable = true)
 |-- visa_issue_state: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- arrive_flag: string (nullable = true)
 |-- departure_flag: string (nullable = true)
 |-- update_flag: string (nullable = true)
 |-- match_flag: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- airline: string (nullable = true)
 |-- flight_num: string (nullable = true)
 |-- visa_type: string (nullable = true)
 |-- arrival_date: date (nullable = true)
 |-- departure_date: date (nullable = true)
 |-- cic_id: integer (nullable = true)
 |-- arrive_year: integer (nullable = true)
 |-- arrive_month: integer (nullable = true)
 |-- citizen_country: integer (nullable = true)
 |-- resident_country: integer (nullable = true)
 |-- age: integer (nullable = true)
 |-- birth_year: integer (nullable = true)
 |-- visa_class: integer (nullable = true)
 |-- mode: integer (nullable = true)
 |-- allowed_date: da

In [13]:
# show first 5 rows
df_sas.show(n=5)

+----+------------+----------------+----------+-----------+--------------+-----------+----------+------+-------+----------+---------+------------+--------------+-------+-----------+------------+---------------+----------------+---+----------+----------+----+------------+
|port|arrive_state|visa_issue_state|occupation|arrive_flag|departure_flag|update_flag|match_flag|gender|airline|flight_num|visa_type|arrival_date|departure_date| cic_id|arrive_year|arrive_month|citizen_country|resident_country|age|birth_year|visa_class|mode|allowed_date|
+----+------------+----------------+----------+-----------+--------------+-----------+----------+------+-------+----------+---------+------------+--------------+-------+-----------+------------+---------------+----------------+---+----------+----------+----+------------+
| LOS|          CA|             SYD|      null|          G|             O|       null|         M|     F|     QF|     00011|       B1|  2016-04-30|    2016-05-08|5748517|       2016|   

In [14]:
# remove directory if already exist
if (os.path.exists("file:/home/workspace/immigration")
    os.rmdir("file:/home/workspace/immigration")
else:
# write to parquet
    df_sas.write.partitionBy('arrive_state').parquet('immigration')

### 2.2 U.S. City Demographic Data

In [5]:
#load us-cities-demographics.csv
df_demo = spark.read.csv("us-cities-demographics.csv",header=True,sep=";")
df_demo.show(5)

+----------------+-------------+----------+---------------+-----------------+----------------+------------------+------------+----------------------+----------+--------------------+-----+
|            City|        State|Median Age|Male Population|Female Population|Total Population|Number of Veterans|Foreign-born|Average Household Size|State Code|                Race|Count|
+----------------+-------------+----------+---------------+-----------------+----------------+------------------+------------+----------------------+----------+--------------------+-----+
|   Silver Spring|     Maryland|      33.8|          40601|            41862|           82463|              1562|       30908|                   2.6|        MD|  Hispanic or Latino|25924|
|          Quincy|Massachusetts|      41.0|          44129|            49500|           93629|              4147|       32935|                  2.39|        MA|               White|58723|
|          Hoover|      Alabama|      38.5|          38040| 

In [16]:
df_demo.printSchema()
df_demo.count()

root
 |-- City: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Median Age: string (nullable = true)
 |-- Male Population: string (nullable = true)
 |-- Female Population: string (nullable = true)
 |-- Total Population: string (nullable = true)
 |-- Number of Veterans: string (nullable = true)
 |-- Foreign-born: string (nullable = true)
 |-- Average Household Size: string (nullable = true)
 |-- State Code: string (nullable = true)
 |-- Race: string (nullable = true)
 |-- Count: string (nullable = true)



2891

In [6]:
# change datatypes, format column names and delete duplicates
df_demo = df_demo.withColumn("median_age",col("Median Age").cast(FloatType())).drop("Median Age") \
    .withColumn("male_population",col("Male Population").cast(IntegerType())).drop("Male Population") \
    .withColumn("female_population",col("Female Population").cast(IntegerType())).drop("Female Population") \
    .withColumn("total_population",col("Total Population").cast(IntegerType())).drop("Total Population") \
    .withColumn("veterans_num",col("Number of Veterans").cast(IntegerType())).drop("Number of Veterans") \
    .withColumn("foreign_born_population",col("Foreign-born").cast(IntegerType())).drop("Foreign-born") \
    .withColumn("avg_household_size",col("Average Household Size").cast(FloatType())).drop("Average Household Size") \
    .withColumn("count",col("Count").cast(IntegerType())) \
    .withColumnRenamed("City", "city") \
    .withColumnRenamed("State", "state") \
    .withColumnRenamed("State Code", "state_code") \
    .withColumnRenamed("Race", "race") \
    .distinct()
                
df_demo.printSchema()

root
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- state_code: string (nullable = true)
 |-- race: string (nullable = true)
 |-- count: integer (nullable = true)
 |-- median_age: float (nullable = true)
 |-- male_population: integer (nullable = true)
 |-- female_population: integer (nullable = true)
 |-- total_population: integer (nullable = true)
 |-- veterans_num: integer (nullable = true)
 |-- foreign_born_population: integer (nullable = true)
 |-- avg_household_size: float (nullable = true)



In [18]:
### Get count of both null and missing values
df_demo.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df_demo.columns]).show()

+----+-----+----------+----+-----+----------+---------------+-----------------+----------------+------------+-----------------------+------------------+
|city|state|state_code|race|count|median_age|male_population|female_population|total_population|veterans_num|foreign_born_population|avg_household_size|
+----+-----+----------+----+-----+----------+---------------+-----------------+----------------+------------+-----------------------+------------------+
|   0|    0|         0|   0|    0|         0|              3|                3|               0|          13|                     13|                16|
+----+-----+----------+----+-----+----------+---------------+-----------------+----------------+------------+-----------------------+------------------+



In [19]:
# show rows with null values in 'male_population' column
df_demo.select('city','state','race','count','male_population','female_population','veterans_num','foreign_born_population','avg_household_size') \
    .where(col('male_population').isNull()).show()

+------------+-------+--------------------+-----+---------------+-----------------+------------+-----------------------+------------------+
|        city|  state|                race|count|male_population|female_population|veterans_num|foreign_born_population|avg_household_size|
+------------+-------+--------------------+-----+---------------+-----------------+------------+-----------------------+------------------+
|The Villages|Florida|               White|72211|           null|             null|       15231|                   4034|              null|
|The Villages|Florida|Black or African-...|  331|           null|             null|       15231|                   4034|              null|
|The Villages|Florida|  Hispanic or Latino| 1066|           null|             null|       15231|                   4034|              null|
+------------+-------+--------------------+-----+---------------+-----------------+------------+-----------------------+------------------+



In [20]:
# check distinct values in 'race' column
df_demo.select('race').distinct().show(truncate = False)

+---------------------------------+
|race                             |
+---------------------------------+
|Black or African-American        |
|Hispanic or Latino               |
|White                            |
|Asian                            |
|American Indian and Alaska Native|
+---------------------------------+



In [7]:
# pivot table to make each race population into seperate columns, change column names
df_demo = df_demo.groupBy(col("city"),col("state"),col("median_age") \
                        ,col("male_population"),col("female_population") \
                        ,col("total_population"),col("veterans_num") \
                        ,col("foreign_born_population"),col("avg_household_size") \
                        ,col("state_code")) \
                    .pivot("race").agg(sum("count").cast("integer")) \
                    .fillna({"American Indian and Alaska Native": 0,
                     "Asian": 0,
                     "Black or African-American": 0,
                     "Hispanic or Latino": 0,
                     "White": 0}) \
                    .withColumnRenamed("American Indian and Alaska Native", "american_indian_alaska_native") \
                    .withColumnRenamed("Asian","asian") \
                    .withColumnRenamed("Black or African-American","african_american") \
                    .withColumnRenamed("Hispanic or Latino","hispanic_latino") \
                    .withColumnRenamed("White","white")

In [22]:
df_demo.show(5)

+----------+--------------+----------+---------------+-----------------+----------------+------------+-----------------------+------------------+----------+-----------------------------+-----+----------------+---------------+------+
|      city|         state|median_age|male_population|female_population|total_population|veterans_num|foreign_born_population|avg_household_size|state_code|american_indian_alaska_native|asian|african_american|hispanic_latino| white|
+----------+--------------+----------+---------------+-----------------+----------------+------------+-----------------------+------------------+----------+-----------------------------+-----+----------------+---------------+------+
|   Vallejo|    California|      37.8|          58379|            62890|          121269|        8103|                  30592|              2.83|        CA|                         1671|34753|           26778|          28977| 56169|
|   Concord|North Carolina|      35.7|          42732|            44

In [23]:
# remove directory if already exist
if (os.path.exists("file:/home/workspace/demography")
    os.rmdir("file:/home/workspace/demography")
else:
# write to parquet
    df_demo.write.partitionBy('state','city').parquet('demography')

### 2.3 Airport Code Table

In [8]:
# load airport-codes_csv.csv
df_airport = spark.read.csv("airport-codes_csv.csv",header=True,sep=",")
df_airport.show(5)

+-----+-------------+--------------------+------------+---------+-----------+----------+------------+--------+---------+----------+--------------------+
|ident|         type|                name|elevation_ft|continent|iso_country|iso_region|municipality|gps_code|iata_code|local_code|         coordinates|
+-----+-------------+--------------------+------------+---------+-----------+----------+------------+--------+---------+----------+--------------------+
|  00A|     heliport|   Total Rf Heliport|          11|       NA|         US|     US-PA|    Bensalem|     00A|     null|       00A|-74.9336013793945...|
| 00AA|small_airport|Aero B Ranch Airport|        3435|       NA|         US|     US-KS|       Leoti|    00AA|     null|      00AA|-101.473911, 38.7...|
| 00AK|small_airport|        Lowell Field|         450|       NA|         US|     US-AK|Anchor Point|    00AK|     null|      00AK|-151.695999146, 5...|
| 00AL|small_airport|        Epps Airpark|         820|       NA|         US|     

In [17]:
# count distinct countries
df_airport.select('iso_country').distinct().count()

244

In [201]:
# select rows for US
df_airport.filter(df_airport.iso_country == "US").show(5)

+-----+-------------+--------------------+------------+---------+-----------+----------+------------+--------+---------+----------+--------------------+
|ident|         type|                name|elevation_ft|continent|iso_country|iso_region|municipality|gps_code|iata_code|local_code|         coordinates|
+-----+-------------+--------------------+------------+---------+-----------+----------+------------+--------+---------+----------+--------------------+
|  00A|     heliport|   Total Rf Heliport|          11|       NA|         US|     US-PA|    Bensalem|     00A|     null|       00A|-74.9336013793945...|
| 00AA|small_airport|Aero B Ranch Airport|        3435|       NA|         US|     US-KS|       Leoti|    00AA|     null|      00AA|-101.473911, 38.7...|
| 00AK|small_airport|        Lowell Field|         450|       NA|         US|     US-AK|Anchor Point|    00AK|     null|      00AK|-151.695999146, 5...|
| 00AL|small_airport|        Epps Airpark|         820|       NA|         US|     

In [202]:
# count rows
df_airport.count()

55075

In [203]:
# count rows where country = US
#df_airport.select('iso_country' == 'US').distinct().count()
df_airport.select('iso_country').where("iso_country = 'US'").count()

22757

In [204]:
# count distince iata_code for US
df_airport.select('iata_code').filter(df_airport.iso_country == "US").distinct().count()

2015

In [205]:
# count Null iata_code for US
df_airport.where(col("iata_code").isNull()).filter(df_airport.iso_country == "US").count()

20738

In [9]:
# delete rows with iata_code as Null, None or empty string where country == US
df_airport = df_airport.where("iso_country = 'US'") \
                        .filter(col('iata_code').isNotNull() | 
                        ~col('iata_code').contains('None') | \
                        ~col('iata_code').contains('NULL') | \
                            (col('iata_code') != '' ))

In [207]:
df_airport.show(5)

+-----+-------------+--------------------+------------+---------+-----------+----------+-------------+--------+---------+----------+--------------------+
|ident|         type|                name|elevation_ft|continent|iso_country|iso_region| municipality|gps_code|iata_code|local_code|         coordinates|
+-----+-------------+--------------------+------------+---------+-----------+----------+-------------+--------+---------+----------+--------------------+
| 07FA|small_airport|Ocean Reef Club A...|           8|       NA|         US|     US-FL|    Key Largo|    07FA|      OCA|      07FA|-80.274803161621,...|
|  0AK|small_airport|Pilot Station Air...|         305|       NA|         US|     US-AK|Pilot Station|    null|      PQS|       0AK|-162.899994, 61.9...|
| 0CO2|small_airport|Crested Butte Air...|        8980|       NA|         US|     US-CO|Crested Butte|    0CO2|      CSE|      0CO2|-106.928341, 38.8...|
| 0TE7|small_airport|   LBJ Ranch Airport|        1515|       NA|         US

In [10]:
df_airport.count()

2019

In [21]:
# check "code" and city columns to find out the relationship
df_airport.select(col('iata_code'),col('iso_country'),col('iso_region'),col('municipality')) \
            .distinct().sort(df_airport.iata_code.desc()).show(truncate=False)

+---------+-----------+----------+------------------------------+
|iata_code|iso_country|iso_region|municipality                  |
+---------+-----------+----------+------------------------------+
|ZZV      |US         |US-OH     |Zanesville                    |
|ZPH      |US         |US-FL     |Zephyrhills                   |
|ZNC      |US         |US-AK     |Nyac                          |
|YUM      |US         |US-AZ     |Yuma                          |
|YNG      |US         |US-OH     |Youngstown/Warren             |
|YKN      |US         |US-SD     |Yankton                       |
|YKM      |US         |US-WA     |Yakima                        |
|YIP      |US         |US-MI     |Detroit                       |
|YAK      |US         |US-AK     |Yakutat                       |
|XSD      |US         |US-NV     |Tonopah                       |
|XPR      |US         |US-SD     |Pine Ridge                    |
|XNA      |US         |US-AR     |Fayetteville/Springdale/Rogers|
|XMD      

In [11]:
# split "coordinates" into seperate columns
split_col = split(df_airport['coordinates'], ',')
df_airport = df_airport.withColumn('longitude', split_col.getItem(0)) \
                        .withColumn('latitude', split_col.getItem(1)) \
                        .drop("coordinates")

In [211]:
# split "iso_region"
#split_region = split(df_airport['iso_region'],'-')
#df_airport = df_airport.withColumn('region',split_region.getItem(1)) \
#                        .drop('iso_region') \
 #                       .withColumnRenamed('iso_country','country') \
  #                      .withColumnRenamed('ident','id')

In [23]:
df_airport.show(3)

+-----+-------------+--------------------+------------+---------+-----------+----------+-------------+--------+---------+----------+----------------+----------------+
|ident|         type|                name|elevation_ft|continent|iso_country|iso_region| municipality|gps_code|iata_code|local_code|       longitude|        latitude|
+-----+-------------+--------------------+------------+---------+-----------+----------+-------------+--------+---------+----------+----------------+----------------+
| 07FA|small_airport|Ocean Reef Club A...|           8|       NA|         US|     US-FL|    Key Largo|    07FA|      OCA|      07FA|-80.274803161621| 25.325399398804|
|  0AK|small_airport|Pilot Station Air...|         305|       NA|         US|     US-AK|Pilot Station|    null|      PQS|       0AK|     -162.899994|       61.934601|
| 0CO2|small_airport|Crested Butte Air...|        8980|       NA|         US|     US-CO|Crested Butte|    0CO2|      CSE|      0CO2|     -106.928341|       38.851918

In [24]:
### Get count of both null and missing values
df_airport.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df_airport.columns]).show()

+-----+----+----+------------+---------+-----------+----------+------------+--------+---------+----------+---------+--------+
|ident|type|name|elevation_ft|continent|iso_country|iso_region|municipality|gps_code|iata_code|local_code|longitude|latitude|
+-----+----+----+------------+---------+-----------+----------+------------+--------+---------+----------+---------+--------+
|    0|   0|   0|          34|        0|          0|         0|           6|      81|        0|        50|        0|       0|
+-----+----+----+------------+---------+-----------+----------+------------+--------+---------+----------+---------+--------+



In [26]:
# check id, gps_code, iata_code, local_code and explor the relationship between those codes
df_airport.select('ident','gps_code','iata_code','local_code').orderBy(col('iata_code').desc()).show(10)

+-----+--------+---------+----------+
|ident|gps_code|iata_code|local_code|
+-----+--------+---------+----------+
| KZZV|    KZZV|      ZZV|       ZZV|
| KZPH|    KZPH|      ZPH|       ZPH|
|  ZNC|     ZNC|      ZNC|       ZNC|
| KNYL|    KNYL|      YUM|       NYL|
| KYNG|    KYNG|      YNG|       YNG|
| KYKN|    KYKN|      YKN|       YKN|
| KYKM|    KYKM|      YKM|       YKM|
| KYIP|    KYIP|      YIP|       YIP|
| PAYA|    PAYA|      YAK|       YAK|
| KTNX|    KTNX|      XSD|       TNX|
+-----+--------+---------+----------+
only showing top 10 rows



In [29]:
# remove directory if already exist
if (os.path.exists("file:/home/workspace/airport")):
    os.rmdir("file:/home/workspace/airport")
else:
# write to parquet
    df_airport.write.partitionBy('iso_region').parquet('airport')

### 2.4 World Temperature Data

In [12]:
# load world temperature data
#fname = '../../data2/GlobalLandTemperaturesByCity.csv'
df_temperature = spark.read.csv("../../data2/GlobalLandTemperaturesByCity.csv",header=True,sep=",")
df_temperature.show(5)

+----------+------------------+-----------------------------+-----+-------+--------+---------+
|        dt|AverageTemperature|AverageTemperatureUncertainty| City|Country|Latitude|Longitude|
+----------+------------------+-----------------------------+-----+-------+--------+---------+
|1743-11-01|             6.068|           1.7369999999999999|Århus|Denmark|  57.05N|   10.33E|
|1743-12-01|              null|                         null|Århus|Denmark|  57.05N|   10.33E|
|1744-01-01|              null|                         null|Århus|Denmark|  57.05N|   10.33E|
|1744-02-01|              null|                         null|Århus|Denmark|  57.05N|   10.33E|
|1744-03-01|              null|                         null|Århus|Denmark|  57.05N|   10.33E|
+----------+------------------+-----------------------------+-----+-------+--------+---------+
only showing top 5 rows



In [13]:
# slice table with interested columns (City, Country, Latitude and Longitude) only
df_temperature = df_temperature.select('City','Country','Latitude','Longitude') \
                                .filter(df_temperature.Country == 'United States') \
                                .distinct()

In [46]:
df_temperature.printSchema()

root
 |-- City: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- Latitude: string (nullable = true)
 |-- Longitude: string (nullable = true)



In [47]:
### Get count of both null and missing values
df_temperature.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df_temperature.columns]).show()

+----+-------+--------+---------+
|City|Country|Latitude|Longitude|
+----+-------+--------+---------+
|   0|      0|       0|        0|
+----+-------+--------+---------+



In [14]:
# drop duplicated rows and change column names
df_temperature = df_temperature.dropDuplicates() \
                                .withColumnRenamed("City","city") \
                                .withColumnRenamed("Latitude","latitude") \
                                .withColumnRenamed("Longitude","longitude") \
                                .drop("Country")

In [38]:
df_temperature.show(5)

+----------------+--------+---------+
|            city|latitude|longitude|
+----------------+--------+---------+
|        Columbus|  32.95N|   85.21W|
|          Edison|  40.99N|   74.56W|
|         Miramar|  26.52N|   80.60W|
|Huntington Beach|  32.95N|  117.77W|
|      Washington|  39.38N|   76.99W|
+----------------+--------+---------+
only showing top 5 rows



In [50]:
# remove directory if already exist and write to parquet
if (os.path.exists("file:/home/workspace/coordinates")):
    os.rmdir("file:/home/workspace/coordinates")
else:
    df_temperature.write.partitionBy('city').parquet('coordinates')

### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model    
My data modeling concept is to keep the most relevant information together in one table and reserve the most frequently requested information (from my perspective) in the fact table. In this way a lightweight fact table is produced to retrieve often needed information, in case further information is need for the analysis, joining another table (dimension table) under the Star Schema framework is also not so costly. Ideally, the dimension tables could be further normalized, a Snowflake Schema will possibly more proper considering the data integrity aspect, but it will possibly cause more costly joins among tables. After consideration, I decided on a Star Schema, specifically the fact and dimension tables look like the following:       

#### Fact table

__fact_immigration_record__:        
*__cic_id (PK)__, port(FK), arrival_date, arrive_year, arrive_month, departure_date, ariline, flight_num, arrive_city (FK), arrive_state (FK), mode*

#### Dimension Tables   
1. __dim_immigrant__: *__cic_id (PK)__, age, occupation, gender, birth_year, citizen_country,resident_country*

2. __dim_city__: *city, state, state_code, longitude, latitude, median_age, avg_household_size, total_population,
male_population, female_population, veterans_num, foreign_born_population, american_indian_alaska_native, asian,african_american, hispanic_latino, white, __(city,state PK)__*

3. __dim_airport__: *__id (PK)__, type, name, elevation_ft, iso_region, municipality, gps_code, iata_code (FK) reference fact_port, local_code, longitude, latitude*

4. __dim_visa__: *__cic_id (PK)__, visa_type, visa_class, visa_issue_state, rrive_flag, departure_flag, update_flag, match_flag, allowed_date*

#### 3.2 Mapping Out Data Pipelines
I used the Data Lake concept throughout the project, which is: instead of defining a strict relational structure to the data tables, I prefer to keep the database relatively flexible. For example "null" values were kept in the fact table, only duplicated records were excluded, because it makes no sense to exclude records with "null" values out of the database, considering one of the most possible important usages of the database is to keep track of every immigrant. 

The steps necessary to pipeline the data into the chosen data model are:

1. Extract: load datasets (sas_data, demography, coordinate, airport) from the sources (parquet files stored in local/cloud after the data cleaning step)    

2. Transform: selecting target columns from each data set and join them to compose fact and dimension tables  

3. Load: write the final tables back to local/cloud as parquet files (I loaded back the tables directly to the Udacity provided workspace storage, for self implication on cloud self-configuration of cloud infrastructure is necessary)


### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model

#### Create fact table
__fact_immigration_record__: *__cic_id(PK)__,port(FK),arrival_date,arrive_year,arrive_month,departure_date,ariline,flight_num,arrive_state (FK),arrive_city(FK),mode*

In [68]:
# load fact table
df_airport_temp = df_airport.select('iata_code','municipality').filter(df_airport.municipality.isNotNull())
df_sas_temp = df_sas.select("cic_id","port","mode","arrival_date","arrive_year","arrive_month", \
                                        "departure_date","airline","flight_num","arrive_state").distinct()
cond = [df_sas_temp['port'] == df_airport_temp['iata_code']]
fact_immigration_record = df_sas_temp.join(df_airport_temp, cond,'left').drop('iata_code') \
                                    .select("cic_id","port","mode","arrival_date","arrive_year","arrive_month", \
                                        "departure_date","airline","flight_num","arrive_state","municipality") \
                                    .withColumnRenamed('municipality','arrive_city') \
                                    .distinct()

fact_immigration_record.show(5)
fact_immigration_record.count()

+------+----+----+------------+-----------+------------+--------------+-------+----------+------------+-----------+
|cic_id|port|mode|arrival_date|arrive_year|arrive_month|departure_date|airline|flight_num|arrive_state|arrive_city|
+------+----+----+------------+-----------+------------+--------------+-------+----------+------------+-----------+
|    75| ATL|   1|  2016-04-01|       2016|           4|    2016-04-14|     LH|     00444|          TN|    Atlanta|
|   250| NYC|   1|  2016-04-01|       2016|           4|    2016-04-08|     AA|     00101|          NY|       null|
|   434| MIA|   1|  2016-04-01|       2016|           4|    2016-04-06|     OS|     00097|          FL|      Miami|
|   439| MIA|   1|  2016-04-01|       2016|           4|    2016-04-09|     OS|     00097|          FL|      Miami|
|   454| MIA|   1|  2016-04-01|       2016|           4|    2016-04-16|     OS|     00097|          FL|      Miami|
+------+----+----+------------+-----------+------------+--------------+-

3096313

In [83]:
# remove directory if already exist and write to parquet
if (os.path.exists("file:/home/workspace/US_immigration")):
    os.rmdir("file:/home/workspace/US_immigration")
else:
    fact_immigration_record.write.partitionBy("arrive_state").parquet("US_immigration")

#### Create dim_immigrant
__dim_immigrant__:    
*__cic_id (PK)__,age,occupation,gender,birth_year,citizen_country,resident_country,arrive_state (partition key)*

In [71]:
# load dim_immigrant
dim_immigrant = df_sas.select("cic_id","age","birth_year","gender","occupation", \
                              "citizen_country","resident_country","arrive_state").distinct()

In [72]:
dim_immigrant.show(5)

+------+---+----------+------+----------+---------------+----------------+------------+
|cic_id|age|birth_year|gender|occupation|citizen_country|resident_country|arrive_state|
+------+---+----------+------+----------+---------------+----------------+------------+
|   194| 63|      1953|     M|      null|            103|             103|          CA|
|   632| 23|      1993|     M|      null|            103|             103|          TX|
|  1510| 68|      1948|     F|      null|            104|             104|          NY|
|  1627| 65|      1951|     F|      null|            104|             104|          FL|
|  1812| 45|      1971|     M|      null|            104|             104|          FL|
+------+---+----------+------+----------+---------------+----------------+------------+
only showing top 5 rows



In [73]:
# remove directory if already exist and write to parquet
if (os.path.exists("file:/home/workspace/US_immigrant")):
    os.rmdir("file:/home/workspace/US_immigrant")
else:
    dim_immigrant.write.partitionBy("arrive_state").parquet("US_immigrant")

#### Create dim_city
__dim_city__: 
*city, state, state_code, longitude, latitude, median_age, avg_household_size, total_population,
male_population, female_population, veterans_num, foreign_born_population, american_indian_alaska_native, asian, african_american, hispanic_latino, white, __(city,state PK)__*


In [132]:
# load dim_city
city_list = set(fact_immigration_record.select('arrive_city').toPandas()['arrive_city']).
df_temp = df_temperature.selectExpr('city as temp_city','latitude','longitude')
dim_city = df_demo.join(df_temp,df_demo.city == df_temp.temp_city, 'left') \
                    .select('city','state','state_code','latitude','longitude', \
                           'median_age','male_population','female_population', \
                           'total_population','veterans_num','foreign_born_population', \
                           'avg_household_size','american_indian_alaska_native', \
                           'asian','african_american','hispanic_latino','white') \
                    .filter(df_demo.city.isin(city_list)).distinct()
dim_city.show(5)

+-----------+--------------+----------+--------+---------+----------+---------------+-----------------+----------------+------------+-----------------------+------------------+-----------------------------+-----+----------------+---------------+------+
|       city|         state|state_code|latitude|longitude|median_age|male_population|female_population|total_population|veterans_num|foreign_born_population|avg_household_size|american_indian_alaska_native|asian|african_american|hispanic_latino| white|
+-----------+--------------+----------+--------+---------+----------+---------------+-----------------+----------------+------------+-----------------------+------------------+-----------------------------+-----+----------------+---------------+------+
| Charleston|South Carolina|        SC|  32.95N|   79.47W|      35.0|          63956|            71568|          135524|        9368|                   5767|               2.4|                          633| 2773|           29998|           3

In [140]:
# remove directory if already exist and write to parquet
if (os.path.exists("file:/home/workspace/US_city")):
    os.rmdir("file:/home/workspace/US_city")
else:
    dim_city.write.partitionBy("state_code").parquet("US_city")

#### Create dim_airport
__dim_airport__: 
*__id(PK)__, type,name, elevation_ft, continent, iso_region, municipality, gps_code, iata_code (FK, reference fact_port), local_code, longitude, latitude*

In [77]:
# select distinct port codes from df_sas
port_list = set(df_sas.select('port').filter(df_sas.port.isNotNull()).toPandas()['port'])
# load dim_airport reserving only the airport included in fact table
dim_airport = df_airport.select('ident', 'type', 'name', 'elevation_ft', 'continent',\
                                'municipality', 'gps_code', 'iata_code',\
                                'local_code', 'longitude', 'latitude', 'iso_region') \
                        .filter(df_airport.iata_code.isin(port_list)) \
                        .filter(df_airport.municipality.isNotNull()) \
                        .distinct()

In [78]:
dim_airport.show(5)
dim_airport.count()

+-----+--------------+--------------------+------------+---------+------------+--------+---------+----------+--------------+-------------------+----------+
|ident|          type|                name|elevation_ft|continent|municipality|gps_code|iata_code|local_code|     longitude|           latitude|iso_region|
+-----+--------------+--------------------+------------+---------+------------+--------+---------+----------+--------------+-------------------+----------+
| KFWA| large_airport|Fort Wayne Intern...|         814|       NA|  Fort Wayne|    KFWA|      FWA|       FWA|  -85.19509888|        40.97850037|     US-IN|
| KABQ| large_airport|Albuquerque Inter...|        5355|       NA| Albuquerque|    KABQ|      ABQ|       ABQ|   -106.609001|          35.040199|     US-NM|
| KPIR|medium_airport|Pierre Regional A...|        1744|       NA|      Pierre|    KPIR|      PIR|       PIR|  -100.2860031|        44.38270187|     US-SD|
| KFTK|medium_airport|Godman Army Air F...|         756|       N

132

In [79]:
# remove directory if already exist and write to parquet
if (os.path.exists("file:/home/workspace/US_airport")):
    os.rmdir("file:/home/workspace/US_airport")
else:
    dim_airport.write.partitionBy("iso_region").parquet("US_airport")

#### Create dim_visa
__dim_visa__: *__cic_id (PK)__, visa_type, visa_class, visa_issue_state, arrive_flag, departure_flag, update_flag, match_flag, allowed_date*

In [100]:
#load dim_visa
dim_visa = df_sas.select('cic_id','visa_type','visa_class','visa_issue_state','arrive_flag', \
                         'departure_flag','update_flag','match_flag','allowed_date') \
                    .distinct()

In [101]:
dim_visa.show(3)

+------+---------+----------+----------------+-----------+--------------+-----------+----------+------------+
|cic_id|visa_type|visa_class|visa_issue_state|arrive_flag|departure_flag|update_flag|match_flag|allowed_date|
+------+---------+----------+----------------+-----------+--------------+-----------+----------+------------+
|    56|       WT|         2|            null|          G|             O|       null|         M|  2016-06-29|
|   157|       WT|         2|            null|          G|             O|       null|         M|  2016-06-29|
|   171|       WT|         2|            null|          G|             O|       null|         M|  2016-06-29|
+------+---------+----------+----------------+-----------+--------------+-----------+----------+------------+
only showing top 3 rows



In [111]:
# remove directory if already exist and write to parquet
if (os.path.exists("file:/home/workspace/US_airport")):
    os.rmdir("file:/home/workspace/US_airport")
else:
    dim_visa.write.partitionBy("visa_type").parquet("US_visa")

#### 4.2 Data Quality Checks
Quality checks are performed to ensure the pipeline ran as expected. These included:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness

In [84]:
# load fact table, show rows and check schema
fact = spark.read.parquet("US_immigration")
fact.show(5)
fact.printSchema() # check if datatypes are valid

+------+----+----+------------+-----------+------------+--------------+-------+----------+-----------+------------+
|cic_id|port|mode|arrival_date|arrive_year|arrive_month|departure_date|airline|flight_num|arrive_city|arrive_state|
+------+----+----+------------+-----------+------------+--------------+-------+----------+-----------+------------+
|    69| ATL|   1|  2016-04-01|       2016|           4|    2016-04-16|     DL|     00131|    Atlanta|          FL|
|   883| NEW|   1|  2016-04-01|       2016|           4|    2016-04-16|     UA|     02067|New Orleans|          FL|
|  3719| NYC|   1|  2016-04-01|       2016|           4|    2016-04-02|     AA|     00065|       null|          FL|
|  4785| CLT|   1|  2016-04-01|       2016|           4|    2016-04-23|     AA|     00787|  Charlotte|          FL|
|  9082| MIA|   1|  2016-04-01|       2016|           4|    2016-04-11|     AF|     00090|      Miami|          FL|
+------+----+----+------------+-----------+------------+--------------+-

In [85]:
# count rows of fact table, row number matches the source table
fact.count()

3096313

In [89]:
# check if "cic_id" are unique
df1 = fact.groupBy("cic_id").count().filter("count > 1")
df1.show()


+------+-----+
|cic_id|count|
+------+-----+
+------+-----+



In [88]:
# load dim_immigrant, show rows, check schema and count rows
dim_immigrant = spark.read.parquet("US_immigrant")
dim_immigrant.show(5)
dim_immigrant.printSchema()
dim_immigrant.count() # row number should be equal to the fact table's row number

+------+---+----------+------+----------+---------------+----------------+------------+
|cic_id|age|birth_year|gender|occupation|citizen_country|resident_country|arrive_state|
+------+---+----------+------+----------+---------------+----------------+------------+
|  1804| 73|      1943|     F|      null|            104|             104|          FL|
|  1916| 36|      1980|     M|      null|            104|             104|          FL|
|  2175| 14|      2002|     F|      null|            105|             105|          FL|
|  3317| 39|      1977|     M|      null|            108|             108|          FL|
|  3718| 77|      1939|     F|      null|            108|             135|          FL|
+------+---+----------+------+----------+---------------+----------------+------------+
only showing top 5 rows

root
 |-- cic_id: integer (nullable = true)
 |-- age: integer (nullable = true)
 |-- birth_year: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- occupation: strin

3096313

In [91]:
# load dim_city
dim_city = spark.read.parquet("US_city")
dim_city.show(5)
dim_city.printSchema() # check if the datatypes are proper
dim_city.count()

+-----------+--------------+--------+---------+----------+---------------+-----------------+----------------+------------+-----------------------+------------------+-----------------------------+------+----------------+---------------+-------+----------+
|       city|         state|latitude|longitude|median_age|male_population|female_population|total_population|veterans_num|foreign_born_population|avg_household_size|american_indian_alaska_native| asian|african_american|hispanic_latino|  white|state_code|
+-----------+--------------+--------+---------+----------+---------------+-----------------+----------------+------------+-----------------------+------------------+-----------------------------+------+----------------+---------------+-------+----------+
| Charleston|South Carolina|  32.95N|   79.47W|      35.0|          63956|            71568|          135524|        9368|                   5767|               2.4|                          633|  2773|           29998|           3929|

64

In [92]:
# the number of distinct city in dim_city should be equal or less than in fact table
fact.select('arrive_city').distinct().count()

126

In [102]:
# load dim_airport
dim_airport = spark.read.parquet("US_airport")
dim_airport.show(5)
dim_airport.printSchema() # check if the datatypes are proper
dim_airport.count()

+-----+--------------+--------------------+------------+---------+--------------------+--------+---------+----------+-------------------+-------------------+----------+
|ident|          type|                name|elevation_ft|continent|        municipality|gps_code|iata_code|local_code|          longitude|           latitude|iso_region|
+-----+--------------+--------------------+------------+---------+--------------------+--------+---------+----------+-------------------+-------------------+----------+
| KPIE|medium_airport|St Petersburg Cle...|          11|       NA|St Petersburg-Cle...|    KPIE|      PIE|       PIE|       -82.68740082|        27.91020012|     US-FL|
| KSRQ| large_airport|Sarasota Bradento...|          30|       NA|  Sarasota/Bradenton|    KSRQ|      SRQ|       SRQ| -82.55439758300781|  27.39539909362793|     US-FL|
| PANC| large_airport|Ted Stevens Ancho...|         152|       NA|           Anchorage|    PANC|      ANC|       ANC|-149.99600219726562| 61.17440032958984

132

In [104]:
# row count of dim_airport should be equal or less than the 'port' number appeared in fact table
fact.select('port').distinct().count()

299

In [105]:
# load dim_visa
dim_visa = spark.read.parquet("US_visa")
dim_visa.show(5)
dim_visa.printSchema() # check if the datatypes are proper
dim_visa.count()  # should be equal to the row count of fact table

+------+----------+----------------+-----------+--------------+-----------+----------+------------+---------+
|cic_id|visa_class|visa_issue_state|arrive_flag|departure_flag|update_flag|match_flag|allowed_date|visa_type|
+------+----------+----------------+-----------+--------------+-----------+----------+------------+---------+
|     6|         2|            null|          T|          null|          U|      null|  2016-10-28|       B2|
|  2141|         2|             SOF|          G|             O|       null|         M|  2016-09-30|       B2|
|  2455|         2|             KRK|          G|             O|       null|         M|  2016-09-30|       B2|
|  2583|         2|             WRW|          G|             O|       null|         M|  2016-09-30|       B2|
|  3682|         2|             LUA|          G|             O|       null|         M|  2016-09-30|       B2|
+------+----------+----------------+-----------+--------------+-----------+----------+------------+---------+
only showi

3096313

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

### fact table 
Includes following columns with corresponding data types       
 |-- cic_id: integer (nullable = true)       id numbers of immigrants         
 |-- port: string (nullable = true)          three alphabetic identity codes of port through which the immigrants arrived             
 |-- mode: integer (nullable = true)         the way the immigrants arrived, 1 = 'Air', 2 = 'Sea', 3 = 'Land', 9 = 'Not reported'             
 |-- arrival_date: date (nullable = true)    the date the immigrants arrived in US            
 |-- arrive_year: integer (nullable = true)  the year the immigrants arrived in US            
 |-- arrive_month: integer (nullable = true) the month the immigrants arrived in US          
 |-- departure_date: date (nullable = true)  the date the immigrants leave US          
 |-- airline: string (nullable = true)       the name of airline the immigrants took        
 |-- flight_num: string (nullable = true)    the number of flight the immigrants tool       
 |-- arrive_city: string (nullable = true)   the city the immigrants first arrived in        
 |-- arrive_state: string (nullable = true)  the US state the immigrants first arrived in

### dim_immigrant 
Dimension table of information on immigrants, including following columns       
 |-- cic_id: integer (nullable = true) unique identity of immigrants       
 |-- age: integer (nullable = true) age of immigrants        
 |-- birth_year: integer (nullable = true) birth year of immigrants        
 |-- gender: string (nullable = true) gender of immigrants       
 |-- occupation: string (nullable = true) occupation of immigrants       
 |-- citizen_country: integer (nullable = true) three digit number indicating the citizenship country of the immigrants, country names are included in I94_SAS_Lables_Description          
 |-- resident_country: integer (nullable = true) three digit number indicating the residentship country of the immigrants, country names are included in I94_SAS_Lables_Description           
 |-- arrive_state: string (nullable = true)  the US state the immigrants first arrived, could be used as foreign combined with cic_id when joining with fact table     

### dim_city
Includes demography information of the cities that appeared in the fact table        
|-- city: string (nullable = true)    name of the city     
 |-- state: string (nullable = true)  state the city belongs to      
 |-- latitude: string (nullable = true)  coordinates latitude      
 |-- longitude: string (nullable = true)  coordinates longitude      
 |-- median_age: float (nullable = true)  median age of populations      
 |-- male_population: integer (nullable = true) number of male population       
 |-- female_population: integer (nullable = true) number of female population       
 |-- total_population: integer (nullable = true)  total number of population       
 |-- veterans_num: integer (nullable = true)  number of veterans      
 |-- foreign_born_population: integer (nullable = true) number of foreign_born population      
 |-- avg_household_size: float (nullable = true) average size of household        
 |-- american_indian_alaska_native: integer (nullable = true) number of population of the race as indicated in the column name               
 |-- asian: integer (nullable = true) number of population of the race as indicated in the column name        
 |-- african_american: integer (nullable = true) number of population of the race as indicated in the column name        
 |-- hispanic_latino: integer (nullable = true) number of population of the race as indicated in the column name        
 |-- white: integer (nullable = true)  number of population of the race as indicated in the column name        
 |-- state_code: string (nullable = true) state code

### dim_visa
Dimension table with information about visa of each immigrant     
 |-- cic_id: integer (nullable = true)  unique ids of immigrants       
 |-- visa_class: integer (nullable = true) one digit integer indicating the class of visa, 1 = Business, 2 = Pleasure, 3 = Student      
 |-- visa_issue_state: string (nullable = true) the code representing the place where visa was issued   |-- arrive_flag: string (nullable = true) Arrival Flag - admitted or paroled into the U.S.       
 |-- departure_flag: string (nullable = true) Departure Flag - Departed, lost I-94 or is deceased     
 |-- update_flag: string (nullable = true) Update Flag - Either apprehended, overstayed, adjusted to perm residence        
 |-- match_flag: string (nullable = true) Match flag - Match of arrival and departure records     
 |-- allowed_date: date (nullable = true) Date to which admitted to U.S. (allowed to stay until)       
 |-- visa_type: string (nullable = true)  Class of admission legally admitting the non-immigrant to temporarily stay in U.S. Detailed description is provided in the I94_SAS_Lables_Description       

### dim_airport
Dimension table of information about airport, including following columns 
 |-- ident: string (nullable = true)  ID of airport      
 |-- type: string (nullable = true)  type of airport     
 |-- name: string (nullable = true)  name of airport      
 |-- elevation_ft: string (nullable = true) elevation of airport     
 |-- continent: string (nullable = true)  continent of airport    
 |-- municipality: string (nullable = true)  city of airport    
 |-- gps_code: string (nullable = true) gps_code of airport      
 |-- iata_code: string (nullable = true) unique iata_code of airport      
 |-- local_code: string (nullable = true) local_code of airport    
 |-- longitude: string (nullable = true) coordinates longitude
 |-- latitude: string (nullable = true)   coordinates  latitude     
 |-- iso_region: string (nullable = true) iso_region of airport      

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.        
  In princip should be updated every time a new immigrant is registered or according to the consumer demand
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.       
  I would use Airflow or other data flow management tools to automate the ETL process. As storage, another scalable Cloud-based data lake or on-premise location would be proper. PySpark or other paralyzed frameworks would still be preferred.
 * The data populates a dashboard that must be updated daily by 7 am every day.      
 I will move the database on a Cloud platform (AWS for example) and connect with proper BI tools (Dash, Tableau for example) that are adapted to the cloud platform and automate the entire data flow using tools like Airflow.
 * The database needed to be accessed by 100+ people.     
 Deploy the database on a Cloud platform with high availability.