## Data Engineering Capstone Project

#### Project Summary

<font size=3>
Using several datasets provided by Udacity, build a dimensional star schema data model that can be used for simulated data analysis queries related to I94 tourist data as well as weather data (average temperature). The primary (large volume) dataset consists of 12 months of US Government I94 immigration data. This is supplemented with a "conformed" dataset containing average temperatures at the grain of year, month, state which is compatible with the year, month and destination state present in the I94 dataset. The project utilizes various tools covered in the course such as PostgreSQL, Pandas, Spark, etc.  
</font>
  
The project follows these steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

<font size=2>
Note that due to this framework the "flow" of data from source to target for the various datasets is necessarily discontiguous and perhaps a little more difficult to follow.
</font>

### Step 0: Initialize Environment

###### Note if this is not being executed from the development workspace please remember to load additional datafiles from more_data.tar.gz into data/ directory

In [1]:
# Do all imports and installs here
import pandas as pd
import os
import glob
import psycopg2
import pandas as pd
import numpy as np
#import json
import csv
from pyspark.sql import SparkSession
from datetime import datetime

In [2]:
# Note need to install pyarrow library prior to running this notebook. From command line execute:
# pip install pyarrow
import pyarrow

In [61]:
# set num_days which determines what subset of days from January, April, July, October of 2016 will be used.
# note that setting num_days = 2 will result in approximately 1M rows being loaded into the primary fact table i94_f.
# if num_days = 1 the entire notebook can execute in approx 15 minutes; num_days = 2 in approx 25 minutes...
num_days = 2

##### Set up the PostgreSQL database environment

In [4]:
# connect to default database
conn = psycopg2.connect("host=127.0.0.1 dbname=studentdb user=student password=student")
conn.set_session(autocommit=True)
cur = conn.cursor()

In [5]:
# be careful with this since it recreates the database environment...

# create capstone database with UTF8 encoding
cur.execute("DROP DATABASE IF EXISTS capstone")
cur.execute("CREATE DATABASE capstone WITH ENCODING 'utf8' TEMPLATE template0")

# close connection to default database
conn.close()    

In [6]:
# connect to capstone database
conn = psycopg2.connect("host=127.0.0.1 dbname=capstone user=student password=student")
conn.set_session(autocommit=True)
cur = conn.cursor()

###### Also enable SQL "magic" to be used as needed

In [7]:
%load_ext sql
%sql postgresql://student:student@127.0.0.1/capstone

'Connected: student@capstone'

###### Since the I94 dataset has many columns use the following option to enhance output formatting.

In [8]:
# The I94 dataset has many columns. This enhances the output formatting.
pd.set_option('display.max_columns', 40)
#pd.set_option('display.max_rows', 100)

### Step 1: Scope the Project and Gather Data

#### Scope 

- Using several datasets provided by Udacity, build a dimensional star schema data model that can be used for simulated data analysis queries related to I94 tourist data as well as weather data (average temperature). The project utilizes various tools covered in the course such as PostgreSQL, Pandas, Spark, etc.    
- The primary data source is the large I94 tourist dataset provided by Udacity for the project. This consists of 10s of millions of rows of I94 forms gathered by US Customs at Ports of Entry for the 12 months of 2016. This will be used to create the primary fact table of the resulting star schema. Note that only a subset of data will be loaded. 
- This will be supplemented by the global temperature dataset also provided by Udacity. This will be used to create a secondary fact table of the resulting star schema. The keys of the two fact tables will be designed so that they are 'conforming' and can be used together (via joins) if necessary. Since they have very different import feeds they will not be combined into a single fact table.   
- In addition several small datasets have been extracted manually via text editor from the I94 Data Dictionary ```I94_SAS_Labels_Descriptions.SAS``` and will be used to create small dimensions for the resulting star schema.
- One additional dataset was obtained over the web and it contains a cross reference of US Zip Codes, cities and states vs. latitude and longitude. It is used to enhance the world temperature dataset by adding a state code column to it so that it can be matched/joined efficiently against the I94 fact table.

##### Data Source files and locations

- Global Temperature Data - ```../../data2/GlobalLandTemperaturesByCity.csv```
- I94 immigration Data - ```../../data/18-83510-I94-Data-2016/<12 monthly files>```
- I94 State Codes - ```data/i94_states.csv```
- I94 Country Codes - ```data/i94_countries.csv```
- I94 Ports of Entry Codes - ```data/i94_ports_entry.csv```
- I94 Visa Codes - ```data/i94_visas.csv```
- I94 Mode Codes - ```data/i94_modes.csv```
- Zip Code Latitude and Longitude Data - ```data/us-zip-code-latitude-and-longitude.csv``` 

##### Tools
- The primary database tool is PostgreSQL. In an actual implementation the volume of data would lead towards a more scalable solution such as Redshift.
  - One reason I'm using PostgreSQL is that I've been charged by AWS for processing during several earlier projects. The choice is partially budget related.
  - However for purposes of the exercise I assume that it might be possible to use local PostgreSQL to prototype some elements of Redshift development.
  - In reality it may not be practical to use PostgreSQL for prototyping and local development of Redshift applications since in some ways Redshift is a subset of PostgreSQL (they forked years ago) and in other ways Redshift is a superset of PostgreSQL (since AWS has added many capabilities to the base).
- Secondary tools used for data exploration and ingestion include Pandas, Spark, Pyarrow, Parquet, etc.
  - I'm using Pandas for data exploration as well as limited scale data ingestion processing. 
  - The Pyarrow library is installed and used in order to obtain a specific capability to ingest Parquet format files into Pandas dataframes.
  
##### End Solution - Star Schema Data Mart containing facts for I94 data and US City Temperature data.

![Conceptual Model](conceptual_model.jpg)

#### Pre-process Global (City) Temperature Data Source
* Several datasets are in the form of simple csv files so load them into staging tables or final dimension tables if no transformations are required.

In [9]:
stg_city_temp_create = ("""
CREATE TABLE IF NOT EXISTS stg_city_temp(
    measure_dt DATE,                               -- dt
    avg_temp NUMERIC(8,3),                         -- AverageTemperature
    avg_temp_uncertainty NUMERIC(8,3),             -- AverageTemperatureUncertainty
    city VARCHAR,
    country VARCHAR,
    latitude VARCHAR,
    longitude VARCHAR)
""")
stg_city_temp_drop = "DROP TABLE IF EXISTS stg_city_temp"

In [10]:
cur.execute(stg_city_temp_drop)
cur.execute(stg_city_temp_create)

In [11]:
city_temp_fname = '../../data2/GlobalLandTemperaturesByCity.csv'
with open(city_temp_fname, 'r', encoding='utf-8') as f:
  next(f)   # skip the header row
  cur.copy_from(f, 'stg_city_temp', sep=',', null='')
conn.commit()

#### Pre-process the I94 tourism immigration dataset
* The source data is in a proprietary SAS format.
* First convert selected months of data (e.g. jan, apr, jul, oct) into Parquet format. 
* This is useful since the parquet files load quicker into pandas which is an advantage if running multiple times (i.e. during development).
* The following utility will use a parquet version of the data if it already exists or create it. 

In [12]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.\
                     config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11").\
                     enableHiveSupport().getOrCreate()

In [13]:
def xfer_sas_to_parq(month = 'jan'):
  """Utility function to transfer monthly I94 data from SAS format to parquet format. 
     Parameter is 3 char month code (e.g. 'jan', 'feb', etc.)"""
  sas_file = 'i94_' + month + '16_sub.sas7bdat'
  sas_path = '../../data/18-83510-I94-Data-2016/' + sas_file
  parq_path = 'sas_data_16' + month
    
  # if the parquet directory already exists then skip this process
  try:
    open(parq_path, 'r')
  except IsADirectoryError:
    print(f'Target parquet directory {parq_path} for month "{month}" already exists so skip (re)loading it.')
    return
  except FileNotFoundError:
    pass
  except:
    raise
    
  # otherwise load the SAS file into a Spark dataframe and then write it back to Parquet format
  df_i94_spark = spark.read.format('com.github.saurfang.sas.spark').load(sas_path)
    
  print(f'Writing parquet directory {parq_path} for month "{month}".')
  df_i94_spark.write.parquet('sas_data_16' + month)


In [14]:
# Transfer selected monthly i94 data from SAS format into more efficient Parquet format.
# This process only occurs if the transfer has not previously been performed.

df_i94_spark = xfer_sas_to_parq('jan')

df_i94_spark = xfer_sas_to_parq('apr')

df_i94_spark = xfer_sas_to_parq('jul')

df_i94_spark = xfer_sas_to_parq('oct')

Target parquet directory sas_data_16jan for month "jan" already exists so skip (re)loading it.
Target parquet directory sas_data_16apr for month "apr" already exists so skip (re)loading it.
Target parquet directory sas_data_16jul for month "jul" already exists so skip (re)loading it.
Target parquet directory sas_data_16oct for month "oct" already exists so skip (re)loading it.


In [15]:
# Load selected monthly i94 data into pandas dataframes from parquet files

df_i94_jan_pd = pd.read_parquet('sas_data_16jan', engine='pyarrow')
print('df_i94_jan_pd has shape:', df_i94_jan_pd.shape)  

df_i94_apr_pd = pd.read_parquet('sas_data_16apr', engine='pyarrow')
print('df_i94_apr_pd has shape:', df_i94_apr_pd.shape)  

df_i94_jul_pd = pd.read_parquet('sas_data_16jul', engine='pyarrow')
print('df_i94_jul_pd has shape:', df_i94_jul_pd.shape)  

df_i94_oct_pd = pd.read_parquet('sas_data_16oct', engine='pyarrow')
print('df_i94_oct_pd has shape:', df_i94_oct_pd.shape)  

df_i94_jan_pd has shape: (2847924, 28)
df_i94_apr_pd has shape: (3096313, 28)
df_i94_jul_pd has shape: (4265031, 28)
df_i94_oct_pd has shape: (3649136, 28)


#### Process small static I94 dimensions harvested from data dictionary I94_SAS_Labels_Descriptions.SAS
* These small static dimensions can be processed directly into the target star schema table formats.
* However these datasets were manipulated manually via text editor to extract the data from the I94 data dictionary.
* This seems reasonable to do since these are static, one time only exercises.

In [16]:
i94_states_create = ("""
CREATE TABLE IF NOT EXISTS i94_states_d(
    state_code VARCHAR PRIMARY KEY,
    state_name VARCHAR)
""")

i94_states_drop = "DROP TABLE IF EXISTS i94_states_d"

cur.execute(i94_states_drop)
cur.execute(i94_states_create)

# Note that MP/Northern Mariana Islands, and OT/OTHER were added to the csv file manually.
i94_states_fname = 'data/i94_states.csv'
copy_command = """
COPY i94_states_d FROM STDIN WITH ( FORMAT csv, HEADER, DELIMITER ',' , NULL '' , QUOTE '''' )
"""
with open(i94_states_fname, 'r', ) as f:
  cur.copy_expert(copy_command, f)
conn.commit()  

In [17]:
i94_countries_create = ("""
CREATE TABLE IF NOT EXISTS i94_countries_d(
    country_code VARCHAR PRIMARY KEY,
    country_name VARCHAR)
""")

i94_countries_drop = "DROP TABLE IF EXISTS i94_countries_d"

cur.execute(i94_countries_drop)
cur.execute(i94_countries_create)

# need to use copy_expert() because it allows use of PostgreSQL COPY command quote option. 
# Single quotes are needed to encapsulate some commas in country name column.
i94_countries_fname = 'data/i94_countries.csv'
copy_command = """
COPY i94_countries_d FROM STDIN WITH ( FORMAT csv, HEADER, DELIMITER ',' , NULL '' , QUOTE '''' )
"""
with open(i94_countries_fname, 'r', ) as f:
  cur.copy_expert(copy_command, f)
conn.commit()  

In [18]:
i94_ports_entry_create = ("""
CREATE TABLE IF NOT EXISTS i94_ports_entry_d(
    port_code VARCHAR PRIMARY KEY,
    port_of_entry VARCHAR,
    state_code VARCHAR)
""")

i94_ports_entry_drop = "DROP TABLE IF EXISTS i94_ports_entry_d"

cur.execute(i94_ports_entry_drop)
cur.execute(i94_ports_entry_create)

# need to use copy_expert() because it allows use of PostgreSQL COPY command quote option. 
# Single quotes are needed to encapsulate some commas in port_of_entry column.
i94_ports_entry_fname = 'data/i94_ports_entry.csv'
copy_command = """
COPY i94_ports_entry_d FROM STDIN WITH ( FORMAT csv, HEADER, DELIMITER ',' , NULL '' , QUOTE '"' )
"""
with open(i94_ports_entry_fname, 'r', ) as f:
  cur.copy_expert(copy_command, f)
conn.commit()  

In [19]:
i94_visas_create = ("""
CREATE TABLE IF NOT EXISTS i94_visas_d(
    visa_code VARCHAR PRIMARY KEY,
    visa_name VARCHAR)
""")

i94_visas_drop = "DROP TABLE IF EXISTS i94_visas_d"

cur.execute(i94_visas_drop)
cur.execute(i94_visas_create)

i94_visas_fname = 'data/i94_visas.csv'
with open(i94_visas_fname, 'r', ) as f:
  next(f) # Skip the header row.
  cur.copy_from(f, 'i94_visas_d', sep=',')
conn.commit() 

In [20]:
i94_modes_create = ("""
CREATE TABLE IF NOT EXISTS i94_modes_d(
    mode_code VARCHAR PRIMARY KEY,
    mode_name VARCHAR)
""")

i94_modes_drop = "DROP TABLE IF EXISTS i94_modes_d"

cur.execute(i94_modes_drop)
cur.execute(i94_modes_create)

i94_modes_fname = 'data/i94_modes.csv'
with open(i94_modes_fname, 'r', ) as f:
  next(f) # Skip the header row.
  cur.copy_from(f, 'i94_modes_d', sep=',')
conn.commit()  

#### Predefine the I94 tourism immigration staging table

* Note that this dataset was examined extensiviely via Pandas and SQL to determine which columns could be defined as BIGINT vs INT vs NUMERIC vs VARCHAR.
* In addition there were many queries used to determine the characteristics of the contents of the various columns.

In [21]:
stg_i94_create = ("""
CREATE TABLE IF NOT EXISTS stg_i94(
    cicid BIGINT, -- PRIMARY KEY, 
    i94yr INT, 
    i94mon INT, 
    i94cit VARCHAR,    
    i94res VARCHAR,    
    i94port VARCHAR, 
    arrdate INT,          -- 19600101+
    i94mode VARCHAR, 
    i94addr VARCHAR,      -- declared destination state of tourist - this is a dirty column
    depdate INT,          -- 19600101+
    i94bir INT,           -- actually this is age
    i94visa VARCHAR,    
    count INT, 
    dtadfile VARCHAR,     -- YYYYMMDD  this is a clean column
    visapost VARCHAR, 
    occup VARCHAR, 
    entdepa VARCHAR, 
    entdepd VARCHAR, 
    entdepu VARCHAR, 
    matflag VARCHAR, 
    biryear INT, 
    dtaddto VARCHAR,      -- MMDDYYYY  this is a dirty column 
    gender VARCHAR, 
    insnum VARCHAR, 
    airline VARCHAR, 
    admnum VARCHAR, 
    fltno VARCHAR, 
    visatype VARCHAR)
""")

stg_i94_drop = "DROP TABLE IF EXISTS stg_i94"

In [22]:
cur.execute(stg_i94_drop)
cur.execute(stg_i94_create)

### Step 2: Explore and Assess the Data

#### Explore the Data
* A significant amount of data exploration was required to determine which datasets could be used and what cleaning and enhancements they might require. You can examine these notebooks for evidence of this activity: 
  * create_tables.ipynb
  * data_wrangling_pandas.ipynb 
  * data_wrangling_sql.ipynb
* Several other Udacity provided datasets such as Airport Codes and US Cities Demographics were explored. Based on that exploration it was decided not to use these datasets. 
  * In particular it was difficult to imagine semi-realistic, plausible queries that could be applied to the combination of I94 data with these datasets.
  * So it was decided to concentrate on the I94 data and the Global City Temperature datasets to create the target star schema data mart.

#### Explore City Temperature dataset
* In order for the City Temperature data to be compatible as a "conformed" fact table with the I94 fact table a state code needs to be added to the former since it only contains City Names and Latitude and Longitude. These three form a unique key for the source dataset.
* In order to add a State Code to the City Temperature dataset I determined that I would need a cross reference of City Name, Latitude, Longitude to US State Code.
* Originally I thought I could construct such a cross reference from the data contained in the Udacity provided Airport Codes dataset. It has the necessary columns. However after exploring that dataset I found that it didn't have enough geographical coverage of the US.
* The following is an example of some of the MANY data exploration exercises that was done to determine what data could be used. This query of the Global City Temperature dataset reveals that the latitude and longitude contained in the dataset do not correspond to the cities. The latitude and longitude is at the grain of a region and might point to a location more than 30 miles away from the city. You can clearly see that all of the cities in the Bay Area share the same values for Latitude and Longitude. 
* Based on queries like this it was determined that in order to have the city temperature data be compatible with the I94 dataset a better method would be needed to add a state column to this dataset.

In [23]:
%%sql 
select max(measure_dt) dt, round(avg(avg_temp),2) avg_temp, min(city) city, min(country) country, min(latitude) latitude, min(longitude) longitude
  from stg_city_temp 
 where country = 'United States'
   --and city in ('San Francisco', 'Oakland', 'Berkeley', 'Vallejo', 'Sacramento', 'San Jose')
   and city in ('New York', 'Jersey City', 'Newark', 'Yonkers')
 group by country, city
 order by latitude, longitude
 limit 6

 * postgresql://student:***@127.0.0.1/capstone
4 rows affected.


dt,avg_temp,city,country,latitude,longitude
2013-09-01,9.52,Newark,United States,40.99N,74.56W
2013-09-01,9.52,Jersey City,United States,40.99N,74.56W
2013-09-01,9.52,New York,United States,40.99N,74.56W
2013-09-01,9.52,Yonkers,United States,40.99N,74.56W


#### Process zip code vs latitute-longitude cross reference data

* In order to be able to match data from the city temperature dataset to the I94 dataset we need to be able to enhance the former to include state codes. In its raw form the city temp dataset has a pseudo primary key of country and city compounded with Latitude and Longitude but no state code.
* In order to enhance the city temperature dataset to include state codes in addition to city names we can use the latitude and longitude from this dataset in conjunction with a cross reference dataset which has city, state and latitude, longitude data elements.
* Note that the provided airports dataset does have these data elements and this was considered as a possible source for the required cross reference data but there is not enough geographical coverage of US cities in that dataset so it was ultimately discarded.
* An alternative dataset with zip code, state, city to latitude, longitude cross reference capability was obtained at:
https://public.opendatasoft.com/explore/dataset/us-zip-code-latitude-and-longitude/information/
* Data structure: Zip;City;State;Latitude;Longitude;Timezone;Daylight savings time flag;Geopoint

In [24]:
stg_zip_lat_lon_xref_create = ("""
CREATE TABLE IF NOT EXISTS stg_zip_lat_lon_xref(
    zip VARCHAR,
    city VARCHAR,
    state VARCHAR,
    latitude NUMERIC,
    longitude NUMERIC,
    timezone NUMERIC,
    dst_flag NUMERIC,
    geopoint VARCHAR)
""")

stg_zip_lat_lon_xref_drop = "DROP TABLE IF EXISTS stg_zip_lat_lon_xref"

cur.execute(stg_zip_lat_lon_xref_drop)
cur.execute(stg_zip_lat_lon_xref_create)

zip_lat_lon_fname = 'data/us-zip-code-latitude-and-longitude.csv'
with open(zip_lat_lon_fname, 'r', ) as f:
  next(f) # Skip the header row.
  cur.copy_from(f, 'stg_zip_lat_lon_xref', sep=';', null='')  # interpret empty string as NULL
conn.commit()  

In [25]:
%sql select count(*) from stg_zip_lat_lon_xref;

 * postgresql://student:***@127.0.0.1/capstone
1 rows affected.


count
43191


In [26]:
%sql select zip, city, state, latitude, longitude, timezone, dst_flag from stg_zip_lat_lon_xref where city = 'Sausalito' limit 5;

 * postgresql://student:***@127.0.0.1/capstone
2 rows affected.


zip,city,state,latitude,longitude,timezone,dst_flag
94966,Sausalito,CA,38.068036,-122.740988,-8,1
94965,Sausalito,CA,37.855527,-122.49949,-8,1


###### Perform one processing step which reduces the cross reference data down from zip code to city.

In [27]:
%%sql
drop table if exists stg_city_lat_lon_xref cascade;
create table stg_city_lat_lon_xref as 
select city, 
       state state_code, 
       round(avg(latitude),6) latitude, 
       round(avg(longitude),6) longitude, 
       min(timezone) timezone, 
       min(dst_flag) dst_flag, 
       count(*) zip_count
  from stg_zip_lat_lon_xref
 group by state, city
;  

 * postgresql://student:***@127.0.0.1/capstone
Done.
30346 rows affected.


[]

In [28]:
%sql select * from stg_city_lat_lon_xref where city = 'Sausalito' limit 5;

 * postgresql://student:***@127.0.0.1/capstone
1 rows affected.


city,state_code,latitude,longitude,timezone,dst_flag,zip_count
Sausalito,CA,37.961782,-122.620239,-8,1,2


#### Cleanse the I94 dataframes

* Note that pandas imports the data by default with poorly defined datatypes: all numbers are float64; all strings are object. In addition np.nan/NaN values are used to represent null or missing data for both datatypes. The NaN value is not compatible with PostgreSQL (or any SQL database) which uses NULL for this purpose.
  * Therefore it is necessary to remove the NaN values and replace with None/NULL prior to insert into PostgreSQL.
  * Use the pandas notna() utility for this purpose.
* Also take advantage of this step to utilize sampling to reduce the subset of data to be pulled from SAS/Pandas into PostgreSQL.
  * One way to do this is to use a pandas query filter to pick an equivalent, representative subset of days from each of the months. 
  * So use several days of data starting with the 14th day of each of four seasonally representative months (January, April, July, October).
  * Note that if 2 days of data are used then the total rows in the i94 fact table will be close to 1M which is a project requirement...

In [29]:
# Note that num_days has been set above during Step 0. This is used in the next cells to slice/partition the source data.

In [30]:
def sample_slice(df_i94, start_date, num_days = 1, tag = None):
    """Utility function to create a sample slice of data from an I94 dataframe.
       Parameters: df_i94 - dataframe representing one month of I94 data 
                   start_date - start date for sample in character YYYY-MM-DD format
                   num_days - number of days of data to include"""
    if tag: print(f'Processing sample slice for {tag}')
    
    first_date = datetime.strptime('1960-01-01', "%Y-%m-%d")
    slice_date = datetime.strptime(start_date, "%Y-%m-%d")
    days_offset = (slice_date - first_date).days
    
    df_slice = df_i94.query('arrdate>=@days_offset & arrdate<(@days_offset + @num_days)')
    print('slice has shape:', df_slice.shape)
    return df_slice.where(pd.notna(df_slice), None)

In [31]:
# obtain sample slices from four seasonally representative months

df_jan = sample_slice(df_i94_jan_pd, '2016-01-14', num_days, tag='16jan')
#display(df_jan.head(4))

df_apr = sample_slice(df_i94_apr_pd, '2016-04-14', num_days, tag='16apr')
#display(df_apr.head(4))

df_jul = sample_slice(df_i94_jul_pd, '2016-07-14', num_days, tag='16jul')
#display(df_jul.head(4))

df_oct = sample_slice(df_i94_oct_pd, '2016-10-14', num_days, tag='16oct')
#display(df_oct.head(4))

# ... or alternatively use the entire month but prepare to wait awhile for this to complete ...
#df_oct = df_i94_jan_pd.where(pd.notna(df_i94_oct_pd), None)  

Processing sample slice for 16jan
slice has shape: (92151, 28)
Processing sample slice for 16apr
slice has shape: (107557, 28)
Processing sample slice for 16jul
slice has shape: (140666, 28)
Processing sample slice for 16oct
slice has shape: (130102, 28)


#### Transfer the I94 data from Pandas dataframes to PostgreSQL staging table.
* Utilize the relatively efficient executemany() routine to bulk load data in chunks.

In [32]:
def load_i94(df, max_rows = 999_999_999, chunk_size = 10_000, tag = None):
    """Utility function to efficiently load i94 data from a pandas dataframe into a PostgreSQL staging table. 
       Parameters: df - dataframe to load; max_rows - maximum row cutoff; chunk_size - number of rows to load in each call to executemany()
    """
    i94_table_insert = ("""
    INSERT INTO stg_i94(cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,i94bir,i94visa,count,dtadfile,
                        visapost,occup,entdepa,entdepd,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype) 
    VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
    --ON CONFLICT (cicid)
    --  DO NOTHING
    """)   
    
    if tag:
        print(f'Processing data for {tag}')

    tot_rows = len(df)
    i = 0
    buf = list()
    for row in df.itertuples(index = False):
        #print(f'loading buf[{i%chunk}] with row {i}')
        buf.append(row)
        if (i+1)%chunk_size==0 or (i+1) == max_rows or (i+1) == tot_rows:
            print(f'dumping buffer sized {len(buf)} at row {i}')
            try:   
                cur.executemany(i94_table_insert, buf)
                buf = list()
            except Exception as f:
                print(f)
                print(i, row)
        if (i+1)>=max_rows: break
        i += 1

In [33]:
# load sample data from four seasonally representative months into PostgreSQL staging table

load_i94(df_jan, tag='16jan')

load_i94(df_apr, tag='16apr')

load_i94(df_jul, tag='16jul')

load_i94(df_oct, tag='16oct')

Processing data for 16jan
dumping buffer sized 10000 at row 9999
dumping buffer sized 10000 at row 19999
dumping buffer sized 10000 at row 29999
dumping buffer sized 10000 at row 39999
dumping buffer sized 10000 at row 49999
dumping buffer sized 10000 at row 59999
dumping buffer sized 10000 at row 69999
dumping buffer sized 10000 at row 79999
dumping buffer sized 10000 at row 89999
dumping buffer sized 2151 at row 92150
Processing data for 16apr
dumping buffer sized 10000 at row 9999
dumping buffer sized 10000 at row 19999
dumping buffer sized 10000 at row 29999
dumping buffer sized 10000 at row 39999
dumping buffer sized 10000 at row 49999
dumping buffer sized 10000 at row 59999
dumping buffer sized 10000 at row 69999
dumping buffer sized 10000 at row 79999
dumping buffer sized 10000 at row 89999
dumping buffer sized 10000 at row 99999
dumping buffer sized 7557 at row 107556
Processing data for 16jul
dumping buffer sized 10000 at row 9999
dumping buffer sized 10000 at row 19999
dumpin

#### Explore I94 staging dataset - perform data integrity checks
* It appears the the i94addr column represents the ultimate destination state of the traveller within the US. There are a lot of bad code values in this column. This is mentioned in the data dictionary I94_SAS_Labels_Descriptions.SAS. The large preponderance of data in this column are US state and territory codes. 
* In order to 'cleanse' this column use the following update which will backfill NULL or questionable codes by using the state code of the port of entry as a reasonable best guess for ultimate destination within US. An alternative strategy would be to ignore these rows and/or set the questionable values to NULL.

In [34]:
%%sql
select count(*), i94addr 
  from stg_i94 
 where i94addr is null 
    or i94addr not in (select state_code from i94_states_d) 
 group by i94addr 
 order by 1 desc limit 6;

 * postgresql://student:***@127.0.0.1/capstone
6 rows affected.


count,i94addr
20446,
1431,US
433,VQ
235,UN
140,HA
116,XX


* It appears that the dtadfile date is clean. However the dtaddto date is dirty. There is one semi-standard non date string found in the data (i.e. 'D/S').
* In addition there are many mal-formed dates as well. These will be zapped to NULL below.

In [35]:
%sql select count(*), dtaddto from stg_i94 where substr(dtaddto,5,4) not in ('2016', '2017', '2018', '2019') group by dtaddto;

 * postgresql://student:***@127.0.0.1/capstone
7 rows affected.


count,dtaddto
1098,00000000
1,04272020
1,11300002
1,09222021
16439,D/S
1,12319999
1,09132020


In [36]:
%sql select count(*), dtaddto from stg_i94 where dtaddto='00000000' or dtaddto not similar to '[0-9]*' group by dtaddto;

 * postgresql://student:***@127.0.0.1/capstone
3 rows affected.


count,dtaddto
1098,00000000
16439,D/S
1,`1132017


#### Cleaning Steps


##### Cleaning City Temperature dataset
* The City Temperature dataset is has city names and fuzzy, inaccurate latitude and longitude data but does NOT have a US State code. In order to use this with the I94 dataset which uses state codes, we need to modify/enhance this dataset to have state codes in addition to city names. This is necessary because many city names are not unique and exist across multiple states. For example cities such as Portland, Springfield, etc. may be found in many US states.
* Filters are applied to the Global City Temperature dataset in this step.
  * Only data for the United States is selected.
  * Only data for the one year (2012) is selected.
  * In addition, since this dataset as provided by Udacity is missing the year 2016 (all provided I94 data is from 2016), I have taken the liberty to 'convert' the 2012 data into simulated 2016 data by simply incrementing the year. In the real world I'd go back to the source of the dataset to find data covering 2016.
  * denormalized columns for year and month are added to make this similar to th I94 dataset
  * the latitude and longitude data is converted to numeric form
* Next create a final cross reference enabling view which uses the zip/city cross reference file processed earlier. It matches based on:
  * proximity of latitude and longitude between the city temp dataset and the zip city xref dataset  
  * ranking cities by size based on how many zip codes are present in a city which prioritizes matches with mid to large cities over small towns

In [37]:
%%sql
drop table if exists stg_us_city_temp cascade;
create table stg_us_city_temp as 
select measure_dt + interval '4 year' measure_dt,     -- cheat a little to simulate the year 2016 in the data
       cast(extract(year from measure_dt) as int) + 4 measure_yr, 
       cast(extract(month from measure_dt) as int) measure_mon,
       avg_temp,
       avg_temp_uncertainty,
       city,
       cast('' as VARCHAR) as state_code,     -- placeholder
       md5(city || latitude || longitude) as city_key,  -- quick and dirty PK/FK for parent/child tables
       cast(substr(latitude, 1, length(latitude)-1) as numeric) * case substr(latitude, length(latitude)) when 'N' then 1 else -1 end as latitude,
       cast(substr(longitude, 1, length(longitude)-1) as numeric) * case substr(longitude, length(longitude)) when 'E' then 1 else -1 end as longitude
 from stg_city_temp
 where country = 'United States'
   and measure_dt between '2012-01-01' and '2012-12-31'
;

 * postgresql://student:***@127.0.0.1/capstone
Done.
3084 rows affected.


[]

###### Use the size of a city (based on number of zip codes within city boundaries) to break ties and assign states to cities.

In [38]:
%%sql
drop view if exists stg_state_city_xref;
create view stg_state_city_xref as
select x.*
  from (select y.*,
               max(y.zip_count) over (partition by y.city_key) as zip_max, 
               count(*) over (partition by y.city_key) as dups
          from (select t.city, t.city_key, a.state_code, a.zip_count
                  from (select distinct city_key, city, latitude, longitude 
                          from stg_us_city_temp) t 
                  join stg_city_lat_lon_xref a on a.city=t.city and abs(a.latitude - t.latitude)<2.0 and abs(a.longitude - t.longitude)<2.0 
               ) y  
       ) x
 where x.zip_count=x.zip_max
;

 * postgresql://student:***@127.0.0.1/capstone
Done.
Done.


[]

###### Now inject state codes from state city xref into city temp dimension

In [39]:
%%sql
update stg_us_city_temp t
   set state_code = (select x.state_code 
                         from stg_state_city_xref x 
                        where t.city_key = x.city_key
                          and x.zip_count>1)
;

 * postgresql://student:***@127.0.0.1/capstone
3084 rows affected.


[]

In [40]:
%sql select count(*), state_code from stg_us_city_temp group by state_code order by 1 desc limit 10;

 * postgresql://student:***@127.0.0.1/capstone
10 rows affected.


count,state_code
720,CA
312,TX
168,
168,FL
120,AZ
96,VA
84,MI
84,NC
84,IL
84,CO


In [41]:
# This shows cities from the city temp dataset that have no matches in the zip city state xref dataset
%sql select distinct city, latitude, longitude from stg_us_city_temp where state_code is null;

 * postgresql://student:***@127.0.0.1/capstone
14 rows affected.


city,latitude,longitude
Miramar,26.52,-80.6
Nuevo Laredo,28.13,-99.09
Paradise,36.17,-115.36
East Los Angeles,34.56,-118.7
Lakewood,39.38,-104.05
Windsor,42.59,-82.91
Coral Springs,26.52,-80.6
Lexington Fayette,37.78,-85.42
Sunrise Manor,36.17,-115.36
Thornton,39.38,-104.05


###### Finally create target "conformed" fact table for State Month Temperatures

In [42]:
%%sql
drop table if exists i94_state_month_temp_f;
create table i94_state_month_temp_f as
select state_code, 
       measure_dt, 
       measure_yr, 
       measure_mon, 
       round(avg(avg_temp),3) avg_temp
  from stg_us_city_temp
 where state_code is not null    -- ignore the ones that didn't match
 group by state_code, measure_dt, measure_yr, measure_mon
 order by state_code, measure_dt, measure_yr, measure_mon
;

 * postgresql://student:***@127.0.0.1/capstone
Done.
504 rows affected.


[]

#### Perform cleaning on staged I94 dataset. 
* Clean up dtaddto and i94addr columns based on analysis above.

In [43]:
%%sql
update stg_i94
   set dtaddto = null
 where dtaddto='00000000' or dtaddto not similar to '[0-9]*'
; 

 * postgresql://student:***@127.0.0.1/capstone
17538 rows affected.


[]

In [44]:
%%sql
update stg_i94 i
   set i94addr = (select state_code
                    from i94_ports_entry_d p
                   where i.i94port = p.port_code)
 where i.i94addr is null
    or i.i94addr not in (select state_code from i94_states_d)
;

 * postgresql://student:***@127.0.0.1/capstone
23890 rows affected.


[]

### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model
* The target data model is a conventional star schema with the I94 data as the primary fact.
* There are standard dimensions for elements such as States, Countries, Ports of Entry, Visa Type, etc.
* In addition there is a secondary "conformed" fact table to contain average temperature data that can be matched against the I94 fact table. In practice this table behaves somewhat like a dimension table but it is a fact table in its own right if used in conjunction with the state dimension.

![Conceptual Model](conceptual_model.jpg)

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model
* Currently the 'pipeline' is in prototype form and embedded within a Jupyter Notebook.
* A production version would create pipeline logic for periodic update of the Global City Temperature dataset as well as the I94 dataset as these grow on a daily basis.

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

#### Create I94 Fact table by transforming and loading dataset from I94 staging table.

In [45]:
# should convert this to a create table followed by an insert select
i94_f_create = ("""
CREATE TABLE IF NOT EXISTS i94_f(
    cicid BIGINT, 
    i94yr INT, 
    i94mon INT, 
    i94cit VARCHAR,    
    i94res VARCHAR,    
    i94port VARCHAR, 
    arrdate DATE,
    i94mode VARCHAR, 
    i94addr VARCHAR, 
    depdate DATE,
    i94age INT,        
    i94visa VARCHAR,    
    count INT, 
    dtadfile DATE,
    visapost VARCHAR, 
    occup VARCHAR, 
    entdepa VARCHAR, 
    entdepd VARCHAR, 
    entdepu VARCHAR, 
    matflag VARCHAR, 
    biryear INT, 
    dtaddto DATE, 
    gender VARCHAR, 
    insnum VARCHAR, 
    airline VARCHAR, 
    admnum VARCHAR, 
    fltno VARCHAR, 
    visatype VARCHAR)
""")

i94_f_drop = "DROP TABLE IF EXISTS i94_f CASCADE"

In [46]:
cur.execute(i94_f_drop)
cur.execute(i94_f_create)

In [47]:
%%sql
insert into i94_f 
select cicid, 
       i94yr, 
       i94mon,  
       replace(i94cit, '.0', '') as i94cit, 
       replace(i94res, '.0', '') as i94res,
       i94port, 
       to_date('19600101','YYYYMMDD') + cast(arrdate as int) as arrdate,
       replace(i94mode,'.0', '') as i94mode,
       i94addr, 
       to_date('19600101','YYYYMMDD') + cast(depdate as int) as depdate,
       i94bir as i94age,
       replace(i94visa,'.0', '') as i94visa, 
       count, 
       to_date(dtadfile, 'YYYYMMDD') as dtadfile,
       visapost,
       occup,
       entdepa,
       entdepd,
       entdepu,
       matflag,
       biryear,
       to_date(dtaddto, 'MMDDYYYY') as dtaddto,   -- assumes column has been cleansed in staging table via prior update
       gender, 
       insnum, 
       airline, 
       fltno, 
       replace(admnum, '.0', '') as admnum,  
       visatype
  from stg_i94
;

 * postgresql://student:***@127.0.0.1/capstone
470476 rows affected.


[]

###### Perform sanity check queries to confirm data has been loaded for the proper dates.

In [48]:
%sql select arrdate, count(*) from stg_i94 group by arrdate order by arrdate;

 * postgresql://student:***@127.0.0.1/capstone
4 rows affected.


arrdate,count
20467,92151
20558,107557
20649,140666
20741,130102


In [49]:
%sql select arrdate, count(*) from i94_f group by arrdate order by arrdate;

 * postgresql://student:***@127.0.0.1/capstone
4 rows affected.


arrdate,count
2016-01-14,92151
2016-04-14,107557
2016-07-14,140666
2016-10-14,130102


###### Create convenience view on top of I94 fact table. This view instantiates all the joins to associated dimension tables.

In [50]:
%%sql
drop view if exists i94_v;
create view i94_v as
select i.cicid, 
       i.i94yr, 
       i.i94mon, 
       i.i94cit, c.country_name as cit_country, 
       i.i94res, r.country_name as res_country,
       i.i94port, p.port_of_entry, 
       i.arrdate,
       i.i94mode, m.mode_name,
       i.i94addr, s.state_name, 
       i.depdate,
       i.i94age,
       i.i94visa, v.visa_name,
       i.dtadfile,
       i.visapost,
       i.occup,
       i.entdepa,
       i.entdepd,
       i.entdepu,
       i.matflag,
       i.biryear,
       i.dtaddto,  
       i.gender, 
       i.insnum, 
       i.airline, 
       i.fltno, 
       i.admnum,  
       i.visatype
  from i94_f i 
  left outer join i94_ports_entry_d p on i.i94port = p.port_code   -- never null so use inner join?
  left outer join i94_visas_d v on i.i94visa = v.visa_code         -- never null so use inner join?
  left outer join i94_modes_d m on i.i94mode = m.mode_code
  left outer join i94_countries_d c on i.i94cit = c.country_code
  left outer join i94_countries_d r on i.i94res = r.country_code
  left outer join i94_states_d s on i.i94addr = s.state_code
;

 * postgresql://student:***@127.0.0.1/capstone
Done.
Done.


[]

#### 4.2 Data Quality Checks
 
Run Quality Checks

In [51]:
# make sure the row count from the base table is equivalent to the row count from the view
%sql select count(*) from i94_f union all select count(*) from i94_v;

 * postgresql://student:***@127.0.0.1/capstone
2 rows affected.


count
470476
470476


##### The following queries demonstrate a base query against i94_f and then a similar query which joins to the i94_state_month_temp_f table.
- Note that the two queries do not match because the city temperature dataset is missing measures for Hawaii!!

In [52]:
%sql select round(avg(avg_temp),3) avg_temp, state_code from i94_state_month_temp_f where state_code in ('HI', 'FL') group by state_code;

 * postgresql://student:***@127.0.0.1/capstone
1 rows affected.


avg_temp,state_code
23.264,FL


In [53]:
%%sql 
select count(*), 
       i94yr, i94mon,
       i94port, port_of_entry, i94addr, state_name,
       i94visa, visa_name, i94mode, mode_name, 
       i94cit, cit_country, i94res, res_country
  from i94_v i 
 where 1=1
 group by i94yr, i94mon, 
          i94port, port_of_entry, i94addr, state_name,
          i94visa, visa_name, i94mode, mode_name, 
          i94cit, cit_country, i94res, res_country
 order by 1 desc
 limit 10;

 * postgresql://student:***@127.0.0.1/capstone
10 rows affected.


count,i94yr,i94mon,i94port,port_of_entry,i94addr,state_name,i94visa,visa_name,i94mode,mode_name,i94cit,cit_country,i94res,res_country
4370,2016,10,HHW,"HONOLULU, HI",HI,HAWAII,2,Pleasure,1,Air,209,JAPAN,209,JAPAN
4108,2016,7,HHW,"HONOLULU, HI",HI,HAWAII,2,Pleasure,1,Air,209,JAPAN,209,JAPAN
3620,2016,4,HHW,"HONOLULU, HI",HI,HAWAII,2,Pleasure,1,Air,209,JAPAN,209,JAPAN
3519,2016,1,HHW,"HONOLULU, HI",HI,HAWAII,2,Pleasure,1,Air,209,JAPAN,209,JAPAN
3205,2016,10,ORL,"ORLANDO, FL",FL,FLORIDA,2,Pleasure,1,Air,135,UNITED KINGDOM,135,UNITED KINGDOM
2489,2016,1,MIA,"MIAMI, FL",FL,FLORIDA,2,Pleasure,1,Air,689,BRAZIL,689,BRAZIL
2443,2016,10,NYC,"NEW YORK, NY",NY,NEW YORK,2,Pleasure,1,Air,135,UNITED KINGDOM,135,UNITED KINGDOM
2201,2016,7,LOS,"LOS ANGELES, CA",CA,CALIFORNIA,2,Pleasure,1,Air,245,"CHINA, PRC",245,"CHINA, PRC"
1984,2016,4,NYC,"NEW YORK, NY",NY,NEW YORK,2,Pleasure,1,Air,111,FRANCE,111,FRANCE
1926,2016,7,MIA,"MIAMI, FL",FL,FLORIDA,2,Pleasure,1,Air,687,ARGENTINA,687,ARGENTINA


In [54]:
%%sql 
select count(*), 
       i94yr, i94mon, t.avg_temp,
       i94port, port_of_entry, i94addr, state_name,
       i94visa, visa_name, i94mode, mode_name, 
       i94cit, cit_country, i94res, res_country
  from i94_v i
  join i94_state_month_temp_f t on i94yr=measure_yr and i94mon=measure_mon and i94addr=state_code  
 where 1=1
 group by i94yr, i94mon, avg_temp,
          i94port, port_of_entry, i94addr, state_name,
          i94visa, visa_name, i94mode, mode_name, 
          i94cit, cit_country, i94res, res_country
 order by 1 desc
 limit 10;

 * postgresql://student:***@127.0.0.1/capstone
10 rows affected.


count,i94yr,i94mon,avg_temp,i94port,port_of_entry,i94addr,state_name,i94visa,visa_name,i94mode,mode_name,i94cit,cit_country,i94res,res_country
3205,2016,10,24.356,ORL,"ORLANDO, FL",FL,FLORIDA,2,Pleasure,1,Air,135,UNITED KINGDOM,135,UNITED KINGDOM
2489,2016,1,16.826,MIA,"MIAMI, FL",FL,FLORIDA,2,Pleasure,1,Air,689,BRAZIL,689,BRAZIL
2443,2016,10,11.48,NYC,"NEW YORK, NY",NY,NEW YORK,2,Pleasure,1,Air,135,UNITED KINGDOM,135,UNITED KINGDOM
2201,2016,7,22.068,LOS,"LOS ANGELES, CA",CA,CALIFORNIA,2,Pleasure,1,Air,245,"CHINA, PRC",245,"CHINA, PRC"
1984,2016,4,8.162,NYC,"NEW YORK, NY",NY,NEW YORK,2,Pleasure,1,Air,111,FRANCE,111,FRANCE
1926,2016,7,28.175,MIA,"MIAMI, FL",FL,FLORIDA,2,Pleasure,1,Air,687,ARGENTINA,687,ARGENTINA
1857,2016,7,28.175,ORL,"ORLANDO, FL",FL,FLORIDA,2,Pleasure,1,Air,135,UNITED KINGDOM,135,UNITED KINGDOM
1715,2016,10,24.356,MIA,"MIAMI, FL",FL,FLORIDA,2,Pleasure,1,Air,687,ARGENTINA,687,ARGENTINA
1671,2016,4,8.162,NYC,"NEW YORK, NY",NY,NEW YORK,2,Pleasure,1,Air,135,UNITED KINGDOM,135,UNITED KINGDOM
1646,2016,1,16.826,ORL,"ORLANDO, FL",FL,FLORIDA,2,Pleasure,1,Air,689,BRAZIL,689,BRAZIL


##### Sample queries simulating analysis of tourism patterns related to months/weather...
* first initialize the rudimentary pivot table feature of PostgreSQL

In [55]:
%%sql 
drop extension if exists tablefunc;
create extension tablefunc;

 * postgresql://student:***@127.0.0.1/capstone
Done.
Done.


[]

In [56]:
%%sql 
drop view if exists captain_obvious;
create view captain_obvious as
select count(*)::int as tourists, 
       (case i94mon when 1 then 'JAN' when 4 then 'APR' when 7 then 'JUL' when 10 then 'OCT' end)::text as mon,
       i94yr, i94mon, 
       round(avg(t.avg_temp),1) avg_temp,
       i94addr, state_name::text
  from i94_v i
  join i94_state_month_temp_f t on i94yr=measure_yr and i94mon=measure_mon and i94addr=state_code  
 where i94visa='2'
   and i94addr in ('AK', 'CA', 'FL', 'NY', 'TX', 'AZ', 'CO', 'LA')
 group by i94yr, i94mon, 
          i94addr, state_name
 order by i94addr, i94yr, i94mon
;

 * postgresql://student:***@127.0.0.1/capstone
Done.
Done.


[]

In [57]:
%sql select * from captain_obvious limit 12;

 * postgresql://student:***@127.0.0.1/capstone
12 rows affected.


tourists,mon,i94yr,i94mon,avg_temp,i94addr,state_name
30,JAN,2016,1,-22.6,AK,ALASKA
38,APR,2016,4,0.4,AK,ALASKA
425,JUL,2016,7,10.7,AK,ALASKA
52,OCT,2016,10,-3.6,AK,ALASKA
397,JAN,2016,1,12.2,AZ,ARIZONA
507,APR,2016,4,20.7,AZ,ARIZONA
700,JUL,2016,7,31.2,AZ,ARIZONA
597,OCT,2016,10,22.5,AZ,ARIZONA
11305,JAN,2016,1,11.5,CA,CALIFORNIA
15038,APR,2016,4,14.7,CA,CALIFORNIA


In [58]:
%%sql
select * 
from crosstab (
'SELECT state_name, mon, avg_temp
   FROM captain_obvious
  ORDER BY state_name, i94mon'
) as result (state_name TEXT, JAN numeric, APR numeric, JUL numeric, OCT numeric)
;

 * postgresql://student:***@127.0.0.1/capstone
8 rows affected.


state_name,jan,apr,jul,oct
ALASKA,-22.6,0.4,10.7,-3.6
ARIZONA,12.2,20.7,31.2,22.5
CALIFORNIA,11.5,14.7,22.1,18.5
COLORADO,-0.1,10.1,23.3,8.2
FLORIDA,16.8,22.5,28.2,24.4
LOUISIANA,14.1,21.6,28.4,20.1
NEW YORK,-1.2,8.2,23.7,11.5
TEXAS,10.6,21.2,29.5,19.3


In [59]:
%%sql
select * 
from crosstab (
'SELECT state_name, mon, tourists
   FROM captain_obvious
  ORDER BY state_name, i94mon'
) as result (state_name TEXT, JAN INT, APR INT, JUL INT, OCT INT)
;

 * postgresql://student:***@127.0.0.1/capstone
8 rows affected.


state_name,jan,apr,jul,oct
ALASKA,30,38,425,52
ARIZONA,397,507,700,597
CALIFORNIA,11305,15038,24114,18290
COLORADO,662,395,1055,561
FLORIDA,21359,21423,26741,30558
LOUISIANA,286,667,600,463
NEW YORK,10510,19185,23335,24360
TEXAS,3007,3631,5116,4318


##### From the above we determine the earthshattering analysis:
1. More people visit Alaska, Arizona, and California in the summer.
2. That is true of Colorado as well but there are a large number of tourists in winter as well (skiing)
3. Florida has a high season in the fall (snowbirds)
3. Louisiana has a high season in spring (Mardi Gras)
4. New York is popular in the summer and fall
5. Texas? Meh.

#### 4.3 Data dictionary 

##### i94_f - I94 Fact table
* I94YR - 4 digit year 
* I94MON - Numeric month 
* I94CIT - Country of Citizenship  
* I94RES - Country of Residence 
* I94PORT - This format shows all the valid and invalid codes for processing 
* ARRDATE - the Arrival Date in the USA. 
* I94MODE - Mode of arrival: 1 = 'Air'; 2 = 'Sea'; 3 = 'Land'; 9 = 'Not reported' 
* I94ADDR - Represents the ultimate destination US state of the traveller. There are a lot of invalid codes in the data.
* DEPDATE - the Departure Date from the USA. It is a SAS date numeric field that
* I94BIR - Age of Respondent in Years 
* I94VISA - Visa codes collapsed into three categories: 1 = Business; 2 = Pleasure; 3 = Student 
* COUNT - Used for summary statistics 
* DTADFILE - Character Date Field - Date added to I-94 Files - CIC does not use 
* VISAPOST - Department of State where where Visa was issued - CIC does not use 
* OCCUP - Occupation that will be performed in U.S. - CIC does not use 
* ENTDEPA - Arrival Flag - admitted or paroled into the U.S. - CIC does not use 
* ENTDEPD - Departure Flag - Departed, lost I-94 or is deceased - CIC does not use 
* ENTDEPU - Update Flag - Either apprehended, overstayed, adjusted to perm residence - CIC does not use 
* MATFLAG - Match flag - Match of arrival and departure records 
* BIRYEAR - 4 digit year of birth 
* DTADDTO - Character Date Field - Date to which admitted to U.S. (allowed to stay until) - CIC does not use 
* GENDER - Non-immigrant sex 
* INSNUM - INS number 
* AIRLINE - Airline used to arrive in U.S. 
* ADMNUM - Admission Number 
* FLTNO - Flight number of Airline used to arrive in U.S. 
* VISATYPE - Class of admission legally admitting the non-immigrant to temporarily stay in U.S. *  

##### i94_states_d - States Dimension table
* state_code - state code
* state_name - state name

##### i94_countries_d - Countries Dimension table
* country_code - country code
* country_name - country name

##### i94_ports_entry_d - Ports of Entry Dimension Table
* port_code - port of entry code
* port_of_entry - port of entry name
* state_code - state code

##### i94_modes_d - I94 Modes Dimension table
* mode_code - mode code 
* mode_name - mode name (i.e. Air, Sea, Land)

##### i94_visas_d - I94 Visas Dimension table
* visa_code - visa code 
* visa_name - visa name (i.e. Business, Student, Pleasure)

##### i94_state_month_temp_f - State Month Temperatore fact table (correlated fact))
* state_code - state code
* measure_dt - measurement date
* measure_yr - measurement year
* measure_mon - measurement month
* avg_temp - average temperature


#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
  * As described earlier, the primary tool PostgreSQL is a stand-in for Redshift.
* Propose how often the data should be updated and why.
  * This depends on what hypothetical end users would require. However it seems likely that both the I94 and Global City Temperature data should be updated consistently on a daily to weekly basis. The other datasets are primarily static in nature although there might be some occasional updates.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
   * The database would need to be upgraded to one of the tools that can scale to that degree. So something like Redshift, Cassandra, or some combination of cloud based Spark with parquet files and/or Hive based relational repository could be used.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
   * A scheduling management/orchestration tool such as Airflow could be integrated into the data pipeline for loading I94 and/or City Temperature data. 
 * The database needed to be accessed by 100+ people.
   * The same upgrade mentioned above for data increase could be useful to maintain performance for higher query traffic.
   * Most of the cloud based tools can benefit from almost linear scaling by increasing the number and power of server cluster nodes.