# Study of Immigration Data in the United States
### Data Engineering Capstone Project

#### Project Summary

This is the capstone project for the Udacity Data Engineering Nanodegree program. The idea is to take multiple disparate data sources, clean the data, and process it through an ETL pipeline to produce a usable data set for analytics.

We will be looking at the immigration data for the U.S. ports (I94 immigration data). The plan is to enrich this data using the other data sources suggested, build an ETL pipeline to process the raw data and create a data warehouse which can be used for analytics. 


The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Modeli94port
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

#### Import Libraries

In [51]:
# Do all imports and installs here
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os
import configparser
import datetime
import re

from pyspark.sql import SparkSession
from pyspark.sql.functions import avg
from pyspark.sql import SQLContext
from pyspark.sql.functions import isnan, when, count, col, udf, dayofmonth, dayofweek, month, year, weekofyear
from pyspark.sql.functions import monotonically_increasing_id
from pyspark.sql.types import *
#import chart_studio.plotly as py
import psycopg2
from pyspark.sql import functions as F

#### Load Configuration Data

#### Create a Spark Session

In [40]:
spark = SparkSession.builder.\
config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11")\
.enableHiveSupport().getOrCreate()

In [41]:
spark

## Step 1: Scope the Project and Gather Data

### Project Scope
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc>

In this project, we will be gathering, assessing, cleaning, creating data models, setup and push data to a data warehouse where it will be analyzed to produce informative reports.

At a high level:

- Data is extracted from the immigration SAS data, partitioned by year, month, and day, and stored in a data lake on Amazon S3 as Parquet files.
- The partitioned data is loaded into Redshift into staging tables
- Design fact and dimension tables
- The staging data is combined with other staged data sources to produce the final fact and dimension records in the Redshift warehouse.

Example questions we could explore with the final data set:

+ For a given port city, how many immigrants enter from which countries?
+ Is there any relationship between the average temperature of the country of origin and average temperature of the port city of entry?
+ Is there any relationship between the connection between the volume of travel and the number of entry ports (ie airports)
+ The effects of temperature on the volume of travellers
+ What time of year or month sees more immigration for certain areas?

#### Technical Overview
Project uses following technologies,

+ AWS S3 Storage : To store inputs & outputs
+ AWS Redshift as Date Warehouse for Analytics
+ Juypter Notebooks
+ Python
+ Spark
+ Libraries : Pandas & Pyspark

### Project Datasets

**I94 Immigration Data (immigration)** 

The i94 data contains information about visitors to the US via an i94 form that all visitors must complete. It comes from the US National Tourism and Trade Office and includes details on incoming immigrants and their ports of entry. - [source](https://www.trade.gov/national-travel-and-tourism-office). 

It is provided in SAS7BDAT format which is a binary database storage format. It is created by Statistical Analysis System (SAS) software to store data.

The immigration data is partitioned into monthly SAS files. Each file is around 300 to 700 MB. The data provided represents 12 months of data for the year 2016. This is the bulk of the data used in the project.

A data dictionary ```I94_SAS_Labels_Descriptions.SAS``` is provided for the immigration data. In addition to descriptions of the various fields, the port and country codes used were listed in table format. Two csv files are extracted from the dictionary: 

* ```i94_countries.csv```: Table containing country codes used in the dataset.
* ```i94_ports.csv```: Table containing city codes used in the dataset.

These files will be used as a lookup when extracting the immigration data.

**World Temparature data (temperature)** 

This dataset comes from Kaggle. - [source](https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data).

It contains average temperature data for countries and cities around the world between 1743-11-01 and 2013-09-01.

The data is stored in ```GlobalLandTemperaturesByCity.csv``` (508 MB).

**U.S. City Demographic Data (demographics)**

This data comes from OpenSoft and contains demographic data for U.S. cities and states, such as age, population, veteran status and race. - [source](https://public.opendatasoft.com/explore/dataset/us-cities-demographics/export/)

This data placed in the data lake as a single CSV file (246 KB). This data will be combined with port city data to provide ancillary demographic info for port cities.

**Airport Code Table (airports)**

This is a simple table of airport codes and corresponding cities. It comes from datahub.io. - [source](https://datahub.io/core/airport-codes#data)

This data is a single CSV file (5.8 KB). It provides additional information for airports and can be combined with the immigration port city info.


Below is a table of datasets used in the project:
<table>
<thead>
<tr>
<th>Source name</th>
<th>Filename</th>
<th>Format</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>I94 Immigration Sample Data</td>
<td>immigration_data_sample.csv</td>
<td>csv</td>
<td>This is a sample data which is from the US National Tourism and Trade Office.</td>
</tr>
<tr>
<td><a href="https://travel.trade.gov/research/reports/i94/historical/2016.html">I94 Immigration Data</a></td>
<td>data/18-83510-I94-Data-2016/i94_***16_sub.sas7bdat</td>
<td>SAS</td>
<td>This data comes from the US National Tourism and Trade Office.</td>
</tr>
<tr>
<td><a href="https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data">World Temperature Data</a></td>
<td>world_temperature.csv</td>
<td>csv</td>
<td>This dataset contains temperature data of various cities from 1700&#39;s - 2013. This dataset came from Kaggle.</td>
</tr>
<tr>
<td><a href="https://public.opendatasoft.com/explore/dataset/us-cities-demographics/export/">U.S. City Demographic Data</a></td>
<td>us-cities-demographics.csv</td>
<td>csv</td>
<td>This dataset contains population details of all US Cities and census-designated places includes gender &amp; race informatoin. This data came from OpenSoft.</td>
</tr>
<tr>
<td><a href="https://datahub.io/core/airport-codes#data">Airport Codes</a></td>
<td>airport-codes_csv.csv</td>
<td>csv</td>
<td>This is a simple table of airport codes and corresponding cities.</td>
</tr>
<tr>
<td>I94_country</td>
<td>i94_countries.csv</td>
<td>csv</td>
<td>Shows corresponding i94 Country of Citizenship &amp; Country of Residence codes. Source : I94_SAS_Labels_Descriptions.SAS</td>
</tr>
<tr>
<td>I94_port</td>
<td>i94_ports.csv</td>
<td>csv</td>
<td>Shows US Port of Entry city names and their corresponding codes. Source : I94_SAS_Labels_Descriptions.SAS</td>
</tr>
</tbody>
</table>

## Step 2: Explore and Assess the Data

### 1. Immigration Data

In [4]:
# Read in the data as pd df here. Pay attention to encoding method, which is needed by read_sas.
#immigration_data_path = '..data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat'
#df_immigration = pd.read_sas(immigration_data_path, 'sas7bdat', encoding="ISO-8859-1")

In [5]:
# immigration_data_path = '../sas_data1/'
# df_immigration = pd.read_parquet(immigration_data_path,  'pyarrow')

In [42]:
immigration_data_path = '../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat'
df_immigration_spark =spark.read.format('com.github.saurfang.sas.spark').load(immigration_data_path)

In [52]:
def convert_sas_date(days):
    """
    Converts SAS date stored as days since 1/1/1960 to datetime
    :param days: Days since 1/1/1960
    :return: datetime
    """
    if days is None:
        return None
    return datetime.date(1960, 1, 1) + datetime.timedelta(days=days)


def get_sas_day(days):
    """
    Converts SAS date stored as days since 1/1/1960 to day of month
    :param days: Days since 1/1/1960
    :return: Day of month value as integer
    """
    if days is None:
        return None
    return (datetime.date(1960, 1, 1) + datetime.timedelta(days=days)).day


def convert_i94mode(mode):
    """
    Converts i94 travel mode code to a description
    :param mode: int i94 mode as integer
    :return: i94 mode description
    """
    if mode == 1:
        return "Air"
    elif mode == 2:
        return "Sea"
    elif mode == 3:
        return "Land"
    else:
        return "Not Reported"


def convert_visa(visa):
    """
    Converts visa numeric code to description
    :param visa: str
    :return: Visa description: str
    """
    if visa is None:
        return "Not Reported"
    elif visa == 1:
        return "Business"
    elif visa == 2:
        return "Pleasure"
    elif visa == 3:
        return "Student"
    else:
        return "Not Reported"

In [53]:
convert_i94mode_udf = F.udf(convert_i94mode, StringType())
convert_sas_date_udf = F.udf(convert_sas_date, DateType())
convert_visa_udf = F.udf(convert_visa, StringType())
get_sas_day_udf = F.udf(get_sas_day, IntegerType())

In [54]:
df_immigration_spark.dtypes

[('cicid', 'double'),
 ('i94yr', 'double'),
 ('i94mon', 'double'),
 ('i94cit', 'double'),
 ('i94res', 'double'),
 ('i94port', 'string'),
 ('arrdate', 'double'),
 ('i94mode', 'double'),
 ('i94addr', 'string'),
 ('depdate', 'double'),
 ('i94bir', 'double'),
 ('i94visa', 'double'),
 ('count', 'double'),
 ('dtadfile', 'string'),
 ('visapost', 'string'),
 ('occup', 'string'),
 ('entdepa', 'string'),
 ('entdepd', 'string'),
 ('entdepu', 'string'),
 ('matflag', 'string'),
 ('biryear', 'double'),
 ('dtaddto', 'string'),
 ('gender', 'string'),
 ('insnum', 'string'),
 ('airline', 'string'),
 ('admnum', 'double'),
 ('fltno', 'string'),
 ('visatype', 'string'),
 ('arrival_date', 'date'),
 ('departure_date', 'date'),
 ('arrival_year', 'int'),
 ('arrival_month', 'int'),
 ('arrival_day', 'int'),
 ('age', 'int'),
 ('country_of_bir', 'int'),
 ('country_of_res', 'int'),
 ('port_of_admission', 'string'),
 ('birth_year', 'int'),
 ('mode', 'string'),
 ('visa_category', 'string')]

In [55]:
df_immigration_spark = df_immigration_spark\
        .withColumn('arrival_date', convert_sas_date_udf(df_immigration_spark['arrdate'])) \
        .withColumn('departure_date', convert_sas_date_udf(df_immigration_spark['depdate'])) \
        .withColumn('arrival_year', df_immigration_spark['i94yr'].cast(IntegerType())) \
        .withColumn('arrival_month', df_immigration_spark['i94mon'].cast(IntegerType())) \
        .withColumn('arrival_day', get_sas_day_udf(df_immigration_spark['arrdate'])) \
        .withColumn('age', df_immigration_spark['i94bir'].cast(IntegerType())) \
        .withColumn('country_of_bir', df_immigration_spark['i94cit'].cast(IntegerType())) \
        .withColumn('country_of_res', df_immigration_spark['i94res'].cast(IntegerType())) \
        .withColumn('port_of_admission', df_immigration_spark['i94port'].cast(StringType())) \
        .withColumn('birth_year', df_immigration_spark['biryear'].cast(IntegerType())) \
        .withColumn('mode', convert_i94mode_udf(df_immigration_spark['i94mode'])) \
        .withColumn('visa_category', convert_visa_udf(df_immigration_spark['i94visa']))

In [56]:
df_immigration_spark.dtypes

[('cicid', 'double'),
 ('i94yr', 'double'),
 ('i94mon', 'double'),
 ('i94cit', 'double'),
 ('i94res', 'double'),
 ('i94port', 'string'),
 ('arrdate', 'double'),
 ('i94mode', 'double'),
 ('i94addr', 'string'),
 ('depdate', 'double'),
 ('i94bir', 'double'),
 ('i94visa', 'double'),
 ('count', 'double'),
 ('dtadfile', 'string'),
 ('visapost', 'string'),
 ('occup', 'string'),
 ('entdepa', 'string'),
 ('entdepd', 'string'),
 ('entdepu', 'string'),
 ('matflag', 'string'),
 ('biryear', 'double'),
 ('dtaddto', 'string'),
 ('gender', 'string'),
 ('insnum', 'string'),
 ('airline', 'string'),
 ('admnum', 'double'),
 ('fltno', 'string'),
 ('visatype', 'string'),
 ('arrival_date', 'date'),
 ('departure_date', 'date'),
 ('arrival_year', 'int'),
 ('arrival_month', 'int'),
 ('arrival_day', 'int'),
 ('age', 'int'),
 ('country_of_bir', 'int'),
 ('country_of_res', 'int'),
 ('port_of_admission', 'string'),
 ('birth_year', 'int'),
 ('mode', 'string'),
 ('visa_category', 'string')]

In [60]:
df_immigration_spark.limit(5).toPandas()

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,...,arrival_year,arrival_month,arrival_day,age,country_of_bir,country_of_res,port_of_admission,birth_year,mode,visa_category
0,6.0,2016.0,4.0,692.0,692.0,XXX,20573.0,,,,...,2016,4,29,37,692,692,XXX,1979,Not Reported,Pleasure
1,7.0,2016.0,4.0,254.0,276.0,ATL,20551.0,1.0,AL,,...,2016,4,7,25,254,276,ATL,1991,Air,Student
2,15.0,2016.0,4.0,101.0,101.0,WAS,20545.0,1.0,MI,20691.0,...,2016,4,1,55,101,101,WAS,1961,Air,Pleasure
3,16.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,MA,20567.0,...,2016,4,1,28,101,101,NYC,1988,Air,Pleasure
4,17.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,MA,20567.0,...,2016,4,1,4,101,101,NYC,2012,Air,Pleasure


In [71]:
immigration_table = df_immigration_spark.select(
        col('cicid').alias('id'),
        'gender',
        'arrival_date',
        'arrival_year',
        'arrival_month',
        'arrival_day',
        'age',
        'country_of_bir',
        'country_of_res',
        'port_of_admission',
        'birth_year',
        'mode',
        'visa_category',
        'visatype')\
        .dropDuplicates().show()       

+--------+------+------------+------------+-------------+-----------+---+--------------+--------------+-----------------+----------+----+-------------+--------+
|      id|gender|arrival_date|arrival_year|arrival_month|arrival_day|age|country_of_bir|country_of_res|port_of_admission|birth_year|mode|visa_category|visatype|
+--------+------+------------+------------+-------------+-----------+---+--------------+--------------+-----------------+----------+----+-------------+--------+
|155962.0|     F|  2016-04-01|        2016|            4|          1| 23|           582|           582|              NEW|      1993| Air|     Business|      B1|
| 28761.0|     M|  2016-04-01|        2016|            4|          1| 26|           135|           135|              BOS|      1990| Air|     Business|      B1|
|206411.0|     F|  2016-04-01|        2016|            4|          1| 26|           689|           689|              ATL|      1990| Air|     Business|      B1|
|206150.0|  null|  2016-04-01|    

In [6]:
df_immigration.head()

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,...,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,6.0,2016.0,4.0,692.0,692.0,XXX,20573.0,,,,...,U,,1979.0,10282016,,,,1897628000.0,,B2
1,7.0,2016.0,4.0,254.0,276.0,ATL,20551.0,1.0,AL,,...,Y,,1991.0,D/S,M,,,3736796000.0,296.0,F1
2,15.0,2016.0,4.0,101.0,101.0,WAS,20545.0,1.0,MI,20691.0,...,,M,1961.0,09302016,M,,OS,666643200.0,93.0,B2
3,16.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,MA,20567.0,...,,M,1988.0,09302016,,,AA,92468460000.0,199.0,B2
4,17.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,MA,20567.0,...,,M,2012.0,09302016,,,AA,92468460000.0,199.0,B2


####  Immigration Data dictionary: 

<table>
<thead>
<th>Feature</th>
<th>Description</th>
</thead>
<tbody>
<tr><td>cicid</td><td>Unique record ID</td>
<tr><td>i94yr</td><td>4 digit year</td>
<tr><td>i94mon</td><td>Numeric month</td>
<tr><td>i94cit</td><td>3 digit code for immigrant country of birth</td>
<tr><td>i94res</td><td>3 digit code for immigrant country of residence </td>
<tr><td>i94port</td><td>Port of admission</td>
<tr><td>arrdate</td><td>Arrival Date in the USA</td>
<tr><td>i94mode</td><td>Mode of transportation (1 = Air; 2 = Sea; 3 = Land; 9 = Not reported)</td>
<tr><td>i94addr</td><td>USA State of arrival</td>
<tr><td>depdate</td><td>Departure Date from the USA</td>
<tr><td>i94bir</td><td>Age of Respondent in Years</td>
<tr><td>i94visa</td><td>Visa codes collapsed into three categories</td>
<tr><td>count</td><td>Field used for summary statistics</td>
<tr><td>dtadfile</td><td>Character Date Field - Date added to I-94 Files</td>
<tr><td>visapost</td><td>Department of State where where Visa was issued </td>
<tr><td>occup</td><td>Occupation that will be performed in U.S</td>
<tr><td>entdepa</td><td>Arrival Flag - admitted or paroled into the U.S.</td>
<tr><td>entdepd</td><td>Departure Flag - Departed lost I-94 or is deceased</td>
<tr><td>entdepu</td><td>Update Flag - Either apprehended overstayed adjusted to perm residence</td>
<tr><td>matflag</td><td>Match flag - Match of arrival and departure records</td>
<tr><td>biryear</td><td>4 digit year of birth</td>
<tr><td>dtaddto</td><td>Character Date Field - Date to which admitted to U.S. (allowed to stay until)</td>
<tr><td>gender</td><td>Non-immigrant sex</td>
<tr><td>insnum</td><td>INS number</td>
<tr><td>airline</td><td>Airline used to arrive in U.S.</td>
<tr><td>admnum</td><td>Admission Number</td>
<tr><td>fltno</td><td>Flight number of Airline used to arrive in U.S.</td>
<tr><td>visatype</td><td>Class of admission legally admitting the non-immigrant to temporarily stay in U.S.</td>
</tbody>
</table>

In [7]:
df_immigration.shape

(3096313, 28)

In [8]:
df_immigration.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3096313 entries, 0 to 3096312
Data columns (total 28 columns):
 #   Column    Dtype  
---  ------    -----  
 0   cicid     float64
 1   i94yr     float64
 2   i94mon    float64
 3   i94cit    float64
 4   i94res    float64
 5   i94port   object 
 6   arrdate   float64
 7   i94mode   float64
 8   i94addr   object 
 9   depdate   float64
 10  i94bir    float64
 11  i94visa   float64
 12  count     float64
 13  dtadfile  object 
 14  visapost  object 
 15  occup     object 
 16  entdepa   object 
 17  entdepd   object 
 18  entdepu   object 
 19  matflag   object 
 20  biryear   float64
 21  dtaddto   object 
 22  gender    object 
 23  insnum    object 
 24  airline   object 
 25  admnum    float64
 26  fltno     object 
 27  visatype  object 
dtypes: float64(13), object(15)
memory usage: 661.4+ MB


In [9]:
# Without encoding method 
df_immigration.head()

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,...,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,6.0,2016.0,4.0,692.0,692.0,XXX,20573.0,,,,...,U,,1979.0,10282016,,,,1897628000.0,,B2
1,7.0,2016.0,4.0,254.0,276.0,ATL,20551.0,1.0,AL,,...,Y,,1991.0,D/S,M,,,3736796000.0,296.0,F1
2,15.0,2016.0,4.0,101.0,101.0,WAS,20545.0,1.0,MI,20691.0,...,,M,1961.0,09302016,M,,OS,666643200.0,93.0,B2
3,16.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,MA,20567.0,...,,M,1988.0,09302016,,,AA,92468460000.0,199.0,B2
4,17.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,MA,20567.0,...,,M,2012.0,09302016,,,AA,92468460000.0,199.0,B2


Read the whole sad7bdat files sperated by month

In [11]:
df_immigration.isnull().sum()

cicid             0
i94yr             0
i94mon            0
i94cit            0
i94res            0
i94port           0
arrdate           0
i94mode         239
i94addr      152592
depdate      142457
i94bir          802
i94visa           0
count             0
dtadfile          1
visapost    1881250
occup       3088187
entdepa         238
entdepd      138429
entdepu     3095921
matflag      138429
biryear         802
dtaddto         477
gender       414269
insnum      2982605
airline       83627
admnum            0
fltno         19549
visatype          0
dtype: int64

In [21]:
# columns with over 90% missing values and with no sense 
df_immigration.drop(columns = ['occup', 'entdepu','insnum', 'count', 'dtaddto', 'entdepd', 'admnum', 'matflag'])

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,i94bir,i94visa,dtadfile,visapost,entdepa,biryear,gender,airline,fltno,visatype
0,6.0,2016.0,4.0,692.0,692.0,XXX,20573.0,,,,37.0,2.0,,,T,1979.0,,,,B2
1,7.0,2016.0,4.0,254.0,276.0,ATL,20551.0,1.0,AL,,25.0,3.0,20130811,SEO,G,1991.0,M,,00296,F1
2,15.0,2016.0,4.0,101.0,101.0,WAS,20545.0,1.0,MI,20691.0,55.0,2.0,20160401,,T,1961.0,M,OS,93,B2
3,16.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,MA,20567.0,28.0,2.0,20160401,,O,1988.0,,AA,00199,B2
4,17.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,MA,20567.0,4.0,2.0,20160401,,O,2012.0,,AA,00199,B2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3096308,625229.0,2016.0,4.0,745.0,745.0,SYS,20547.0,3.0,CA,,36.0,2.0,20160403,,Z,1980.0,,,00066,B2
3096309,1972204.0,2016.0,4.0,745.0,745.0,SYS,20554.0,3.0,CA,20555.0,36.0,2.0,20160410,BLG,Z,1980.0,F,,00066,B2
3096310,4249448.0,2016.0,4.0,745.0,745.0,TEC,20566.0,3.0,VA,20588.0,23.0,2.0,20160422,BLG,Z,1993.0,F,,00651,B2
3096311,5658953.0,2016.0,4.0,748.0,748.0,NEW,20573.0,3.0,MN,,57.0,2.0,20160429,CLG,Z,1959.0,M,,LAND,B2


In [22]:
# Create udf to convert SAS date to PySpark date 
@udf(StringType())
def convert_datetime(x):
    if x:
        return (datetime(1960, 1, 1).date() + timedelta(x)).isoformat()
    return None

In [23]:
df_immigration_cleaned = df_immigration.dropDuplicates(['cicid'])
df_immigration_cleaned = df_immigration.dropna(how="any", subset=["i94port", "i94addr", "gender"])
df_immigration_cleaned = df_immigration_cleaned.withColumn("arrdate", convert_datetime(df_immigration_cleaned.arrdate))
df_immigration_cleaned

AttributeError: 'DataFrame' object has no attribute 'withColumn'

#### Exact ports info from I94_SAS_Labels_Descriptions.SAS
i94_ports.csv: Table containing city codes used in the dataset

In [15]:
# Create list of valid ports
i94_sas_label_descriptions_path = "../Data/I94_SAS_Labels_Descriptions.SAS"
def code_mapper(label):
    with open(i94_sas_label_descriptions_path) as f:
        sas_labels_content = f.read()

    sas_labels_content = sas_labels_content.replace('\t', '')

    label_content = sas_labels_content[sas_labels_content.index(label):]
    label_content = label_content[:label_content.index(';')].split('\n')
    label_content = [i.replace("'", "") for i in label_content]

    label_dict = [i.split('=') for i in label_content[1:]]
    label_dict = dict([i[0].strip(), i[1].strip()] for i in label_dict if len(i) == 2)

    return label_dict

In [16]:
i94cntyl = code_mapper('i94cntyl') # '101': 'ALBANIA',
i94port = code_mapper('i94prtl')   # 'ORL': 'ORLANDO, FL'
i94mode = code_mapper('i94model')  # {'1': 'Air', '2': 'Sea', '3': 'Land', '9': 'Not reported'}
i94addr = code_mapper('i94addrl')  # 'AK': 'ALASKA'
i94visa = {'1.0':'Business', '2.0': 'Pleasure', '3.0' : 'Student'}

All invalid values are grouped together as 'INVALID ENTRY'.

In [18]:
i94cntyl = { k: (v if not re.match('^INVALID:|^Collapsed|^No Country Code', v) else 'INVALID ENTRY') 
                 for k, v in i94cntyl.items()}
#i94cntyl

The states in map are formatted to comply with state names in other datasets.

+ 'DIST. OF' is replaced with 'District of'
+ 'S.', 'N.', 'W.' are replaced with 'South', 'North', 'West'
+ all states are capitalized

In [19]:
def format_state(s):
    s = s.replace('DIST. OF', 'District of') \
         .replace('S.', 'South') \
         .replace('N.', 'North') \
         .replace('W.', 'West')
    return ' '.join([w.capitalize() if w != 'of' else w for w in s.split() ])

# format addr labels
i94addr = {k: format_state(v) for k, v in i94addr.items()}
# i94addr

In [21]:
i94port_split = {}
index = 0 
for k, v in i94port.items():
    if not re.match('^Collapsed|^No PORT Code', v):
        try:
            # extract state part from i94port
            # the state part contains the state and also other words
            state_part = v.rsplit(',', 1)[1]
            city_part = v.rsplit(',', 1)[0] 
            # create a set of all words in state part
            state_part_set = set(state_part.split())
            # if the state is valid (is in the set(i94addr.keys()), then retrieve state
            state = list(set(i94addr.keys()).intersection(state_part_set))[0]
            # add state to dict
            i94port_split[index] = [k, city_part, state]
        except IndexError:
            # no state is specified for Washington DC in labels so it is added here
            # 'MARIPOSA AZ' is not split by ","
            if v == 'WASHINGTON DC':
                i94port_split[index] = [k, 'WASHINGTON DC','DC']
            elif v == 'MARIPOSA AZ':
                i94port_split[index] = [k,  'MARIPOSA', 'AZ']
            else:
                i94port_split[index] = [k, 'INVALID ENTRY', 'INVALID ENTRY']
            
    else:
        i94port_split[index] = [k, 'INVALID ENTRY', 'INVALID ENTRY']
    index += 1

In [22]:
df_i94cntyl = pd.DataFrame(i94cntyl.items(), columns=['country_code', 'country'])
#df_i94port = pd.DataFrame(i94port.items(), columns=['port_code', 'port'])
df_i94port = pd.DataFrame(i94port_split.values(), columns=['port_code', 'city', 'state'])
df_i94mode = pd.DataFrame(i94mode.items(), columns=['mode_num', 'mode'])
df_i94addr = pd.DataFrame(i94addr.items(), columns=['state_code', 'state'])
df_i94visa = pd.DataFrame(i94visa.items(), columns=['visa_type_num', 'visa_type'])

In [23]:
df_i94cntyl.to_csv('../Data/i94cntyl.csv', index=False, sep=',')
df_i94port.to_csv('../Data/i94port.csv', index=False, sep=',')
df_i94mode.to_csv('../Data/i94mode.csv', index=False, sep=',')
df_i94addr.to_csv('../Data/i94addr.csv', index=False, sep=',')
df_i94visa.to_csv('../Data/i94visa.csv', index=False, sep=',')

In [161]:
# df_i94port[['city', 'state_or_country']] = df_i94port["port"].str.split(',', expand=True)

### 2. World Temparature Data 

In [20]:
temperature_data_path = '../data/GlobalLandTemperaturesByCity.csv'

In [21]:
df_temperature = pd.read_csv(temperature_data_path)

In [22]:
df_temperature.head()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E
2,1744-01-01,,,Århus,Denmark,57.05N,10.33E
3,1744-02-01,,,Århus,Denmark,57.05N,10.33E
4,1744-03-01,,,Århus,Denmark,57.05N,10.33E


### 3. US City Demographic Data

In [23]:
demographics_data_path = "../data/us-cities-demographics.csv"

In [24]:
df_demographics_spark = spark.read.csv(demographics_data_path, inferSchema=True, header=True, sep=';')

In [25]:
df_demographics_spark.limit(5).toPandas()

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Silver Spring,Maryland,33.8,40601,41862,82463,1562,30908,2.6,MD,Hispanic or Latino,25924
1,Quincy,Massachusetts,41.0,44129,49500,93629,4147,32935,2.39,MA,White,58723
2,Hoover,Alabama,38.5,38040,46799,84839,4819,8229,2.58,AL,Asian,4759
3,Rancho Cucamonga,California,34.5,88127,87105,175232,5821,33878,3.18,CA,Black or African-American,24437
4,Newark,New Jersey,34.6,138040,143873,281913,5829,86253,2.73,NJ,White,76402


### 4. Airport Data

In [15]:
df_airports = pd.read_csv('../data/airport-codes_csv.csv')

In [16]:
df_airports.columns

Index(['ident', 'type', 'name', 'elevation_ft', 'continent', 'iso_country',
       'iso_region', 'municipality', 'gps_code', 'iata_code', 'local_code',
       'coordinates'],
      dtype='object')

In [17]:
df_airports.head()

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
0,00A,heliport,Total Rf Heliport,11.0,,US,US-PA,Bensalem,00A,,00A,"-74.93360137939453, 40.07080078125"
1,00AA,small_airport,Aero B Ranch Airport,3435.0,,US,US-KS,Leoti,00AA,,00AA,"-101.473911, 38.704022"
2,00AK,small_airport,Lowell Field,450.0,,US,US-AK,Anchor Point,00AK,,00AK,"-151.695999146, 59.94919968"
3,00AL,small_airport,Epps Airpark,820.0,,US,US-AL,Harvest,00AL,,00AL,"-86.77030181884766, 34.86479949951172"
4,00AR,closed,Newport Hospital & Clinic Heliport,237.0,,US,US-AR,Newport,,,,"-91.254898, 35.6087"


In [162]:
df_airports['municipality'].nunique()

27133

In [163]:
df_airports['municipality'].count()

49399

In [170]:
df_airports.dtypes

ident            object
type             object
name             object
elevation_ft    float64
continent        object
iso_country      object
iso_region       object
municipality     object
gps_code         object
iata_code        object
local_code       object
coordinates      object
dtype: object

In [167]:
df_airports.isna().sum()

ident               0
type                0
name                0
elevation_ft     7006
continent       27719
iso_country       247
iso_region          0
municipality     5676
gps_code        14045
iata_code       45886
local_code      26389
coordinates         0
dtype: int64

### 2. Temperature Data

In [27]:
df_temperature.head()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E
2,1744-01-01,,,Århus,Denmark,57.05N,10.33E
3,1744-02-01,,,Århus,Denmark,57.05N,10.33E
4,1744-03-01,,,Århus,Denmark,57.05N,10.33E


In [29]:
df_temperature.shape

(8599212, 7)

In [28]:
df_temperature.isnull().sum()

dt                                    0
AverageTemperature               364130
AverageTemperatureUncertainty    364130
City                                  0
Country                               0
Latitude                              0
Longitude                             0
dtype: int64

In [31]:
df_temperature.dtypes

dt                                object
AverageTemperature               float64
AverageTemperatureUncertainty    float64
City                              object
Country                           object
Latitude                          object
Longitude                         object
dtype: object

##### Fact Table

immigration_df
    id
    state_code
    city_code
    date
    count

##### Dimension Tables
immigrant_df
    id
    gender
    age
    visa_type

city_df
    city_code
    state_code
    city_name
    median_age
    pct_male_pop
    pct_female_pop
    pct_veterans
    pct_foreign_born
    pct_native_american
    pct_asian
    pct_black
    pct_hispanic_or_latino
    pct_white
    total_pop
    lat
    long

monthly_city_temp_df
    city_code
    year
    month
    avg_temperature

time_df
    date
    dayofweek
    weekofyear
    month

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model### Step 3: Define the Data Model



immigrant_df
    id
    gender
    age
    visa_type

city_df
    city_code
    state_code
    city_name
    median_age
    pct_male_pop
    pct_female_pop
    pct_veterans
    pct_foreign_born
    pct_native_american
    pct_asian
    pct_black
    pct_hispanic_or_latino
    pct_white
    total_pop
    lat
    long

monthly_city_temp_df
    city_code
    year
    month
    avg_temperature

time_df
    date
    dayofweek
    weekofyear
    month

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model##### Dimension Tables

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
# Write code here

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.