# Immigration Data ETL
### Data Engineering Capstone Project

#### Project Summary
--describe your project at a high level--

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* **Step 4: Run ETL to Model the Data**
* Step 5: Complete Project Write Up

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model

In Step 2 of this project the data model was created using Jupyter Notebooks and applying Apache Spark for scalable processing.

In Step 4 we created a fully working ETL script included in this workspace:

    Name of ETL Script:    etl.py

The script "etl.py" will (1) download the data from the workspace, (2) Run transformations on the data and (3) write it back into the workspace using Parquet format.

Options:
* **Sampling:** Run "etl.py --sample-size 0.01" with a 1% sample of the complete immmigration data
    * You can choose other values and adjust the sample size
* **S3-Storing:** Run "etl.py --s3-store" and the sci
    * Enter an S3 bucket link and AWS credentials in "dl.cfg" and "etl.py" will store the data in the bucket
* **Loading:** 

Config file:
* Name "dl.cfg"
* Is set to workspace locations as default
* You may adjust the settings e.g. for storing data on S3

Script location: "etl.py" (in this workspace root)

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

I implemented 2 data quality checks: `data_quality_check_01` and `data_quality_check_02`

Running `data_quality_check_01` will first count lines in the dataframes before and after the data transformation. Then it will compare the number of duplicate entries before and after the transformation. For key columns both countings should result in the same numbers.

Running `data_quality_check_02` will try to join a given fact table with a dimension table. The function will then indicate how many lines the dimension table cannot match. This indicates where dimension entries could be added so that we have more matches with the fact table.

    This check was introduced when it was discovered that the immigration data contains other airport codes than the airport table (e.g. "NYC" for New York City which is not a valid IATA Code).
    The script "etl.py" runs the check for this problem, but could be reused on other tables as well.

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

##### Immigration facts
    'cicid' - unique entry id
    'status' - status code, a foreign key for table "dim_status"
    'adm_number' - admission number, converted from string to bigint
    'transport_mode' - Mode of transportation
    'airport' - Airport short code e.g. for an airline traveller his/her arrival airport in the US, string value
    'state' - Short code of the airport's location state, string value derived from i94addr
    'arrival_dt' - from arrdate, converted from SAS date to ISO date
    'departure_dt' - from depdate, converted from SAS date to ISO date
    'airline' - from airline in source, string colum
    'fltno' - from fltno in source, string column
    'visatype' - is the i94visa column, string column (Note: the column "visatype" in i94 data is not used)
    'age' - renamed i94bir column, int value
    'gender' - from gender column, string value
    'res_country' - derived from i94res, country codes of residence
    'cit_country' - derived from i94cit, country codes of citizenship
    'occup' - immigrant's occupation that he/she performs in the US, no transformation, string value
    

##### Airport dimension
    'iata_code' - The airports international aviation code
    'name' - Official airport name
    'municipality' - City where the airport is located
    'iso_region' - International region code
    'iso_country' - International country code
    'latitude' - airport geolocation
    'longitude' - airport geolocation

##### Time dimension
    'datestamp' - contains all unique values from columns arrdate and depdate, but transformed into an ISO formatted date
    'day_of_month' - number of the date's day within that month
    'day_of_year' - number of the date's day within the year
    'week' - calendar week number
    'month' - number of month
    'year' - year

##### Status dimension
    'status_flag_id' - Individual key for each combination of flags
    'entdepa' - Arrival Flag
    'entdepd' - Departure Flag
    'entdepu' - Update Flag
    'matflag' - Match flag, indicates matching arrivals and departures