# Immigration Data ETL
### Data Engineering Capstone Project

#### Project Summary
--describe your project at a high level--

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

## Step 3: Data model

### Step 3: Define the Data Model

Flyby Salad wants to analyze individual travellers data on a large scale. Analysis dimensions are "time", "traveller status", "arrival location demographics" and "airport data".

In Step 2 we identified already ways to link the airport and demographic datasets with the immigration facts table.

In this section you will see the final data model with all columns selected.

#### 3.1 Conceptual Data Model
The data model consists of the following fact and dimension tables.

**Fact Table**
The immigration fact table will be named "immigration_facts" and consist of the following columns:
* cicid - Individual fact identifier for an i94 record (one for each individual being processed), Primary Key
* status - The status of immigration, FK for status dimension table
* adm_number - the admission number
* state - Code of U.S. state, FK to dimension table containing demographic data
* arrival_dt - Date of arrival, FK to time dimension table
* departure_dt - Date of departure, FK to time dimension table
* transport_mode - Mode of transportation (1 = Air; 2 = Sea; 3 = Land; 9 = Not reported
* airport - IATA code of airport, FK for airport dimension table
* airline - Airline the immigrant has arrived with
* ftlno - Flight number
* visatype - Visa codes with three categories: (1 = Business; 2 = Pleasure; 3 = Student)
* age - Immigrant's age
* gender - Immigrant' gender
* res_country - Country code of immigrants residence
* cit_country - Country code of immigrants citizenship
* occup - immigrants occupation


The remaining 13 columns will be dropped.

**Dimension Table "dim_time"**
The time dimension table has the following columns:
* datestamp - ISO date string
* day_of_week - The date strings day of the week
* week - Calendar week number
* month - Month number
* year - Year number


**Dimension Table "dim_airports"**
The airports dimension tables contains the following data:
* iata_code - Individual airport code, Primary key
* name - Airport name
* municipality - City name of the airport
* iso_region - Code according to ISO code table
* iso_country - Country code
* latitude / longitude - Airport coordinates

**Dimension Table "dim_status"**
Status table columns:
* status_id - Short code of the immigration status
* arrived - Admitted or paroled into the U.S.
* departed - Departed, lost I-94 or is deceased
* updated - Either apprehended, overstayed, adjusted to perm residence
* matched - Match of arrival and departure records

**Dimension Table "dim_demographics"**
Demographics dimension table will be aggregated to the U.S. state level since no city codes exist to reliably link it to immigration data.
Columns will be the following:
* state_code - U.S. state code
* median_age - Median age of population
* male_population - Amount of male population
* female_population - Amount of female population
* total_population - Total amount of population
* number_of_veterans - Total amount of veterans within population
* foreign_born - Total amount of population not being born in current city/state
* average_household_size - Average number of persons in a household

**Data Model Diagram**

![](FlybySalad_DataModel.png)

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

| Id | Step | Description |
|----|------|-------------|
| 01 | Import staging dataframes | Import immigration data, demographics data and airport information into dataframe "imm_df", "dem_df" and "air_df" |
| 02 | Reduce to required columns | Per dataframe exclude unused columns
| 03 | Change data types | On existing columns change data types for numbers and dates |
| 04 | Fill fact table | |
| 05 | Fill dimension tables | |
| 06 | Export to Parquet | |
| 07 | Import to Redshift | |

**Diagram of created DAG Pipeline in Airflow**
