# Immigration Data Model
### Data Engineering Capstone Project

#### Project Summary
 - This project aims to utilize the tools and knowledge learned in this course, design a data model and build a data pipeline. <br>
 - The main dataset will include data on immigration to the United States, and supplementary datasets will include data on airport codes, U.S. city demographics, and temperature data. 
 - First data exploration was done on raw dataset - checking what fields are there, what is the quality of each field, what is the relationship among data sources
 - Then conceptional data model were designed, relational data model was selected for its flexibility of queries and anlytics 
 - Specifically for relational data model, data normalization/denormalization were done to clean up the data and make it easier for querying
 - ETL was created to integrate all these in one command - fetch raw data, data cleaning, loading into database selected
 - Data quality check was then designed to make sure ETL was correctly done
 - After ETL and data quality check, the final tables will be stored in aws redshift, this data model will support analysis and answer questions like 
     - how many people enter and exit US
     - what do these people do for job, how old are they
     - what cities are they coming from, where are they going to
     - what ports are they coming to US through
 - Eventually, a further brainstorming was done to discuss data solutions under challenging scenarios 

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [6]:
# Do all imports and installs here
import pandas as pd

### <font color=blue>Step 1: Scope the Project and Gather Data<font>

#### Scope 
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc>
 - I will build a star schema data model in AWS redshift database.
 - I will use udacity provided data, refer to below data description
 - My end solution will be an AWS redshift database with fact and dimension tables.
 - I used Python, AWS S3, Redshift
    
#### Describe and Gather Data 
4 data sources are recommended, 3 data sources were utilized
 - 94 Immigration Data: This data comes from the US National Tourism and Trade Office. It is available in the workspace. It contains all related info about the entry, including the person, the port, etc.
 - World Temperature Data: This dataset came from Kaggle. <font color=red>As i don't see the relationship between temperature and i94 data, i'm not using this data source<font>
 - U.S. City Demographic Data: This data comes from OpenSoft. It is available in the workspace.
 - Airport Code Table: This is a simple table of airport codes and corresponding cities. It comes from here.

### <font color=blue>Step 2: Explore and Assess the Data<font>
In this step, 2 tasks were performed
 - Explore the data: Identify data quality issues, like missing values, duplicate data, etc.
 - Cleaning the data: Document steps necessary to clean the data

### _I94 Immigration Data_
 - for full data exploratory, refer to i94_Exploration notebook
 - in this section i'll just show what the data looks like

#spark not working for now
from pyspark.sql import SparkSession
spark = SparkSession.builder.\
config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11")\
.enableHiveSupport().getOrCreate()
df_spark =spark.read.format('com.github.saurfang.sas.spark').load('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')

#spark not working for now
#write to parquet
df_spark.write.parquet("sas_data")
df_spark=spark.read.parquet("sas_data")

In [7]:
# Read in the data here
raw_i94 = pd.read_sas('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')

In [9]:
raw_i94.shape

(3096313, 28)

In [10]:
raw_i94.head()

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,...,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,6.0,2016.0,4.0,692.0,692.0,b'XXX',20573.0,,,,...,b'U',,1979.0,b'10282016',,,,1897628000.0,,b'B2'
1,7.0,2016.0,4.0,254.0,276.0,b'ATL',20551.0,1.0,b'AL',,...,b'Y',,1991.0,b'D/S',b'M',,,3736796000.0,b'00296',b'F1'
2,15.0,2016.0,4.0,101.0,101.0,b'WAS',20545.0,1.0,b'MI',20691.0,...,,b'M',1961.0,b'09302016',b'M',,b'OS',666643200.0,b'93',b'B2'
3,16.0,2016.0,4.0,101.0,101.0,b'NYC',20545.0,1.0,b'MA',20567.0,...,,b'M',1988.0,b'09302016',,,b'AA',92468460000.0,b'00199',b'B2'
4,17.0,2016.0,4.0,101.0,101.0,b'NYC',20545.0,1.0,b'MA',20567.0,...,,b'M',2012.0,b'09302016',,,b'AA',92468460000.0,b'00199',b'B2'


### _U.S. City Demographic Data_
 - for full data exploratory, refer to CitiesDemo_Exploration notebook
 - in this section i'll just show what the data looks like

In [5]:
raw_citydemo = pd.read_csv('us-cities-demographics.csv',sep = ';')

In [6]:
raw_citydemo.head()

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Silver Spring,Maryland,33.8,40601.0,41862.0,82463,1562.0,30908.0,2.6,MD,Hispanic or Latino,25924
1,Quincy,Massachusetts,41.0,44129.0,49500.0,93629,4147.0,32935.0,2.39,MA,White,58723
2,Hoover,Alabama,38.5,38040.0,46799.0,84839,4819.0,8229.0,2.58,AL,Asian,4759
3,Rancho Cucamonga,California,34.5,88127.0,87105.0,175232,5821.0,33878.0,3.18,CA,Black or African-American,24437
4,Newark,New Jersey,34.6,138040.0,143873.0,281913,5829.0,86253.0,2.73,NJ,White,76402


### _World Temperature Data_
 - for full data exploratory, refer to WorldTemp_Exploration notebook
 - in this section i'll just show what the data looks like
 - as i don't see the necessity of keeping temperature in this data model, i will **NOT** use this data

In [4]:
raw_worldtemp = pd.read_csv('../../data2/GlobalLandTemperaturesByCity.csv')

In [5]:
raw_worldtemp.head()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E
2,1744-01-01,,,Århus,Denmark,57.05N,10.33E
3,1744-02-01,,,Århus,Denmark,57.05N,10.33E
4,1744-03-01,,,Århus,Denmark,57.05N,10.33E


### _Airport Code Table_
 - for full data exploratory, refer to Airport_Exploration notebook
 - in this section i'll just show what the data looks like

In [32]:
data_url = 'https://datahub.io/core/airport-codes/datapackage.json'

# to load Data Package into storage
package = Package(data_url)

raw_airportcode = pd.read_csv(package.resources[1].descriptor['path'])

In [33]:
raw_airportcode.head()

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
0,00A,heliport,Total Rf Heliport,11.0,,US,US-PA,Bensalem,00A,,00A,"-74.93360137939453, 40.07080078125"
1,00AA,small_airport,Aero B Ranch Airport,3435.0,,US,US-KS,Leoti,00AA,,00AA,"-101.473911, 38.704022"
2,00AK,small_airport,Lowell Field,450.0,,US,US-AK,Anchor Point,00AK,,00AK,"-151.695999146, 59.94919968"
3,00AL,small_airport,Epps Airpark,820.0,,US,US-AL,Harvest,00AL,,00AL,"-86.77030181884766, 34.86479949951172"
4,00AR,closed,Newport Hospital & Clinic Heliport,237.0,,US,US-AR,Newport,,,,"-91.254898, 35.6087"


### <font color=blue>Step 3: Define the Data Model<font>
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model


 - This will be a relational data model. Compared to non-relational data model, it provides flexibility to do queries and aggregation on the fly, one query can access and join data from multiple tables
 - I'm going for a snowflake schema, with one main factor table as the i94 data, 3 dimension tables to provide more info to the fact table, 2 dimension tables to provide further info to dimension tables
<img src="datamodel.png">

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

 - use AWS redshift database
 - clean out databases
 - load 1 fact tables and 4 dimension tables from SAS/CSV/txt files
 - do necessary data cleaning
 - create 1 dimension tables from fact table

### <font color=blue>Step 4: Run Pipelines to Model the Data <font>
#### 4.1 Create the data model
 - <font color=red>Refer to the datapipeline folder. <font>
 - etl.ipynb gives a step-by-step running and drafting of etl process
 - to run the etl in production, cd into the datapipline folder and run **python3 -W ignore etl.py**

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 

<font color=red>Refer to dataqualitycheck.ipynb in datapipeline folder<font>

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

<font color = red> columns with same cell color can be linked together <font>
<img src="datadictionary.png">

#### <font color=blue>Step 5: Complete Project Write Up<font>
* Following technology were used
 * python3 - as this is the most straight-forward and widely used language, it's easy to maintain and be understood
 * s3 - act as a bridge between python3 and redshift
 * AWS redshift - for easy and flexible to query 
 * more advance technologies like Apache Airflow and EMR/Spark not used as i'm facing tech difficulties. Data Engineering will not be my main study field so would like to complete faster using easier technology
* I'm only using 1 month of i94 table as fact table. So the tables/database should be updated monthly
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x - I would first try if increasing number of nodes in cluster helps, if not switch to Spark and EMR for data processing as this might exceed capability of single machine. 
 * The data populates a dashboard that must be updated on a daily basis by 7am every day - I would use Apache Airflow to schdule a daily data update. If API is needed, use a combination of Airflow + Spark + Apache Livy in your EMR cluster so that Spark commands can be passed through an API interface.
 * The database needed to be accessed by 100+ people - redshift allows access for up to 500 users, so it would be no problem