# US Tourism & Immigration Data Model
### Data Engineering Capstone Project

#### Project Summary
US National Tourism and Trade Office gathers detailed data on every person arriving to the US who is not a citizen or lawful permanent resident including their country of origin, age, purpose of travel, duration of stay, mode of transportation, port of arrival etc.

This project attempts to combine the above data with information about temperature and demographics in US cities and the existing airports in each US state. For this purpose, an ETL pipeline is created using Spark which runs on an Amazon EMR cluster. The data are loaded in Spark and transformed into fact and dimension DataFrames that form a star schema. These DataFrames are then stored as .parquet files in an Amazon S3 Bucket.

### 1. Datasets Used

#### 1.1 I94 Immigration Data
This data comes from the US National Tourism and Trade Office. It contains information about international travelers to the US in 2016.

https://travel.trade.gov/research/reports/i94/historical/2016.html

File immigration_data_sample.csv is just a sample from the complete dataset which is separated in .sas7bdat files, one for each month. There is also a file called I94_SAS_Labels_Descriptions.SAS which contains information about the interpretation of data in each column.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('immigration_data_sample.csv')

In [4]:
df.head()

Unnamed: 0.1,Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,...,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,2027561,4084316.0,2016.0,4.0,209.0,209.0,HHW,20566.0,1.0,HI,...,,M,1955.0,7202016,F,,JL,56582670000.0,00782,WT
1,2171295,4422636.0,2016.0,4.0,582.0,582.0,MCA,20567.0,1.0,TX,...,,M,1990.0,10222016,M,,*GA,94362000000.0,XBLNG,B2
2,589494,1195600.0,2016.0,4.0,148.0,112.0,OGG,20551.0,1.0,FL,...,,M,1940.0,7052016,M,,LH,55780470000.0,00464,WT
3,2631158,5291768.0,2016.0,4.0,297.0,297.0,LOS,20572.0,1.0,CA,...,,M,1991.0,10272016,M,,QR,94789700000.0,00739,B2
4,3032257,985523.0,2016.0,4.0,111.0,111.0,CHM,20550.0,3.0,NY,...,,M,1997.0,7042016,F,,,42322570000.0,LAND,WT


#### 1.2 US City Demographics Data
This data comes from OpenSoft. It contains demographics about various US cities.

https://public.opendatasoft.com/explore/dataset/us-cities-demographics/export/

In [3]:
df = pd.read_csv('us-cities-demographics.csv', delimiter=';')

In [4]:
df.head()

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Silver Spring,Maryland,33.8,40601.0,41862.0,82463,1562.0,30908.0,2.6,MD,Hispanic or Latino,25924
1,Quincy,Massachusetts,41.0,44129.0,49500.0,93629,4147.0,32935.0,2.39,MA,White,58723
2,Hoover,Alabama,38.5,38040.0,46799.0,84839,4819.0,8229.0,2.58,AL,Asian,4759
3,Rancho Cucamonga,California,34.5,88127.0,87105.0,175232,5821.0,33878.0,3.18,CA,Black or African-American,24437
4,Newark,New Jersey,34.6,138040.0,143873.0,281913,5829.0,86253.0,2.73,NJ,White,76402


In [7]:
print('rows: {}, columns: {}'.format(*df.shape))

rows: 2891, columns 12


How many null values exist in each column?

In [6]:
df.isna().sum()

City                       0
State                      0
Median Age                 0
Male Population            3
Female Population          3
Total Population           0
Number of Veterans        13
Foreign-born              13
Average Household Size    16
State Code                 0
Race                       0
Count                      0
dtype: int64

Not many null values exist in this dataset and the data are mostly clean. However, there are many duplicate cities.

In [22]:
print('unique cities: {}'.format(df[['City','State']].drop_duplicates().shape[0]))

unique cities: 596


#### 1.3 Global Land Temperatures by City
This is a dataset from kaggle. It contains information about average temperature in cities around the globe.

https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data

In [93]:
df = pd.read_csv('GlobalLandTemperaturesByCity.csv')

In [94]:
df.head()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E
2,1744-01-01,,,Århus,Denmark,57.05N,10.33E
3,1744-02-01,,,Århus,Denmark,57.05N,10.33E
4,1744-03-01,,,Århus,Denmark,57.05N,10.33E


From the cities' temperature data, only recent (21st century) data from US cities will be used.

In [95]:
df = df[df['Country']=='United States']

In [96]:
df = df[df['dt'].apply(lambda date: int(date[:4]) >= 2000)]

How many null values exist in each column?

In [97]:
df.isna().sum()

dt                               0
AverageTemperature               1
AverageTemperatureUncertainty    1
City                             0
Country                          0
Latitude                         0
Longitude                        0
dtype: int64

Only temperature data for Anchorage in 09/2013 are missing.

In [98]:
df[df.isna().any(axis=1)]

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
287781,2013-09-01,,,Anchorage,United States,61.88N,151.13W


Do we have at least 1 measurement each month in every year?

In [107]:
months_years = [[x, []] for x in range(2000, 2014)]

for x in df['dt'].unique():
    months_years[int(x[2:4])][1].append(x[5:7])

for month_year in months_years:
    print('{}:{}'.format(month_year[0], month_year[1]))

2000:['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']
2001:['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']
2002:['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']
2003:['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']
2004:['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']
2005:['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']
2006:['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']
2007:['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']
2008:['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']
2009:['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']
2010:['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']
2011:['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']
2012:['01', '02', '03', '04', '05', '06', '07', '08', '09', '10'

Temperature data for October, November and December are missing for 2013 and therefore the whole year is excluded.

In [57]:
df = df[df['dt'].apply(lambda x: int(x[:4]) < 2013)]

No null values exist now since the only null value was from a measurement in 2013

In [59]:
df.isna().sum()

dt                               0
AverageTemperature               0
AverageTemperatureUncertainty    0
City                             0
Country                          0
Latitude                         0
Longitude                        0
dtype: int64

Also, no duplicate data exists.

In [60]:
print('rows: {}, columns: {}'.format(*df.shape))

rows: 40092, columns 7


In [63]:
print('rows: {}, columns: {}'.format(*df.drop_duplicates().shape))

rows: 40092, columns 7


#### 1.4 Airport Codes
This dataset contains airports and their respective code.

https://datahub.io/core/airport-codes#data

In [64]:
df = pd.read_csv('airport-codes_csv.csv')

In [65]:
df.head()

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
0,00A,heliport,Total Rf Heliport,11.0,,US,US-PA,Bensalem,00A,,00A,"-74.93360137939453, 40.07080078125"
1,00AA,small_airport,Aero B Ranch Airport,3435.0,,US,US-KS,Leoti,00AA,,00AA,"-101.473911, 38.704022"
2,00AK,small_airport,Lowell Field,450.0,,US,US-AK,Anchor Point,00AK,,00AK,"-151.695999146, 59.94919968"
3,00AL,small_airport,Epps Airpark,820.0,,US,US-AL,Harvest,00AL,,00AL,"-86.77030181884766, 34.86479949951172"
4,00AR,closed,Newport Hospital & Clinic Heliport,237.0,,US,US-AR,Newport,,,,"-91.254898, 35.6087"


From this dataset we are only concerned with US airports.

In [66]:
df = df[df['iso_country']=='US']

Are there any duplicate entries?

In [67]:
print('rows: {}, columns: {}'.format(*df.shape))

rows: 22757, columns 12


In [70]:
print('unique rows: {}'.format(len(df['ident'].unique())))

unique rows: 22757


Fortunately, all registered airports are unique.

What are the unique types of airports?

In [72]:
print(df['type'].unique())

['heliport' 'small_airport' 'closed' 'seaplane_base' 'balloonport'
 'medium_airport' 'large_airport']


Some airports are marked as closed and should be removed.

In [75]:
df = df[df.type != 'closed']

In [76]:
print('rows: {}, columns: {}'.format(*df.shape))

rows: 21431, columns 12


The rest of the columns will not be used.

### 2. Data Model
#### 2.1 Conceptual Data Model
The data are transformed into a star schema consisting of 1 fact table and 3 dimensional tables. Star schema is choosen because of its simplicity and effectiveness for handling simple queries.

<img src="ERD.PNG">

#### 2.2 Data dictionary 
A brief description of the entries in each table is provided here.

- **us_immigration table**
    - arrival_year: year of arrival to the U.S.
    - arrival_month: month of arrival to the U.S.
    - arrival_date: date of arrival to the U.S.
    - departure_date: date of departure from the U.S.
    - port_city: port of arrival
    - port_state_code: state of arrival port
    - origin_country: immigrant's country of origin
    - residence_country: immigrant's residence country
    - birth_year: birth year of immigrant
    - age: age of immigrant
    - gender: gender of immigrant
    - admission_num: admission number
    - admission_date: date for which admission took place
    - admitted_until: date until which the immigrant is allowed to stay in the U.S.
    - visa_category: visa category
    - visa_type: visa type
    - state: state which the immigrant visited after his arrival
    - state_code: U.S. state abbreviation
    - transportation_mode: mode of transportation used to arrive to U.S.
    - airline: airline used to arrive to U.S. (if transportation_mode was 'Air')
    - flight_num: flight number (if transportation_mode was 'Air')
 
 
- **us_cities_demographics table**
    - city: city's name
    - state_code: U.S. state abbreviation
    - state: U.S. state full name
    - total_population: city's total population
    - male_population: city's total male population
    - female_population: city's total female population
    - median_age: median age of city's residents
    - average_household_size: average household size in the city
    - number_of_veterans: # of veterans in the city
    - foreign_born: # of foreign born residents
    - asian: # of asian residents
    - white: # of white residents
    - american_indian_and_alaska_native: # of american indian and alaska native residents
    - black_or_african_american: # of black or african american residents
    - hispanic_or_latino: # of hispanic or latino residents
    

- **us_cities_temperature table**
    - city: city's name
    - average_temperature: average yearly temperature
    - average_temperature_uncertainty: average yearly uncertainty in temperature measurement
    
    
- **state_airports table**
    - state_code: U.S. state abbreviation
    - balloonport: # of ballon ports in state
    - heliport: # of heliports in state
    - large_airport: # of large airports in state
    - medium_airport: # of medium airports in state
    - seaplane_base: # of seaplane airports in state
    - small_airport: # of small airports in state
    - state: U.S. state full name

### 3. Data Pipelines
The necessary steps to pipeline the data into the above data model are described in separete jupyter notebooks (**us_immigration.ipynb**, **us_cities_demographics.ipynb**, **us_cities_temperature.ipynb**, **state_airports.ipynb**), one for each dataset. In contrast to the initial exploration of the datasets here, only Spark is used for the actual processing. It is also possible to perform the whole ETL process by running etl.py on ERM cluster.

### 4. Addressing other senarios


 * The data is increased by 100x.
 
 If the data is increased by 100x, then it is adviced to add more nodes to EMR cluster and possibly choose another type with more CPUs such as m5.4xlarge.
 
 
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 
 To address this senario, the data pipeline should be scheduled to run using Airflow. It is adviced to only update the immigration dataset for the current month on a daily basis since the others remain unchanged.
 
 
 * The database needed to be accessed by 100+ people.
 
 The output .parquet files should be copied into a Redshift database.