# Analysis on tourists to the United States
### Data Engineering Capstone Project

#### Project Summary
Prepare data to analyze tourist travel behavior. Query flexibility, analysis on redshift so everyone can use it -> OLAP -> Redshift
Updated in batches
Understandable & performant dimensional model

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [3]:
# Do all imports and installs here
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType as R, StructField as Fld, DoubleType as Dbl, StringType as Str, IntegerType as Int, DateType as Date, TimestampType as Timestamp
from datetime import datetime, timedelta
from pyspark.sql import types as T
import pyspark.sql.functions
from pyspark.sql.functions import udf
from pyspark.sql.functions import col
import os
import configparser

In [None]:
#create the config object and read cfg file
config = configparser.ConfigParser()
config.read('dwh.cfg')
#Accessing the AWS user IAM credentials in the dl.cfg file using config object
os.environ['AWS_ACCESS_KEY_ID']=config['AWS CREDS']['AWS_ACCESS_KEY_ID']
os.environ['AWS_SECRET_ACCESS_KEY']=config['AWS CREDS']['AWS_SECRET_ACCESS_KEY']

### Step 1: Scope the Project and Gather Data

#### Scope 
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc>

#### Describe and Gather Data 
Describe the data sets you're using. Where did it come from? What type of information is included? 

In [5]:
spark = SparkSession.builder.\
config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11,org.apache.hadoop:hadoop-aws:2.7.2").enableHiveSupport().getOrCreate()

In [6]:
df_immigration = spark.read.format('com.github.saurfang.sas.spark').load('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,...,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,6.0,2016.0,4.0,692.0,692.0,XXX,20573.0,,,,...,U,,1979.0,10282016,,,,1897628000.0,,B2
1,7.0,2016.0,4.0,254.0,276.0,ATL,20551.0,1.0,AL,,...,Y,,1991.0,D/S,M,,,3736796000.0,296.0,F1
2,15.0,2016.0,4.0,101.0,101.0,WAS,20545.0,1.0,MI,20691.0,...,,M,1961.0,09302016,M,,OS,666643200.0,93.0,B2
3,16.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,MA,20567.0,...,,M,1988.0,09302016,,,AA,92468460000.0,199.0,B2
4,17.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,MA,20567.0,...,,M,2012.0,09302016,,,AA,92468460000.0,199.0,B2


In [8]:
	
df_spark.printSchema()


In [None]:
# convert sas date to date
def convert_datetime(x):
    try:
        start = datetime(1960, 1, 1)
        return start + timedelta(days=int(x))
    except:
        return None

In [None]:
# convert arrival and departure date to date
udf_datetime_from_sas = udf(lambda x: convert_datetime(x), T.DateType())
df_immigration = df_immigration.withColumn("arrival_date", udf_datetime_from_sas("arrdate")).withColumn("departure_date", udf_datetime_from_sas("depdate"))

In [None]:
df_immigration.show(5)

In [None]:
df_immigration.printSchema()

In [None]:
# make sure that cicid is not null since it is the primary key
df_immigration2 = df_immigration.filter(df_immigration.cicid != '').dropDuplicates()
df_immigration2.show(5)
df_immigration.show(5)

In [None]:
# load airport data
df_airport = spark.read.format("csv").option("header", True).load("airport-codes_csv.csv")
df_airport.show(5)

In [None]:
# create song data schema to ensure that schema is inferred correctly
demoSchema = R([
    Fld("city",Str()),
    Fld("state_name",Str()),
    Fld("median_age",Dbl()),
    Fld("male_population",Dbl()),
    Fld("female_population",Dbl()),
    Fld("total_population",Dbl()),
    Fld("number of veterans",Dbl()),
    Fld("foreign_born",Dbl()),
    Fld("avg_household_size",Dbl()),
    Fld("state_code",Str()),
    Fld("race",Str()),
    Fld("count",Dbl()),
])
# read song data file
#df = spark.read.schema(songSchema).json(song_data)

In [None]:
# load demographics data
df_us_demograhics = spark.read.format("csv").option("header", True).option("delimiter", ";").schema(demoSchema).load("us-cities-demographics.csv")
df_us_demograhics.show(5)

In [11]:
#write to parquet
df_spark.write.parquet("sas_data")
df_spark=spark.read.parquet("sas_data")

### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

In [None]:
# Performing cleaning tasks here airport data
# extract columns to create songs table
df_airport = df_airport.select('iata_code', 'name', 'iso_country','iso_region','municipality','coordinates')
df_airport.head()


In [None]:
# filter blank iata_codes out since this column will be a primary key and drop duplicates
df_airport = df_airport.filter(df_airport.iata_code != '').dropDuplicates()
# filter to onyl US airports
df_airport = df_airport.filter(df_airport.iso_country == 'US').dropDuplicates()
df_airport.show(5)

In [None]:
# split iso_region in country and state
split_col = pyspark.sql.functions.split(df_airport['iso_region'], '-')
df_airport = df_airport.withColumn('country_code', split_col.getItem(0))
df_airport = df_airport.withColumn('state_code', split_col.getItem(1))
df_airport.show(5)

In [None]:
# split coordinates into longitude and latitude
split_col = pyspark.sql.functions.split(df_airport['coordinates'], ', ')
df_airport = df_airport.withColumn('laditude', split_col.getItem(0))
df_airport = df_airport.withColumn('longitude', split_col.getItem(1))
df_airport.show(5)

In [None]:
df_airport = df_airport.drop('coordinates')
df_airport = df_airport.drop('country_code')
df_airport.show(5)

In [None]:
# clean demographics
# select relevant columns
df_us_demograhics = df_us_demograhics.select('city', 'state_name', 'median_age','male_population','female_population','total_population', 'foreign_born', 'avg_household_size', 'state_code').dropDuplicates()
df_us_demograhics.show(5)

In [None]:
# ensure that city is not null since it is the primary key
df_us_demograhics = df_us_demograhics.filter(df_us_demograhics.city != '')
df_us_demograhics.show(5)

In [None]:
df_us_demograhics.printSchema()
#demo done

In [None]:
# clean immigration data
df_immigration.show(5)

In [None]:
# select only relevant columns
df_immigration = df_immigration.select('cicid', col("i94yr").alias("year"), 
                                       col("i94mon").alias("month"),
                                       col("i94cit").alias("city_code_origin"),
                                       col("i94res").alias("country_code_residence"),
                                       col("i94port").alias("city_code_destination"),
                                       col("arrival_date"),
                                       col("i94mode").alias("travel_code"),
                                       col("i94addr").alias("state_code_residence"),
                                       col("departure_date"),
                                       col("i94visa").alias("visa"),
                                       col("biryear").alias("birth_year"),
                                       col("gender"),
                                       col("airline")
                                      ).distinct()
df_immigration.show(5)

In [None]:
def split_codes_to_dict(string, separator):
    dictionary = {}
    for line in string.split("\n"):
        line = line.strip()
        #split into code and country description
        l = line.split(separator) #.strip()
        #save in dicctionary
        string = dict(zip(l[::2], l[1::2]))
        dictionary.update(string)
        #strip leading
        #print(dictionary)
    return dictionary

In [None]:
# get city and residence country codes and description
#I94CIT & I94RES
country_codes= """
   582 =  'MEXICO Air Sea, and Not Reported (I-94, no land arrivals)'
   236 =  'AFGHANISTAN'
   101 =  'ALBANIA'
   316 =  'ALGERIA'
   102 =  'ANDORRA'
   324 =  'ANGOLA'
   529 =  'ANGUILLA'
   518 =  'ANTIGUA-BARBUDA'
   687 =  'ARGENTINA '
   151 =  'ARMENIA'
   532 =  'ARUBA'
   438 =  'AUSTRALIA'
   103 =  'AUSTRIA'
   152 =  'AZERBAIJAN'
   512 =  'BAHAMAS'
   298 =  'BAHRAIN'
   274 =  'BANGLADESH'
   513 =  'BARBADOS'
   104 =  'BELGIUM'
   581 =  'BELIZE'
   386 =  'BENIN'
   509 =  'BERMUDA'
   153 =  'BELARUS'
   242 =  'BHUTAN'
   688 =  'BOLIVIA'
   717 =  'BONAIRE, ST EUSTATIUS, SABA' 
   164 =  'BOSNIA-HERZEGOVINA'
   336 =  'BOTSWANA'
   689 =  'BRAZIL'
   525 =  'BRITISH VIRGIN ISLANDS'
   217 =  'BRUNEI'
   105 =  'BULGARIA'
   393 =  'BURKINA FASO'
   243 =  'BURMA'
   375 =  'BURUNDI'
   310 =  'CAMEROON'
   326 =  'CAPE VERDE'
   526 =  'CAYMAN ISLANDS'
   383 =  'CENTRAL AFRICAN REPUBLIC'
   384 =  'CHAD'
   690 =  'CHILE'
   245 =  'CHINA, PRC'
   721 =  'CURACAO' 
   270 =  'CHRISTMAS ISLAND'
   271 =  'COCOS ISLANDS'
   691 =  'COLOMBIA'
   317 =  'COMOROS'
   385 =  'CONGO'
   467 =  'COOK ISLANDS'
   575 =  'COSTA RICA'
   165 =  'CROATIA'
   584 =  'CUBA'
   218 =  'CYPRUS'
   140 =  'CZECH REPUBLIC'
   723 =  'FAROE ISLANDS (PART OF DENMARK)'  
   108 =  'DENMARK'
   322 =  'DJIBOUTI'
   519 =  'DOMINICA'
   585 =  'DOMINICAN REPUBLIC'
   240 =  'EAST TIMOR'
   692 =  'ECUADOR'
   368 =  'EGYPT'
   576 =  'EL SALVADOR'
   399 =  'EQUATORIAL GUINEA'
   372 =  'ERITREA'
   109 =  'ESTONIA'
   369 =  'ETHIOPIA'
   604 =  'FALKLAND ISLANDS'
   413 =  'FIJI'
   110 =  'FINLAND'
   111 =  'FRANCE'
   601 =  'FRENCH GUIANA'
   411 =  'FRENCH POLYNESIA'
   387 =  'GABON'
   338 =  'GAMBIA'
   758 =  'GAZA STRIP' 
   154 =  'GEORGIA'
   112 =  'GERMANY'
   339 =  'GHANA'
   143 =  'GIBRALTAR'
   113 =  'GREECE'
   520 =  'GRENADA'
   507 =  'GUADELOUPE'
   577 =  'GUATEMALA'
   382 =  'GUINEA'
   327 =  'GUINEA-BISSAU'
   603 =  'GUYANA'
   586 =  'HAITI'
   726 =  'HEARD AND MCDONALD IS.'
   149 =  'HOLY SEE/VATICAN'
   528 =  'HONDURAS'
   206 =  'HONG KONG'
   114 =  'HUNGARY'
   115 =  'ICELAND'
   213 =  'INDIA'
   759 =  'INDIAN OCEAN AREAS (FRENCH)' 
   729 =  'INDIAN OCEAN TERRITORY' 
   204 =  'INDONESIA'
   249 =  'IRAN'
   250 =  'IRAQ'
   116 =  'IRELAND'
   251 =  'ISRAEL'
   117 =  'ITALY'
   388 =  'IVORY COAST'
   514 =  'JAMAICA'
   209 =  'JAPAN'
   253 =  'JORDAN'
   201 =  'KAMPUCHEA'
   155 =  'KAZAKHSTAN'
   340 =  'KENYA'
   414 =  'KIRIBATI'
   732 =  'KOSOVO' 
   272 =  'KUWAIT'
   156 =  'KYRGYZSTAN'
   203 =  'LAOS'
   118 =  'LATVIA'
   255 =  'LEBANON'
   335 =  'LESOTHO'
   370 =  'LIBERIA'
   381 =  'LIBYA'
   119 =  'LIECHTENSTEIN'
   120 =  'LITHUANIA'
   121 =  'LUXEMBOURG'
   214 =  'MACAU'
   167 =  'MACEDONIA'
   320 =  'MADAGASCAR'
   345 =  'MALAWI'
   273 =  'MALAYSIA'
   220 =  'MALDIVES'
   392 =  'MALI'
   145 =  'MALTA'
   472 =  'MARSHALL ISLANDS'
   511 =  'MARTINIQUE'
   389 =  'MAURITANIA'
   342 =  'MAURITIUS'
   760 =  'MAYOTTE (AFRICA - FRENCH)' 
   473 =  'MICRONESIA, FED. STATES OF'
   157 =  'MOLDOVA'
   122 =  'MONACO'
   299 =  'MONGOLIA'
   735 =  'MONTENEGRO' 
   521 =  'MONTSERRAT'
   332 =  'MOROCCO'
   329 =  'MOZAMBIQUE'
   371 =  'NAMIBIA'
   440 =  'NAURU'
   257 =  'NEPAL'
   123 =  'NETHERLANDS'
   508 =  'NETHERLANDS ANTILLES'
   409 =  'NEW CALEDONIA'
   464 =  'NEW ZEALAND'
   579 =  'NICARAGUA'
   390 =  'NIGER'
   343 =  'NIGERIA'
   470 =  'NIUE'
   275 =  'NORTH KOREA'
   124 =  'NORWAY'
   256 =  'OMAN'
   258 =  'PAKISTAN'
   474 =  'PALAU'
   743 =  'PALESTINE' 
   504 =  'PANAMA'
   441 =  'PAPUA NEW GUINEA'
   693 =  'PARAGUAY'
   694 =  'PERU'
   260 =  'PHILIPPINES'
   416 =  'PITCAIRN ISLANDS'
   107 =  'POLAND'
   126 =  'PORTUGAL'
   297 =  'QATAR'
   748 =  'REPUBLIC OF SOUTH SUDAN'
   321 =  'REUNION'
   127 =  'ROMANIA'
   158 =  'RUSSIA'
   376 =  'RWANDA'
   128 =  'SAN MARINO'
   330 =  'SAO TOME AND PRINCIPE'
   261 =  'SAUDI ARABIA'
   391 =  'SENEGAL'
   142 =  'SERBIA AND MONTENEGRO'
   745 =  'SERBIA' 
   347 =  'SEYCHELLES'
   348 =  'SIERRA LEONE'
   207 =  'SINGAPORE'
   141 =  'SLOVAKIA'
   166 =  'SLOVENIA'
   412 =  'SOLOMON ISLANDS'
   397 =  'SOMALIA'
   373 =  'SOUTH AFRICA'
   276 =  'SOUTH KOREA'
   129 =  'SPAIN'
   244 =  'SRI LANKA'
   346 =  'ST. HELENA'
   522 =  'ST. KITTS-NEVIS'
   523 =  'ST. LUCIA'
   502 =  'ST. PIERRE AND MIQUELON'
   524 =  'ST. VINCENT-GRENADINES'
   716 =  'SAINT BARTHELEMY' 
   736 =  'SAINT MARTIN' 
   749 =  'SAINT MAARTEN' 
   350 =  'SUDAN'
   602 =  'SURINAME'
   351 =  'SWAZILAND'
   130 =  'SWEDEN'
   131 =  'SWITZERLAND'
   262 =  'SYRIA'
   268 =  'TAIWAN'
   159 =  'TAJIKISTAN'
   353 =  'TANZANIA'
   263 =  'THAILAND'
   304 =  'TOGO'
   417 =  'TONGA'
   516 =  'TRINIDAD AND TOBAGO'
   323 =  'TUNISIA'
   264 =  'TURKEY'
   161 =  'TURKMENISTAN'
   527 =  'TURKS AND CAICOS ISLANDS'
   420 =  'TUVALU'
   352 =  'UGANDA'
   162 =  'UKRAINE'
   296 =  'UNITED ARAB EMIRATES'
   135 =  'UNITED KINGDOM'
   695 =  'URUGUAY'
   163 =  'UZBEKISTAN'
   410 =  'VANUATU'
   696 =  'VENEZUELA'
   266 =  'VIETNAM'
   469 =  'WALLIS AND FUTUNA ISLANDS'
   757 =  'WEST INDIES (FRENCH)' 
   333 =  'WESTERN SAHARA'
   465 =  'WESTERN SAMOA'
   216 =  'YEMEN'
   139 =  'YUGOSLAVIA'
   301 =  'ZAIRE'
   344 =  'ZAMBIA'
   315 =  'ZIMBABWE'
   403 =  'INVALID: AMERICAN SAMOA'
   712 =  'INVALID: ANTARCTICA' 
   700 =  'INVALID: BORN ON BOARD SHIP'
   719 =  'INVALID: BOUVET ISLAND (ANTARCTICA/NORWAY TERR.)'
   574 =  'INVALID: CANADA'
   720 =  'INVALID: CANTON AND ENDERBURY ISLS' 
   106 =  'INVALID: CZECHOSLOVAKIA'
   739 =  'INVALID: DRONNING MAUD LAND (ANTARCTICA-NORWAY)' 
   394 =  'INVALID: FRENCH SOUTHERN AND ANTARCTIC'
   501 =  'INVALID: GREENLAND'
   404 =  'INVALID: GUAM'
   730 =  'INVALID: INTERNATIONAL WATERS' 
   731 =  'INVALID: JOHNSON ISLAND' 
   471 =  'INVALID: MARIANA ISLANDS, NORTHERN'
   737 =  'INVALID: MIDWAY ISLANDS' 
   753 =  'INVALID: MINOR OUTLYING ISLANDS - USA'
   740 =  'INVALID: NEUTRAL ZONE (S. ARABIA/IRAQ)' 
   710 =  'INVALID: NON-QUOTA IMMIGRANT'
   505 =  'INVALID: PUERTO RICO'
    0  =  'INVALID: STATELESS'
   705 =  'INVALID: STATELESS'
   583 =  'INVALID: UNITED STATES'
   407 =  'INVALID: UNITED STATES'
   999 =  'INVALID: UNKNOWN'
   239 =  'INVALID: UNKNOWN COUNTRY'
   134 =  'INVALID: USSR'
   506 =  'INVALID: U.S. VIRGIN ISLANDS'
   755 =  'INVALID: WAKE ISLAND'  
   311 =  'Collapsed Tanzania (should not show)'
   741 =  'Collapsed Curacao (should not show)'
    54 =  'No Country Code (54)'
   100 =  'No Country Code (100)'
   187 =  'No Country Code (187)'
   190 =  'No Country Code (190)'
   200 =  'No Country Code (200)'
   219 =  'No Country Code (219)'
   238 =  'No Country Code (238)'
   277 =  'No Country Code (277)'
   293 =  'No Country Code (293)'
   300 =  'No Country Code (300)'
   319 =  'No Country Code (319)'
   365 =  'No Country Code (365)'
   395 =  'No Country Code (395)'
   400 =  'No Country Code (400)'
   485 =  'No Country Code (485)'
   503 =  'No Country Code (503)'
   589 =  'No Country Code (589)'
   592 =  'No Country Code (592)'
   791 =  'No Country Code (791)'
   849 =  'No Country Code (849)'
   914 =  'No Country Code (914)'
   944 =  'No Country Code (944)'
   996 =  'No Country Code (996)'"""

In [None]:
# remove quotes
country_codes = country_codes.replace("'",'')

In [None]:
data = {}

data = split_codes_to_dict(country_codes, " =  ")
#print(data)
df = pd.DataFrame(data)
#ddf = spark.createDataFrame(df)
#split_codes_to_dict(country_codes, " =  ")
#turn dict into data frame and write as parquet to s3
#https://kontext.tech/column/spark/366/convert-python-dictionary-list-to-pyspark-dataframe

In [None]:
df.head(5)

In [None]:
#travel code and description and remove quotes
travel_code = """
   1 = 'Air'
   2 = 'Sea'
   3 = 'Land'
   9 = 'Not reported'""".replace("'",'')

In [None]:
split_codes_to_dict(travel_code, " = ") 

In [None]:
visa_codes = """
   1 = 'Business'
   2 = 'Pleasure'
   3 = 'Student'
""".replace("'",'')

In [None]:
split_codes_to_dict(visa_codes, " = ") 

In [None]:
#Write to S3
#output_data = "s3a://data-engineering-tourists-to-us-analysis/"
output_data = "output/" #create your own bucket

In [None]:
df_us_demograhics.write.mode('overwrite').parquet(output_data + "us-demographics.parquet")

In [None]:
#output_data = "s3a://data-engineering-tourists-to-us-analysis/"
df_immigration.write.mode('overwrite').parquet(output_data + "us-immigration.parquet")


In [None]:
#output_data = 's3a://data-engineering-tourists-to-us-analysis/'
#output_data = "output/" #create your own bucket
df_airport.write.mode('overwrite').parquet(output_data + "airport.parquet")

### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
# Write code here

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.