# Project Title
### Data Engineering Capstone Project

#### Project Summary
--describe your project at a high level--

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
# Do all imports and installs here
import pandas as pd
import numpy as np
import configparser
from pyspark.sql import SparkSession
from pyspark.sql.functions import col,isnan,when,count,avg
from pyspark.sql.types import TimestampType
from pyspark.sql import functions as F
import datetime

def sas_date_to_datetime(sas_date):
    '''
    Converts given SAS numeric date to datetime
    '''
    if sas_date is None:
        return None
    return str(datetime.date(1960, 1, 1) + datetime.timedelta(days=sas_date))

spark = SparkSession.builder.\
config("spark.jars.repositories", "https://repos.spark-packages.org/").\
config("spark.jars.packages", "saurfang:spark-sas7bdat:2.0.0-s_2.11").\
enableHiveSupport().getOrCreate()

def generate_time_df(immigration_data):
    '''
    Returns a df with (timestamp,year,month,day,week,weekday) columns
    From given immigrations df with (SAS date column(s))
    '''
    # get non-null date columns first
    time_df = immigration_data.select(['arrival_date', 'departure_date']) \
            .filter(immigration_data['arrival_date'].isNotNull()) \
            .filter(immigration_data['departure_date'].isNotNull())

    # Start creating unified date time df
    arrival_df = time_df.select('arrival_date').dropDuplicates()
    departure_df = time_df.select('departure_date').dropDuplicates()
    unified_df = arrival_df.union(departure_df).dropDuplicates()

    reg_convert_sas_date = F.udf(lambda date: sas_date_to_datetime(date))

    # Apply sas date conversion function
    unified_df = unified_df.withColumn('arrivalDateAsDATE', reg_convert_sas_date(unified_df.arrival_date))

    # Add other columns
    unified_df = unified_df.withColumn('year', F.year('arrivalDateAsDATE'))
    unified_df = unified_df.withColumn('month', F.month('arrivalDateAsDATE'))
    unified_df = unified_df.withColumn('day', F.dayofmonth('arrivalDateAsDATE'))
    unified_df = unified_df.withColumn('week', F.weekofyear('arrivalDateAsDATE'))
    unified_df = unified_df.withColumn('weekday', F.dayofweek('arrivalDateAsDATE'))

    # Drop date string column since we no longer need it
    unified_df = unified_df.drop('arrivalDateAsDATE')

    # Rename 
    unified_df = unified_df.withColumnRenamed('arrival_date','timestamp')
    
    return unified_df

### Step 1: Scope the Project and Gather Data

#### Scope 
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc>

## Purpose
This project aims to create a single source of truth database regarding I-94 information gathered throughout the year 2016.

### Considerations
* Unnamed column in immigration data will be dropped since there is no information about it

#### Describe and Gather Data 
Describe the data sets you're using. Where did it come from? What type of information is included? 

# 1) Clean and explore cities, ports and immigrations

## 1.1) Read Cities

In [2]:
# Explore us-cities-demographics.csv
# read data into frame read csv
city_df = spark.read.options(header="true",inferSchema="true",nullValue = "NULL",delimiter=";").csv('us-cities-demographics.csv')

# drop race and count columns
city_df = city_df.drop('Race','Count')

# Get non-null state code and city
city_df = city_df.filter(city_df['State Code'].isNotNull() & city_df['City'].isNotNull())

city_df.printSchema()
city_df.limit(10).toPandas()

root
 |-- City: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Median Age: double (nullable = true)
 |-- Male Population: integer (nullable = true)
 |-- Female Population: integer (nullable = true)
 |-- Total Population: integer (nullable = true)
 |-- Number of Veterans: integer (nullable = true)
 |-- Foreign-born: integer (nullable = true)
 |-- Average Household Size: double (nullable = true)
 |-- State Code: string (nullable = true)



Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code
0,Silver Spring,Maryland,33.8,40601,41862,82463,1562,30908,2.6,MD
1,Quincy,Massachusetts,41.0,44129,49500,93629,4147,32935,2.39,MA
2,Hoover,Alabama,38.5,38040,46799,84839,4819,8229,2.58,AL
3,Rancho Cucamonga,California,34.5,88127,87105,175232,5821,33878,3.18,CA
4,Newark,New Jersey,34.6,138040,143873,281913,5829,86253,2.73,NJ
5,Peoria,Illinois,33.1,56229,62432,118661,6634,7517,2.4,IL
6,Avondale,Arizona,29.1,38712,41971,80683,4815,8355,3.18,AZ
7,West Covina,California,39.8,51629,56860,108489,3800,37038,3.56,CA
8,O'Fallon,Missouri,36.0,41762,43270,85032,5783,3269,2.77,MO
9,High Point,North Carolina,35.5,51751,58077,109828,5204,16315,2.65,NC


## 1.2) Read Ports

In [3]:
# read
ports_df = spark.read.options(header="true",inferSchema="true",nullValue = "NULL").csv('airport-codes_csv.csv')

# Record count is 55075 initally
# Lets take iso_country US, non-null iata code and non-closed records
ports_df = ports_df.filter((ports_df.iata_code.isNotNull()) \
                            & (ports_df.iso_country == 'US') \
                            & (ports_df.type != 'closed') )

drop_cols = ('iso_country','gps_code','elevation_ft','local_code','coordinates')

ports_df = ports_df.drop(*drop_cols)

# Show some info
ports_df.printSchema()
ports_df.limit(10).toPandas()

root
 |-- ident: string (nullable = true)
 |-- type: string (nullable = true)
 |-- name: string (nullable = true)
 |-- continent: string (nullable = true)
 |-- iso_region: string (nullable = true)
 |-- municipality: string (nullable = true)
 |-- iata_code: string (nullable = true)



Unnamed: 0,ident,type,name,continent,iso_region,municipality,iata_code
0,07FA,small_airport,Ocean Reef Club Airport,,US-FL,Key Largo,OCA
1,0AK,small_airport,Pilot Station Airport,,US-AK,Pilot Station,PQS
2,0CO2,small_airport,Crested Butte Airpark,,US-CO,Crested Butte,CSE
3,0TE7,small_airport,LBJ Ranch Airport,,US-TX,Johnson City,JCY
4,13MA,small_airport,Metropolitan Airport,,US-MA,Palmer,PMX
5,13Z,seaplane_base,Loring Seaplane Base,,US-AK,Loring,WLR
6,16A,small_airport,Nunapitchuk Airport,,US-AK,Nunapitchuk,NUP
7,16K,seaplane_base,Port Alice Seaplane Base,,US-AK,Port Alice,PTC
8,19AK,small_airport,Icy Bay Airport,,US-AK,Icy Bay,ICY
9,19P,seaplane_base,Port Protection Seaplane Base,,US-AK,Port Protection,PPV


## 1.3) Combine cities and ports

In [4]:
# Join on 'City.city == ports_df.municipality
combined_df = city_df.join(ports_df, city_df.City == ports_df.municipality).dropDuplicates()
combined_df.printSchema()
combined_df.limit(20).toPandas()

root
 |-- City: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Median Age: double (nullable = true)
 |-- Male Population: integer (nullable = true)
 |-- Female Population: integer (nullable = true)
 |-- Total Population: integer (nullable = true)
 |-- Number of Veterans: integer (nullable = true)
 |-- Foreign-born: integer (nullable = true)
 |-- Average Household Size: double (nullable = true)
 |-- State Code: string (nullable = true)
 |-- ident: string (nullable = true)
 |-- type: string (nullable = true)
 |-- name: string (nullable = true)
 |-- continent: string (nullable = true)
 |-- iso_region: string (nullable = true)
 |-- municipality: string (nullable = true)
 |-- iata_code: string (nullable = true)



Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,ident,type,name,continent,iso_region,municipality,iata_code
0,Seattle,Washington,35.5,345659,338784,684443,29364,119840,2.13,WA,KBFI,large_airport,Boeing Field King County International Airport,,US-WA,Seattle,BFI
1,Springfield,Illinois,38.8,55639,62170,117809,7525,4264,2.22,IL,KVSF,small_airport,Hartness State (Springfield) Airport,,US-VT,Springfield,VSF
2,Santa Fe,New Mexico,44.1,40601,43511,84112,5083,13824,2.41,NM,KSAF,medium_airport,Santa Fe Municipal Airport,,US-NM,Santa Fe,SAF
3,Chattanooga,Tennessee,36.6,83640,92957,176597,10001,10599,2.4,TN,KCHA,large_airport,Lovell Field,,US-TN,Chattanooga,CHA
4,Madison,Wisconsin,30.7,122596,126360,248956,9707,30090,2.23,WI,KMBO,small_airport,Bruce Campbell Field,,US-MS,Madison,DXE
5,Omaha,Nebraska,34.2,218789,225098,443887,24503,48263,2.47,NE,KOFF,medium_airport,Offutt Air Force Base,,US-NE,Omaha,OFF
6,Portland,Oregon,36.7,313516,318671,632187,29940,86041,2.43,OR,KTTD,medium_airport,Portland Troutdale Airport,,US-OR,Portland,TTD
7,Los Angeles,California,35.0,1958998,2012898,3971896,85417,1485425,2.86,CA,KWHP,small_airport,Whiteman Airport,,US-CA,Los Angeles,WHP
8,Lafayette,Indiana,33.5,34313,36857,71170,5045,5697,2.19,IN,KLAF,medium_airport,Purdue University Airport,,US-IN,Lafayette,LAF
9,Shreveport,Louisiana,35.2,93138,103856,196994,14287,5658,2.53,LA,KSHV,medium_airport,Shreveport Regional Airport,,US-LA,Shreveport,SHV


### Now cities have been matched!

## 1.4) Finalize ports and cities

### 1.4.1) Ports

In [5]:
# Define a function to get state part from 'iso_region' column which is the second item of the list after split by '-'
def split_iso_region(iso_region):
    return iso_region.split("-")[1]

# Register split function
reg_split_iso_region = F.udf(lambda iso_reg: split_iso_region(iso_reg))


final_ports_df = combined_df.select(['ident','name','municipality','type','iata_code','iso_region']) \
                            .withColumnRenamed('ident','port_id') \
                            .withColumnRenamed('municipality','city') 

# Apply split function to iso_region
final_ports_df = final_ports_df.withColumn('iso_region',reg_split_iso_region('iso_region').alias('region'))
final_ports_df.printSchema()
#final_ports_df.limit(10).toPandas()

root
 |-- port_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- city: string (nullable = true)
 |-- type: string (nullable = true)
 |-- iata_code: string (nullable = true)
 |-- iso_region: string (nullable = true)



### 1.4.2) Cities

In [6]:
final_cities_df = combined_df.select(['City','State','Median Age','Male Population','Female Population','Total Population' \
                                      ,'Number of Veterans','Foreign-born','Average Household Size']) \
                            .withColumnRenamed('City','city') \
                            .withColumnRenamed('State','state') \
                            .withColumnRenamed('Median Age','median_age') \
                            .withColumnRenamed('Male Population','male_pop') \
                            .withColumnRenamed('Female Population','female_pop') \
                            .withColumnRenamed('Total Population','total_pop') \
                            .withColumnRenamed('Number of Veterans','veterans') \
                            .withColumnRenamed('Foreign-born','foreign-born') \
                            .withColumnRenamed('Average Household Size','avg_household_size')
final_cities_df.printSchema()
#final_cities_df.limit(10).toPandas()

root
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- median_age: double (nullable = true)
 |-- male_pop: integer (nullable = true)
 |-- female_pop: integer (nullable = true)
 |-- total_pop: integer (nullable = true)
 |-- veterans: integer (nullable = true)
 |-- foreign-born: integer (nullable = true)
 |-- avg_household_size: double (nullable = true)



## 1.5) Immigrations

### Process immigrations data one by one in a loop, each iteration:
* Read data
* Repartition
* Drop duplicates, redundant columns, null records
* Finally,
    * JOIN ON conditions list = __[df.state == final_ports_df.iso_region, df.landing_port == final_ports_df.iata_code]__
    * joined_immig = df.join(final_ports_df, conditions)
    * Additionally, create time data using joined_immig (using arrdate and depdate columns)

In [7]:
# These columns are either meaningless for the scope of the project or almost empty, so we are removing them
drop_cols = ("i94yr","i94mon","i94res","count","visapost","occup","entdepa","entdepd","entdepu","matflag" \
             ,"biryear","insnum","fltno","dtadfile","dtaddto","airline","admnum")

# Month name list
months = ['jan','feb','mar','apr','may','jun','jul','aug','sep','oct','nov','dec']

#for month in months:
#immigration_data_path = f'../../data/18-83510-I94-Data-2016/i94_{month}16_sub.sas7bdat'
immigration_data_path = f'../../data/18-83510-I94-Data-2016/i94_feb16_sub.sas7bdat'

# Read data using month in input file name
df = spark.read.format('com.github.saurfang.sas.spark').load(immigration_data_path)
df.repartition(6)
# Drop duplicates, redundant columns, null records from each df and write to S3 in parquet format
#print(f'Started cleaning of {month} data...')
#print(f"Record count BEFORE clean: {df.count()}, ", end="") # for debugging
df = df.dropDuplicates() \
        .drop(*drop_cols) \
        .filter(df.cicid.isNotNull()) \
        .filter(df.i94addr.isNotNull()) \
        .filter(df.admnum.isNotNull()) \
        .filter(df.biryear.isNotNull()) \
        .filter(df.i94port.isNotNull()) \
        .filter(df.arrdate.isNotNull()) \
        .filter(df.depdate.isNotNull()) \
        .filter(df.i94mode.isNotNull()) \
        .filter(df.gender.isNotNull())

# Fix column names and types
df = df.withColumnRenamed('cicid','immigration_id') \
    .withColumnRenamed('i94cit','origin') \
    .withColumnRenamed('i94port','landing_port') \
    .withColumnRenamed('arrdate','arrival_date') \
    .withColumnRenamed('i94mode','arrival_mode') \
    .withColumnRenamed('i94addr','state') \
    .withColumnRenamed('depdate','departure_date') \
    .withColumnRenamed('i94bir','age') \
    .withColumnRenamed('i94visa','visa')

df = df.withColumn('immigration_id',df.immigration_id.cast('int'))

# Now generate time df from current immigration data using 'arrival_date' and 'departure_date' columns
time_df = generate_time_df(df)
time_df.printSchema()
time_df.limit(10).toPandas()

root
 |-- timestamp: double (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- day: integer (nullable = true)
 |-- week: integer (nullable = true)
 |-- weekday: integer (nullable = true)



Unnamed: 0,timestamp,year,month,day,week,weekday
0,20503.0,2016,2,19,7,6
1,20511.0,2016,2,27,8,7
2,20593.0,2016,5,19,20,5
3,20467.0,2016,1,14,2,5
4,18318.0,2010,2,25,8,5
5,20461.0,2016,1,8,1,6
6,20522.0,2016,3,9,10,4
7,20550.0,2016,4,6,14,4
8,20239.0,2015,5,31,22,1
9,20592.0,2016,5,18,20,4


In [8]:
# Fix data types
immig_df = immig_df.withColumn('Residence',immig_df['i94res'].cast('int'))\
                   .withColumn('Immigration_id',immig_df['cicid'].cast('int'))
       #.withColumn('Immigration_id',df['cicid'].cast('int'))\

# Now remove old columns 
drop_cols = ("i94res","cicid")
immig_df = immig_df.drop(*drop_cols)

# Reorder columns

immig_df.printSchema()
immig_df.show(2)

root
 |-- immigration_id: integer (nullable = true)
 |-- origin: double (nullable = true)
 |-- landing_port: string (nullable = true)
 |-- arrival_date: double (nullable = true)
 |-- arrival_mode: double (nullable = true)
 |-- state: string (nullable = true)
 |-- departure_date: double (nullable = true)
 |-- age: double (nullable = true)
 |-- visa: double (nullable = true)
 |-- gender: string (nullable = true)
 |-- visatype: string (nullable = true)



In [None]:
# Join conditions as a list
conditions = [df.state == final_ports_df.iso_region, df.landing_port == final_ports_df.iata_code]

joined_immig = df.join(final_ports_df, conditions)

joined_immig.printSchema()
joined_immig.limit(10).toPandas()
print(f'Joined immig count: {joined_immig.count()}')

# 2) Clean and explore GlobalLandandTemperaturesByCity
* Filter by 'United States'
* Drop redundant colums and rename the remaining ones
* Drop rows with null values
* Calculate the average of temperature and temperature uncertainty and group by 'city'
* Resulting data frame can be written to S3

In [40]:
temps_df = spark.read.options(header="true",inferSchema="true",nullValue = "NULL").csv('../../data2/GlobalLandTemperaturesByCity.csv')

# Filter by united states since others are meaningless
temps_df = temps_df.filter(temps_df.Country == 'United States')

# Drop redundant columns
temps_df = temps_df.drop("dt","Country","Latitude","Longitude")

final_temperature_df = temps_df.select(col('City'),col('AverageTemperature'),col('AverageTemperatureUncertainty')) \
                        .withColumnRenamed('City','city').withColumnRenamed('AverageTemperature','avg_temp') \
                        .withColumnRenamed('AverageTemperatureUncertainty','avg_temp_uncertainty')

avg_df = final_temperature_df.select(col('city'),col('avg_temp')) \
            .groupBy('city').avg('avg_temp') # average_temp_df

avg_uncer_df = final_temperature_df.select(col('city'),col('avg_temp_uncertainty')) \
            .groupBy('city').avg('avg_temp_uncertainty') # average_uncertainty_df

# Now finalize...
final_uni_df = avg_df.join(avg_uncer_df, ['city']) # join column name is same('city'), so we pass it as a list
# Fix column names and order
final_uni_df = final_uni_df.select(final_uni_df.city, final_uni_df['avg(avg_temp)'] \
                                   ,final_uni_df['avg(avg_temp_uncertainty)'])
final_uni_df = final_uni_df.withColumnRenamed('avg(avg_temp)', 'avg_temp') \
                            .withColumnRenamed('avg(avg_temp_uncertainty)','avg_temp_uncertainty')
final_uni_df.printSchema()
final_uni_df.show(5)
final_uni_df.count()

root
 |-- city: string (nullable = true)
 |-- avg_temp: double (nullable = true)
 |-- avg_temp_uncertainty: double (nullable = true)

+-----------+------------------+--------------------+
|       city|          avg_temp|avg_temp_uncertainty|
+-----------+------------------+--------------------+
|  Worcester| 7.341440525809558|  1.3742648284706618|
| Charleston|18.696557871112546|  1.4356107726835539|
|     Corona| 16.12483712696008|  0.7674734446130481|
|Springfield|10.647931343609901|  1.3296092707991722|
|      Tempe|  21.0487690509584|  0.7654862085086479|
+-----------+------------------+--------------------+
only showing top 5 rows



248

## Immigrations skewness check !

In [8]:
immig_feb_df.groupBy('i94visa').count().orderBy('count', ascending=False).show()
immig_feb_df.groupBy('i94cit').count().orderBy('count', ascending=False).show()
immig_feb_df.groupBy('i94res').count().orderBy('count', ascending=False).show()

+-------+-------+
|i94visa|  count|
+-------+-------+
|    2.0|2077101|
|    1.0| 446667|
|    3.0|  46775|
+-------+-------+

+------+------+
|i94cit| count|
+------+------+
| 135.0|278054|
| 209.0|237686|
| 245.0|216674|
| 254.0|151965|
| 582.0|147002|
| 689.0|122098|
| 148.0|120220|
| 111.0|110774|
| 438.0| 69380|
| 687.0| 66559|
| 117.0| 63484|
| 252.0| 62114|
| 213.0| 59423|
| 129.0| 45837|
| 691.0| 42396|
| 123.0| 41116|
| 130.0| 40593|
| 268.0| 37959|
| 690.0| 32880|
| 692.0| 32161|
+------+------+
only showing top 20 rows

+------+------+
|i94res| count|
+------+------+
| 209.0|301323|
| 135.0|281040|
| 245.0|210772|
| 582.0|150539|
| 276.0|150529|
| 689.0|127326|
| 112.0|118443|
| 111.0|107629|
| 687.0| 71000|
| 438.0| 70774|
| 213.0| 57103|
| 117.0| 52487|
| 691.0| 42417|
| 130.0| 40383|
| 123.0| 39625|
| 129.0| 38599|
| 268.0| 37622|
| 690.0| 33974|
| 692.0| 33090|
| 696.0| 32357|
+------+------+
only showing top 20 rows



In [27]:
# i94visa seems skewed a bit, let's repartition
df.rdd.getNumPartitions()
partitioned_df = df.repartition('visa')
partitioned_df.rdd.getNumPartitions()

200

### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

In [21]:
# Join conditions as a list
conditions = [df.state == final_ports_df.iso_region, df.landing_port == final_ports_df.iata_code]

joined_immig = df.join(final_ports_df, conditions)

joined_immig.printSchema()
joined_immig.limit(10).toPandas()
print(f'Joined immig count: {joined_immig.count()}')

root
 |-- immigration_id: integer (nullable = true)
 |-- origin: double (nullable = true)
 |-- landing_port: string (nullable = true)
 |-- arrival_date: double (nullable = true)
 |-- arrival_mode: double (nullable = true)
 |-- state: string (nullable = true)
 |-- departure_date: double (nullable = true)
 |-- age: double (nullable = true)
 |-- visa: double (nullable = true)
 |-- gender: string (nullable = true)
 |-- visatype: string (nullable = true)
 |-- port_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- city: string (nullable = true)
 |-- type: string (nullable = true)
 |-- iata_code: string (nullable = true)
 |-- iso_region: string (nullable = true)
 |-- continent: string (nullable = true)



Unnamed: 0,immigration_id,origin,landing_port,arrival_date,arrival_mode,state,departure_date,age,visa,gender,visatype,port_id,name,city,type,iata_code,iso_region,continent
0,75354,107.0,BOS,20485.0,1.0,MA,20488.0,28.0,2.0,F,B2,KBOS,General Edward Lawrence Logan International Ai...,Boston,large_airport,BOS,MA,
1,81940,116.0,BOS,20485.0,1.0,MA,20486.0,40.0,1.0,M,WB,KBOS,General Edward Lawrence Logan International Ai...,Boston,large_airport,BOS,MA,
2,85982,123.0,BOS,20485.0,1.0,MA,20492.0,64.0,2.0,F,WT,KBOS,General Edward Lawrence Logan International Ai...,Boston,large_airport,BOS,MA,
3,126885,245.0,BOS,20485.0,1.0,MA,20491.0,35.0,2.0,F,B2,KBOS,General Edward Lawrence Logan International Ai...,Boston,large_airport,BOS,MA,
4,294117,213.0,BOS,20486.0,1.0,MA,20488.0,59.0,1.0,M,B1,KBOS,General Edward Lawrence Logan International Ai...,Boston,large_airport,BOS,MA,
5,304039,245.0,BOS,20486.0,1.0,MA,20490.0,26.0,2.0,F,B2,KBOS,General Edward Lawrence Logan International Ai...,Boston,large_airport,BOS,MA,
6,304212,245.0,BOS,20486.0,1.0,MA,20599.0,26.0,3.0,F,F1,KBOS,General Edward Lawrence Logan International Ai...,Boston,large_airport,BOS,MA,
7,307803,251.0,BOS,20486.0,1.0,MA,20500.0,66.0,2.0,M,B2,KBOS,General Edward Lawrence Logan International Ai...,Boston,large_airport,BOS,MA,
8,435173,148.0,BOS,20487.0,1.0,MA,20495.0,24.0,2.0,F,WT,KBOS,General Edward Lawrence Logan International Ai...,Boston,large_airport,BOS,MA,
9,460876,245.0,BOS,20487.0,1.0,MA,20505.0,54.0,2.0,M,B2,KBOS,General Edward Lawrence Logan International Ai...,Boston,large_airport,BOS,MA,


### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
# Write code here

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.