# Project Title
### Data Engineering Capstone Project

#### Project Summary
--describe your project at a high level--

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [2]:
# Do all imports and installs here
import pandas as pd
import psycopg2
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import *


### Step 1: Scope the Project and Gather Data

#### Scope 
The project goal is to enrich the US I94 immigration data with more data i.e.airport data, city demographics to be more insightful during  analysis of the immigration data.


#### I94 Immigration Data
Data  is from the US National Tourism and Trade Office. The data comes from the US National Tourism and Trade Office. This table is used for the fact table in this project.A data dictionary is included in the workspace.

In [3]:
# Read in the data here
spark = SparkSession.builder.\
config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11")\
.enableHiveSupport().getOrCreate()


In [4]:
df_I94=spark.read.parquet("sas_data")

In [18]:
df_I94.limit(10).toPandas()

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,...,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,5748517.0,2016.0,4.0,245.0,438.0,LOS,20574.0,1.0,CA,20582.0,...,,M,1976.0,10292016,F,,QF,94953870000.0,11,B1
1,5748518.0,2016.0,4.0,245.0,438.0,LOS,20574.0,1.0,NV,20591.0,...,,M,1984.0,10292016,F,,VA,94955620000.0,7,B1
2,5748519.0,2016.0,4.0,245.0,438.0,LOS,20574.0,1.0,WA,20582.0,...,,M,1987.0,10292016,M,,DL,94956410000.0,40,B1
3,5748520.0,2016.0,4.0,245.0,438.0,LOS,20574.0,1.0,WA,20588.0,...,,M,1987.0,10292016,F,,DL,94956450000.0,40,B1
4,5748521.0,2016.0,4.0,245.0,438.0,LOS,20574.0,1.0,WA,20588.0,...,,M,1988.0,10292016,M,,DL,94956390000.0,40,B1
5,5748522.0,2016.0,4.0,245.0,464.0,HHW,20574.0,1.0,HI,20579.0,...,,M,1959.0,10292016,M,,NZ,94981800000.0,10,B2
6,5748523.0,2016.0,4.0,245.0,464.0,HHW,20574.0,1.0,HI,20586.0,...,,M,1950.0,10292016,F,,NZ,94979690000.0,10,B2
7,5748524.0,2016.0,4.0,245.0,464.0,HHW,20574.0,1.0,HI,20586.0,...,,M,1975.0,10292016,F,,NZ,94979750000.0,10,B2
8,5748525.0,2016.0,4.0,245.0,464.0,HOU,20574.0,1.0,FL,20581.0,...,,M,1989.0,10292016,M,,NZ,94973250000.0,28,B2
9,5748526.0,2016.0,4.0,245.0,464.0,LOS,20574.0,1.0,CA,20581.0,...,,M,1990.0,10292016,F,,NZ,95013550000.0,2,B2


#### Data Dictionary
- cicid - float64 - ID that uniquely identify one record in the dataset
- i94yr - float64 - 4 digit year
- i94mon- float64 - Numeric month
- i94cit - float64 - 3 digit code of source city for immigration (Born country)
- i94res - float64 - 3 digit code of source country for immigration (Residence country)
- i94port - object - Port addmitted through
- arrdate - float64 - Arrival date in the USA
- i94mode - float64 - Mode of transportation (1 = Air; 2 = Sea; 3 = Land; 9 = Not reported)
- i94addr - object - State of arrival
- depdate -float64 - Departure date
- i94bir - float64 - Age of Respondent in Years
- i94visa - float64 - Visa codes collapsed into three categories: (1 = Business; 2 = Pleasure; 3 = Student)
- count - float64 - Used for summary statistics
- dtadfile - object - Character Date Field
- visapost - object - Department of State where where Visa was issued
- occup - object - Occupation that will be performed in U.S.
- entdepa - object - Arrival Flag. Whether admitted or paroled into the US
- entdepd - object - Departure Flag. Whether departed, lost visa, or deceased
- entdepu - object - Update Flag. Update of visa, either apprehended, overstayed, or updated to PR
- matflag - object - Match flag
- biryear - float64 - 4 digit year of birth
- dtaddto - object - Character date field to when admitted in the US
- gender - object - Gender
- insnum - object - INS number
- airline - object - Airline used to arrive in U.S.
- admnum - float64 - Admission number, should be unique and not nullable
- fltno - object - Flight number of Airline used to arrive in U.S.
- visatype - object - Class of admission legally admitting the non-immigrant to temporarily stay in U.S.


#### U.S. City Demographic Data¶

This data comes from OpenSoft.The dataset contains population details of all US Cities and census-designated places.


In [5]:
df_city_demo=spark.read.csv("us-cities-demographics.csv",inferSchema=True,header=True,sep=';')

In [12]:
df_city_demo.limit(10).toPandas()

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Silver Spring,Maryland,33.8,40601,41862,82463,1562,30908,2.6,MD,Hispanic or Latino,25924
1,Quincy,Massachusetts,41.0,44129,49500,93629,4147,32935,2.39,MA,White,58723
2,Hoover,Alabama,38.5,38040,46799,84839,4819,8229,2.58,AL,Asian,4759
3,Rancho Cucamonga,California,34.5,88127,87105,175232,5821,33878,3.18,CA,Black or African-American,24437
4,Newark,New Jersey,34.6,138040,143873,281913,5829,86253,2.73,NJ,White,76402
5,Peoria,Illinois,33.1,56229,62432,118661,6634,7517,2.4,IL,American Indian and Alaska Native,1343
6,Avondale,Arizona,29.1,38712,41971,80683,4815,8355,3.18,AZ,Black or African-American,11592
7,West Covina,California,39.8,51629,56860,108489,3800,37038,3.56,CA,Asian,32716
8,O'Fallon,Missouri,36.0,41762,43270,85032,5783,3269,2.77,MO,Hispanic or Latino,2583
9,High Point,North Carolina,35.5,51751,58077,109828,5204,16315,2.65,NC,Asian,11060


#### Data Dictionary
- City - Name of the city
- State - US state of the city
- Median Age - The median of the age of the population
- Male Population - Number of the male population
- Female Population - Number of the female population
- Total Population - Number of the total population
- Number of Veterans - Number of veterans living in the city
- Foreign-born - Number of residents of the city that were not born in the city
- Average Household Size - Average size of the houses in the city
- State Code - Code of the state of the city
- Race - Race class
- Count - Number of individual of each race

#### Airport Codes
This is a simple table of airport codes and corresponding cities

In [6]:
df_airport=spark.read.csv("airport-codes_csv.csv",inferSchema=True,header=True)

In [18]:
df_airport.limit(10).toPandas()

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
0,00A,heliport,Total Rf Heliport,11,,US,US-PA,Bensalem,00A,,00A,"-74.93360137939453, 40.07080078125"
1,00AA,small_airport,Aero B Ranch Airport,3435,,US,US-KS,Leoti,00AA,,00AA,"-101.473911, 38.704022"
2,00AK,small_airport,Lowell Field,450,,US,US-AK,Anchor Point,00AK,,00AK,"-151.695999146, 59.94919968"
3,00AL,small_airport,Epps Airpark,820,,US,US-AL,Harvest,00AL,,00AL,"-86.77030181884766, 34.86479949951172"
4,00AR,closed,Newport Hospital & Clinic Heliport,237,,US,US-AR,Newport,,,,"-91.254898, 35.6087"
5,00AS,small_airport,Fulton Airport,1100,,US,US-OK,Alex,00AS,,00AS,"-97.8180194, 34.9428028"
6,00AZ,small_airport,Cordes Airport,3810,,US,US-AZ,Cordes,00AZ,,00AZ,"-112.16500091552734, 34.305599212646484"
7,00CA,small_airport,Goldstone /Gts/ Airport,3038,,US,US-CA,Barstow,00CA,,00CA,"-116.888000488, 35.350498199499995"
8,00CL,small_airport,Williams Ag Airport,87,,US,US-CA,Biggs,00CL,,00CL,"-121.763427, 39.427188"
9,00CN,heliport,Kitchen Creek Helibase Heliport,3350,,US,US-CA,Pine Valley,00CN,,00CN,"-116.4597417, 32.7273736"


#### Data Dictionary
- ident -Unique identifier
- type - Type of the airport
- name - Airport Name
- elevation_ft - Altitude of the airport
- continent - Continent
- iso_country -ISO code of the country of the airport
- iso_region - ISO code for the region of the airport
- municipality - City where the airport is located
- gps_code - GPS code of the airport
- iata_code - IATA code of the airport
- local_code - Local code of the airport
- coordinates - GPS coordinates of the airport

### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.





In [22]:

df_I94.describe().toPandas()

Unnamed: 0,summary,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,...,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,count,3096313.0,3096313.0,3096313.0,3096313.0,3096313.0,3096313,3096313.0,3096074.0,2943721,...,392,2957884,3095511.0,3095836,2682044,113708,3012686,3096313.0,3076764,3096313
1,mean,3078651.879075533,2016.0,4.0,304.9069344733559,303.28381949757664,,20559.84854179794,1.0736897761487614,51.652482269503544,...,,,1974.2323855415148,8291120.333841449,,4131.050016327899,59.477601493233784,70828850111.50484,1360.2463696420555,
2,stddev,1763278.0997499449,0.0,0.0,210.02688853063205,208.58321292789532,,8.777339475317723,0.5158963131657106,42.97906231370983,...,,,17.420260534589556,1656502.4244925722,,8821.743471773654,172.6333995206175,22154415947.558968,5852.676345633695,
3,min,6.0,2016.0,4.0,101.0,101.0,5KE,20545.0,1.0,..,...,U,M,1902.0,/ 183D,F,0,*FF,0.0,00000,B1
4,max,6102785.0,2016.0,4.0,999.0,760.0,YSL,20574.0,9.0,ZU,...,Y,M,2019.0,D/S,X,YM0167,ZZ,99915565930.0,ZZZ,WT


In [14]:
df_airport.describe().toPandas()

Unnamed: 0,summary,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
0,count,55075,55075,55075,48069.0,55075,55075,55075,49399,41030,9189,28686,55075
1,mean,2.3873375337777779E8,,,1240.7896773388254,,,,,2.1920446610204083E8,0.0,8.580556178571428E7,
2,stddev,9.492375382267495E8,,,1602.3634593484142,,,,,9.1123224377024E8,0.0,5.747026415216715E8,
3,min,00A,balloonport,"""""""Der Dingel"""" Airfield""",-1266.0,AF,AD,AD-04,'S Gravenvoeren,0000,-,-,"-0.004722000099718571, 9.425000190734863"
4,max,spgl,small_airport,Çá¸¾á¸á¸ á¸®á¸Ç{+91-9680118734} GiRLFRieNd...,22000.0,SA,ZZ,ZZ-U-A,Å½ocene,ZYYY,ZZV,ZZV,"99.9555969238, 8.47115039825"


In [9]:
df_city_demo = df_city_demo.withColumn("Ccount",df_city_demo.Count)
df_city_demo = df_city_demo.drop(df_city_demo.Count)
df_city_demo.describe().toPandas()

Unnamed: 0,summary,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Ccount
0,count,2891,2891,2891.0,2888.0,2888.0,2891.0,2878.0,2878.0,2875.0,2891,2891,2891.0
1,mean,,,35.49488066413016,97328.4262465374,101769.6308864266,198966.77931511588,9367.832522585128,40653.59867963864,2.742542608695655,,,48963.77447250087
2,stddev,,,4.401616730099886,216299.93692873296,231564.5725714828,447555.9296335903,13211.21992386408,155749.1036650984,0.4332910878973046,,,144385.58856460615
3,min,Abilene,Alabama,22.9,29281.0,27348.0,63215.0,416.0,861.0,2.0,AK,American Indian and Alaska Native,98.0
4,max,Yuma,Wisconsin,70.5,4081698.0,4468707.0,8550405.0,156961.0,3212500.0,4.98,WI,White,3835726.0


#### Summary of shapes of the datasets

In [12]:
print("City Demographics")
df_city_demo_count=df_city_demo.count()
print(f"Rows: {df_city_demo_count}")
print(f"Columns: {len(df_city_demo.columns)}")
print()
print("Airport Codes")
df_airport_count=df_airport.count()
print(f"Rows: {df_airport_count}")
print(f"Columns: {len(df_airport.columns)}")
print()
print("I94 Immigration")
df_I94_count=df_I94.count()
print(f"Rows: {df_I94_count}")
print(f"Columns: {len(df_I94.columns)}")

City Demographics
Rows: 2891
Columns: 12

Airport Codes
Rows: 55075
Columns: 12

I94 Immigration
Rows: 3096313
Columns: 28


#### Checking for Nulls

In [10]:
def countNulls(df):
    nulls_list=[]
    for col in df.columns:
        nulls_d={}
        number_nulls=df.select(col).filter(F.col(col).isNull()).count()
        if number_nulls > 0:
            nulls_d["Column"]=col
            nulls_d["Nulls"]=number_nulls
            nulls_list.append(nulls_d)
    if len(nulls_list)>0:
        display(pd.DataFrame(nulls_list))
    else: print("No Nulls")



In [11]:
countNulls(df_airport)

Unnamed: 0,Column,Nulls
0,elevation_ft,7006
1,municipality,5676
2,gps_code,14045
3,iata_code,45886
4,local_code,26389


In [13]:
countNulls(df_city_demo)

Unnamed: 0,Column,Nulls
0,Male Population,3
1,Female Population,3
2,Number of Veterans,13
3,Foreign-born,13
4,Average Household Size,16


In [14]:
countNulls(df_I94)

Unnamed: 0,Column,Nulls
0,i94mode,239
1,i94addr,152592
2,depdate,142457
3,i94bir,802
4,dtadfile,1
5,visapost,1881250
6,occup,3088187
7,entdepa,238
8,entdepd,138429
9,entdepu,3095921


In [None]:
def count_distinct_values(df):
    for column in df:
            
            distinct_v = df.select(column).distinct().rdd.map(lambda r: r[0]).collect()
            
            print ("'{}' has {} distinct values" .format(column,len(distinct_v)))
            if (len(distinct_v) > 5):
                print("-----------Listing up to 5 distinct values---------")
            display(pd.DataFrame(distinct_v[0:5]))
            print ("\n-----------------------------------------------------------------------\n")


In [11]:
count_distinct_values(df_airport)

'Column<b'ident'>' has 55075 distinct values
-----------Listing up to 15 distinct values---------


Unnamed: 0,0
0,06IN
1,06VA
2,0LA0
3,0MD6
4,0OH7
5,0OK9
6,11KS
7,12PR
8,16KY
9,19OI



-----------------------------------------------------------------------

'Column<b'type'>' has 7 distinct values


Unnamed: 0,0
0,large_airport
1,balloonport
2,seaplane_base
3,heliport
4,closed
5,medium_airport
6,small_airport



-----------------------------------------------------------------------

'Column<b'name'>' has 52144 distinct values
-----------Listing up to 15 distinct values---------


Unnamed: 0,0
0,Mc Kenzie Bridge State Airport
1,Clarks Dream Strip
2,Whipoorwill Springs Airport
3,John C Lincoln Hospital Heliport
4,Trinity Medical Center West Heliport
5,Pankratz Airport
6,PVH Heliport
7,J & S Field
8,Gimlin Airport
9,Jorgensen - Stoller Airport



-----------------------------------------------------------------------

'Column<b'elevation_ft'>' has 5450 distinct values
-----------Listing up to 15 distinct values---------


Unnamed: 0,0
0,6620
1,6397
2,463
3,1580
4,4935
5,1645
6,2142
7,5300
8,148
9,496



-----------------------------------------------------------------------

'Column<b'continent'>' has 7 distinct values


Unnamed: 0,0
0,
1,SA
2,AS
3,AN
4,OC
5,EU
6,AF



-----------------------------------------------------------------------

'Column<b'iso_country'>' has 244 distinct values
-----------Listing up to 15 distinct values---------


Unnamed: 0,0
0,DZ
1,LT
2,MM
3,CI
4,TC
5,AZ
6,FI
7,SC
8,PM
9,UA



-----------------------------------------------------------------------

'Column<b'iso_region'>' has 2810 distinct values
-----------Listing up to 15 distinct values---------


Unnamed: 0,0
0,US-TN
1,AF-KAP
2,AO-CUS
3,BR-MT
4,CO-CAL
5,MZ-MPM
6,UG-409
7,DO-08
8,CO-CAQ
9,LA-LM



-----------------------------------------------------------------------

'Column<b'municipality'>' has 27134 distinct values
-----------Listing up to 15 distinct values---------


Unnamed: 0,0
0,Sandy Valley
1,Agawam
2,Aitkin
3,Pleasant Home
4,Tyler
5,Fairbanks
6,Worcester
7,Middlefield
8,Cheriton
9,Deerwood



-----------------------------------------------------------------------

'Column<b'gps_code'>' has 40851 distinct values
-----------Listing up to 15 distinct values---------


Unnamed: 0,0
0,06VA
1,0LA0
2,0MD6
3,0OH7
4,0OK9
5,12PR
6,16KY
7,19OI
8,19OR
9,1CL8



-----------------------------------------------------------------------

'Column<b'iata_code'>' has 9043 distinct values
-----------Listing up to 15 distinct values---------


Unnamed: 0,0
0,BZT
1,YUL
2,DWR
3,NWI
4,KLR
5,KMU
6,KGL
7,HYL
8,BGM
9,CNU



-----------------------------------------------------------------------

'Column<b'local_code'>' has 27437 distinct values
-----------Listing up to 15 distinct values---------


Unnamed: 0,0
0,06VA
1,0LA0
2,0MD6
3,0OH7
4,0OK9
5,12PR
6,16KY
7,19OI
8,19OR
9,1CL8



-----------------------------------------------------------------------

'Column<b'coordinates'>' has 54874 distinct values
-----------Listing up to 15 distinct values---------


Unnamed: 0,0
0,"-88.56900024414062, 40.1786003112793"
1,"-117.28399658203125, 48.84090042114258"
2,"-91.30760192871094, 44.00360107421875"
3,"-81.41940307617188, 28.610300064086914"
4,"-87.386496, 46.270198"
5,"-115.40399932861328, 47.66080093383789"
6,"-74.43990325927734, 40.752899169921875"
7,"-74.24539947509766, 42.99729919433594"
8,"-123.931392431, 46.1159964394"
9,"-91.526001, 34.973202"



-----------------------------------------------------------------------



In [12]:
count_distinct_values(df_city_demo)

'Column<b'City'>' has 567 distinct values
-----------Listing up to 15 distinct values---------


Unnamed: 0,0
0,Saint George
1,Worcester
2,Tyler
3,Springfield
4,Caguas
5,Charleston
6,Pasco
7,Corona
8,Tempe
9,North Las Vegas



-----------------------------------------------------------------------

'Column<b'State'>' has 49 distinct values
-----------Listing up to 15 distinct values---------


Unnamed: 0,0
0,Utah
1,Hawaii
2,Minnesota
3,Ohio
4,Arkansas
5,Oregon
6,Texas
7,North Dakota
8,Pennsylvania
9,Connecticut



-----------------------------------------------------------------------

'Column<b'Median Age'>' has 180 distinct values
-----------Listing up to 15 distinct values---------


Unnamed: 0,0
0,37.1
1,44.8
2,32.3
3,43.3
4,26.4
5,37.4
6,36.2
7,31.7
8,41.6
9,35.6



-----------------------------------------------------------------------

'Column<b'Male Population'>' has 594 distinct values
-----------Listing up to 15 distinct values---------


Unnamed: 0,0
0,70863
1,93948
2,62478
3,43588
4,54676
5,42926
6,32680
7,50792
8,34488
9,51629



-----------------------------------------------------------------------

'Column<b'Female Population'>' has 595 distinct values
-----------Listing up to 15 distinct values---------


Unnamed: 0,0
0,89574
1,76110
2,43270
3,35868
4,44205
5,38146
6,57754
7,35713
8,33337
9,33437



-----------------------------------------------------------------------

'Column<b'Total Population'>' has 594 distinct values
-----------Listing up to 15 distinct values---------


Unnamed: 0,0
0,100884
1,166624
2,559131
3,158300
4,193955
5,83307
6,68592
7,315685
8,192608
9,96098



-----------------------------------------------------------------------

'Column<b'Number of Veterans'>' has 578 distinct values
-----------Listing up to 15 distinct values---------


Unnamed: 0,0
0,4519
1,18051
2,7880
3,5156
4,10817
5,4219
6,1507
7,1990
8,2580
9,897



-----------------------------------------------------------------------

'Column<b'Foreign-born'>' has 588 distinct values
-----------Listing up to 15 distinct values---------


Unnamed: 0,0
0,11317
1,26755
2,23271
3,56640
4,20735
5,21558
6,19499
7,15003
8,18161
9,24630



-----------------------------------------------------------------------

'Column<b'Average Household Size'>' has 162 distinct values
-----------Listing up to 15 distinct values---------


Unnamed: 0,0
0,2.86
1,3.26
2,2.4
3,2.41
4,2.82
5,2.62
6,3.02
7,2.55
8,3.08
9,3.56



-----------------------------------------------------------------------

'Column<b'State Code'>' has 49 distinct values
-----------Listing up to 15 distinct values---------


Unnamed: 0,0
0,AZ
1,SC
2,LA
3,MN
4,NJ
5,DC
6,OR
7,VA
8,RI
9,KY



-----------------------------------------------------------------------

'Column<b'Race'>' has 5 distinct values


Unnamed: 0,0
0,Black or African-American
1,Hispanic or Latino
2,White
3,Asian
4,American Indian and Alaska Native



-----------------------------------------------------------------------

'Column<b'Count'>' has 2785 distinct values
-----------Listing up to 15 distinct values---------


Unnamed: 0,0
0,7240
1,23271
2,7340
3,496
4,6620
5,67294
6,51123
7,43688
8,6397
9,9427



-----------------------------------------------------------------------



In [13]:
count_distinct_values(df_I94)

'Column<b'cicid'>' has 3096313 distinct values
-----------Listing up to 15 distinct values---------


Unnamed: 0,0
0,5748877.0
1,5749231.0
2,5749277.0
3,5750090.0
4,5750830.0
5,5750892.0
6,5751462.0
7,5751686.0
8,5751739.0
9,5751758.0



-----------------------------------------------------------------------

'Column<b'i94yr'>' has 1 distinct values


Unnamed: 0,0
0,2016.0



-----------------------------------------------------------------------

'Column<b'i94mon'>' has 1 distinct values


Unnamed: 0,0
0,4.0



-----------------------------------------------------------------------

'Column<b'i94cit'>' has 243 distinct values
-----------Listing up to 15 distinct values---------


Unnamed: 0,0
0,299.0
1,692.0
2,769.0
3,576.0
4,147.0
5,735.0
6,311.0
7,524.0
8,206.0
9,389.0



-----------------------------------------------------------------------

'Column<b'i94res'>' has 229 distinct values
-----------Listing up to 15 distinct values---------


Unnamed: 0,0
0,692.0
1,299.0
2,576.0
3,735.0
4,206.0
5,524.0
6,389.0
7,390.0
8,249.0
9,329.0



-----------------------------------------------------------------------

'Column<b'i94port'>' has 299 distinct values
-----------Listing up to 15 distinct values---------


Unnamed: 0,0
0,FMY
1,BGM
2,HEL
3,DNS
4,MOR
5,FOK
6,HVR
7,SNA
8,PTK
9,CLG



-----------------------------------------------------------------------

'Column<b'arrdate'>' has 30 distinct values
-----------Listing up to 15 distinct values---------


Unnamed: 0,0
0,20550.0
1,20556.0
2,20553.0
3,20551.0
4,20565.0
5,20563.0
6,20546.0
7,20547.0
8,20554.0
9,20564.0



-----------------------------------------------------------------------

'Column<b'i94mode'>' has 5 distinct values


Unnamed: 0,0
0,
1,1.0
2,3.0
3,2.0
4,9.0



-----------------------------------------------------------------------

'Column<b'i94addr'>' has 458 distinct values
-----------Listing up to 15 distinct values---------


Unnamed: 0,0
0,.N
1,RG
2,YH
3,RF
4,CI
5,FT
6,TC
7,SC
8,AZ
9,IC



-----------------------------------------------------------------------

'Column<b'depdate'>' has 236 distinct values
-----------Listing up to 15 distinct values---------


Unnamed: 0,0
0,20593.0
1,20689.0
2,20673.0
3,20467.0
4,20652.0
5,20196.0
6,20705.0
7,20621.0
8,20682.0
9,20550.0



-----------------------------------------------------------------------

'Column<b'i94bir'>' has 113 distinct values
-----------Listing up to 15 distinct values---------


Unnamed: 0,0
0,8.0
1,67.0
2,70.0
3,69.0
4,0.0
5,7.0
6,108.0
7,88.0
8,49.0
9,101.0



-----------------------------------------------------------------------

'Column<b'i94visa'>' has 3 distinct values


Unnamed: 0,0
0,1.0
1,3.0
2,2.0



-----------------------------------------------------------------------

'Column<b'count'>' has 1 distinct values


Unnamed: 0,0
0,1.0



-----------------------------------------------------------------------

'Column<b'dtadfile'>' has 118 distinct values
-----------Listing up to 15 distinct values---------


Unnamed: 0,0
0,20160615
1,20160825
2,20160429
3,20160604
4,20160611
5,20160403
6,20160623
7,20160427
8,20160420
9,20160702



-----------------------------------------------------------------------

'Column<b'visapost'>' has 531 distinct values
-----------Listing up to 15 distinct values---------


Unnamed: 0,0
0,CRS
1,KGL
2,AKD
3,BGM
4,KAT
5,TRK
6,MOR
7,MDL
8,FRN
9,CJT



-----------------------------------------------------------------------

'Column<b'occup'>' has 112 distinct values
-----------Listing up to 15 distinct values---------


Unnamed: 0,0
0,PHA
1,REL
2,ENT
3,ACH
4,101
5,EMM
6,ULS
7,GEN
8,MTH
9,DVM



-----------------------------------------------------------------------

'Column<b'entdepa'>' has 14 distinct values


Unnamed: 0,0
0,K
1,F
2,
3,T
4,B
5,M
6,U
7,O
8,Z
9,A



-----------------------------------------------------------------------

'Column<b'entdepd'>' has 13 distinct values


Unnamed: 0,0
0,K
1,Q
2,
3,L
4,M
5,V
6,O
7,D
8,J
9,N



-----------------------------------------------------------------------

'Column<b'entdepu'>' has 3 distinct values


Unnamed: 0,0
0,
1,Y
2,U



-----------------------------------------------------------------------

'Column<b'matflag'>' has 2 distinct values


Unnamed: 0,0
0,
1,M



-----------------------------------------------------------------------

'Column<b'biryear'>' has 113 distinct values
-----------Listing up to 15 distinct values---------


Unnamed: 0,0
0,1988.0
1,1976.0
2,1951.0
3,1940.0
4,1928.0
5,1979.0
6,1953.0
7,1913.0
8,1987.0
9,1909.0



-----------------------------------------------------------------------

'Column<b'dtaddto'>' has 778 distinct values
-----------Listing up to 15 distinct values---------


Unnamed: 0,0
0,9282016
1,2282017
2,1102017
3,11162016
4,10202017
5,11292017
6,6152016
7,1192018
8,11142016
9,3152017



-----------------------------------------------------------------------

'Column<b'gender'>' has 5 distinct values


Unnamed: 0,0
0,F
1,
2,M
3,U
4,X



-----------------------------------------------------------------------

'Column<b'insnum'>' has 1914 distinct values
-----------Listing up to 15 distinct values---------


Unnamed: 0,0
0,5325
1,3414
2,7252
3,4032
4,3959
5,39645
6,39590
7,691
8,39458
9,4821



-----------------------------------------------------------------------

'Column<b'airline'>' has 535 distinct values
-----------Listing up to 15 distinct values---------


Unnamed: 0,0
0,MM
1,DZ
2,LT
3,CI
4,TC
5,YFA
6,447
7,AZ
8,FI
9,R0E



-----------------------------------------------------------------------

'Column<b'admnum'>' has 3075579 distinct values
-----------Listing up to 15 distinct values---------


Unnamed: 0,0
0,907737600.0
1,94947800000.0
2,59511120000.0
3,59519190000.0
4,59504510000.0
5,59560430000.0
6,59557650000.0
7,59548670000.0
8,59515270000.0
9,59517280000.0



-----------------------------------------------------------------------

'Column<b'fltno'>' has 7153 distinct values
-----------Listing up to 15 distinct values---------


Unnamed: 0,0
0,00456
1,00556
2,00530
3,2294
4,1159
5,XAGPR
6,03904
7,02053
8,03788
9,04319



-----------------------------------------------------------------------

'Column<b'visatype'>' has 17 distinct values
-----------Listing up to 15 distinct values---------


Unnamed: 0,0
0,F2
1,GMB
2,B2
3,F1
4,CPL
5,I1
6,WB
7,M1
8,B1
9,WT



-----------------------------------------------------------------------



#### Cleaning Steps
Document steps necessary to clean the data

In [7]:
# Performing cleaning tasks here


df_airport.withColumn('nullsNum', sum(df_airport[col].isNull().cast('int') for col in df_airport.columns))\
.orderBy(F.desc("nullsNum")).groupBy("nullsNum").count().toPandas()


Unnamed: 0,nullsNum,count
0,5,1508
1,4,5020
2,3,6479
3,2,12623
4,1,26699
5,0,2746


In [8]:
df_I94.withColumn('nullsNum', sum(df_I94[col].isNull().cast('int') for col in df_I94.columns))\
.orderBy(F.desc("nullsNum")).groupBy("nullsNum").count().toPandas()


Unnamed: 0,nullsNum,count
0,12,3
1,11,308
2,10,289
3,9,3721
4,8,17196
5,7,38935
6,6,106841
7,5,460335
8,4,1346942
9,3,1115984


In [9]:
df_I94.withColumn('nullsNum', sum(df_I94[col].isNull().cast('int') for col in df_I94.columns))\
.orderBy(F.desc("nullsNum")).limit(10).toPandas()

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,...,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype,nullsNum
0,6.0,2016.0,4.0,692.0,692.0,XXX,20573.0,,,,...,,1979.0,10282016.0,,,,1897628000.0,,B2,12
1,6020025.0,2016.0,4.0,252.0,582.0,PSP,20553.0,9.0,,,...,,,7072016.0,U,,,34605310000.0,,WT,12
2,5973645.0,2016.0,4.0,252.0,209.0,SAI,20549.0,9.0,,,...,,,5192016.0,F,,,57551700000.0,,GMT,12
3,5901492.0,2016.0,4.0,582.0,582.0,XXX,20556.0,,,20575.0,...,M,1986.0,,,,,78081390000.0,,B2,11
4,5901490.0,2016.0,4.0,582.0,582.0,XXX,20554.0,,,20575.0,...,M,1986.0,,,,,84332940000.0,,B2,11
5,5901928.0,2016.0,4.0,112.0,112.0,HHW,20559.0,9.0,,,...,,,7142016.0,M,3517.0,,47114220000.0,,WT,11
6,5879643.0,2016.0,4.0,582.0,582.0,XXX,20555.0,,,20574.0,...,M,1951.0,,,,,89677700000.0,,B2,11
7,5901488.0,2016.0,4.0,582.0,582.0,XXX,20553.0,,,20574.0,...,M,1978.0,,,,,89095960000.0,,B2,11
8,5901491.0,2016.0,4.0,582.0,582.0,XXX,20554.0,,,20575.0,...,M,1989.0,,,,,89885530000.0,,B2,11
9,5901728.0,2016.0,4.0,111.0,111.0,CHM,20553.0,9.0,,,...,,,7072016.0,U,5522.0,,42348700000.0,,WT,11


### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
# Write code here

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.