# Data Engineering Capstone Project

#### Project Summary
This project is designed to help the United States of America to be able to track and perform analytical queries on the people that enter the country - immigrants.
It consists of a star schema with 5 dimensions and one fact table. The United States of America will be able to report on the amount of people entering their country and where they come from, what visa type they are in ie. The Data provided was Big Data so to be able to perform such operations we make use of Apache Spark.

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

## Step 1: Scope the project and Gather Data

The project that was conducted for this capstone was provided by Udacity. This project allowed the student to have an opportunity to show case their new found skills from the nanodegree. It was used as a final piece of grading.



In [1]:
# Do all imports and installs here
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import numpy as np
import missingno as msno
pd.set_option('display.width',170, 'display.max_rows',200, 'display.max_columns',900)

### Step 1: Scope the Project and Gather Data

#### Scope 
I plan on providing this solution to Immigration in the United States of America. Due to the "American Dream" America finds it self in a position where it has a lot of people enter the country. This would help them perform better analytical queries. 

#### Describe and Gather Data 

* airport-codes_csv.csv - Is a table of airport codes and full location details of where the airports are and coordinates

* immigration_data_sample.csv - This data comes from the US National Tourism and Trade Office. A data dictionary is included in the workspace. This is where the data comes from. 

* U.S. City Demographic Data: This data comes from OpenSoft. You can read more about it here.

These two methods allow for better analysis of how our data is stored.
They provide us with a wide variety of information from just a plain view of what the data looks like
to how many unique values we have in the columns.

In [2]:
# Read in the data here
def Source_information(df):
    print ("\n\n---------------------")
    print ("Dataset INFORMATION")
    print ("---------------------")
    print ("Shape of data set:", df.shape, "\n")
    print ("Column Headers:", list(df.columns.values), "\n")
    print (df.dtypes)
    
def Source_Full_report(df):
    import re
    missing_values = []
    nonumeric_values = []

    print ("Dataset INFORMATION")
    print ("========================\n")

    for column in df:
        # Find all the unique feature values
        uniq = df[column].unique()
        print ("'{}' has {} unique values" .format(column,uniq.size))
        if (uniq.size > 10):
            print("~~Listing up to 10 unique values~~")
        print (uniq[0:10])
        print ("\n-----------------------------------------------------------------------\n")

        # Find features with missing values
        if (True in pd.isnull(uniq)):
            s = "{} has {} missing" .format(column, pd.isnull(df[column]).sum())
            missing_values.append(s)

        # Find features with non-numeric values
        for i in range (1, np.prod(uniq.shape)):
            if (re.match('nan', str(uniq[i]))):
                break
            if not (re.search('(^\d+\.?\d*$)|(^\d*\.?\d+$)', str(uniq[i]))):
                nonumeric_values.append(column)
                break

    print ("\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n")
    print ("Features with missing values:\n{}\n\n" .format(missing_values))
    print ("Features with non-numeric values:\n{}" .format(nonumeric_values))
    print ("\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n")

In [3]:
###The below allows us to view the data that is stored in the sas_data folder using pandas

In [4]:
folder = 'sas_data'
df = pq.ParquetDataset(folder).read_pandas().to_pandas(split_blocks=True, self_destruct=True)

In [5]:
Source_information(df)



---------------------
Dataset INFORMATION
---------------------
Shape of data set: (3096313, 28) 

Column Headers: ['cicid', 'i94yr', 'i94mon', 'i94cit', 'i94res', 'i94port', 'arrdate', 'i94mode', 'i94addr', 'depdate', 'i94bir', 'i94visa', 'count', 'dtadfile', 'visapost', 'occup', 'entdepa', 'entdepd', 'entdepu', 'matflag', 'biryear', 'dtaddto', 'gender', 'insnum', 'airline', 'admnum', 'fltno', 'visatype'] 

cicid       float64
i94yr       float64
i94mon      float64
i94cit      float64
i94res      float64
i94port      object
arrdate     float64
i94mode     float64
i94addr      object
depdate     float64
i94bir      float64
i94visa     float64
count       float64
dtadfile     object
visapost     object
occup        object
entdepa      object
entdepd      object
entdepu      object
matflag      object
biryear     float64
dtaddto      object
gender       object
insnum       object
airline      object
admnum      float64
fltno        object
visatype     object
dtype: object


In [6]:
df.describe()

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,arrdate,i94mode,depdate,i94bir,i94visa,count,biryear,admnum
count,3096313.0,3096313.0,3096313.0,3096313.0,3096313.0,3096313.0,3096074.0,2953856.0,3095511.0,3096313.0,3096313.0,3095511.0,3096313.0
mean,3078652.0,2016.0,4.0,304.9069,303.2838,20559.85,1.07369,20573.95,41.76761,1.845393,1.0,1974.232,70828850000.0
std,1763278.0,0.0,0.0,210.0269,208.5832,8.777339,0.5158963,29.35697,17.42026,0.398391,0.0,17.42026,22154420000.0
min,6.0,2016.0,4.0,101.0,101.0,20545.0,1.0,15176.0,-3.0,1.0,1.0,1902.0,0.0
25%,1577790.0,2016.0,4.0,135.0,131.0,20552.0,1.0,20561.0,30.0,2.0,1.0,1962.0,56035230000.0
50%,3103507.0,2016.0,4.0,213.0,213.0,20560.0,1.0,20570.0,41.0,2.0,1.0,1975.0,59360940000.0
75%,4654341.0,2016.0,4.0,512.0,504.0,20567.0,1.0,20579.0,54.0,2.0,1.0,1986.0,93509870000.0
max,6102785.0,2016.0,4.0,999.0,760.0,20574.0,9.0,45427.0,114.0,3.0,1.0,2019.0,99915570000.0


In [7]:

Source_Full_report(df)

Dataset INFORMATION

'cicid' has 3096313 unique values
~~Listing up to 10 unique values~~
[ 6.  7. 15. 16. 17. 18. 19. 20. 21. 22.]

-----------------------------------------------------------------------

'i94yr' has 1 unique values
[2016.]

-----------------------------------------------------------------------

'i94mon' has 1 unique values
[4.]

-----------------------------------------------------------------------

'i94cit' has 243 unique values
~~Listing up to 10 unique values~~
[692. 254. 101. 102. 103. 104. 105. 107. 108. 109.]

-----------------------------------------------------------------------

'i94res' has 229 unique values
~~Listing up to 10 unique values~~
[692. 276. 101. 110. 117. 112. 251. 102. 103. 104.]

-----------------------------------------------------------------------

'i94port' has 299 unique values
~~Listing up to 10 unique values~~
['XXX' 'ATL' 'WAS' 'NYC' 'TOR' 'BOS' 'HOU' 'MIA' 'CHI' 'LOS']

--------------------------------------------------------------

There is a chance some of our columns are empty due to the large volumes of data that we are dealing with.
I have used the below to get a percentage of how much is missing in each column.

In [8]:
(df.isnull().sum() / len(df))*100

cicid        0.000000
i94yr        0.000000
i94mon       0.000000
i94cit       0.000000
i94res       0.000000
i94port      0.000000
arrdate      0.000000
i94mode      0.007719
i94addr      4.928184
depdate      4.600859
i94bir       0.025902
i94visa      0.000000
count        0.000000
dtadfile     0.000032
visapost    60.757746
occup       99.737559
entdepa      0.007687
entdepd      4.470769
entdepu     99.987340
matflag      4.470769
biryear      0.025902
dtaddto      0.015405
gender      13.379429
insnum      96.327632
airline      2.700857
admnum       0.000000
fltno        0.631364
visatype     0.000000
dtype: float64

## Duplicity Check

In [6]:
df1 = df[['cicid','gender', 'i94addr', 'visapost']]
size = df1.groupby('cicid')['gender','i94addr'].size().reset_index()
size[size[0] > 1]        # DATAFRAME OF DUPLICATES

len(size[size[0] > 1])   # NUMBER OF DUPLICATES

0

# Airport Dataset

In [7]:
airport = pd.read_csv("airport-codes_csv.csv")

In [8]:
airport.head()

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
0,00A,heliport,Total Rf Heliport,11.0,,US,US-PA,Bensalem,00A,,00A,"-74.93360137939453, 40.07080078125"
1,00AA,small_airport,Aero B Ranch Airport,3435.0,,US,US-KS,Leoti,00AA,,00AA,"-101.473911, 38.704022"
2,00AK,small_airport,Lowell Field,450.0,,US,US-AK,Anchor Point,00AK,,00AK,"-151.695999146, 59.94919968"
3,00AL,small_airport,Epps Airpark,820.0,,US,US-AL,Harvest,00AL,,00AL,"-86.77030181884766, 34.86479949951172"
4,00AR,closed,Newport Hospital & Clinic Heliport,237.0,,US,US-AR,Newport,,,,"-91.254898, 35.6087"


In [15]:
Source_information(airport)



---------------------
Dataset INFORMATION
---------------------
Shape of data set: (55075, 12) 

Column Headers: ['ident', 'type', 'name', 'elevation_ft', 'continent', 'iso_country', 'iso_region', 'municipality', 'gps_code', 'iata_code', 'local_code', 'coordinates'] 

ident            object
type             object
name             object
elevation_ft    float64
continent        object
iso_country      object
iso_region       object
municipality     object
gps_code         object
iata_code        object
local_code       object
coordinates      object
dtype: object


In [17]:
Source_Full_report(airport)

Dataset INFORMATION

'ident' has 55075 unique values
~~Listing up to 10 unique values~~
['00A' '00AA' '00AK' '00AL' '00AR' '00AS' '00AZ' '00CA' '00CL' '00CN']

-----------------------------------------------------------------------

'type' has 7 unique values
['heliport' 'small_airport' 'closed' 'seaplane_base' 'balloonport'
 'medium_airport' 'large_airport']

-----------------------------------------------------------------------

'name' has 52144 unique values
~~Listing up to 10 unique values~~
['Total Rf Heliport' 'Aero B Ranch Airport' 'Lowell Field' 'Epps Airpark'
 'Newport Hospital & Clinic Heliport' 'Fulton Airport' 'Cordes Airport'
 'Goldstone /Gts/ Airport' 'Williams Ag Airport'
 'Kitchen Creek Helibase Heliport']

-----------------------------------------------------------------------

'elevation_ft' has 5450 unique values
~~Listing up to 10 unique values~~
[  11. 3435.  450.  820.  237. 1100. 3810. 3038.   87. 3350.]

---------------------------------------------------------

In [9]:
temp_file_name = '../../data2/GlobalLandTemperaturesByCity.csv'
Global_temp = pd.read_csv(temp_file_name)

In [21]:
Global_temp.head()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E
2,1744-01-01,,,Århus,Denmark,57.05N,10.33E
3,1744-02-01,,,Århus,Denmark,57.05N,10.33E
4,1744-03-01,,,Århus,Denmark,57.05N,10.33E


In [23]:
demographic = pd.read_csv("us-cities-demographics.csv", sep=";")

In [25]:
demographic.head()

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Silver Spring,Maryland,33.8,40601.0,41862.0,82463,1562.0,30908.0,2.6,MD,Hispanic or Latino,25924
1,Quincy,Massachusetts,41.0,44129.0,49500.0,93629,4147.0,32935.0,2.39,MA,White,58723
2,Hoover,Alabama,38.5,38040.0,46799.0,84839,4819.0,8229.0,2.58,AL,Asian,4759
3,Rancho Cucamonga,California,34.5,88127.0,87105.0,175232,5821.0,33878.0,3.18,CA,Black or African-American,24437
4,Newark,New Jersey,34.6,138040.0,143873.0,281913,5829.0,86253.0,2.73,NJ,White,76402


In [27]:
Source_information(demographic)



---------------------
Dataset INFORMATION
---------------------
Shape of data set: (2891, 12) 

Column Headers: ['City', 'State', 'Median Age', 'Male Population', 'Female Population', 'Total Population', 'Number of Veterans', 'Foreign-born', 'Average Household Size', 'State Code', 'Race', 'Count'] 

City                       object
State                      object
Median Age                float64
Male Population           float64
Female Population         float64
Total Population            int64
Number of Veterans        float64
Foreign-born              float64
Average Household Size    float64
State Code                 object
Race                       object
Count                       int64
dtype: object


In [28]:
Source_Full_report(demographic)

Dataset INFORMATION

'City' has 567 unique values
~~Listing up to 10 unique values~~
['Silver Spring' 'Quincy' 'Hoover' 'Rancho Cucamonga' 'Newark' 'Peoria'
 'Avondale' 'West Covina' "O'Fallon" 'High Point']

-----------------------------------------------------------------------

'State' has 49 unique values
~~Listing up to 10 unique values~~
['Maryland' 'Massachusetts' 'Alabama' 'California' 'New Jersey' 'Illinois'
 'Arizona' 'Missouri' 'North Carolina' 'Pennsylvania']

-----------------------------------------------------------------------

'Median Age' has 180 unique values
~~Listing up to 10 unique values~~
[33.8 41.  38.5 34.5 34.6 33.1 29.1 39.8 36.  35.5]

-----------------------------------------------------------------------

'Male Population' has 594 unique values
~~Listing up to 10 unique values~~
[ 40601.  44129.  38040.  88127. 138040.  56229.  38712.  51629.  41762.
  51751.]

-----------------------------------------------------------------------

'Female Population' ha

In [1]:
	
from pyspark.sql import SparkSession
spark = SparkSession.builder.\
config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11")\
.enableHiveSupport().getOrCreate()
df_spark =spark.read.format('com.github.saurfang.sas.spark').load('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')


In [11]:
#write to parquet
df_spark.write.parquet("sas_data")
df_spark=spark.read.parquet("sas_data")

### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

This is found in the USA IMMIGRANT STAR SCHEMA image.

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

This is found in the PySpark Notebook.

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

This can also be found in the PySpark notebook.

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

This can be found in the DataDictionary.txt

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.
 
 * If the data was increased by 100x I would make use of Amazon EMR because:
 For cost-effective and quick performance of data transformation workloads (ETL) like sort, join and aggregate on large datasets, you can use EMR. 
* I would use that in conjuction with Apache Airflow to schedule when certain analytics need to be done and multiple users need to be notified of the success or failures etc.
* If this report is looked at everyday It should be run day but I suspect it will only be needed once a month so I would schedule it to run at the beginning of every month.