# Data Engineering Capstone Project


In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
# Do all imports and installs here
import sys, os
import pandas as pd
from pathlib import Path
from IPython import display as ICD

In [3]:
src_path: str = "../src"
sys.path.append(src_path)

In [4]:
from utils.io import process_config
from utils.spark import create_spark_session
from data.table_schemas import ON_LOAD_TABLES_SCHEMA


In [5]:
data_path: Path = Path("../data")
user_config, dl_config = (
    process_config(Path(os.getcwd()).parent.joinpath("_user.cfg")),
    process_config(Path(os.getcwd()).parent.joinpath("dwh.cfg")),
)
spark = create_spark_session(user_config, dl_config)

22/12/10 22:53:17 WARN Utils: Your hostname, uzi resolves to a loopback address: 127.0.1.1; using 192.168.1.181 instead (on interface wlp114s0)
22/12/10 22:53:17 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
:: loading settings :: url = jar:file:/home/uziel/miniconda3/envs/de_capstone/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/uziel/.ivy2/cache
The jars for the packages stored in: /home/uziel/.ivy2/jars
org.apache.hadoop#hadoop-aws added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-35188972-8a4c-47eb-b363-d72bca30c6d7;1.0
	confs: [default]
	found org.apache.hadoop#hadoop-aws;3.3.1 in central
	found com.amazonaws#aws-java-sdk-bundle;1.11.901 in central
	found org.wildfly.openssl#wildfly-openssl;1.0.7.Final in central
:: resolution report :: resolve 148ms :: artifacts dl 8ms
	:: modules in use:
	com.amazonaws#aws-java-sdk-bundle;1.11.901 from central in [default]
	org.apache.hadoop#hadoop-aws;3.3.1 from central in [default]
	org.wildfly.openssl#wildfly-openssl;1.0.7.Final from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	-----------------------------------

22/12/10 22:53:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


---

## 1. Preview raw data


In [6]:
table_files = {
    "i94_immigration": [
        str(p)
        for p in data_path.joinpath("i94_immigration_data_2016").glob("*_2016.csv.bz2")
    ],
    "us_demographics": str(data_path.joinpath("us_cities_demographics.csv.bz2")),
    "airport_codes": str(data_path.joinpath("airport_codes.csv.bz2")),
    "world_temperature": str(
        data_path.joinpath("global_land_temperature_by_city.csv.bz2")
    ),
}


In [7]:
for table_name, table_schema in ON_LOAD_TABLES_SCHEMA.items():
    table_df = spark.read.csv(
        table_files[table_name],
        schema=ON_LOAD_TABLES_SCHEMA[table_name],
        header=True
    )

    n_elem = table_df.count()
    table_df_preview = spark.createDataFrame(
        table_df.take(5), schema=ON_LOAD_TABLES_SCHEMA[table_name],
    ).toPandas()

    print(f"First 5 rows of {table_name}:")
    ICD.display(table_df_preview)
    print(f"The full table contains a total of {n_elem} records\n")

                                                                                

22/12/10 22:54:28 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


                                                                                

First 5 rows of i94_immigration:


Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,...,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,1.0,2016.0,7.0,254.0,276.0,LOS,20636.0,1.0,CA,20640.0,...,,M,1978.0,9282016.0,M,,OZ,63092900000.0,202,WT
1,2.0,2016.0,7.0,140.0,140.0,NYC,20636.0,1.0,NY,20657.0,...,,M,1971.0,9282016.0,F,,DL,63092900000.0,9858,WT
2,3.0,2016.0,7.0,135.0,135.0,ORL,20636.0,1.0,FL,20657.0,...,,M,2006.0,9282016.0,M,,VS,63092900000.0,71,WT
3,4.0,2016.0,7.0,124.0,124.0,TAM,20636.0,1.0,FL,20645.0,...,,M,1999.0,9282016.0,M,,LH,63092900000.0,482,WT
4,5.0,2016.0,7.0,130.0,130.0,LOS,20636.0,1.0,CA,20662.0,...,,M,2015.0,9282016.0,M,,SU,63092900000.0,106,WT


The full table contains a total of 40790529 records

First 5 rows of us_demographics:


Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Silver Spring,Maryland,33.799999,40601.0,41862.0,82463,1562.0,30908.0,2.6,MD,Hispanic or Latino,25924
1,Quincy,Massachusetts,41.0,44129.0,49500.0,93629,4147.0,32935.0,2.39,MA,White,58723
2,Hoover,Alabama,38.5,38040.0,46799.0,84839,4819.0,8229.0,2.58,AL,Asian,4759
3,Rancho Cucamonga,California,34.5,88127.0,87105.0,175232,5821.0,33878.0,3.18,CA,Black or African-American,24437
4,Newark,New Jersey,34.599998,138040.0,143873.0,281913,5829.0,86253.0,2.73,NJ,White,76402


The full table contains a total of 2891 records



                                                                                

First 5 rows of airport_codes:


Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
0,00A,heliport,Total Rf Heliport,11.0,,US,US-PA,Bensalem,00A,,00A,"-74.93360137939453, 40.07080078125"
1,00AA,small_airport,Aero B Ranch Airport,3435.0,,US,US-KS,Leoti,00AA,,00AA,"-101.473911, 38.704022"
2,00AK,small_airport,Lowell Field,450.0,,US,US-AK,Anchor Point,00AK,,00AK,"-151.695999146, 59.94919968"
3,00AL,small_airport,Epps Airpark,820.0,,US,US-AL,Harvest,00AL,,00AL,"-86.77030181884766, 34.86479949951172"
4,00AR,closed,Newport Hospital & Clinic Heliport,237.0,,US,US-AR,Newport,,,,"-91.254898, 35.6087"


The full table contains a total of 55075 records



                                                                                

First 5 rows of world_temperature:


Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1744-04-01,5.788,3.624,Århus,Denmark,57.05N,10.33E
2,1744-05-01,10.644,1.283,Århus,Denmark,57.05N,10.33E
3,1744-06-01,14.051,1.347,Århus,Denmark,57.05N,10.33E
4,1744-07-01,16.082001,1.396,Århus,Denmark,57.05N,10.33E


The full table contains a total of 8235082 records



### 1.1. I94 Immigration Data


In [None]:
df = pd.read_csv(data_path.joinpath(""))

In [None]:
table_name = "i94_immigration"
table_files = [
    str(p) for p in data_path.joinpath("i94_inmigration_data_2016").glob("*_2016.csv.bz2")
]


In [None]:
table_df = spark.read.csv(
    table_files,
    # schema=ON_LOAD_TABLES_SCHEMA[table_name],
    header=True
)

table_df.take(5)

### 1.2. U.S. Demographics


In [None]:
table_name = "us_demographics"
table_file = str(data_path.joinpath("us_cities_demographics.csv.bz2"))


In [None]:
table_df = spark.read.csv(
    table_file,
    schema=ON_LOAD_TABLES_SCHEMA[table_name],
    header=True
)

n_elem = table_df.count()
table_df_preview = spark.createDataFrame(
    table_df.take(5), schema=ON_LOAD_TABLES_SCHEMA[table_name],
).toPandas()

print(f"First 5 rows of {table_name}:")
ICD.display(table_df_preview)
print(f"The full table contains a total of {n_elem} records")

In [None]:
temp_df = pd.read_csv(
    data_path.joinpath("global_land_temperature_by_city.csv.bz2"),
    index_col=0,
).dropna(subset=["AverageTemperature"])
print(temp_df.columns)
temp_df

In [None]:
temp_df[temp_df["Country"] == "United States"]

### 1.3. U.S. City Demographic Data


In [None]:
us_dem_df = pd.read_csv(data_path.joinpath("us_cities_demographics.csv.bz2"))
print(us_dem_df.columns)
us_dem_df

### 1.4. Airport Codes


In [None]:
airp_df = pd.read_csv(data_path.joinpath("airport_codes.csv.bz2"), index_col=0)
print(airp_df.columns)
airp_df

US Airports

In [None]:
airp_df[airp_df["iso_country"] == "US"]

---

## 2. Upload raw data to S3 buckets

### 2.1. I94 Immigration Data

In [None]:
# load raw data using Spark and write to S3

### 2.2. World Temperature Data

### 2.3. U.S. City Demographics

### 2.4. Airport Codes

---

### Step 2: Explore and Assess the Data

#### Explore the Data

Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps

Document steps necessary to clean the data


In [None]:
# Performing cleaning tasks here

---

### Step 3: Define the Data Model

#### 3.1 Conceptual Data Model

Map out the conceptual data model and explain why you chose that model

#### 3.2 Mapping Out Data Pipelines

List the steps necessary to pipeline the data into the chosen data model


---

### Step 4: Run Pipelines to Model the Data

#### 4.1 Create the data model

Build the data pipelines to create the data model.


In [None]:
# Write code here

#### 4.2 Data Quality Checks

Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:

- Integrity constraints on the relational database (e.g., unique key, data type, etc.)
- Unit tests for the scripts to ensure they are doing the right thing
- Source/Count checks to ensure completeness

Run Quality Checks


In [None]:
# Perform quality checks here

#### 4.3 Data dictionary

Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.


---

### Step 5: Complete Project Write Up

- Clearly state the rationale for the choice of tools and technologies for the project.
- Propose how often the data should be updated and why.
- Write a description of how you would approach the problem differently under the following scenarios:
- The data was increased by 100x.
- The data populates a dashboard that must be updated on a daily basis by 7am every day.
- The database needed to be accessed by 100+ people.
