# [Project Title]

### Data Engineering Capstone Project

#### Project Summary

--describe your project at a high level--

The project follows the follow steps:

- 1: Scope the Project and Gather Data
- 2: Explore and Assess the Data
- 3: Define the Data Model
- 4: Run ETL to Model the Data
- 5: Complete Project Write Up


In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
# Do all imports and installs here
import sys
import pandas as pd
from pathlib import Path

In [3]:
src_path: str = "../src"
sys.path.append(src_path)

In [4]:
from utils.io import extract_sas_map

In [5]:
data_path: Path = Path("../data")

---

### 1. Scope the Project and Gather Data


#### 1.1. Scope

_Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc>_

**The goal of this project is to find possible correlations between inmigrants destinations, the city demographics, the availability of airports in those destinations and its average yearly temperature.**

**Execution plan:**

1. Data is uploaded to S3 buckets.
2. Data is extracted, transformed and loaded (ETL) into a Redshift Database through an Airflow DAG.
   1. Airflow DAG uses operators that use Spark, run on EMR (if possible at all, seems a bit complicated).
3. Analytics queries are run to answer questions.


...


#### 1.2. Describe and Gather Data

The following datasets are included in the project:

- **I94 Immigration Data**: This data comes from the US National Tourism and Trade Office. A data dictionary
- **World Temperature Data**: ...
- **U.S. City Demographic Data**: ...
- **Airport Code Table**: This is a simple table of airport codes and corresponding cities...


In [None]:
sas_file = data_path.joinpath(
    "i94_inmigration_data_2016/I94_SAS_Labels_Descriptions.SAS"
)

sas_file_content = sas_file.read_text().replace("\t", "")

In [None]:
import re

x = re.findall(r"value [^;]*", sas_file_content)

In [14]:
file_path = data_path.joinpath("i94_inmigration_data_2016/data_sample.csv")
pd.read_csv(file_path).to_csv(file_path.with_suffix(".csv.bz2"))

In [None]:
sas_file_content.split(";")

In [None]:
i94mode

#### 1.3. Datasets preview


##### 1.3.1. I94 Immigration Data


In [None]:
i94_df = pd.read_csv(
    data_path.joinpath("i94_inmigration_data_2016").joinpath("data_sample.csv"),
    index_col=0,
)
print(i94_df.columns)
i94_df

##### 1.3.2. World Temperature Data


In [None]:
temp_df = pd.read_csv(
    data_path.joinpath("global_land_temperature_by_city.csv"),
    index_col=0,
).dropna(subset=["AverageTemperature"])
print(temp_df.columns)
temp_df

In [None]:
temp_df.value_counts(["City", "Country"])

##### 1.3.3. U.S. City Demographic Data


In [None]:
us_dem_df = pd.read_csv(data_path.joinpath("us_cities_demographics.csv"), sep=";")
print(us_dem_df.columns)
us_dem_df

##### 1.3.4. Airport Codes


In [None]:
airp_df = pd.read_csv(data_path.joinpath("airport_codes.csv"), index_col=0)
print(airp_df.columns)
airp_df

In [None]:
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder.config(
        "spark.jars.repositories", "https://repos.spark-packages.org/"
    )
    .config("spark.jars.packages", "saurfang:spark-sas7bdat:2.0.0-s_2.11")
    .enableHiveSupport()
    .getOrCreate()
)

df_spark = spark.read.format("com.github.saurfang.sas.spark").load(
    "../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat"
)

In [None]:
# write to parquet
df_spark.write.parquet("sas_data")
df_spark = spark.read.parquet("sas_data")

### Step 2: Explore and Assess the Data

#### Explore the Data

Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps

Document steps necessary to clean the data


In [None]:
# Performing cleaning tasks here

### Step 3: Define the Data Model

#### 3.1 Conceptual Data Model

Map out the conceptual data model and explain why you chose that model

#### 3.2 Mapping Out Data Pipelines

List the steps necessary to pipeline the data into the chosen data model


### Step 4: Run Pipelines to Model the Data

#### 4.1 Create the data model

Build the data pipelines to create the data model.


In [None]:
# Write code here

#### 4.2 Data Quality Checks

Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:

- Integrity constraints on the relational database (e.g., unique key, data type, etc.)
- Unit tests for the scripts to ensure they are doing the right thing
- Source/Count checks to ensure completeness

Run Quality Checks


In [None]:
# Perform quality checks here

#### 4.3 Data dictionary

Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.


#### Step 5: Complete Project Write Up

- Clearly state the rationale for the choice of tools and technologies for the project.
- Propose how often the data should be updated and why.
- Write a description of how you would approach the problem differently under the following scenarios:
- The data was increased by 100x.
- The data populates a dashboard that must be updated on a daily basis by 7am every day.
- The database needed to be accessed by 100+ people.
