# Pipeline for a data warehouse
### Data Engineering Capstone Project

#### Project Summary
This is the capstone project of the udacity nanodegree for data engineering. 
The aim of the project is to apply learned skills during the course. 

This project will show how to load and transform data from four different data sources, load the data in spark apply quality checks and store the data into a star schema so that it can be used for BI apps and ad hoc analysis. 

Specific analysis of the number of immigrants in relation to weather and locations and time can be executed with the created data model. An example can be found in `capstone_sample_analysis.ipynb`

The project follows the follow steps:
* _Step 1_: Scope the Project and Gather Data
* _Step 2_: Explore and Assess the Data
* _Step 3_: Define the Data Model
* _Step 4_: Run ETL to Model the Data
* _Step 5_: Complete Project Write Up

### Step 1: Scope the Project and Gather Data

#### Scope 
As one of the acceptance criteria for this project is to handle at least 1 million rows and two different data sources and file formats, we will use the data sources are provided by Udacity. 
The scope of this project is to create a star schema source-of-truth database so that it can be used for BI apps and ad hoc analysis.

#### Describe and Gather Data 

The main dataset includes data about 
- _Immigration in the United States of America_ from [here](https://www.trade.gov/national-travel-and-tourism-office).

Supplementary datasets provided are:

- U.S. city _demographics_ from [here](https://public.opendatasoft.com/explore/dataset/us-cities-demographics/export/)

- _Temperature_ data from [here](https://www.trade.gov/national-travel-and-tourism-office)

- Data on _airport codes_ from [here](https://datahub.io/core/airport-codes#data)

### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
The data shall be used for ad hoc queries and BI Apps. Therefore the data shall be represented in a star schema.
The advantage is, that it is easy to query an easy to understand.


![star schema](capstone_schema.png)


#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

In order to pipe the data into the chosen data model, we will use Spark. This has the advantage that we can do the transformations in memory, before writing the data to the tables.
1. Load only the needed columns from the immigrant files into a spark dataframe. 
2. Transform arrival and departure date to timestamps
3. Create the fact_immigration table from the loaded immigrant files and write to parquet files.
4. Create the dim_immigrant_person table from the loaded immigrant files and write to parquet files.
5. Create the dim_time table from the transformed timestamps
6. Load weather data, filter the weather data by country, as we only need data from the United States from 1960 on and save them as dim_weather
7. Load the city data and write to parquet files as dim_city

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.
The code can be found in `etl.py`

In [10]:
%load_ext autoreload
%autoreload 2
import pyspark
from pyspark.sql import SparkSession

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [11]:
#Create SparkSession
spark = SparkSession.builder.appName("Capstone Project").getOrCreate()

Step 1: Load the needed immigration data from parquet files

In [14]:
import etl_capstone

df_immigration = etl_capstone.get_df_immigration(spark)
df_immigration.printSchema()

root
 |-- cicid: double (nullable = true)
 |-- arrdate: double (nullable = true)
 |-- depdate: double (nullable = true)
 |-- i94cit: double (nullable = true)
 |-- i94res: double (nullable = true)
 |-- i94port: string (nullable = true)
 |-- fltno: string (nullable = true)
 |-- biryear: double (nullable = true)
 |-- gender: string (nullable = true)
 |-- i94visa: double (nullable = true)
 |-- arrival_ts: string (nullable = true)
 |-- departure_ts: string (nullable = true)



Step2: Create the fact table and write it to a parquet table

In [15]:

etl_capstone.create_fact_immigrant(df_immigration)

                                                                                

Info: Current DF contains 219268 rows
Info: 7 columnns for current DF.
root
 |-- cicid: double (nullable = true)
 |-- arrival_ts: string (nullable = true)
 |-- departure_ts: string (nullable = true)
 |-- i94cit: double (nullable = true)
 |-- i94res: double (nullable = true)
 |-- i94port: string (nullable = true)
 |-- fltno: string (nullable = true)



                                                                                

Info: Immigration fact table written to: tables/fact_immigration


Step3: Create date dimension table and write it to parquet

In [16]:

etl_capstone.create_dim_date(df_immigration)

root
 |-- ts: string (nullable = true)
 |-- date: string (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- week: integer (nullable = true)
 |-- weekday: integer (nullable = true)
 |-- day: integer (nullable = true)
 |-- hour: integer (nullable = true)



                                                                                

Info: Current DF contains 438536 rows
Info: 8 columnns for current DF.
Success: date dimension data validation


[Stage 52:>                                                       (0 + 8) / 400]

2022-03-01 14:57:32 WARN  MemoryManager:115 - Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 96.54% for 7 writers
2022-03-01 14:57:32 WARN  MemoryManager:115 - Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 84.47% for 8 writers
2022-03-01 14:57:32 WARN  MemoryManager:115 - Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 96.54% for 7 writers


                                                                                

Info: date dimension table written to: tables/dim_time


Step4: Create city dimension table

In [17]:
etl_capstone.create_dim_city(spark)

root
 |-- city_name: string (nullable = true)
 |-- state: string (nullable = true)
 |-- median_age: double (nullable = true)
 |-- male_population: integer (nullable = true)
 |-- total_population: integer (nullable = true)
 |-- foreign_born: integer (nullable = true)
 |-- average_householdsize: double (nullable = true)
 |-- state_code: integer (nullable = true)
 |-- city_code: string (nullable = true)

Info: Current DF contains 2891 rows
Error: There are 9 instead of 8 for the current Dataframe
Success: city dimension data validation
2022-03-01 14:58:31 WARN  CSVDataSource:66 - Number of column in CSV header is not equal to number of fields in the schema:
 Header length: 12, schema size: 8
CSV file: file:///Users/joebsbar/Documents/GitBarbara/data-engineering-nd/capstone-project/data/us-cities-demographics.csv
Info: city dimension table written to: tables/dim_city


Step5: Create weather dimension table

In [18]:
etl_capstone.create_dim_weather(spark)

                                                                                

Info: Current DF contains 134925 rows
Error: There are 11 instead of 7 for the current Dataframe
Success: temparature dimension data validation


                                                                                

Info: temperature dimension table written to: tables/dim_temperature


Step 6: Create immigration person dimenston table

In [19]:
etl_capstone.create_dim_immigrant(df_immigration)

                                                                                

Info: Current DF contains 219268 rows
Info: 4 columnns for current DF.
Success: immigrant person dimension data validation


[Stage 62:>                                                         (0 + 1) / 1]

2022-03-01 14:59:13 WARN  MemoryManager:115 - Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 96.54% for 7 writers


                                                                                

Info: immigrant dimension table written to: tables/dim_immigrant_person


----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 49669)
Traceback (most recent call last):
  File "/Users/joebsbar/opt/anaconda3/envs/dend/lib/python3.7/socketserver.py", line 316, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/Users/joebsbar/opt/anaconda3/envs/dend/lib/python3.7/socketserver.py", line 347, in process_request
    self.finish_request(request, client_address)
  File "/Users/joebsbar/opt/anaconda3/envs/dend/lib/python3.7/socketserver.py", line 360, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/Users/joebsbar/opt/anaconda3/envs/dend/lib/python3.7/socketserver.py", line 720, in __init__
    self.handle()
  File "/Users/joebsbar/opt/anaconda3/envs/dend/lib/python3.7/site-packages/pyspark/accumulators.py", line 268, in handle
    poll(accum_updates)
  File "/Users/joebsbar/opt/anaconda3/envs/dend/lib/python3.7/site-packages/pyspark/accumul

#### 4.2 Data Quality Checks 

- We are using  _schemas_, when reading the data, with the schema we can assure that the data type is correct. See `schemas.py`
- We are using _dropDuplicates()_ to get rid of duplicate data.
- We are count the number of rows and columns for the output tables. See `utils.py`


#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### FACT IMMIGRATION

| column name| description | type | data source | 
| --- | --- | --- | --- |
|cicid | unique number for an immigrant | integer, not nullable | sas_data  immigration |
|arrival_ts | timestamp  of the arrival date | timestamp | sas_data  immigration: transformed from field "arrdata": SAS numeric |
|departure_ts | timestamp  of the arrival date | timestamp | sas_data  immigration: transformed from field "depdate": SAS numeric |
|i94cit | 3 digit code of origin city | short, not nullable | sas_data  immigration |
|i94res | 3 digit from the country one has travelled | short | sas_data  immigration |
|i94port | 3 char code of origin city in USA| string | sas_data  immigration |
|fltno | flight number of airline that arrived in us | string | sas_data  immigration |

#### DIM TIME

Datasource: all timestamps are taken frome the arrival and departure date of the fact_immigration table

| column name| description | type |
| --- | --- | --- | 
| ts | unix timestamp, not nullable | ts |
| date | date | string |
| year | year | integer |
| month | month | integer |
| weekday | weekday | integer |
| day | day | integer |
| hour | hour | integer |

#### DIM IMMIGRANT
Data Source:  sas_data  immigration

| column name| description | type |
| --- | --- | --- | 
|cicid|  unique number for the immigrants| int |
|biryear| year of birth| int |
|gender| gender of immigrant| string |
|i94visa| Visa codes collapsed into three categories, 1 = Business 2 = Pleasure 3 = Student |int |


#### DIM CITY

Datasource: https://public.opendatasoft.com/explore/dataset/us-cities-demographics

| column name| description | type | 
| --- | --- | --- |
|city_code| code for the city| string |
|city_name| name of the city| string |
|state | state | string |
|male_population | number of male population| int |
|male_population | number of female population| int |
|total_population | number of total population| int |
|foreign_born | number of foreign born | int |
|average_household | average number of people in a household | double |
| state_code | code of the state | int |


#### DIM TEMPERATURE

| column name| description | type |
| --- | --- | --- | 
|dt|date|date|
|AverageTemperature| <- |float|
|AverageTemperatureUncertainty|<- |string|
|city_code| code for the city| string |
|City|<- | string|
|Country|<- | string|
|year| <- | integer|
|month| <- |integer|
|Langitude| <- |integer|
|Latidue| <- |integer|

#### Step 5: Complete Project Write Up
1. _Clearly state the rationale for the choice of tools and technologies for the project._
    We choose Apache Spark, because
    - we are able to read in only the columns we need from parquet files
    - we are able to do data cleaning and transformation either with dataframes or sql syntax.
    - it provides an easy to use API
    - it can handle a lot of different file formats 
    - it can handle big amounts of data  <br/>
  

2. _Propose how often the data should be updated and why._
    - It depends on the amount of data and how often the data is updated. If the data is updated every week, it makes sense to run the pipeline every week. <br/>


3. _Write a description of how you would approach the problem differently under the following scenarios:_  <br/>

 3a. _The data was increased by 100x._<br/>
    - The bigger the data gets, the more computing power is helpful to process the data. Adding more nodes to our cluster could be a way of dealing with a bigger amount of data.


 3b. _The data populates a dashboard that must be updated on a daily basis by 7am every day._<br/>
    - For scheduled pipelines a tool like [Airflow](https://airflow.apache.org/docs/) can be used. It has the advantage, that it provides a web view, so that also non programmers can check wether a pipeline ran successfully or not.


 3c. _The database needed to be accessed by 100+ people._<br/>
    - There are different cloud solutions availabe for this scenario. We could use [Databricks](https://databricks.com/) it can handle a lot of simultaneasly connections.
