# Capstone Project
### Data Engineering Capstone Project

#### Project Summary
the objective of this project is to build a data warehouse with different sets of data, such as weather, travel and tourism information.
Propose to the business analysts that they identify trends, understand seasonality of visiting the United States and Answering business questions.
I Used in this project some lessons learned during the course, such as Spark, data model among others.  


The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
import pandas as pd
import os
from pyspark.sql import SparkSession

In [2]:
# declare path of dataset

project_dir = os.path.abspath("./../data")

full_path_immigration  = os.path.join(project_dir , "immigration_data_sample.csv")
full_path_temperature  = os.path.join(project_dir , 'GlobalTemperature/GlobalLandTemperaturesByCity.csv')
full_path_demographics = os.path.join(project_dir , "us-cities-demographics.csv")
full_path_i94_sas 	   = os.path.join(project_dir , "I94_SAS_Labels_Descriptions.sas")

### Step 1: Scope the Project and Gather Data

#### Scope 
In this project, I will understand information about US temperature, demographics and tourism by storing the data in a data warehouse using the star schema schema. Using Pandas and Spark to Explore the Dataset

#### Describe and Gather Data 

- **I94 Immigration Data**: This data comes from the US National Tourism and Trade Office;
- **World Temperature Data**: Este conjunto de dados veio do Kaggle. E há informações sobre a média de temperatura de países e cidades;
- **U.S. City Demographic Data**: This dataset presents information on the city's population, such as median age, number of population 
separated by gender, number of people born abroad, among others.



### Step 2: Explore and Assess the Data
#### Explore the Data 

- The Pandas lib was used to explore the data and understand the proposed model.
- Use the dimensional model for fact and dimension tables and change how to split and understand about data type
- The PySpark lib was used to read the SAS information and look at the values and data types


#### Cleaning Steps

- The I94_SAS_Label_Descriptions.SAS file was inspected in order to build the cross tables and make the data model possible to relate.
- The city and state columns were changed in the demographics dataset to uppercase.



#### Immigration data

**I94 Immigration Data**: This data comes from the US National Tourism and Trade Office.

In [3]:
df_immigration = pd.read_csv(full_path_immigration)

df_immigration.head(3)

Unnamed: 0.1,Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,...,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,2027561,4084316.0,2016.0,4.0,209.0,209.0,HHW,20566.0,1.0,HI,...,,M,1955.0,7202016,F,,JL,56582670000.0,00782,WT
1,2171295,4422636.0,2016.0,4.0,582.0,582.0,MCA,20567.0,1.0,TX,...,,M,1990.0,10222016,M,,*GA,94362000000.0,XBLNG,B2
2,589494,1195600.0,2016.0,4.0,148.0,112.0,OGG,20551.0,1.0,FL,...,,M,1940.0,7052016,M,,LH,55780470000.0,00464,WT


In [4]:
df_immigration.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 29 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  1000 non-null   int64  
 1   cicid       1000 non-null   float64
 2   i94yr       1000 non-null   float64
 3   i94mon      1000 non-null   float64
 4   i94cit      1000 non-null   float64
 5   i94res      1000 non-null   float64
 6   i94port     1000 non-null   object 
 7   arrdate     1000 non-null   float64
 8   i94mode     1000 non-null   float64
 9   i94addr     941 non-null    object 
 10  depdate     951 non-null    float64
 11  i94bir      1000 non-null   float64
 12  i94visa     1000 non-null   float64
 13  count       1000 non-null   float64
 14  dtadfile    1000 non-null   int64  
 15  visapost    382 non-null    object 
 16  occup       4 non-null      object 
 17  entdepa     1000 non-null   object 
 18  entdepd     954 non-null    object 
 19  entdepu     0 non-null      

In [5]:
df_fact_immigration = df_immigration[['cicid', 'i94yr', 'i94mon', 'i94cit', 'i94port', 'i94addr', 'arrdate', 
	'depdate', 'i94mode', 'i94visa']].copy()
df_fact_immigration.columns = ['cic_id', 'year', 'month', 'city_code', 'cod_port', 'cod_state', 'arrival_date', 
	'departure_date', 'mode', 'visa']
df_fact_immigration['country'] = 'United States'
df_fact_immigration.head(3)

Unnamed: 0,cic_id,year,month,city_code,cod_port,cod_state,arrival_date,departure_date,mode,visa,country
0,4084316.0,2016.0,4.0,209.0,HHW,HI,20566.0,20573.0,1.0,2.0,United States
1,4422636.0,2016.0,4.0,582.0,MCA,TX,20567.0,20568.0,1.0,2.0,United States
2,1195600.0,2016.0,4.0,148.0,OGG,FL,20551.0,20571.0,1.0,2.0,United States


In [6]:
df_dim_immigration_person = df_immigration[['cicid', 'i94cit', 'i94res', 'biryear', 'gender', 'insnum']].copy()
df_dim_immigration_person.columns = [['cic_id', 'citizen_country', 'residence_country', 'birth_year', 'gender', 'ins_num']]
df_dim_immigration_person.head(5)

Unnamed: 0,cic_id,citizen_country,residence_country,birth_year,gender,ins_num
0,4084316.0,209.0,209.0,1955.0,F,
1,4422636.0,582.0,582.0,1990.0,M,
2,1195600.0,148.0,112.0,1940.0,M,
3,5291768.0,297.0,297.0,1991.0,M,
4,985523.0,111.0,111.0,1997.0,F,


In [7]:
df_dim_immigration_airline = df_immigration[['cicid', 'airline', 'admnum', 'fltno', 'visatype']].copy()
df_dim_immigration_airline.columns = ['cic_id', 'airline', 'admin_num', 'flight_number', 'visa_type']
df_dim_immigration_airline.head(5)

Unnamed: 0,cic_id,airline,admin_num,flight_number,visa_type
0,4084316.0,JL,56582670000.0,00782,WT
1,4422636.0,*GA,94362000000.0,XBLNG,B2
2,1195600.0,LH,55780470000.0,00464,WT
3,5291768.0,QR,94789700000.0,00739,B2
4,985523.0,,42322570000.0,LAND,WT


#### Temperature Dataset

**World Temperature Data**: Este conjunto de dados veio do Kaggle. E há informações sobre a média de temperatura de países e cidades.

In [8]:
# Read temperature's dataset
df_temperature = pd.read_csv(full_path_temperature)

# Print temperature dataset
df_temperature.head(3)

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E
2,1744-01-01,,,Århus,Denmark,57.05N,10.33E


In [9]:
df_temperature.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8599212 entries, 0 to 8599211
Data columns (total 7 columns):
 #   Column                         Dtype  
---  ------                         -----  
 0   dt                             object 
 1   AverageTemperature             float64
 2   AverageTemperatureUncertainty  float64
 3   City                           object 
 4   Country                        object 
 5   Latitude                       object 
 6   Longitude                      object 
dtypes: float64(2), object(5)
memory usage: 459.2+ MB


In [10]:
df_dim_temperature = df_temperature.copy()
df_dim_temperature.columns = ['measurement_date', 'average_temp', 'average_temperature_uncertainty', 'city', 'country','latitude', 'longitude']
df_dim_temperature.head(5)

Unnamed: 0,measurement_date,average_temp,average_temperature_uncertainty,city,country,latitude,longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E
2,1744-01-01,,,Århus,Denmark,57.05N,10.33E
3,1744-02-01,,,Århus,Denmark,57.05N,10.33E
4,1744-03-01,,,Århus,Denmark,57.05N,10.33E


In [11]:
# incluir month and year of measurement

df_dim_temperature['measurement_date'] = pd.to_datetime(df_dim_temperature['measurement_date'])
df_dim_temperature['measuremnt_year'] = df_dim_temperature['measurement_date'].apply(lambda t: t.year)
df_dim_temperature['measuremnt_month'] = df_dim_temperature['measurement_date'].apply(lambda t: t.month)
df_dim_temperature.head()

Unnamed: 0,measurement_date,average_temp,average_temperature_uncertainty,city,country,latitude,longitude,measuremnt_year,measuremnt_month
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E,1743,11
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E,1743,12
2,1744-01-01,,,Århus,Denmark,57.05N,10.33E,1744,1
3,1744-02-01,,,Århus,Denmark,57.05N,10.33E,1744,2
4,1744-03-01,,,Århus,Denmark,57.05N,10.33E,1744,3


In [12]:
spark = SparkSession.builder.\
config("spark.jars.repositories", "https://repos.spark-packages.org/").\
config("spark.jars.packages", "saurfang:spark-sas7bdat:2.0.0-s_2.11").\
enableHiveSupport().getOrCreate()

# df_spark = spark.read.format('com.github.saurfang.sas.spark').load('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')

https://repos.spark-packages.org/ added as a remote repository with the name: repo-1
Ivy Default Cache set to: /Users/renatomeira/.ivy2/cache
The jars for the packages stored in: /Users/renatomeira/.ivy2/jars
saurfang#spark-sas7bdat added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-fc5046f8-cca7-418d-b159-d87d7c6222cf;1.0
	confs: [default]
	found saurfang#spark-sas7bdat;2.0.0-s_2.11 in spark-packages
	found com.epam#parso;2.0.8 in central
	found org.slf4j#slf4j-api;1.7.5 in central
	found org.apache.logging.log4j#log4j-api-scala_2.11;2.7 in central
	found org.scala-lang#scala-reflect;2.11.8 in central
:: resolution report :: resolve 126ms :: artifacts dl 5ms
	:: modules in use:
	com.epam#parso;2.0.8 from central in [default]
	org.apache.logging.log4j#log4j-api-scala_2.11;2.7 from central in [default]
	org.scala-lang#scala-reflect;2.11.8 from central in [default]
	org.slf4j#slf4j-api;1.7.5 from central in [default]
	saurfang#spark-sas7bdat;2.0.0-s_2

:: loading settings :: url = jar:file:/opt/homebrew/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
22/10/24 00:46:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [16]:
#write to parquet
#df_spark.write.parquet("sas_data")
df_spark= spark.read.parquet('../data/sas_data')
df_spark.limit(5).toPandas()

                                                                                

22/10/24 00:46:57 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,...,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,5748517.0,2016.0,4.0,245.0,438.0,LOS,20574.0,1.0,CA,20582.0,...,,M,1976.0,10292016,F,,QF,94953870000.0,11,B1
1,5748518.0,2016.0,4.0,245.0,438.0,LOS,20574.0,1.0,NV,20591.0,...,,M,1984.0,10292016,F,,VA,94955620000.0,7,B1
2,5748519.0,2016.0,4.0,245.0,438.0,LOS,20574.0,1.0,WA,20582.0,...,,M,1987.0,10292016,M,,DL,94956410000.0,40,B1
3,5748520.0,2016.0,4.0,245.0,438.0,LOS,20574.0,1.0,WA,20588.0,...,,M,1987.0,10292016,F,,DL,94956450000.0,40,B1
4,5748521.0,2016.0,4.0,245.0,438.0,LOS,20574.0,1.0,WA,20588.0,...,,M,1988.0,10292016,M,,DL,94956390000.0,40,B1


In [17]:
df_spark.printSchema()

root
 |-- cicid: double (nullable = true)
 |-- i94yr: double (nullable = true)
 |-- i94mon: double (nullable = true)
 |-- i94cit: double (nullable = true)
 |-- i94res: double (nullable = true)
 |-- i94port: string (nullable = true)
 |-- arrdate: double (nullable = true)
 |-- i94mode: double (nullable = true)
 |-- i94addr: string (nullable = true)
 |-- depdate: double (nullable = true)
 |-- i94bir: double (nullable = true)
 |-- i94visa: double (nullable = true)
 |-- count: double (nullable = true)
 |-- dtadfile: string (nullable = true)
 |-- visapost: string (nullable = true)
 |-- occup: string (nullable = true)
 |-- entdepa: string (nullable = true)
 |-- entdepd: string (nullable = true)
 |-- entdepu: string (nullable = true)
 |-- matflag: string (nullable = true)
 |-- biryear: double (nullable = true)
 |-- dtaddto: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- insnum: string (nullable = true)
 |-- airline: string (nullable = true)
 |-- admnum: double (nullable = 

#### U.S. City Demographic Data

**U.S. City Demographic Data**: This dataset presents information on the city's population, such as median age, number of population 
separated by gender, number of people born abroad, among others.

In [18]:
df_demographics = pd.read_csv(full_path_demographics, delimiter=";")
df_demographics.head()

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Silver Spring,Maryland,33.8,40601.0,41862.0,82463,1562.0,30908.0,2.6,MD,Hispanic or Latino,25924
1,Quincy,Massachusetts,41.0,44129.0,49500.0,93629,4147.0,32935.0,2.39,MA,White,58723
2,Hoover,Alabama,38.5,38040.0,46799.0,84839,4819.0,8229.0,2.58,AL,Asian,4759
3,Rancho Cucamonga,California,34.5,88127.0,87105.0,175232,5821.0,33878.0,3.18,CA,Black or African-American,24437
4,Newark,New Jersey,34.6,138040.0,143873.0,281913,5829.0,86253.0,2.73,NJ,White,76402


In [19]:
df_demographics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2891 entries, 0 to 2890
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   City                    2891 non-null   object 
 1   State                   2891 non-null   object 
 2   Median Age              2891 non-null   float64
 3   Male Population         2888 non-null   float64
 4   Female Population       2888 non-null   float64
 5   Total Population        2891 non-null   int64  
 6   Number of Veterans      2878 non-null   float64
 7   Foreign-born            2878 non-null   float64
 8   Average Household Size  2875 non-null   float64
 9   State Code              2891 non-null   object 
 10  Race                    2891 non-null   object 
 11  Count                   2891 non-null   int64  
dtypes: float64(6), int64(2), object(4)
memory usage: 271.2+ KB


In [20]:
df_dim_demographics = df_demographics[['City', 'State', 'Median Age', 'Male Population', 'Female Population',
                                        'Total Population', 'Number of Veterans', 'Foreign-born',
                                        'Average Household Size', 'State Code', 'Race', 'Count']
                                   ].copy()
df_dim_demographics.columns = ['city', 'state','median_age', 'male_population', 'female_population', 'total_population', 
                                   'number_veterans', 'foreign_born', 'average_household_size', 'cod_state', 'race', 'count']
df_dim_demographics.head(3)

Unnamed: 0,city,state,median_age,male_population,female_population,total_population,number_veterans,foreign_born,average_household_size,cod_state,race,count
0,Silver Spring,Maryland,33.8,40601.0,41862.0,82463,1562.0,30908.0,2.6,MD,Hispanic or Latino,25924
1,Quincy,Massachusetts,41.0,44129.0,49500.0,93629,4147.0,32935.0,2.39,MA,White,58723
2,Hoover,Alabama,38.5,38040.0,46799.0,84839,4819.0,8229.0,2.58,AL,Asian,4759


##### The I94_SAS_Label_Descriptions.SAS file was inspected in order to build the cross tables and make the data model possible to relate.

In [21]:
with open(full_path_i94_sas) as f:
    contents = f.readlines()

country_code = {}
for countries in contents[10:298]:
    pair = countries.split('=')
    code, country = pair[0].strip(), pair[1].strip().strip("'")
    country_code[code] = country

df_country_code = pd.DataFrame(list(country_code.items()), columns=['code', 'country'])
df_country_code.head(5)

Unnamed: 0,code,country
0,236,AFGHANISTAN
1,101,ALBANIA
2,316,ALGERIA
3,102,ANDORRA
4,324,ANGOLA


In [22]:
city_code = {}
for cities in contents[303:962]:
    pair = cities.split('=')
    code, city = pair[0].strip("\t").strip().strip("'"), pair[1].strip('\t').strip().strip("''")
    city_code[code] = city

df_city_code = pd.DataFrame(list(city_code.items()), columns=['code', 'city'])
df_city_code.head(5)

Unnamed: 0,code,city
0,ANC,"ANCHORAGE, AK"
1,BAR,"BAKER AAF - BAKER ISLAND, AK"
2,DAC,"DALTONS CACHE, AK"
3,PIZ,"DEW STATION PT LAY DEW, AK"
4,DTH,"DUTCH HARBOR, AK"


In [23]:
state_code = {}
for states in contents[982:1036]:
    pair = states.split('=')
    code, state = pair[0].strip('\t').strip("'"), pair[1].strip().strip("'")
    state_code[code] = state

df_state_code = pd.DataFrame(list(state_code.items()), columns=['code', 'state'])
df_state_code.head(5)

Unnamed: 0,code,state
0,AK,ALASKA
1,AZ,ARIZONA
2,AR,ARKANSAS
3,CA,CALIFORNIA
4,CO,COLORADO


##### The city and state columns were changed in the demographics dataset to uppercase.

In [24]:
df_dim_demographics['city'] = df_dim_demographics['city'].str.upper()
df_dim_demographics['state'] = df_dim_demographics['state'].str.upper()
df_dim_demographics.head(5)

Unnamed: 0,city,state,median_age,male_population,female_population,total_population,number_veterans,foreign_born,average_household_size,cod_state,race,count
0,SILVER SPRING,MARYLAND,33.8,40601.0,41862.0,82463,1562.0,30908.0,2.6,MD,Hispanic or Latino,25924
1,QUINCY,MASSACHUSETTS,41.0,44129.0,49500.0,93629,4147.0,32935.0,2.39,MA,White,58723
2,HOOVER,ALABAMA,38.5,38040.0,46799.0,84839,4819.0,8229.0,2.58,AL,Asian,4759
3,RANCHO CUCAMONGA,CALIFORNIA,34.5,88127.0,87105.0,175232,5821.0,33878.0,3.18,CA,Black or African-American,24437
4,NEWARK,NEW JERSEY,34.6,138040.0,143873.0,281913,5829.0,86253.0,2.73,NJ,White,76402


##### Convert number of date in SAS to datetime 

In [25]:
def _SAS_to_datetime(date):
    return pd.to_timedelta(date, unit='D') + pd.Timestamp('1960-1-1')

In [26]:
df_fact_immigration['arrival_date'] = _SAS_to_datetime(df_fact_immigration['arrival_date'])
df_fact_immigration['departure_date'] = _SAS_to_datetime(df_fact_immigration['departure_date'])
df_fact_immigration.head(5)

Unnamed: 0,cic_id,year,month,city_code,cod_port,cod_state,arrival_date,departure_date,mode,visa,country
0,4084316.0,2016.0,4.0,209.0,HHW,HI,2016-04-22,2016-04-29,1.0,2.0,United States
1,4422636.0,2016.0,4.0,582.0,MCA,TX,2016-04-23,2016-04-24,1.0,2.0,United States
2,1195600.0,2016.0,4.0,148.0,OGG,FL,2016-04-07,2016-04-27,1.0,2.0,United States
3,5291768.0,2016.0,4.0,297.0,LOS,CA,2016-04-28,2016-05-07,1.0,2.0,United States
4,985523.0,2016.0,4.0,111.0,CHM,NY,2016-04-06,2016-04-09,3.0,2.0,United States


### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
The data model is going to build a data warehouse a with star schema model in the table and the data analysts and data scientist will be using this model to better understand the business.

#### Star Schema

##### Dimension Tables:


**df_dim_immigration_person**

|COLUMN		|  TYPE	|
|---		|  ---		|
|cic_id			|  int 	| 
|citizen_country		|  int	| 
|residence_country		|  int	| 
|birth_year		|  int		| 
|gender		|  varchar(1)		| 
|ins_num		|  varchar		|

**df_dim_immigration_airline**

|COLUMN	|  TYPE  	|
| --- | -- |
|cic_id		|  int	|
|airline		|  varchar		|
|admin_num			|  float 	| 
|flight_number		|  varchar	| 
|artist_id		|  varchar	| 
|visa_type		|  varchar		| 


**df_country_code**

|COLUMN | TYPE |
| --- | --- |
|code 		|  int		|
|country			|  varchar 	| 

**df_city_code**

|COLUMN | TYPE |
| --- | --- |
|code 		|  varchar		|
|city			|  varchar 	| 

**df_state_code**

|COLUMN | TYPE |
| --- | --- |
|code 		|  varchar		|
|state			|  varchar 	| 



#### Fact Table:


**df_fact_immigration**


| COLUMN  		| TYPE  	|
|	---			|	---		|
|cic_id	|  int  	|
|year		|  int	|
|month		|  int		|
|cod_port			|  varchar 	| 
|cod_state		|  varchar	| 
|arrival_date		|  timestamp	| 
|departure_date	|  timestamp		| 
|mode		|  int		| 
|visa		|  int		|
|country	|  varchar  	|



**df_fact_temperature**

| COLUMN  		| TYPE  	|
|	---			|	---		|
|measurement_date	|  timestamp  	|
|average_temp		|  float	|
|average_temperature_uncertainty		|  float		|
|city			|  varchar 	| 
|country		|  varchar	| 
|latitude		|  varchar	| 
|longitude		|  varchar	| 
|measuremnt_year		|  int	| 
|measuremnt_month		|  int		|


**df_fact_demographics**

|COLUMN | TYPE |
| --- | --- |
|city		|  varchar		|
|state			|  varchar 	| 
|median_age		|  float	| 
|male_population		|  float	| 
|female_population		|  float		| 
|total_population		|  float		| 
|number_veterans		|  float		|
|foreign_born	|  int  	|
|average_household_size		|  float	|
|cod_state		|  varchar		|
|race			|  varchar 	| 
|count		|  bigint	| 



#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

1. extract all data sets of different source, as CSV and SAS. For example, these fonts can be in a bucket or on a server;
2. Do the ETL, separe the information that are fact and dimension, cleaning and alter data types;
	1. Parse I94_SAS_Labels_Descriptions.SAS file to get auxiliary dimension table - country_code, city_code, state_code;
	2. Tranform city, state in demography data to upper case to match city_code and state_code table;
	3. Convert number of date in SAS to datetime 
3. Store this information to in the AWS Redshift service; 


### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

1. Refer to [src/etl.py](capstone-project/src/etl.py)

#### 4.2 Data Quality Checks
Run Quality Checks

1. After the ETL process, the tables must not be unregistered.
2. Do the dimensional tables correspond with the proposed model?

Refer to [data_quality_check.ipynb](capstone-project/src/data_quality.ipynb)

#### 4.3 Data dictionary 

Refer to [DataDictionary.md](capstone-project/DataDictionary.md)

#### Step 5: Complete Project Write Up


##### Tools and Technologies

1. Python
2. AWS S3
3. Pandas
4. PySpark


## What if?

This section discusses strategies to deal with the following three key scenarios:

1. Data is increased 100x.

If the data has been increased by 100x, the spark is going to solve this problem, what we need is a distributed system that in this case we can use EMR.

2. Data pipeline is run on daily basis by 7 am every day.

In this case, we need a scheduling service to always update the data source so that the consumption of information is accurate, at the right time.
As we saw during the course, we can use Apache airflow, as it uses DAG to make the schedules.
There are also other services similar to apache Airflow such as dagster, Luigi and others.

3. Database needs to be accessed by 100+ users simultaneously.

I see that in a current scenario where we use cloud services we can scale both horizontally and vertically our data services.
This means that when we have few requests to the system, we can work with few nodes or even one, and as this amount of requests increases, we can request a resource increment, known as auto-scaling.


