This project has to output a Dataware house solution and create high-grade data pipelines that are dynamic and built from reusable tasks, monitored, and allow easy backfills. They have also noted that the data quality plays a big part when analyses are executed on top of the data warehouse and want to run tests against their datasets after the ETL steps have been executed to catch any discrepancies in the datasets.
Sparkify is a fictional popular digital media service created by Udacity, similar to Spotify or Pandora; many users use their services every day.
It has the following structure:
- begin execution
- Load stage tables(
Stage_events
andStage_songs
) : Load the data from S3 buckets into staging tables in the Redshift Cluster. - Load data into fact table (
Load_songplays_fact_table
): Transform and load data from stage tables into fact table:songplays
- Load data into dimension tables: Transform and load data into dimension tables from the staging tables, and
time
table is loaded fromsongplays
table. - Execute check operations (
Run_data_quality_checks
): Perform count operations to validate that the data has been inserted. in the fileetl.py
.
- Create an IAM Role
- Go to Amazon IAM console
- Choose Create role.
- In the AWS Service group, choose
Redshift
. - Under Select your use case, choose
Redshift - Customizable
, and then Next: Permissions. - On the Attach permissions policies page, choose
AmazonS3ReadOnlyAccess
- Create Security Group
- Go to Amazon EC2 console and under Network and Security in the left navigation pane, select Security Groups.
- Click on
Create Security Group
- Enter a Security group name and description.
- Select the
Inbound
tab underSecurity group rules.
- Click on Add Rule and enter the following values:
- Type: Custom TCP Rule.
- Protocol: TCP.
- Port Range: 5439. The default port for Amazon Redshift is 5439, but your port might be different. See note on determining your firewall rules on the earlier "AWS Setup Instructions" page in this lesson.
- Source: select Custom IP, then type 0.0.0.0/0. Note: or select a specific location to share it.
- Launch a Redshift Cluster
- Sign in to the AWS Management Console and open the Amazon Redshift console
- On the Amazon Redshift Dashboard, choose
Launch cluster
. - On the Cluster details page, configure
Cluster identifier
,Database name
,Database port
,Master user name
andMaster user password
- select Node time
- On the Additional Configuration page, enter the following values:
- Available IAM roles: IAM Role just created
- VPC security groups: security group created previously
- choose Launch cluster.
- Start Airflow, run the
/opt/airflow/start.sh
command to start the Airflow webserver and wait for the Airflow web server to be ready. - Go to the Airflow UI
- Configure Connections (Click on the Admin tab and select Connections)
- Configure AWS Connection
- Conn Id: Enter
aws_credentials
. - Conn Type: Enter
Amazon Web Services
. - Login: Enter your Access key ID from the IAM User credentials.
- Password: Enter your Secret access key from the IAM User credentials.
- Conn Id: Enter
- Configure redshift connection
- Conn Id: Enter
redshift
. - Conn Type: Enter
Postgres
. - Host: Enter the endpoint of your Redshift cluster, excluding the port at the end.
- Schema: Enter
dev
. This is the Redshift database you want to connect to. - Login: Enter
awsuser
. - Password: Enter the password you created when launching your Redshift cluster.
- Port: Enter 5439.
- Conn Id: Enter
- Configure AWS Connection
- Configure Variables (Click on the Admin tab and select Variables)
- S3 bucket
- Key: Enter
s3_bucket
. - Val: Enter
udacity-dend
.
- Key: Enter
- AWS region
- Key: Enter
aws_region
. - Val: Enter
us-west-2
.
- Key: Enter
- Truncate tables
- Key: Enter
truncate_tables
. - Val: Enter
True
if you expect that the dimension task will have delete-load functionality or enterFalse
if you expect to perform append-only functionality.
- Key: Enter
- S3 bucket
The environment needed for this project:
- Python 3.6
- Apache Airflow
- AWS account
- Chek if you have a Redshift cluster, check
Set up for Redshift cluster
section - Configure Airflow, check
Airflow configuration
section
- dags
- create_tables.sql SQL script which cotains the the creation of the Redshift tables.
- udac_example_dag.sql python script with the dag definition.
- plugins
- helpers
- __init__.py load helpers directory as module to import SqlQueries
- sql_queries.py Select statements used to populate fact and dimension tables
- operatos
- __init__.py load operators directory
- data_quality.py DAG operator used for data quality checks
- load_dimension.py DAG operator used to load data into dimension tables
- load_fact.py DAG operator used to load data into fact table
- stage_redshift.py DAG operator used to load data into stage tables
- helpers
- images: folder that contains the images used in this file.
- clone the github repository:
git clone https://github.com/Erickramirez/Sparkify-Data-Pipelines-with-Airflow-S3-and-Redshift.git
- verify the Prerequisites
- In the Airflow UI, enable the dag
redshift_dag
- Song data:
s3://udacity-dend/song_data
it is a subset of real data from Million Song Dataset it is in JSON format:{ "num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0 }
- Log data:
s3://udacity-dend/log_data
consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. It has the following structure:
It will be necesary to copy the JSON file in the S3 buckets into the staging tables.
After this staging tables, the data is loaded into dimension and fact tables.
This table definition is in dags/create_tables.py
- Staging tables
- staging_songs: contains the data from the dataset
Song data
- staging_events: contains the data from the dataset
Log data
- staging_songs: contains the data from the dataset
- Dimension tables
- users: users in the app - user_id, first_name, last_name, gender, level
- songs: songs in music database - song_id, title, artist_id, year, duration
- artists: artists in music database - artist_id, name, location, lattitude, longitude
- time: timestamps of records in songplays broken down into specific units - start_time, hour, day, week, month, year, weekday
- Fact table
- songplays: records in event data associated with song plays i.e. records with page NextSong - songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent