sparkify_redshift_dwh

Summary

This is a data engineering project aimed to get practice with AWS Redshift, ETL pipelines and IaC (infrastructure-as-code) techniques. It is composed of the following steps:

Using IaC, build the cloud data warehouse infrastructure for Sparkify, an imaginary music streaming app.
Build a data pipeline that extracts raw data from S3 buckets and stages them in Redshift.
Transform the extracted data into a set of dimensional tables for Sparkify's analytics team.

Quick Start

Clone this repository.

git clone https://github.com/dunyaoguz/sparkify_redshift_dwh.git
cd sparkify_redshift_dwh

Install dependencies.

pip install boto3
pip install psycopg2
pip install psycopg2-binary
pip install python-dotenv

Create an AWS account if you don't already have one. Create an IAM user with admin access from AWS console. Download your credentials.
Add your AWS key and secret to your .env file, along with the master user name and password you want to use for your database.
Run cluster_launcher.py.

python cluster_launcher.py

Copy the arn, host and port printed on your terminal, add it on your .env file. Wait until you see "cluster is available" printed out in terminal.
Run create_tables.py.

python create_tables.py

Run etl.py.

python etl.py

Congrats! You successfully created SparkifyDB. Now go run some fun queries on your data! You can do this from the Redshift console query editor. See the example queries section for inspiration.
Make sure to delete your Redshift cluster after at the end to avoid unnecessary costs. Remember, you'll be charged 1$ for each hour your cluster is live. To delete your Redshift cluster, go back to cluster_launcher.py, comment out line 105 (create_cluster(ROLE_ARN)) and uncomment lines 118 (reset()) and 119 (check_status('deleted')). Go back to terminal and run cluster_launcher.py.

python cluster_launcher.py

Wait until "cluster is deleted" gets printed out in terminal.

Schema

Example Queries

Get the most streamed artists

SELECT a.name
, COUNT(DISTINCT s.songplay_id) AS no_listens
FROM songplays AS s
LEFT JOIN artists AS a
ON s.artist_id = a.artist_id
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10

Get the most active users

SELECT CONCAT(u.firstname + '', u.lastname) AS user_name
, COUNT(DISTINCT s.songplay_id) AS no_listens
FROM songplays AS s
LEFT JOIN users AS u
ON s.user_id = u.user_id
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10

Tech Stack

boto3
dotenv
os
pandas
psycopg2
json

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
example_data		example_data
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
cluster_launcher.py		cluster_launcher.py
create_tables.py		create_tables.py
data_check.ipynb		data_check.ipynb
entity_relationship_diagram.png		entity_relationship_diagram.png
etl.py		etl.py
example.env		example.env
log_json_path.json		log_json_path.json
sql_queries.py		sql_queries.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sparkify_redshift_dwh

Summary

Directory

Quick Start

Schema

Example Queries

Tech Stack

About

Releases

Packages

Languages

dunyaoguz/sparkify_redshift_dwh

Folders and files

Latest commit

History

Repository files navigation

sparkify_redshift_dwh

Summary

Directory

Quick Start

Schema

Example Queries

Tech Stack

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages