A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analytics team is particularly interested in understanding what songs users are listening to. This project creates a Redshift Cluster with tables designed to optimize queries on song play analysis.
drops and creates your tables. You run this file to reset your tables before each time you run your ETL scripts.
copies and processes files from AWS S3 bucket and loads them into your tables.
contains all your sql queries, and is imported into the two files above.
In Jupyter Lab, start a new command line interface by clicking "File" -> "New" -> "Terminal".
- Run command
python create_tables.py
. - Run command
python etl.py
.
Remember to rerun create_tables.py
to reset your tables before each time you run etl.py
.
The database tables employ a Star Schema with the purpose to optimize queries on song play analysis.
- songplays - records in log data associated with song plays.
- users - users in the app.
- songs - songs in music database.
- artists - artists in music database.
- time - timestamps of records in songplays broken down into specific units.