This project has as output a Data Lake solution. It building an ETL pipeline that extracts their data from S3, processes them using Spark, and loads the data back into S3 as a set of dimensional tables. This will allow their analytics team to continue finding insights in what songs their users are listening to.
Sparkify is a fictional popular digital media service created by Udacity, similar to Spotify or Pandora; many users use their services every day.
The environment needed for this project:
- Python 3.6
- PySpark
- AWS account
- Apache Spark, you can use one of the following configurations:
- Stand along installation, check Apache Spark documentation
- Amazon EMR Cluster
- etl.py: reads data from S3, processes that data using Spark, and writes them back to S3.
- dl.cfg: configuration file that contains AWS credentials:
[CREDENTIALS]
AWS_ACCESS_KEY_ID = <your aws key>
AWS_SECRET_ACCESS_KEY = <your aws secret>
- images: folder that contains the images used in this file.
- clone the github repository:
git clone https://github.com/Erickramirez/Sparkify_Data-Lake-with-Apache-Spark.git
- verify the Prerequisites
- Execute ETL process using the command
python etl.py
- Song data:
s3://udacity-dend/song_data
it is a subset of real data from Million Song Dataset it is in JSON format:{ "num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0 }
- Log data:
s3://udacity-dend/log_data
consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. It has the following structure:
It will be necesary to copy the JSON file in the S3 buckets into the staging tables.
After this staging tables, the data is loaded into dimension and fact tables.
This table definition is in etl.py
- Dimension tables
- users: users in the app - user_id, first_name, last_name, gender, level
- songs: songs in music database - song_id, title, artist_id, year, duration
- artists: artists in music database - artist_id, name, location, lattitude, longitude
- time: timestamps of records in songplays broken down into specific units - start_time, hour, day, week, month, year, weekday
- Fact table
- songplays: records in event data associated with song plays i.e. records with page
NextSong
- songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
- songplays: records in event data associated with song plays i.e. records with page
Each of the five tables are written to parquet files in a separate analytics directory on S3.
Each table has its own folder within the directory.
Songs
table files are partitioned by year and then artist.
Time
table files are partitioned by year and month.
Songplays
table files are partitioned by year and month.
The data is extracted from JSON files in s3 (s3a://udacity-dend/) for the explained Datasets
. It is necessary to transform the data into the Database Schema
; it is performed with Spark using PySpark. Some of the transformations are related only to select some columns and remove the duplicated data, and in other cases, it is necessary to modify the type; for instance, the ts
in log-data
dataset is in Unix format, and that will need to be converted to a date-time format. The files generated are in parquet format.
The etl for both dataset are in the methods: process_song_data
and process_log_data
in the file etl.py
.