Data Lake with Apache Spark

This project has as output a Data Lake solution. It building an ETL pipeline that extracts their data from S3, processes them using Spark, and loads the data back into S3 as a set of dimensional tables. This will allow their analytics team to continue finding insights in what songs their users are listening to.

Sparkify is a fictional popular digital media service created by Udacity, similar to Spotify or Pandora; many users use their services every day.

Prerequisites

The environment needed for this project:

Python 3.6
PySpark
AWS account
- IAM User with Programmatic access and permission to write on S3.
- S3 bucket
Apache Spark, you can use one of the following configurations:
- Stand along installation, check Apache Spark documentation
- Amazon EMR Cluster

Explanation of the files in the repository

etl.py: reads data from S3, processes that data using Spark, and writes them back to S3.
dl.cfg: configuration file that contains AWS credentials:

[CREDENTIALS]
AWS_ACCESS_KEY_ID = <your aws key>
AWS_SECRET_ACCESS_KEY = <your aws secret>

images: folder that contains the images used in this file.

Instructions to run the project

clone the github repository: git clone https://github.com/Erickramirez/Sparkify_Data-Lake-with-Apache-Spark.git
verify the Prerequisites
Execute ETL process using the command python etl.py

About the Data Warehouse solution

Datasets

Song data: s3://udacity-dend/song_data it is a subset of real data from Million Song Dataset it is in JSON format:

{
    "num_songs": 1,
    "artist_id": "ARJIE2Y1187B994AB7",
    "artist_latitude": null,
    "artist_longitude": null,
    "artist_location": "",
    "artist_name": "Line Renaud",
    "song_id": "SOUPIRU12A6D4FA1E1",
    "title": "Der Kleine Dompfaff",
    "duration": 152.92036,
    "year": 0
}

Log data: s3://udacity-dend/log_data consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. It has the following structure:

Database Schema

It will be necesary to copy the JSON file in the S3 buckets into the staging tables. After this staging tables, the data is loaded into dimension and fact tables. This table definition is in etl.py

Dimension tables
- users: users in the app - user_id, first_name, last_name, gender, level
- songs: songs in music database - song_id, title, artist_id, year, duration
- artists: artists in music database - artist_id, name, location, lattitude, longitude
- time: timestamps of records in songplays broken down into specific units - start_time, hour, day, week, month, year, weekday
Fact table
- songplays: records in event data associated with song plays i.e. records with page NextSong - songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Each of the five tables are written to parquet files in a separate analytics directory on S3. Each table has its own folder within the directory. Songs table files are partitioned by year and then artist. Time table files are partitioned by year and month. Songplays table files are partitioned by year and month.

ETL Pipeline

The data is extracted from JSON files in s3 (s3a://udacity-dend/) for the explained Datasets. It is necessary to transform the data into the Database Schema; it is performed with Spark using PySpark. Some of the transformations are related only to select some columns and remove the duplicated data, and in other cases, it is necessary to modify the type; for instance, the ts in log-data dataset is in Unix format, and that will need to be converted to a date-time format. The files generated are in parquet format. The etl for both dataset are in the methods: process_song_data and process_log_data in the file etl.py.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
images		images
README.md		README.md
dl.cfg		dl.cfg
etl.py		etl.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Lake with Apache Spark

Prerequisites

Explanation of the files in the repository

Instructions to run the project

About the Data Warehouse solution

Datasets

Database Schema

ETL Pipeline

About

Releases

Packages

Languages

Erickramirez/Sparkify-Data-Lake-with-Apache-Spark

Folders and files

Latest commit

History

Repository files navigation

Data Lake with Apache Spark

Prerequisites

Explanation of the files in the repository

Instructions to run the project

About the Data Warehouse solution

Datasets

Database Schema

ETL Pipeline

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages