Project 4: Apache Spark & Data Lake

Summary

Preamble
Spark process
Project structure

Preamble

Data source is provided by one public S3 buckets. This bucket contains info about songs, artists and actions done by users (which song are listening, etc..). The objects contained in the bucket are JSON files. The entire elaboration is done with Apache Spark

Spark process

The ETL job processes the song files then the log files. The song files are listed and iterated over entering relevant information in the artists and the song folders in parquet. The log files are filtered by the NextSong action. The subsequent dataset is then processed to extract the date , time , year etc. fields and records are then appropriately entered into the time, users and songplays folders in parquet for analysis.

Project structure

This is the project structure, if the bullet contains /
means that the resource is a folder:

/data - A folder that cointains two zip files, helpful for data exploration
etl.py - The ETL engine done with Spark, data normalization and parquet file writing.
dl.cfg - Configuration file that contains info about AWS credentials

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
README.md		README.md
dl.cfg		dl.cfg
etl.py		etl.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project 4: Apache Spark & Data Lake

Summary

Preamble

Spark process

Project structure

About

Releases

Packages

Languages

FedericoSerini/DEND-Project-4-Apache-Spark-And-Data-Lake

Folders and files

Latest commit

History

Repository files navigation

Project 4: Apache Spark & Data Lake

Summary

Preamble

Spark process

Project structure

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages