GitHub - Daniel57910/cass_sparkify: solution to udacity cassandra sparkify challenge

NOSQL SPARKIFY ETL

Pipeline to aggregate csv files and load them into a Cassandra DB

Project Overview

Music company sparkify generate CSV files that record song play data in their app. The CSV files are aggregated and loaded into a Cassandra DB, where each table represents a particular query/piece of analysis.

App Architecture

Setup Via Jupyter Notebook

Ensure python3 and cassandra are installed
Ensure you have a cassandra cluster running
Run cass_etl_exec.sh -> this will create a virtual env, install the dependencies & start the jupyter server
Run jupyter notebook dmiller_cass_sparkify_notebook.ipynb
Execute the jupyter notebook dmiller_cass_sparkify_project

Results should look something like this (without the warning message):

Additional Steps

Use cqlsh to create tables and run them via bash. This separates out the DB and application logic further and allows for creation of composite clustering keys for ordering tables on multiple rows
Look into one function for copying csv files to db and using interpolation for the table name & csv file
Performance testing with more data or on a low memory VM/container to identify bottlenecks
More queries and potentially also some data dashboards
Unit and integration testing to provide documentation and ensure long term code robustness

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.vscode		.vscode
diagrams		diagrams
event_data		event_data
lib		lib
model		model
.gitignore		.gitignore
README.md		README.md
cass_etl_exec.sh		cass_etl_exec.sh
create_aggregate_csv.py		create_aggregate_csv.py
database_handler.py		database_handler.py
dmiller_cass_sparkify_notebook.ipynb		dmiller_cass_sparkify_notebook.ipynb
event_datafile_new.csv		event_datafile_new.csv
project_etl.py		project_etl.py
requirements.txt		requirements.txt
user_and_session.csv		user_and_session.csv
user_and_song.csv		user_and_song.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NOSQL SPARKIFY ETL

Pipeline to aggregate csv files and load them into a Cassandra DB

Project Overview

App Architecture

Setup Via Jupyter Notebook

Additional Steps

About

Releases

Packages

Languages

Daniel57910/cass_sparkify

Folders and files

Latest commit

History

Repository files navigation

NOSQL SPARKIFY ETL

Pipeline to aggregate csv files and load them into a Cassandra DB

Project Overview

App Architecture

Setup Via Jupyter Notebook

Additional Steps

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages