Skip to content

Ioana-Postolache/SongplaysDatapipelines

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 

Repository files navigation

SongplaysDatapipelines

This project builds a datawarehouse through data pipelines that are started and monitored with Apache Airflow. The source data resides in S3 and needs to be processed in the data warehouse in Amazon Redshift.

The source datasets consist of JSON logs that tell about user activity in the application and JSON metadata about the songs the users listen to.The data comes from two datasets that reside in S3. Below are the S3 links for each:

  • Song data: s3://udacity-dend/song_data - contains JSON metadata on the songs in the app
  • Log data: s3://udacity-dend/log_data - contains JSON logs on user activity on the app Log data json path: s3://udacity-dend/log_json_path.json

The data is loaded from S3 to staging tables on Redshift and SQL statements are executed to create the the tables below from these staging tables:

Fact Table

  • songplays - records in event data associated with song plays i.e. records with page NextSong; Columns: songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension Tables

  • user - users in the app; Columns: user_id, first_name, last_name, gender, level
  • song - songs in music database; Columns: song_id, title, artist_id, year, duration
  • artist - artists in music database; Columns: artist_id, name, location, lattitude, longitude
  • time - timestamps of records in songplays broken down into specific units; Columns: start_time, hour, day, week, month, year, weekday

Song Dataset

The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.

  • song_data/A/B/C/TRABCEI128F424C983.json
  • song_data/A/A/B/TRAABJL12903CDCF1A.json

And below is an example of what a single song file, TRAABJL12903CDCF1A.json, looks like.

  • {"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}

Sample rows from the song files

+------------------+---------------+-----------------+----------------+----------------------------------------------------------------------------------------------+---------+---------+------------------+----------------------------------------------------+----+
|artist_id         |artist_latitude|artist_location  |artist_longitude|artist_name                                                                                   |duration |num_songs|song_id |title                                               |year|
+------------------+---------------+-----------------+----------------+----------------------------------------------------------------------------------------------+---------+---------+------------------+----------------------------------------------------+----+
|ARDR4AC1187FB371A1|null           |                 |null            |Montserrat Caballé;Placido Domingo;Vicente Sardinero;Judith Blegen;Sherrill Milnes;Georg Solti|511.16363|1        |SOBAYLL12A8C138AF9|Sono andati? Fingevo di dormire                     |0   |
|AREBBGV1187FB523D2|null           |Houston, TX      |null            |Mike Jones (Featuring CJ_ Mello & Lil' Bran)                                                  |173.66159|1        |SOOLYAZ12A6701F4A6|Laws Patrolling (Album Version)                     |0   |
|ARMAC4T1187FB3FA4C|40.82624       |Morris Plains, NJ|-74.47995       |The Dillinger Escape Plan                                                                     |207.77751|1        |SOBBUGU12A8C13E95D|Setting Fire to Sleeping Giants                     |2004|
|ARPBNLO1187FB3D52F|40.71455       |New York, NY     |-74.00712       |Tiny Tim                                                                                      |43.36281 |1        |SOAOIBZ12AB01815BE|I Hold Your Hand In Mine [Live At Royal Albert Hall]|2000|
|ARDNS031187B9924F0|32.67828       |Georgia          |-83.22295       |Tim Wilson                                                                                    |186.48771|1        |SONYPOM12A8C13B2D7|I Think My Wife Is Running Around On Me (Taco Hell) |2005|
+------------------+---------------+-----------------+----------------+----------------------------------------------------------------------------------------------+---------+---------+------------------+----------------------------------------------------+----+

Log Dataset

The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate app activity logs from an imaginary music streaming app based on configuration settings.

The log files in the dataset are partitioned by year and month.

Sample rows from the log files

+-----------+---------+---------+------+-------------+--------+---------+-----+-------------------------------------+------+--------+-----------------+---------+---------------+------+-------------+-----------------------------------------------------------------------------------------------------------------------------------------+------+
|artist     |auth     |firstName|gender|itemInSession|lastName|length   |level|location                             |method|page    |registration     |sessionId|song           |status|ts           |userAgent                                                                                                                                |userId|
+-----------+---------+---------+------+-------------+--------+---------+-----+-------------------------------------+------+--------+-----------------+---------+---------------+------+-------------+-----------------------------------------------------------------------------------------------------------------------------------------+------+
|Harmonia   |Logged In|Ryan     |M     |0            |Smith   |655.77751|free |San Jose-Sunnyvale-Santa Clara, CA   |PUT   |NextSong|1.541016707796E12|583      |Sehr kosmisch  |200   |1542241826796|"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/36.0.1985.125 Chrome/36.0.1985.125 Safari/537.36"|26    |
|The Prodigy|Logged In|Ryan     |M     |1            |Smith   |260.07465|free |San Jose-Sunnyvale-Santa Clara, CA   |PUT   |NextSong|1.541016707796E12|583      |The Big Gundown|200   |1542242481796|"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/36.0.1985.125 Chrome/36.0.1985.125 Safari/537.36"|26    |
|Train      |Logged In|Ryan     |M     |2            |Smith   |205.45261|free |San Jose-Sunnyvale-Santa Clara, CA   |PUT   |NextSong|1.541016707796E12|583      |Marry Me       |200   |1542242741796|"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/36.0.1985.125 Chrome/36.0.1985.125 Safari/537.36"|26    |
|null       |Logged In|Wyatt    |M     |0            |Scott   |null     |free |Eureka-Arcata-Fortuna, CA            |GET   |Home    |1.540872073796E12|563      |null           |200   |1542247071796|Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko                                                                     |9     |
|null       |Logged In|Austin   |M     |0            |Rosales |null     |free |New York-Newark-Jersey City, NY-NJ-PA|GET   |Home    |1.541059521796E12|521      |null           |200   |1542252577796|Mozilla/5.0 (Windows NT 6.1; rv:31.0) Gecko/20100101 Firefox/31.0                                                                        |12    |
+-----------+---------+---------+------+-------------+--------+---------+-----+-------------------------------------+------+--------+-----------------+---------+---------------+------+-------------+-----------------------------------------------------------------------------------------------------------------------------------------+------+

Config file (dwh.cfg) structure

Below is the structure of the config file. Don't use any quotes for the KEY ID and ACCESS KEY:

[CLUSTER]
host = ""
db_name = ""
db_user = ""
db_password = ""
db_port = ""

How to run

Use Airflow's UI to configure your AWS credentials and connection to Redshift.
  • Click on the Admin tab and select Connections.
  • Under Connections, select Create.
  • On the create connection page, enter the following values:
AWS credentials
  • Conn Id: Enter aws_credentials.
  • Conn Type: Enter Amazon Web Services.
  • Login: Enter your Access key ID from the IAM User credentials you downloaded earlier.
  • Password: Enter your Secret access key from the IAM User credentials you downloaded earlier.
  • Once you've entered these values, select Save and Add Another.On the next create connection page, enter the following values:
Redshift
  • Conn Id: Enter redshift.
  • Conn Type: Enter Postgres.
  • Host: Enter the endpoint of your Redshift cluster, excluding the port at the end. You can find this by selecting your cluster in the Clusters page of the Amazon Redshift console. IMPORTANT: Make sure to NOT include the port at the end of the Redshift endpoint string.
  • Schema: Enter dev. This is the Redshift database you want to connect to.
  • Login: Enter awsuser.
  • Password: Enter the password you created when launching your Redshift cluster.
  • Port: Enter 5439.
  • Once you've entered these values, select Save.
Add the connection details to the dwh.cfg file in create_tables.py and then run create_tables/create_tables.py to create the tables in Redshift
Start the airflow server
/opt/airflow/start.sh
Turn on the dwh_dag DAG and monitor its execution

About

Data Pipelines with Airflow

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages