# Data Engineering Project 4: Data Lakes with Spark

Introduction

A music streaming startup, Sparkify, has grown their user base and song database even more and want to move their data warehouse to a data lake. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

In this notebook we test an ETL pipeline that extracts a local copy of the data, processes them using Spark.


In [1]:
import os
import configparser
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField as Fld, DoubleType as Dbl, \
        StringType as Str, IntegerType as Int, DateType as Date, StringType as Str, \
        LongType as Lgt, TimestampType as Tme
from pyspark.sql.functions import *
from zipfile import ZipFile

Lines below are commented out, because we are using a local copy

In [2]:
#config = configparser.ConfigParser()
#config.read_file(open('aws/credentials.cfg'))
#os.environ["AWS_ACCESS_KEY_ID"]= config['AWS']['AWS_ACCESS_KEY_ID']
#os.environ["AWS_SECRET_ACCESS_KEY"]= config['AWS']['AWS_SECRET_ACCESS_KEY']

Upack the zip files with the data and save it in the `data` folder.

`song-data.zip` has the following path structure: `song-data/*/*/*/*.json`

`log-data.zip` has the following path structure: `*.json`



In [3]:
with ZipFile('data/song-data.zip','r') as zip_file:
    zip_file.extractall('data/')
with ZipFile('data/log-data.zip','r') as zip_file:
    zip_file.extractall('data/log_data')

Create a spark session

In [4]:
spark = SparkSession.builder\
                     .config("spark.jars.packages","org.apache.hadoop:hadoop-aws:2.7.0")\
                     .getOrCreate()

Create the schema of the `json` files we are going to read. We create one schema for the `song-data` and one for the `log-data`

In [5]:
songSchema = StructType([
    Fld("artist_id",       Str()),
    Fld("artist_latitude", Dbl()),
    Fld("artist_location", Str()),
    Fld("artist_longitude",Dbl()),
    Fld("artist_name",     Str()),
    Fld("duration",        Dbl()),
    Fld("num_songs",       Int()),
    Fld("song_id",         Str()),
    Fld("title",           Str()),
    Fld("year",            Int())
    ])

logschema = StructType([
    Fld('artist',          Str()),
    Fld('auth',            Str()),
    Fld('firstName',       Str()),
    Fld('gender',          Str()),
    Fld('itemInSession',   Int()),
    Fld('lastName',        Str()),
    Fld('length',          Dbl()),
    Fld('level',           Str()),
    Fld('location',        Str()),
    Fld('method',          Str()),
    Fld('page',            Str()),
    Fld('registration',    Dbl()),
    Fld('sessionId',       Int()),
    Fld('song',            Str()),
    Fld('status',          Int()),
    Fld('ts',              Lgt()),
    Fld('userAgent',       Str()),
    Fld('userId',          Int())
    ]) 

We load the data, S3 download has been commented out, because we use the local copy of the data.

In [6]:
#uncomment if S3
#input_data = "s3a://udacity-dend/"
#song_data = os.path.join(input_data, "song_data/A/A/A/*.json")
#uncomment if local
input_data = "data/"
song_data = os.path.join(input_data, "song_data/*/*/*/*.json")

Load data in spark data frame.

In [7]:
df_song = spark.read.json(song_data, schema=songSchema)

We check the schema and print the first row, to check if the data has been loaded correctly.

In [8]:
df_song.printSchema()
df_song.head()

root
 |-- artist_id: string (nullable = true)
 |-- artist_latitude: double (nullable = true)
 |-- artist_location: string (nullable = true)
 |-- artist_longitude: double (nullable = true)
 |-- artist_name: string (nullable = true)
 |-- duration: double (nullable = true)
 |-- num_songs: integer (nullable = true)
 |-- song_id: string (nullable = true)
 |-- title: string (nullable = true)
 |-- year: integer (nullable = true)



Row(artist_id='ARDR4AC1187FB371A1', artist_latitude=None, artist_location='', artist_longitude=None, artist_name='Montserrat Caballé;Placido Domingo;Vicente Sardinero;Judith Blegen;Sherrill Milnes;Georg Solti', duration=511.16363, num_songs=1, song_id='SOBAYLL12A8C138AF9', title='Sono andati? Fingevo di dormire', year=0)

Next we load the `log-data` into a spark data frame

In [9]:
# get filepath to song data file
#log_data = os.path.join(input_data, "log_data/2018/11/*.json")
log_data = os.path.join(input_data, "log_data/*.json")

In [10]:
df_log = spark.read.json(log_data)

Check that we have loaded the `log-data` correctly into the spark dataframe

In [11]:
df_log.printSchema()
df_log.head()

root
 |-- artist: string (nullable = true)
 |-- auth: string (nullable = true)
 |-- firstName: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- itemInSession: long (nullable = true)
 |-- lastName: string (nullable = true)
 |-- length: double (nullable = true)
 |-- level: string (nullable = true)
 |-- location: string (nullable = true)
 |-- method: string (nullable = true)
 |-- page: string (nullable = true)
 |-- registration: double (nullable = true)
 |-- sessionId: long (nullable = true)
 |-- song: string (nullable = true)
 |-- status: long (nullable = true)
 |-- ts: long (nullable = true)
 |-- userAgent: string (nullable = true)
 |-- userId: string (nullable = true)



Row(artist='Harmonia', auth='Logged In', firstName='Ryan', gender='M', itemInSession=0, lastName='Smith', length=655.77751, level='free', location='San Jose-Sunnyvale-Santa Clara, CA', method='PUT', page='NextSong', registration=1541016707796.0, sessionId=583, song='Sehr kosmisch', status=200, ts=1542241826796, userAgent='"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/36.0.1985.125 Chrome/36.0.1985.125 Safari/537.36"', userId='26')

We join both data frames, based on `artist_name` and `song_title`. We also create an additional column with the `ts`-column converted into timestamp type column. Later we can use the `start_time`-column to create the time table. 

In [12]:
df_log_join_song = df_log.join(df_song, (df_song.artist_name == df_log.artist) & (df_song.title == df_log.song))
df_log_join_song = df_log_join_song.withColumn( 'start_time', (round( col('ts')/1000 )).cast(Tme()) )

In [13]:
df_log_join_song

DataFrame[artist: string, auth: string, firstName: string, gender: string, itemInSession: bigint, lastName: string, length: double, level: string, location: string, method: string, page: string, registration: double, sessionId: bigint, song: string, status: bigint, ts: bigint, userAgent: string, userId: string, artist_id: string, artist_latitude: double, artist_location: string, artist_longitude: double, artist_name: string, duration: double, num_songs: int, song_id: string, title: string, year: int, start_time: timestamp]

Create the `songs_table`

In [14]:
songs_table = df_log_join_song.select('song_id', 'title', 'artist_id', 'year', 'duration')

Check the `songs_table`

In [15]:
songs_table.head()

Row(song_id='SOZCTXZ12AB0182364', title='Setanta matins', artist_id='AR5KOSW1187FB35FF4', year=0, duration=269.58322)

Create the `user_table`. We use pyspark alias to rename the respective column in accordance with the project instructions.

In [16]:
user_table = df_log_join_song.select(col('userId').alias('user_id'), 
                                     col('firstName').alias('first_name'), 
                                     col('lastName').alias('last_name'), 
                                     col('gender'), 
                                     col('level'))

Check the `user_table`

In [17]:
user_table.head()

Row(user_id='15', first_name='Lily', last_name='Koch', gender='F', level='paid')

Create the `artist_table`

In [18]:
artist_table = df_log_join_song.select(col('artist_id'), 
                                     col('artist_name').alias('name'), 
                                     col('artist_location').alias('location'), 
                                     col('artist_latitude').alias('latitude'),
                                     col('artist_longitude').alias('longitude'),)

Check the `artist_table`

In [19]:
artist_table.head()

Row(artist_id='AR5KOSW1187FB35FF4', name='Elena', location='Dubai UAE', latitude=49.80388, longitude=15.47491)

Create the `time_table`. We use the pyspark functions to convert the time stamp `start_time`. We also make sure that the time stamp is unique `dropDuplicates`

In [20]:
time_table = df_log_join_song.select(col('start_time'), 
                                     hour(df_log_join_song.start_time).alias('hour'), 
                                     dayofmonth(df_log_join_song.start_time).alias('day'), 
                                     weekofyear(df_log_join_song.start_time).alias('week'),
                                     month(df_log_join_song.start_time).alias('month'),
                                     year(df_log_join_song.start_time).alias('year'),
                                     dayofweek(df_log_join_song.start_time).alias('weekday')).dropDuplicates()

Check the `time-table`.

In [21]:
time_table.head()

Row(start_time=datetime.datetime(2018, 11, 21, 21, 56, 48), hour=21, day=21, week=47, month=11, year=2018, weekday=4)

Create `songplays_table`, rename column names and filter to include only records with page `NextSong`.

In [22]:
songplays_table = df_log_join_song.select(col('start_time'), 
                                     col('userId').alias('user_id'), 
                                     col('level'), 
                                     col('song_id'), 
                                     col('artist_id'),
                                     col('sessionId').alias('session_id'),
                                     col('location'),
                                     col('userAgent').alias('user_agent')).filter("page = 'NextSong'")

Check `songplays_table`.

In [23]:
songplays_table.head()

Row(start_time=datetime.datetime(2018, 11, 21, 21, 56, 48), user_id='15', level='paid', song_id='SOZCTXZ12AB0182364', artist_id='AR5KOSW1187FB35FF4', session_id=818, location='Chicago-Naperville-Elgin, IL-IN-WI', user_agent='"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/36.0.1985.125 Chrome/36.0.1985.125 Safari/537.36"')