# Data Exploration for Sparkify Data Model  
In order to come up with a proper model, I downloaded the (currently relatively small) dataset, stored is as csv-files and did some exploration.  

The targeted star schema is as follows:

**Fact Table: `songplays`**
- `session_id`
- `songplay_id`
- `start_time`
- `artist_id`
- `song_id`
- `user_id`
- `level`
- `location`
- `user_agent`

**Dimension Table: `time`**
- `start_time`
- `year`
- `month`
- `day`
- `hour`
- `week`
- `weekday`

**Dimension Table: `artists`**
- `artist_id`
- `name`
- `location`
- `lattitude`
- `longitude`

**Dimension Table: `songs`**
- `song_id`
- `title`
- `artist_id`
- `year`
- `duration`

**Dimension Table: `users`**
- `user_id`
- `first_name`
- `last_name`
- `gender`
- `level`

The source data is stored in S3 buckets. The log data is stored in `s3://udacity-dend/log_data` and the song data is stored in `s3://udacity-dend/song_data`. Both is stored in json format. The log data contains information about the user activity on the Sparkify app. The song data contains additional information about the songs that are available in the Sparkify app.  

## Main Observations and Findings  

### Structure of the `log_data` files

The `log_data` files contain 8056 entries with the following fields:  

- Fields for later direct use:
    - Identifiers / keys:
        - `sessionId`: Session ID as an integer
        - `itemInSession`: Item in session as an integer
    - Timestamp:
        - `ts`: Timestamp as a long integer being the number of milliseconds since 1.1.1970
    - Artist:
        - `artist`: Name of the Artist as a string
    - Song:
        - `song`: Song title as a string
    - User:
        - `userId`: User ID as a string
        - `firstName`: First name of the user as a string
        - `lastName`: Last name of the user as a string
        - `gender`: Gender of the user as a string being either "M" or "F"
        - `level`: Level of the user as a string being either "free" or "paid"
    - Other usage data:
        - `location`: Location of the user as a string
        - `userAgent`: User agent (browser) as a string
- Fields for later pre-processing:  
    - `auth`: Authentication status as a string being either "Logged In" or "Logged Out"  
    - `length`: Length of the playing of the songs as a float  
- Other fields not used later:  
    - `method`: Method as a string being either "GET" or "PUT"  
    - `page`: Page as a string  
    - `registration`: Registration as a float  
    - `status`: Status message as an integer  

### Key finding regarding to the `log_data` files:
- `sessionId` and `itemInSession` in combination can be used as primary key for the fact table.
- Relevant facts like song and artist information are only given when 
    - the user is not logged off, 
    - the lenght of the playing is not zero  
    This means, we should pre-filter the data accordingly.
- With there filters all other data is available / not missing.
- The combination of `userId`, `firstName`, `lastName`, `gender` and `level` is not unique as users change their `level` over time. As a consequence, I've decided to use `userId` as primary key for the users dimension table taking the latest available `level` into account. However, the `level` in the fact table reflects the `level` at the time of the song play.  

### Structure of the `song_data` files

The `song_data` files contain 14896 entries with the following fields:

- Song related fields:
    - `song_id`: Song ID as a string with 18 characters
    - `title`: Song title as a string
    - `year`: Year of the song as an integer with, in general, four digits, but some being 0, meaning the year is unknown / missing
    - `duration`: Duration of the song as a float
- Artist related fields:
    - `artist_id`: Artist ID as a string with 18 character
    - `artist_name`: Name of the artist as a string
    - `artist_location`: Location of the artist as a string with many missing values
    - `artist_latitude`: Latitude of the artist as a float with many missing values
    - `artist_longitude`: Longitude of the artist as a float with many missing values

### Key finding regarding the `song data` files:

- One may be tempted to use this data to build the dimension tables `songs` and `artists`, however, this would be a bad idea as data quality in combination with the `log_data` files is not good. There are many songs and artists in the `log_data` files that are not available in the `song_data` files. 
- This means, we use the detail available in the `song_data` files only to enrich the data in the `log_data` files where possible.
- In addition to this, `artist_id`, `artist_name`, `artist_location`, `artist_latitude` and `artist_longitude` are not a unique set. There are artists having various location and inconsistent geographical coordinates. When using this data to enrich the `log_data` files, we should be aware of this. Here, I've decided to use the first available entry for each artist.

## Overall Conclusion for the Data Model

- Due to consistency issues, our main source is the `log_data` files.
- The `song_data` files are only used to enrich the data in the `log_data` files where possible.
- The `log_data` files are pre-filtered to only contain entries where the user is logged in and the length of the playing is not zero.

## Exploration in Detail

### Loading the Data

In [None]:
# Load libraries
import csv
import json
import sqlite3
from typing import Tuple

import boto3
from dotenv import dotenv_values
import pandas as pd

In [None]:
# Set max rows to 30
pd.set_option('display.max_rows', 30)

# Get credentials from .env file
env = dotenv_values()

AWS_ACCESS_KEY_ID = env["AWS_ACCESS_KEY_ID"]
AWS_SECRET_ACCESS_KEY = env["AWS_SECRET_ACCESS_KEY"]

In [None]:
# Connect to S3
s3 = boto3.client(
    "s3",
    region_name="us-west-2",
    aws_access_key_id=AWS_ACCESS_KEY_ID,
    aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
)

In [None]:
# Function to split bucket string into name and prefix
def bucket_name_and_prefix_from_string(bucket_string: str) -> Tuple[str, str]:
    """Creates a filter for the bucket."""
    bucket_name = bucket_string.split("/")[2]
    prefix = f"{'/'.join(bucket_string.split('/')[3:])}/"
    return bucket_name, prefix

In [None]:
# Load log data into a list
log_data = []

bucket_name, prefix = bucket_name_and_prefix_from_string('s3://udacity-dend/log_data')
bucket_name, prefix

paginator = s3.get_paginator("list_objects")
page_iterator = paginator.paginate(Bucket=bucket_name, Prefix=prefix)

for page in page_iterator:
    for item in page["Contents"]:
        key = item["Key"]
        if key != prefix:
            object_body = s3.get_object(Bucket=bucket_name, Key=key)["Body"].read().decode("utf-8")
            for data in object_body.split("\n"):
                if data:
                    log_data.append(json.loads(data))

# Show length of log data
len(log_data)

In [None]:
# Show an example of the log data
log_data[4_000]

In [None]:
# Put log data into a dataframe and save to csv
all_log_data = pd.DataFrame(log_data)
all_log_data.to_csv("./data/project/log_data.csv", index=False)

In [None]:
# Load song data into a list
song_data = []

bucket_name, prefix = bucket_name_and_prefix_from_string('s3://udacity-dend/song_data')
bucket_name, prefix

paginator = s3.get_paginator("list_objects")
page_iterator = paginator.paginate(Bucket=bucket_name, Prefix=prefix)

for page in page_iterator:
    for item in page["Contents"]:
        key = item["Key"]
        if key != prefix:
            object_body = s3.get_object(Bucket=bucket_name, Key=key)["Body"].read().decode("utf-8")
            for data in object_body.split("\n"):
                if data:
                    song_data.append(json.loads(data))

# Show length of song data
len(song_data)

In [None]:
# Show an example of the song data
song_data[4_000]

In [None]:
# Put song data into a dataframe and save to csv
all_song_data = pd.DataFrame(song_data)
all_song_data.to_csv("./data/project/song_data.csv", index=False)

In the center of the data model is the fact table `songplays` containing information about the user activity. So let's start here:

### Exploration of `log_data` files

In [None]:
all_log_data = pd.read_csv('./data/project/log_data.csv')
all_log_data.head()

In [None]:
all_log_data.describe(include="all").T.fillna("")

**QUESTION** - Are `sessionId` and `itemInSession` applicable primary keys for this table?

In [None]:
all_log_data[["sessionId", "itemInSession"]].shape[0] == all_log_data.shape[0]

**QUESTION** - What's the min and max lenght of the strings in the table?

In [None]:
for column in all_log_data.select_dtypes(include=['object']).columns:
    print(column, all_log_data[column].dropna().map(lambda x: len(str(x))).min(), all_log_data[column].dropna().map(lambda x: len(str(x))).max())

In [None]:
log_columns_required = [
    # Data for primary keys
    "sessionId", 
    "itemInSession", 
    # Data for timestamp
    "ts",
    # User related data
    "userId",
    "firstName", 
    "lastName",
    "gender",
    "level", # also usage related
    # Usage related data
    "location",
    "userAgent",
    # Song related data
    "song", 
    # Artist related data
    "artist",
]

log_data = all_log_data[log_columns_required]
log_data.head()

In [None]:
log_data.describe(include='all').T.fillna("")

**QUESTION** - Can we use `sessionId` and `itemInSession` as primary key for the fact table?**

In [None]:
log_data[['sessionId', 'itemInSession']].drop_duplicates().shape[0] == log_data.shape[0]

**QUESTION** - What's the structure of missing data?**

- Subquestion: Is `userId`, `firstName`, `lastName`, `gender`, `location` and `userAgend` missing always for the same rows?

In [None]:
log_data[["userId", "firstName", "lastName", "gender", "location", "userAgent"]].dropna().shape[0] == log_data["userId"].dropna().shape[0]

- Subquestion: Is there something special about the missing `userId` data?

In [None]:
all_log_data[all_log_data["userId"].isna()].drop(["userId", "firstName", "lastName", "gender", "location", "userAgent"], axis=1).head()

In [None]:
all_log_data[all_log_data["userId"].isna()].drop(["userId", "firstName", "lastName", "gender", "location", "userAgent"], axis=1).describe(include='all').T.fillna("")

- Subquestion: Are events with not missing `userId` having `auth` == "Logged In"?

In [None]:
all_log_data.loc[log_data["userId"].notna(), "auth"].unique().tolist() == ["Logged In"]

**FINDING** - log_data with `auth` == "Logged Out" can be dropped as it doesn't contain any useful information for the data model.

**QUESTION** - Is `song` and `artist` always missing for the same rows?

In [None]:
log_data[["song", "artist"]].dropna().shape[0] == log_data["song"].dropna().shape[0]

**QUESTION** - Are the not missing `song` occuring when `userId` is missing?

In [None]:
log_data.loc[log_data["song"].notna() & log_data["userId"].isna()].shape[0] > 0

**QUESTION** - Is there something special about the missing `song` and `artist` data?

In [None]:
all_log_data.loc[all_log_data["song"].isna()].drop(["song", "artist"], axis=1).describe(include='all').T.fillna("")

**QUESTION** - Is the `length`== 0 when `song` is not missing?

In [None]:
(all_log_data.loc[all_log_data["song"].notna()].drop(["song", "artist"], axis=1)["length"] == 0).any() == False

**FINDING** - log_data with `length` == 0 can be dropped as it doesn't contain any useful information for the data model.

So, let's filter the data accordingly:

In [None]:
filtered_log_data = all_log_data.loc[(all_log_data["auth"] != "Logged Out") & (all_log_data["length"] > 0), log_columns_required]
filtered_log_data.head()

In [None]:
filtered_log_data.describe(include='all').T.fillna("")

**QUESTION** - What are the times from `ts` the songs are heard? Is this consistent with the log_data file organisation covering the month of November 2018?

In [None]:
pd.to_datetime(filtered_log_data["ts"], unit='ms').describe(datetime_is_numeric=True)

**QUESTION** - What are the locations the songs are heard?

In [None]:
filtered_log_data["location"].value_counts()[:25]

**QUESTION** - What are the user agents the songs are heard with?

In [None]:
filtered_log_data["userAgent"].value_counts()[:25]

#### Closer look at the user related data in `log_data`

Now, let's have a closer look at the user related data contained in the log data, namely: 
- `userId`, 
- `firstName`, 
- `lastName`, 
- `gender`, and 
- `level`

In [None]:
user_data = filtered_log_data[["userId", "firstName", "lastName", "gender", "level"]].drop_duplicates().sort_values("userId")
user_data["userId"] = user_data["userId"].astype(int).astype(str)
user_data.head()

In [None]:
user_data.describe(include='all').T.fillna("")

**QUESTION** - Where do duplicates in user_data come from?

In [None]:
user_data[user_data.duplicated(subset=["userId"], keep=False)].sort_values(["userId", "level"])

**FINDING** - The duplicates in user_data come from the fact that the user can change his/her subscription level.

#### Closer look at the song and artist related data in `log_data`

**QUESTION** - What's the structure of song related data in the log_data, namely `song` and `artist`?

In [None]:
song_from_filtered_log_data = filtered_log_data.loc[:, ["song", "artist"]].sort_values(["song", "artist"]).drop_duplicates()
song_from_filtered_log_data.head()

In [None]:
song_from_filtered_log_data.describe(include='all').T.fillna("")

**QUESTION** - What the structure of the duplicate song in the song_data?

In [None]:
song_from_filtered_log_data.loc[song_from_filtered_log_data["song"].duplicated(keep=False)]

**FINDING** - To identify the song in the log data, we need both `song` and `artist` from the log_data.

**QUESTION** - Is the filtering giving back valid data for the required fields?

In [None]:
(all_log_data.query("auth == 'Logged In' and length > 0")[["ts", "userId", "artist", "song"]].count() == 6820).all()

### Exploration of `song_data` files

In [None]:
all_song_data = pd.read_csv('./data/project/song_data.csv')
all_song_data.head()

In [None]:
all_song_data.describe(include='all').T.fillna("")

**QUESTION** - Are artist_id and song_id applicable primary keys for this table?

In [None]:
all_song_data[["song_id", "artist_id"]].shape[0] == all_song_data.shape[0]

**QUESTION** - What the lenght of the strings in the song_data?

In [None]:
for column in all_song_data.select_dtypes("object"):
    print(column, all_song_data[column].dropna().map(lambda x: len(str(x))).min(), all_song_data[column].dropna().map(lambda x: len(str(x))).max())

Form the requirements of the project, we know that we might need the following columns:
- `song_id`
- `title`
- `year`
- `duration`
- `artist_id`
- `artist_name`
- `artist_location`
- `artist_latitude`

This means, we can drop `num_songs` from the table.

So let's narrow the dataset a bit down:

In [None]:
columns_required = ['song_id', 'title', 'year', 'duration', 'artist_id', 'artist_name', 'artist_location', 'artist_latitude', 'artist_longitude']
song_data = all_song_data[columns_required]
song_data.head()

In [None]:
song_data.describe(include='all').T.fillna("")

**QUESTION** - Is song_id unique for title and artist_name?

In [None]:
song_data[["title", "artist_name"]].drop_duplicates().shape[0] == song_data.shape[0]

**QUESTION** - What's the reason for the duplicates?

In [None]:
song_data.loc[song_data.duplicated(subset=["title", "artist_name"], keep=False)].sort_values(["title", "artist_name"])

**FINDING** - title and artist_name are not unique in relation to song_id as there are ambiguous entries for the duration.

#### Closer look at the artist related data in `song_data`

In [None]:
artists_from_song_data = song_data[["artist_id", "artist_name", "artist_location", "artist_latitude", "artist_longitude"]].drop_duplicates()
artists_from_song_data.head()

**QUESTION** - Are the artist_id unique for artist_name?

In [None]:
artists_from_song_data[["artist_id", "artist_name"]].drop_duplicates().shape[0] == artists_from_song_data["artist_id"].drop_duplicates().shape[0]

**QUESTION** - What the reason of non-unique artist_ids?

In [None]:
artists_from_song_data.loc[artists_from_song_data.duplicated(subset=["artist_name"], keep=False)].sort_values(["artist_name"]).iloc[:30]

**FINDING** - Non-unique artist_ids are due to the fact that there are artists with different locations and/or geographical coordinates, probably due to ambiguity in the data.

### Shared Information between `log_data` and `song_data`

In [None]:
artist_song_from_log = all_log_data.query("(auth != 'Logged Out') & (length > 0)")[["artist", "song"]].drop_duplicates().sort_values(["artist", "song"]).reset_index(drop=True).reset_index().rename(columns={"index": "from_log"})
artist_song_from_log

In [None]:
artist_song_from_songs = all_song_data[["artist_name", "title"]].drop_duplicates().sort_values(["artist_name", "title"]).reset_index(drop=True).reset_index().rename(columns={"artist_name": "artist", "title": "song", "index": "from_songs"})
artist_song_from_songs

In [None]:
artist_song_from_log.merge(artist_song_from_songs, on=["artist", "song"], how="right")

**QUESTION** - How many songs are in the log_data that are also in the song_data?

In [None]:
artist_song_from_log.merge(artist_song_from_songs, on=["artist", "song"], how="right")["from_log"].count()

**FINDING** - The overlap between the log_data and the song_data is pretty low.

## Modeling Using Pandas

In [None]:
# Creating users_df
users_df = pd.DataFrame(
    columns=[
        "user_id",
        "first_name",
        "last_name",
        "gender",
        "level",
    ],
)

# Filling users_df
users_df = (
    all_log_data
    .query("(auth != 'Logged Out') & (length > 0)")
    [["userId", "firstName", "lastName", "gender", "level", "ts"]]
    .rename(columns={"userId": "user_id", "firstName": "first_name", "lastName": "last_name"})
    .sort_values("ts")
    .drop_duplicates(subset=["user_id", "first_name", "last_name", "gender"], keep="last")
    .drop("ts", axis=1)
    .reset_index(drop=True)
)

users_df["user_id"] = users_df["user_id"].astype(int)

# Showing users_df
users_df

In [None]:
# Creating time_df
time_df = pd.DataFrame(
    columns=[
        
        "start_time",
        "year",
        "month",
        "day",
        "hour",
        "week",
        "weekday",
    ],
)

# Filling time_df
time_df["start_time"] = pd.to_datetime(
    all_log_data.query("(auth != 'Logged Out') & (length > 0)")["ts"].drop_duplicates().sort_values(),
    unit="ms",
)

time_df["year"] = pd.to_datetime(
    all_log_data.query("(auth != 'Logged Out') & (length > 0)")["ts"].drop_duplicates().sort_values(),
    unit="ms",
).dt.isocalendar().year

time_df["month"] = pd.to_datetime(
    all_log_data.query("(auth != 'Logged Out') & (length > 0)")["ts"].drop_duplicates().sort_values(),
    unit="ms",
).dt.month

time_df["day"] = pd.to_datetime(
    all_log_data.query("(auth != 'Logged Out') & (length > 0)")["ts"].drop_duplicates().sort_values(),
    unit="ms",
).dt.day

time_df["hour"] = pd.to_datetime(
    all_log_data.query("(auth != 'Logged Out') & (length > 0)")["ts"].drop_duplicates().sort_values(),
    unit="ms",
).dt.hour

time_df["week"] = pd.to_datetime(
    all_log_data.query("(auth != 'Logged Out') & (length > 0)")["ts"].drop_duplicates(),
    unit="ms",
).dt.isocalendar().week

time_df["weekday"] = pd.to_datetime(
    all_log_data.query("(auth != 'Logged Out') & (length > 0)")["ts"].drop_duplicates(),
    unit="ms",
).dt.weekday

# Showing time_df
time_df

In [None]:
# Creating artists_df
artists_df = pd.DataFrame(
    columns=[
        # "artist_id",
        "name",
        #"location",
        #"latitude",
        #"longitude",
    ],
)

# Filling artists_df
artists_df["name"] = all_log_data.query("(auth != 'Logged Out') & (length > 0)")["artist"].drop_duplicates().sort_values()

additional_artist_info = (
    all_song_data
    [["artist_name", "artist_location", "artist_latitude", "artist_longitude"]]
    .rename(columns={"artist_name": "name", "artist_location": "location", "artist_latitude": "latitude", "artist_longitude": "longitude"})
    .sort_values(["name", "location", "latitude", "longitude"])
    .drop_duplicates(subset=["name"])
    .reset_index(drop=True)
)

artists_df = artists_df.merge(additional_artist_info, on="name", how="left")
artists_df = artists_df.reset_index().rename(columns={"index": "artist_id"})

# Showing artists_df
artists_df

In [None]:
# Creating songs_df
songs_df = all_log_data.query("(auth != 'Logged Out') & (length > 0)")[["song", "artist"]].drop_duplicates().sort_values(["song"]).rename(columns={"song": "title", "artist": "name"}).reset_index(drop=True).reset_index().rename(columns={"index": "song_id"})

songs_df = songs_df.merge(artists_df[["name", "artist_id"]], on="name", how="left")
songs_df

songs_df = songs_df.merge(
    all_song_data[["artist_name", "title", "year", "duration"]].rename(columns={"artist_name": "name"}), 
    on=["name", "title"], 
    how="left"
).drop("name", axis=1)

# Showing songs_df
songs_df

In [None]:
helper_1_df = songs_df.merge(artists_df, on="artist_id", how="left")[["song_id", "title", "artist_id", "name"]].rename(columns={"name": "artist", "title": "song"})
helper_2_df = all_log_data.query("(auth != 'Logged Out') & (length > 0)")[["artist", "song"]].merge(helper_1_df, on=["artist", "song"], how="left").drop(["artist", "song"], axis=1)
helper_2_df

In [None]:
# Creating songplays_df
songplays_df = pd.DataFrame(
    columns=[
        "session_id",
        "songplay_id",
        "start_time",
        "song_id",
        "artist_id",
        "user_id",
        "location",
        "level",
        "user_agent",
    ],
)

# Filling songplays_df
songplays_df["session_id"] = all_log_data.query("(auth != 'Logged Out') & (length > 0)")["sessionId"]
songplays_df["songplay_id"] = all_log_data.query("(auth != 'Logged Out') & (length > 0)")["itemInSession"]
songplays_df["start_time"] = pd.to_datetime(all_log_data.query("(auth != 'Logged Out') & (length > 0)")["ts"], unit="ms")

helper_1_df = songs_df.merge(artists_df, on="artist_id", how="left")[["song_id", "title", "artist_id", "name"]].rename(columns={"name": "artist", "title": "song"})
helper_2_df = all_log_data.query("(auth != 'Logged Out') & (length > 0)")[["artist", "song"]].merge(helper_1_df, on=["artist", "song"], how="left").drop(["artist", "song"], axis=1)

songplays_df["song_id"] = helper_2_df["song_id"].values
songplays_df["artist_id"] = helper_2_df["artist_id"].values

songplays_df["user_id"] = all_log_data.query("(auth != 'Logged Out') & (length > 0)")["userId"].astype(int)
songplays_df["location"] = all_log_data.query("(auth != 'Logged Out') & (length > 0)")["location"]
songplays_df["level"] = all_log_data.query("(auth != 'Logged Out') & (length > 0)")["level"]
songplays_df["user_agent"] = all_log_data.query("(auth != 'Logged Out') & (length > 0)")["userAgent"]

songplays_df = songplays_df.reset_index(drop=True)

# Showing songplays_df
songplays_df

## Modeling Using SQLite

In [None]:
connection = sqlite3.connect("./data/project/sparkify.sqlite3")
cursor = connection.cursor()

In [None]:
# Delete staging tables
drop_log_data_table = "DROP TABLE IF EXISTS log_data;"
drop_song_data_table = "DROP TABLE IF EXISTS song_data;"

# Delete dimension tables
drop_time_table = "DROP TABLE IF EXISTS time;"
drop_users_table = "DROP TABLE IF EXISTS users;"
drop_songs_table = "DROP TABLE IF EXISTS songs;"
drop_artists_table = "DROP TABLE IF EXISTS artists;"

# Delete fact table
drop_songplays_table = "DROP TABLE IF EXISTS songplays;"

# Drop all tables
drop_tables = [
    drop_log_data_table,
    drop_song_data_table,
    drop_time_table,
    drop_users_table,
    drop_songs_table,
    drop_artists_table,
    drop_songplays_table,
]

In [None]:
# Create staging tables
create_log_data_table = """
CREATE TABLE IF NOT EXISTS log_data (
    artist          VARCHAR(200)    NULL,
    auth            VARCHAR(50)     NOT NULL,
    firstName       VARCHAR(50)     NULL,
    gender          CHAR(1)         NULL,
    itemInSession   INTEGER         NOT NULL,
    lastName        VARCHAR(50)     NULL,
    length          FLOAT           NULL,
    level           CHAR(4)         NOT NULL,
    location        VARCHAR(200)    NULL,
    method          VARCHAR(10)     NOT NULL,
    page            VARCHAR(50)     NOT NULL,
    registration    FLOAT           NULL,
    sessionId       INTEGER         NOT NULL,
    song            VARCHAR(200)    NULL,
    status          INTEGER         NOT NULL,
    ts              INTEGER         NOT NULL,
    userAgent       VARCHAR(200)    NULL,
    userId          INTEGER         NULL,
    PRIMARY KEY (sessionId, itemInSession)
);
"""

create_song_data_table = """
CREATE TABLE IF NOT EXISTS song_data (
    artist_id       VARCHAR(50)     NOT NULL,
    artist_latitude FLOAT           NULL,
    artist_location VARCHAR(200)    NULL,
    artist_longitude FLOAT          NULL,
    artist_name     VARCHAR(200)    NOT NULL,
    duration        FLOAT           NOT NULL,
    num_songs       INTEGER         NOT NULL,
    song_id         VARCHAR(50)     NOT NULL,
    title           VARCHAR(200)    NOT NULL,
    year            INTEGER         NOT NULL,
    PRIMARY KEY (artist_id, song_id)
);
"""

# Create dimension tables
create_time_table = """
CREATE TABLE IF NOT EXISTS time (
    start_time      TIMESTAMP       NOT NULL,
    year            INTEGER         NOT NULL,
    month           INTEGER         NOT NULL,
    day             INTEGER         NOT NULL,
    hour            INTEGER         NOT NULL,
    week            INTEGER         NOT NULL,
    weekday         INTEGER         NOT NULL,
    PRIMARY KEY (start_time)
);
"""

create_users_table = """
CREATE TABLE IF NOT EXISTS users (
    user_id         INTEGER         NOT NULL,
    first_name      VARCHAR(50)     NOT NULL,
    last_name       VARCHAR(50)     NOT NULL,
    gender          CHAR(1)         NOT NULL,
    level           CHAR(4)         NOT NULL,
    PRIMARY KEY (user_id)
);
"""

create_artists_table = """
CREATE TABLE IF NOT EXISTS artists (
    artist_id       INTEGER         NOT NULL,
    name            VARCHAR(200)    NOT NULL,
    location        VARCHAR(200)    NULL,
    latitude        FLOAT           NULL,
    longitude       FLOAT           NULL,
    PRIMARY KEY (artist_id)
);
"""

create_songs_table = """
CREATE TABLE IF NOT EXISTS songs (
    song_id         INTEGER         NOT NULL,
    title           VARCHAR(200)    NOT NULL,
    artist_id       INTEGER         NOT NULL,
    year            INTEGER         NULL,
    duration        FLOAT           NULL,
    PRIMARY KEY (song_id),
    FOREIGN KEY (artist_id) REFERENCES artists (artist_id)
);
"""

# Create fact table
create_songplays_table = """
CREATE TABLE IF NOT EXISTS songplays (
    session_id      INTEGER         NOT NULL,
    songplay_id     INTEGER         NOT NULL,
    start_time      TIMESTAMP       NOT NULL,
    artist_id       INTEGER         NOT NULL,
    song_id         INTEGER         NOT NULL,
    user_id         INTEGER         NOT NULL,
    level           CHAR(4)         NOT NULL,
    location        VARCHAR(200)    NOT NULL,
    user_agent      VARCHAR(200)    NOT NULL,
    PRIMARY KEY (session_id, songplay_id),
    UNIQUE (session_id, songplay_id),
    FOREIGN KEY (start_time) REFERENCES time (start_time),
    FOREIGN KEY (artist_id) REFERENCES artists (artist_id),
    FOREIGN KEY (song_id) REFERENCES songs (song_id),
    FOREIGN KEY (user_id) REFERENCES users (user_id)
);
"""

# Create all tables
create_tables = [
    create_log_data_table,
    create_song_data_table,
    create_time_table,
    create_users_table,
    create_artists_table,
    create_songs_table,
    create_songplays_table,
]

In [None]:
insert_log_data_table = f"""
INSERT INTO log_data (
    artist,
    auth,
    firstName,
    gender,
    itemInSession,
    lastName,
    length,
    level,
    location,
    method,
    page,
    registration,
    sessionId,
    song,
    status,
    ts,
    userAgent,
    userId
) VALUES (
    ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?,?, ?, ?, ?, ?, ?
);
"""

insert_song_data_table = f"""
INSERT INTO song_data (
    artist_id,
    artist_latitude,
    artist_location,
    artist_longitude,
    artist_name,
    duration,
    num_songs,
    song_id,
    title,
    year
) VALUES (
    ?, ?, ?, ?, ?, ?, ?, ?, ?, ?
);
"""

# Insert tables
insert_tables = [
    insert_log_data_table,
    insert_song_data_table,
]

In [None]:
# Source data paths
log_data_path = "./data/project/log_data.csv"
song_data_path = "./data/project/song_data.csv"

# Data paths for staging tables
data_paths = [
    log_data_path,
    song_data_path,
]

# Drop staging tables
drop_staging_tables = [
    drop_log_data_table,
    drop_song_data_table,
]

# Create staging tables
create_staging_tables = [
    create_log_data_table,
    create_song_data_table,
]

# Insert staging tables
insert_staging_tables = [
    insert_log_data_table,
    insert_song_data_table,
]

# Drop all staging tables
for query in drop_staging_tables:
    cursor.execute(query)

# Create all staging tables
for query in create_staging_tables:
    cursor.execute(query)

# Insert all staging tables
for i, query in enumerate(insert_staging_tables):
    with open(data_paths[i], "r") as f:
        reader = csv.reader(f)
        next(reader)  # skip header
        for row in reader:
            data = [None if x == "" else x for x in row]
            cursor.execute(query, data)

In [None]:
# Function to get pandas dataframe from sql query
def get_df_from_sql(sql_query):
    cursor.execute(sql_query)
    df = pd.DataFrame(cursor.fetchall())
    df.columns = [x[0] for x in cursor.description]
    return df

In [None]:
log_data = get_df_from_sql("SELECT * FROM log_data;")
log_data.sort_values(by=["artist", "song"])

In [None]:
# Check if all data is inserted correctly
(
    (get_df_from_sql("SELECT * FROM log_data;").fillna("__NA__") == all_log_data.fillna("__NA__")).all().all() == True,
    (get_df_from_sql("SELECT * FROM song_data;").fillna("__NA__") == all_song_data.fillna("__NA__")).all().all() == True,
)

In [None]:
# Insert time query
insert_time_table = """
INSERT INTO 
    time
SELECT 
    DATETIME(ts / 1000, 'auto')                 AS start_time,
    strftime('%Y', DATETIME(ts / 1000, 'auto')) AS year,
    strftime('%m', DATETIME(ts / 1000, 'auto')) AS month,
    strftime('%d', DATETIME(ts / 1000, 'auto')) AS day,
    strftime('%H', DATETIME(ts / 1000, 'auto')) AS hour,
    strftime('%W', DATETIME(ts / 1000, 'auto')) AS week,
    strftime('%w', DATETIME(ts / 1000, 'auto')) AS weekday
FROM 
    log_data
WHERE
    auth = 'Logged In' AND 
    length > 0
GROUP BY
    start_time
;
"""

cursor.execute(drop_time_table)
cursor.execute(create_time_table)
cursor.execute(insert_time_table)

get_df_from_sql("SELECT * FROM time;")

In [None]:
# Insert users query
insert_users_table = """
INSERT INTO
    users
SELECT
    user_id,
    first_name,
    last_name,
    gender,
    level
FROM (
    SELECT
        userId                                  AS user_id,
        firstName                               AS first_name,
        lastName                                AS last_name,
        gender,
        level,
        DATETIME(ts / 1000, 'auto')             AS time
    FROM
        log_data
    WHERE
        auth = 'Logged In' AND 
        length > 0
    )
GROUP BY
    user_id
HAVING 
    time = MAX(time)
;
"""

cursor.execute(drop_users_table)
cursor.execute(create_users_table)
cursor.execute(insert_users_table)

get_df_from_sql("SELECT * FROM users;")

In [None]:
# Insert artists query
insert_artists_table = """
INSERT INTO
    artists
SELECT 
    ROW_NUMBER() OVER ()                        AS artist_id,
    name, 
    location, 
    latitude, 
    longitude
FROM 
    (
        SELECT DISTINCT 
            artist                              AS name
        FROM
            log_data
        WHERE
            log_data.auth = 'Logged In' AND 
            log_data.length > 0
    )
LEFT JOIN 
    (
        SELECT DISTINCT
            artist_name,
            artist_location                     AS location,
            artist_latitude                     AS latitude,
            artist_longitude                    AS longitude
        FROM
            song_data
        GROUP BY
            artist_name
        ORDER BY
            location DESC,
            latitude DESC,
            longitude DESC
    ) 
ON 
    name = artist_name
"""

cursor.execute(drop_artists_table)
cursor.execute(create_artists_table)
cursor.execute(insert_artists_table)

get_df_from_sql("SELECT * FROM artists;")

In [None]:
# Insert songs query
insert_songs_table = """
INSERT INTO
    songs
SELECT
    ROW_NUMBER() OVER ()                        AS song_id,
    first_part.title,
    first_part.artist_id,
    second_part.year,
    second_part.duration
FROM 
(
    (
        (
            SELECT
                song AS title,
                artist
            FROM
                log_data
            WHERE
                auth = 'Logged In' AND
                length > 0
            GROUP BY
                title,
                artist
            ORDER BY
                title
        )
        LEFT JOIN (
            SELECT
                name,
                artist_id
            FROM
                artists
        )
        ON
            artist = name
    ) AS first_part
    LEFT JOIN (
        SELECT
            title,
            artist_name,
            year,
            duration
        FROM
            song_data
        WHERE
            year > 0
        GROUP BY
            title,
            artist_name
        HAVING
            duration = MAX(duration)
    ) AS second_part
    ON
        first_part.title = second_part.title AND
        first_part.artist = second_part.artist_name
)
"""

cursor.execute(drop_songs_table)
cursor.execute(create_songs_table)
cursor.execute(insert_songs_table)

get_df_from_sql("SELECT * FROM songs;")

In [None]:
query = """
SELECT
    raw_log_data.session_id,
    raw_log_data.item_in_session,
    raw_log_data.start_time,
    raw_log_data.artist,
    raw_artist_data.artist_id,
    raw_log_data.song,
    raw_song_data.song_id,
    raw_log_data.user_id,
    raw_log_data.level,
    raw_log_data.location,
    raw_log_data.user_agent
FROM
    (   
        SELECT
            sessionId                           AS session_id,
            itemInSession AS item_in_session,
            ts AS start_time,
            artist,
            song,
            userId AS user_id,
            level,
            location,
            userAgent                           AS user_agent
        FROM
            log_data
        WHERE
            auth = 'Logged In' AND
            length > 0
    )                                           AS raw_log_data
JOIN
    (
        SELECT
            artist_id,
            name
        FROM
            artists
    )                                           AS raw_artist_data
ON
    raw_log_data.artist = raw_artist_data.name
JOIN
    (
        SELECT
            song_id,
            title,
            artist_id
        FROM
            songs
    )                                           AS raw_song_data
ON
    raw_artist_data.artist_id = raw_song_data.artist_id AND
    raw_log_data.song = raw_song_data.title
"""

get_df_from_sql(query)
    

In [None]:
# Insert songplays query
insert_songplays_table = """
INSERT INTO
    songplays
SELECT
    raw_log_data.session_id,
    raw_log_data.item_in_session,
    raw_log_data.start_time,
    raw_artist_data.artist_id,
    raw_song_data.song_id,
    raw_log_data.user_id,
    raw_log_data.level,
    raw_log_data.location,
    raw_log_data.user_agent
FROM
    (   
        SELECT
            sessionId                           AS session_id,
            itemInSession                       AS item_in_session,
            ts                                  AS start_time,
            artist,
            song,
            userId                              AS user_id,
            level,
            location,
            userAgent                           AS user_agent
        FROM
            log_data
        WHERE
            auth = 'Logged In' AND
            length > 0
    )                                           AS raw_log_data
JOIN
    (
        SELECT
            artist_id,
            name
        FROM
            artists
    )                                           AS raw_artist_data
ON
    raw_log_data.artist = raw_artist_data.name
JOIN
    (
        SELECT
            song_id,
            title,
            artist_id
        FROM
            songs
    )                                           AS raw_song_data
ON
    raw_artist_data.artist_id = raw_song_data.artist_id AND
    raw_log_data.song = raw_song_data.title
"""

cursor.execute(drop_songplays_table)
cursor.execute(create_songplays_table)
cursor.execute(insert_songplays_table)

get_df_from_sql("SELECT * FROM songplays;")

In [None]:
connection.commit()

In [None]:
#for query in drop_tables:
#    cursor.execute(query)
#    connection.commit()

In [None]:
connection.commit()
cursor.close()
connection.close()