# Data Exploration for Sparkify Data Model  
In order to come up with a proper model, I downloaded the (currently relatively small) dataset, stored is as csv-files and did some exploration.  

The targeted star schema is as follows:

**Fact Table: `songplays`**
| Column |
| ------ |
| session_id |
| songplay_id |
| start_time |
| artist_id |
| song_id |
| user_id |
| level |
| location |
| user_agent |

**Dimension Table: `time`**
| Column |
| ---------- |
| start_time |
| year |
| month |
| day |
| hour |
| week |
| weekday |

**Dimension Table: `artists`**
| Column |
| ---------- |
| artist_id |
| name |
| location |
| lattitude |
| longitude |

**Dimension Table: `songs`**
| Column |
| ---------- |
| song_id |
| title |
| artist_id |
| year |
| duration |

**Dimension Table: `users`**
| Column |
| ---------- |
| user_id |
| first_name |
| last_name |
| gender |
| level |

The source data is stored in S3 buckets. The log data is stored in `s3://udacity-dend/log_data` and the song data is stored in `s3://udacity-dend/song_data`. Both is stored in json format. The log data contains information about the user activity on the Sparkify app. The song data contains additional information about the songs that are available in the Sparkify app.  

## Main Observations and Findings  

### Structure of the `log_data` files

The `log_data` files contain 8056 entries with the following fields:  

- Fields for later direct use:
    - Identifiers / keys:
        - `sessionId`: Session ID as an integer
        - `itemInSession`: Item in session as an integer
    - Timestamp:
        - `ts`: Timestamp as a long integer being the number of milliseconds since 1.1.1970
    - Artist:
        - `artist`: Name of the Artist as a string
    - Song:
        - `song`: Song title as a string
    - User:
        - `userId`: User ID as a string
        - `firstName`: First name of the user as a string
        - `lastName`: Last name of the user as a string
        - `gender`: Gender of the user as a string being either "M" or "F"
        - `level`: Level of the user as a string being either "free" or "paid"
    - Other usage data:
        - `location`: Location of the user as a string
        - `userAgent`: User agent (browser) as a string
- Fields for later pre-processing:  
    - `auth`: Authentication status as a string being either "Logged In" or "Logged Out"  
    - `length`: Length of the playing of the songs as a float  
- Other fields not used later:  
    - `method`: Method as a string being either "GET" or "PUT"  
    - `page`: Page as a string  
    - `registration`: Registration as a float  
    - `status`: Status message as an integer  

### Key finding regarding to the `log_data` files:
- `sessionId` and `itemInSession` in combination can be used as primary key for the fact table.
- Relevant facts like song and artist information are only given when 
    - the user is not logged off, 
    - the lenght of the playing is not zero  
    This means, we should pre-filter the data accordingly.
- With there filters all other data is available / not missing.
- The combination of `userId`, `firstName`, `lastName`, `gender` and `level` is not unique as users change their `level` over time. As a consequence, I've decided to use `userId` as primary key for the users dimension table taking the latest available `level` into account. However, the `level` in the fact table reflects the `level` at the time of the song play.  

### Structure of the `song_data` files

The `song_data` files contain 14896 entries with the following fields:

- Song related fields:
    - `song_id`: Song ID as a string with 18 characters
    - `title`: Song title as a string
    - `year`: Year of the song as an integer with, in general, four digits, but some being 0, meaning the year is unknown / missing
    - `duration`: Duration of the song as a float
- Artist related fields:
    - `artist_id`: Artist ID as a string with 18 character
    - `artist_name`: Name of the artist as a string
    - `artist_location`: Location of the artist as a string with many missing values
    - `artist_latitude`: Latitude of the artist as a float with many missing values
    - `artist_longitude`: Longitude of the artist as a float with many missing values

### Key finding regarding the `song data` files:

- One may be tempted to use this data to build the dimension tables `songs` and `artists`, however, this would be a bad idea as data quality in combination with the `log_data` files is not good. There are many songs and artists in the `log_data` files that are not available in the `song_data` files. 
- This means, we use the detail available in the `song_data` files only to enrich the data in the `log_data` files where possible.
- In addition to this, `artist_id`, `artist_name`, `artist_location`, `artist_latitude` and `artist_longitude` are not a unique set. There are artists having various location and inconsistent geographical coordinates. When using this data to enrich the `log_data` files, we should be aware of this. Here, I've decided to use the first available entry for each artist.

## Overall Conclusion for the Data Model

- Due to consistency issues, our main source is the `log_data` files.
- The `song_data` files are only used to enrich the data in the `log_data` files where possible.
- The `log_data` files are pre-filtered to only contain entries where the user is logged in and the length of the playing is not zero.

## Exploration in Detail

For the exploration let's use pandas:

In [1]:
import csv
import sqlite3

import pandas as pd


# Set max rows to 30
pd.set_option('display.max_rows', 30)

In the center of the data model is the fact table `songplays` containing information about the user activity. So let's start here:

### Exploration of `log_data` files

In [2]:
all_log_data = pd.read_csv('./data/project/log_data.csv')
all_log_data.head()

Unnamed: 0,artist,auth,firstName,gender,itemInSession,lastName,length,level,location,method,page,registration,sessionId,song,status,ts,userAgent,userId
0,,Logged In,Walter,M,0,Frye,,free,"San Francisco-Oakland-Hayward, CA",GET,Home,1540919000000.0,38,,200,1541105830796,"""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4...",39.0
1,,Logged In,Kaylee,F,0,Summers,,free,"Phoenix-Mesa-Scottsdale, AZ",GET,Home,1540345000000.0,139,,200,1541106106796,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",8.0
2,Des'ree,Logged In,Kaylee,F,1,Summers,246.30812,free,"Phoenix-Mesa-Scottsdale, AZ",PUT,NextSong,1540345000000.0,139,You Gotta Be,200,1541106106796,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",8.0
3,,Logged In,Kaylee,F,2,Summers,,free,"Phoenix-Mesa-Scottsdale, AZ",GET,Upgrade,1540345000000.0,139,,200,1541106132796,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",8.0
4,Mr Oizo,Logged In,Kaylee,F,3,Summers,144.03873,free,"Phoenix-Mesa-Scottsdale, AZ",PUT,NextSong,1540345000000.0,139,Flat 55,200,1541106352796,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",8.0


In [3]:
all_log_data.describe(include="all").T.fillna("")

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
artist,6820.0,3148.0,Coldplay,58.0,,,,,,,
auth,8056.0,2.0,Logged In,7770.0,,,,,,,
firstName,7770.0,85.0,Chloe,791.0,,,,,,,
gender,7770.0,2.0,F,5482.0,,,,,,,
itemInSession,8056.0,,,,21.198858,23.440699,0.0,3.0,13.0,33.0,127.0
lastName,7770.0,87.0,Cuevas,772.0,,,,,,,
length,6820.0,,,,247.032221,102.975921,15.85587,197.321998,232.972605,274.121992,2594.87302
level,8056.0,2.0,paid,6291.0,,,,,,,
location,7770.0,63.0,"San Francisco-Oakland-Hayward, CA",776.0,,,,,,,
method,8056.0,2.0,PUT,7021.0,,,,,,,


**QUESTION** - Are `sessionId` and `itemInSession` applicable primary keys for this table?

In [4]:
all_log_data[["sessionId", "itemInSession"]].shape[0] == all_log_data.shape[0]

True

**QUESTION** - What's the min and max lenght of the strings in the table?

In [5]:
for column in all_log_data.select_dtypes(include=['object']).columns:
    print(column, all_log_data[column].dropna().map(lambda x: len(str(x))).min(), all_log_data[column].dropna().map(lambda x: len(str(x))).max())

artist 2 89
auth 9 10
firstName 3 10
gender 1 1
lastName 3 9
level 4 4
location 10 46
method 3 3
page 4 16
song 1 151
userAgent 63 139


In [6]:
log_columns_required = [
    # Data for primary keys
    "sessionId", 
    "itemInSession", 
    # Data for timestamp
    "ts",
    # User related data
    "userId",
    "firstName", 
    "lastName",
    "gender",
    "level", # also usage related
    # Usage related data
    "location",
    "userAgent",
    # Song related data
    "song", 
    # Artist related data
    "artist",
]

log_data = all_log_data[log_columns_required]
log_data.head()

Unnamed: 0,sessionId,itemInSession,ts,userId,firstName,lastName,gender,level,location,userAgent,song,artist
0,38,0,1541105830796,39.0,Walter,Frye,M,free,"San Francisco-Oakland-Hayward, CA","""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4...",,
1,139,0,1541106106796,8.0,Kaylee,Summers,F,free,"Phoenix-Mesa-Scottsdale, AZ","""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",,
2,139,1,1541106106796,8.0,Kaylee,Summers,F,free,"Phoenix-Mesa-Scottsdale, AZ","""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",You Gotta Be,Des'ree
3,139,2,1541106132796,8.0,Kaylee,Summers,F,free,"Phoenix-Mesa-Scottsdale, AZ","""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",,
4,139,3,1541106352796,8.0,Kaylee,Summers,F,free,"Phoenix-Mesa-Scottsdale, AZ","""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",Flat 55,Mr Oizo


In [7]:
log_data.describe(include='all').T.fillna("")

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
sessionId,8056.0,,,,598.167577,285.313094,3.0,372.0,605.0,834.0,1114.0
itemInSession,8056.0,,,,21.198858,23.440699,0.0,3.0,13.0,33.0,127.0
ts,8056.0,,,,1542486261744.982,700316630.206308,1541105830796.0,1542022870546.0,1542467316296.0,1543063920796.0,1543607664796.0
userId,7770.0,,,,54.463964,28.168504,2.0,29.0,49.0,80.0,101.0
firstName,7770.0,85.0,Chloe,791.0,,,,,,,
lastName,7770.0,87.0,Cuevas,772.0,,,,,,,
gender,7770.0,2.0,F,5482.0,,,,,,,
level,8056.0,2.0,paid,6291.0,,,,,,,
location,7770.0,63.0,"San Francisco-Oakland-Hayward, CA",776.0,,,,,,,
userAgent,7770.0,40.0,"""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4...",1098.0,,,,,,,


**QUESTION** - Can we use `sessionId` and `itemInSession` as primary key for the fact table?**

In [8]:
log_data[['sessionId', 'itemInSession']].drop_duplicates().shape[0] == log_data.shape[0]

True

**QUESTION** - What's the structure of missing data?**

- Subquestion: Is `userId`, `firstName`, `lastName`, `gender`, `location` and `userAgend` missing always for the same rows?

In [9]:
log_data[["userId", "firstName", "lastName", "gender", "location", "userAgent"]].dropna().shape[0] == log_data["userId"].dropna().shape[0]

True

- Subquestion: Is there something special about the missing `userId` data?

In [10]:
all_log_data[all_log_data["userId"].isna()].drop(["userId", "firstName", "lastName", "gender", "location", "userAgent"], axis=1).head()

Unnamed: 0,artist,auth,itemInSession,length,level,method,page,registration,sessionId,song,status,ts
186,,Logged Out,0,,free,PUT,Login,,52,,307,1541207073796
192,,Logged Out,0,,free,GET,Home,,18,,200,1541239749796
308,,Logged Out,3,,paid,GET,Home,,128,,200,1541310732796
309,,Logged Out,4,,paid,PUT,Login,,128,,307,1541310733796
387,,Logged Out,0,,paid,GET,Home,,175,,200,1541329386796


In [11]:
all_log_data[all_log_data["userId"].isna()].drop(["userId", "firstName", "lastName", "gender", "location", "userAgent"], axis=1).describe(include='all').T.fillna("")

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
artist,0.0,0.0,,,,,,,,,
auth,286.0,1.0,Logged Out,286.0,,,,,,,
itemInSession,286.0,,,,12.667832,22.472708,0.0,0.0,3.0,12.75,121.0
length,0.0,,,,,,,,,,
level,286.0,2.0,paid,198.0,,,,,,,
method,286.0,2.0,GET,194.0,,,,,,,
page,286.0,4.0,Home,171.0,,,,,,,
registration,0.0,,,,,,,,,,
sessionId,286.0,,,,631.77972,294.648784,18.0,437.0,666.0,871.0,1097.0
song,0.0,0.0,,,,,,,,,


- Subquestion: Are events with not missing `userId` having `auth` == "Logged In"?

In [12]:
all_log_data.loc[log_data["userId"].notna(), "auth"].unique().tolist() == ["Logged In"]

True

**FINDING** - log_data with `auth` == "Logged Out" can be dropped as it doesn't contain any useful information for the data model.

**QUESTION** - Is `song` and `artist` always missing for the same rows?

In [13]:
log_data[["song", "artist"]].dropna().shape[0] == log_data["song"].dropna().shape[0]

True

**QUESTION** - Are the not missing `song` occuring when `userId` is missing?

In [14]:
log_data.loc[log_data["song"].notna() & log_data["userId"].isna()].shape[0] > 0

False

**QUESTION** - Is there something special about the missing `song` and `artist` data?

In [15]:
all_log_data.loc[all_log_data["song"].isna()].drop(["song", "artist"], axis=1).describe(include='all').T.fillna("")

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
auth,1236.0,2.0,Logged In,950.0,,,,,,,
firstName,950.0,76.0,Chloe,88.0,,,,,,,
gender,950.0,2.0,F,595.0,,,,,,,
itemInSession,1236.0,,,,12.578479,21.471898,0.0,0.0,2.0,16.0,127.0
lastName,950.0,79.0,Cuevas,83.0,,,,,,,
length,0.0,,,,,,,,,,
level,1236.0,2.0,paid,700.0,,,,,,,
location,950.0,60.0,"San Francisco-Oakland-Hayward, CA",85.0,,,,,,,
method,1236.0,2.0,GET,1035.0,,,,,,,
page,1236.0,12.0,Home,806.0,,,,,,,


**QUESTION** - Is the `length`== 0 when `song` is not missing?

In [16]:
(all_log_data.loc[all_log_data["song"].notna()].drop(["song", "artist"], axis=1)["length"] == 0).any() == False

True

**FINDING** - log_data with `length` == 0 can be dropped as it doesn't contain any useful information for the data model.

So, let's filter the data accordingly:

In [17]:
filtered_log_data = all_log_data.loc[(all_log_data["auth"] != "Logged Out") & (all_log_data["length"] > 0), log_columns_required]
filtered_log_data.head()

Unnamed: 0,sessionId,itemInSession,ts,userId,firstName,lastName,gender,level,location,userAgent,song,artist
2,139,1,1541106106796,8.0,Kaylee,Summers,F,free,"Phoenix-Mesa-Scottsdale, AZ","""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",You Gotta Be,Des'ree
4,139,3,1541106352796,8.0,Kaylee,Summers,F,free,"Phoenix-Mesa-Scottsdale, AZ","""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",Flat 55,Mr Oizo
5,139,4,1541106496796,8.0,Kaylee,Summers,F,free,"Phoenix-Mesa-Scottsdale, AZ","""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",Quem Quiser Encontrar O Amor,Tamba Trio
6,139,5,1541106673796,8.0,Kaylee,Summers,F,free,"Phoenix-Mesa-Scottsdale, AZ","""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",Eriatarka,The Mars Volta
7,139,6,1541107053796,8.0,Kaylee,Summers,F,free,"Phoenix-Mesa-Scottsdale, AZ","""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",Becoming Insane,Infected Mushroom


In [18]:
filtered_log_data.describe(include='all').T.fillna("")

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
sessionId,6820.0,,,,599.181818,284.953333,3.0,374.0,605.0,834.0,1114.0
itemInSession,6820.0,,,,22.761144,23.444636,0.0,4.0,15.0,35.0,126.0
ts,6820.0,,,,1542485482323.2727,700323587.164547,1541106106796.0,1542032421796.0,1542464933296.0,1543063552796.0,1543607664796.0
userId,6820.0,,,,54.681232,28.162734,2.0,29.0,49.0,80.0,101.0
firstName,6820.0,84.0,Chloe,703.0,,,,,,,
lastName,6820.0,86.0,Cuevas,689.0,,,,,,,
gender,6820.0,2.0,F,4887.0,,,,,,,
level,6820.0,2.0,paid,5591.0,,,,,,,
location,6820.0,63.0,"San Francisco-Oakland-Hayward, CA",691.0,,,,,,,
userAgent,6820.0,40.0,"""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4...",971.0,,,,,,,


**QUESTION** - What are the times from `ts` the songs are heard? Is this consistent with the log_data file organisation covering the month of November 2018?

In [19]:
pd.to_datetime(filtered_log_data["ts"], unit='ms').describe(datetime_is_numeric=True)

count                             6820
mean     2018-11-17 20:11:22.323272448
min         2018-11-01 21:01:46.796000
25%         2018-11-12 14:20:21.796000
50%         2018-11-17 14:28:53.296000
75%         2018-11-24 12:45:52.796000
max         2018-11-30 19:54:24.796000
Name: ts, dtype: object

**QUESTION** - What are the locations the songs are heard?

In [20]:
filtered_log_data["location"].value_counts()[:25]

San Francisco-Oakland-Hayward, CA          691
Portland-South Portland, ME                665
Lansing-East Lansing, MI                   557
Chicago-Naperville-Elgin, IL-IN-WI         475
Atlanta-Sandy Springs-Roswell, GA          456
Waterloo-Cedar Falls, IA                   397
Lake Havasu City-Kingman, AZ               321
Tampa-St. Petersburg-Clearwater, FL        307
San Jose-Sunnyvale-Santa Clara, CA         292
Sacramento--Roseville--Arden-Arcade, CA    270
New York-Newark-Jersey City, NY-NJ-PA      262
Janesville-Beloit, WI                      248
Birmingham-Hoover, AL                      223
Winston-Salem, NC                          213
Red Bluff, CA                              201
Marinette, WI-MI                           169
Augusta-Richmond County, GA-SC             140
Detroit-Warren-Dearborn, MI                 76
Houston-The Woodlands-Sugar Land, TX        66
New Orleans-Metairie, LA                    55
San Antonio-New Braunfels, TX               52
New Haven-Mil

**QUESTION** - What are the user agents the songs are heard with?

In [21]:
filtered_log_data["userAgent"].value_counts()[:25]

"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36"                     971
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.78.2 (KHTML, like Gecko) Version/7.0.6 Safari/537.78.2"                        708
Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0                                                                              696
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/36.0.1985.125 Chrome/36.0.1985.125 Safari/537.36"      577
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"                                     573
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:31.0) Gecko/20100101 Firefox/31.0                                                              443
"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36"       

#### Closer look at the user related data in `log_data`

Now, let's have a closer look at the user related data contained in the log data, namely: 
- `userId`, 
- `firstName`, 
- `lastName`, 
- `gender`, and 
- `level`

In [22]:
user_data = filtered_log_data[["userId", "firstName", "lastName", "gender", "level"]].drop_duplicates().sort_values("userId")
user_data["userId"] = user_data["userId"].astype(int).astype(str)
user_data.head()


Unnamed: 0,userId,firstName,lastName,gender,level
944,2,Jizelle,Benjamin,F,free
183,3,Isaac,Valdez,M,free
2619,4,Alivia,Terrell,F,free
6601,5,Elijah,Davis,M,free
295,6,Cecilia,Owens,F,free


In [23]:
user_data.describe(include='all').T.fillna("")

Unnamed: 0,count,unique,top,freq
userId,104,96,16,2
firstName,104,84,Jayden,4
lastName,104,86,Jones,4
gender,104,2,F,60
level,104,2,free,82


**QUESTION** - Where do duplicates in user_data come from?

In [24]:
user_data[user_data.duplicated(subset=["userId"], keep=False)].sort_values(["userId", "level"])

Unnamed: 0,userId,firstName,lastName,gender,level
5235,15,Lily,Koch,F,free
25,15,Lily,Koch,F,paid
361,16,Rylan,George,M,free
2687,16,Rylan,George,M,paid
876,29,Jacqueline,Lynch,F,free
1310,29,Jacqueline,Lynch,F,paid
1459,36,Matthew,Jones,M,free
1668,36,Matthew,Jones,M,paid
91,49,Chloe,Cuevas,F,free
2763,49,Chloe,Cuevas,F,paid


**FINDING** - The duplicates in user_data come from the fact that the user can change his/her subscription level.

#### Closer look at the song and artist related data in `log_data`

**QUESTION** - What's the structure of song related data in the log_data, namely `song` and `artist`?

In [25]:
song_from_filtered_log_data = filtered_log_data.loc[:, ["song", "artist"]].sort_values(["song", "artist"]).drop_duplicates()
song_from_filtered_log_data.head()

Unnamed: 0,song,artist
6096,I Will Not Reap Destruction,We Came As Romans
5863,#40,DAVE MATTHEWS BAND
4947,'Round Midnight,Denise Jannah
1021,'Till I Collapse,Eminem / Nate Dogg
3552,(Hon Vill Ha) Puls,Gyllene Tider


In [26]:
song_from_filtered_log_data.describe(include='all').T.fillna("")

Unnamed: 0,count,unique,top,freq
song,5295,5189,Hello,6
artist,5295,3148,Coldplay,24


**QUESTION** - What the structure of the duplicate song in the song_data?

In [27]:
song_from_filtered_log_data.loc[song_from_filtered_log_data["song"].duplicated(keep=False)]

Unnamed: 0,song,artist
3060,Addicted,Amy Winehouse
5353,Addicted,Enrique Iglesias
7520,All My Friends,Amos Lee
5376,All My Friends,LCD Soundsystem
512,All My Life,Foo Fighters
...,...,...
751,You Know I'm No Good,Amy Winehouse
89,You Know I'm No Good,Arctic Monkeys
907,You're Not Alone,ATB
7321,You're Not Alone,Olive


**FINDING** - To identify the song in the log data, we need both `song` and `artist` from the log_data.

**QUESTION** - Is the filtering giving back valid data for the required fields?

In [28]:
(all_log_data.query("auth == 'Logged In' and length > 0")[["ts", "userId", "artist", "song"]].count() == 6820).all()

True

### Exploration of `song_data` files

In [29]:
all_song_data = pd.read_csv('./data/project/song_data.csv')
all_song_data.head()

Unnamed: 0,artist_id,artist_latitude,artist_location,artist_longitude,artist_name,duration,num_songs,song_id,title,year
0,ARJNIUY12298900C91,,,,Adelitas Way,213.9424,1,SOBLFFE12AF72AA5BA,Scream,2009
1,AR73AIO1187B9AD57B,37.77916,"San Francisco, CA",-122.42005,Western Addiction,118.07302,1,SOQPWCR12A6D4FB2A3,A Poor Recipe For Civic Cohesion,2005
2,ARMJAGH1187FB546F3,35.14968,"Memphis, TN",-90.04892,The Box Tops,148.03546,1,SOCIWDW12A8C13D406,Soul Deep,1969
3,AR9Q9YC1187FB5609B,,New Jersey,,Quest_ Pup_ Kevo,252.94322,1,SOFRDWL12A58A7CEF7,Hit Da Scene,0
4,ARSVTNL1187B992A91,51.50632,"London, England",-0.12714,Jonathan King,129.85424,1,SOEKAZG12AB018837E,I'll Slap Your Face (Entertainment USA Theme),2001


In [30]:
all_song_data.describe(include='all').T.fillna("")

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
artist_id,14896.0,9553.0,ARYPTWE1187FB49D64,9.0,,,,,,,
artist_latitude,5277.0,,,,39.560972,15.657061,-45.8745,34.52865,40.71455,50.07908,69.65102
artist_location,8201.0,2083.0,"London, England",245.0,,,,,,,
artist_longitude,5277.0,,,,-56.685017,55.173199,-155.43414,-93.19547,-75.69189,-1.97406,175.47131
artist_name,14896.0,9936.0,Badly Drawn Boy,9.0,,,,,,,
duration,14896.0,,,,246.779398,110.005727,6.37342,186.14159,231.00036,284.525262,2709.2371
num_songs,14896.0,,,,1.0,0.0,1.0,1.0,1.0,1.0,1.0
song_id,14896.0,14896.0,SOBLFFE12AF72AA5BA,1.0,,,,,,,
title,14896.0,14402.0,Intro,25.0,,,,,,,
year,14896.0,,,,1360.512285,932.689191,0.0,0.0,1997.0,2005.0,2010.0


**QUESTION** - Are artist_id and song_id applicable primary keys for this table?

In [31]:
all_song_data[["song_id", "artist_id"]].shape[0] == all_song_data.shape[0]

True

**QUESTION** - What the lenght of the strings in the song_data?

In [32]:
for column in all_song_data.select_dtypes("object"):
    print(column, all_song_data[column].dropna().map(lambda x: len(str(x))).min(), all_song_data[column].dropna().map(lambda x: len(str(x))).max())

artist_id 18 18
artist_location 1 176
artist_name 1 177
song_id 18 18
title 1 173


Form the requirements of the project, we know that we might need the following columns:
- `song_id`
- `title`
- `year`
- `duration`
- `artist_id`
- `artist_name`
- `artist_location`
- `artist_latitude`

This means, we can drop `num_songs` from the table.

So let's narrow the dataset a bit down:

In [33]:
columns_required = ['song_id', 'title', 'year', 'duration', 'artist_id', 'artist_name', 'artist_location', 'artist_latitude', 'artist_longitude']
song_data = all_song_data[columns_required]
song_data.head()

Unnamed: 0,song_id,title,year,duration,artist_id,artist_name,artist_location,artist_latitude,artist_longitude
0,SOBLFFE12AF72AA5BA,Scream,2009,213.9424,ARJNIUY12298900C91,Adelitas Way,,,
1,SOQPWCR12A6D4FB2A3,A Poor Recipe For Civic Cohesion,2005,118.07302,AR73AIO1187B9AD57B,Western Addiction,"San Francisco, CA",37.77916,-122.42005
2,SOCIWDW12A8C13D406,Soul Deep,1969,148.03546,ARMJAGH1187FB546F3,The Box Tops,"Memphis, TN",35.14968,-90.04892
3,SOFRDWL12A58A7CEF7,Hit Da Scene,0,252.94322,AR9Q9YC1187FB5609B,Quest_ Pup_ Kevo,New Jersey,,
4,SOEKAZG12AB018837E,I'll Slap Your Face (Entertainment USA Theme),2001,129.85424,ARSVTNL1187B992A91,Jonathan King,"London, England",51.50632,-0.12714


In [34]:
song_data.describe(include='all').T.fillna("")

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
song_id,14896.0,14896.0,SOBLFFE12AF72AA5BA,1.0,,,,,,,
title,14896.0,14402.0,Intro,25.0,,,,,,,
year,14896.0,,,,1360.512285,932.689191,0.0,0.0,1997.0,2005.0,2010.0
duration,14896.0,,,,246.779398,110.005727,6.37342,186.14159,231.00036,284.525262,2709.2371
artist_id,14896.0,9553.0,ARYPTWE1187FB49D64,9.0,,,,,,,
artist_name,14896.0,9936.0,Badly Drawn Boy,9.0,,,,,,,
artist_location,8201.0,2083.0,"London, England",245.0,,,,,,,
artist_latitude,5277.0,,,,39.560972,15.657061,-45.8745,34.52865,40.71455,50.07908,69.65102
artist_longitude,5277.0,,,,-56.685017,55.173199,-155.43414,-93.19547,-75.69189,-1.97406,175.47131


**QUESTION** - Is song_id unique for title and artist_name?

In [35]:
song_data[["title", "artist_name"]].drop_duplicates().shape[0] == song_data.shape[0]

False

**QUESTION** - What's the reason for the duplicates?

In [36]:
song_data.loc[song_data.duplicated(subset=["title", "artist_name"], keep=False)].sort_values(["title", "artist_name"])

Unnamed: 0,song_id,title,year,duration,artist_id,artist_name,artist_location,artist_latitude,artist_longitude
172,SODDZAD12A6701DC4C,Commercial Reign,1990,283.76771,AR9AM2N1187B9AD2F1,Inspiral Carpets,"Manchester, England",53.4796,-2.24881
6474,SOBHMQL12A67ADE30A,Commercial Reign,1990,283.32363,AR9AM2N1187B9AD2F1,Inspiral Carpets,"Manchester, England",53.4796,-2.24881
1130,SOOULII12AB0182A1B,Day And Night,0,511.42485,AR6AKW41187FB5B046,Sonic Division,,,
5628,SOKWBCJ12AB0182A08,Day And Night,0,513.67138,AR6AKW41187FB5B046,Sonic Division,,,
378,SOZUNHU12A8C137BB7,Moto Perpetuo_ Op. 11_ No. 2,2001,223.13751,ARKDO731187B98E21B,Béla Fleck,,,
11697,SOPPYXD12A8C1316E6,Moto Perpetuo_ Op. 11_ No. 2,2001,218.98404,ARKDO731187B98E21B,Béla Fleck,,,
6453,SOQGLWB12AF72A632B,The Earth Will Shake,2005,329.7171,ARIMZQZ1187B9AD541,Thrice,"Orange, CA",,
7539,SOMHGMP12A6D4F5904,The Earth Will Shake,2005,268.17261,ARIMZQZ1187B9AD541,Thrice,"Orange, CA",,
9641,SOKJUZQ12AB0185E37,When I Grow Up,2009,556.06812,ARUYVDC12086C11D5C,Fever Ray,Stockholm,,
12297,SOIXAJN12AB0183EE3,When I Grow Up,2009,335.5424,ARUYVDC12086C11D5C,Fever Ray,Stockholm,,


**FINDING** - title and artist_name are not unique in relation to song_id as there are ambiguous entries for the duration.

#### Closer look at the artist related data in `song_data`

In [37]:
artists_from_song_data = song_data[["artist_id", "artist_name", "artist_location", "artist_latitude", "artist_longitude"]].drop_duplicates()
artists_from_song_data.head()

Unnamed: 0,artist_id,artist_name,artist_location,artist_latitude,artist_longitude
0,ARJNIUY12298900C91,Adelitas Way,,,
1,AR73AIO1187B9AD57B,Western Addiction,"San Francisco, CA",37.77916,-122.42005
2,ARMJAGH1187FB546F3,The Box Tops,"Memphis, TN",35.14968,-90.04892
3,AR9Q9YC1187FB5609B,Quest_ Pup_ Kevo,New Jersey,,
4,ARSVTNL1187B992A91,Jonathan King,"London, England",51.50632,-0.12714


**QUESTION** - Are the artist_id unique for artist_name?

In [38]:
artists_from_song_data[["artist_id", "artist_name"]].drop_duplicates().shape[0] == artists_from_song_data["artist_id"].drop_duplicates().shape[0]

False

**QUESTION** - What the reason of non-unique artist_ids?

In [39]:
artists_from_song_data.loc[artists_from_song_data.duplicated(subset=["artist_name"], keep=False)].sort_values(["artist_name"]).iloc[:30]

Unnamed: 0,artist_id,artist_name,artist_location,artist_latitude,artist_longitude
14345,ARIN7RA1187FB4CE8B,59 Times the Pain,"Simi Valley, California",34.28946,-118.71766
2688,ARIN7RA1187FB4CE8B,59 Times the Pain,,,
10830,ARR5XTD1187FB3C6BE,Aimee Mann,"Los Angeles, CA",,
2816,ARR5XTD1187FB3C6BE,Aimee Mann,,,
4350,AROBSO71187B995AF0,Ali Farka Toure_ Toumani Diabate,Kanau,27.57452,78.30813
12118,AR3ZGUC1187FB57721,Ali Farka Toure_ Toumani Diabate,,,
2820,ARFMT4W1187FB42FA8,Alison Krauss,"Decatur, IL",,
10320,ARF2SVO1187FB53E8F,Alison Krauss,,,
1173,ARF2SVO1187FB53E8F,Alison Krauss / Union Station,,,
14678,ARF2SVO1187FB53E8F,Alison Krauss / Union Station,"Decatur, IL",,


**FINDING** - Non-unique artist_ids are due to the fact that there are artists with different locations and/or geographical coordinates, probably due to ambiguity in the data.

### Shared Information between `log_data` and `song_data`

In [40]:
artist_song_from_log = all_log_data.query("(auth != 'Logged Out') & (length > 0)")[["artist", "song"]].drop_duplicates().sort_values(["artist", "song"]).reset_index(drop=True).reset_index().rename(columns={"index": "from_log"})
artist_song_from_log

Unnamed: 0,from_log,artist,song
0,0,!!!,Bend Over Beethoven
1,1,'N Sync/Phil Collins,Trashin' The Camp (Phil And 'N Sync Version)
2,2,+ / - {Plus/Minus},The Queen of Nothing
3,3,+44,Make You Smile
4,4,1 Mile North,Black Lines
...,...,...,...
5290,5290,tobyMac,Burn For You
5291,5291,zebrahead,Wake Me Up
5292,5292,zebrahead,With Legs Like That
5293,5293,ÃÂngeles del Infierno,Si TÃÂº No EstÃÂ¡s AquÃÂ­


In [41]:
artist_song_from_songs = all_song_data[["artist_name", "title"]].drop_duplicates().sort_values(["artist_name", "title"]).reset_index(drop=True).reset_index().rename(columns={"artist_name": "artist", "title": "song", "index": "from_songs"})
artist_song_from_songs

Unnamed: 0,from_songs,artist,song
0,0,!!!,Myth Takes
1,1,& And Oceans,Voyage: Lost Between Horizons: Eaten By The Di...
2,2,'68 Comeback,A Little Bitch (And A Little Bitch Better)
3,3,'68 Comeback,The Wall
4,4,'t Hof Van Commerce,Chance
...,...,...,...
14886,14886,µ-ziq,Something Else
14887,14887,Ágata,Conselho de mãe
14888,14888,Åge Aleksandersen,Fremmed Fugl
14889,14889,Étienne Daho,Le plaisir de perdre (live 1989)


In [42]:
artist_song_from_log.merge(artist_song_from_songs, on=["artist", "song"], how="right")

Unnamed: 0,from_log,artist,song,from_songs
0,,!!!,Myth Takes,0
1,,& And Oceans,Voyage: Lost Between Horizons: Eaten By The Di...,1
2,,'68 Comeback,A Little Bitch (And A Little Bitch Better),2
3,,'68 Comeback,The Wall,3
4,,'t Hof Van Commerce,Chance,4
...,...,...,...,...
14886,,µ-ziq,Something Else,14886
14887,,Ágata,Conselho de mãe,14887
14888,,Åge Aleksandersen,Fremmed Fugl,14888
14889,,Étienne Daho,Le plaisir de perdre (live 1989),14889


**QUESTION** - How many songs are in the log_data that are also in the song_data?

In [43]:
artist_song_from_log.merge(artist_song_from_songs, on=["artist", "song"], how="right")["from_log"].count()

217

**FINDING** - The overlap between the log_data and the song_data is pretty low.

## Modeling Using Pandas

In [44]:
# Creating users_df
users_df = pd.DataFrame(
    columns=[
        "user_id",
        "first_name",
        "last_name",
        "gender",
        "level",
    ],
)

# Filling users_df
users_df = (
    all_log_data
    .query("(auth != 'Logged Out') & (length > 0)")
    [["userId", "firstName", "lastName", "gender", "level", "ts"]]
    .rename(columns={"userId": "user_id", "firstName": "first_name", "lastName": "last_name"})
    .sort_values("ts")
    .drop_duplicates(subset=["user_id", "first_name", "last_name", "gender"], keep="last")
    .drop("ts", axis=1)
    .reset_index(drop=True)
)

users_df["user_id"] = users_df["user_id"].astype(int)

# Showing users_df
users_df

Unnamed: 0,user_id,first_name,last_name,gender,level
0,3,Isaac,Valdez,M,free
1,84,Shakira,Hunt,F,free
2,20,Aiden,Ramirez,M,paid
3,27,Carlos,Carter,M,free
4,59,Lily,Cooper,F,free
...,...,...,...,...,...
91,33,Bronson,Harris,M,free
92,91,Jayden,Bell,M,free
93,49,Chloe,Cuevas,F,paid
94,16,Rylan,George,M,paid


In [45]:
# Creating time_df
time_df = pd.DataFrame(
    columns=[
        
        "start_time",
        "year",
        "month",
        "day",
        "hour",
        "week",
        "weekday",
    ],
)

# Filling time_df
time_df["start_time"] = pd.to_datetime(
    all_log_data.query("(auth != 'Logged Out') & (length > 0)")["ts"].drop_duplicates().sort_values(),
    unit="ms",
)

time_df["year"] = pd.to_datetime(
    all_log_data.query("(auth != 'Logged Out') & (length > 0)")["ts"].drop_duplicates().sort_values(),
    unit="ms",
).dt.isocalendar().year

time_df["month"] = pd.to_datetime(
    all_log_data.query("(auth != 'Logged Out') & (length > 0)")["ts"].drop_duplicates().sort_values(),
    unit="ms",
).dt.month

time_df["day"] = pd.to_datetime(
    all_log_data.query("(auth != 'Logged Out') & (length > 0)")["ts"].drop_duplicates().sort_values(),
    unit="ms",
).dt.day

time_df["hour"] = pd.to_datetime(
    all_log_data.query("(auth != 'Logged Out') & (length > 0)")["ts"].drop_duplicates().sort_values(),
    unit="ms",
).dt.hour

time_df["week"] = pd.to_datetime(
    all_log_data.query("(auth != 'Logged Out') & (length > 0)")["ts"].drop_duplicates(),
    unit="ms",
).dt.isocalendar().week

time_df["weekday"] = pd.to_datetime(
    all_log_data.query("(auth != 'Logged Out') & (length > 0)")["ts"].drop_duplicates(),
    unit="ms",
).dt.weekday

# Showing time_df
time_df

Unnamed: 0,start_time,year,month,day,hour,week,weekday
2,2018-11-01 21:01:46.796,2018,11,1,21,44,3
4,2018-11-01 21:05:52.796,2018,11,1,21,44,3
5,2018-11-01 21:08:16.796,2018,11,1,21,44,3
6,2018-11-01 21:11:13.796,2018,11,1,21,44,3
7,2018-11-01 21:17:33.796,2018,11,1,21,44,3
...,...,...,...,...,...,...,...
8050,2018-11-30 18:40:05.796,2018,11,30,18,48,4
8051,2018-11-30 18:44:36.796,2018,11,30,18,48,4
8052,2018-11-30 18:47:58.796,2018,11,30,18,48,4
8053,2018-11-30 18:51:24.796,2018,11,30,18,48,4


In [46]:
# Creating artists_df
artists_df = pd.DataFrame(
    columns=[
        # "artist_id",
        "name",
        #"location",
        #"latitude",
        #"longitude",
    ],
)

# Filling artists_df
artists_df["name"] = all_log_data.query("(auth != 'Logged Out') & (length > 0)")["artist"].drop_duplicates().sort_values()

additional_artist_info = (
    all_song_data
    [["artist_name", "artist_location", "artist_latitude", "artist_longitude"]]
    .rename(columns={"artist_name": "name", "artist_location": "location", "artist_latitude": "latitude", "artist_longitude": "longitude"})
    .sort_values(["name", "location", "latitude", "longitude"])
    .drop_duplicates(subset=["name"])
    .reset_index(drop=True)
)

artists_df = artists_df.merge(additional_artist_info, on="name", how="left")
artists_df = artists_df.reset_index().rename(columns={"index": "artist_id"})

# Showing artists_df
artists_df

Unnamed: 0,artist_id,name,location,latitude,longitude
0,0,!!!,,,
1,1,'N Sync/Phil Collins,,,
2,2,+ / - {Plus/Minus},,,
3,3,+44,,,
4,4,1 Mile North,,,
...,...,...,...,...,...
3143,3143,the bird and the bee,"Los Angeles, CA",34.05349,-118.24532
3144,3144,tobyMac,Nashville,,
3145,3145,zebrahead,,,
3146,3146,ÃÂngeles del Infierno,,,


In [47]:
# Creating songs_df
songs_df = all_log_data.query("(auth != 'Logged Out') & (length > 0)")[["song", "artist"]].drop_duplicates().sort_values(["song"]).rename(columns={"song": "title", "artist": "name"}).reset_index(drop=True).reset_index().rename(columns={"index": "song_id"})

songs_df = songs_df.merge(artists_df[["name", "artist_id"]], on="name", how="left")
songs_df

songs_df = songs_df.merge(
    all_song_data[["artist_name", "title", "year", "duration"]].rename(columns={"artist_name": "name"}), 
    on=["name", "title"], 
    how="left"
).drop("name", axis=1)

# Showing songs_df
songs_df

Unnamed: 0,song_id,title,artist_id,year,duration
0,0,I Will Not Reap Destruction,3058,,
1,1,#40,645,,
2,2,'Round Midnight,747,,
3,3,'Till I Collapse,881,,
4,4,(Hon Vill Ha) Puls,1141,,
...,...,...,...,...,...
5290,5290,s.Ada.Licht,2352,2007.0,176.97914
5291,5291,shimmer,370,,
5292,5292,the king of wishful thinking,1977,,
5293,5293,ÃÂ Noite,2015,,


In [48]:
helper_1_df = songs_df.merge(artists_df, on="artist_id", how="left")[["song_id", "title", "artist_id", "name"]].rename(columns={"name": "artist", "title": "song"})
helper_2_df = all_log_data.query("(auth != 'Logged Out') & (length > 0)")[["artist", "song"]].merge(helper_1_df, on=["artist", "song"], how="left").drop(["artist", "song"], axis=1)
helper_2_df

Unnamed: 0,song_id,artist_id
0,5227,753
1,1470,1904
2,3551,2597
3,1284,2758
4,398,1233
...,...,...
6815,4506,1009
6816,427,2914
6817,3707,28
6818,4437,119


In [49]:
# Creating songplays_df
songplays_df = pd.DataFrame(
    columns=[
        "session_id",
        "songplay_id",
        "start_time",
        "song_id",
        "artist_id",
        "user_id",
        "location",
        "level",
        "user_agent",
    ],
)

# Filling songplays_df
songplays_df["session_id"] = all_log_data.query("(auth != 'Logged Out') & (length > 0)")["sessionId"]
songplays_df["songplay_id"] = all_log_data.query("(auth != 'Logged Out') & (length > 0)")["itemInSession"]
songplays_df["start_time"] = pd.to_datetime(all_log_data.query("(auth != 'Logged Out') & (length > 0)")["ts"], unit="ms")

helper_1_df = songs_df.merge(artists_df, on="artist_id", how="left")[["song_id", "title", "artist_id", "name"]].rename(columns={"name": "artist", "title": "song"})
helper_2_df = all_log_data.query("(auth != 'Logged Out') & (length > 0)")[["artist", "song"]].merge(helper_1_df, on=["artist", "song"], how="left").drop(["artist", "song"], axis=1)

songplays_df["song_id"] = helper_2_df["song_id"].values
songplays_df["artist_id"] = helper_2_df["artist_id"].values

songplays_df["user_id"] = all_log_data.query("(auth != 'Logged Out') & (length > 0)")["userId"].astype(int)
songplays_df["location"] = all_log_data.query("(auth != 'Logged Out') & (length > 0)")["location"]
songplays_df["level"] = all_log_data.query("(auth != 'Logged Out') & (length > 0)")["level"]
songplays_df["user_agent"] = all_log_data.query("(auth != 'Logged Out') & (length > 0)")["userAgent"]

songplays_df = songplays_df.reset_index(drop=True)

# Showing songplays_df
songplays_df

Unnamed: 0,session_id,songplay_id,start_time,song_id,artist_id,user_id,location,level,user_agent
0,139,1,2018-11-01 21:01:46.796,5227,753,8,"Phoenix-Mesa-Scottsdale, AZ",free,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK..."
1,139,3,2018-11-01 21:05:52.796,1470,1904,8,"Phoenix-Mesa-Scottsdale, AZ",free,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK..."
2,139,4,2018-11-01 21:08:16.796,3551,2597,8,"Phoenix-Mesa-Scottsdale, AZ",free,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK..."
3,139,5,2018-11-01 21:11:13.796,1284,2758,8,"Phoenix-Mesa-Scottsdale, AZ",free,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK..."
4,139,6,2018-11-01 21:17:33.796,398,1233,8,"Phoenix-Mesa-Scottsdale, AZ",free,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK..."
...,...,...,...,...,...,...,...,...,...
6815,1076,57,2018-11-30 18:40:05.796,4506,1009,16,"Birmingham-Hoover, AL",paid,"""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4..."
6816,1076,58,2018-11-30 18:44:36.796,427,2914,16,"Birmingham-Hoover, AL",paid,"""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4..."
6817,1076,59,2018-11-30 18:47:58.796,3707,28,16,"Birmingham-Hoover, AL",paid,"""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4..."
6818,1076,60,2018-11-30 18:51:24.796,4437,119,16,"Birmingham-Hoover, AL",paid,"""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4..."


## Modeling Using SQLite

In [50]:
connection = sqlite3.connect("sparkify.sqlite3")
cursor = connection.cursor()

In [51]:
# Delete staging tables
drop_log_data_table = "DROP TABLE IF EXISTS log_data;"
drop_song_data_table = "DROP TABLE IF EXISTS song_data;"

# Delete dimension tables
drop_time_table = "DROP TABLE IF EXISTS time;"
drop_users_table = "DROP TABLE IF EXISTS users;"
drop_songs_table = "DROP TABLE IF EXISTS songs;"
drop_artists_table = "DROP TABLE IF EXISTS artists;"

# Delete fact table
drop_songplays_table = "DROP TABLE IF EXISTS songplays;"

# Drop all tables
drop_tables = [
    drop_log_data_table,
    drop_song_data_table,
    drop_time_table,
    drop_users_table,
    drop_songs_table,
    drop_artists_table,
    drop_songplays_table,
]

In [52]:
# Create staging tables
create_log_data_table = """
CREATE TABLE IF NOT EXISTS log_data (
    artist          VARCHAR(200)    NULL,
    auth            VARCHAR(50)     NOT NULL,
    firstName       VARCHAR(50)     NULL,
    gender          CHAR(1)         NULL,
    itemInSession   INTEGER         NOT NULL,
    lastName        VARCHAR(50)     NULL,
    length          FLOAT           NULL,
    level           CHAR(4)         NOT NULL,
    location        VARCHAR(200)    NULL,
    method          VARCHAR(10)     NOT NULL,
    page            VARCHAR(50)     NOT NULL,
    registration    FLOAT           NULL,
    sessionId       INTEGER         NOT NULL,
    song            VARCHAR(200)    NULL,
    status          INTEGER         NOT NULL,
    ts              INTEGER         NOT NULL,
    userAgent       VARCHAR(200)    NULL,
    userId          INTEGER         NULL,
    PRIMARY KEY (sessionId, itemInSession)
);
"""

create_song_data_table = """
CREATE TABLE IF NOT EXISTS song_data (
    artist_id       VARCHAR(50)     NOT NULL,
    artist_latitude FLOAT           NULL,
    artist_location VARCHAR(200)    NULL,
    artist_longitude FLOAT          NULL,
    artist_name     VARCHAR(200)    NOT NULL,
    duration        FLOAT           NOT NULL,
    num_songs       INTEGER         NOT NULL,
    song_id         VARCHAR(50)     NOT NULL,
    title           VARCHAR(200)    NOT NULL,
    year            INTEGER         NOT NULL,
    PRIMARY KEY (artist_id, song_id)
);
"""

# Create dimension tables
create_time_table = """
CREATE TABLE IF NOT EXISTS time (
    start_time      TIMESTAMP       NOT NULL,
    year            INTEGER         NOT NULL,
    month           INTEGER         NOT NULL,
    day             INTEGER         NOT NULL,
    hour            INTEGER         NOT NULL,
    week            INTEGER         NOT NULL,
    weekday         INTEGER         NOT NULL,
    PRIMARY KEY (start_time)
);
"""

create_users_table = """
CREATE TABLE IF NOT EXISTS users (
    user_id         INTEGER         NOT NULL,
    first_name      VARCHAR(50)     NOT NULL,
    last_name       VARCHAR(50)     NOT NULL,
    gender          CHAR(1)         NOT NULL,
    level           CHAR(4)         NOT NULL,
    PRIMARY KEY (user_id)
);
"""

create_artists_table = """
CREATE TABLE IF NOT EXISTS artists (
    artist_id       INTEGER         NOT NULL,
    name            VARCHAR(200)    NOT NULL,
    location        VARCHAR(200)    NULL,
    latitude        FLOAT           NULL,
    longitude       FLOAT           NULL,
    PRIMARY KEY (artist_id)
);
"""

create_songs_table = """
CREATE TABLE IF NOT EXISTS songs (
    song_id         INTEGER         NOT NULL,
    title           VARCHAR(200)    NOT NULL,
    artist_id       INTEGER         NOT NULL,
    year            INTEGER         NULL,
    duration        FLOAT           NULL,
    PRIMARY KEY (song_id),
    FOREIGN KEY (artist_id) REFERENCES artists (artist_id)
);
"""

# Create fact table
create_songplays_table = """
CREATE TABLE IF NOT EXISTS songplays (
    session_id      INTEGER         NOT NULL,
    songplay_id     INTEGER         NOT NULL,
    start_time      TIMESTAMP       NOT NULL,
    artist_id       INTEGER         NOT NULL,
    song_id         INTEGER         NOT NULL,
    user_id         INTEGER         NOT NULL,
    level           CHAR(4)         NOT NULL,
    location        VARCHAR(200)    NOT NULL,
    user_agent      VARCHAR(200)    NOT NULL,
    PRIMARY KEY (session_id, songplay_id),
    UNIQUE (session_id, songplay_id),
    FOREIGN KEY (start_time) REFERENCES time (start_time),
    FOREIGN KEY (artist_id) REFERENCES artists (artist_id),
    FOREIGN KEY (song_id) REFERENCES songs (song_id),
    FOREIGN KEY (user_id) REFERENCES users (user_id)
);
"""

# Create all tables
create_tables = [
    create_log_data_table,
    create_song_data_table,
    create_time_table,
    create_users_table,
    create_artists_table,
    create_songs_table,
    create_songplays_table,
]

In [53]:
insert_log_data_table = f"""
INSERT INTO log_data (
    artist,
    auth,
    firstName,
    gender,
    itemInSession,
    lastName,
    length,
    level,
    location,
    method,
    page,
    registration,
    sessionId,
    song,
    status,
    ts,
    userAgent,
    userId
) VALUES (
    ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?,?, ?, ?, ?, ?, ?
);
"""

insert_song_data_table = f"""
INSERT INTO song_data (
    artist_id,
    artist_latitude,
    artist_location,
    artist_longitude,
    artist_name,
    duration,
    num_songs,
    song_id,
    title,
    year
) VALUES (
    ?, ?, ?, ?, ?, ?, ?, ?, ?, ?
);
"""

# Insert tables
insert_tables = [
    insert_log_data_table,
    insert_song_data_table,
]

In [54]:
# Source data paths
log_data_path = "./data/project/log_data.csv"
song_data_path = "./data/project/song_data.csv"

# Data paths for staging tables
data_paths = [
    log_data_path,
    song_data_path,
]

# Drop staging tables
drop_staging_tables = [
    drop_log_data_table,
    drop_song_data_table,
]

# Create staging tables
create_staging_tables = [
    create_log_data_table,
    create_song_data_table,
]

# Insert staging tables
insert_staging_tables = [
    insert_log_data_table,
    insert_song_data_table,
]

# Drop all staging tables
for query in drop_staging_tables:
    cursor.execute(query)

# Create all staging tables
for query in create_staging_tables:
    cursor.execute(query)

# Insert all staging tables
for i, query in enumerate(insert_staging_tables):
    with open(data_paths[i], "r") as f:
        reader = csv.reader(f)
        next(reader)  # skip header
        for row in reader:
            data = [None if x == "" else x for x in row]
            cursor.execute(query, data)

In [55]:
# Function to get pandas dataframe from sql query
def get_df_from_sql(sql_query):
    cursor.execute(sql_query)
    df = pd.DataFrame(cursor.fetchall())
    df.columns = [x[0] for x in cursor.description]
    return df

In [56]:
log_data = get_df_from_sql("SELECT * FROM log_data;")
log_data.sort_values(by=["artist", "song"])

Unnamed: 0,artist,auth,firstName,gender,itemInSession,lastName,length,level,location,method,page,registration,sessionId,song,status,ts,userAgent,userId
3661,!!!,Logged In,Tegan,F,32,Levine,486.81751,paid,"Portland-South Portland, ME",PUT,NextSong,1.540794e+12,620,Bend Over Beethoven,200,1542367380796,"""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4...",80.0
5559,'N Sync/Phil Collins,Logged In,Morris,M,1,Gilmore,143.64689,free,"Raleigh, NC",PUT,NextSong,1.540972e+12,351,Trashin' The Camp (Phil And 'N Sync Version),200,1542899043796,"""Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_1 like...",23.0
7654,+ / - {Plus/Minus},Logged In,Tegan,F,29,Levine,318.98077,paid,"Portland-South Portland, ME",PUT,NextSong,1.540794e+12,1065,The Queen of Nothing,200,1543530451796,"""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4...",80.0
7824,+ / - {Plus/Minus},Logged In,Matthew,M,11,Jones,318.98077,paid,"Janesville-Beloit, WI",PUT,NextSong,1.541063e+12,998,The Queen of Nothing,200,1543575245796,"""Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537....",36.0
1949,+44,Logged In,Ryan,M,1,Smith,224.57424,free,"San Jose-Sunnyvale-Santa Clara, CA",PUT,NextSong,1.541017e+12,472,Make You Smile,200,1541962092796,"""Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/5...",26.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7996,,Logged Out,,,9,,,free,,GET,Home,,1026,,200,1543596347796,,
8000,,Logged In,Rylan,M,32,George,,paid,"Birmingham-Hoover, AL",GET,Downgrade,1.541020e+12,1076,,200,1543596972796,"""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4...",16.0
8005,,Logged In,Austin,M,0,Rosales,,free,"New York-Newark-Jersey City, NY-NJ-PA",GET,Home,1.541060e+12,1101,,200,1543598001796,Mozilla/5.0 (Windows NT 6.1; rv:31.0) Gecko/20...,12.0
8047,,Logged In,Rylan,M,56,George,,paid,"Birmingham-Hoover, AL",GET,Help,1.541020e+12,1076,,200,1543602936796,"""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4...",16.0


In [57]:
# Check if all data is inserted correctly
(
    (get_df_from_sql("SELECT * FROM log_data;").fillna("__NA__") == all_log_data.fillna("__NA__")).all().all() == True,
    (get_df_from_sql("SELECT * FROM song_data;").fillna("__NA__") == all_song_data.fillna("__NA__")).all().all() == True,
)

(True, True)

In [58]:
# Insert time query
insert_time_table = """
INSERT INTO 
    time
SELECT 
    DATETIME(ts / 1000, 'auto') AS start_time,
    strftime('%Y', DATETIME(ts / 1000, 'auto')) AS year,
    strftime('%m', DATETIME(ts / 1000, 'auto')) AS month,
    strftime('%d', DATETIME(ts / 1000, 'auto')) AS day,
    strftime('%H', DATETIME(ts / 1000, 'auto')) AS hour,
    strftime('%W', DATETIME(ts / 1000, 'auto')) AS week,
    strftime('%w', DATETIME(ts / 1000, 'auto')) AS weekday
FROM 
    log_data
WHERE
    auth = 'Logged In' AND 
    length > 0
GROUP BY
    start_time
;
"""

cursor.execute(drop_time_table)
cursor.execute(create_time_table)
cursor.execute(insert_time_table)

get_df_from_sql("SELECT * FROM time;")

Unnamed: 0,start_time,year,month,day,hour,week,weekday
0,2018-11-01 21:01:46,2018,11,1,21,44,4
1,2018-11-01 21:05:52,2018,11,1,21,44,4
2,2018-11-01 21:08:16,2018,11,1,21,44,4
3,2018-11-01 21:11:13,2018,11,1,21,44,4
4,2018-11-01 21:17:33,2018,11,1,21,44,4
...,...,...,...,...,...,...,...
6808,2018-11-30 18:40:05,2018,11,30,18,48,5
6809,2018-11-30 18:44:36,2018,11,30,18,48,5
6810,2018-11-30 18:47:58,2018,11,30,18,48,5
6811,2018-11-30 18:51:24,2018,11,30,18,48,5


In [59]:
# Insert users query
insert_users_table = """
INSERT INTO
    users
SELECT
    user_id,
    first_name,
    last_name,
    gender,
    level
FROM (
    SELECT
        userId AS user_id,
        firstName AS first_name,
        lastName AS last_name,
        gender,
        level,
        DATETIME(ts / 1000, 'auto') AS time
    FROM
        log_data
    WHERE
        auth = 'Logged In' AND 
        length > 0
    )
GROUP BY
    user_id
HAVING 
    time = MAX(time)
;
"""

cursor.execute(drop_users_table)
cursor.execute(create_users_table)
cursor.execute(insert_users_table)

get_df_from_sql("SELECT * FROM users;")

Unnamed: 0,user_id,first_name,last_name,gender,level
0,2,Jizelle,Benjamin,F,free
1,3,Isaac,Valdez,M,free
2,4,Alivia,Terrell,F,free
3,5,Elijah,Davis,M,free
4,6,Cecilia,Owens,F,free
...,...,...,...,...,...
91,97,Kate,Harrell,F,paid
92,98,Jordyn,Powell,F,free
93,99,Ann,Banks,F,free
94,100,Adler,Barrera,M,free


In [60]:
# Insert artists query
insert_artists_table = """
INSERT INTO
    artists
SELECT 
    ROW_NUMBER() OVER () AS artist_id,
    name, 
    location, 
    latitude, 
    longitude
FROM 
    (
        SELECT DISTINCT 
            artist as name
        FROM
            log_data
        WHERE
            log_data.auth = 'Logged In' AND 
            log_data.length > 0
    )
LEFT JOIN 
    (
        SELECT DISTINCT
            artist_name,
            artist_location AS location,
            artist_latitude AS latitude,
            artist_longitude AS longitude
        FROM
            song_data
        GROUP BY
            artist_name
        ORDER BY
            location DESC,
            latitude DESC,
            longitude DESC
    ) 
ON 
    name = artist_name
"""

cursor.execute(drop_artists_table)
cursor.execute(create_artists_table)
cursor.execute(insert_artists_table)

get_df_from_sql("SELECT * FROM artists;")

Unnamed: 0,artist_id,name,location,latitude,longitude
0,1,Des'ree,,,
1,2,Mr Oizo,,,
2,3,Tamba Trio,,,
3,4,The Mars Volta,"Long Beach, California",,
4,5,Infected Mushroom,,,
...,...,...,...,...,...
3143,3144,The Replacements,"Minneapolis, MN",,
3144,3145,Sarah McLachlan,"Halifax, Nova Scotia, Canada",,
3145,3146,Soul II Soul Featuring Caron Wheeler,,,
3146,3147,Timbiriche,,,


In [61]:
# Insert songs query
insert_songs_table = """
INSERT INTO
    songs
SELECT
    ROW_NUMBER() OVER () AS song_id,
    first_part.title,
    first_part.artist_id,
    second_part.year,
    second_part.duration
FROM 
(
    (
        (
            SELECT
                song AS title,
                artist
            FROM
                log_data
            WHERE
                auth = 'Logged In' AND
                length > 0
            GROUP BY
                title,
                artist
            ORDER BY
                title
        )
        LEFT JOIN (
            SELECT
                name,
                artist_id
            FROM
                artists
        )
        ON
            artist = name
    ) AS first_part
    LEFT JOIN (
        SELECT
            title,
            artist_name,
            year,
            duration
        FROM
            song_data
        WHERE
            year > 0
        GROUP BY
            title,
            artist_name
        HAVING
            duration = MAX(duration)
    ) AS second_part
    ON
        first_part.title = second_part.title AND
        first_part.artist = second_part.artist_name
)
"""

cursor.execute(drop_songs_table)
cursor.execute(create_songs_table)
cursor.execute(insert_songs_table)

get_df_from_sql("SELECT * FROM songs;")

Unnamed: 0,song_id,title,artist_id,year,duration
0,1,I Will Not Reap Destruction,2611,,
1,2,#40,895,,
2,3,'Round Midnight,2260,,
3,4,'Till I Collapse,674,,
4,5,(Hon Vill Ha) Puls,1793,,
...,...,...,...,...,...
5290,5291,s.Ada.Licht,412,2007.0,176.97914
5291,5292,shimmer,755,,
5292,5293,the king of wishful thinking,1307,,
5293,5294,ÃÂ Noite,1286,,


In [62]:
query = """
SELECT
    raw_log_data.session_id,
    raw_log_data.item_in_session,
    raw_log_data.start_time,
    raw_log_data.artist,
    raw_artist_data.artist_id,
    raw_log_data.song,
    raw_song_data.song_id,
    raw_log_data.user_id,
    raw_log_data.level,
    raw_log_data.location,
    raw_log_data.user_agent
FROM
    (   
        SELECT
            sessionId AS session_id,
            itemInSession AS item_in_session,
            ts AS start_time,
            artist,
            song,
            userId AS user_id,
            level,
            location,
            userAgent AS user_agent
        FROM
            log_data
        WHERE
            auth = 'Logged In' AND
            length > 0
    ) AS raw_log_data
JOIN
    (
        SELECT
            artist_id,
            name
        FROM
            artists
    ) AS raw_artist_data
ON
    raw_log_data.artist = raw_artist_data.name
JOIN
    (
        SELECT
            song_id,
            title,
            artist_id
        FROM
            songs
    ) AS raw_song_data
ON
    raw_artist_data.artist_id = raw_song_data.artist_id AND
    raw_log_data.song = raw_song_data.title
"""

get_df_from_sql(query)
    

Unnamed: 0,session_id,item_in_session,start_time,artist,artist_id,song,song_id,user_id,level,location,user_agent
0,692,31,1543069415796,We Came As Romans,2611,I Will Not Reap Destruction,1,73,paid,"Tampa-St. Petersburg-Clearwater, FL","""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4..."
1,891,18,1543006953796,DAVE MATTHEWS BAND,895,#40,2,85,paid,"Red Bluff, CA","""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_..."
2,776,35,1542750304796,Denise Jannah,2260,'Round Midnight,3,85,paid,"Red Bluff, CA","""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_..."
3,315,1,1541536304796,Eminem / Nate Dogg,674,'Till I Collapse,4,16,free,"Birmingham-Hoover, AL","""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4..."
4,648,18,1542401291796,Eminem / Nate Dogg,674,'Till I Collapse,4,49,paid,"San Francisco-Oakland-Hayward, CA",Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20...
...,...,...,...,...,...,...,...,...,...,...,...
6815,295,1,1541600987796,Booka Shade,755,shimmer,5292,29,free,"Atlanta-Sandy Springs-Roswell, GA","""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4..."
6816,827,1,1543193736796,New Found Glory,1307,the king of wishful thinking,5293,33,free,"Eugene, OR","""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK..."
6817,598,2,1542293123796,New Found Glory,1307,the king of wishful thinking,5293,50,free,"New Haven-Milford, CT","""Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebK..."
6818,486,32,1542121783796,O Rappa,1286,ÃÂ Noite,5294,29,paid,"Atlanta-Sandy Springs-Roswell, GA","""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4..."


In [63]:
# Insert songplays query
insert_songplays_table = """
INSERT INTO
    songplays
SELECT
    raw_log_data.session_id,
    raw_log_data.item_in_session,
    raw_log_data.start_time,
    raw_artist_data.artist_id,
    raw_song_data.song_id,
    raw_log_data.user_id,
    raw_log_data.level,
    raw_log_data.location,
    raw_log_data.user_agent
FROM
    (   
        SELECT
            sessionId AS session_id,
            itemInSession AS item_in_session,
            ts AS start_time,
            artist,
            song,
            userId AS user_id,
            level,
            location,
            userAgent AS user_agent
        FROM
            log_data
        WHERE
            auth = 'Logged In' AND
            length > 0
    ) AS raw_log_data
JOIN
    (
        SELECT
            artist_id,
            name
        FROM
            artists
    ) AS raw_artist_data
ON
    raw_log_data.artist = raw_artist_data.name
JOIN
    (
        SELECT
            song_id,
            title,
            artist_id
        FROM
            songs
    ) AS raw_song_data
ON
    raw_artist_data.artist_id = raw_song_data.artist_id AND
    raw_log_data.song = raw_song_data.title
"""

cursor.execute(drop_songplays_table)
cursor.execute(create_songplays_table)
cursor.execute(insert_songplays_table)

get_df_from_sql("SELECT * FROM songplays;")

Unnamed: 0,session_id,songplay_id,start_time,artist_id,song_id,user_id,level,location,user_agent
0,692,31,1543069415796,2611,1,73,paid,"Tampa-St. Petersburg-Clearwater, FL","""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4..."
1,891,18,1543006953796,895,2,85,paid,"Red Bluff, CA","""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_..."
2,776,35,1542750304796,2260,3,85,paid,"Red Bluff, CA","""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_..."
3,315,1,1541536304796,674,4,16,free,"Birmingham-Hoover, AL","""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4..."
4,648,18,1542401291796,674,4,49,paid,"San Francisco-Oakland-Hayward, CA",Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20...
...,...,...,...,...,...,...,...,...,...
6815,295,1,1541600987796,755,5292,29,free,"Atlanta-Sandy Springs-Roswell, GA","""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4..."
6816,827,1,1543193736796,1307,5293,33,free,"Eugene, OR","""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK..."
6817,598,2,1542293123796,1307,5293,50,free,"New Haven-Milford, CT","""Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebK..."
6818,486,32,1542121783796,1286,5294,29,paid,"Atlanta-Sandy Springs-Roswell, GA","""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4..."


In [64]:
for query in drop_tables:
    cursor.execute(query)
    connection.commit()

In [65]:
connection.commit()
cursor.close()
connection.close()