# Spotify data analysis: A retrospective

Every December since 2016, Spotify users discover their "Spotify Wrapped". The latter provides a compilation of data about their activity on the platform over the past year: top artists, top songs, top genres, etc. They get a deep dive into their most memorable listening moments of the year.

Objectives:
* JSON : Import several JSON files in an elegant way in Python
* SQL : 
  * Write SQL queries in Python
  * Demonstrate my abilities to code in SQL -> WINDOW FUNCTIONS, JOINS, etc
* PYTHON : Make a summary visual report in Python (hvplot)

## 1. Import the Spotify data (JSON files)

In [11]:
# Step 0: Import all the relevant Python packages
import pandas as pd
import json
import sqlite3

In [2]:
# Step 1 : Import the data
    # A améliorer / solution plus élégante à mettre en place
df0 = pd.read_json('C:/Users/margo/Documents/Documents/Formation/Github/Spotify/MyData/endsong_0.json')
df1 = pd.read_json('C:/Users/margo/Documents/Documents/Formation/Github/Spotify/MyData/endsong_1.json')
df2 = pd.read_json('C:/Users/margo/Documents/Documents/Formation/Github/Spotify/MyData/endsong_2.json')
df3 = pd.read_json('C:/Users/margo/Documents/Documents/Formation/Github/Spotify/MyData/endsong_3.json')
df4 = pd.read_json('C:/Users/margo/Documents/Documents/Formation/Github/Spotify/MyData/endsong_4.json')
df5 = pd.read_json('C:/Users/margo/Documents/Documents/Formation/Github/Spotify/MyData/endsong_5.json')
df6 = pd.read_json('C:/Users/margo/Documents/Documents/Formation/Github/Spotify/MyData/endsong_6.json')

In [17]:
# Step 2 : Remove private information (username, user agent, IP address)
to_delete = ['username', 'ip_addr_decrypted', 'user_agent_decrypted']
df0.drop(to_delete, axis=1, inplace=True)
df1.drop(to_delete, axis=1, inplace=True)
df2.drop(to_delete, axis=1, inplace=True)
df3.drop(to_delete, axis=1, inplace=True)
df4.drop(to_delete, axis=1, inplace=True)
df5.drop(to_delete, axis=1, inplace=True)
df6.drop(to_delete, axis=1, inplace=True)

## 2. Set up the SQL connection

In [19]:
# Create the SQLITE3 connection
cnx = sqlite3.connect(':memory:')

# Transform dfx dataframes to a dfx SQL tables
df0.to_sql(name='df0', con=cnx)
df1.to_sql(name='df1', con=cnx)
df2.to_sql(name='df2', con=cnx)
df3.to_sql(name='df3', con=cnx)
df4.to_sql(name='df4', con=cnx)
df5.to_sql(name='df5', con=cnx)
df6.to_sql(name='df6', con=cnx)

2900

## 3. Glimpse of the data

In [20]:
# Print the 10 first rows to ensure everything is fine
read_data = pd.read_sql('SELECT * FROM df0 LIMIT 10', cnx)
print(read_data)

   index                    ts                                   platform  \
0      0  2021-07-27T14:56:37Z               Windows 10 (10.0.19042; x64)   
1      1  2021-12-06T19:56:36Z  Android OS 7.0 API 24 (samsung, SM-G920F)   
2      2  2020-11-25T11:05:19Z               Windows 10 (10.0.18363; x64)   
3      3  2022-05-16T09:12:05Z   Android OS 12 API 31 (samsung, SM-G990B)   
4      4  2022-03-26T20:24:49Z   Android OS 12 API 31 (samsung, SM-G990B)   
5      5  2021-04-02T11:13:39Z  Android OS 7.0 API 24 (samsung, SM-G920F)   
6      6  2022-08-03T07:43:47Z   Android OS 12 API 31 (samsung, SM-G990B)   
7      7  2021-09-21T16:00:06Z  Android OS 7.0 API 24 (samsung, SM-G920F)   
8      8  2022-05-27T08:10:19Z               Windows 10 (10.0.18363; x64)   
9      9  2022-04-08T15:13:32Z               Windows 10 (10.0.19042; x64)   

   ms_played conn_country     master_metadata_track_name  \
0     168168           FR               One Summer's Day   
1       6370           FR       

## 4. Data transformation

In [None]:
# Union all -> 1 dataset only


Opérations à faire : 
* Colonne source de données : df0, df1, etc
* Joindre les données -> UNION
* Transformer certaines variables : millisecondes en secondes et minutes
* renommer variables aux noms à rallonge

In [10]:
read_data = pd.read_sql('select COUNT(*) AS nb_rows, MAX(ts) AS max_timestamp from df0', cnx)
print(read_data)

   nb_rows         max_timestamp
0    15851  2022-11-24T21:27:24Z


In [None]:
# Step 2 : SQL connection
import sqlite3

# Set up SQL connexion
sql_connect = sqlite3.connect('spotify_data.db')

# To execute some SQL command
cursor = sql_connect.cursor()

# Create database
#cursor.execute('''CREATE TABLE spotify_data (
 #   ts timestamp,
  #  username text)''') 

#cursor.execute("INSERT INTO spotify_data VALUES ('2022-07-17T06:07:42Z', 'v7x27nfjb2dri60b7jzl159rl')")

cursor.execute("SELECT * FROM spotify_data;")
print(cursor.fetchone())

sql_connect.commit()

sql_connect.close()

In [None]:
query = "SELECT * FROM df0;"
results = cursor.execute(query).fetchall()
sql_connect.close()

In [None]:

from sqlalchemy import create_engine

# Create database engine to manage connections
engine = create_engine("sqlite:///data.db")

# Load entire weather table by table name
weather = pd.read_sql("df0", engine)

In [None]:
# 
df.shape
#df.head(5)

In [None]:
# Test 2 : Je récupère la liste des fichiers JSON dans mon dossier
import os, json
import pandas as pd

path_to_json = 'C:/Users/margo/Documents/Documents/Formation/Github/Spotify/MyData/'
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
print(json_files) 

# Puis j'essaie d'importer chaque file une par une 
for file in json_files :
    df = pd.read_json(path_to_json + file)
    
df

In [None]:
spotify_data = pd.DataFrame(columns=['ts', 'username', 'platform',
                                    'ms_played', 'conn_country', 'user_agent_decrypted',
                                     'master_metadata_track_name',
                                     'master_metadata_album_artist_name',
                                     'master_metadata_album_album_name',
                                     'spotify_track_uri',
                                     'episode_name',
                                     'episode_show_name',
                                     'spotify_episode_uri',
                                     'reason_start',
                                     'reason_end',
                                     'shuffle', 'skipped', 'offline', 'offline_timestamp'
                                    ])
spotify_data.head(5)
#print spotify_json['features'][0]['geometry']

In [None]:
# Essai pour importer tous les fichiers du dossier
# Récupérer le 
path = 'C:/Users/margo/Documents/Documents/Formation/Github/Spotify/MyData'
all_files = glob.glob(path + "/*.json")

all_files

In [None]:
for file in all_files:
    data = pd.read_json(file, lines=True)
    temp = temp.concat(data, ignore_index = True)

In [None]:
temp.head(5)

In [None]:
temp = pd.DataFrame()

path_to_json = 'C:/Users/margo/Documents/Documents/Formation/Github/Spotify/MyData/' 

json_pattern = os.path.join(path_to_json,'*.json')
file_list = glob.glob(json_pattern)

for file in file_list:
    data = pd.read_json(file, lines=True)
    temp = temp.append(data, ignore_index = True)

In [None]:
# Inspirations fichiers csv en source 

import pandas as pd
import glob
import os

globbed_files = glob.glob("*.csv") #creates a list of all csv files

data = [] # pd.concat takes a list of dataframes as an agrument
for csv in globbed_files:
    frame = pd.read_csv(csv)
    frame['filename'] = os.path.basename(csv)
    data.append(frame)

bigframe = pd.concat(data, ignore_index=True) #dont want pandas to try an align row indexes
bigframe.to_csv("Pandas_output2.csv")

In [45]:
# Ca fonctionne
import pandas as pd
import glob
import os

path_to_json = 'C:/Users/margo/Documents/Documents/Formation/Github/Spotify/MyData/' 
json_pattern = os.path.join(path_to_json,'*.json')
globbed_files = glob.glob(json_pattern)

globbed_files

data = [] # pd.concat takes a list of dataframes as an argument
for json in globbed_files:
    frame = pd.read_json(json)
    frame['filename'] = os.path.basename(json)
    data.append(frame)
  
bigframe = pd.concat(data, ignore_index=True) #dont want pandas to try an align row indexes
bigframe.to_csv("spotify_data.csv", sep = ';')

  bigframe = pd.concat(data, ignore_index=True) #dont want pandas to try an align row indexes


In [50]:
spotify_data = pd.read_csv("spotify_data.csv", sep = ';', index_col = 0)
spotify_data.head(5) 

Unnamed: 0,ts,username,platform,ms_played,conn_country,ip_addr_decrypted,user_agent_decrypted,master_metadata_track_name,master_metadata_album_artist_name,master_metadata_album_album_name,...,episode_show_name,spotify_episode_uri,reason_start,reason_end,shuffle,skipped,offline,offline_timestamp,incognito_mode,filename
0,2021-07-27T14:56:37Z,v7x27nfjb2dri60b7jzl159rl,Windows 10 (10.0.19042; x64),168168,FR,195.36.154.135,unknown,One Summer's Day,Smyang Piano,One Summer's Day,...,,,trackdone,trackdone,False,,0.0,1627398000000.0,False,endsong_0.json
1,2021-12-06T19:56:36Z,v7x27nfjb2dri60b7jzl159rl,"Android OS 7.0 API 24 (samsung, SM-G920F)",6370,FR,88.170.227.77,unknown,ONE SHOT,B.A.P,ONE SHOT,...,,,clickrow,endplay,False,,0.0,1638821000000.0,False,endsong_0.json
2,2020-11-25T11:05:19Z,v7x27nfjb2dri60b7jzl159rl,Windows 10 (10.0.18363; x64),184453,FR,212.195.100.232,unknown,BTD (Before The Dawn),INFINITE,Evolution,...,,,trackdone,trackdone,False,,0.0,1606302000000.0,False,endsong_0.json
3,2022-05-16T09:12:05Z,v7x27nfjb2dri60b7jzl159rl,"Android OS 12 API 31 (samsung, SM-G990B)",492,FR,88.170.227.77,unknown,El Dorado,Thomas Bergersen,SkyWorld,...,,,fwdbtn,fwdbtn,False,,0.0,1652692000000.0,False,endsong_0.json
4,2022-03-26T20:24:49Z,v7x27nfjb2dri60b7jzl159rl,"Android OS 12 API 31 (samsung, SM-G990B)",355,FR,80.215.82.88,unknown,Euphoria,BTS,Love Yourself 結 'Answer',...,,,fwdbtn,fwdbtn,False,,0.0,1648326000000.0,False,endsong_0.json


In [31]:
dataframe = pd.DataFrame(data)
dataframe.head(10)

  values = np.array([convert(v) for v in values])


Unnamed: 0,0
0,ts ...
1,ts ...
2,ts ...
3,ts ...
4,ts ...
5,ts ...
6,ts u...
7,ts use...
