# Spotify Streaming History: An End-to-End Data Analysis

This project conducts a full-cycle analysis of personal Spotify streaming data. The workflow starts with data cleaning, transformation, and feature engineering using **SQL** to build a robust analytical base table. The second phase moves to **Python**, where libraries like **pandas** and **numpy** are used to answer specific business questions about listening habits and engagement. Finally, insights are presented through charts created with **Plotly Express**.

![Spotify](/work/alexander-shatov-JlO3-oY5ZlQ-unsplash-scaled.jpg)

## Set Up

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
!pip install openpyxl

Collecting openpyxl
  Downloading openpyxl-3.1.5-py2.py3-none-any.whl (250 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m250.9/250.9 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting et-xmlfile
  Downloading et_xmlfile-2.0.0-py3-none-any.whl (18 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-2.0.0 openpyxl-3.1.5

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Extraction

In [2]:
data_source = _dntk.execute_sql(
  'SELECT *\nFROM st_read(\'spotify_data_songs_.xlsx\')',
  'SQL_DEEPNOTE_DATAFRAME_SQL',
  audit_sql_comment='',
  sql_cache_mode='cache_disabled',
  return_variable_type='dataframe'
)
data_source

Unnamed: 0,spotify_track_uri,ts,platform,ms_played,track_name,artist_name,album_name,reason_start,reason_end,shuffle,skipped
0,2J3n32GeLmMjwuAzyhcSNe,2013-07-08 02:44:34,web player,3185,"Say It, Just Say It",The Mowgli's,Waiting For The Dawn,autoplay,clickrow,FALSE,FALSE
1,1oHxIPqJyvAYHy0PVrDU98,2013-07-08 02:45:37,web player,61865,Drinking from the Bottle (feat. Tinie Tempah),Calvin Harris,18 Months,clickrow,clickrow,FALSE,FALSE
2,487OPlneJNni3NWC8SYqhW,2013-07-08 02:50:24,web player,285386,Born To Die,Lana Del Rey,Born To Die - The Paradise Edition,clickrow,unknown,FALSE,FALSE
3,5IyblF777jLZj1vGHG2UD3,2013-07-08 02:52:40,web player,134022,Off To The Races,Lana Del Rey,Born To Die - The Paradise Edition,trackdone,clickrow,FALSE,FALSE
4,0GgAAB0ZMllFhbNc3mAodO,2013-07-08 03:17:52,web player,0,Half Mast,Empire Of The Sun,Walking On A Dream,clickrow,nextbtn,FALSE,FALSE
...,...,...,...,...,...,...,...,...,...,...,...
149855,4Fz1WWr5o0OrlIcZxcyZtK,2024-12-15 23:06:19,android,1247,On The Way Home,John Mayer,Paradise Valley,fwdbtn,fwdbtn,TRUE,TRUE
149856,0qHMhBZqYb99yhX9BHcIkV,2024-12-15 23:06:21,android,1515,Magical Mystery Tour - Remastered 2009,The Beatles,Magical Mystery Tour,fwdbtn,fwdbtn,TRUE,TRUE
149857,0HHdujGjOZChTrl8lJWEIq,2024-12-15 23:06:22,android,1283,"Stop This Train - Live at the Nokia Theatre, L...",John Mayer,Where the Light Is: John Mayer Live In Los Ang...,fwdbtn,fwdbtn,TRUE,TRUE
149858,7peh6LUcdNPcMdrSH4JPsM,2024-12-15 23:06:23,android,1306,I Don't Trust Myself (With Loving You),John Mayer,Continuum,fwdbtn,fwdbtn,TRUE,TRUE


## Transformation

### Main query

In [3]:
spotify = _dntk.execute_sql(
  'select \n    spotify_track_uri\n    ,cast(ts as timestamp) as ts\n    ,platform\n    ,ms_played\n    ,track_name\n    ,artist_name\n    ,album_name\n    ,reason_start\n    ,reason_end\n    ,shuffle\n    ,skipped\n    ,round(ms_played / 1000 / 60, 1) as minutes_played\n    ,case \n        when skipped = \'TRUE\' then 1\n        when skipped = \'FALSE\' then 0 end as skipped_flag\n    ,lag(ts) over(partition by artist_name order by ts) as previous_ts_date\nfrom data_source\n',
  'SQL_DEEPNOTE_DATAFRAME_SQL',
  audit_sql_comment='',
  sql_cache_mode='cache_disabled',
  return_variable_type='dataframe'
)
spotify

Unnamed: 0,spotify_track_uri,ts,platform,ms_played,track_name,artist_name,album_name,reason_start,reason_end,shuffle,skipped,minutes_played,skipped_flag,previous_ts_date
0,06xNkl12TrxwGYrwTGlUUd,2022-02-22 01:54:29,android,169460,Main Title (The Notebook),Aaron Zigman,The Notebook (Original Motion Picture Soundtrack),trackdone,trackdone,FALSE,FALSE,2.8,0.0,NaT
1,0UmR5k7YxioGuU6V0XVPbZ,2023-01-02 20:43:23,android,184977,Blue Something,Agustín Amigó,Blue Something,trackdone,trackdone,TRUE,FALSE,3.1,0.0,NaT
2,0LbZg654PnwFu7lcj5aPtb,2023-01-03 21:44:32,android,193727,Manzanilla,Agustín Amigó,Swatches,trackdone,trackdone,TRUE,FALSE,3.2,0.0,2023-01-02 20:43:23
3,2jsQWRYISLjXA2x3LQMzH6,2023-01-24 18:34:17,android,168000,Days of Rain,Agustín Amigó,Days of Rain,trackdone,trackdone,TRUE,FALSE,2.8,0.0,2023-01-03 21:44:32
4,0LbZg654PnwFu7lcj5aPtb,2023-01-25 18:23:23,android,193727,Manzanilla,Agustín Amigó,Swatches,trackdone,trackdone,TRUE,FALSE,3.2,0.0,2023-01-24 18:34:17
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
149855,27ebni0DfbT5Owz6W42HP8,2021-09-25 06:40:55,android,241560,Nunca Me Olvides,Yandel,Dangerous,trackdone,trackdone,TRUE,FALSE,4.0,0.0,2020-12-20 08:25:59
149856,1UulFvlRZPoCWwUBzk5ImI,2021-12-23 07:57:52,android,213520,Jaque Mate,Yandel,Legacy - De Líder a Leyenda Tour,trackdone,trackdone,FALSE,FALSE,3.6,0.0,2021-09-25 06:40:55
149857,2oiixB9QMIzhWaHGVlQx4g,2023-04-15 21:41:01,android,216148,Yandel 150,Yandel,Yandel 150,clickrow,trackdone,TRUE,FALSE,3.6,0.0,2021-12-23 07:57:52
149858,4FAKtPVycI4DxoOHC01YqD,2023-05-21 07:23:39,android,77915,Yandel 150,Yandel,Resistencia,trackdone,fwdbtn,FALSE,TRUE,1.3,1.0,2023-04-15 21:41:01


### Business Question #1 (Artist Summary)

***Which artists are listened to the most, and what is the engagement level with their music, measured by total minutes and the number of times their songs are skipped?***

In [4]:
_dntk.execute_sql(
  'select \n    artist_name\n    ,count(*) as total_reproductions\n    ,sum(minutes_played) as total_minutes_played\n    ,sum(skipped_flag) as total_skips\nfrom spotify\ngroup by artist_name\nhaving count(*) > 1\norder by sum(minutes_played) desc',
  'SQL_DEEPNOTE_DATAFRAME_SQL',
  audit_sql_comment='',
  sql_cache_mode='cache_disabled',
  return_variable_type='dataframe'
)

Unnamed: 0,artist_name,total_reproductions,total_minutes_played,total_skips
0,The Beatles,13621,20079.4,388.0
1,The Killers,6878,17615.5,197.0
2,John Mayer,4855,12084.3,153.0
3,Bob Dylan,3814,9479.7,163.0
4,Paul McCartney,2697,5940.2,107.0
...,...,...,...,...
2259,Mariachi México de Pepe Villa,2,0.0,0.0
2260,The Wanted,2,0.0,2.0
2261,Joe Lovano,2,0.0,2.0
2262,Foreigner,2,0.0,2.0


### Business Question #2 (Time Between Listens)

***What is the listening cadence or frequency for each artist, measured by the time elapsed between consecutive plays?***

In [5]:
_dntk.execute_sql(
  'select \n    artist_name\n    ,track_name\n    ,ts\n    ,previous_ts_date\n    ,round(JULIAN(ts) - JULIAN(previous_ts_date),2) as days_since_previous_play\n    ,ts - previous_ts_date as days_hours_since_previous_play\nfrom spotify',
  'SQL_DEEPNOTE_DATAFRAME_SQL',
  audit_sql_comment='',
  sql_cache_mode='cache_disabled',
  return_variable_type='dataframe'
)

Unnamed: 0,artist_name,track_name,ts,previous_ts_date,days_since_previous_play,days_hours_since_previous_play
0,Aaron Zigman,Main Title (The Notebook),2022-02-22 01:54:29,NaT,,NaT
1,Agustín Amigó,Blue Something,2023-01-02 20:43:23,NaT,,NaT
2,Agustín Amigó,Manzanilla,2023-01-03 21:44:32,2023-01-02 20:43:23,1.04,1 days 01:01:09
3,Agustín Amigó,Days of Rain,2023-01-24 18:34:17,2023-01-03 21:44:32,20.87,20 days 20:49:45
4,Agustín Amigó,Manzanilla,2023-01-25 18:23:23,2023-01-24 18:34:17,0.99,0 days 23:49:06
...,...,...,...,...,...,...
149855,Yandel,Nunca Me Olvides,2021-09-25 06:40:55,2020-12-20 08:25:59,278.93,278 days 22:14:56
149856,Yandel,Jaque Mate,2021-12-23 07:57:52,2021-09-25 06:40:55,89.05,89 days 01:16:57
149857,Yandel,Yandel 150,2023-04-15 21:41:01,2021-12-23 07:57:52,478.57,478 days 13:43:09
149858,Yandel,Yandel 150,2023-05-21 07:23:39,2023-04-15 21:41:01,35.40,35 days 09:42:38


### Business Question #3

***What is the proportion of active listening (user-initiated plays) versus passive listening (autoplay)?***

Calculating the number of tracks started by a direct user action versus those started automatically after another track finished.

In [6]:
df_3 = _dntk.execute_sql(
  'with count_per_start_type as (\nselect \n    track_name\n    ,case \n        when reason_start = \'trackdone\' then \'Autoplay\'\n        when reason_start in(\'fwdbtn\', \'backbtn\', \'playbtn\', \'nextbtn\', \'remote\', \'popup\') then \'user_action\'\n        when reason_start in(\'appload\',\'unknown\') then \'unknown\'\n        else \'other\' end as start_type\nfrom spotify\n)\n\nselect \n    start_type\n    ,count(track_name) as start_type_count\nfrom count_per_start_type\ngroup by start_type\norder by count(track_name) desc',
  'SQL_DEEPNOTE_DATAFRAME_SQL',
  audit_sql_comment='',
  sql_cache_mode='cache_disabled',
  return_variable_type='dataframe'
)
df_3

Unnamed: 0,start_type,start_type_count
0,Autoplay,76491
1,user_action,57853
2,other,11772
3,unknown,3744


### Business Question #4

***For a given artist, which tracks are listened to for longer than that artist's average listening time?***

In [7]:
_dntk.execute_sql(
  'with avg_minutes as (\n    select \n    artist_name\n    ,track_name\n    ,minutes_played\n    ,avg(minutes_played) over(partition by artist_name) as artist_avg_minutes\nfrom spotify\n    )\n\nselect \n    artist_name\n    ,track_name\n    ,minutes_played\n    ,artist_avg_minutes\n    ,case\n        when minutes_played > artist_avg_minutes then \'above average\'\n        when minutes_played < artist_avg_minutes then \'below average\'\n        else \'Equal to average\' end as comparison\nfrom avg_minutes',
  'SQL_DEEPNOTE_DATAFRAME_SQL',
  audit_sql_comment='',
  sql_cache_mode='cache_disabled',
  return_variable_type='dataframe'
)

Unnamed: 0,artist_name,track_name,minutes_played,artist_avg_minutes,comparison
0,Aaron Zigman,Main Title (The Notebook),2.8,2.8,Equal to average
1,Agustín Amigó,Blue Something,3.1,3.1,Equal to average
2,Agustín Amigó,Manzanilla,3.2,3.1,above average
3,Agustín Amigó,Days of Rain,2.8,3.1,below average
4,Agustín Amigó,Manzanilla,3.2,3.1,above average
...,...,...,...,...,...
149855,Yandel,Nunca Me Olvides,4.0,2.0,above average
149856,Yandel,Jaque Mate,3.6,2.0,above average
149857,Yandel,Yandel 150,3.6,2.0,above average
149858,Yandel,Yandel 150,1.3,2.0,below average


### Business Question #5:

***How do the reasons for starting a track compare to the reasons for ending it, in a single, unified view?***

In [8]:
_dntk.execute_sql(
  'with start_reasons as (\nselect \n    \'start_reason\' as category_type\n    ,reason_start as reason\n    ,count(track_name) as track_count\nfrom spotify\ngroup by category_type, reason_start\n),\n\nend_reasons as (\nselect \n    \'end_reason\' as category_type\n    ,reason_end as reason\n    ,count(track_name) as track_count\nfrom spotify\ngroup by category_type, reason_end\n)\n\nselect *\nfrom start_reasons\nunion all\nselect * \nfrom end_reasons ',
  'SQL_DEEPNOTE_DATAFRAME_SQL',
  audit_sql_comment='',
  sql_cache_mode='cache_disabled',
  return_variable_type='dataframe'
)

Unnamed: 0,category_type,reason,track_count
0,start_reason,fwdbtn,53692
1,start_reason,trackdone,76491
2,start_reason,backbtn,2202
3,start_reason,playbtn,1456
4,start_reason,appload,3721
5,start_reason,unknown,23
6,start_reason,remote,477
7,start_reason,,452
8,start_reason,nextbtn,21
9,start_reason,popup,5


### Business Question #5:

***In which part of the day is the listening time per track longer, considering only periods with significant listening?***

In [9]:
_dntk.execute_sql(
  'with hour_day as (\nselect \n    extract(hour from ts) as hour_day\n    ,track_name\n    ,minutes_played\n    ,case \n        when extract(hour from ts) between 6 and 11 then \'Morning\'\n        when extract(hour from ts) between 12 and 17 then \'Afternoon\'\n        when extract(hour from ts) between 18 and 22 then \'Evening\'\n        else \'night\' end as time_of_day\nfrom spotify\n)\n\nselect\n    time_of_day\n    ,count(track_name) as track_count\n    ,round(avg(minutes_played),2) as avg_minutes_played\nfrom hour_day \ngroup by time_of_day\nhaving avg(minutes_played) > 1.5\norder by avg(minutes_played) desc',
  'SQL_DEEPNOTE_DATAFRAME_SQL',
  audit_sql_comment='',
  sql_cache_mode='cache_disabled',
  return_variable_type='dataframe'
)

Unnamed: 0,time_of_day,track_count,avg_minutes_played
0,Afternoon,24797,2.47
1,Evening,45272,2.23
2,night,61893,1.99
3,Morning,17898,1.94


### Business Question #6:

***What are the characteristics of the top 10% longest listening sessions, specifically for tracks that were not skipped?***

In [10]:
ranked_tracks = _dntk.execute_sql(
  'with non_skipped_tracks as(\nSELECT *\n,ntile(10) over(order by minutes_played desc) as duration_decile\nfrom spotify\nwhere skipped_flag = 0\norder by ntile(10) over(order by minutes_played desc) asc, minutes_played desc\n)\n\nselect \n    artist_name\n    ,track_name\n    ,minutes_played\n    ,duration_decile\n    ,sum(minutes_played) over(order by ts asc) as cumulative_minutes\nfrom non_skipped_tracks \nwhere duration_decile = 1',
  'SQL_DEEPNOTE_DATAFRAME_SQL',
  audit_sql_comment='',
  sql_cache_mode='cache_disabled',
  return_variable_type='dataframe'
)
ranked_tracks

Unnamed: 0,artist_name,track_name,minutes_played,duration_decile,cumulative_minutes
0,Lana Del Rey,Born To Die,4.8,1,4.8
1,John Mayer,In Your Atmosphere - Live at the Nokia Theatre...,5.8,1,10.6
2,U2,I Still Haven't Found What I'm Looking For,4.6,1,15.2
3,Justin Timberlake,Mirrors,7.3,1,22.5
4,John Mayer,In Your Atmosphere - Live at the Nokia Theatre...,5.8,1,28.3
...,...,...,...,...,...
14165,My Chemical Romance,Welcome to the Black Parade,5.2,1,79808.7
14166,Eminem,Headlights,5.7,1,79814.4
14167,Pink Floyd,Dogs,17.1,1,79831.5
14168,Guy Clark,L.A. Freeway,5.0,1,79836.5


### Business Question #7:

***How does the user's listening mode (shuffling vs. sequential play) affect their engagement (i.e., their tendency to skip songs)?***

In [11]:
_dntk.execute_sql(
  'with eng_type as \n(select *\n    ,case \n        when shuffle = TRUE then \'shuffle\'\n        else \'sequential\' end as listening_mode\n    ,case\n        when skipped = TRUE then \'skipped\'\n        else \'completed\' end as engagement_type\nfrom spotify\n)\n\nselect \n    listening_mode\n    ,engagement_type\n    ,count(track_name) as track_count\nfrom eng_type\ngroup by listening_mode, engagement_type\norder by listening_mode, engagement_type',
  'SQL_DEEPNOTE_DATAFRAME_SQL',
  audit_sql_comment='',
  sql_cache_mode='cache_disabled',
  return_variable_type='dataframe'
)

Unnamed: 0,listening_mode,engagement_type,track_count
0,sequential,completed,36854
1,sequential,skipped,1651
2,shuffle,completed,105156
3,shuffle,skipped,6199


### Business Question #8:

***How can we identify distinct listening sessions, and what is the average number of tracks played per session?***

In [12]:
_dntk.execute_sql(
  'with diff as \n(select \n    ts\n    ,previous_ts_date\n    ,(julian(ts) - julian(previous_ts_date)) *24 * 60 as time_diff\nfrom spotify),\n\nsessions as (\nselect * \n    ,case \n        when time_diff > 30 or time_diff is null then 1\n        else 0 end as is_new_session\nfrom diff\n        ),\n\nsession_ids as (\n    select * \n        ,sum(is_new_session) over(order by ts asc) as session_ids\n    from sessions \n)\n\nselect \n    session_ids\n    ,count(*) as track_count\nfrom session_ids\ngroup by session_ids',
  'SQL_DEEPNOTE_DATAFRAME_SQL',
  audit_sql_comment='',
  sql_cache_mode='cache_disabled',
  return_variable_type='dataframe'
)

Unnamed: 0,session_ids,track_count
0,58701.0,1
1,58704.0,26
2,58705.0,2
3,58706.0,1
4,58708.0,1
...,...,...
88300,58693.0,1
88301,58695.0,1
88302,58696.0,1
88303,58698.0,1


### Business Question #9

***On which platform is the user more likely to skip songs, and what is the average listening duration on each?***

In [13]:
_dntk.execute_sql(
  'SELECT \n    platform\n    ,count(*) as total_plays\n    ,round(avg(skipped_flag),2) as skip_rate\n    ,round(avg(minutes_played),2)as avg_minutes_played\nfrom spotify\nGROUP BY platform\nORDER BY count(*) desc',
  'SQL_DEEPNOTE_DATAFRAME_SQL',
  audit_sql_comment='',
  sql_cache_mode='cache_disabled',
  return_variable_type='dataframe'
)

Unnamed: 0,platform,total_plays,skip_rate,avg_minutes_played
0,android,139821,0.05,2.08
1,cast to device,3898,0.0,3.11
2,iOS,3049,0.1,2.75
3,windows,1691,0.14,2.3
4,mac,1176,0.06,3.57
5,web player,225,0.0,1.86


### Business Question #10:

***How does listening behavior change between weekdays and weekends?***

In [14]:
_dntk.execute_sql(
  'with days as \n(SELECT \n    skipped_flag\n    ,minutes_played\n    ,case\n        when extract(dow from ts) in (0,6) then \'weekend\'\n        else \'weekday\' end as day_type\nfrom spotify)\n\nselect \n    day_type\n    ,count(*) as total_plays\n    ,round(avg(skipped_flag),2) as skip_rate\n    ,round(avg(minutes_played),2)as avg_minutes_played\nfrom days\ngroup by day_type\n',
  'SQL_DEEPNOTE_DATAFRAME_SQL',
  audit_sql_comment='',
  sql_cache_mode='cache_disabled',
  return_variable_type='dataframe'
)

Unnamed: 0,day_type,total_plays,skip_rate,avg_minutes_played
0,weekday,112189,0.05,2.15
1,weekend,37671,0.06,2.08


### Business Question # 11

***What is the average length, in minutes, of a typical listening session?***

In [15]:
_dntk.execute_sql(
  'with track_time_diff as (\nselect \n    ts\n    ,minutes_played\n    ,lag(ts) over(order by ts asc) as previous_ts_date\n    ,(julian(ts) - julian(lag(ts) over(order by ts asc))) * 24 * 60 as minutes_since_last_play\nfrom spotify\n),\n\nsession_data as (\nselect \n    ts\n    ,minutes_played\n    ,case\n        when minutes_since_last_play > 30 or minutes_since_last_play is null then 1\n        else 0 end as is_new_session\n    ,sum(case\n        when minutes_since_last_play > 30 or minutes_since_last_play is null then 1\n        else 0 end) over (order by ts asc) as session_id\nfrom track_time_diff\n),\n\nfinal_table as (\nselect \n    session_id\n    ,sum(minutes_played) as total_minutes_played\nfrom session_data\ngroup by session_id\n)\n\nselect \n    avg(total_minutes_played) as avg_minutes_played\nfrom final_table',
  'SQL_DEEPNOTE_DATAFRAME_SQL',
  audit_sql_comment='',
  sql_cache_mode='cache_disabled',
  return_variable_type='dataframe'
)

Unnamed: 0,avg_minutes_played
0,36.019309


In [16]:
spotify

Unnamed: 0,spotify_track_uri,ts,platform,ms_played,track_name,artist_name,album_name,reason_start,reason_end,shuffle,skipped,minutes_played,skipped_flag,previous_ts_date
0,06xNkl12TrxwGYrwTGlUUd,2022-02-22 01:54:29,android,169460,Main Title (The Notebook),Aaron Zigman,The Notebook (Original Motion Picture Soundtrack),trackdone,trackdone,FALSE,FALSE,2.8,0.0,NaT
1,0UmR5k7YxioGuU6V0XVPbZ,2023-01-02 20:43:23,android,184977,Blue Something,Agustín Amigó,Blue Something,trackdone,trackdone,TRUE,FALSE,3.1,0.0,NaT
2,0LbZg654PnwFu7lcj5aPtb,2023-01-03 21:44:32,android,193727,Manzanilla,Agustín Amigó,Swatches,trackdone,trackdone,TRUE,FALSE,3.2,0.0,2023-01-02 20:43:23
3,2jsQWRYISLjXA2x3LQMzH6,2023-01-24 18:34:17,android,168000,Days of Rain,Agustín Amigó,Days of Rain,trackdone,trackdone,TRUE,FALSE,2.8,0.0,2023-01-03 21:44:32
4,0LbZg654PnwFu7lcj5aPtb,2023-01-25 18:23:23,android,193727,Manzanilla,Agustín Amigó,Swatches,trackdone,trackdone,TRUE,FALSE,3.2,0.0,2023-01-24 18:34:17
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
149855,27ebni0DfbT5Owz6W42HP8,2021-09-25 06:40:55,android,241560,Nunca Me Olvides,Yandel,Dangerous,trackdone,trackdone,TRUE,FALSE,4.0,0.0,2020-12-20 08:25:59
149856,1UulFvlRZPoCWwUBzk5ImI,2021-12-23 07:57:52,android,213520,Jaque Mate,Yandel,Legacy - De Líder a Leyenda Tour,trackdone,trackdone,FALSE,FALSE,3.6,0.0,2021-09-25 06:40:55
149857,2oiixB9QMIzhWaHGVlQx4g,2023-04-15 21:41:01,android,216148,Yandel 150,Yandel,Yandel 150,clickrow,trackdone,TRUE,FALSE,3.6,0.0,2021-12-23 07:57:52
149858,4FAKtPVycI4DxoOHC01YqD,2023-05-21 07:23:39,android,77915,Yandel 150,Yandel,Resistencia,trackdone,fwdbtn,FALSE,TRUE,1.3,1.0,2023-04-15 21:41:01


## Exploratory Data Analysys (Python)

### Business Question #1:

***In which part of the day is the listening time per track longer, considering only periods with significant listening?***

In [17]:
spotify['ts'] = pd.to_datetime(spotify['ts'])
spotify['hour'] = spotify['ts'].dt.hour
conditions = [
    (spotify['hour'] >= 6) & (spotify['hour'] <= 11),
    (spotify['hour'] >= 12) & (spotify['hour'] <= 17),
    (spotify['hour'] >= 18) & (spotify['hour'] <= 22)
]
choices = [
    'Morning', 'Afternoon','Evening'
]
spotify['time_of_day'] = np.select(conditions, choices, default='Night')

time_day_grouped = spotify.groupby('time_of_day')[['track_name','minutes_played']].agg({'track_name':'count','minutes_played':'mean'}).round(2).reset_index()
time_day_grouped = time_day_grouped.rename(columns ={'track_name':'track_count','minutes_played':'avg_minutes_played'})
time_day_grouped_filtered = time_day_grouped.loc[(time_day_grouped['time_of_day'] != 'Night') & (time_day_grouped['avg_minutes_played'] > 1.5)].sort_values('avg_minutes_played', ascending = False)
time_day_grouped_filtered

Unnamed: 0,time_of_day,track_count,avg_minutes_played
0,Afternoon,24797,2.47
1,Evening,45272,2.23
2,Morning,17898,1.94


***Let´s the visualize average time played per time of the day.*** 

In [18]:
fig = px.bar(time_day_grouped,x='time_of_day',y='avg_minutes_played')
fig.show()

***Now let's see the distribution of minutes played in a boxplot, per time of the day.***

In [19]:
fig = px.box(spotify,x='time_of_day',y='minutes_played')
fig.show()

### Business Question # 2:

***We need to create a cleaned-up report of tracks that were meaningfully played (more than 30 seconds) for a database import. The report requires specific column names and data types.***

In [20]:
spotify_track_30_secs = spotify.loc[spotify['ms_played'] > 30000, ['track_name','artist_name','album_name','minutes_played','skipped_flag']]
spotify_track_30_secs = spotify_track_30_secs.rename(columns={'track_name':'track','artist_name':'artist','album_name':'album','minutes_played':'duration_mins','skipped_flag':'was_skipped'})
spotify_track_30_secs['was_skipped'] = spotify_track_30_secs['was_skipped'].astype(bool)
conditions = [
    spotify_track_30_secs['duration_mins'] <= 1,
    (spotify_track_30_secs['duration_mins'] > 1) & (spotify_track_30_secs['duration_mins'] <= 3),
    (spotify_track_30_secs['duration_mins'] > 3) & (spotify_track_30_secs['duration_mins'] <= 5),
    (spotify_track_30_secs['duration_mins'] > 5) & (spotify_track_30_secs['duration_mins'] <= 10)
]
choices = ['Less_1_min','1-3_mins','3-5_mins','5-10_mins']
spotify_track_30_secs['duration_category'] = np.select(conditions, choices, default='More_than_10')
spotify_track_30_secs

Unnamed: 0,track,artist,album,duration_mins,was_skipped,duration_category
0,Main Title (The Notebook),Aaron Zigman,The Notebook (Original Motion Picture Soundtrack),2.8,False,1-3_mins
1,Blue Something,Agustín Amigó,Blue Something,3.1,False,3-5_mins
2,Manzanilla,Agustín Amigó,Swatches,3.2,False,3-5_mins
3,Days of Rain,Agustín Amigó,Days of Rain,2.8,False,1-3_mins
4,Manzanilla,Agustín Amigó,Swatches,3.2,False,3-5_mins
...,...,...,...,...,...,...
149855,Nunca Me Olvides,Yandel,Dangerous,4.0,False,3-5_mins
149856,Jaque Mate,Yandel,Legacy - De Líder a Leyenda Tour,3.6,False,3-5_mins
149857,Yandel 150,Yandel,Yandel 150,3.6,False,3-5_mins
149858,Yandel 150,Yandel,Resistencia,1.3,True,1-3_mins


***Let's visualize the distribution of this data***

In [21]:
fig = px.histogram(spotify_track_30_secs,x='duration_mins', marginal='box')
fig.show()

### Business Question #3:

***How do listening habits, such as total time played and skip rate, differ for the top 5 most frequently played artists?***

In [22]:
top_5_artists = spotify.groupby(['artist_name'])['track_name'].count().reset_index().sort_values('track_name', ascending=False).head(5)
top_5_artists_list = top_5_artists['artist_name'].tolist() 
spotify_top_5 = spotify.loc[spotify['artist_name'].isin(top_5_artists_list)]
spotify_top_5['Year'] = spotify_top_5['ts'].dt.to_period('Y').astype(str)
spotify_top_5_grouped = spotify_top_5.groupby(['artist_name','Year'])[['skipped_flag','minutes_played']].agg({'skipped_flag':'mean','minutes_played':'sum'}).reset_index()
spotify_top_5_grouped



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,artist_name,Year,skipped_flag,minutes_played
0,Bob Dylan,2015,0.608696,45.4
1,Bob Dylan,2016,0.055556,220.5
2,Bob Dylan,2017,0.0,1369.1
3,Bob Dylan,2018,0.0,1421.0
4,Bob Dylan,2019,0.0,1615.2
5,Bob Dylan,2020,0.0,1564.9
6,Bob Dylan,2021,0.0,1016.1
7,Bob Dylan,2022,0.033058,542.1
8,Bob Dylan,2023,0.261494,1204.3
9,Bob Dylan,2024,0.298013,481.1


***This line chart tracks the evolution of listening habits by showing the total minutes played per year for the top 5 most frequent artists.***

In [23]:
fig = px.line(spotify_top_5_grouped,x='Year',y='minutes_played',color='artist_name')
fig.show()

***Checking the correlation between total minutes played per artist and Year.***

In [24]:
fig = px.scatter(spotify_top_5_grouped,x='Year',y='minutes_played',color='artist_name')
fig.show()

### Business Question #4:

***Is there a relationship between the variety of tracks a user listens to from an artist and the total time they spend listening to that artist? Seeing the top 10 artists per minutes played***

In [25]:
artist_summary = spotify.groupby('artist_name')[['track_name','minutes_played']].agg({'track_name':'nunique','minutes_played':'sum'}).reset_index().rename(columns={'track_name':'unique_tracks','minutes_played':'total_minutes_played'}).sort_values('total_minutes_played',ascending=False).head(10)
artist_summary

Unnamed: 0,artist_name,unique_tracks,total_minutes_played
3500,The Beatles,470,20079.4
3602,The Killers,155,17615.5
1773,John Mayer,115,12084.3
465,Bob Dylan,100,9479.7
2858,Paul McCartney,155,5940.2
1511,Howard Shore,98,5816.3
3708,The Strokes,49,5298.4
3684,The Rolling Stones,111,5122.4
2934,Pink Floyd,60,4334.8
2093,Led Zeppelin,94,4120.8


***Checking the correlation between number of unique tracks per artist and the total minutes played, for the top 10 artists.*** 

In [26]:
fig= px.scatter(artist_summary, x='unique_tracks',y='total_minutes_played', hover_data='artist_name',color='artist_name')
fig.show()

### Business Question #5:

***Which albums encourage the most engaged listening? We want to identify the albums with the highest average listening duration per track, which suggests users don't skip them often.***

In [27]:
album_engagement = spotify.dropna(subset='album_name')
album_engagement = album_engagement.groupby('album_name')[['track_name','minutes_played']].agg({'track_name':'count','minutes_played':'mean'}).rename(columns={'track_name':'play_count','minutes_played':'avg_duration'})
album_engagement = album_engagement[album_engagement['play_count'] > 1].sort_values('avg_duration',ascending=False).head(10).reset_index().sort_values('avg_duration', ascending=True)
album_engagement

Unnamed: 0,album_name,play_count,avg_duration
9,Stardust: The Music Of Hoagy Carmichael,2,7.5
8,Mr. Music / Ring of Love,2,7.6
7,All the Greatest Hits Ever Made,2,7.7
6,Strauss: Die Fledermaus,2,7.7
5,Sofrito: Tropical Discotheque,5,8.08
4,Blue Train,5,8.44
3,Beatles to Bond and Bach,3,9.5
2,The Sound Of Belgium Vol. 3,4,10.75
1,Just Coolin',2,11.0
0,Tubular Bells,5,11.62


***Visually Identifying High-Engagement Albums by Average Listening Time (in Mins)***

In [28]:
fig = px.bar(album_engagement,
     x='avg_duration',
     y='album_name',
     title='Top 10 longest played albums by Artist',
     text='avg_duration')
fig.show()

### Business Question #5:

***What is the user's weekly listening pattern? We want to identify on which days of the week and at which hours listening is most intense.***

In [109]:
spotify['day_of_week'] = spotify['ts'].dt.day_name()
spotify['number_day_of_week'] = spotify['ts'].dt.day_of_week
spotify['day'] = spotify['number_day_of_week'].astype(str) + '.' + spotify['day_of_week']
spotify['hour_of_day'] = spotify['ts'].dt.hour
spotify_pivot = spotify.pivot_table(index='hour_of_day', columns='day', values='track_name', aggfunc='count').sort_values('hour_of_day', ascending=True)
spotify_pivot

day,0.Monday,1.Tuesday,2.Wednesday,3.Thursday,4.Friday,5.Saturday,6.Sunday
hour_of_day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,1348,1673,1383,1516,1637,2030,1297
1,1157,1551,1574,1174,1270,1328,1341
2,1055,1311,1650,967,1333,1399,1314
3,1162,1181,1290,1101,1466,1131,1219
4,767,628,781,1088,1195,1048,848
5,1307,1025,979,930,1281,714,928
6,736,655,1341,1510,1242,963,922
7,418,614,506,517,787,782,788
8,297,251,234,168,301,456,605
9,93,186,132,273,173,335,503


***Listening pattern throughout time of day and day of week, # of songs played.***

In [127]:
fig = px.imshow(
                spotify_pivot,
                labels=dict(x='Day',y='Hour',color='# Songs')
                )
fig.update_xaxes(side='top')
fig.show()

### Business Question #6

***Identifying the top 5 artists by play count and then summarizing their key engagement metrics.***

In [196]:
top_5_artists = spotify.groupby('artist_name')['track_name'].count().sort_values(ascending=False).head(5).reset_index()
top_5_artists_list = top_5_artists['artist_name'].tolist()
spotify_top_5 = spotify.loc[spotify['artist_name'].isin(top_5_artists_list)]
spotify_top_5_grouped = spotify_top_5.groupby('artist_name')[['minutes_played','skipped_flag']].agg(
    {'minutes_played':'sum','skipped_flag':'mean'}).rename(
        columns={'minutes_played':'total_minutes','skipped_flag':'skip_rate'}).sort_values(
            'total_minutes',ascending=True).reset_index()
spotify_top_5_grouped

Unnamed: 0,artist_name,total_minutes,skip_rate
0,Paul McCartney,5940.2,0.039674
1,Bob Dylan,9479.7,0.042737
2,John Mayer,12084.3,0.031514
3,The Killers,17615.5,0.028642
4,The Beatles,20079.4,0.028485


***Visualizing top 5 artists with highest number of minutes played***

In [208]:
fig = px.bar(spotify_top_5_grouped, 
            x='total_minutes',
            y='artist_name'
            ,title='Minutes Played for Top 5 Artists'
            ,text='total_minutes'
            )
fig.show()

In [29]:
spotify

Unnamed: 0,spotify_track_uri,ts,platform,ms_played,track_name,artist_name,album_name,reason_start,reason_end,shuffle,skipped,minutes_played,skipped_flag,previous_ts_date,hour,time_of_day
0,06xNkl12TrxwGYrwTGlUUd,2022-02-22 01:54:29,android,169460,Main Title (The Notebook),Aaron Zigman,The Notebook (Original Motion Picture Soundtrack),trackdone,trackdone,FALSE,FALSE,2.8,0.0,NaT,1,Night
1,0UmR5k7YxioGuU6V0XVPbZ,2023-01-02 20:43:23,android,184977,Blue Something,Agustín Amigó,Blue Something,trackdone,trackdone,TRUE,FALSE,3.1,0.0,NaT,20,Evening
2,0LbZg654PnwFu7lcj5aPtb,2023-01-03 21:44:32,android,193727,Manzanilla,Agustín Amigó,Swatches,trackdone,trackdone,TRUE,FALSE,3.2,0.0,2023-01-02 20:43:23,21,Evening
3,2jsQWRYISLjXA2x3LQMzH6,2023-01-24 18:34:17,android,168000,Days of Rain,Agustín Amigó,Days of Rain,trackdone,trackdone,TRUE,FALSE,2.8,0.0,2023-01-03 21:44:32,18,Evening
4,0LbZg654PnwFu7lcj5aPtb,2023-01-25 18:23:23,android,193727,Manzanilla,Agustín Amigó,Swatches,trackdone,trackdone,TRUE,FALSE,3.2,0.0,2023-01-24 18:34:17,18,Evening
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
149855,27ebni0DfbT5Owz6W42HP8,2021-09-25 06:40:55,android,241560,Nunca Me Olvides,Yandel,Dangerous,trackdone,trackdone,TRUE,FALSE,4.0,0.0,2020-12-20 08:25:59,6,Morning
149856,1UulFvlRZPoCWwUBzk5ImI,2021-12-23 07:57:52,android,213520,Jaque Mate,Yandel,Legacy - De Líder a Leyenda Tour,trackdone,trackdone,FALSE,FALSE,3.6,0.0,2021-09-25 06:40:55,7,Morning
149857,2oiixB9QMIzhWaHGVlQx4g,2023-04-15 21:41:01,android,216148,Yandel 150,Yandel,Yandel 150,clickrow,trackdone,TRUE,FALSE,3.6,0.0,2021-12-23 07:57:52,21,Evening
149858,4FAKtPVycI4DxoOHC01YqD,2023-05-21 07:23:39,android,77915,Yandel 150,Yandel,Resistencia,trackdone,fwdbtn,FALSE,TRUE,1.3,1.0,2023-04-15 21:41:01,7,Morning


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=c69c347e-adeb-472b-b52c-7c0ad12c6123' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>