## Maven Music Challenge

### Spotify Streaming History
Spotify user's complete music streaming history data, including timestamps, track, artist, and album names, and reasons for playing and ending each track.

### Challenge Objective
Every December, millions of Spotify users look forward to their Spotify Wrapped – a personalized recap showcasing their listening habits over the past year.

Wrapped has become a social and cultural phenomenon, including breakdowns of listeners' most-streamed artists and tracks, total minutes listened, personalized playlists, and even video messages from artists to their top fans.

For the Maven Music Challenge, your task is to create your own version of Spotify Wrapped, by downloading your streaming history or using the sample dataset provided (if you aren't a Spotify user).

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [2]:
# read the dataset
data = pd.read_csv("Spotify_Streaming_History/spotify_history.csv")
data.shape

(149860, 11)

In [3]:
# let's look at few records
data.head()

Unnamed: 0,spotify_track_uri,ts,platform,ms_played,track_name,artist_name,album_name,reason_start,reason_end,shuffle,skipped
0,2J3n32GeLmMjwuAzyhcSNe,7/8/2013 2:44,web player,3185,"Say It, Just Say It",The Mowgli's,Waiting For The Dawn,autoplay,clickrow,False,False
1,1oHxIPqJyvAYHy0PVrDU98,7/8/2013 2:45,web player,61865,Drinking from the Bottle (feat. Tinie Tempah),Calvin Harris,18 Months,clickrow,clickrow,False,False
2,487OPlneJNni3NWC8SYqhW,7/8/2013 2:50,web player,285386,Born To Die,Lana Del Rey,Born To Die - The Paradise Edition,clickrow,unknown,False,False
3,5IyblF777jLZj1vGHG2UD3,7/8/2013 2:52,web player,134022,Off To The Races,Lana Del Rey,Born To Die - The Paradise Edition,trackdone,clickrow,False,False
4,0GgAAB0ZMllFhbNc3mAodO,7/8/2013 3:17,web player,0,Half Mast,Empire Of The Sun,Walking On A Dream,clickrow,nextbtn,False,False


#### Data Dictionary

 - spotify_track_uri: Spotify URI that uniquely identifies each track in the form of "spotify:track:<base-62 string>"
 - ts: Timestamp indicating when the track stopped playing in UTC (Coordinated Universal Time)
 - platform: Platform used when streaming the track
 - ms_played: Number of milliseconds the stream was played
 - track_name: Name of the track
 - artist_name: Name of the artist
 - album_name: Name of the album
 - reason_start: Why the track started
 - reason_end: Why the track ended
 - shuffle: TRUE or FALSE depending on if shuffle mode was used when playing the track
 - skipped: TRUE of FALSE depending on if the user skipped to the next song


In [4]:
# check the datatypes
data.dtypes

spotify_track_uri    object
ts                   object
platform             object
ms_played             int64
track_name           object
artist_name          object
album_name           object
reason_start         object
reason_end           object
shuffle                bool
skipped                bool
dtype: object

In [5]:
# descriptive statistics
data.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
spotify_track_uri,149860.0,16527.0,1BLOVHYYlH4JUHQGcpt75R,207.0,,,,,,,
ts,149860.0,95738.0,7/27/2017 20:11,209.0,,,,,,,
platform,149860.0,6.0,android,139821.0,,,,,,,
ms_played,149860.0,,,,128316.635093,117840.060332,0.0,2795.0,138840.0,218507.0,1561125.0
track_name,149860.0,13839.0,Ode To The Mets,207.0,,,,,,,
artist_name,149860.0,4113.0,The Beatles,13621.0,,,,,,,
album_name,149860.0,7946.0,The Beatles,2063.0,,,,,,,
reason_start,149717.0,13.0,trackdone,76655.0,,,,,,,
reason_end,149743.0,15.0,trackdone,77194.0,,,,,,,
shuffle,149860.0,2.0,True,111583.0,,,,,,,


In [6]:
# convert column "ts" into a datetime format
data['ts'] = pd.to_datetime(data['ts'])

In [7]:
data.head()

Unnamed: 0,spotify_track_uri,ts,platform,ms_played,track_name,artist_name,album_name,reason_start,reason_end,shuffle,skipped
0,2J3n32GeLmMjwuAzyhcSNe,2013-07-08 02:44:00,web player,3185,"Say It, Just Say It",The Mowgli's,Waiting For The Dawn,autoplay,clickrow,False,False
1,1oHxIPqJyvAYHy0PVrDU98,2013-07-08 02:45:00,web player,61865,Drinking from the Bottle (feat. Tinie Tempah),Calvin Harris,18 Months,clickrow,clickrow,False,False
2,487OPlneJNni3NWC8SYqhW,2013-07-08 02:50:00,web player,285386,Born To Die,Lana Del Rey,Born To Die - The Paradise Edition,clickrow,unknown,False,False
3,5IyblF777jLZj1vGHG2UD3,2013-07-08 02:52:00,web player,134022,Off To The Races,Lana Del Rey,Born To Die - The Paradise Edition,trackdone,clickrow,False,False
4,0GgAAB0ZMllFhbNc3mAodO,2013-07-08 03:17:00,web player,0,Half Mast,Empire Of The Sun,Walking On A Dream,clickrow,nextbtn,False,False


In [8]:
# concise summary
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149860 entries, 0 to 149859
Data columns (total 11 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   spotify_track_uri  149860 non-null  object        
 1   ts                 149860 non-null  datetime64[ns]
 2   platform           149860 non-null  object        
 3   ms_played          149860 non-null  int64         
 4   track_name         149860 non-null  object        
 5   artist_name        149860 non-null  object        
 6   album_name         149860 non-null  object        
 7   reason_start       149717 non-null  object        
 8   reason_end         149743 non-null  object        
 9   shuffle            149860 non-null  bool          
 10  skipped            149860 non-null  bool          
dtypes: bool(2), datetime64[ns](1), int64(1), object(7)
memory usage: 10.6+ MB


In [9]:
# check for duplicate records
data.duplicated().sum()

1782

In [10]:
data = data.drop_duplicates()

In [11]:
# check for duplicate records
data.duplicated().sum()

0

In [12]:
# check for null/missing values
data.isnull().sum()

spotify_track_uri      0
ts                     0
platform               0
ms_played              0
track_name             0
artist_name            0
album_name             0
reason_start         143
reason_end           117
shuffle                0
skipped                0
dtype: int64

In [13]:
# fill the missing values with reason "unknown"
data['reason_start'] = data['reason_start'].fillna("unknown")
data['reason_end'] = data['reason_end'].fillna("unknown")

In [14]:
# check for null/missing values
data.isnull().sum()

spotify_track_uri    0
ts                   0
platform             0
ms_played            0
track_name           0
artist_name          0
album_name           0
reason_start         0
reason_end           0
shuffle              0
skipped              0
dtype: int64