## Data Analysis and Visulization of Most Streamed Spotify Songs - 2023

#### **About Dataset**

Dataset Title: Most Streamed Spotify Songs 2023
- Data Source: Kaggle.
- Data Format: CSV
- Data Dimensions: 

| Field Name | Description |
| --- | --- |
| track_name | Name of the song |
| artist(s)_name | Name of the artist(s) of the song |
| artist_count | Number of artists contributing to the song |
| released_year | Year when the song was released |
| released_month | Month when the song was released |
| released_day | Day of the month when the song was released |
| in_spotify_playlists | Number of Spotify playlists the song is included in |
| in_spotify_charts | Presence and rank of the song on Spotify charts |
| streams | Total number of streams on Spotify |
| in_apple_playlists | Number of Apple Music playlists the song is included in |
| in_apple_charts | Presence and rank of the song on Apple Music charts |
| in_deezer_playlists | Number of Deezer playlists the song is included in |
| in_deezer_charts | Presence and rank of the song on Deezer charts |
| in_shazam_charts | Presence and rank of the song on Shazam charts |
| bpm | Beats per minute, a measure of song tempo |
| key | Key of the song |
| mode | Mode of the song (major or minor) |
| danceability_% | Percentage indicating how suitable the song is for dancing |
| valence_% | Positivity of the song's musical content |
| energy_% | Perceived energy level of the song |
| acousticness_% | Amount of acoustic sound in the song |
| instrumentalness_% | Amount of instrumental content in the song |
| liveness_% | Presence of live performance elements |
| speechiness_% | Amount of spoken words in the song |



In [91]:
#importing required packages
import altair as alt
import pandas as pd
import numpy as np  


In [92]:
#reading the csv file
df = pd.read_csv("spotify-2023.csv",encoding='iso-8859-1')

#### **A sneak peek of the Most Streamed Spotify Songs data set**

In [93]:
df.head()

Unnamed: 0,track_name,artist(s)_name,artist_count,released_year,released_month,released_day,in_spotify_playlists,in_spotify_charts,streams,in_apple_playlists,...,bpm,key,mode,danceability_%,valence_%,energy_%,acousticness_%,instrumentalness_%,liveness_%,speechiness_%
0,Seven (feat. Latto) (Explicit Ver.),"Latto, Jung Kook",2,2023,7,14,553,147,141381703,43,...,125,B,Major,80,89,83,31,0,8,4
1,LALA,Myke Towers,1,2023,3,23,1474,48,133716286,48,...,92,C#,Major,71,61,74,7,0,10,4
2,vampire,Olivia Rodrigo,1,2023,6,30,1397,113,140003974,94,...,138,F,Major,51,32,53,17,0,31,6
3,Cruel Summer,Taylor Swift,1,2019,8,23,7858,100,800840817,116,...,170,A,Major,55,58,72,11,0,11,15
4,WHERE SHE GOES,Bad Bunny,1,2023,5,18,3133,50,303236322,84,...,144,A,Minor,65,23,80,14,63,11,6


#### **General info about the data columns**

In [94]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 953 entries, 0 to 952
Data columns (total 24 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   track_name            953 non-null    object
 1   artist(s)_name        953 non-null    object
 2   artist_count          953 non-null    int64 
 3   released_year         953 non-null    int64 
 4   released_month        953 non-null    int64 
 5   released_day          953 non-null    int64 
 6   in_spotify_playlists  953 non-null    int64 
 7   in_spotify_charts     953 non-null    int64 
 8   streams               953 non-null    object
 9   in_apple_playlists    953 non-null    int64 
 10  in_apple_charts       953 non-null    int64 
 11  in_deezer_playlists   953 non-null    object
 12  in_deezer_charts      953 non-null    int64 
 13  in_shazam_charts      903 non-null    object
 14  bpm                   953 non-null    int64 
 15  key                   858 non-null    ob

#### **Goals and Tasks**

By applying visualization to this data, we can achieve several goals:

1.	**<p style="color:yellow;">Trend Identification:</p>** <p style="font-size:15px;">Visualizing data over time can help identify trends in song popularity, artist prominence, and genre preference.</p>
2.	**<p style="color:yellow;">Correlation Discovery:</p>** <p style="font-size:15px;">Scatter plots or heatmaps can help identify correlations between song features and their popularity.</p>
3. **<p style="color:yellow;">Comparative Analysis:</p>** <p style="font-size:15px;">Bar charts or pie charts can help compare the success of different artists, songs, or genres.</p>
4.	**<p style="color:yellow;">Geographical Patterns:</p>** <p style="font-size:15px;">Geographical heatmaps can show regional preferences for songs or artists.</p>


##### **I will be considering two goals among the above four:**

**<p style="color:yellow;">Task 1: Analyzing Song Popularity</p>**
-	Goal: Understand the popularity of songs to predict future trends.
-	Means: Analyze the number of streams, artist popularity, and song features.
-	Characteristics: Identify patterns and correlations in the data.
-	Target Data: Song name, artist, number of streams, song features (like danceability, energy, etc.).

**<p style="color:yellow;">Task 2: Feature Interaction Identification-Key/Mode based</p>**
-	Goal: Understand the popularity of songs using key/mode preferences to tailor music recommendations.
-	Means: Analyze the interaction between song features based on the popularity of songs for different keys/modes
-	Characteristics: Identify key (like D,D#,E,F, etc) preferences and trends.
-	Target Data: Song name, artist, number of streams, key, bpm, dancebility, etc.

Workflow: These tasks can be performed on weekly or monthly basis to understand the trend.<br>
Roles: These task are typically executed by data analysts.


**<p style="color:yellow;">Below image shows the low-fedility prototype of analyzing top 5 songs based on streams</p>**<br>
(I have used Tableau to create the chart quickly)


![Alt text](image.png)

**<p style="color:yellow;">Below image shows the low-fedility prototype of key/mode preferences and feature interactions</p>**<br>
(I have used Tableau to create the chart quickly)

![Alt text](image-2.png)

----------
#### Data Preprocessing

##### Colunms present in the database:

In [95]:
#Analysing song popularity to uderstand the correlation between popularity(no. of streams) and other features to predict future trends

df.columns 

Index(['track_name', 'artist(s)_name', 'artist_count', 'released_year',
       'released_month', 'released_day', 'in_spotify_playlists',
       'in_spotify_charts', 'streams', 'in_apple_playlists', 'in_apple_charts',
       'in_deezer_playlists', 'in_deezer_charts', 'in_shazam_charts', 'bpm',
       'key', 'mode', 'danceability_%', 'valence_%', 'energy_%',
       'acousticness_%', 'instrumentalness_%', 'liveness_%', 'speechiness_%'],
      dtype='object')

##### The 'streams' column in the below row contains junk values and hence this can be dropped.

In [96]:
df[df['streams'].str.startswith("BPM110KeyAModeMajorDan")]

Unnamed: 0,track_name,artist(s)_name,artist_count,released_year,released_month,released_day,in_spotify_playlists,in_spotify_charts,streams,in_apple_playlists,...,bpm,key,mode,danceability_%,valence_%,energy_%,acousticness_%,instrumentalness_%,liveness_%,speechiness_%
574,Love Grows (Where My Rosemary Goes),Edison Lighthouse,1,1970,1,1,2877,0,BPM110KeyAModeMajorDanceability53Valence75Ener...,16,...,110,A,Major,53,75,69,7,0,17,3


In [97]:
df.drop(index=574,inplace=True)

In [98]:
#creating a dataframe with only required columns
df_song =df[['track_name', 'artist(s)_name','streams','danceability_%', 'valence_%', 'energy_%',
       'acousticness_%', 'instrumentalness_%', 'liveness_%', 'speechiness_%','key','mode']].copy()

##### Top 20 songs by total number of streams:

In [100]:
df_song.loc[:, 'streams'] = df_song['streams'].astype('int64')
df_top20songs = df_song.nlargest(20, 'streams')
df_top20songs

Unnamed: 0,track_name,artist(s)_name,streams,danceability_%,valence_%,energy_%,acousticness_%,instrumentalness_%,liveness_%,speechiness_%,key,mode
55,Blinding Lights,The Weeknd,3703895074,50,38,80,0,0,9,7,C#,Major
179,Shape of You,Ed Sheeran,3562543890,83,93,65,58,0,9,8,C#,Minor
86,Someone You Loved,Lewis Capaldi,2887241814,50,45,41,75,0,11,3,C#,Major
620,Dance Monkey,Tones and I,2864791672,82,54,59,69,0,18,10,F#,Minor
41,Sunflower - Spider-Man: Into the Spider-Verse,"Post Malone, Swae Lee",2808096550,76,91,50,54,0,7,5,D,Major
162,One Dance,"Drake, WizKid, Kyla",2713922350,77,36,63,1,0,36,5,C#,Major
84,STAY (with Justin Bieber),"Justin Bieber, The Kid Laroi",2665343922,59,48,76,4,0,10,5,C#,Major
140,Believer,Imagine Dragons,2594040133,77,74,78,4,0,23,11,A#,Minor
725,Closer,"The Chainsmokers, Halsey",2591224264,75,64,52,41,0,11,3,G#,Major
48,Starboy,"The Weeknd, Daft Punk",2565529693,68,49,59,16,0,13,28,G,Major


------
### **Interactive Visualizations using Altair**

##### <p style="color:lightblue;">Analysis of Song Popularity(in terms of no. of streams) based on audio features of the song.</p>

<p style="color:yellow;">A summary of key elements of the design and justification:</p>
The visualization design consists of building a scatter plot to see the correlation between the feature 'streams' and the other dimensions(audio features of the song) namely ['danceability_%','valence_%', 'energy_%','acousticness_%', 'instrumentalness_%', 'liveness_%', 'speechiness_%'].It also consists of a dropdown menu to change the x-axis dimension.The scatter plot for top 20 songs(sorted based on number of streams) allows us to see if there is a linear relationship between the mentioned features, which can be used to predict future song trends.

In [101]:
dimensions = ['danceability_%','valence_%', 'energy_%',
       'acousticness_%', 'instrumentalness_%', 'liveness_%', 'speechiness_%']

# Let's implement filtering using dynamic queries. 
dropdown = alt.binding_select (options=dimensions, name="Select x-axis dimension")

x_axis_param = alt.param(value=dimensions[0], bind=dropdown)

# Define the scatter plot
scatter = alt.Chart(df_top20songs).mark_circle(size=60,color="darkblue").encode(
    x=alt.X('x:Q').title(" "),
    y='streams',
    tooltip=['streams','artist(s)_name','track_name'],
    opacity=alt.value(0.6)
).transform_calculate(
    x=f'datum[{x_axis_param.name}]').add_params(
    x_axis_param
).properties(width=600,
    height = 300)

scatter.display()


##### **Inference/Findings:**
The no. of streams and audio features of the songs(['danceability_%','valence_%', 'energy_%','acousticness_%', 'instrumentalness_%', liveness_%', 'speechiness_%']) doesn't seem to have a linear relationship for the top 20 songs. Although most songs among them seem to have danceability_% and energy_% in the upper limit of the x-axis scale. This means, more the energy and dancability the better it is for the popularity of the song.


-------
##### **<p style="color:lightblue;">Analysis of key/mode preferences and feature interactions.</p>**

<p style="color:yellow;">A summary of key elements of the design and justification:</p>
The visualization design consists of building a scatter plot(for top 20 songs) between the feature 'streams' and the other dimensions(audio features of the song) namely ['danceability_%','valence_%', 'energy_%','acousticness_%', 'instrumentalness_%', 'liveness_%', 'speechiness_%']. Each mark on the plot is colored by the Key using which the song is composed. The aim is to get a general notion about the popularity of the song and Key preference.It also consists of a dropdown menu to change the x-axis dimension.


In [102]:
dimensions = ['danceability_%','valence_%', 'energy_%',
       'acousticness_%', 'instrumentalness_%', 'liveness_%', 'speechiness_%']

# Let's implement filtering using dynamic queries. 
dropdown = alt.binding_select (options=dimensions, name="Select x-axis dimension")

x_axis_param = alt.param(value=dimensions[0], bind=dropdown)

# Define the scatter plot
scatter = alt.Chart(df_top20songs).mark_circle(size=60,color="darkblue").encode(
    x=alt.X('x:Q').title(" "),
    y='streams',
    color=alt.Color('key:N', scale=alt.Scale(scheme='category20')),
    tooltip=['streams','artist(s)_name','track_name','mode'],
    opacity=alt.value(0.6)
).transform_calculate(
    x=f'datum[{x_axis_param.name}]').add_params(
    x_axis_param
).properties(width=600,
    height = 300).interactive()

scatter.display()


##### **Inference/Findings:**
Irrespective of the x-axis dimension, the songs composed in C# and D have the most no. of streams. This means the likelihood of the song's popularity is partly increased by the choice of Key being C# or D.


--------
#### **Preliminary Analysis**

**Target Question:** How effective is the visualization in helping users understand song popularity trends and Key preferences owing to song popularity?

**Recruitment:** The evaluation involved recruiting a diverse group of participants, including data analysts and general users who are interested in music trends.

**Measures:** The measures used included:
- <sub> **Insight Depth:** This was measured through participant interviews and questionnaires. The depth and quality of insights participants gained from the visualization was good in the sense that, it could potentially help the artists, music producers and performers to produce songs with greater percentage of danceability and energy, and leser percentage of spoken words and live performance elements.</sub>
- <sub> **Use Cases:** Identifed real-world use cases(like important elements to consider during song composition) where the visualization could be applied. This was through brainstorming sessions with participants.</sub>
- <sub> **Accuracy:** Evaluated how accurately the visualization represents the underlying data. This was measured by comparing participant interpretations with the actual data. All participants gave a consistent opinion about the insights they gained from the visualization. </sub>

**Approach:** A mixed-methods approach was used, combining quantitative methods (like accuracy measurement) with qualitative methods (like interviews and brainstorming sessions).

**Methods:** Participants wwere asked to interact with the visualization and answer a series of questions related to song popularity analysis and key/mode preferences and feature interactions. They were also asked to brainstorm potential use cases for the visualization.

**Success Criteria:** The visualization can be considered successful as it allowed the users to accurately understand song popularity trends and and key/mode preferences and feature interactions, provided deep insights, and has clear real-world use cases.

#### **Elements to consider in future iterations**

- Future iterations should consider the dimensions - 'in_spotify_playlists',   'in_spotify_charts', 'streams', 'in_apple_playlists', 'in_apple_charts', 'in_deezer_playlists', 'in_deezer_charts', 'in_shazam_charts', to uncover additional insights which can enhance the reach of songs on different platforms like Apple music, Deezer and Shazam.<br>

- Dimensions like 'released_year', 'released_month', 'released_day' can be used to perform Time Series Analysis and Forcasting.




-------