### **Most Streamed Songs on Spotify 2024**

#### __1. Introduction__
- The objective of this project is to develop and deploy a web application to a cloud service using a dataset sourced from Kaggle.
- The web application will provide the user with interactive plots and graphics to share visualizations gleaned during the EDA.
- The dataset used for this analysis is "Most Streamed Spotify Songs 2024," created by Nidula Elgiriyewithana.
- Details on this dataset and download instructions can be found here:
    - https://www.kaggle.com/datasets/nelgiriyewithana/most-streamed-spotify-songs-2024

#### __2. Approach__
- The dataset will be analyzed to understand the following:
    - Frequency Distribution of Release Years of Songs in Dataset
    - Relationship Between Spotify Streams and Spotify Playlist Reach
    - Relationship Between Spotify Streams and TikTok Posts
- The project will consists of the following stages:
    1. Data Preparation
    2. Data Analysis & Conclusions
    3. Streamlit Visualization Development
    4. Deployment of Streamlit Components to Render

#### __3. Initialization__
Importing all relevant libraries and loading in the dataset.

In [None]:
import pandas as pd
import numpy as np
import streamlit as st
import plotly.express as px

In [1]:
# Read in the dataset
spotify_data = pd.read_csv(
    r"C:\Users\sethc\github_projects\2024-Spotify-Top-Songs-Analysis\Most Streamed Spotify Songs 2024.csv", 
    encoding='ISO-8859-1'
)

# Display first few rows to check that the pull worked
# spotify_data.head()

NameError: name 'pd' is not defined

#### __4. Data Preparation__
- Review dataset for missing and duplicate information.
- Resolve any issues as-needed.

In [None]:
# Print the general/summary information about the dataset
spotify_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4600 entries, 0 to 4599
Data columns (total 29 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Track                       4600 non-null   object 
 1   Album Name                  4600 non-null   object 
 2   Artist                      4595 non-null   object 
 3   Release Date                4600 non-null   object 
 4   ISRC                        4600 non-null   object 
 5   All Time Rank               4600 non-null   object 
 6   Track Score                 4600 non-null   float64
 7   Spotify Streams             4487 non-null   object 
 8   Spotify Playlist Count      4530 non-null   object 
 9   Spotify Playlist Reach      4528 non-null   object 
 10  Spotify Popularity          3796 non-null   float64
 11  YouTube Views               4292 non-null   object 
 12  YouTube Likes               4285 non-null   object 
 13  TikTok Posts                3427 

- There does not appear to be any missing data from critical columns (Track, Album Name, ISRC, All Time Rank, Track Score).
    - There does appear to be some missing Spotify streaming and TikTok engagement data - this will be replaced with 0. 
- The column names have mixed case and contain whitespaces - the names will be converted to lowercase and the whitespaces will be replaced with underscores.
- The __`spotify_streams`__, __`spotify_playlist_reach`__, and __`tiktok_posts`__ columns are currently **object** dataytypes - these will be converted to an integer to allows a numerical filter. 
- The focus of this analysis will be primarily on Spotify and TikTok engagement, so extraneous columns shall be dropped.
- A __`release_year`__ column shall be created to simplify the histogram creation process.

In [None]:
# Cleaning up the column names
spotify_data.columns = (
    spotify_data.columns.str.strip() #remove lead/trailing spaces
    .str.lower() #convert to lowercase
    .str.replace(r"[^\w\s]", "", regex=True)  # Remove special characters
    .str.replace(r"\s+", "_", regex=True) #convert to snake case
)

# Convert all_time_rank to integer
spotify_data['spotify_streams'] = spotify_data['spotify_streams'].str.replace(',', '').astype('Int64')
spotify_data['spotify_playlist_reach'] = spotify_data['spotify_playlist_reach'].str.replace(',', '').astype('Int64')
spotify_data['tiktok_posts'] = spotify_data['tiktok_posts'].str.replace(',', '').astype('Int64')

# Replacing missing values with 0
spotify_data = spotify_data.replace(" ", pd.NA)
spotify_data = spotify_data.replace("", pd.NA)
spotify_data = spotify_data.fillna(0)

# print(spotify_data.columns)

# Remove extraneous columns
spotify_data = spotify_data.drop(
    columns=['spotify_playlist_count',
             'spotify_popularity',
             'tiktok_likes',
             'tiktok_views',
             'youtube_views',
             'youtube_likes',
             'youtube_playlist_reach',
             'apple_music_playlist_count',
             'airplay_spins',
             'siriusxm_spins',
             'deezer_playlist_count',
             'deezer_playlist_reach',
             'amazon_playlist_count',
             'pandora_streams',
             'pandora_track_stations',
             'soundcloud_streams',
             'shazam_counts',
             'tidal_popularity',
             'explicit_track']
)

#Create release_year column
spotify_data['release_date'] = pd.to_datetime(spotify_data['release_date'])
spotify_data['release_year'] = spotify_data['release_date'].dt.year

print(spotify_data.columns)

Index(['track', 'album_name', 'artist', 'release_date', 'isrc',
       'all_time_rank', 'track_score', 'spotify_streams',
       'spotify_playlist_reach', 'tiktok_posts', 'release_year'],
      dtype='object')


#### __5. Data Analysis__

In [None]:
# Print the summary table below for reference
spotify_data.sample(10)

Unnamed: 0,track,album_name,artist,release_date,isrc,all_time_rank,track_score,spotify_streams,spotify_playlist_reach,tiktok_posts,release_year
3122,Don't Go Insane,Don't Go Insane,DPR IAN,2023-10-04,QMBZ92331724,3100,24.8,76033359,8094289,217600,2023
1767,I Took A Pill In Ibiza - Seeb Remix,I Took A Pill In Ibiza (Seeb Remix),Mike Posner,2015-07-24,USUM71509342,1757,34.8,1949596473,75554079,86209,2015
3996,Drift Night,Drift Night,Alfianie,2022-07-30,SGB502231047,3967,21.4,218621,18188,2101254,2022
3922,shipping,Follow Your Nature,great area,2022-12-15,UKKTQ2200011,3902,21.6,130895,4467,0,2022
4533,ARRANCARMELO,ARRANCARMELO,WOS,2022-04-06,UYB282206048,4504,19.6,223426639,8039699,83501,2022
2758,Huntinï¿½ï¿½ï¿½ W,Might Delete Later,J. Cole,2024-04-05,USUG12402405,2747,26.8,36972292,4824184,0,2024
930,Crawling,Papercuts,Linkin Park,2024-04-12,USWB11201319,927,50.1,570290910,525584,0,2024
3690,Suano,Suano - Single,NTG,2024-03-24,TCAIA2470789,3656,22.4,15333864,35368637,121600,2024
1084,ZEZE (feat. Travis Scott & Offset),ZEZE (feat. Travis Scott & Offset),Kodak Black,2018-10-12,USAT21811523,1080,46.0,1070685587,32147932,1386205,2018
421,Never Lose Me,Never Lose Me - Single,Flo Milli,2023-11-30,USRC12303379,420,75.1,278132516,70573006,1200000,2023


##### __5.1 Histogram__

In [None]:
# Create plotly express histogram showing distribution of most streamed songs in 2024 by release year]
fig = px.histogram(
    spotify_data,
    x='release_year',
    title='Distribution of 2024 Spotify Most Streamed Songs by Release Year',
    labels={'release_year': 'Release Year', 'count': 'Number of Songs'},
    opacity=0.8,
    nbins=35,
    template='plotly_dark'
)

fig.update_layout(
    yaxis_title='Number of Songs',
    bargap=0.1
)

fig.show()

**CONCLUSIONS**:
- The histogram above shows that though the dataset details the tracks that had the most Spotify streams in 2024, there is some level of spread in terms of release year for the songs that comprise the data.
- The dataset showed the largest cluster of release years between 2020 and 2024, with the highest spike between 2022 and 2023.
- More interestingly, the remaining approximately 1/3 of the data shows that the release year is between 2010 and 2019, implying that there are songs that remain popular for quite some time.
- This is further supported by the presence of songs that are 25+ years old making the list in the outliers and smaller bars from 2005 and further backwards. 

##### __5.2 Scatter Plots__

In [None]:
# Filter spotify_data to only include the songs with the top 10 track scores
top_10_songs = spotify_data.nlargest(10, 'track_score')

##### __5.2.1 Spotify Playlist Reach vs. Streams__

In [None]:
# Create scatter plot 1
scat1 = px.scatter(
    spotify_data,
    x='spotify_playlist_reach',
    y='spotify_streams',
    labels={'spotify_playlist_reach': 'Spotify Playlist Reach', 'spotify_streams': 'Spotify Streams'},
    title='Spotify Playlist Reach vs. Spotify Streams)',
    template='plotly_dark',
    hover_name='track',
    hover_data={'artist': True, 'track_score': True},
)
scat1.update_traces(marker=dict(size=12, opacity=0.8))
scat1.show()

**CONCLUSIONS**:
- The scatter plot above shows a slight positive relationship between the number of streams songs tend to have in relation to how many playlists they are featured on.
- There are a fair number of songs abutting the y-axis of the plot - these may not be part of many playlists as it may be the case that they were only very recently released, or are enjoying a viral surge of attention in the short-term.
- Conversely, there is some representation on the other end of the spectrum, where the streams are not quite as high relative to their peers, but they are among the most well-represented in playlists, as is the case with "Espresso" from Sabrina Carpenter.

##### __5.2.2 TikTok Posts vs. Spotify Streams__

In [None]:
# Create scatter plot 2
scat2 = px.scatter(
    spotify_data,
    x='tiktok_posts',
    y='spotify_streams',
    labels={'tiktok_posts': 'TikTok Posts', 'spotify_streams': 'Spotify Streams'},
    title='TikTok Posts vs. Streams',
    template='plotly_dark',
    hover_name='track',
    hover_data={'artist': True, 'track_score': True},
)
scat2.update_traces(marker=dict(size=12, opacity=0.8))
scat2.show()

**CONCLUSIONS**:
- There does not appear to be a very strong relationship between Spotify Stream count and the number of TikTok Posts the songs are used on.
- There is certainly a large cluster between 0 and approximately 5M TikTok Posts where Spotify stream counts range from 0 to over 4B.
- One would expect that with representation into the 30M and 40M post-level on TikTok that there would be far greater Spotify streaming activity, but this does not appear to be the case.
- This may be due to user preferences in terms of music consumption platform, generational differences in terms of representation on newer apps such as TikTok, and other hidden factors influencing the potential conversion of a listener observing from a TikTok post taking the initiative to open the Spotify app and repeatedly stream and share.