### **Most Streamed Songs on Spotify 2024**

#### __1. Introduction__
- The objective of this project is to develop and deploy a web application to a cloud service using a dataset sourced from Kaggle.
- The web application will provide the user with interactive plots and graphics to share insights gleaned from the preliminary EDA.
- The dataset used for this analysis is "Most Streamed Spotify Songs 2024," created by Nidula Elgiriyewithana.
- Details on this dataset and download instructions can be found here:
    - https://www.kaggle.com/datasets/nelgiriyewithana/most-streamed-spotify-songs-2024

#### __2. Approach__
- The dataset will be analyzed to understand the following:
    - Frequency Distribution of Release Years of Songs in DataSet
    - Relationship Between Spotify Streams and Spotify Playlist Reach for Top 10 Tracks by Track Score
    - Relationship Between Spotify Streams and TikTok Posts for Top 10 Tracks by Track Score
- The project will consists of the following stages:
    1. Data Preparation
    2. Data Analysis
    3. Streamlit Visualization Development
    4. Deployment of Streamlit Components to Render
    5. Conclusion

#### __3. Initialization__
Importing all relevant libraries and loading in the dataset.

In [143]:
import pandas as pd
import numpy as np
import streamlit as st
import plotly.express as px

In [144]:
# Read in the dataset
spotify_data = pd.read_csv(
    r"C:\Users\sethc\github_projects\2024-Spotify-Top-Songs-Analysis\Most Streamed Spotify Songs 2024.csv", 
    encoding='ISO-8859-1'
)

# Display first few rows to check that the pull worked
# spotify_data.head()

#### __4. Data Preparation__
- Review dataset for missing and duplicate information.
- Resolve any issues as-needed.

In [145]:
# Print the general/summary information about the dataset
spotify_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4600 entries, 0 to 4599
Data columns (total 29 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Track                       4600 non-null   object 
 1   Album Name                  4600 non-null   object 
 2   Artist                      4595 non-null   object 
 3   Release Date                4600 non-null   object 
 4   ISRC                        4600 non-null   object 
 5   All Time Rank               4600 non-null   object 
 6   Track Score                 4600 non-null   float64
 7   Spotify Streams             4487 non-null   object 
 8   Spotify Playlist Count      4530 non-null   object 
 9   Spotify Playlist Reach      4528 non-null   object 
 10  Spotify Popularity          3796 non-null   float64
 11  YouTube Views               4292 non-null   object 
 12  YouTube Likes               4285 non-null   object 
 13  TikTok Posts                3427 

- There does not appear to be any missing data from critical columns (Track, Album Name, ISRC, All Time Rank, Track Score).
    - There does appear to be some missing Spotify streaming and TikTok engagement data - this will be replaced with 0. 
- The column names have mixed case and contain whitespaces - the names will be converted to lowercase and the whitespaces will be replaced with underscores.
- The __`spotify_streams`__, __`spotify_playlist_reach`__, and __`tiktok_posts`__ columns are currently **object** dataytypes - these will be converted to an integer to allows a numerical filter. 
- The focus of this analysis will be primarily on Spotify and TikTok engagement, so extraneous columns shall be dropped.
- A __`release_year`__ column shall be created to simplify the histogram creation process.

In [146]:
# Cleaning up the column names
spotify_data.columns = (
    spotify_data.columns.str.strip() #remove lead/trailing spaces
    .str.lower() #convert to lowercase
    .str.replace(r"[^\w\s]", "", regex=True)  # Remove special characters
    .str.replace(r"\s+", "_", regex=True) #convert to snake case
)

# Convert all_time_rank to integer
spotify_data['spotify_streams'] = spotify_data['spotify_streams'].str.replace(',', '').astype('Int64')
spotify_data['spotify_playlist_reach'] = spotify_data['spotify_playlist_reach'].str.replace(',', '').astype('Int64')
spotify_data['tiktok_posts'] = spotify_data['tiktok_posts'].str.replace(',', '').astype('Int64')

# Replacing missing values with 0
spotify_data = spotify_data.replace(" ", pd.NA)
spotify_data = spotify_data.replace("", pd.NA)
spotify_data = spotify_data.fillna(0)

# print(spotify_data.columns)

# Remove extraneous columns
spotify_data = spotify_data.drop(
    columns=['spotify_playlist_count',
             'spotify_popularity',
             'tiktok_likes',
             'tiktok_views',
             'youtube_views',
             'youtube_likes',
             'youtube_playlist_reach',
             'apple_music_playlist_count',
             'airplay_spins',
             'siriusxm_spins',
             'deezer_playlist_count',
             'deezer_playlist_reach',
             'amazon_playlist_count',
             'pandora_streams',
             'pandora_track_stations',
             'soundcloud_streams',
             'shazam_counts',
             'tidal_popularity',
             'explicit_track']
)

#Create release_year column
spotify_data['release_date'] = pd.to_datetime(spotify_data['release_date'])
spotify_data['release_year'] = spotify_data['release_date'].dt.year

print(spotify_data.columns)

Index(['track', 'album_name', 'artist', 'release_date', 'isrc',
       'all_time_rank', 'track_score', 'spotify_streams',
       'spotify_playlist_reach', 'tiktok_posts', 'release_year'],
      dtype='object')


#### __5. Data Analysis__

In [147]:
# Print the summary table below for reference
spotify_data.sample(10)

Unnamed: 0,track,album_name,artist,release_date,isrc,all_time_rank,track_score,spotify_streams,spotify_playlist_reach,tiktok_posts,release_year
2429,Poker Face ( Remix ),Poker Face ( Remix ),Bora Media,2023-10-17,ES64E2323174,2410,29.0,1271094976,5,0,2023
3036,UNA BALA,UNA BALA,Milo j,2023-11-29,UYB282384014,3026,25.2,94097213,4283676,0,2023
3480,Bom Diggy,Bom Diggy - Single,Zack Knight,2017-08-24,QM6P41715585,3464,23.2,55362558,2598056,75620,2017
1639,Pacas De Billetes,Pacas De Billetes,Natanael Cano,2023-05-01,QZ9QQ2300315,1634,36.2,199378491,7148639,15900,2023
59,Unholy (feat. Kim Petras),Unholy (feat. Kim Petras),Sam Smith,2022-09-22,GBUM72205415,60,189.1,1556275789,95974138,2379787,2022
1006,Lean Wit Me,Lean Wit Me,Juice WRLD,2018-05-22,USUG11800945,1001,47.9,880781313,29154307,28508,2018
3066,Drankin N Smokin,Pluto x Baby Pluto,Future,2020-11-13,USAT22007323,3058,25.1,300694036,17871947,37293,2020
1827,Mahiye Jinna Sohna,Mahiye Jinna Sohna,Darshan Raval,2023-06-22,INW262317735,1823,34.1,208410725,8924766,496900,2023
3564,Feel Good Inc.,Demon Days,Gorillaz,2005-05-23,GBAYE0500172,3548,22.8,1407382497,72604627,337277,2005
2259,ZITTI E BUONI,ZITTI E BUONI,Mï¿½ï¿½ne,2021-03-03,ITB002100112,2242,30.2,475549122,9975524,139573,2021


##### __5.1 Data Analysis__