# 1. Select Dataset

###### Selected the dataset Most Streamed Soptify Songs of 2024, and the set is available to download from the below link:
###### https://www.kaggle.com/datasets/nelgiriyewithana/most-streamed-spotify-songs-2024
###### Ensured the dataset is sufficiently large and has a variety of features (columns) to analyze.
###### It totally has 4600 rows and 29 columns

# 2. Project Setup

###### Created a directory "Spotify-EDA", and havign all the required files within it only!
###### Used a Jupyter Notebook for my analysis, and the named the notebook file "EDA_Spotify_Group3.ipynb", which is being referred currently
###### Also having a ReadME.md file inside the directory "Spotify-EDA"

# 3. Data Import and Cleaning

#### i. Import the necessary libraries:

In [4]:
# Importing necessary libraries:
import pandas as pd  # For data manipulation and analysis
import numpy as np  # For numerical operations and array handling
import matplotlib.pyplot as plt  # For creating static, animated, and interactive visualizations
import seaborn as sns  # For statistical data visualization based on matplotlib
import plotly  # Plotly to connect to the Plotly CDN for additional resources if needed.
plotly.offline.init_notebook_mode(connected = True)

#### ii. Load the dataset into the pandas data frame

In [6]:
# Import the os module for interacting with the operating system
import os
# Get the current working directory and store it in a variable
current_folder_path = os.getcwd()
print("Current folder path:", current_folder_path)

Current folder path: C:\Users\manju\Documents\Spotify-EDA


In [10]:
# Define the name of the file to be used
file_name = "Most Streamed Spotify Songs 2024.csv"
# Create the full file path by joining the current folder path with the file name
full_file_path = os.path.join(current_folder_path, file_name)
print("Full file path:", full_file_path)

Full file path: C:\Users\manju\Documents\Spotify-EDA\Most Streamed Spotify Songs 2024.csv


In [12]:
# Read the CSV file from the specified path into a pandas DataFrame
spotify_data = pd.read_csv(full_file_path, encoding='ISO-8859-1')
print("Spotify data frame information:")
spotify_data.info()

Spotify data frame information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4600 entries, 0 to 4599
Data columns (total 29 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Track                       4600 non-null   object 
 1   Album Name                  4600 non-null   object 
 2   Artist                      4595 non-null   object 
 3   Release Date                4600 non-null   object 
 4   ISRC                        4600 non-null   object 
 5   All Time Rank               4600 non-null   object 
 6   Track Score                 4600 non-null   float64
 7   Spotify Streams             4487 non-null   object 
 8   Spotify Playlist Count      4530 non-null   object 
 9   Spotify Playlist Reach      4528 non-null   object 
 10  Spotify Popularity          3796 non-null   float64
 11  YouTube Views               4292 non-null   object 
 12  YouTube Likes               4285 non-null   object 
 13  T

#### iii. Perform initial data inspection:

In [14]:
# shape of the data
data_shape = spotify_data.shape
print("Spotify data shape is:")
data_shape

Spotify data shape is:


(4600, 29)

In [16]:
# Check the data types
data_types = spotify_data.dtypes
print("Spotify data_types:")
data_types

Spotify data_types:


Track                          object
Album Name                     object
Artist                         object
Release Date                   object
ISRC                           object
All Time Rank                  object
Track Score                   float64
Spotify Streams                object
Spotify Playlist Count         object
Spotify Playlist Reach         object
Spotify Popularity            float64
YouTube Views                  object
YouTube Likes                  object
TikTok Posts                   object
TikTok Likes                   object
TikTok Views                   object
YouTube Playlist Reach         object
Apple Music Playlist Count    float64
AirPlay Spins                  object
SiriusXM Spins                 object
Deezer Playlist Count         float64
Deezer Playlist Reach          object
Amazon Playlist Count         float64
Pandora Streams                object
Pandora Track Stations         object
Soundcloud Streams             object
Shazam Count

In [28]:
# Check the summary statistics of numeric columns
numeric_stat = spotify_data.describe()
print("Numeric columns' statistics:\n")
numeric_stat

Numeric columns' statistics:



Unnamed: 0,Track Score,Spotify Popularity,Apple Music Playlist Count,Deezer Playlist Count,Amazon Playlist Count,TIDAL Popularity,Explicit Track
count,4600.0,3796.0,4039.0,3679.0,3545.0,0.0,4600.0
mean,41.844043,63.501581,54.60312,32.310954,25.348942,,0.358913
std,38.543766,16.186438,71.61227,54.274538,25.989826,,0.479734
min,19.4,1.0,1.0,1.0,1.0,,0.0
25%,23.3,61.0,10.0,5.0,8.0,,0.0
50%,29.9,67.0,28.0,15.0,17.0,,0.0
75%,44.425,73.0,70.0,37.0,34.0,,1.0
max,725.4,96.0,859.0,632.0,210.0,,1.0


In [34]:
# Check the summary statistics of object columns
object_stat = spotify_data.describe(include=['object'])
print("Object columns Statistics:")
object_stat

Object columns Statistics:


Unnamed: 0,Track,Album Name,Artist,Release Date,ISRC,All Time Rank,Spotify Streams,Spotify Playlist Count,Spotify Playlist Reach,YouTube Views,...,TikTok Likes,TikTok Views,YouTube Playlist Reach,AirPlay Spins,SiriusXM Spins,Deezer Playlist Reach,Pandora Streams,Pandora Track Stations,Soundcloud Streams,Shazam Counts
count,4600,4600,4595,4600,4600,4600,4487,4530,4528,4292,...,3620,3619,3591,4102,2477,3672,3494,3332,1267,4023
unique,4370,4005,1999,1562,4598,4577,4425,4207,4478,4290,...,3615,3616,3458,3267,689,3558,3491,2975,1265,4002
top,Danza Kuduro - Cover,Un Verano Sin Ti,Drake,1/1/2012,USWL11700269,3441,1655575417,1,3,30913276,...,14000,158504854,381728,1,1,1097,56972562,9,27,1
freq,13,20,63,38,2,2,4,46,8,2,...,2,2,13,69,54,17,2,6,2,5


#### iv. Identify and handle missing values

##### > Check for missing values

In [36]:
# Check for missing values
missing_values = spotify_data.isnull().sum()
print("missing_values\n")
print(missing_values)

missing_values

Track                            0
Album Name                       0
Artist                           5
Release Date                     0
ISRC                             0
All Time Rank                    0
Track Score                      0
Spotify Streams                113
Spotify Playlist Count          70
Spotify Playlist Reach          72
Spotify Popularity             804
YouTube Views                  308
YouTube Likes                  315
TikTok Posts                  1173
TikTok Likes                   980
TikTok Views                   981
YouTube Playlist Reach        1009
Apple Music Playlist Count     561
AirPlay Spins                  498
SiriusXM Spins                2123
Deezer Playlist Count          921
Deezer Playlist Reach          928
Amazon Playlist Count         1055
Pandora Streams               1106
Pandora Track Stations        1268
Soundcloud Streams            3333
Shazam Counts                  577
TIDAL Popularity              4600
Expl

##### > Drop rows which are missing with basic data

In [53]:
spotify_data.shape

(4600, 29)

In [70]:
# Remove rows from the DataFrame where all 'Track', 'Album Name' and 'Artist' columns have missing values
spotify_data_filtered_rows = spotify_data.dropna(subset=['Track', 'Album Name', 'Artist'], how='all')
spotify_data_filtered_rows.shape

(4600, 27)

In [72]:
# Check for rows where all values in the 'Track' and 'Album Name' columns are missing
condition1 = spotify_data_filtered_rows[['Track', 'Album Name']].isnull().all(axis=1)

# Check for rows where all values in the 'Track' and 'Artist' columns are missing
condition2 = spotify_data_filtered_rows[['Track', 'Artist']].isnull().all(axis=1)

# Combine the conditions to identify rows where any of the above conditions are met
combined_condition = condition1 | condition2

# Filter out rows that meet any of the combined conditions to create a cleaned DataFrame
spotify_data_filtered_columns = spotify_data_filtered_rows[~combined_condition]
spotify_data_filtered_columns.shape

(4600, 27)

##### > Drop the columns which are missing with most of its row data

In [74]:
# Calculate the percentage of missing values for each column
missing_percentage = spotify_data_filtered_columns.isnull().mean()
# Define a threshold for dropping columns (e.g., 50%)
threshold = 0.5
# Identify columns where missing percentage exceeds the threshold
columns_to_drop = missing_percentage[missing_percentage > threshold].index
print("columns_to_drop is: ", columns_to_drop)
# Drop the identified columns
spotify_data_filtered_columns = spotify_data_filtered_columns.drop(columns=columns_to_drop)
spotify_data_filtered_columns.shape

columns_to_drop is:  Index([], dtype='object')


(4600, 27)

##### > Get the count of missing values in each column

In [78]:
# Get the count of missing values in each column
missing_values = spotify_data_filtered_columns.isnull().sum()
# Filter columns with missing values
missing_values = missing_values[missing_values > 0]
# Get data types of columns with missing values
missing_data_types = spotify_data_filtered_columns.dtypes[missing_values.index]

# Combine missing values count and data types into a DataFrame
missing_info = pd.DataFrame({
    'Missing Values': missing_values,
    'Data Type': missing_data_types
})
# Display the DataFrame with missing values and their data types
print(missing_info)

                            Missing Values Data Type
Artist                                   5    object
Spotify Streams                        113    object
Spotify Playlist Count                  70    object
Spotify Playlist Reach                  72    object
Spotify Popularity                     804   float64
YouTube Views                          308    object
YouTube Likes                          315    object
TikTok Posts                          1173    object
TikTok Likes                           980    object
TikTok Views                           981    object
YouTube Playlist Reach                1009    object
Apple Music Playlist Count             561   float64
AirPlay Spins                          498    object
SiriusXM Spins                        2123    object
Deezer Playlist Count                  921   float64
Deezer Playlist Reach                  928    object
Amazon Playlist Count                 1055   float64
Pandora Streams                       1106    

##### Define default values for missing fields based on data types

In [80]:
# Define default values for missing fields based on data types
default_values = {
    'Artist': 'Unknown',  # String columns can be filled with a placeholder
    'Spotify Streams': 0,  # Numeric columns can be filled with 0 or another appropriate value
    'Spotify Playlist Count': 0,
    'Spotify Playlist Reach': 0,
    'Spotify Popularity': 0,
    'YouTube Views': 0,
    'YouTube Likes': 0,
    'TikTok Posts': 0,
    'TikTok Likes': 0,
    'TikTok Views': 0,
    'YouTube Playlist Reach': 0,
    'Apple Music Playlist Count': 0,
    'AirPlay Spins': 0,
    'SiriusXM Spins': 0,
    'Deezer Playlist Count': 0,
    'Deezer Playlist Reach': 0,
    'Amazon Playlist Count': 0,
    'Pandora Streams': 0,
    'Pandora Track Stations': 0,
    'Shazam Counts': 0
}

# Fill missing values with the default values
spotify_filled_data = spotify_data_filtered_columns.fillna(default_values)

# Verify that there are no missing values left
print(spotify_filled_data.isnull().sum())
spotify_filled_data.shape

Track                         0
Album Name                    0
Artist                        0
Release Date                  0
ISRC                          0
All Time Rank                 0
Track Score                   0
Spotify Streams               0
Spotify Playlist Count        0
Spotify Playlist Reach        0
Spotify Popularity            0
YouTube Views                 0
YouTube Likes                 0
TikTok Posts                  0
TikTok Likes                  0
TikTok Views                  0
YouTube Playlist Reach        0
Apple Music Playlist Count    0
AirPlay Spins                 0
SiriusXM Spins                0
Deezer Playlist Count         0
Deezer Playlist Reach         0
Amazon Playlist Count         0
Pandora Streams               0
Pandora Track Stations        0
Shazam Counts                 0
Explicit Track                0
dtype: int64


(4600, 27)

#### v. Remove duplicate rows

In [82]:
spotify_filled_data = spotify_filled_data.drop_duplicates()
# Remove duplicate rows based on two specified columns
spotify_filled_data = spotify_filled_data.drop_duplicates(subset=['Track', 'Album Name', 'Artist'], keep='first')
spotify_filled_data.shape

(4559, 27)

#### vi. Convert the data format

In [217]:
spotify_filled_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4600 entries, 0 to 4599
Data columns (total 27 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Track                       4600 non-null   object 
 1   Album Name                  4600 non-null   object 
 2   Artist                      4600 non-null   object 
 3   Release Date                4600 non-null   object 
 4   ISRC                        4600 non-null   object 
 5   All Time Rank               4600 non-null   object 
 6   Track Score                 4600 non-null   float64
 7   Spotify Streams             4600 non-null   object 
 8   Spotify Playlist Count      4600 non-null   object 
 9   Spotify Playlist Reach      4600 non-null   object 
 10  Spotify Popularity          4600 non-null   float64
 11  YouTube Views               4600 non-null   object 
 12  YouTube Likes               4600 non-null   object 
 13  TikTok Posts                4600 

In [84]:
columns_to_convert = [
    'All Time Rank', 'Spotify Streams', 'Spotify Playlist Count', 'Spotify Playlist Reach',
    'YouTube Views', 'YouTube Likes', 'TikTok Posts', 'TikTok Likes', 'TikTok Views',
    'YouTube Playlist Reach', 'AirPlay Spins', 'SiriusXM Spins', 'Deezer Playlist Reach',
    'Pandora Streams', 'Pandora Track Stations', 'Shazam Counts'
]

# Remove commas and convert to numeric
#for column in columns_to_convert:
#    spotify_filled_data[column] = pd.to_numeric(spotify_filled_data[column].str.replace(',', ''), errors='coerce').astype('Int64')

for column in columns_to_convert:
    # Remove commas and convert to numeric, but first replace empty strings with NaN to handle them separately
    spotify_filled_data[column] = pd.to_numeric(
        spotify_filled_data[column].str.replace(',', '').replace('', 'NaN'),
        errors='coerce'
    ).fillna(0).astype('Int64')

# Convert the 'Release Date' column to datetime format
spotify_filled_data['Release Date'] = pd.to_datetime(spotify_filled_data['Release Date'], errors='coerce')

# Display the updated data types
print(spotify_filled_data.dtypes)

Track                                 object
Album Name                            object
Artist                                object
Release Date                  datetime64[ns]
ISRC                                  object
All Time Rank                          Int64
Track Score                          float64
Spotify Streams                        Int64
Spotify Playlist Count                 Int64
Spotify Playlist Reach                 Int64
Spotify Popularity                   float64
YouTube Views                          Int64
YouTube Likes                          Int64
TikTok Posts                           Int64
TikTok Likes                           Int64
TikTok Views                           Int64
YouTube Playlist Reach                 Int64
Apple Music Playlist Count           float64
AirPlay Spins                          Int64
SiriusXM Spins                         Int64
Deezer Playlist Count                float64
Deezer Playlist Reach                  Int64
Amazon Pla

#  4. Exploratory Data Analysis (EDA)

# 5. Advanced Python Techniques

# 6. Insights and Conclusions

# 7. Documentation and Presentation

In [86]:
spotify_filled_data.to_csv(os.path.join(current_folder_path, "CleanedUp Most Streamed Spotify Songs 2024.csv"), index=False)