<a href="https://colab.research.google.com/github/Abhijith-S-D/Spotify-EDA/blob/main/Spotify_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📊 Exploratory Data Analysis (EDA) on the Spotify Dataset 🎹

**Python for Data Science Mini Project**

By,
* **Abhijith S D**
* **Arvind C R**
* **Ayyasamy S**
* **Gurumurthy Kalyanpur Viswanathaiah**
* **Manjunath**

## ✨ Introduction ✨
Welcome to the Exploratory Data Analysis (EDA) of the Spotify dataset! In this notebook, we will explore various aspects of the dataset to uncover insights and understand the underlying patterns.

## 📚 Libraries Used 🔧

### 🐍 Pandas
* **Purpose**: Data manipulation and analysis.
* **Usage**: To handle and process data, including reading and cleaning the dataset.

### 📊 Seaborn
* **Purpose**: Statistical data visualization.
* **Usage**: To create informative and attractive visualizations such as histograms, pair plots, and violin plots.

### 🖼️ Matplotlib
* **Purpose**: Plotting and visualization.
* **Usage**: For creating static, animated, and interactive visualizations, including histograms and plots with customized styles.

### 📈 Plotly Express
* **Purpose**: Interactive data visualization.
* **Usage**: To create interactive plots and visualizations that allow for deeper exploration of the data.

### ⚠️ Warnings
* **Purpose**: Manage warnings.
* **Usage**: To suppress warnings that may clutter the output.

In [12]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as mp
import plotly.express as px
import warnings as w
w.filterwarnings('ignore')

# 📋 Selected Dataset Overview and Preprocessing 🔍

## 📈 Data Setup linking kaggle and google colab


### 📰 Read the kaggle API token to interact with your kaggle account

### ⚙️ Series of commands to set-up for download

### ♿ giving rw access (if 401-nathorized)

### ✅ Sanity check if able to access kaggle

### ⬇️ Download data command

### 🤐 Unzip Dataset

In [14]:
from google.colab import files
files.upload()
!ls -lha kaggle.json
!pip install -q kaggle # installing the kaggle package
!mkdir -p ~/.kaggle # creating .kaggle folder where the key should be placed
!cp kaggle.json ~/.kaggle/ # move the key to the folder
!pwd # checking the present working directory
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets list
!kaggle datasets download -d nelgiriyewithana/most-streamed-spotify-songs-2024
! unzip most-streamed-spotify-songs-2024.zip

Saving kaggle.json to kaggle (2).json
-rw-r--r-- 1 root root 69 Aug 16 08:53 kaggle.json
/content
ref                                                            title                                              size  lastUpdated          downloadCount  voteCount  usabilityRating  
-------------------------------------------------------------  -------------------------------------------------  ----  -------------------  -------------  ---------  ---------------  
muhammadehsan000/healthcare-dataset-2019-2024                  Healthcare Dataset (2019-2024)                      3MB  2024-08-09 17:52:25           2552         60  1.0              
muhammadehsan000/global-electric-vehicle-sales-data-2010-2024  Global Electric Vehicle Sales Data (2010-2024)     83KB  2024-08-09 16:39:22           1969         37  1.0              
emreksz/software-engineer-jobs-and-salaries-2024               Software Engineer Jobs & Salaries 2024             23KB  2024-08-12 00:08:03            900        

## 🗂️ Loading the Dataset 📥

We began by loading the [Most Streamed Spotify Songs 2024](https://www.kaggle.com/datasets/nelgiriyewithana/most-streamed-spotify-songs-2024) dataset using `pandas`. This dataset contains information about various tracks, including metrics from different streaming platforms.

In [15]:
df = pd.read_csv('Most Streamed Spotify Songs 2024.csv', encoding='latin1')

## 🏷️ Initial Data Examination 👀
We reviewed the last few rows of the dataset to understand its structure and content. The dataset consists of 29 columns and 4599 rows with diverse types of information such as track names, artists, release dates, and various streaming metrics.

In [16]:
pd.reset_option('display.max_columns')
df.tail(3)

Unnamed: 0,Track,Album Name,Artist,Release Date,ISRC,All Time Rank,Track Score,Spotify Streams,Spotify Playlist Count,Spotify Playlist Reach,...,SiriusXM Spins,Deezer Playlist Count,Deezer Playlist Reach,Amazon Playlist Count,Pandora Streams,Pandora Track Stations,Soundcloud Streams,Shazam Counts,TIDAL Popularity,Explicit Track
4597,Grace (feat. 42 Dugg),My Turn,Lil Baby,2/28/2020,USUG12000043,4571,19.4,189972685,72066,6704802,...,,1.0,74.0,6.0,84426740,28999.0,,1135998,,1
4598,Nashe Si Chadh Gayi,November Top 10 Songs,Arijit Singh,11/8/2016,INY091600067,4591,19.4,145467020,14037,7387064,...,,,,7.0,6817840,,,448292,,0
4599,Me Acostumbre (feat. Bad Bunny),Me Acostumbre (feat. Bad Bunny),Arcï¿½ï¿½,4/11/2017,USB271700107,4593,19.4,255740653,32138,14066526,...,,4.0,127479.0,4.0,69006739,11320.0,,767006,,1


In [17]:
df.shape

(4600, 29)

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4600 entries, 0 to 4599
Data columns (total 29 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Track                       4600 non-null   object 
 1   Album Name                  4600 non-null   object 
 2   Artist                      4595 non-null   object 
 3   Release Date                4600 non-null   object 
 4   ISRC                        4600 non-null   object 
 5   All Time Rank               4600 non-null   object 
 6   Track Score                 4600 non-null   float64
 7   Spotify Streams             4487 non-null   object 
 8   Spotify Playlist Count      4530 non-null   object 
 9   Spotify Playlist Reach      4528 non-null   object 
 10  Spotify Popularity          3796 non-null   float64
 11  YouTube Views               4292 non-null   object 
 12  YouTube Likes               4285 non-null   object 
 13  TikTok Posts                3427 

In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4600 entries, 0 to 4599
Data columns (total 29 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Track                       4600 non-null   object 
 1   Album Name                  4600 non-null   object 
 2   Artist                      4595 non-null   object 
 3   Release Date                4600 non-null   object 
 4   ISRC                        4600 non-null   object 
 5   All Time Rank               4600 non-null   float64
 6   Track Score                 4600 non-null   float64
 7   Spotify Streams             4600 non-null   float64
 8   Spotify Playlist Count      4600 non-null   float64
 9   Spotify Playlist Reach      4600 non-null   float64
 10  Spotify Popularity          4600 non-null   float64
 11  YouTube Views               4600 non-null   float64
 12  YouTube Likes               4600 non-null   float64
 13  TikTok Posts                4600 

In [24]:
summary_statistics = df.describe()
print("Summary Statistics:")
summary_statistics

Summary Statistics:


Unnamed: 0,All Time Rank,Track Score,Spotify Streams,Spotify Playlist Count,Spotify Playlist Reach,Spotify Popularity,YouTube Views,YouTube Likes,TikTok Posts,TikTok Likes,...,SiriusXM Spins,Deezer Playlist Count,Deezer Playlist Reach,Amazon Playlist Count,Pandora Streams,Pandora Track Stations,Soundcloud Streams,Shazam Counts,TIDAL Popularity,Explicit Track
count,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,...,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,0.0,4600.0
mean,2290.678043,41.844043,436397100.0,58487.056304,22981390.0,52.402609,375826900.0,2729404.0,703509.2,88640790.0,...,138.574348,25.841739,1033699.0,19.535217,65069940.0,63653.49,4089647.0,2211905.0,,0.358913
std,1322.878312,38.543766,536279200.0,70961.508769,29596120.0,28.247818,685425100.0,4494884.0,2147581.0,489903500.0,...,426.777427,50.229711,3218861.0,25.181672,150818000.0,225446.6,18117280.0,5709079.0,,0.479734
min,1.0,19.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0
25%,1144.75,23.3,63406270.0,6021.25,4500124.0,45.0,28215740.0,288705.5,0.0,73155.25,...,0.0,1.0,2960.25,1.0,628.25,0.0,0.0,105887.8,,0.0
50%,2290.5,29.9,226376100.0,31211.0,12877650.0,65.0,128353100.0,1072084.0,64952.5,12703310.0,...,4.0,9.0,122329.5,11.0,2906567.0,2044.5,0.0,610960.5,,0.0
75%,3436.25,44.425,611356300.0,84908.25,29305260.0,71.0,420463700.0,3354867.0,459983.2,67218510.0,...,103.0,30.0,607699.8,28.0,46691380.0,23638.75,123709.8,2242540.0,,1.0
max,4998.0,725.4,4281469000.0,590392.0,262343400.0,96.0,16322760000.0,62311180.0,42900000.0,23474220000.0,...,7098.0,632.0,48197850.0,210.0,1463624000.0,3780513.0,319835900.0,219794500.0,,1.0


## 📜 Column Names 📑
The dataset includes columns like:
* 🎵 **Track**
* 📀 **Album Name**
* 🎤 **Artist**
* 📅 **Release Date**
* 🔢 **ISRC**
* 📈 **All Time Rank**
* ⭐ **Track Score**
* 🎧 **Spotify Streams**
* 📋 **Spotify Playlist Count**
* 🌐 **Spotify Playlist Reach**
* 🎼 **Spotify Popularity**
* 📹 **YouTube Views**
* 👍 **YouTube Likes**
* 🎥 **TikTok Posts**
* 💖 **TikTok Likes**
* 👁️ **TikTok Views**
* 🎵 **YouTube Playlist Reach**
* 🍏 **Apple Music Playlist Count**
* 📻 **AirPlay Spins**
* 📡 **SiriusXM Spins**
* 🎶 **Deezer Playlist Count**
* 🌍 **Deezer Playlist Reach**
* 📚 **Amazon Playlist Count**
* 🎙️ **Pandora Streams**
* 📻 **Pandora Track Stations**
* 🔊 **Soundcloud Streams**
* 🕵️ **Shazam Counts**
* 🎶 **TIDAL Popularity**
* 🔞 **Explicit Track**

In [25]:
df.columns

Index(['Track', 'Album Name', 'Artist', 'Release Date', 'ISRC',
       'All Time Rank', 'Track Score', 'Spotify Streams',
       'Spotify Playlist Count', 'Spotify Playlist Reach',
       'Spotify Popularity', 'YouTube Views', 'YouTube Likes', 'TikTok Posts',
       'TikTok Likes', 'TikTok Views', 'YouTube Playlist Reach',
       'Apple Music Playlist Count', 'AirPlay Spins', 'SiriusXM Spins',
       'Deezer Playlist Count', 'Deezer Playlist Reach',
       'Amazon Playlist Count', 'Pandora Streams', 'Pandora Track Stations',
       'Soundcloud Streams', 'Shazam Counts', 'TIDAL Popularity',
       'Explicit Track'],
      dtype='object')

## 📉 Missing Values Analysis ⚠️
We identified columns with missing values:
* 🎤 **Artist**: 5 missing values
* 🎧 **Spotify Streams**: 113 missing values
* 📋 **Spotify Playlist Count**: 70 missing values
* 🌐 **Spotify Playlist Reach**: 72 missing values
* 🎼 **Spotify Popularity**: 804 missing values
* 📹 **YouTube Views**: 308 missing values
* 👍 **YouTube Likes**: 315 missing values
* 🎥 **TikTok Posts**: 1173 missing values
* 💖 **TikTok Likes**: 980 missing values
* 👁️ **TikTok Views**: 981 missing values
* 🎵 **YouTube Playlist Reach**: 1009 missing values
* 🍏 **Apple Music Playlist Count**: 561 missing values
* 📻 **AirPlay Spins**: 498 missing values
* 📡 **SiriusXM Spins**: 2123 missing values
* 🎶 **Deezer Playlist Count**: 921 missing values
* 🌍 **Deezer Playlist Reach**: 928 missing values
* 📚 **Amazon Playlist Count**: 1055 missing values
* 🎙️ **Pandora Streams**: 1106 missing values
* 📻 **Pandora Track Stations**: 1268 missing values
* 🔊 **Soundcloud Streams**: 3333 missing values
* 🕵️ **Shazam Counts**: 577 missing values
* 🎶 **TIDAL Popularity**: 4600 missing values

In [None]:
df.isna().sum()

Unnamed: 0,0
Track,0
Album Name,0
Artist,5
Release Date,0
ISRC,0
All Time Rank,0
Track Score,0
Spotify Streams,113
Spotify Playlist Count,70
Spotify Playlist Reach,72


## 🔄 Removing Duplicates 🔥
We checked for duplicate entries in the dataset:
* 🚨 **Duplicates Found**: 2
* 🗑️ **Duplicates Removed**: The dataset now contains 4598 rows.

In [None]:
df.duplicated().sum()

2

In [None]:
df.drop_duplicates(inplace=True)

In [None]:
df.duplicated().sum()

0

## 🔄 Cleaning data and filling NA values 🔥
We checked for NaN entries in the dataset:
* 🚨 replaced NaN with 0

In [None]:
columns_to_clean = ['All Time Rank',
                    'Track Score',
                    'Spotify Streams',
                    'Spotify Playlist Count',
                    'Spotify Playlist Reach',
                    'Spotify Popularity',
                     'YouTube Views',
                     'YouTube Likes',
                     'TikTok Posts',
                     'TikTok Likes',
                     'TikTok Views',
                     'YouTube Playlist Reach',
                     'Apple Music Playlist Count',
                     'AirPlay Spins',
                     'SiriusXM Spins',
                     'Deezer Playlist Count',
                     'Deezer Playlist Reach',
                     'Amazon Playlist Count',
                    'Pandora Streams',
                    'Pandora Track Stations',
                    'Soundcloud Streams',
                    'Shazam Counts',
                    'Explicit Track']

def clean_and_convert_fill_zero(column):
    df[column] = df[column].replace('[\$,]', '', regex=True).astype(float)
    df[column] = df[column].fillna(0)

for column in columns_to_clean:
    clean_and_convert_fill_zero(column)

## 📊 Descriptive Statistics 📈
We computed descriptive statistics for the following columns:
* ⭐ **Track Score**: Mean of 41.85, Standard Deviation of 38.55
* 🎼 **Spotify Popularity**: Mean of 63.50, Standard Deviation of 16.19
* 🍏 **Apple Music Playlist Count**: Mean of 54.61, Standard Deviation of 71.63
* 🎶 **Deezer Playlist Count**: Mean of 32.32, Standard Deviation of 54.29
* 📚 **Amazon Playlist Count**: Mean of 25.35, Standard Deviation of 25.99
* 🎶 **TIDAL Popularity**: This column was dropped due to all missing values.
* 🔞 **Explicit Track**: 0 or 1 indicating the explicit nature of the track.

In [26]:
df.describe()

Unnamed: 0,All Time Rank,Track Score,Spotify Streams,Spotify Playlist Count,Spotify Playlist Reach,Spotify Popularity,YouTube Views,YouTube Likes,TikTok Posts,TikTok Likes,...,SiriusXM Spins,Deezer Playlist Count,Deezer Playlist Reach,Amazon Playlist Count,Pandora Streams,Pandora Track Stations,Soundcloud Streams,Shazam Counts,TIDAL Popularity,Explicit Track
count,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,...,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,0.0,4600.0
mean,2290.678043,41.844043,436397100.0,58487.056304,22981390.0,52.402609,375826900.0,2729404.0,703509.2,88640790.0,...,138.574348,25.841739,1033699.0,19.535217,65069940.0,63653.49,4089647.0,2211905.0,,0.358913
std,1322.878312,38.543766,536279200.0,70961.508769,29596120.0,28.247818,685425100.0,4494884.0,2147581.0,489903500.0,...,426.777427,50.229711,3218861.0,25.181672,150818000.0,225446.6,18117280.0,5709079.0,,0.479734
min,1.0,19.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0
25%,1144.75,23.3,63406270.0,6021.25,4500124.0,45.0,28215740.0,288705.5,0.0,73155.25,...,0.0,1.0,2960.25,1.0,628.25,0.0,0.0,105887.8,,0.0
50%,2290.5,29.9,226376100.0,31211.0,12877650.0,65.0,128353100.0,1072084.0,64952.5,12703310.0,...,4.0,9.0,122329.5,11.0,2906567.0,2044.5,0.0,610960.5,,0.0
75%,3436.25,44.425,611356300.0,84908.25,29305260.0,71.0,420463700.0,3354867.0,459983.2,67218510.0,...,103.0,30.0,607699.8,28.0,46691380.0,23638.75,123709.8,2242540.0,,1.0
max,4998.0,725.4,4281469000.0,590392.0,262343400.0,96.0,16322760000.0,62311180.0,42900000.0,23474220000.0,...,7098.0,632.0,48197850.0,210.0,1463624000.0,3780513.0,319835900.0,219794500.0,,1.0


## 🗑️ Dropping Unnecessary Columns 🧹
The **TIDAL Popularity** column was dropped from the dataset:
* ❌ **Reason**: This column contained only missing values and did not contribute to our analysis.

In [27]:
df.drop("TIDAL Popularity",axis=1,inplace=True)