<a href="https://colab.research.google.com/github/Abhijith-S-D/Spotify-EDA/blob/main/Spotify_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📊 Exploratory Data Analysis (EDA) on the Spotify Dataset 🎹

**Python for Data Science Mini Project**

By,
* **Abhijith S D**
* **Arvind C R**
* **Ayyasamy S**
* **Gurumurthy Kalyanpur Viswanathaiah**
* **Manjunath**

## ✨ Introduction ✨
Welcome to the Exploratory Data Analysis (EDA) of the Spotify dataset! In this notebook, we will explore various aspects of the dataset to uncover insights and understand the underlying patterns.

## 📚 Libraries Used 🔧

### 🐍 Pandas
* **Purpose**: Data manipulation and analysis.
* **Usage**: To handle and process data, including reading and cleaning the dataset.

In [7]:
import pandas as pd

### 📊 Seaborn
* **Purpose**: Statistical data visualization.
* **Usage**: To create informative and attractive visualizations such as histograms, pair plots, and violin plots.

In [None]:
import seaborn as sn

### 🖼️ Matplotlib
* **Purpose**: Plotting and visualization.
* **Usage**: For creating static, animated, and interactive visualizations, including histograms and plots with customized styles.

In [None]:
import matplotlib.pyplot as mp

### 📈 Plotly Express
* **Purpose**: Interactive data visualization.
* **Usage**: To create interactive plots and visualizations that allow for deeper exploration of the data.

In [None]:
import plotly.express as px

### ⚠️ Warnings
* **Purpose**: Manage warnings.
* **Usage**: To suppress warnings that may clutter the output.

In [None]:
import warnings as w
w.filterwarnings('ignore')

# 📋 Dataset Overview and Preprocessing 🔍

## 📈 Data Setup linking kaggle and google colab

### 📰 Read the kaggle API token to interact with your kaggle account

In [1]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"abhinomega135","key":"5e3e1bff26d3c27d9e8bdc2cfed82f7c"}'}

### ⚙️ Series of commands to set-up for download

In [2]:
!ls -lha kaggle.json
!pip install -q kaggle # installing the kaggle package
!mkdir -p ~/.kaggle # creating .kaggle folder where the key should be placed
!cp kaggle.json ~/.kaggle/ # move the key to the folder
!pwd # checking the present working directory

-rw-r--r-- 1 root root 69 Aug 10 03:13 kaggle.json
/content


### ♿ giving rw access (if 401-nathorized)

In [3]:
!chmod 600 ~/.kaggle/kaggle.json

### ✅ Sanity check if able to access kaggle

In [4]:
!kaggle datasets list

ref                                                        title                                               size  lastUpdated          downloadCount  voteCount  usabilityRating  
---------------------------------------------------------  -------------------------------------------------  -----  -------------------  -------------  ---------  ---------------  
waqi786/heart-attack-dataset                               ❤️ Heart Attack Risk Factors Dataset                 9KB  2024-08-06 10:57:42           1254         29  1.0              
youssefismail20/olympic-games-1994-2024                    Olympic Games (1994-2024) 🏅🌍                        16KB  2024-08-08 12:56:36           1198         30  1.0              
myrios/cost-of-living-index-by-country-by-number-2024      Cost of Living Index by Country                      3KB  2024-07-19 06:25:42           2865         42  1.0              
muhammadehsan000/credit-card-transaction-records-dataset   Credit Card Transaction Records

### ⬇️ Download data command

In [5]:
!kaggle datasets download -d nelgiriyewithana/most-streamed-spotify-songs-2024

Dataset URL: https://www.kaggle.com/datasets/nelgiriyewithana/most-streamed-spotify-songs-2024
License(s): CC-BY-SA-4.0
Downloading most-streamed-spotify-songs-2024.zip to /content
  0% 0.00/496k [00:00<?, ?B/s]
100% 496k/496k [00:00<00:00, 38.3MB/s]


### 🤐 Unzip Dataset

In [6]:
! unzip most-streamed-spotify-songs-2024.zip

Archive:  most-streamed-spotify-songs-2024.zip
  inflating: Most Streamed Spotify Songs 2024.csv  


## 🗂️ Loading the Dataset 📥
We began by loading the Spotify dataset using `pandas`. This dataset contains information about various tracks, including metrics from different streaming platforms.

In [8]:
df = pd.read_csv('Most Streamed Spotify Songs 2024.csv', encoding='latin1')

## 🏷️ Initial Data Examination 👀
We reviewed the last few rows of the dataset to understand its structure and content. The dataset consists of 29 columns and 4599 rows with diverse types of information such as track names, artists, release dates, and various streaming metrics.

In [9]:
pd.reset_option('display.max_columns')
df.tail(3)

Unnamed: 0,Track,Album Name,Artist,Release Date,ISRC,All Time Rank,Track Score,Spotify Streams,Spotify Playlist Count,Spotify Playlist Reach,...,SiriusXM Spins,Deezer Playlist Count,Deezer Playlist Reach,Amazon Playlist Count,Pandora Streams,Pandora Track Stations,Soundcloud Streams,Shazam Counts,TIDAL Popularity,Explicit Track
4597,Grace (feat. 42 Dugg),My Turn,Lil Baby,2/28/2020,USUG12000043,4571,19.4,189972685,72066,6704802,...,,1.0,74.0,6.0,84426740,28999.0,,1135998,,1
4598,Nashe Si Chadh Gayi,November Top 10 Songs,Arijit Singh,11/8/2016,INY091600067,4591,19.4,145467020,14037,7387064,...,,,,7.0,6817840,,,448292,,0
4599,Me Acostumbre (feat. Bad Bunny),Me Acostumbre (feat. Bad Bunny),Arcï¿½ï¿½,4/11/2017,USB271700107,4593,19.4,255740653,32138,14066526,...,,4.0,127479.0,4.0,69006739,11320.0,,767006,,1


In [15]:
df.shape

(4598, 29)

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4598 entries, 0 to 4599
Data columns (total 29 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Track                       4598 non-null   object 
 1   Album Name                  4598 non-null   object 
 2   Artist                      4593 non-null   object 
 3   Release Date                4598 non-null   object 
 4   ISRC                        4598 non-null   object 
 5   All Time Rank               4598 non-null   object 
 6   Track Score                 4598 non-null   float64
 7   Spotify Streams             4485 non-null   object 
 8   Spotify Playlist Count      4528 non-null   object 
 9   Spotify Playlist Reach      4526 non-null   object 
 10  Spotify Popularity          3794 non-null   float64
 11  YouTube Views               4290 non-null   object 
 12  YouTube Likes               4283 non-null   object 
 13  TikTok Posts                3425 non-n

## 📜 Column Names 📑
The dataset includes columns like:
* 🎵 **Track**
* 📀 **Album Name**
* 🎤 **Artist**
* 📅 **Release Date**
* 🔢 **ISRC**
* 📈 **All Time Rank**
* ⭐ **Track Score**
* 🎧 **Spotify Streams**
* 📋 **Spotify Playlist Count**
* 🌐 **Spotify Playlist Reach**
* 🎼 **Spotify Popularity**
* 📹 **YouTube Views**
* 👍 **YouTube Likes**
* 🎥 **TikTok Posts**
* 💖 **TikTok Likes**
* 👁️ **TikTok Views**
* 🎵 **YouTube Playlist Reach**
* 🍏 **Apple Music Playlist Count**
* 📻 **AirPlay Spins**
* 📡 **SiriusXM Spins**
* 🎶 **Deezer Playlist Count**
* 🌍 **Deezer Playlist Reach**
* 📚 **Amazon Playlist Count**
* 🎙️ **Pandora Streams**
* 📻 **Pandora Track Stations**
* 🔊 **Soundcloud Streams**
* 🕵️ **Shazam Counts**
* 🎶 **TIDAL Popularity**
* 🔞 **Explicit Track**

In [10]:
df.columns

Index(['Track', 'Album Name', 'Artist', 'Release Date', 'ISRC',
       'All Time Rank', 'Track Score', 'Spotify Streams',
       'Spotify Playlist Count', 'Spotify Playlist Reach',
       'Spotify Popularity', 'YouTube Views', 'YouTube Likes', 'TikTok Posts',
       'TikTok Likes', 'TikTok Views', 'YouTube Playlist Reach',
       'Apple Music Playlist Count', 'AirPlay Spins', 'SiriusXM Spins',
       'Deezer Playlist Count', 'Deezer Playlist Reach',
       'Amazon Playlist Count', 'Pandora Streams', 'Pandora Track Stations',
       'Soundcloud Streams', 'Shazam Counts', 'TIDAL Popularity',
       'Explicit Track'],
      dtype='object')

## 📉 Missing Values Analysis ⚠️
We identified columns with missing values:
* 🎤 **Artist**: 5 missing values
* 🎧 **Spotify Streams**: 113 missing values
* 📋 **Spotify Playlist Count**: 70 missing values
* 🌐 **Spotify Playlist Reach**: 72 missing values
* 🎼 **Spotify Popularity**: 804 missing values
* 📹 **YouTube Views**: 308 missing values
* 👍 **YouTube Likes**: 315 missing values
* 🎥 **TikTok Posts**: 1173 missing values
* 💖 **TikTok Likes**: 980 missing values
* 👁️ **TikTok Views**: 981 missing values
* 🎵 **YouTube Playlist Reach**: 1009 missing values
* 🍏 **Apple Music Playlist Count**: 561 missing values
* 📻 **AirPlay Spins**: 498 missing values
* 📡 **SiriusXM Spins**: 2123 missing values
* 🎶 **Deezer Playlist Count**: 921 missing values
* 🌍 **Deezer Playlist Reach**: 928 missing values
* 📚 **Amazon Playlist Count**: 1055 missing values
* 🎙️ **Pandora Streams**: 1106 missing values
* 📻 **Pandora Track Stations**: 1268 missing values
* 🔊 **Soundcloud Streams**: 3333 missing values
* 🕵️ **Shazam Counts**: 577 missing values
* 🎶 **TIDAL Popularity**: 4600 missing values

In [11]:
df.isna().sum()

Unnamed: 0,0
Track,0
Album Name,0
Artist,5
Release Date,0
ISRC,0
All Time Rank,0
Track Score,0
Spotify Streams,113
Spotify Playlist Count,70
Spotify Playlist Reach,72


## 🔄 Removing Duplicates 🔥
We checked for duplicate entries in the dataset:
* 🚨 **Duplicates Found**: 2
* 🗑️ **Duplicates Removed**: The dataset now contains 4598 rows.

In [12]:
df.duplicated().sum()

2

In [13]:
df.drop_duplicates(inplace=True)

In [14]:
df.duplicated().sum()

0

## 📊 Descriptive Statistics 📈
We computed descriptive statistics for the following columns:
* ⭐ **Track Score**: Mean of 41.85, Standard Deviation of 38.55
* 🎼 **Spotify Popularity**: Mean of 63.50, Standard Deviation of 16.19
* 🍏 **Apple Music Playlist Count**: Mean of 54.61, Standard Deviation of 71.63
* 🎶 **Deezer Playlist Count**: Mean of 32.32, Standard Deviation of 54.29
* 📚 **Amazon Playlist Count**: Mean of 25.35, Standard Deviation of 25.99
* 🎶 **TIDAL Popularity**: This column was dropped due to all missing values.
* 🔞 **Explicit Track**: 0 or 1 indicating the explicit nature of the track.

In [17]:
df.describe()

Unnamed: 0,Track Score,Spotify Popularity,Apple Music Playlist Count,Deezer Playlist Count,Amazon Playlist Count,TIDAL Popularity,Explicit Track
count,4598.0,3794.0,4037.0,3677.0,3543.0,0.0,4598.0
mean,41.850892,63.498682,54.613574,32.32173,25.346034,,0.359069
std,38.550706,16.189952,71.628469,54.287051,25.993157,,0.47978
min,19.4,1.0,1.0,1.0,1.0,,0.0
25%,23.3,61.0,10.0,5.0,8.0,,0.0
50%,29.9,67.0,28.0,15.0,17.0,,0.0
75%,44.475,73.0,70.0,37.0,34.0,,1.0
max,725.4,96.0,859.0,632.0,210.0,,1.0


## 🗑️ Dropping Unnecessary Columns 🧹
The **TIDAL Popularity** column was dropped from the dataset:
* ❌ **Reason**: This column contained only missing values and did not contribute to our analysis.

In [18]:
df.drop("TIDAL Popularity",axis=1,inplace=True)