# 🎵 **Spotify Songs Data Cleaning Process** 🎵

## 📝 **Objective**
🔹 Prepare the dataset for analysis by cleaning inconsistencies, handling missing values, and standardizing data.


In [1]:
import pandas as pd
import numpy as np

### Steps:
🔹 **Load Dataset**: Import and read the `spotify_songs_dataset.csv` file.

In [2]:

df = pd.read_csv("spotify_songs_dataset.csv")


In [3]:
# Display basic details
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   song_id           50000 non-null  object 
 1   song_title        50000 non-null  object 
 2   artist            50000 non-null  object 
 3   album             50000 non-null  object 
 4   genre             50000 non-null  object 
 5   release_date      50000 non-null  object 
 6   duration          45000 non-null  float64
 7   popularity        50000 non-null  int64  
 8   stream            50000 non-null  int64  
 9   language          47500 non-null  object 
 10  explicit_content  50000 non-null  object 
 11  label             50000 non-null  object 
 12  composer          50000 non-null  object 
 13  producer          50000 non-null  object 
 14  collaboration     15000 non-null  object 
dtypes: float64(1), int64(2), object(12)
memory usage: 5.7+ MB


In [4]:
df.head()

Unnamed: 0,song_id,song_title,artist,album,genre,release_date,duration,popularity,stream,language,explicit_content,label,composer,producer,collaboration
0,SP0001,Space executive series.,Sydney Clark,What.,Electronic,1997-11-08,282.0,42,35055874,English,Yes,Def Jam,Amy Hatfield,Jeffrey Weaver,
1,SP0002,Price last painting.,Connor Peters DDS,Nature politics.,Electronic,2015-05-10,127.0,50,9249527,English,Yes,Universal Music,Jason Gregory,Kenneth White,
2,SP0003,Piece.,Anna Keith,Visit.,Pop,2024-07-08,,10,76669110,English,Yes,Universal Music,Rachel Lopez,Jason Barnes,
3,SP0004,Power industry your.,Zachary Simpson,Behavior evening.,Hip-Hop,2022-08-15,214.0,86,34732016,English,No,Sony Music,Thomas Li,Mrs. Becky Palmer,
4,SP0005,Food animal second.,Christopher Mcgee,Front.,Pop,2023-03-05,273.0,63,96649372,English,Yes,Def Jam,Adam Wagner,Beverly Baker,


In [5]:
df.describe()

Unnamed: 0,duration,popularity,stream
count,45000.0,50000.0,50000.0
mean,239.659178,50.78344,50191830.0
std,50.136727,28.948749,28936240.0
min,33.0,1.0,1899.0
25%,206.0,26.0,25233110.0
50%,240.0,51.0,50421690.0
75%,273.0,76.0,75190640.0
max,433.0,100.0,99999130.0


In [6]:
summary_text = """
## Statistical Summary of Spotify Songs Dataset

#### 1. Duration (seconds)
- Average song length is **~4 minutes**, ranging from **33 sec to 7.2 min**.
- Some songs are significantly shorter or longer, indicating possible **outliers**.

#### 2. Popularity (0-100 Scale)
- Median popularity is **51**, with values ranging from **1 to 100**.
- High standard deviation (**28.94**) suggests **variation in song popularity**.

#### 3. Streams (Number of Plays)
- Average streams are **~50M**, but some songs have as low as **1.8K** and as high as **100M**.
- Skewed distribution suggests **top-streamed songs dominate the dataset**.

####  🔹 Key Observations
- Potential **outliers** in duration, popularity, and streams.
- Stream count is **highly skewed**, meaning **top songs get most of the plays**.
- **Data cleaning** (outlier handling, normalization) may improve analysis accuracy.
"""

from IPython.display import display, Markdown
display(Markdown(summary_text))



## Statistical Summary of Spotify Songs Dataset

#### 1. Duration (seconds)
- Average song length is **~4 minutes**, ranging from **33 sec to 7.2 min**.
- Some songs are significantly shorter or longer, indicating possible **outliers**.

#### 2. Popularity (0-100 Scale)
- Median popularity is **51**, with values ranging from **1 to 100**.
- High standard deviation (**28.94**) suggests **variation in song popularity**.

#### 3. Streams (Number of Plays)
- Average streams are **~50M**, but some songs have as low as **1.8K** and as high as **100M**.
- Skewed distribution suggests **top-streamed songs dominate the dataset**.

####  🔹 Key Observations
- Potential **outliers** in duration, popularity, and streams.
- Stream count is **highly skewed**, meaning **top songs get most of the plays**.
- **Data cleaning** (outlier handling, normalization) may improve analysis accuracy.


**Convert Data Types**:
   - 🔹 Convert `release_date` to datetime format.
   - 🔹 Convert `explicit_content` to binary (0 for No, 1 for Yes).

In [7]:

df['release_date'] = pd.to_datetime(df['release_date'], errors='coerce')


In [8]:
# df['explicit_content'] = df['explicit_content'].map({'Yes': 1, 'No': 0})

**Remove Duplicates**: 
🔹 Identify and remove duplicate rows.

In [9]:
df.drop_duplicates(inplace=True)

**Standardize Text Columns**: 
🔹 Convert all categorical text columns to lowercase and trim spaces.

In [10]:
text_cols = ['song_title', 'artist', 'album', 'genre', 'language', 'label', 'composer', 'producer']


In [11]:
df[text_cols] = df[text_cols].apply(lambda x: x.str.strip().str.lower())


 **Handle Missing Values** :
   - 🔹 Fill missing `duration` values with the median.
   - 🔹 Fill missing `language` values with the mode.
   - 🔹 Drop the `collaboration` column due to excessive missing values.

In [12]:
# Handle missing values
df['duration'].fillna(df['duration'].median())  # Impute with median


0        282.0
1        127.0
2        240.0
3        214.0
4        273.0
         ...  
49995    272.0
49996    355.0
49997    207.0
49998    266.0
49999    188.0
Name: duration, Length: 50000, dtype: float64

In [13]:
df['duration'].isna()

0        False
1        False
2         True
3        False
4        False
         ...  
49995    False
49996    False
49997    False
49998    False
49999    False
Name: duration, Length: 50000, dtype: bool

In [14]:
df['language'].fillna(df['language'].mode()[0])  # Fill missing with mode


0        english
1        english
2        english
3        english
4        english
          ...   
49995    spanish
49996     korean
49997    spanish
49998    english
49999    english
Name: language, Length: 50000, dtype: object

In [15]:
 df.drop(columns=['collaboration'], inplace=True)  # Drop collaboration column due to excessive missing values


**Handle Outliers**
📊 Apply the **IQR method** to remove extreme values in:  
🎵 `duration`, ⭐ `popularity`, and 📈 `stream`.

In [16]:
# Handle outliers using the IQR method
for col in ['duration', 'popularity', 'stream']:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    df[col] = np.where(df[col] < lower_bound, lower_bound, df[col])
    df[col] = np.where(df[col] > upper_bound, upper_bound, df[col])


In [17]:
lower_bound

np.float64(-49703187.5)

In [18]:
upper_bound

np.float64(150126932.5)

In [19]:
df.head()

Unnamed: 0,song_id,song_title,artist,album,genre,release_date,duration,popularity,stream,language,explicit_content,label,composer,producer
0,SP0001,space executive series.,sydney clark,what.,electronic,1997-11-08,282.0,42.0,35055874.0,english,Yes,def jam,amy hatfield,jeffrey weaver
1,SP0002,price last painting.,connor peters dds,nature politics.,electronic,2015-05-10,127.0,50.0,9249527.0,english,Yes,universal music,jason gregory,kenneth white
2,SP0003,piece.,anna keith,visit.,pop,2024-07-08,,10.0,76669110.0,english,Yes,universal music,rachel lopez,jason barnes
3,SP0004,power industry your.,zachary simpson,behavior evening.,hip-hop,2022-08-15,214.0,86.0,34732016.0,english,No,sony music,thomas li,mrs. becky palmer
4,SP0005,food animal second.,christopher mcgee,front.,pop,2023-03-05,273.0,63.0,96649372.0,english,Yes,def jam,adam wagner,beverly baker


🎵**Save Cleaned Data** 🎵
💾 Export the cleaned dataset as **`cleaned_spotify_songs.csv`**.


In [22]:
df.to_csv("new_cleanedd_spotify_songs.csv", index=False)



In [23]:
print("Data cleaning complete. Cleaned file saved as 'cleaned_spotify_songs.csv'.")

Data cleaning complete. Cleaned file saved as 'cleaned_spotify_songs.csv'.


In [24]:
df.head()

Unnamed: 0,song_id,song_title,artist,album,genre,release_date,duration,popularity,stream,language,explicit_content,label,composer,producer
0,SP0001,space executive series.,sydney clark,what.,electronic,1997-11-08,282.0,42.0,35055874.0,english,Yes,def jam,amy hatfield,jeffrey weaver
1,SP0002,price last painting.,connor peters dds,nature politics.,electronic,2015-05-10,127.0,50.0,9249527.0,english,Yes,universal music,jason gregory,kenneth white
2,SP0003,piece.,anna keith,visit.,pop,2024-07-08,,10.0,76669110.0,english,Yes,universal music,rachel lopez,jason barnes
3,SP0004,power industry your.,zachary simpson,behavior evening.,hip-hop,2022-08-15,214.0,86.0,34732016.0,english,No,sony music,thomas li,mrs. becky palmer
4,SP0005,food animal second.,christopher mcgee,front.,pop,2023-03-05,273.0,63.0,96649372.0,english,Yes,def jam,adam wagner,beverly baker
