# Data Cleaning And Wrangling

>**Importing necessary libraries**
\
*<small>Pandas for data manipulation and numpy for numerical operations</small>*

In [2]:
import pandas as pd
import numpy as np

>**Loading Spotify dataset from the CSV file**
\
*<small>Using ISO-8859-1 encoding to handle character encoding issues</small>*

In [3]:
file_path = "spotify-2023.csv"
df = pd.read_csv(file_path, encoding='ISO-8859-1')

>**Selecting the only columns relevant for analysis**
\
*<small>Track details, release information, and key audio features</small>*

In [4]:
selected_columns = [
    'track_name',
    'artist(s)_name',
    'released_year',
    'released_month',
    'released_day',
    'streams',
    'danceability_%',
    'valence_%',
    'energy_%',
    'acousticness_%',
    'liveness_%',
    'speechiness_%'
]
df = df[selected_columns]

>**Displaying a preview of the first few rows of the data for quick verification**)

In [5]:
df.head()

Unnamed: 0,track_name,artist(s)_name,released_year,released_month,released_day,streams,danceability_%,valence_%,energy_%,acousticness_%,liveness_%,speechiness_%
0,Seven (feat. Latto) (Explicit Ver.),"Latto, Jung Kook",2023,7,14,141381703,80,89,83,31,8,4
1,LALA,Myke Towers,2023,3,23,133716286,71,61,74,7,10,4
2,vampire,Olivia Rodrigo,2023,6,30,140003974,51,32,53,17,31,6
3,Cruel Summer,Taylor Swift,2019,8,23,800840817,55,58,72,11,11,15
4,WHERE SHE GOES,Bad Bunny,2023,5,18,303236322,65,23,80,14,11,6


>**Transforming columns to Windows-1252 UTF-8**
\
*<small>Transforming to ensure consistency and compatibility (in raw dataset 1st column row 23/index 21 their is weird characters because of encoding issues)</small>*

In [5]:
print(df.loc[df.index[21],'track_name'])

I Can See You (Taylorï¿½ï¿½ï¿½s Version) (From The 


>**Selecting Columns that need encoding**

In [6]:
encoding_columns = [
    'track_name',
    'artist(s)_name',
]

>*<small>this also can be done by opening csv file in notepad and saving it as UTF-8 encoding</small>*

In [7]:
for column in encoding_columns:
     df[column] = df[column].str.encode('windows-1252', errors='ignore').str.decode('utf-8', errors='ignore')

>**Preview of fixed encoding**

In [8]:
print(df.loc[df.index[21],'track_name'])

I Can See You (Taylor���s Version) (From The 


> **Saveing the data frame to a new CSV file** \
> *<small> New CSV file named "updated-spotify-2023.csv" without index</small>*


In [9]:
output_file_path = "updated-spotify-2023.csv"
df.to_csv(output_file_path, index=False, encoding='utf-8')

>**Displaying a preview of a cleaned data**

In [10]:
df.head()

Unnamed: 0,track_name,artist(s)_name,released_year,released_month,released_day,streams,danceability_%,valence_%,energy_%,acousticness_%,liveness_%,speechiness_%
0,Seven (feat. Latto) (Explicit Ver.),"Latto, Jung Kook",2023,7,14,141381703,80,89,83,31,8,4
1,LALA,Myke Towers,2023,3,23,133716286,71,61,74,7,10,4
2,vampire,Olivia Rodrigo,2023,6,30,140003974,51,32,53,17,31,6
3,Cruel Summer,Taylor Swift,2019,8,23,800840817,55,58,72,11,11,15
4,WHERE SHE GOES,Bad Bunny,2023,5,18,303236322,65,23,80,14,11,6


># **Special character encoding issues can be fixed by opening csv file in notepad and saving it as UTF-8 encoding**