# EDA Datset Spotify

- In this Jupyter notebook, you will see the entire process of extraction, cleaning, and loading of one of the 2 datasets of the project, in this case the Spotify dataset which is extracted from a CSV file in order to process it. At the end of all its transformation, it is saved into a pickle file that will allow us to interact with it from other notebooks.
- The decision to use the pickle library was made for its ease of use and functionality in this Jupyter environment and Python programming language.


## Library Importation
- Pandas and Matplotlib: High-performance libraries for data handling and exploratory data analysis, allowing for thorough and organized cleaning of our datasets. Additionally, they provide a wide range of statistical and graphical analysis capabilities, complemented by Matplotlib.
- Pickle: A widely used library for data transfer and processes within the Python language, offering flexibility and easy data persistence.
- re: Library that allows us to work with regular expressions, find patterns, and search for occurrences among records, thus facilitating cleaning in case of repeated expressions among records that make many of them dirty and difficult to identify in a large dataset.

In [22]:
import pandas as pd
import matplotlib as plt
import pickle
import re

## Data Retrieval from CSV Path and Initial Exploration of Fields and Information in Our Dataset
- We can analyze with the ".info()" method that the presence of 113999 records in the fields artists, album_name, track_name suggests that there is a missing record in those fields somewhere in the dataframe, possibly NaN values.

In [23]:
Spotify_data = '../Data/spotify_dataset.csv' 
spotify_dataset = pd.read_csv(Spotify_data, delimiter=',') 
print(spotify_dataset.head())
print(spotify_dataset.info())

   Unnamed: 0                track_id                 artists  \
0           0  5SuOikwiRyPMVoIQDJUgSV             Gen Hoshino   
1           1  4qPNDBW1i3p13qLCt0Ki3A            Ben Woodward   
2           2  1iJBSr7s7jYXzM8EGcbK5b  Ingrid Michaelson;ZAYN   
3           3  6lfxq3CG4xtTiEg7opyCyx            Kina Grannis   
4           4  5vjLSffimiIP26QG5WcN2K        Chord Overstreet   

                                          album_name  \
0                                             Comedy   
1                                   Ghost (Acoustic)   
2                                     To Begin Again   
3  Crazy Rich Asians (Original Motion Picture Sou...   
4                                            Hold On   

                   track_name  popularity  duration_ms  explicit  \
0                      Comedy          73       230666     False   
1            Ghost - Acoustic          55       149610     False   
2              To Begin Again          57       210826     False   



## Null Values: Verification of the Existence of NaN Values in Our Dataset

- Se hace una lista de la existencia de valores NaN en cada campo existente del dataset, encontrando 3 campos con la existencia de valores NaN, los cuales son: artists, album_name y track_name

In [24]:
nan_values = spotify_dataset.isna().any()

print("NaN Values per Column:")
print(nan_values)

NaN Values per Column:
Unnamed: 0          False
track_id            False
artists              True
album_name           True
track_name           True
popularity          False
duration_ms         False
explicit            False
danceability        False
energy              False
key                 False
loudness            False
mode                False
speechiness         False
acousticness        False
instrumentalness    False
liveness            False
valence             False
tempo               False
time_signature      False
track_genre         False
dtype: bool


## Null Values: Removal of the Null Values Found in the Dataset and Subsequent Visualization of the Fields After the Change

In [25]:
spotify_dataset = spotify_dataset.dropna()
print(spotify_dataset.info())

<class 'pandas.core.frame.DataFrame'>
Index: 113999 entries, 0 to 113999
Data columns (total 21 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Unnamed: 0        113999 non-null  int64  
 1   track_id          113999 non-null  object 
 2   artists           113999 non-null  object 
 3   album_name        113999 non-null  object 
 4   track_name        113999 non-null  object 
 5   popularity        113999 non-null  int64  
 6   duration_ms       113999 non-null  int64  
 7   explicit          113999 non-null  bool   
 8   danceability      113999 non-null  float64
 9   energy            113999 non-null  float64
 10  key               113999 non-null  int64  
 11  loudness          113999 non-null  float64
 12  mode              113999 non-null  int64  
 13  speechiness       113999 non-null  float64
 14  acousticness      113999 non-null  float64
 15  instrumentalness  113999 non-null  float64
 16  liveness          113999 

## Removal of the Unnamed Field: Field that Does Not Provide Additional Information or Utility in Our Analysis
- It was decided to remove the Unnamed field from the dataset since it contains redundant information which simply consists of the default pandas index values, which is unnecessary as the default index is already present in the dataframe.

In [26]:
spotify_dataset = spotify_dataset.drop(columns=['Unnamed: 0'])
print(spotify_dataset.info())

<class 'pandas.core.frame.DataFrame'>
Index: 113999 entries, 0 to 113999
Data columns (total 20 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   track_id          113999 non-null  object 
 1   artists           113999 non-null  object 
 2   album_name        113999 non-null  object 
 3   track_name        113999 non-null  object 
 4   popularity        113999 non-null  int64  
 5   duration_ms       113999 non-null  int64  
 6   explicit          113999 non-null  bool   
 7   danceability      113999 non-null  float64
 8   energy            113999 non-null  float64
 9   key               113999 non-null  int64  
 10  loudness          113999 non-null  float64
 11  mode              113999 non-null  int64  
 12  speechiness       113999 non-null  float64
 13  acousticness      113999 non-null  float64
 14  instrumentalness  113999 non-null  float64
 15  liveness          113999 non-null  float64
 16  valence           113999 

## Duplicate Rows: Verification of the Existence of Duplicate Values in Our Dataset

In [27]:
duplicates = spotify_dataset.duplicated()

num_duplicates = duplicates.sum()
print("Number of duplicate rows:", num_duplicates)



Number of duplicate rows: 450


## Duplicate Rows: Removal of the Duplicate Values Found in the Dataset

In [28]:
spotify_dataset = spotify_dataset.drop_duplicates()
print("New size of the DataFrame after removing duplicates:", spotify_dataset.shape)

New size of the DataFrame after removing duplicates: (113549, 20)


## Exploration of Unique Values: 'album_name' Field
- Exploring our dataset, we encounter records that are contaminated by special characters or numbers, which in fields that refer only to names, such as the artist's name field, the album name, and the track or song name.
- In this example, we aim to see all the unique values ​​in the 'album_name' field, and we can observe how there are characters that do not correspond to what is expected in the dataset, such as words containing special characters like "!", "#", "(", as well as records with meaningless numbers and characters in other languages.

In [31]:
album_names_unique = spotify_dataset['album_name'].unique()
album_names_unique_sorted = sorted(album_names_unique)

for album_name in album_names_unique_sorted:
    print(album_name)

! ! ! ! ! Whispers ! ! ! ! !
! ! ! ! 300 Sounds of the Ocean ! ! ! !
! ! % > (( Shelter )) < % ! !
! !"#Reboot#"! !
!!!" Baby Sleep Aid Rain Sounds "!!!
!!" White Noise 10 Hours "!!
!!Puro Macanazo!!
!I'll Be Back!
" ! + restful + ! "
"1" (壹)
"4"
"A Divina Comédia Ou Ando Meio Desligado"
"Attack on Titan" Season 3 Original Soundtrack
"Fairy Tail" Original Soundtrack Vol.1
"Fairy Tail" Original Soundtrack Vol.3
"Forever King"
"GIGS" CASE OF BOφWY COMPLETE (Live From "Gigs" Case Of Boowy / 1987)
"GIGS" JUST A HERO TOUR 1986 (Live)
"GIGS" JUST A HERO TOUR 1986 NAKED (Live)
"I Could Go Anywhere But Again I Go With You"
"I Liked His Old Stuff Better"
"Jardim Eletrico"
"LED Spirals" (Extended Version) [From "John Wick"]
"Let's Rock"
"Mutantes E Seus Cometas No Pais Do Baurets"
"Mutantes"
"Strings and Stories of a Troubadour", Live in Odeon, Vienna 2011
"Sus Grandes Exitos"
"Was He Slow?" (Music From The Motion Picture Baby Driver)
"Watch out!"
"Weird Al" Yankovic
"和自己對話" 實驗專輯
# (Hashtag)
#00

## Filtering the Field to Only Include Records Containing Letters and No Special Characters That Contaminate the Dataset

In [32]:
filtered_df = spotify_dataset[spotify_dataset['album_name'].str.match(r'^[A-Z]') & spotify_dataset['album_name'].str.match(r'^[A-Za-z\s]*$')]
spotify_dataset = filtered_df

## With the field filtered, we can use the same code we used to manually inspect dirty records to verify that the filter we applied was successful.
- If you want to visualize the change in unique values with this code in a correct way, you have to run all the cells at the same time to respect the cleaning process followed. Otherwise, both codes would show the same unique values with the filter applied.

In [33]:
album_names_unique = spotify_dataset['album_name'].unique()
album_names_unique_sorted = sorted(album_names_unique)
for album_name in album_names_unique_sorted:
    print(album_name)

A Arte De Ivete Sangalo
A Arte De Jair Rodrigues
A Arte De Nando Reis
A Arte De Os Paralamas Do Sucesso
A Arte De Raul Seixas
A Arte De Zeca Pagodinho
A Arte Do Barulho
A Arte do Insulto
A B Y S S
A Bad Dream That Will Pass Away
A Banda Mais Bonita da Cidade
A Beautiful Lie
A Beautiful Place To Drown
A Bela e a Fera
A Blazing Grace
A Boca Dela
A Book Of Flying
A Brief Inquiry Into Online Relationships
A Business Proposal OST
A Cada Beijo
A Cara Metade das Vaquejadas
A Carolina Jubilee
A Casa Sou Eu
A Celtic Dream
A Certain Distance
A Chabuca
A Chapter of Accidents
A Christmas Duel
A Cocochito
A Collection
A Collection of Depravation
A Collection of Fleeting Moments and Daydreams
A Collection of Past Works
A Collection of Soothing Night Rain
A Comprehensive Guide to Moderne Rebellion
A Comunidade
A Conquista
A Country Boy Singing His Heart Out
A Crooked Road
A Date with Elvis
A Date with The Everly Brothers
A Dave Brubeck Christmas
A Day In The Life
A Day in a Life
A Day in the Life
A D


## We follow the same process with the fields 'artists' and 'track_name', which were also identified with dirty records

## Exploration and Removal of Dirty Unique Values: 'artists' Field

In [34]:
filtered_df = spotify_dataset[spotify_dataset['artists'].str.match(r'^[A-Z]') & spotify_dataset['artists'].str.match(r'^[A-Za-z\s]*$')]
spotify_dataset = filtered_df

In [35]:
album_names_unique = spotify_dataset['artists'].unique()
album_names_unique_sorted = sorted(album_names_unique)
for album_name in album_names_unique_sorted:
    print(album_name)

A Banda Mais Bonita da Cidade
A Day To Remember
A Feast for Kings
A Fine Frenzy
A Flock Of Seagulls
A Forest Mighty Black
A Great Big World
A Life Once Lost
A Mose
A R I Z O N A
A Rocket To The Moon
A Skylit Drive
A Tribe Called Quest
A Vida Music
A Wilhelm Scream
ABBA
ABC
ABC Kids
ACxDC
AFI
AFKAP
AGA
AHOSHI
AIRMANN
AJ Mitchell
AJJ
AJR
ALEX
ALEXANDRE APOSAN
ALMAR
AMAN
AMARIA BB
AMK
AMONGST THE ASHES
ANDY SVGE
ANNA
AP Dhillon
ARTBAT
ARTY
ASTN
ATB
ATLiens
ATTLAS
AVOID
Aamir Khan
Aaren Hughes
Aaron Carl
Aaron Espe
Aaron Karo
Aaron Keyes
Aaron Shust
Aaron Watson
Aaryan Shah
Ababeel
Abandoned
Abayomy Afrobeat Orquestra
Abbath
Abby Cates
Abdul Hannan
Abel Korzeniowski
Abel Pintos
Aberdeen
Aberola
Abfahrt Hinwil
Abhilasha Sinha
Aborted
Abraham Mateo
Absofacto
Absolute Valentine
Abstract Void
Absu
Accept
Acerina Y Su Danzonera
AcesToAces
Achillea
Acid Ghost
Acid Pauli
Acoustic Guitar Collective
Acoustic Syndicate
AcousticTrench
Acranius
Activator
Actress
Ad Dios
Ada
Adalberto Santiago
Adam Bey

## Exploration and Removal of Dirty Unique Values: 'track_name' Field

In [36]:
filtered_df = spotify_dataset[spotify_dataset['track_name'].str.match(r'^[A-Z]') & spotify_dataset['artists'].str.match(r'^[A-Za-z\s]*$')]
spotify_dataset = filtered_df

In [37]:
album_names_unique = spotify_dataset['track_name'].unique()
album_names_unique_sorted = sorted(album_names_unique)
for album_name in album_names_unique_sorted:
    print(album_name)

A Arte do Insulto
A Bad Dream
A Bad Dream That Will Pass Away
A Barrel Of Monkeys
A Basket of Kisses
A Batalha do Arcanjo
A Beautiful Distraction
A Beautiful Lie
A Beautiful Thing
A Bela e a Fera
A Beleza É Você Menina
A Better World Is Possible
A Bitter Rain
A Boca Dela
A Bridge Over Novocaine
A Broken Alphabet
A Burden
A Buried Sun
A Cada Beijo
A Carta - Ao Vivo
A Casa Caiu
A Casa ao Lado
A Casa é Sua
A Celebration For The Death Of Man... (2016 - Remaster)
A Celtic Dream
A City in Ruins
A Coming Race
A Comunidade Chora
A Cosy Bed for Ted
A Crooked Road
A Daisy Chain 4 Satan
A Day In The Life
A Day In The Life Of A Day
A Day Without a War
A Day in a Life
A Deathpact Most Imminent
A Demon's Fate
A Desolation Song (2016 - Remaster)
A Dios Le Pido
A Distant Shade of Green
A Don Agustín Bardi
A Donde Iras
A Donde Voy (Revamp)
A Dream Is a Wish Your Heart Makes (From "Cinderella")
A Dream's Frozen Reflection
A Drop In the Ocean
A Drop in the Ocean
A Drunk Can't Be a Man
A Dying God Coming 

## Visualization of Dataset After Processing 

In [38]:
print(spotify_dataset.head())
print(spotify_dataset.info())

                  track_id           artists            album_name  \
0   5SuOikwiRyPMVoIQDJUgSV       Gen Hoshino                Comedy   
4   5vjLSffimiIP26QG5WcN2K  Chord Overstreet               Hold On   
5   01MVOl9KtVTNfFiBU9I7dc      Tyrone Wells  Days I Will Remember   
9   7k9GuJYLp2AzqokyEdwEw2    Ross Copperman                Hunger   
10  4mzP5mHkRvGxdhdGdAH7EJ      Zack Tabudlo               Episode   

              track_name  popularity  duration_ms  explicit  danceability  \
0                 Comedy          73       230666     False         0.676   
4                Hold On          82       198853     False         0.618   
5   Days I Will Remember          58       214240     False         0.688   
9                 Hunger          56       205594     False         0.442   
10  Give Me Your Forever          74       244800     False         0.627   

    energy  key  loudness  mode  speechiness  acousticness  instrumentalness  \
0    0.461    1    -6.746     0     

## Saving Cleaned Dataset as a Pickle File

In [17]:
with open('spotify_dataset.pkl', 'wb') as f:
    pickle.dump(spotify_dataset, f)


## Unused Option, but Included in Case You Want to View the Cleaned Dataset in CSV Format

In [8]:
#Save the DataFrame to a CSV file
#spotify_dataset.to_csv("cleaned_spotify_dataset.csv", index=False)