# Exploration, pre-precessing, and splitting the dataset

This notebook contains data exploration, preprocessing, and preparaton of the `tcc_ceds_music.csv` dataset.

Structure:

- Inspecting the structure and size of the dataset

- Checking for and handling duplicates or missing values

- Filtering and removing irrelevant songs and columns.

- Stratisfied splitting of the dataset into training, validation, and test sets based on genre distribution.

- The subsets are then saved to the folder `music_dataset_split/`


<br>
Inspiration to stratisfied split:

- [Youtube video](https://www.youtube.com/watch?v=ixBbAZDS7TU)
- [ChatGPT - implementation help of the function](https://chatgpt.com/)

In [56]:
try:
    # Comment out if not using colab
    from google.colab import drive
    drive.mount('/content/drive')

    %cd "/content/drive/Othercomputers/Min MacBook Pro/semester_project_info371"
    #using_colab = True
except:
    print("Not using Google Colab")
    #using_colab = False

Not using Google Colab


In [57]:
import pandas as pd
import numpy as np

### Loading Dataset

In [58]:
df = pd.read_csv('tcc_ceds_music.csv')
df.head(5)

Unnamed: 0.1,Unnamed: 0,artist_name,track_name,release_date,genre,lyrics,len,dating,violence,world/life,...,sadness,feelings,danceability,loudness,acousticness,instrumentalness,valence,energy,topic,age
0,0,mukesh,mohabbat bhi jhoothi,1950,pop,hold time feel break feel untrue convince spea...,95,0.000598,0.063746,0.000598,...,0.380299,0.117175,0.357739,0.454119,0.997992,0.901822,0.339448,0.13711,sadness,1.0
1,4,frankie laine,i believe,1950,pop,believe drop rain fall grow believe darkest ni...,51,0.035537,0.096777,0.443435,...,0.001284,0.001284,0.331745,0.64754,0.954819,2e-06,0.325021,0.26324,world/life,1.0
2,6,johnnie ray,cry,1950,pop,sweetheart send letter goodbye secret feel bet...,24,0.00277,0.00277,0.00277,...,0.00277,0.225422,0.456298,0.585288,0.840361,0.0,0.351814,0.139112,music,1.0
3,10,pérez prado,patricia,1950,pop,kiss lips want stroll charm mambo chacha merin...,54,0.048249,0.001548,0.001548,...,0.225889,0.001548,0.686992,0.744404,0.083935,0.199393,0.77535,0.743736,romantic,1.0
4,12,giorgos papadopoulos,apopse eida oneiro,1950,pop,till darling till matter know till dream live ...,48,0.00135,0.00135,0.417772,...,0.0688,0.00135,0.291671,0.646489,0.975904,0.000246,0.597073,0.394375,romantic,1.0


### Exploring

In [59]:
print(f'Number of rows: {df.shape[0]}, Number of columns: {df.shape[1]}')

Number of rows: 28372, Number of columns: 31


In [60]:
df.isnull().sum()

Unnamed: 0                  0
artist_name                 0
track_name                  0
release_date                0
genre                       0
lyrics                      0
len                         0
dating                      0
violence                    0
world/life                  0
night/time                  0
shake the audience          0
family/gospel               0
romantic                    0
communication               0
obscene                     0
music                       0
movement/places             0
light/visual perceptions    0
family/spiritual            0
like/girls                  0
sadness                     0
feelings                    0
danceability                0
loudness                    0
acousticness                0
instrumentalness            0
valence                     0
energy                      0
topic                       0
age                         0
dtype: int64

Changing `Unnamed: 0` to `id`

In [61]:
df = df.rename(columns={'Unnamed: 0': 'id'})
df['id'] = range(1, len(df) + 1)
df

Unnamed: 0,id,artist_name,track_name,release_date,genre,lyrics,len,dating,violence,world/life,...,sadness,feelings,danceability,loudness,acousticness,instrumentalness,valence,energy,topic,age
0,1,mukesh,mohabbat bhi jhoothi,1950,pop,hold time feel break feel untrue convince spea...,95,0.000598,0.063746,0.000598,...,0.380299,0.117175,0.357739,0.454119,0.997992,0.901822,0.339448,0.137110,sadness,1.000000
1,2,frankie laine,i believe,1950,pop,believe drop rain fall grow believe darkest ni...,51,0.035537,0.096777,0.443435,...,0.001284,0.001284,0.331745,0.647540,0.954819,0.000002,0.325021,0.263240,world/life,1.000000
2,3,johnnie ray,cry,1950,pop,sweetheart send letter goodbye secret feel bet...,24,0.002770,0.002770,0.002770,...,0.002770,0.225422,0.456298,0.585288,0.840361,0.000000,0.351814,0.139112,music,1.000000
3,4,pérez prado,patricia,1950,pop,kiss lips want stroll charm mambo chacha merin...,54,0.048249,0.001548,0.001548,...,0.225889,0.001548,0.686992,0.744404,0.083935,0.199393,0.775350,0.743736,romantic,1.000000
4,5,giorgos papadopoulos,apopse eida oneiro,1950,pop,till darling till matter know till dream live ...,48,0.001350,0.001350,0.417772,...,0.068800,0.001350,0.291671,0.646489,0.975904,0.000246,0.597073,0.394375,romantic,1.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28367,28368,mack 10,10 million ways,2019,hip hop,cause fuck leave scar tick tock clock come kno...,78,0.001350,0.001350,0.001350,...,0.065664,0.001350,0.889527,0.759711,0.062549,0.000000,0.751649,0.695686,obscene,0.014286
28368,28369,m.o.p.,ante up (robbin hoodz theory),2019,hip hop,minks things chain ring braclets yap fame come...,67,0.001284,0.001284,0.035338,...,0.001284,0.001284,0.662082,0.789580,0.004607,0.000002,0.922712,0.797791,obscene,0.014286
28369,28370,nine,whutcha want?,2019,hip hop,get ban get ban stick crack relax plan attack ...,77,0.001504,0.154302,0.168988,...,0.001504,0.001504,0.663165,0.726970,0.104417,0.000001,0.838211,0.767761,obscene,0.014286
28370,28371,will smith,switch,2019,hip hop,check check yeah yeah hear thing call switch g...,67,0.001196,0.001196,0.001196,...,0.001196,0.001196,0.883028,0.786888,0.007027,0.000503,0.508450,0.885882,obscene,0.014286


Looking to see if there are any duplicate songs

In [62]:
duplicate_lyrics = df[df.duplicated(subset='lyrics', keep=False)]
print("Duplicate lyrics of songs:")
print(duplicate_lyrics)

Duplicate lyrics of songs:
Empty DataFrame
Columns: [id, artist_name, track_name, release_date, genre, lyrics, len, dating, violence, world/life, night/time, shake the audience, family/gospel, romantic, communication, obscene, music, movement/places, light/visual perceptions, family/spiritual, like/girls, sadness, feelings, danceability, loudness, acousticness, instrumentalness, valence, energy, topic, age]
Index: []

[0 rows x 31 columns]


In [63]:
duplicate_id = df[df.duplicated(subset='id', keep=False)]
print("Duplicate id of songs:")
print(duplicate_id)

Duplicate id of songs:
Empty DataFrame
Columns: [id, artist_name, track_name, release_date, genre, lyrics, len, dating, violence, world/life, night/time, shake the audience, family/gospel, romantic, communication, obscene, music, movement/places, light/visual perceptions, family/spiritual, like/girls, sadness, feelings, danceability, loudness, acousticness, instrumentalness, valence, energy, topic, age]
Index: []

[0 rows x 31 columns]


The different genres in our dataset

In [64]:
unike_genrer = df["genre"].unique()
print(unike_genrer)

['pop' 'country' 'blues' 'jazz' 'reggae' 'rock' 'hip hop']


Number of songs per genre

In [65]:
df['genre'].value_counts()

genre
pop        7042
country    5445
blues      4604
rock       4034
jazz       3845
reggae     2498
hip hop     904
Name: count, dtype: int64

### Pre-precessing

Removing Songs with less then 15 words

In [66]:
nr_songs = df[df['len'] <= 15].shape[0]
print(nr_songs)

500


In [72]:
df_filtered = df[df['len'] > 15].reset_index(drop=True)
print(f'Number of rows: {df_filtered.shape[0]}, Number of columns: {df_filtered.shape[1]}')

Number of rows: 27872, Number of columns: 31


Dropping unnecessary columns

In [73]:
colums_to_remove = [
    'artist_name', 'track_name', 'len','dating', 'violence', 'world/life', 'night/time', 
    'shake the audience','family/gospel', 'romantic', 'communication', 'obscene', 'music',
    'movement/places', 'light/visual perceptions', 'family/spiritual',
    'like/girls', 'sadness', 'feelings', 'loudness',
    'acousticness', 'instrumentalness', 'valence', 'energy', 'topic', 'age'
]

df_filtered = df_filtered.drop(columns=colums_to_remove)
df_filtered.head()

Unnamed: 0,id,release_date,genre,lyrics,danceability
0,1,1950,pop,hold time feel break feel untrue convince spea...,0.357739
1,2,1950,pop,believe drop rain fall grow believe darkest ni...,0.331745
2,3,1950,pop,sweetheart send letter goodbye secret feel bet...,0.456298
3,4,1950,pop,kiss lips want stroll charm mambo chacha merin...,0.686992
4,5,1950,pop,till darling till matter know till dream live ...,0.291671


## Splitting in Train, Val and Test set

Stratisfied split

In [75]:
from sklearn.model_selection import train_test_split
import os
import sys

In [76]:
def stratified_split(df, stratify_col='genre', train_size=0.7, val_size=0.15, test_size=0.15, random_state=42):
    # First split into train and temp (val+test)
    train_df, temp_df = train_test_split(
        df,
        stratify=df[stratify_col],
        test_size=(1 - train_size),
        random_state=random_state
    )

    # Then split temp into val and test
    relative_test_size = test_size / (val_size + test_size)
    val_df, test_df = train_test_split(
        temp_df,
        stratify=temp_df[stratify_col],
        test_size=relative_test_size,
        random_state=random_state
    )

    return train_df.reset_index(drop=True), val_df.reset_index(drop=True), test_df.reset_index(drop=True)

In [77]:
train_df, val_df, test_df = stratified_split(df_filtered, stratify_col='genre', train_size=0.7, val_size=0.15, test_size=0.15)

In [78]:
base_path = "music_dataset_split"
splits = {
    "Train": ("training_data.csv", train_df),
    "Val": ("validation_data.csv", val_df),
    "Test": ("test_data.csv", test_df)
}

In [79]:
for split_name, (file_name, split_df) in splits.items():
    path = os.path.join(base_path, split_name)
    os.makedirs(path, exist_ok=True)
    split_df.to_csv(os.path.join(path, file_name), index=False)

print("Stratified split complete and saved to 'music_dataset_split/'")

Stratified split complete and saved to 'music_dataset_split/'


In [84]:
df_train = pd.read_csv("music_dataset_split/Train/training_data.csv")
df_train
df_train['genre'].value_counts()

genre
pop        4870
country    3780
blues      3152
rock       2766
jazz       2582
reggae     1730
hip hop     630
Name: count, dtype: int64

In [85]:
df_test = pd.read_csv("music_dataset_split/Test/test_data.csv")
df_test
df_test['genre'].value_counts()

genre
pop        1044
country     810
blues       676
rock        592
jazz        553
reggae      371
hip hop     135
Name: count, dtype: int64

In [86]:
df_val = pd.read_csv("music_dataset_split/Val/validation_data.csv")
df_val
df_val['genre'].value_counts()

genre
pop        1044
country     810
blues       675
rock        593
jazz        553
reggae      371
hip hop     135
Name: count, dtype: int64