## ETL Development

This file serves as as draft for designing & development of project ETLs - transformations & preprocessing of data for modeling. The ETL should be part of runnable  
pipeline for model encapsulation & enviroment isolation purposes (leak of statistics).

Includes:
- Transformation ETLs
- Encoding ETLs

ETLs must be part of pipeline to ensure statistical calculation only on train dataset.

In [9]:
import os
import sys

sys.dont_write_bytecode =True

import numpy as np
import pandas as pd

loading data

In [10]:
DATA_DIR = './Data/'
FILE = 'spotify_tracks_kaggle_weekly.csv'

In [11]:
data = pd.read_csv(DATA_DIR + FILE)

columns to be dropped:

- track_id : ID
- artwork_url : url data - not intended for modelling purposes
- track_url : url data - not intended for modelling purposes
- track_name : user-friendly ID - not intended for modelling purposes (could be used in NLP analysis, however that will not be subject of this project)

In [12]:
DROP_COLUMNS = ['track_id', 'artwork_url', 'track_url', 'track_name']

In [13]:
data = data.drop(DROP_COLUMNS, axis=1, errors='ignore')

In [14]:
data.head()

Unnamed: 0,artist_name,year,popularity,album_name,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,language
0,Anirudh Ravichander,2024,59,"Leo Das Entry (From ""Leo"")",0.0241,0.753,97297.0,0.97,0.0553,8.0,0.1,-5.994,0.0,0.103,110.997,4.0,0.459,Tamil
1,"Anirudh Ravichander, Pravin Mani, Vaishali Sri...",2024,47,AAO KILLELLE,0.0851,0.78,207369.0,0.793,0.0,10.0,0.0951,-5.674,0.0,0.0952,164.995,3.0,0.821,Tamil
2,"Anirudh Ravichander, Anivee, Alvin Bruno",2024,35,Mayakiriye Sirikiriye (Orchestral EDM),0.0311,0.457,82551.0,0.491,0.0,2.0,0.0831,-8.937,0.0,0.153,169.996,4.0,0.598,Tamil
3,"Anirudh Ravichander, Bharath Sankar, Kabilan, ...",2024,24,Scene Ah Scene Ah (Experimental EDM Mix),0.227,0.718,115831.0,0.63,0.000727,7.0,0.124,-11.104,1.0,0.445,169.996,4.0,0.362,Tamil
4,"Anirudh Ravichander, Benny Dayal, Leon James, ...",2024,22,Gundellonaa X I Am a Disco Dancer (Mashup),0.0153,0.689,129621.0,0.748,1e-06,7.0,0.345,-9.637,1.0,0.158,128.961,4.0,0.593,Tamil


### Encoding

Due to the nature of the dataset multiple categorical (numerical as well) columns will require specific encoding method to convey information correctly.

- Categorical data:
    - artist_name
    - album_name
    - language


- Numerical data:
    - key
    - mode

- Columns which can be encoded by OneHot principle:
    - language
    - artist_name (maybe but avg. artist popularity could be as well)

Defining ETL model structure

In [15]:
from sklearn.base import BaseEstimator, TransformerMixin

In [16]:
class ETL(BaseEstimator, TransformerMixin):
    '''Custom ETL model, abstract class, when creating an ETL use class inheritance and overwrite
       self.transform method with custom transformation, must return transformed X.'''

    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        return X

Key column transformation - cyclical data capturing:

Original mapping is linear - relationship of the keys is not correctly represented, according to the data model description  
the mapping represents the keys follows:

$$
\begin{array}{|c|c|}
\hline
\textbf{Key} & \textbf{Pitch Class} \\ \hline
0 & C \\ \hline
1 & C\# / D\flat \\ \hline
2 & D \\ \hline
3 & D\# / E\flat \\ \hline
4 & E \\ \hline
5 & F \\ \hline
6 & F\# / G\flat \\ \hline
7 & G \\ \hline
8 & G\# / A\flat \\ \hline
9 & A \\ \hline
10 & A\# / B\flat \\ \hline
11 & B \\ \hline
\end{array}
$$

We will first transform the original column to the correct representation on the circle of fifths, then remap the correct ordinal representation  
to the circle of fifths trigonometric representation defined as:

$\theta = \frac{2\pi}{12}.key$

Then we can create correct encoding representation with two columns defined as:

$x = \cos(\theta)$, $y = \sin(\theta)$

In [8]:
class CircleOfFifthsEncoding(ETL):

    TARGET_COLUMN = 'key'

    # Pôvodný mapping z datasetu (chromatic scale: 0–11)
    # Tento mapping je založený na poltónových intervaloch.
    chromatic_scale = {
        0: "C",
        1: "C#/Db",
        2: "D",
        3: "D#/Eb",
        4: "E",
        5: "F",
        6: "F#/Gb",
        7: "G",
        8: "G#/Ab",
        9: "A",
        10: "A#/Bb",
        11: "B"
    }

    # Korektný ordinal pre Circle of Fifths:
    # Táto sekvencia reflektuje harmonické vzťahy medzi kľúčmi:
    # každý nasledujúci kľúč je vzdialený o kvintu (7 poltónov).
    circle_of_fifths_mapping = {
        0: 0,  # C
        7: 1,  # G
        2: 2,  # D
        9: 3,  # A
        4: 4,  # E
        11: 5, # B
        6: 6,  # F#/Gb
        1: 7,  # C#/Db
        8: 8,  # G#/Ab
        3: 9,  # D#/Eb
        10: 10, # A#/Bb
        5: 11  # F
    }


    def transform(self, X, y=None):

        X_new = X.copy()

        ordinal_remapping = X[self.TARGET_COLUMN].map(self.circle_of_fifths_mapping)

        theta = (2 * np.pi / 12) * ordinal_remapping

        x = np.cos(theta)
        y = np.sin(theta)

        X_new = X_new.drop(self.TARGET_COLUMN, axis=1, errors='ignore')

        X_new['key_x'] = x
        X_new['key_y'] = y

        return X_new