## 2. Preprocessing

All data is preprocessed according to the following steps.
1. Data is split in a training set of 80% of the data and a test set of 20% of the data.
2. Missing feature data is imputed using the mean. Missing target data is inferred from other available metadata.
3. Outliers are removed, data is normalized and centered. Target Y1 is binned per 10 listenings and target Y2 is binned per year.

Next to feature set f1, which contains all features, two more feature sets are created with PCA dimensionality reduction. For feature set f2 PCA is applied per column name group, and for feature set f3 PCA is applied on the total of features.

### Setup

In [None]:
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
sns.set_theme()

In [None]:
class Feature_reader():
    """
    Each feature has a first name, optionally a second name, a statistic and a number.
    This class allows to group feautures according to these aspects or combinations of these aspects.
    Each method produces a list of feature names or a list of lists of feature names.
    """

    def __init__(self, csv):
        self.fts = pd.read_csv(csv, dtype={'n':"string"})
        self.fts = self.fts.fillna('')

    def format(self, select):
        return select.apply(lambda x: '_'.join(x).replace('__', '_'), axis=1).tolist()

    def all(self):
        select = self.fts.copy()
        return self.format(select)

    def first(self):
        select = self.fts.copy()
        select = select.loc[select['n']=='01']
        return self.format(select)

    def kurtosis(self):
        select = self.fts.copy()
        select = select.loc[select['stat']=='kurtosis']
        return self.format(select)

    def mean(self):
        select = self.fts.copy()
        select = select.loc[select['stat']=='mean']
        return self.format(select)

    def per_nns(self):
        """
        List of lists per name1, name2, stat, per name1, name2.
        """
        select = self.fts.copy()
        select = [[self.format(grp2) for idx2, grp2 in grp.groupby(by=['stat'])] for idx, grp in select.groupby(by=['name1', 'name2'])]
        return select

fts = Feature_reader('features.csv')

### 2.1 Train-test split

In [None]:
# Create train and test set and save csv's.
# Takes 1,5 min

# import pandas as pd
# from sklearn.model_selection import train_test_split

# df_music = pd.read_csv('data/music_data.csv')
# df_meta = pd.read_csv('data/metadata.csv')

# train, test = train_test_split(df_music, test_size=0.2, random_state=1)

# train.to_csv('data/data_train.csv')
# test.to_csv('data/data_test.csv')
# df_meta.iloc[train.index].to_csv('data/metadata_train.csv', index=False)
# df_meta.iloc[test.index].to_csv('data/metadata_test.csv', index=False)

# train test sampling with respect to balancing cats??


### 2.2 Missing data

### 2.3 Outliers

### 2.4 Feature sets