# Data processing

In this notebook, we will try to find good techniques to apply to our original data to improve the precision of our algorithms and to reduce the size of the data.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import numpy as np
import pandas as pd
import random
from regressions import *
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

In [3]:
DATA_FOLDER = "data/"
X = np.load(DATA_FOLDER + "feature_mat_radial_compression.npy")
y = np.load(DATA_FOLDER + "CSD500-r_train-H_total.npy")

In [4]:
print("X: " + str(X.shape))
print("y: " + str(y.shape))

X: (30049, 15961)
y: (30049,)


In [5]:
x_df = pd.DataFrame(X)

In [6]:
x_df = x_df.drop_duplicates()
print("X: " + str(x_df.shape))

X: (30049, 15961)


The split function is used to split the data set into train and test sets with a 75/25 ratio

In [7]:
def split(x_df, y, perc):
    train_set_size = int(x_df.shape[0] * 0.75)
    x_tr = x_df.head(train_set_size)
    x_te = x_df.tail(int(x_df.shape[0] - train_set_size))
    y_tr = y[: train_set_size]
    y_te = y[train_set_size :]
    return x_tr, y_tr, x_te, y_te

We notice that a polynomial expansion really improves the results. This expansion is equivalent to add a column of 1s.

In [8]:
def add_cte_col(df):
    df_tmp = df.copy()
    df_tmp[df_tmp.shape[1]] = pd.Series(np.ones(df_tmp.shape[0]), index=df_tmp.index)
    return df_tmp

Finally we can create our test function that splits the data, adds a column of constant term and performs ridge regression for a few lambdas. Only the lowest error is kept.

In [9]:
def test_quality(x_df, y):
    x_df = add_cte_col(x_df)
    test_perc = 0.75
    x_train, y_train, x_test, y_test = split(x_df, y, test_perc)
    best = 100
    for lambda_ in [1e-2, 1e-3, 1e-4, 1e-5, 1e-6, 1e-7, 1e-8, 1e-9]:
        err = rmse(y_test, x_test, ridge_regression(y_train, x_train, lambda_))
        best = err if err < best else best
    return best

In [10]:
test_quality(x_df, y)

0.7373143070214498

We can see that a simple ridge regression with polynomial expansion gives us an RMSE of 0.73. This will be the value that we consider as the reference for the upcoming tests on the techniques we try.

### Data augmentation

We encountered in our readings that deep learning can get only better with more data. So we try to artificially generate some samples based on what we have. We use jittering whhich consists in adding a little bit of noise to an existing, small enough to consider that our evaluation y will stay the same. Basically, for a given number of x_i, we create new samples by replacing x_ij by x_ij + 1% of the mean of the j th feature of the samples. We made it such that we can choose how many samples should we add and how many features should we modify per sample.

In [11]:
#to have more samples
def add_jitter(df, y, perc_sample, perc_col):
    means = [df[j].mean() for j in df]
    df_tmp = df.copy()
    y_tmp = y.copy()
    ids = random.sample(range(df_tmp.shape[0]), int(df_tmp.shape[0]*perc_sample))
    for id_ in ids:
        new_sample = df_tmp.iloc[[id_]].copy()
        col = random.sample(range(df_tmp.shape[1]), int(df_tmp.shape[1]*perc_col))
        for j in col:
            new_sample[j] = new_sample[j] + 0.01*means[j]
        df_tmp = df_tmp.append(new_sample, ignore_index=True)
        y_tmp = np.append(y_tmp, y_tmp[id_])
    return df_tmp, y_tmp        

In [12]:
x_df_aug, y_aug = add_jitter(x_df, y, 0.01, 0.01)

In [None]:
test_quality(x_df_aug, y_aug)

We can see that adding 1% of samples and modifying 1% of the features lowers the RMSE to 0.72. This is some improvement !

### Normalization:

In [None]:
x_df=(x_df-x_df.mean())/x_df.std()
x_df = x_df.drop(15960, axis=1)
x_df

Normalization is a linear transformation and thus should have any influence on the predictions. We choose to normalize not for results but for smoothness.

### Correlation:

In [None]:
to_rm = []
for i in [i for i in x_df if i < 50]:
    for j in [j for j in x_df if j > i]:
        if x_df[i].corr(x_df[j]) > 0.95:
            to_rm.append(j)
x_df_uncorr = x_df.drop(to_rm, axis=1)

In [None]:
print(x_df_uncorr.shape)
test_quality(x_df_uncorr, y)

Those results are surprising as we were expecting the error to go down by removing some noise, but we can see that correlation actually make the results worse. We won't use it.

### PCA

In [None]:
rmses = []
ys = range(500, 5501, 500)
for i in ys:
    pca = PCA(n_components=i, whiten=True)
    principalComponents = pca.fit_transform(x_df, x_df.shape)
    principalDf = pd.DataFrame(data = principalComponents
                 , columns = range(principalComponents.shape[1]))
    err = test_quality(principalDf, y)
    rmses.append(err)
    print(i, err)

In [None]:
plt.xlabel("Number of components")
plt.ylabel("RMSE")
plt.plot(ys, rmses)
plt.show()

The error plotted is relatively high when reducing the number of components below 6000. We are not sure of the conlusion to give here, as we were expecting results less good then the full matrix but not to that extend. We can argue that our `test_quality` function is absolutely not optimized, but after a look at the scientific side, 1000 features should be enough to get good predictions. We then have to choose by default the best one of the results.

In [None]:
i_star = 3003
pca = PCA(n_components=i_star, whiten=True)
principalComponents = pca.fit_transform(x_df, x_df.shape)
principalDf_star = pd.DataFrame(data = principalComponents
             , columns = range(principalComponents.shape[1]))
test_quality(principalDf_star, y)

In [None]:
i_star = 3003
pca = PCA(n_components=i_star)
principalComponents_no_w = pca.fit_transform(x_df, x_df.shape)
principalDf_star_no_w = pd.DataFrame(data = principalComponents_no_w
             , columns = range(principalComponents_no_w.shape[1]))
test_quality(principalDf_star_no_w, y)

Whitening does not improve anything, even worsen the results.

In [None]:
#ValueError: math domain error is known bug : https://github.com/scikit-learn/scikit-learn/issues/10217: Cannot use MLE

The disapointement is real not to be able to use Maximum Likelyhood Estimator.

### Preparing the data for Machine Learning

In [None]:
DATA_FOLDER = "data/"
X = np.load(DATA_FOLDER + "feature_mat_radial_compression.npy")
y = np.load(DATA_FOLDER + "CSD500-r_train-H_total.npy")
x_df = pd.DataFrame(X)

In [None]:
#Apply PCA
i_star = 4500
pca = PCA(n_components=i_star)
principalComponents = pca.fit_transform(x_df, x_df.shape)
x_pca_df = pd.DataFrame(data = principalComponents
             , columns = range(principalComponents.shape[1]))
x_pca_df.columns = range(x_pca_df.shape[1])

In [None]:
#Add jitter
x_with_jitter_df, y_with_jitter = add_jitter(x_pca_df, y, 0.01, 0.01)

In [None]:
#Normalize
x_with_jitter_df=(x_with_jitter_df-x_with_jitter_df.mean())/x_with_jitter_df.std()

In [17]:
np.save("data/ML/x_train.npy", x_with_jitter_df)
np.save("data/ML/y.npy", y_with_jitter)

### Preparing the data for Deep Learning

In [None]:
DATA_FOLDER = "data/"
X = np.load(DATA_FOLDER + "feature_mat_radial_compression.npy")
y = np.load(DATA_FOLDER + "CSD500-r_train-H_total.npy")
x_df = pd.DataFrame(X)

In [None]:
#Apply PCA
i_star = 3004
pca = PCA(n_components=i_star)
principalComponents = pca.fit_transform(x_df, x_df.shape)
x_pca_df = pd.DataFrame(data = principalComponents
             , columns = range(principalComponents.shape[1]))
x_pca_df.columns = range(x_pca_df.shape[1])

In [None]:
#Normalize
x_pca_df=(x_pca_df-x_pca_df.mean())/x_pca_df.std()

In [None]:
np.save("data/feature_mat_radial_compression_normalized_red.npy", x_pca_df)