# Preprocessing
This notebook attempts to make more sense of the data. The outcome from this notebook will be used to choose machine learning models and data augmentation methods. It will also help understanding the problem.

## Assumptions
- The long ellipse if given by r1 while the short ellips is given by a2 and r2
- 2a1 = 1, thus a1=0.5 and b1=0.5. 2a2<=1.   
- The hetrogenous ellipses are split with f stating how many are short. 1-f gives how many are long.
- The ellipse ratio r, is given by the major and minor axis lengths such that r=a/b.
- The width of the ellipse in a given axis direction is given by 2a or 2b.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
sns.set()

Functions

In [None]:
def import_data(name='HybridEllipsePercolation.txt', sep1=" ", header1=None, shuffle=True):
    data = pd.read_csv(name, sep=sep1, header=header1)
    data.columns = ["r1", "2a2", "r2", "frac", "Nc", "Nc Std. Dev", "eta c" ]
    # data.reset_index(inplace=True)
    
    if shuffle:
        data = data.sample(frac=1).reset_index(drop=True)
    return data

def single_input_vs_output(dataset, input_column, output_column="eta c", plot=False, output=False):
    reduced_dataset = dataset.drop_duplicates(subset=input_column)
    # data_range = reduced_dataset[input_column].to_numpy()
    if output:
        print("{col} range is: {rng}    (output is {out})".format(col=input_column, rng=len(reduced_dataset), out=output_column))

    if plot:
        ax1 = reduced_dataset.plot.scatter( x=input_column,
                        y=output_column,
                        c='DarkBlue')
        return ax1
        # return sns.scatterplot(data=reduced_dataset, x=input_column, y=output_column)
    
    return reduced_dataset[[input_column, output_column]]

def split_data(dataset):
    train_dataset = dataset.sample(frac=0.6, random_state=0)
    valid_and_test_dataset = dataset.drop(train_dataset.index)
    test_dataset = valid_and_test_dataset.sample(frac=0.5, random_state=0)
    validation_dataset = valid_and_test_dataset.drop(test_dataset.index)
    return train_dataset, test_dataset, validation_dataset

def split_features_labels(data, label_column='eta c'):
    features = data
    labels = data.pop(label_column)
    return features, labels

Importing the data.

In [None]:
rawdata = import_data(shuffle=False)
dataset = rawdata.copy()

# remove irrelevant columns
# dataset.pop("Nc Std. Dev")

# check for missing values
dataset.isna().sum()
# drop missing values
dataset = dataset.dropna()

dataset.head()

In [None]:
dataset.describe().transpose()

From the statistics above, we can instantly see that the data is skewed. r1's mean is < 0.5*its ma, and the same goes for 2a2 and r2. we need to look at some distributions of the data. Maybe see some histrograms of frequency as well. If the data is skewed, splitting may not be simple. Some data may have to be left out to prevent biases. It is possible that we can consider this later during tuning. 

Here we will extend the dataset to include the major and minor axis lengths 

In [None]:
dataset["a1"]=1/2
dataset["b1"]=dataset["a1"]/dataset["r1"]
dataset["a2"]=dataset["2a2"]/2
dataset["b2"]=dataset["a2"]/dataset["r2"]
dataset.head()

The next step is to check for repeating values. Printing the returned variable gives all of the values.

In [None]:
r1_range = single_input_vs_output(dataset, "r1")
r2_range = single_input_vs_output(dataset, "r2")
frac_range = single_input_vs_output(dataset, "frac")
a1_range = single_input_vs_output(dataset, "a1")
b1_range = single_input_vs_output(dataset, "b1")
a2_range = single_input_vs_output(dataset, "a2")
b2_range = single_input_vs_output(dataset, "b2")
_2a2_range = single_input_vs_output(dataset, "2a2")

dataset.nunique()

You can see from theabove that a1 is constant. b2 has the most variation with frac varying the least. Next would be to plot a histogram of the numbers occuring for each one. The histogram plots will be presented as {r1, r2}, {a2,frac}. 

We have to note the actual numbers for choosing the correct bin size, so viewing the list of present values is useful.

r1 -> 1000, r2 -> 1000, frac -> 100, a2 or 2a2 -> 1000

In [None]:
# sns.distplot(dataset["frac"], kde=False, color='red', bins=100)
# plt.title('Frequency of Fraction', fontsize=14)
# plt.xlabel('Fraction', fontsize=10)
# plt.ylabel('Frequency', fontsize=10)

fig, axs = plt.subplots(ncols=2, figsize=(12,6))
sns.distplot(dataset["r1"], kde=False, color='red', bins=1000, ax=axs[0])
sns.distplot(dataset["r2"], kde=False, color='red', bins=1000, ax=axs[1])

This is not too visible, so can convert the columns to strings and plot like catagories. 

In [None]:
catagorical_dataset = dataset.copy()
catagorical_dataset['r1'] = catagorical_dataset['r1'].astype(str)
catagorical_dataset['r2'] = catagorical_dataset['r2'].astype(str)
sns.displot(catagorical_dataset, x="r1", shrink=.8, color='red', ax=axs[0])
sns.displot(catagorical_dataset, x="r2", shrink=.8, color='red', ax=axs[1])

In [None]:
catagorical_dataset['frac'] = catagorical_dataset['frac'].astype(str)
sns.displot(catagorical_dataset, x="frac", shrink=.8, color='red', ax=axs[0])

catagorical_dataset['a2'] = catagorical_dataset['a2'].astype(str)
sns.displot(catagorical_dataset, x="a2", shrink=.8, color='red', ax=axs[1])

From the plots above, we can see that r1, r2 are relatively consistent in the amount of data for each r value, but the data is skewed towards the smaller r values.

frac 0.99 has substantially fewer points than the rest. The frac class is also skewed towards the higher frac numbers, as the intervals decrease after 0.9 (including 0.95 and 0.99).

the a2 class (or 2a2) increases by 0.05 between 0.05 and 0.5 (0.1 for 2a2...). However, it also includes {0.005, 0.01, 0.025}, i.e. +0.005, +0.015, +0.025. This means there are more points in the {0-0.05 range than the other intervals}. 

This can be shown in the distributino plots.

In [None]:
print("r1 range {}".format(list(r1_range["r1"])))
print("r2 range {}".format(list(r2_range["r2"])))
print("frac range {}".format(list(frac_range["frac"])))
print("a2 range {}".format(list(a2_range["a2"])))

In [None]:
fig, axs = plt.subplots(ncols=2, figsize=(12,6))
sns.distplot(dataset["r1"], color='red', bins=1000, ax=axs[0])
sns.distplot(dataset["r2"], color='red', bins=1000, ax=axs[1])

In [None]:
fig, axs = plt.subplots(ncols=2, figsize=(12,6))
sns.distplot(dataset["frac"], color='red', bins=100, ax=axs[0])
sns.distplot(dataset["a2"], color='red', bins=100, ax=axs[1])

Before moving forward, we can look at the output or labels/targets. These are the 'eta c' and 'Nc' columns.

In [None]:
fig, axs = plt.subplots(ncols=2, figsize=(12,6))
sns.distplot(dataset["eta c"], color='red', bins=100, ax=axs[0])
sns.distplot(dataset["Nc"], color='red', bins=100, ax=axs[1])

From this we can see that the target distributions are skewed towards the lower end of the specturm. Rescaling 'Nc' between -1 and 1 will result in more dense data with larger outliers. Using 'eta c' as the output may be better. Before emoving on, we should plot all the variables agains one another (see Data.png).

In [None]:
#Takes long to plot -> see saved images.

# sns.pairplot(train_dataset[['r1', '2a2', 'r2', 'frac', 'Nc', 'eta c']], diag_kind='kde')

Pairplot of the data.

![Pairplot](Data.png)

Unfortunatley, the data does not show any direct correlations and a hear map is not very useful here. When looking at the data it is useful to remember that the simulations used the inputs to predict 'eta c' or 'Nc' when percolation in the system will occur. 

Now the range of each feature is different and some of them are unbalanced. The features can be standardized between -1 and 1 with a 0 mean. This will be useful to prevent one feature from dominating the weight updates. We can account for the unbalancing by ommiting some data and/or feature selection. However, lets try PCA first and see how the problem looks. 

NOTE: The targets 'eta c' and 'Nc' have been scaled.

In [None]:
from sklearn.preprocessing import StandardScaler

scaled_dataset = dataset.copy()
scaled_dataset['r1'] = StandardScaler().fit_transform(dataset['r1'].values.reshape(-1,1))
scaled_dataset['r2'] = StandardScaler().fit_transform(dataset['r2'].values.reshape(-1,1))
scaled_dataset['frac'] = StandardScaler().fit_transform(dataset['frac'].values.reshape(-1,1))
scaled_dataset['2a2'] = StandardScaler().fit_transform(dataset['a2'].values.reshape(-1,1))
scaled_dataset['eta c'] = StandardScaler().fit_transform(dataset['eta c'].values.reshape(-1,1))
scaled_dataset['Nc'] = StandardScaler().fit_transform(dataset['Nc'].values.reshape(-1,1))

#account for added features
scaled_dataset["a1"]=1/2
scaled_dataset["b1"]=scaled_dataset["a1"]/scaled_dataset["r1"]
scaled_dataset["a2"]=scaled_dataset["2a2"]/2
scaled_dataset["b2"]=scaled_dataset["a2"]/scaled_dataset["r2"]

scaled_dataset.describe().transpose()

In [None]:
scaled_dataset.head()

We can visualise the (original) features and targets again to understand the scaling.

In [None]:
# fig, axs = plt.subplots(ncols=2, figsize=(12,6))
# sns.distplot(scaled_dataset["eta c"], color='red', bins=100, ax=axs[0])
# sns.distplot(scaled_dataset["Nc"], color='red', bins=100, ax=axs[1])
# fig.savefig("scaled_labels.png")

# fig, axs = plt.subplots(ncols=2, figsize=(12,6))
# sns.distplot(scaled_dataset["r1"], color='red', bins=1000, ax=axs[0])
# sns.distplot(scaled_dataset["r2"], color='red', bins=1000, ax=axs[1])
# fig.savefig("scaled_r_features.png")

# fig, axs = plt.subplots(ncols=2, figsize=(12,6))
# sns.distplot(scaled_dataset["frac"], color='red', bins=100, ax=axs[0])
# sns.distplot(scaled_dataset["a2"], color='red', bins=100, ax=axs[1])
# fig.savefig("scaled_frac_a2.png")

The scaled data.

![scaled_r](scaled_r_features.png)
![scaled_frac](scaled_frac_a2.png)
![scaled_labels](scaled_labels.png)

Here you can clearly see that r1 and r2 have a bias with large outliers. frac and a2 have pretty uniform distributions and can be handled pretty easily by ommiting samples at the lower and higher end. For now I do not think this is necessary because the difference in the number of samples is not huge and will effect the other distributions. eta c still looks like a better label, with a smaller range, fewer outliers and less of a bias than Nc. 

We can look at the covariance matrix of the data, but we know that they were slected relatively independantley. 

In [None]:
scaled_dataset[['r1','r2','2a2','frac']].cov()

Now we will implement PCA to try and reduce the number of input features. The total number of un expanded features is 4. With the expansion, we get 8. First we will try PCA on the 4, then the 8 and then maybe a selection of the features. For now the labels will be the nonstandardizedd 'eta c'. In the future we can possibly use both 'eta c' and 'Nc' or their scaled versions. 

In [None]:
from sklearn.decomposition import PCA

features = scaled_dataset.drop(['eta c', 'Nc', 'Nc Std. Dev', 'a1', 'b1', 'a2', 'b2'], axis=1)
expanded_features = scaled_dataset.drop(['eta c', 'Nc', 'Nc Std. Dev',], axis=1)
labels = scaled_dataset['eta c']

components = 2
components_cols = ['PC 1', 'PC 2']
X = features
Y = labels

pca = PCA(n_components=components)
pca_data = pca.fit_transform(X.values)

principalDf = pd.DataFrame(data = pca_data, columns = components_cols)

pcaDF = pd.concat([principalDf, Y], axis = 1)
pcaDF.head()

In [None]:
# fig, axs = plt.subplots(ncols=1, figsize=(12,6))
# sns.distplot(scaled_dataset["frac"], color='red', bins=100, ax=axs[0])
# sns.distplot(scaled_dataset["a2"], color='red', bins=100, ax=axs[1])
fig = sns.pairplot(pcaDF[['PC 1', 'PC 2', 'eta c']], diag_kind='kde')
fig.savefig("PCA.png")


The PCA data has been plotted in a pair plot to see how the new components relate to the output. 

![pca](PCA.png)

An attempt with 1 component.

In [None]:
components = 1
components_cols = ['PC 1']
X = features
Y = labels

pca = PCA(n_components=components)
pca_data = pca.fit_transform(X.values)

principalDf = pd.DataFrame(data = pca_data, columns = components_cols)

pcaDF = pd.concat([principalDf, Y], axis = 1)
pcaDF.head()

In [None]:
fig = sns.pairplot(pcaDF[['PC 1', 'eta c']], diag_kind='kde')
fig.savefig("OnePCA.png")

Lastly, we will store the dataframes to csv files to be used in other notebooks.

In [None]:
scaled_dataset.to_csv("NormalizedData")

In [None]:
pcaDF.to_csv("PCA")