# Synthetic Data Creation

This note book shall be where we create the synthetic-dataset to be use for out testing.

Summary of what we wish to achieve:

We are aiming to create a dataset from which we can build a linear regression/simple polynomial regression model without regularisation. The data set needs to have a sufficent number of samples so we can form a training set and a test set. Because of the whole point of this expirment we shall also mandate that some of the features are themselves correlated.

In [19]:
#Set master parameters.

num_of_features = 10
num_of_samples = 400
train_test_split = 0.25
num_of_correlated_features = 5

In [20]:
#Library imports
import numpy as np
import pandas as pd
import pickle

In [21]:
#colouring function 
def color_thresh(val,thres=0.5):
    if abs(val)>=thres:
        if val<0:
            return 'background-color: red'
        else:
            return 'background-color: green'
    else:
        return ''

## correlated features

The above function will be useful later on when we need to quick inspect the relations/correlations of our generated data set

In [22]:
mean = [0, 0]
cov = [[1, 0.8], [0.8, 1]]  # diagonal covariance

In [23]:
data_df = pd.DataFrame(np.random.multivariate_normal(mean,cov,num_of_samples),columns=['f1','f2'])

In [24]:
data_df.corr()

Unnamed: 0,f1,f2
f1,1.0,0.792928
f2,0.792928,1.0


In [25]:
mean2 = [0, 0, 0]
cov2 = [[1, 0.9, 0], [0.9, 1, -0.9], [0,-0.9,1]]  # diagonal covariance

note that the covariance matrix needs to be a positive semdefinte matrix, which our above one is not. However we can just use the matrix $A*A^{T}$

In [26]:
np.matmul(cov2,cov2)

array([[ 1.81,  1.8 , -0.81],
       [ 1.8 ,  2.62, -1.8 ],
       [-0.81, -1.8 ,  1.81]])

In [33]:
data_df2 = pd.DataFrame(
    np.random.multivariate_normal(
        mean2,
        np.matmul(cov2,cov2),
        num_of_samples),
#    columns=['f3','f4','f5']
)

In [36]:
data_df2.corr()

Unnamed: 0,0,1,2
0,1.0,0.828612,-0.437695
1,0.828612,1.0,-0.81673
2,-0.437695,-0.81673,1.0


## independent features

Now that the correlated features are created we can create the remaining features and combine them into one dataframe

In [None]:
master_data_df = pd.concat([data_df,data_df2],axis=1)

In [None]:
#create remaining features and appened to master dataframe
for i in range(num_of_correlated_features+1,num_of_features+1):
    master_data_df['f'+str(i)] = np.random.normal(size=(num_of_samples,1))

Validate the final data matrix has the correlations we desire

In [None]:
display(master_data_df.corr().style.applymap(color_thresh))

Unnamed: 0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10
f1,1.0,0.823484,0.124231,0.087341,-0.050655,-0.029098,-0.068163,-0.004578,-0.077361,-0.008783
f2,0.823484,1.0,0.104556,0.081274,-0.057743,-0.005934,-0.031893,-0.012371,-0.061228,-0.086081
f3,0.124231,0.104556,1.0,0.79308,-0.36299,-0.016176,-0.088331,0.0926,-0.106142,0.013993
f4,0.087341,0.081274,0.79308,1.0,-0.805855,-0.006909,-0.109402,0.07894,-0.028553,-0.00583
f5,-0.050655,-0.057743,-0.36299,-0.805855,1.0,-0.02324,0.035654,-0.043511,-0.016874,0.010555
f6,-0.029098,-0.005934,-0.016176,-0.006909,-0.02324,1.0,0.007827,-0.039568,-0.070236,-0.006999
f7,-0.068163,-0.031893,-0.088331,-0.109402,0.035654,0.007827,1.0,-0.044487,0.064861,0.025359
f8,-0.004578,-0.012371,0.0926,0.07894,-0.043511,-0.039568,-0.044487,1.0,-0.033496,0.02769
f9,-0.077361,-0.061228,-0.106142,-0.028553,-0.016874,-0.070236,0.064861,-0.033496,1.0,0.039309
f10,-0.008783,-0.086081,0.013993,-0.00583,0.010555,-0.006999,0.025359,0.02769,0.039309,1.0


## Target variable creation

Now we just need to create the polynomial and make our target variable.

For now we shall create a linear target variable

In [None]:
#create coefficents plus intercept
coeffs = 10*np.random.random(11)
coeffs

array([4.31641462, 5.80856897, 7.61175604, 6.17681988, 3.00744034,
       8.0599153 , 7.69436059, 3.61556823, 1.8993543 , 8.72275618,
       0.42003053])

In [None]:
#combine them together to produce the target var and appends to master data frame
master_data_df['y'] = master_data_df.apply(lambda x: sum(x*coeffs[:-1])+coeffs[-1]+np.random.normal(0,5),axis=1)

In [None]:
master_data_df.head()

Unnamed: 0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,y
0,0.068027,0.080662,0.800527,-0.070646,0.485276,-1.39334,1.073835,0.725156,-0.160102,-1.327064,-9.811862
1,-1.76067,-1.607289,-1.812927,-1.586756,-0.443165,-1.457748,1.123388,0.450613,0.318285,1.334552,-29.722292
2,0.27919,0.988703,-1.497472,-1.322321,0.13249,1.751996,-0.285369,0.588471,-3.343897,-0.914913,-13.128286
3,-0.558478,-0.455242,-0.998965,-1.585366,1.522368,-1.176115,-1.000632,-0.223499,0.905531,-0.088387,-33.966458
4,1.620397,1.008161,1.007725,1.816563,-1.888174,1.552265,-1.345767,1.43262,-0.001464,-1.118416,28.163941


## Data export 

We now export our:
 - created data set to be used for model creation
 - our paramters that we have used for future point of reference

In [None]:
#data export
master_data_df.to_csv('synth1.csv')

In [None]:
#parameter export
params = {
    'features': num_of_features, 
    'samples': num_of_samples,
    'split': train_test_split,
    'corr_features': num_of_correlated_features,
    'mean':mean,
    'cov': cov,
    'mean2': mean2,
    'cov2': cov2,
    'coeffs': coeffs,
}

with open("params1.pkl", "wb") as output_file:
    pickle.dump(params, output_file)