# Synthetic Data Creation

This note book shall be where we create the synthetic-dataset to be use for out testing.

Summary of what we wish to achieve:

We are aiming to create a dataset from which we can build a linear regression/simple polynomial regression model without regularisation. The data set needs to have a sufficent number of samples so we can form a training set and a test set. Because of the whole point of this expirment we shall also mandate that some of the features are themselves correlated.

In [144]:
#Set master parameters.

num_of_features = 10
num_of_samples = 400
train_test_split = 0.25
num_of_correlated_features = 5

In [121]:
#Library imports
import numpy as np
import pandas as pd
import pickle

In [122]:
#colouring function 
def color_thresh(val,thres=0.5):
    if abs(val)>=thres:
        if val<0:
            return 'background-color: red'
        else:
            return 'background-color: green'
    else:
        return ''

## correlated features

The above function will be useful later on when we need to quick inspect the relations/correlations of our generated data set

In [123]:
mean = [0, 0]
cov = [[1, 0.8], [0.8, 1]]  # diagonal covariance

In [124]:
data_df = pd.DataFrame(np.random.multivariate_normal(mean,cov,num_of_samples),columns=['f1','f2'])

In [125]:
data_df.corr()

Unnamed: 0,f1,f2
f1,1.0,0.833972
f2,0.833972,1.0


In [126]:
mean2 = [0, 0, 0]
cov2 = [[1, 0.9, 0], [0.9, 1, -0.9], [0,-0.9,1]]  # diagonal covariance

note that the covariance matrix needs to be a positive semdefinte matrix, which our above one is not. However we can just use the matrix $A*A^{T}$

In [127]:
np.matmul(cov2,cov2)

array([[ 1.81,  1.8 , -0.81],
       [ 1.8 ,  2.62, -1.8 ],
       [-0.81, -1.8 ,  1.81]])

In [128]:
data_df2 = pd.DataFrame(np.random.multivariate_normal(mean2,np.matmul(cov2,cov2),num_of_samples),columns=['f3','f4','f5'])

In [129]:
data_df2.corr()

Unnamed: 0,f3,f4,f5
f3,1.0,0.818899,-0.392983
f4,0.818899,1.0,-0.796179
f5,-0.392983,-0.796179,1.0


## independent features

Now that the correlated features are created we can create the remaining features and combine them into one dataframe

In [130]:
master_data_df = pd.concat([data_df,data_df2],axis=1)

In [131]:
#create remaining features and appened to master dataframe
for i in range(num_of_correlated_features+1,num_of_features+1):
    master_data_df['f'+str(i)] = np.random.normal(size=(num_of_samples,1))

Validate the final data matrix has the correlations we desire

In [132]:
display(master_data_df.corr().style.applymap(color_thresh))

Unnamed: 0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10
f1,1.0,0.833972,0.04335,-0.014966,0.068611,0.01815,-0.110752,0.030587,-0.027945,0.065
f2,0.833972,1.0,0.013808,-0.020577,0.052441,0.031699,-0.073705,0.029227,-0.007349,0.021673
f3,0.04335,0.013808,1.0,0.818899,-0.392983,-0.027446,-0.03777,0.002404,0.010906,0.016226
f4,-0.014966,-0.020577,0.818899,1.0,-0.796179,-0.048924,-0.004918,-0.007366,0.018192,0.000103
f5,0.068611,0.052441,-0.392983,-0.796179,1.0,0.030613,-0.015641,0.014615,0.015438,0.004837
f6,0.01815,0.031699,-0.027446,-0.048924,0.030613,1.0,0.020155,0.042374,-0.026892,0.049871
f7,-0.110752,-0.073705,-0.03777,-0.004918,-0.015641,0.020155,1.0,-0.009034,-0.000532,0.054066
f8,0.030587,0.029227,0.002404,-0.007366,0.014615,0.042374,-0.009034,1.0,-0.05428,0.028782
f9,-0.027945,-0.007349,0.010906,0.018192,0.015438,-0.026892,-0.000532,-0.05428,1.0,0.018597
f10,0.065,0.021673,0.016226,0.000103,0.004837,0.049871,0.054066,0.028782,0.018597,1.0


## Target variable creation

Now we just need to create the polynomial and make our target variable.

For now we shall create a linear target variable

In [133]:
#create coefficents plus intercept
coeffs = 10*np.random.random(11)
coeffs

array([0.8548219 , 9.45614306, 4.93156789, 2.74044746, 4.57208473,
       0.24207174, 7.16864783, 5.66669495, 8.47362646, 6.49470771,
       4.22684349])

In [134]:
#combine them together to produce the target var and appends to master data frame
master_data_df['y'] = master_data_df.apply(lambda x: sum(x*coeffs[:-1])+coeffs[-1]+np.random.normal(0,5),axis=1)

In [135]:
master_data_df.head()

Unnamed: 0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,y
0,-1.571577,-1.47001,0.435267,1.418019,-0.861812,0.362211,1.288429,0.766177,-1.858609,-0.169907,-15.961233
1,0.589387,1.389671,1.093053,1.056387,-0.195147,-1.061001,-0.435579,-1.398377,-0.093679,0.291864,10.64672
2,1.007882,0.611113,1.266938,0.265561,0.245825,-1.430773,-0.276978,-0.797279,2.1934,1.410456,43.823134
3,0.035006,0.069504,1.432853,1.04386,-0.307817,1.669126,0.970242,-1.274798,0.424221,-1.138066,12.617782
4,-0.057693,-0.241715,0.932813,2.657547,-2.284334,-0.130933,-0.288229,-0.159546,0.76191,-1.169418,-7.745875


## Data export 

We now export our:
 - created data set to be used for model creation
 - our paramters that we have used for future point of reference

In [136]:
#data export
master_data_df.to_csv('synth1.csv')

In [145]:
#parameter export
params = {
    'features': num_of_features, 
    'samples': num_of_samples,
    'split': train_test_split,
    'corr_features': num_of_correlated_features,
    'mean':mean,
    'cov': cov,
    'mean2': mean2,
    'cov2': cov2,
    'coeffs': coeffs,
}

with open("params.pkl", "wb") as output_file:
    pickle.dump(params, output_file)