# Synthetic Data Creation

This note book shall be where we create the synthetic-dataset to be use for out testing.

Summary of what we wish to achieve:

We are aiming to create a dataset from which we can build a linear regression/simple polynomial regression model without regularisation. The data set needs to have a sufficent number of samples so we can form a training set and a test set. Because of the whole point of this expirment we shall also mandate that some of the features are themselves correlated.

In [1]:
#Set master parameters.

num_of_features = 10
num_of_samples = 400
train_test_split = 0.75
num_of_correlated_features = 5

In [5]:
#Library imports
import numpy as np
import pandas as pd

In [100]:
#colouring function 
def color_thresh(val,thres=0.5):
    if abs(val)>=thres:
        if val<0:
            return 'background-color: red'
        else:
            return 'background-color: green'
    else:
        return ''

The above function will be useful later on when we need to quick inspect the relations/correlations of our generated data set

In [20]:
mean = [0, 0]
cov = [[1, 0.8], [0.8, 1]]  # diagonal covariance

In [42]:
data_df = pd.DataFrame(np.random.multivariate_normal(mean,cov,num_of_samples),columns=['f1','f2'])

In [44]:
data_df.corr()

Unnamed: 0,f1,f2
f1,1.0,0.82794
f2,0.82794,1.0


In [65]:
mean2 = [0, 0, 0]
cov2 = [[1, 0.9, 0], [0.9, 1, -0.9], [0,-0.9,1]]  # diagonal covariance

note that the covariance matrix needs to be a positive semdefinte matrix, which our above one is not. However we can just use the matrix $A*A^{T}$

In [101]:
np.matmul(cov2,cov2)

array([[ 1.81,  1.8 , -0.81],
       [ 1.8 ,  2.62, -1.8 ],
       [-0.81, -1.8 ,  1.81]])

In [73]:
data_df2 = pd.DataFrame(np.random.multivariate_normal(mean2,np.matmul(cov2,cov2),num_of_samples),columns=['f3','f4','f5'])

In [74]:
data_df2.corr()

Unnamed: 0,f3,f4,f5
f3,1.0,0.761212,-0.296504
f4,0.761212,1.0,-0.781817
f5,-0.296504,-0.781817,1.0


In [91]:
master_data_df = pd.concat([data_df,data_df2],axis=1)

In [94]:
for i in range(num_of_correlated_features+1,num_of_features+1):
    master_data_df['f'+str(i)] = np.random.normal(size=(num_of_samples,1))

In [99]:
display(master_data_df.corr().style.applymap(color_thresh))

Unnamed: 0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10
f1,1.0,0.82794,-0.069792,0.07397,-0.083145,0.097433,0.093638,-0.054433,0.044291,-0.056656
f2,0.82794,1.0,-0.085871,0.057217,-0.084251,0.093114,0.103052,0.028197,0.03916,0.025193
f3,-0.069792,-0.085871,1.0,0.761212,-0.296504,-0.017036,-0.150161,0.0598,-0.086533,0.103383
f4,0.07397,0.057217,0.761212,1.0,-0.781817,0.067081,-0.064381,0.178381,-0.076477,0.105529
f5,-0.083145,-0.084251,-0.296504,-0.781817,1.0,-0.162931,-0.077269,-0.232824,0.097265,-0.135821
f6,0.097433,0.093114,-0.017036,0.067081,-0.162931,1.0,0.089055,0.127965,-0.042918,0.088007
f7,0.093638,0.103052,-0.150161,-0.064381,-0.077269,0.089055,1.0,0.00113,-0.040673,0.08132
f8,-0.054433,0.028197,0.0598,0.178381,-0.232824,0.127965,0.00113,1.0,-0.045765,0.018605
f9,0.044291,0.03916,-0.086533,-0.076477,0.097265,-0.042918,-0.040673,-0.045765,1.0,0.012362
f10,-0.056656,0.025193,0.103383,0.105529,-0.135821,0.088007,0.08132,0.018605,0.012362,1.0


Now we just need to create the polynomial and make our target variable.

For now we shall create a linear target variable