In [None]:
import numpy as np
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
import plotly.figure_factory as ff

# Making Synthetic Data

In this lecture we used a synthetic dataset created to visually illustrate several non-linear feature transformations. In this notebook, we explain how this dataset was constructed.

To illustrate how we can fit non-linear relationships using linear models we will create a synthetic "toy" dataset.  Using periodic functions.  In the real-world, one might encounter these kinds of periodic non-linear relationships when modeling time series data (e.g., user activity throughout the day).

Here the true model will be:

$$
Y = \cos(X_0) + \sin(X_1) + \frac{1}{4}X_0 - \frac{1}{2} X_1 + \frac{1}{5}X_0  X_1 + 5
$$

To make the data slightly more realistic we also add a small amount of noise to the response variable $Y$ and consider random features for $X_0$ and $X_1$:

In [None]:
n = 500
noise = 0.2
np.random.seed(42)

X = np.random.rand(n,2) * 10 - 5
Y = np.cos(X[:,0]) + np.sin(X[:,1]) + 0.25*X[:,0] - 0.5*X[:,1] + 0.2*X[:,0]*X[:,1] + 5. 
Y = Y + noise * np.random.randn(n)
synth_data = pd.DataFrame({"X0": X[:,0], "X1": X[:,1], "Y": Y})
synth_data.head()

We can visualize the data in three dimensions:

In [None]:
data_scatter = go.Scatter3d(x=synth_data["X0"], y=synth_data["X1"], z=synth_data["Y"], 
                            mode="markers",
                            marker=dict(size=2))
layout = dict(margin=dict(l=0, r=0, t=0, b=0), 
              height=600,
              scene = dict(xaxis_title='X0', yaxis_title='X1', zaxis_title='Y'))
go.Figure([data_scatter], layout)

In [None]:
synth_data.to_csv("data/synthetic_data.csv", index=False)