In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('dataset.csv')

In [3]:
x1 = (np.array([df['x1']])).T
x2 = np.array([df['x2']]).T
x3 = np.array([df['x3']]).T
X = np.array([df['x1'], df['x2'], df['x3']]).T
y = df['y'].values

In [4]:
## Question 1:
## Run a simple linear regression model to predict y from x1. 
## Report the linear model you found. 
## Predict the value of y for a new x1 values of 1 and 2 respectively.

from sklearn.linear_model import LinearRegression

model = LinearRegression(fit_intercept=True, normalize=False) 
model.fit(x1, y)
print(model.coef_, model.intercept_)

x1_new = np.array([[1], [2]]) 
model.predict(x1_new)

(array([-2.03833663]), 5.927948918061609)


array([3.88961228, 1.85127565])

### Question1 Quick Report:
If we only use ``x1`` as a feature to make prediction with linear regression, the hypothetical model is:
$$ h_{\theta}(x_1) = \theta_1 x_1 + \theta_0$$

Then the analytical solution of $\theta$'s is:
$$ \theta^{*} = (x_1^{T}x_1)^{-1}x_1^{T}\boldsymbol{y}$$

By using linear regression model in ``sklearn``, we get:
$$\theta_1 = -2.0383 \\ \theta_0 = 5.9279$$

Using this model, we can predict that ``y`` will be ``3.8896`` and ``1.8513`` if ``x1 = 1`` and ``x1 = 2`` respectively.

In [5]:
## Question 2:
## Use cross-validation to predict generalization error, 
## while the error of a single data point (x1, y) from a model M is defined as (M(x1)−y)^2. 
## Describe how you did this.

model = LinearRegression(fit_intercept=True, normalize=False) 
model.fit(x1[:70], y[:70])
print(model.coef_, model.intercept_)

x1_new = np.array([])
for i in range(70,100):
    x1_new = np.append(x1_new,[x1[i]])

x1_new = x1_new.reshape(-1,1)
y_pred = model.predict(x1_new)

gen_error = 0
for i in range(30):
    gen_error += (y_pred[i] - y[i+70])**2

gen_error = gen_error/30
print(gen_error)

(array([-0.79724367]), 4.537549841542857)
15.682727756450314


### Question2 Quick Report:
Split the dataset into training set and validation set by proportion 7:3. Implement the same method to the training set and we get:

$$\theta_1 = -0.7972 \\ \theta_0 = 4.5375$$

A salient difference to the model we obtain from Question 1.

Using this model, we can predict the validation set and calculate generalization error as follows:

$$I_S[h_{\theta}] = \frac{1}{30}\sum_{i=1}^{30}(h_{\theta}(x_{1i})-y_i)^2, (x_{1i},y_i)\in V$$

where $V$ is the validation set, then we get

$$I_S[h_{\theta}] = 15.6827$$

Quite large number, which indicates that underfitting happens in the previous model, i.e. a linear model of ``x1`` is not enough to fit this dataset.

In [6]:
## Question 3:
## Predict y from x1 by constructing polynomial regression models with degree of 2, 3, and 4 respectively.
## Report polynomial models with the above three degrees. 
## With each of these models, predict the value of y for a new x1 values of 1 and 2 respectively.

poly_degree = 2
X1 = np.zeros((len(df), 0))
for i in range(poly_degree,-1,-1):
    X1 = np.column_stack((X1,x1**i))

theta = np.linalg.solve(X1.T.dot(X1), X1.T.dot(y))

X_new1 = np.ones(poly_degree + 1)
y_new1 = X_new1.dot(theta)

X_new2 = np.ones(poly_degree + 1)
for i in range(poly_degree,-1,-1):
    X_new2[i] = (2**i)*X_new2[i]
X_new2 = X_new2[::-1]
y_new2 = X_new2.dot(theta)

print(theta)
print(y_new1)
print(y_new2)

[-1.00550258  1.98367369  3.27369294]
4.251864056472463
3.219030004977248


In [7]:
poly_degree = 3
X1 = np.zeros((len(df), 0))
for i in range(poly_degree,-1,-1):
    X1 = np.column_stack((X1,x1**i))

theta = np.linalg.solve(X1.T.dot(X1), X1.T.dot(y))

X_new1 = np.ones(poly_degree + 1)
y_new1 = X_new1.dot(theta)

X_new2 = np.ones(poly_degree + 1)
for i in range(poly_degree,-1,-1):
    X_new2[i] = (2**i)*X_new2[i]
X_new2 = X_new2[::-1]
y_new2 = X_new2.dot(theta)

print(theta)
print(y_new1)
print(y_new2)

[-8.59910796e-04 -1.00034312e+00  1.97545996e+00  3.27636183e+00]
4.250618769472453
3.219030004977263


In [8]:
poly_degree = 4
X1 = np.zeros((len(df), 0))
for i in range(poly_degree,-1,-1):
    X1 = np.column_stack((X1,x1**i))

theta = np.linalg.solve(X1.T.dot(X1), X1.T.dot(y))

X_new1 = np.ones(poly_degree + 1)
y_new1 = X_new1.dot(theta)

X_new2 = np.ones(poly_degree + 1)
for i in range(poly_degree,-1,-1):
    X_new2[i] = (2**i)*X_new2[i]
X_new2 = X_new2[::-1]
y_new2 = X_new2.dot(theta)

print(theta)
print(y_new1)
print(y_new2)

[-0.00383093  0.02978751 -1.0788899   2.0444677   3.26318505]
4.254719440387376
3.21356611535215


### Question3 Quick Report:
If we only use ``x1`` as a feature to make prediction with polynomial regression, and the highest degree is ``d``, the hypothetical model can be written as:

$$ h_{\theta}(x_1) = \sum_{i=0}^{d}\theta_i x_1^i$$

Then the analytical solution of $\theta$'s is:
$$ \theta^{*} = (X_1^{T}X_1)^{-1}X_1^{T}\boldsymbol{y}$$

where $X_1$ is a $(d + 1) * n $ matrix constructed as follows (``n`` refers to sample number):

$$
X_1 = 
\begin{bmatrix}
-(x_1^{d})^{T}-\\
-(x_1^{d-1})^{T}-\\
-...-\\
-(x_1^{0})^{T}-\\
\end{bmatrix}
$$

By using linear regression model, we get $\theta$ for ``d = 2,3,4`` respectively:

$$\theta = 
\begin{equation}
\begin{cases}
[-1.0055, 1.9837, 3.2737] & d = 2 \\
[-8.6\times 10^{-4}, -1.0003, 1.9755, 3.2764] & d = 3 \\
[-3.8\times 10^{-3}, 0.0298, -1.0789, 2.0445, 3.2632] & d = 4\\
\end{cases}
\end{equation}
$$

The corresponding ``y`` with ``x1 = 1`` and ``x1 = 2`` is:

$$y = 
\begin{equation}
\begin{cases}
[4.2519, 3.2190] & d = 2 \\
[4.2506, 3.2190] & d = 3 \\
[4.2547, 3.2136] & d = 4\\
\end{cases}
\end{equation}
$$

As can be seen from the results, the quadratic model fits the given data quite well. Higher degree models get miniature $\theta$ for higher indexes and produce similar results to it. So we can confidentally suppose that the given data can be described (may not fully) by ``x1``'s quadratic model. 

In [9]:
## Question 4:
## Run a simple linear regression model to predict y from X. 
## Report the linear model you found. 
## Predict the value of y for a new X values of (1, 1, 1), (1, 0, 4), and (3, 2, 1) respectively.

model = LinearRegression(fit_intercept=True, normalize=False) 
model.fit(X, y)
print(model.coef_, model.intercept_)

for i in range(70,100):
    X_new = np.array([[1,1,1], [1,0,4], [3,2,1]]) 

model.predict(X_new)

(array([-2.00371927,  0.53256334, -0.26560187]), 5.31416717245698)


array([3.57740937, 2.24804044, 0.10253417])

### Question4 Quick Report:
If the corresponding hypothetical model is:
$$ h_{\theta}(\boldsymbol{x}) = \theta^{T}\boldsymbol{x} $$

and we have ``n`` observations $(\boldsymbol{x_1},y_1), ..., (\boldsymbol{x_n},y_n)$

Then the analytical solution of $\theta$'s is:
$$ \theta^{*} = (X^{T}X)^{-1}X^{T}\boldsymbol{y}$$

where $X$ is a $ 4 * n $ matrix constructed as follows:

$$
X = 
\begin{bmatrix}
-x_1^{T}-\\
-x_2^{T}-\\
-x_3^{T}-\\
-\mathbb{1}^{T}-\\
\end{bmatrix}
$$

By using linear regression model in ``sklearn``, we get:
$$\theta = [-2.0037, 0.5326, -0.2656, 5.3142]$$

Using this model, we can predict that ``y`` will be ``3.5774``, ``2.2480``and ``0.1025`` respectively with the given points.

In [10]:
## Question 5:
## Use cross-validation to predict generalization error, 
## while the error of a single data point (X, y) from a model M is defined as (M(X)−y)^2. 
## Describe how you did this.

model = LinearRegression(fit_intercept=True, normalize=False) 
model.fit(X[:70], y[:70])
print(model.coef_, model.intercept_)

X_new = np.array([])
for i in range(70,100):
    X_new = np.append(X_new,[x1[i],x2[i],x3[i]])

X_new = X_new.reshape(30,3)
y_pred = model.predict(X_new)

gen_error = 0
for i in range(30):
    gen_error += (y_pred[i] - y[i+70])**2

gen_error = gen_error/30
print(gen_error)

(array([-0.78645836,  0.48430326, -0.30035386]), 4.106935599602342)
14.553030639713255


### Question5 Quick Report:
Split the dataset into training set and validation set by proportion 7:3. Implement the same method to the training set and we get:

$$\theta = [-0.7865,  0.4843, -0.3004, 4.1069]$$

A salient difference to the model we obtain from Question 1 mainly in ``x1`` and the intercept.

Using this model, we can predict the validation set and calculate generalization error as follows:

$$I_S[h_{\theta}] = \frac{1}{n}\sum_{i=1}^{30}(h_{\theta}(x_{i})-y_i)^2, (x_{i},y_i)\in V$$

where $V$ is the validation set, then we get

$$I_S[h_{\theta}] = 14.5530$$

Quite large number, which indicates that underfitting happens in the previous model, i.e. a linear model of is not enough to fit this dataset, maybe partially due to the nonlinearity with respect to ``x1``.