# Assignment 1

In this assignment, you will encounter both pen-and-paper exercises and coding tasks to be solved using Python 3 and the NumPy library. To complete each exercise, utilize the designated cell within this Jupyter notebook.

For the pen-and-paper exercises, you have the option to submit either a typeset solution or a good-quality digitalized version of your handwritten solution.

As for the Python exercises:

- Refrain from altering the provided code; simply fill in the missing portions as indicated.
- Do not use any additional libraries beyond those already included in the code.
- Make sure that the output of all code cells is visible in your submitted notebook. The evaluator will NOT execute your code before grading your submission.
   
Please identify the authors of this assignment in the cell below.

### Author 1: , UP number
### Author 2: Name, UP number
### Author 3: Name, UP number

## 1. Probability

Suppose you are participating in a quiz competition where, for each question, you are provided with four answer choices, with only one being correct. At a certain point, the quiz host asks you a question for which you have no idea what the correct answer is. The host offers you a bonus that enables you to eliminate one incorrect answer from among two options of your choice. Let us label the four answers as $a$, $b$, $c$, and $d$.

### 1.1. Pen-and-paper questions

a) What is the probability that you get the correct answer if you answer at random before using the bonus?

**YOUR ANSWER HERE**

b) You requested the quiz host to use the bonus and eliminate one incorrect answer from options $a$ and $b$. You then choose the option $a$ or $b$ that remains after the bonus. What is the probability of selecting the correct answer in this scenario? Show all the calculations involved.

**YOUR ANSWER HERE**

### 1.2. Computational simulation

Perform a computational simulation of the scenario described in 1.b) and estimate the desired probability by completing the code below.

In [None]:
import random

NUM_EPISODES = 10000

num_wins = 0
for _ in range(NUM_EPISODES):
    correct_answer = random.choice(["a", "b", "c", "d"])

    # YOUR CODE HERE #


prob = num_wins / NUM_EPISODES
print(f"Estimated probability: {prob:.3f}")

## 2. Linear regression

Consider the model $f(x) = \theta_0 + \theta_1 x + \theta_2 x^2 + \theta_3 \sin(2\pi x)$ and a dataset $\{(x_i, y_i)\}_{i=1}^n$.

a) The optimal parameters for the linear regression problem can be obtained by solving
$$ \min_{\boldsymbol{\theta}} ||\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\theta}||_2^2 $$
for suitably defined matrices/vectors $\boldsymbol{\theta}$, $\boldsymbol{X}$, and $\boldsymbol{y}$. Provide explicit definitions for these matrices/vectors.

**YOUR ANSWER HERE**

b) Find the solution to this problem by completing the code below.

In [None]:
import numpy as np

arr = np.loadtxt("dataset_train.csv", delimiter=",", dtype=float)
inputs, targets = arr[:, 0], arr[:, 1]
theta = np.zeros(4)  # just to prevent the code from throwing an error before it's fully implemented

# YOUR CODE HERE


print("theta =", theta)

Let's plot the learned model together with the training data.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# create a scatterplot for the data
sns.scatterplot(x=inputs, y=targets, color="blue", label='Data')

# compute the predictions of the model
x = np.linspace(0, 1, 1000)
y_pred = theta[0] + theta[1] * x + theta[2] * x**2 + theta[3] * np.sin(2*np.pi * x)

# create a lineplot for the predicted model
sns.lineplot(x=x, y=y_pred, color="orange", label='Prediction')

# label your axes and add a legend
plt.xlabel('x')
plt.ylabel('y')
plt.legend()

# show the plot
plt.show()

c) Consider the Ridge regression problem where the parameter $\theta_0$ is not regularized. Derive the closed-form solution to this problem analytically.

**YOUR ANSWER HERE**

d) Find the solution to the Ridge regression problem as defined in the previous exercise by completing the code below. The variable `reg` in the code corresponds to the regularization weight $\lambda$.

In [None]:
reg = 10.0  # regularization weight (DO NOT CHANGE)
arr = np.loadtxt("dataset_train.csv", delimiter=",", dtype=float)
inputs, targets = arr[:, 0], arr[:, 1]
theta_ridge = np.zeros(4)  # just to prevent the code from throwing an error before it's fully implemented

# YOUR CODE HERE


print("theta_ridge =", theta_ridge)

Let's plot the two models together with the training data.

In [None]:
# create a scatterplot for the data
sns.scatterplot(x=inputs, y=targets, color="blue", label='Data')

# compute the predictions of the two models
x = np.linspace(0, 1, 1000)
y_pred = theta[0] + theta[1] * x + theta[2] * x**2 + theta[3] * np.sin(2*np.pi * x)
y_pred_ridge = theta_ridge[0] + theta_ridge[1] * x + theta_ridge[2] * x**2 + theta_ridge[3] * np.sin(2*np.pi * x)

# create a lineplot for each of the predicted models
sns.lineplot(x=x, y=y_pred, color="orange", label='Prediction (no reg.)')
sns.lineplot(x=x, y=y_pred_ridge, color="green", label='Prediction (Ridge)')

# label your axes and add a legend
plt.xlabel('x')
plt.ylabel('y')
plt.legend()

# show the plot
plt.show()

e) Compute the mean-squared errors (MSE) of the two models in the training set and in a separate test set. You should avoid utilizing for-loops. Use NumPy vectorized operations instead.

In [None]:
# load the training data
arr = np.loadtxt("dataset_train.csv", delimiter=",", dtype=float)
inputs, targets = arr[:, 0], arr[:, 1]

# load the test data
arr = np.loadtxt("dataset_test.csv", delimiter=",", dtype=float)
inputs_test, targets_test = arr[:, 0], arr[:, 1]

# just to prevent the code from throwing an error before it's fully implemented
mse_train, mse_train_ridge = 0, 0
mse_test, mse_test_ridge = 0, 0

# YOUR CODE HERE


print("MSE in the training data:")
print(f"  Linear regression (no reg.): {mse_train:.5f}")
print(f"  Ridge regression (lambda = {reg}): {mse_train_ridge:.5f}\n")

print("MSE in the test data:")
print(f"  Linear regression (no reg.): {mse_test:.5f}")
print(f"  Ridge regression (lambda = {reg}): {mse_test_ridge:.5f}\n")

f) Based on the errors obtained in the previous question, which of the two models would you prefer? Explain your answer.

**YOUR ANSWER HERE**