# MLT | OPPE | November 2 | IIC

## Instructions

- The duration of the exam is **2 hours**.
- There are **16 questions** in this OPPE out of 50 marks.
- All of them are NAT.
- You have to read the question in colab, enter the solution code in the colab and then enter the answer in the portal.
- For most questions, the data is given in a cell called the **DATA CELL**. You have to run the data cell first before running the solution cell. Do not edit the data-cell at any cost.
- After completing the exam you will have to do two things:
  - Click the submit button on the portal. If you do not submit on the portal, you will get zero marks.
  - Upload the colab as a `.ipynb` file using the form we have given. If you do not upload the file, you will get zero marks.

- Make sure that you run all the cells before the current cell you are working with. Then run the current cell. This can be done using Ctr + F8. Just running the current cell repeatedly might cause a problem. Ctr + F8 runs all the cells starting from the first one in sequence until the current cell. If this doesn't work for you, click on Runtime in the toolbar and click Run before.

- Note that some questions have random numbers generated with specific seed values. So, it is important that you run the cells in the sequence in which they are presented. For such questions, you will find the following message at the end of the cell:

```
# RUN THE DATA CELL BEFORE RUNNING THE SOLUTION CELL
```

## Notation

- The data-matrix in all problems will be of shape $d \times n$, where $d$ is the number of features and $n$ is the number of data-points.
- If $\mathbf{x} = (1, 2, 3)$ is a vector, $1, 2$ and $3$ are termed the components of $\mathbf{x}$. The sum of the components of $\mathbf{x}$ is $6$.
- The norm of a vector $\mathbf{x}$ is the Euclidean norm ($L_2$) by default. This is the only norm used in this exam.
- All vectors will be represented as one-dimensional NumPy arrays. All matrices will be represented as two-dimensional NumPy arrays.

In [None]:
# RUN THIS CELL WITHOUT FAIL
# ONLY THEN PROCEED TO THE QUESTIONS
import numpy as np
import matplotlib.pyplot as plt

## Question-1[3 marks]

Consider the curves

$y = x^{3} - 2x^{2} + x$ and $y = 0.4x^{2} - 0.1$.

Find the number of points at which these two curves intersect in the interval
$-1 \le x \le 1$.

**Answer: 2**


In [None]:
# SOLUTION
# RUN THE DATA CELL BEFORE RUNNING THE SOLUTION CELL



## Question-2 [3 marks]

Matrix `M` is of shape `(n, n)`. Find the dot product of the $230^{th}$ row of $M^T$ and the $158^{th}$ column of $M^T$. Your answer should be an integer. Here, $M^T$ is the transpose of the matrix $M$.

<hr>

**NOTE**: A note regarding the terminology. While talking about rows and columns, we are counting from one and not zero. For example, consider the matrix $M$:

$$
M = \begin{bmatrix}
1 & 2 & 3\\
4 & 5 & 6\\
7 & 8 & 9
\end{bmatrix}
$$

The first row of $M$ is $[1, 2, 3]$. The second column of $M$ is $[2, 5, 8]$.

<hr>

The variable `M` is defined in the cell given below.

**Answer: 1159**


In [None]:
# DATA CELL
# DO NOT EDIT THIS CELL
rng = np.random.default_rng(seed = 1001)
n = rng.integers(100, 300)
M = rng.integers(0, 5, (n, n))

In [None]:
# SOLUTION
# RUN THE DATA CELL BEFORE RUNNING THE SOLUTION CELL



## Question-3 [4 Marks]
$\mathbf{X}$ is a data-matrix. Normalize the matrix $\mathbf{X}$ and call the normalized matrix as $\mathbf{Z}$. Each entry of $\mathbf{Z}$ will be
$$Z_{ij} = \frac{X_{ij} - \mu_j}{\sigma_j},$$
where $\mu_j = \frac{1}{d} \sum_{i=1}^d X_{ij}$ and $\sigma_j = \sqrt{\frac{1}{d} \sum_{i=1}^d \bigl(X_{ij} - \mu_j\bigr)^2}$ are the mean and the population standard deviation of the $j$-th column.

Compute the product of all entries of the normalized matrix $\mathbf{Z}$. Enter the answer correct to three decimal places.

**Answer: 0.259, range: [0.255, 0.264]**

In [None]:
# DATA CELL
# DO NOT EDIT THIS
rng = np.random.default_rng(seed = 101)
d = rng.integers(3, 9)
X = rng.integers(5, 10, (d, 2))
d, n = X.shape

In [None]:
# SOLUTION
# RUN THE DATA CELL BEFORE RUNNING THE SOLUTION CELL


# Common data for questions (4) to (6)

Consider a data-matrix $\mathbf{X}$. Mean center it and perform the standard PCA.

In [None]:
# DATA CELL
# DO NOT EDIT THIS
import numpy as np

rng = np.random.default_rng(seed = 1001)
n = rng.integers(100, 300)          # Random dimension between 100 and 300
X = rng.integers(0, 5, (n, n))

## Question-4 [4 marks]

The first PC is $\mathbf{w}_1$. Let the sum of the components of $\mathbf{w}_1$ be $a$. Find the value of $100a$ and enter the nearest integer as your answer.

**Answer: 125**

In [None]:
# SOLUTION
# RUN THE DATA CELL BEFORE RUNNING THE SOLUTION CELL


## Question-5 [3 marks]

Find the absolute difference between the variance along the first PC and second PC. Enter the nearest integer as your answer.

**Answer: 0**

In [None]:
# SOLUTION
# RUN THE DATA CELL BEFORE RUNNING THE SOLUTION CELL

## Question-6 [2 marks]

Find the sum of the variances along the remaining PCs except first and second PC. Enter the nearest integer as your answer.

**Answer: 544**

In [None]:
# SOLUTION
# RUN THE DATA CELL BEFORE RUNNING THE SOLUTION CELL


## Common Data for questions (7) to (10)

Run the following cell to get the training and test dataset. The following variables are used in the cell:

`X_train` = Training dataset

`y_train` = label vector corresponding to training dataset

`X_test` = Test dataset

`y_test` = label vector corresponding to test dataset

**Note: The dimensions of `X_train` and `X_test` datasets are $n \times d$ instead of $d \times n$. So modify the equations accordingly.**

In [None]:
# DATA CELL
# DO NOT EDIT THIS

import numpy as np

rng = np.random.default_rng(seed=1001)
n = rng.integers(100, 300)
d = 5

X = rng.integers(0, 5, size=(n, d))
w_true = rng.normal(loc=0, scale=1, size=d)
noise = rng.normal(0, 0.5, size=n)
y = X @ w_true + noise
split = int(0.8 * n)
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]

print(X_train.shape)
print(y_train.shape)

(224, 5)
(224,)


## Question-7 [4 Marks]

If we learn a linear regression model on the training dataset, how many weights need to be learned by the model?

**Answer: 5**

In [None]:
# SOLUTION
# RUN THE DATA CELL BEFORE RUNNING THE SOLUTION CELL


## Question-8 [4 Marks]

If $\mathbf{w}$ is the weight vector learnt using the least square linear regression model (normal equation method), what will be euclidean norm of $\mathbf{w}$? Enter your answer correct to two decimal places.

**Answer: 1.98, range: 1.95 to 2.01**

In [None]:
# SOLUTION
# RUN THE DATA CELL BEFORE RUNNING THE SOLUTION CELL


## Question-9 [4 Marks]

Find the root mean square error on the training dataset using the model defined in question 7.

$
 RMSE  =  \sqrt{\dfrac{1}{n}\sum\limits_{i=1}^{n} (y_i- \widehat{y}_i)^2}
$

Enter your answer correct to three decimal places.

**Answer: 3.403, range: 3.39 to 3.41**

In [None]:
# SOLUTION
# RUN THE DATA CELL BEFORE RUNNING THE SOLUTION CELL



## Question 10 [3 Marks]
Find the root mean square error on the test dataset using the model defined in question $7$. Enter your answer correct to two decimal places.

**Answer: 1.51, range: 1.48 to 1.54**

In [None]:
# SOLUTION
# RUN THE DATA CELL BEFORE RUNNING THE SOLUTION CELL



## Common Data for questions (11) to (13)

Consider a dataset $\mathbf{X}$ for a clustering problem. Start with the initial means as $\boldsymbol{\mu}_1 = (2, 3, 5)$ and $\boldsymbol{\mu}_2 = (-3, -5, -7)$ and run K-means with $k = 2$.

**Note**: Set `mu_1` and `mu_2` as `np.float64` arrays. You can set `dtype = np.float64` while creating the array. Use `np.array?` if you are still unsure about this. This is important for your final answer to match the ones we have configured.

In [None]:
# DATA CELL
# DO NOT EDIT THIS
X = np.array([
    [5.1, -6.2, 4.5, -7.5, 2.2, -3.8, 6.5],
    [6.3, -5.5, 3.8, -6.8, 2.5, -4.0, 7.1],
    [4.7, -7.1, 5.0, -8.0, 1.8, -3.5, 6.8]
])


## Question-11 [3 marks]

Find the norm of the final mean $\boldsymbol{\mu}_1$. Enter the nearest integer as your answer.

**Answer: 8**


In [None]:
# SOLUTION
# RUN THE DATA CELL BEFORE RUNNING THE SOLUTION CELL

## Question-12 [3 marks]

Find the norm of the final mean $\boldsymbol{\mu}_2$. Enter the nearest integer as your answer.

**Answer: 10**


In [None]:
# SOLUTION
# RUN THE DATA CELL BEFORE RUNNING THE SOLUTION CELL

## Question-13 [3 marks]

Find the cluster to which the data-point $(0, 1, 2)$ belongs. Enter $1$ if it is closer to $\boldsymbol{\mu}_1$ than $\boldsymbol{\mu}_2$ and $2$ otherwise.

**Answer: 1**

In [None]:
# SOLUTION
# RUN THE DATA CELL BEFORE RUNNING THE SOLUTION CELL

## Common data for questions (14) to (16)

Consider the following dataset for a binary classification problem. The data-matrix $\mathbf{X}$ and the label vector $\mathbf{y}$ are given below.

In [None]:
# DATA CELL
# DO NOT EDIT THIS
X = np.array([
    [1, 2, 2, 3, -2, -1, -4, -3],
    [0, 1, -2, 0, 0, -2, 2, -1]
])
y = np.array(
    [-1, -1, -1, -1, 1, 1, 1, 1]
)

## Question-14 [3 marks]
In the context of hard-margin SVM, find the optimal $\boldsymbol{\alpha}^{*}$ by solving the following dual optimization problem.

$$
\begin{equation*}
\underset{\boldsymbol{\alpha} \geqslant \mathbf{0}}{\max}\ \ \  \  \boldsymbol{\alpha}^{T}\mathbf{1} -\frac{1}{2}\boldsymbol{\alpha}^{T}\mathbf{Y}^{T}\mathbf{X}^{T}\mathbf{XY} \boldsymbol{\alpha}
\end{equation*}
$$

If the sum of the components of $\boldsymbol{\alpha}^{*}$ is $s$, find $\cfrac{1}{s}$ and enter the nearest integer to $\cfrac{1}{s}$ as the answer.

**Answer: 1**

In [None]:
# SOLUTION
# RUN THE DATA CELL BEFORE RUNNING THE SOLUTION CELL

## Question-15 [3 marks]

Find the optimal weight vector $\mathbf{w}^{*}$. If the sum of the components of $\mathbf{w}^{*}$ is $s$, find $\cfrac{1}{s}$ and enter the nearest integer to $\cfrac{1}{s}$ as the answer.

**Answer: -1**

In [None]:
# SOLUTION
# RUN THE DATA CELL BEFORE RUNNING THE SOLUTION CELL

## Question - 16 [3 Marks]

Find the total number of support vectors.

**Answer: 2**

In [None]:
# SOLUTION
# RUN THE DATA CELL BEFORE RUNNING THE SOLUTION CELL