**Instructions:**

- For questions that require coding, you need to write the relevant code and display its output. Your output should either be the direct answer to the question or clearly display the answer in it.
- For questions that require a written answer (sometimes along with the code), you need to put your answer in a Markdown cell. Writing the answer as a comment or as a print line is not acceptable.
- You need to render this file as HTML using Quarto and submit the HTML file. **Please note that this is a requirement and not optional.** A submission cannot be graded until it is properly rendered.

Import all the libraries and tools you need below.

In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs, make_circles

Run the cell given below to create and plot two different two-dimensional datasets:

- The first dataset has four blobs that come from different Gaussian distributions. Assume that each blob is a cluster.
- The second dataset has two concentric circles centered at the origin. Assume that each circle is a cluster.

**Note:** In this in-class assignment and the next, you will work on these toy datasets to understand both the scikit-learn tools and the Expectation-Maximization (EM) algorithm better. In Homework Assignment 3, you will apply clustering to real-life datasets.

In [None]:
n_samples = 1000
n_components = 4

X_blob, true_labels_blob = make_blobs(n_samples=n_samples, centers=4, cluster_std=0.50, random_state=0)
X_circ, true_labels_circ = make_circles(n_samples=n_samples, noise=0.1, random_state=0,factor=0.4)

plt.scatter(X_blob[:,0],X_blob[:,1])
plt.grid()
plt.xlabel('x1')
plt.ylabel('x2')
plt.show()

plt.scatter(X_circ[:,0],X_circ[:,1])
plt.grid()
plt.xlabel('x1')
plt.ylabel('x2')
plt.show()

## 1)

Create and train two [K-Means](https://scikit-learn.org/1.5/modules/generated/sklearn.cluster.KMeans.html) models, one for each dataset. For both models, use `random_state=1` for reproducible initialization. Note that you need to pick a proper `n_clusters`. Leave the other inputs default. **(10 points)**

## 2)

Using the trained models, obtain the predicted cluster labels. Plot both datasets again, only this time, color code the observations with the **predicted** labels.

**Note:** If you are not familiar with data visualization in Python, the lines in the given cell above should be helpful. Just keep in mind that `plt.scatter` function has a `c` input for color-coding.

**(10 points)**

## 3)

Does the K-Means model properly find the clusters in both datasets? Does it fail in any dataset? Why or why not? **Your explanation should include the (only) assumption behind a K-Means model for credit.** **(10 points)**

## 4)

In this question, you will write the two functions that implement the EM algorithm. You will bring them together in the next in-class assignment.

### a)

Define a function called `M_step`. It should take two inputs: (1) `X`, the variable matrix and (2) `cluster_preds`, the vector of predicted cluster labels.

For each cluster label, the function should calculate the centroid by taking variable averages. 

The function should return the calculated centroid values for each cluster label, **as a two-dimensional numpy array**. The size of the array should be: **(number of cluster labels, number of variables).**


**You are not allowed to use loops/comprehensions.** The implementation should be vectorized. Consider the following steps:
- Concatenate X and the labels and convert the output to a DataFrame.
- [Group by](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) the last column.
- Return the group averages, converted back to a numpy array.

**(30 points)**

In [None]:
toy_X = np.arange(1,37).reshape(9,4)
toy_labels = np.array([0,1,2,0,1,2,0,1,2])

M_step(toy_X, toy_labels)

# should return: 

#array([[13., 14., 15., 16.],
#       [17., 18., 19., 20.],
#       [21., 22., 23., 24.]])

### b)

Define a function called `E_step`. It should take two inputs: (1) `X`, the variable matrix and (2) `centroids`, the matrix of predicted centroids.

For each observation, the function should calculate the cluster label by finding the closest centroid. The function should return the predicted cluster labels for each observation, **as a one-dimensional numpy array**. The size of the array should be: **(number of observations, )**.

**You are not allowed to use loops/comprehensions.** The implementation should be vectorized. Consider the following steps:
- Convert X to a DataFrame.
- To each row, [apply](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) a function that finds the Euclidean distance to each row of `centroids` and returns the row index with the smallest distance. (You can implement this with a lambda function or define another function.)
- Convert the output to a numpy array.

**(40 points)**

In [None]:
toy_X = np.arange(1,37).reshape(9,4)
toy_centroids = np.array([[13., 14., 15., 16.],
                           [17., 18., 19., 20.],
                           [21., 22., 23., 24.]])

E_step(toy_X, toy_centroids)

# should return:
    # array([0, 0, 0, 0, 1, 2, 2, 2, 2]