# Week 4 Discussion Section Notebook

This notebook is designed to help you prepare to complete A2. We'll cover key concepts and techniques that are crucial for understanding and solving the problems in A2. The notebook includes a mix of multiple-choice questions and coding exercises to reinforce your understanding of the material.

In [None]:
# According to https://github.com/jmshea/jupyterquiz/issues/20
!python -m pip install -q jupyterquiz==2.7.0a1
from jupyterquiz import display_quiz

## Multiple Choice Questions: E and M steps for GMMs

Below are some multiple-choice questions to test your understanding of key concepts.


In [None]:
display_quiz("data/E_Step.json")

In [None]:
display_quiz("data/M_Step.json")

## Simple Coding Exercise: K-Means Clustering

In this exercise, you'll implement a basic version of the K-Means clustering algorithm. This will help you understand the mechanics of centroid initialization, assignment of points to clusters, and updating centroids.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

First, let's generate some synthetic data using `make_blobs`, which will create clusters of data points. We'll use this data to apply our K-Means implementation.

Complete the line below to use the `make_blobs` function to create 4 clusters (`centers=4`) of 300 points in total (`n_samples=300`), where the standard deviation within a cluster is 0.60 (`cluster_std=0.60`). Also, this function has a random state parameter that should be set to 0 (`random_state=0`).

In [None]:
X, _ = ... # complete this line using make_blobs (replace the ...)
plt.scatter(X[:, 0], X[:, 1], s=50)
plt.title('Synthetic Data')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

We can print out X.shape (the result should be `(300,2)`) to see that there are 300 points, and each point has an x-coordinate and a y-coordinate.

In [None]:
X.shape

The first step in K-Means is to initialize the centroids. We'll randomly select K data points to serve as the initial centroids. In the function below, we create a variable called `indices` to delect the indices of the points we select. In this case, k = 4 so there will be 4 indices chosen and each index is a number in the range 0-299.

Complete the function below using `np.random.choice` to select `k` random indices. Make sure that the same index is **not** chosen twice (hint: there is a parameter called `replace` to help with that).

In [None]:
def initialize_centroids(data, k):
    # Randomly select k data points as initial centroids
    
    indices = ... # complete this line using np.random.choice
    centroids = data[indices]
    return centroids

centroids = initialize_centroids(X, 4)

Once the centroids are initialized, each data point is assigned to the closest centroid. This forms the initial clusters.

In [None]:
def assign_clusters(data, centroids):
    # Calculate the distances between each data point and the centroids
    # Hint: Use numpy broadcasting for efficient computation
    distances = ...  # Complete this line to compute the distances
    
    # Assign each data point to the closest centroid
    # Hint: Use argmin to find the index of the minimum distance
    clusters = ...  # Complete this line to assign clusters based on minimum distance
    
    return clusters

clusters = assign_clusters(X, centroids)

After assigning the data points to clusters, the next step is to update the centroids based on the mean of the points in each cluster.

In [None]:
def update_centroids(data, clusters, k):
    # Complete this function to create update the centroids of each cluster
    # Hint: take the mean of all the points in each cluster.
    # The resulting array should have a shape (4,2) since there are 4 centroids 
    pass

centroids = update_centroids(X, clusters, 4)

The assignment of clusters and updating of centroids are repeated until the centroids no longer change significantly, indicating the algorithm has converged.

In [None]:
def k_means_clustering(data, k, iters=10):
    centroids = initialize_centroids(data, k)
    for _ in range(iters):
        clusters = assign_clusters(data, centroids)
        centroids = update_centroids(data, clusters, k)
    
    return clusters, centroids

clusters, centroids = k_means_clustering(X, 4)

Lastly, we can visualize the resulting clusters from the K-Means algorithm.

In [None]:
plt.scatter(X[:, 0], X[:, 1], c=clusters, s=50, cmap='viridis')
plt.scatter(centroids[:, 0], centroids[:, 1], s=200, c='red', alpha=0.5)
plt.title('K-Means Clustering Results')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

### Interpreting the results

Based on the plot above, what do you think about the results of this clustering algorithm? Does it work well, or are there any ways to improve it? In addition, think about the number of iterations used. What happens when you use a larger number of iterations, such as `iters = 100`?

_Type in your response here_

## Understanding Doc Tests

Doc tests are a convenient way to embed tests for your code within docstrings. This allows you to provide example usage of your functions and automatically verify that they work as intended. In Python, you can include tests in the docstrings of your functions, which can then be run using the `doctest` module. This is a great way to provide examples and ensure your code behaves as expected.

There are several doc tests in A2, and the image below shows an example.

![doc test image](data/doc_test.png)

What do you think this doc test is intended for? Answer the quiz question below.

In [None]:
display_quiz("data/Doc_Test.json")

Let's see an example below where we include a simple function with doc tests.

In [None]:
def add(a, b):
    '''
    Returns the sum of a and b
    
    >>> add(2, 3)
    5
    >>> add(-1, 1)
    0
    '''
    return a + b

import doctest
doctest.testmod(verbose=True)

Notice that after we run the doctests above using **doctest.testmod**, the tests specified **add(2,3)** and **add(-1,1)** are executed, and the outputs of these tests are compared against the expected value. For instance, we expect that adding 2 and 3 should give us 5.

## Now it's your turn to make a doc test!

In this exercise, you'll implement doc tests for a basic Euclidean distance function, which is fundamental for K-Means clustering. The Euclidean distance between two points in a Euclidean plane is the length of a line segment between the two points. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem.

### Task

1. Complete the `euclidean_distance` function by calculating the distance between two points `point1` and `point2`. Each point is represented as a tuple of coordinates (x, y).
2. Write at least two doc tests within the function's docstring:
   - A test where both points are the same, expecting a distance of `0.0`.
   - A test with points `(0, 0)` and `(3, 4)`, expecting a distance of `5.0` based on the 3-4-5 triangle.
   
Feel free to add more doc tests! Remember that you need to use the `>>>` notation (as shown in the previous example for the `add` function).

In [None]:
import numpy as np

def euclidean_distance(point1, point2):
    """
    Calculate the Euclidean distance between two points in 2D space.
    
    Args:
    - point1: A tuple (x1, y1) representing the first point.
    - point2: A tuple (x2, y2) representing the second point.
    
    Returns:
    - The Euclidean distance between the two points.
    
    Examples:
    ADD IN YOUR DOC TESTS HERE...
    
    """
    # Your code here
    pass

Once you finish writing the function and the doc tests, complete the cell below with one line of code to run your doc tests.

In [None]:
### Insert code here to run the doc tests