# Homework 08
This homework is based on the clustering lectures. Check the lecture notes and TA notes - they should help!

## Question 1
This question will walk you through creating your own `kmeans` function.

#### a) What are the steps of `kmeans`?
**Hint**: There are 4 steps/builder functions that you'll need.

1. Assign each data point to a cluster at random
2. Calculate the mean position of each cluster using random assignment
3. Loop through the data points and assign each one to the closest cluster center
4. Repeat 2-3 until centers stop significant movement

#### b) Create the builder function for step 1.

In [1]:
library(tidyverse)
random_assign <- function(df, n_clusters) {
    df$cluster <- sample(1:n_clusters, nrow(df), replace = TRUE);
    return(df);
}

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.1     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 4.0.0     [32m✔[39m [34mtibble   [39m 3.3.0
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.4     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


#### c) Create the builder function for step 2.

In [2]:
centers_df <- function(df) {
    df %>%
  group_by(cluster) %>%summarise(across(everything(), mean),
    .groups = "drop") %>%
  arrange(cluster)
}

#### d) Create the builder function for step 3.
*Hint*: There are two ways to do this part - one is significantly more efficient than the other. You can do either.  

In [3]:
assign_nearest_cluster <- function(df, centers) {
    time_cols <- df %>% select(everything(), -cluster) %>% names();
  df %>%
    rowwise() %>% mutate(
      cluster = {
        distances <- apply(centers[time_cols], 1, function(center) {
          sqrt(sum((c_across(all_of(time_cols)) - center)^2))
        })
        centers$cluster[which.min(distances)]
      }
    ) %>%
    ungroup()
}

#### e) Create the builder function for step 4.

In [4]:
adjusted_centers_df <- function(df_adjusted, iterations) {
    new_centers <- centers_df(df_adjusted);
    for (i in 1:iterations) {
      df_adjusted <- assign_nearest_cluster(df_adjusted, new_centers)
      new_centers <- centers_df(df_adjusted);
    }
    list(data = df_adjusted, centers = new_centers)
}

#### f) Combine them all into your own `kmeans` function.

In [5]:
nk_means <- function(df, n_clusters, iterations) {
    df <- random_assign(df, n_clusters);
    result <- adjusted_centers_df(df, iterations);
    list(
    labels = result$data$cluster,
    means = result$centers,
    data = result$data
  )
}

## Question 2
This is when we'll test your `kmeans` function.
#### a) Read in the `voltages_df.csv` data set. 

In [6]:
voltages <- read_csv('voltages_df.csv')

[1mRows: [22m[34m900[39m [1mColumns: [22m[34m250[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[32mdbl[39m (250): 0, 1.00401606425703, 2.00803212851406, 3.01204819277108, 4.016064...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


#### b) Call your `kmeans` function with 3 clusters. Print the results with `results$labels` and `results$means`. 

In [7]:
results <- nk_means(voltages, 3, 5)
results$means
results$labels

cluster,0,1.00401606425703,2.00803212851406,3.01204819277108,4.01606425702811,5.02008032128514,6.02409638554217,7.0281124497992,8.03212851405623,⋯,240.963855421687,241.967871485944,242.971887550201,243.975903614458,244.979919678715,245.983935742972,246.987951807229,247.991967871486,248.995983935743,250
<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,-1.031463,1.3093239,1.1616772,0.9787498,0.6481497,-1.16861,-1.1196122,-1.0590962,-0.9943176,⋯,0.3364266,0.8337474,0.7125412,-0.2659209,-1.0409179,-1.0587745,-1.01359887,-0.96467777,-0.9151047,-0.8610245
2,-1.031463,0.9381238,0.7619864,0.3631543,-1.1179412,-1.051145,-0.9766807,-0.8694758,-0.6892375,⋯,-0.7900387,-0.8070676,-0.8182598,-0.8207339,-0.8132928,-0.7969549,-0.77567272,-0.75689256,-0.7496483,-0.7570393
3,-1.031463,1.2439759,1.0924697,0.900444,0.3011754,-1.159714,-1.1098127,-1.0685484,-1.0338649,⋯,-0.9107472,-0.8732292,-0.8234477,-0.7607812,-0.6682618,-0.3380864,-0.04693168,0.02820486,-0.41135,-0.8115784


#### c) Call R's `kmeans` function with 3 clusters. Print the results with `results$labels` and `results$cluster`. 
*Hint*: Use the `as.matrix()` function to make the `voltages_df` data frame a matrix before calling `kmeans()`.

In [8]:
results <- voltages %>% as.matrix() %>% kmeans(3)
results$centers
results$cluster

Unnamed: 0,0,1.00401606425703,2.00803212851406,3.01204819277108,4.01606425702811,5.02008032128514,6.02409638554217,7.0281124497992,8.03212851405623,9.03614457831325,⋯,240.963855421687,241.967871485944,242.971887550201,243.975903614458,244.979919678715,245.983935742972,246.987951807229,247.991967871486,248.995983935743,250
1,-1.031463,0.9381238,0.7619864,0.3631543,-1.1179412,-1.051145,-0.9766807,-0.8694758,-0.6892375,-0.5661321,⋯,-0.7900387,-0.8070676,-0.8182598,-0.8207339,-0.8132928,-0.7969549,-0.77567272,-0.75689256,-0.7496483,-0.7570393
2,-1.031463,1.2439759,1.0924697,0.900444,0.3011754,-1.159714,-1.1098127,-1.0685484,-1.0338649,-1.0022396,⋯,-0.9107472,-0.8732292,-0.8234477,-0.7607812,-0.6682618,-0.3380864,-0.04693168,0.02820486,-0.41135,-0.8115784
3,-1.031463,1.3093239,1.1616772,0.9787498,0.6481497,-1.16861,-1.1196122,-1.0590962,-0.9943176,-0.9237437,⋯,0.3364266,0.8337474,0.7125412,-0.2659209,-1.0409179,-1.0587745,-1.01359887,-0.96467777,-0.9151047,-0.8610245


#### d) Are your labels/clusters the same? If not, why? Are your means the same?

The labels/clusters aren't the same due to the random initialization of the centroids. The means are the same, but switched around for each cluster (the values stayed the same just in different clusters).

## Question 3
#### a) Explain the process of using a for loop to assign clusters for kmeans.

Based off of a set number of iterations, we loop through each data point (e.g. for i in iterations) and calculate the distance from all the centroids. We pick the closest centroid and after we recalculate the position of the centroids, we repeat this loop again and again until we have our clusters. So, you would end up needing 

#### b) Explain the process of vectorizing the code to assign clusters for kmeans.

We would have 2 matrices, one for the data points and another for the clusters themselves. By using a function like apply(), we can calculate the Euclidean distance from every point in the first matrix to every centroid in the second by using row vectors and basic vector arithmetic. Based off of this we can find the minimum distance in and accordingly find the cluster index for which the row vector would fit under. 

#### c) State which (for loops or vectorizing) is more efficient and why.

A for loop has to go through each individual element in the data point matrix for assignment and that would mean possibly going through millions of elements. Vectorization would allow us to work with singular vectors that would significantly cut down the number of elements having to be processed by the R interpreter.

## Question 4
#### When does `kmeans` fail? What assumption does `kmeans` use that causes it to fail in this situation?

Kmeans fails when our clusters are not characterized by their centroids. The assumption behind this is that clusters are spherical Gaussians with similar standard deviation, and when they aren't spherical Gaussians with similar standard deviation, Kmeans begins to distort cluster boundaries (we get weird pockets of clusters).

## Question 5
#### What assumption do Guassian mixture models make?

A Gaussian mixture model assumes that the data is drawn from N Gaussian distributions whose parameters are estimated from the data.

## Question 6
#### What assumption does spectral clustering make? Why does this help us?

Two points are more likely to be in the same cluster if they are closeto one another. This helps us because it allows us to work with more types of data (we aren't locked behind the condition of a vector space).

## Question 7
#### Define the gap statistic method. What do we use it for?

We use the gap statistic method to find the optimal number of clusters. The gap statistic method is comparing the clustering for each value of K to a cluster of data randomized into the same domain as the beginning data. Then we compute the dispersion of the two clusterings and look at the difference. Our optimal cluster number is the maximum of the differences. 