In [None]:
#Question 1

#a What are the steps of kmeans?
#Step 1: Assign each point to a cluster N at random
#Step 2: Calculate the mean position of each cluster using the previous assignments
#Step 3: Loop through the points - assign each point to the cluster whose center is closest
#Step 4: Repeat this process until the centers stop moving around

#b Create the builder function for step 1.
label_randomly <- function(n_points, n_clusters){
  sample(((1:n_points) %% n_clusters)+1, n_points, replace=F)
}
#c Create the builder function for step 2.
get_cluster_means <- function(data, labels){
  data %>%
    mutate(label__ = labels) %>%
    group_by(label__) %>%
    summarize(across(everything(), mean), .groups = "drop") %>%
    arrange(label__)
}

#d Create the builder function for step 3.
assign_cluster_fast <- function(data, means){
  data_matrix <- as.matrix(data)
  means_matrix <- as.matrix(means %>% dplyr::select(-label__))
  dii <- sort(rep(1:nrow(data), nrow(means)))
  mii <- rep(1:nrow(means), nrow(data))
  data_repped <- data_matrix[dii, ]
  means_repped <- means_matrix[mii, ]
  diff_squared <- (data_repped - means_repped)^2
  all_distances <- rowSums(diff_squared)
  tibble(dii=dii, mii=mii, distance=all_distances) %>%
    group_by(dii) %>%
    arrange(distance) %>%
    filter(row_number()==1) %>%
    ungroup() %>%
    arrange(dii) %>%
    pull(mii)
}

#e Create the builder function for step 4.
kmeans_done <- function(old_means, new_means, eps=1e-6){
  om <- as.matrix(old_means)
  nm <- as.matrix(new_means)
  m <- mean(sqrt(rowSums((om - nm)^2)))
  if(m < eps) TRUE else FALSE

#f Combine them all into your own kmeans function
mykmeans <- function(data, n_clusters, eps=1e-6, max_it = 1000, verbose = FALSE){
  labels <- label_randomly(nrow(data), n_clusters)
  old_means <- get_cluster_means(data, labels)
  done <- FALSE
  it <- 0
  while(!done & it < max_it){
    labels <- assign_cluster_fast(data, old_means)
    new_means <- get_cluster_means(data, labels)
    if(kmeans_done(old_means, new_means)){
      done <- TRUE
    } else {
      old_means <- new_means
      it <- it + 1
      if(verbose){
        cat(sprintf("%d\n", it))
      }
    }
  }
  list(labels=labels, means=new_means)
}

In [None]:
#Question 2

#a Read in the voltages_df.csv
library(tidyverse)
volt <- read_csv("voltages_df.csv") 

#b Call your kmeans function with 3 clusters. Print the results with results$labels and results$means
results <- mykmeans(volt, 3)
print(results$labels)
print(results$means)

#c Call R's kmeans function with 3 clusters. Print the results with results$labels and results$cluster
r_results <- kmeans(as.matrix(volt, 3)
print(results$labels)
print(results$cluster)
    
#d Are your labels/clusters the same? If not, why? Are your means the same?
#Our labels/clusters are not the same; however, they are random and not indicative of differences in data. 
#Our means are slightly different, but primarily due to differences in the function. 
                    


In [None]:
#Question 3

#a Explain the process of using a loop to assign clusters for kmeans
#A for loop is a function that goes through the data points and constantly runs calculations to find the distance to each cluster center while picking the nearest cluster. 

#b Explain the process of vectorizing the code to assign clusters for kmeans
#Vectorization uses a single operation on multiple data points at once.

#c State which (for loops or vectorizing) is more efficient and why
#Vectorizing is more efficient because is does not use loops; rather, it uses a single operation on multiple data points at once.

In [None]:
#Question 4

#When does kmeans fail? What assumption does kmeans use that causes it to fail in this situation?
#Kmeans can fail when the clusters are non-spherical, unqeual in size, or contain the presence of outliers. Kmeans assumes that clusters are spherical, that they are equal in size, and do not contain outliers.

In [None]:
#Question 5

# What assumptions do Guassian mixture models make?
They assume that the data is drawn from N Gaussian distributions whose individual parameters are estimated from the data. 



In [None]:
#Question 6

#What assumption does spectral clustering make? Why does this help us? 
#Spectral clustering assumes that the clusters are defined by their connectivity rather than their compactness. This helps us because it is not held by the assumption that the data is non-spherical.  

In [None]:
#Question 7

#Define the gap statistic method. What do we use it for? The gap statistic method determines the optimal number of clusters in a dataset. 