SameSizeClustering

A function for creating tight, same sized clusters, based on a 'distance trade' algorithm. Used for geographic applications, in which distance determined by latitude and longitude is being minimized. However, this can be modified to include different distance dimensions.

The base algorithm was developed by Wes Stevenson (http://statistical-research.com/page/3/), but was made scalable and more optimized by incorporating an initial kmeans split of the data into subgroups. This reduces distance between points by significantly decreasing the "worst case matching" in which the starting seeds are very far apart.

Additionally, this can allow for parallelization of the algorithm (although this isn't supported at this time).

It returns a dataframe with several new columns:

BigCluster -- which initial kmeans group was it included in
subCluster -- for each initial kmeans group, which subcluster was it located in after optimization
clusterOriginal -- for each initial kmeans group, this was the initialized subcluster
distance.from.center -- the distance of the point from the center of the final, optimized cluster
distance.from.center.original -- the distance of the point from the initialized center
cluster_final -- the name of the final cluster, leading with a letter to preserve data integrity

Example Usage

# Generate sample data
sample_data_distribution <- function(Mu1, Sigma1, Mu2, Sigma2, n1, n2) {
  library(MASS)
  dist1 <- data.frame(mvrnorm(n = n1, Mu1, Sigma1))
  names(dist1) <- c("y", "x")
  dist2 <- data.frame(mvrnorm(n = n2, Mu2, Sigma2))
  names(dist2) <- c("y", "x")
  dist <- rbind(dist1, dist2)
  dist$UserID <- 1:length(dist$x)
  return(dist)
}

# Generate sample data to replicate pop distribution of Chicago

sample_data_users_chicago <- sample_data_distribution(c(41.855970, -87.68684), 
                              matrix(c(0.01125,-0.005,-.005,0.002975),2,2), 
                              c(41.855970, -87.76684),
                              matrix(c(0.003,0,-0.08,0.003),2,2),
                              ceiling(.7*2000),
                              ceiling(.3*2000))

head(sample_data_users_chicago)
### y           x           UserID
### 41.75815    -87.64354     1
### 41.92560    -87.69849     2
### 41.73634    -87.61671     3


distance_clustered_data <- SameSizeClustering(sample_data_users_chicago)

head(distance_clustered_data)

###     y            x          UserID         BigCluster     subCluster      clusterOriginal    distance.from.center
### 1   41.75815      -87.64354      1              1           39               26                0.2726409
### 2   41.92560      -87.69849      2              8           51               77                0.2513240
### 3   41.73634      -87.61671      3              1           26               36                0.1123567
 
###   distance.from.center_original     cluster_final
### 1                     1.2549047              a39
### 2                     0.6990344              h51
### 3                     0.9942355              a26

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
README.md		README.md
SameSizeClusters.R		SameSizeClusters.R
calc_distance.R		calc_distance.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

SameSizeClusters.R

SameSizeClusters.R

calc_distance.R

calc_distance.R

Repository files navigation

SameSizeClustering

Example Usage

About

Releases

Packages

Languages

terryneumann/SameSizeClustering

Folders and files

Latest commit

History

Repository files navigation

SameSizeClustering

Example Usage

About

Resources

Stars

Watchers

Forks

Languages