k-means

K-means is an algorithm for cluster analysis (clustering). It is the process of partitioning a set of data into related groups / clusters. K-means clustering is useful for Data Mining and Business Intelligence.

Here is k-means in plain English: https://rayli.net/blog/data/top-10-data-mining-algorithms-in-plain-english/#2_k-means

What does it do

k-means will assign a number on each of your data records, indicating which groups it belongs to. You only need to provide the number of clusters/groups you want. After the calculations complete, each record will be assigned to one cluster.

The records in the same cluster are similar to each other. Records in different clusters are dissimilar.

Each row of your Excel data, should be a record/observation with one or more features. Each column is a feature in the observation.

As an example, here is a data set with the height and weight of 25,000 children in Hong Kong : http://socr.ucla.edu/docs/resources/SOCR_Data/SOCR_Data_Dinov_020108_HeightsWeights.html

Currently the script works only with numerical data.

How does it work

Enter your data in a new Excel worksheet.
Enter the name of the worksheet in cell C4, and the range of the data at C5
Enter the worksheet for the results to be placed at C6 (you can use the one where your data is)
Enter the cell were the result will be updated at C7
Number of groups in your data at C8
Click the button to start.

If you do not know the number of clusters/groups contained in your data, try different values for example 1 up to 10. Execute the script several times and observe the GAP figure. When GAP reaches its maximum value, you have selected the appropriate number of clusters. In the example IRIS data set the GAP will maximise when we select 3 clusters.

The original paper that describes the GAP calculation: https://web.stanford.edu/~hastie/Papers/gap.pdf

The results

The result is a number assigned on each record, that indicates the group/cluster the record belongs to.

The Result sheet contains information on the clusters, along with the cluster centers.

Performance

When the "Distance" value is minimized, it indicates the resulting accuracy is higher.

Execute the algorithm several times to find the best results.

The script will stop execution when the clusters are normalized or when the maximum iterations are reached (whichever comes first). You can increase the number of iterations for better results.

More info

This is a k-means clustering implementation in VBA. The script is based in the work of bquanttrading, https://quantmacro.wordpress.com/

It is implementing David Arthur and Sergei Vassilvitski k-means++ algorithm, which chooses the initial centroids. https://theory.stanford.edu/~sergei/papers/kMeansPP-soda.pdf

The example dataset in kmeans.xlsx is IRIS from UC Irvine Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets.html

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
README.md		README.md
kmeans.vba		kmeans.vba
kmeans.xlsm		kmeans.xlsm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

k-means

What does it do

How does it work

The results

Performance

More info

About

Releases

Packages

Languages

SATOSHI-G/kmeans-excel

Folders and files

Latest commit

History

Repository files navigation

k-means

What does it do

How does it work

The results

Performance

More info

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages