
To perform a cluster analysis in R, generally, the data should be prepared as follow:

1. Rows are observations (individuals) and columns are variables
2. Any missing value in the data must be removed or estimated.
3. The data must be standardized (i.e., scaled) to make variables comparable. Recall
that, standardization consists of transforming the variables such that they have
mean zero and standard deviation one. Read more about data standardization
in chapter 3.

In [1]:
data("USArrests") # Load the data set

df <- USArrests
# Use df as shorter name

In [4]:
head(df)

Unnamed: 0,Murder,Assault,UrbanPop,Rape
Alabama,13.2,236,58,21.2
Alaska,10.0,263,48,44.5
Arizona,8.1,294,80,31.0
Arkansas,8.8,190,50,19.5
California,9.0,276,91,40.6
Colorado,7.9,204,78,38.7


In [5]:
# To remove any missing value that might be present in the data, type this:
df <- na.omit(df)

As we don’t want the clustering algorithm to depend to an arbitrary variable
unit, we start by scaling/standardizing the data using the R function *scale()*:

In [6]:
df <- scale(df)

In [7]:
head(df)

Unnamed: 0,Murder,Assault,UrbanPop,Rape
Alabama,1.24256408,0.7828393,-0.5209066,-0.003416473
Alaska,0.50786248,1.1068225,-1.2117642,2.484202941
Arizona,0.07163341,1.4788032,0.9989801,1.042878388
Arkansas,0.23234938,0.230868,-1.0735927,-0.184916602
California,0.27826823,1.2628144,1.7589234,2.067820292
Colorado,0.02571456,0.3988593,0.8608085,1.864967207


In [8]:
install.packages(c("cluster", "factoextra"))

Installing packages into ‘/home/cloves/R/x86_64-pc-linux-gnu-library/3.4’
(as ‘lib’ is unspecified)
also installing the dependencies ‘zip’, ‘modeltools’, ‘DEoptimR’, ‘SparseM’, ‘MatrixModels’, ‘sp’, ‘openxlsx’, ‘minqa’, ‘nloptr’, ‘mclust’, ‘flexmix’, ‘prabclus’, ‘diptest’, ‘mvtnorm’, ‘robustbase’, ‘kernlab’, ‘trimcluster’, ‘carData’, ‘pbkrtest’, ‘quantreg’, ‘maptools’, ‘rio’, ‘lme4’, ‘fpc’, ‘viridis’, ‘car’, ‘ellipse’, ‘flashClust’, ‘leaps’, ‘scatterplot3d’, ‘ggsci’, ‘cowplot’, ‘ggsignif’, ‘gridExtra’, ‘polynom’, ‘abind’, ‘dendextend’, ‘FactoMineR’, ‘ggpubr’, ‘ggrepel’



• *cluster* for computing clustering algorithms.

• *factoextra* for ggplot2-based elegant visualization of clustering results.