# **K-Means Clustering in R**

## **1. K-Means Clustering in R of Mall customers**

### **Importing the dataset**

In [None]:
ds = read.csv('/content/Mall_Customers.csv')
head(ds)

In [None]:
ds = ds[4:5]

### **Using the elbow method to find the optimal number of clusters**

In [None]:
set.seed(6)
wcss = vector()
for (i in 1:10) wcss[i] = sum(kmeans(ds, i)$withinss)
plot(1:10,
     wcss,
     type = 'b',
     main = paste('The Elbow Method'),
     xlab = 'Number of clusters',
     ylab = 'WCSS')

### **Fitting K-Means to the dataset**

In [None]:
set.seed(29)
KM = kmeans(x = ds, centers = 5)
y_KM = KM$cluster

### **Visualizing the clusters**

In [None]:
library(cluster)
clusplot(ds,
         y_KM,
         lines = 0,
         shade = TRUE,
         color = TRUE,
         labels = 2,
         plotchar = FALSE,
         span = TRUE,
         main = paste('Clusters of customers'),
         xlab = 'Annual Income',
         ylab = 'Spending Score')

In [None]:
library(cluster)
clusplot(ds,
         y_KM,
         lines = 0,
         shade = FALSE,
         color = TRUE,
         labels = 2,
         plotchar = FALSE,
         span = TRUE,
         main = paste('Clusters of customers'),
         xlab = 'Annual Income',
         ylab = 'Spending Score')

In [None]:
library(cluster)
clusplot(ds,
         y_KM,
         lines = 1,
         shade = TRUE,
         color = TRUE,
         labels = 2,
         plotchar = FALSE,
         span = TRUE,
         main = paste('Clusters of customers'),
         xlab = 'Annual Income',
         ylab = 'Spending Score')

In [None]:
library(cluster)
clusplot(ds,
         y_KM,
         lines = 0,
         shade = TRUE,
         color = FALSE,
         labels = 2,
         plotchar = FALSE,
         span = TRUE,
         main = paste('Clusters of customers'),
         xlab = 'Annual Income',
         ylab = 'Spending Score')

In [None]:
library(cluster)
clusplot(ds,
         y_KM,
         lines = 0,
         shade = TRUE,
         color = TRUE,
         labels = 1,
         plotchar = FALSE,
         span = TRUE,
         main = paste('Clusters of customers'),
         xlab = 'Annual Income',
         ylab = 'Spending Score')

In [None]:
library(cluster)
clusplot(ds,
         y_KM,
         lines = 0,
         shade = TRUE,
         color = TRUE,
         labels = 0,
         plotchar = FALSE,
         span = TRUE,
         main = paste('Clusters of customers'),
         xlab = 'Annual Income',
         ylab = 'Spending Score')

In [None]:
library(cluster)
clusplot(ds,
         y_KM,
         lines = 0,
         shade = TRUE,
         color = TRUE,
         labels = 2,
         plotchar = FALSE,
         span = FALSE,
         main = paste('Clusters of customers'),
         xlab = 'Annual Income',
         ylab = 'Spending Score')

## **2. K-Means Clustering in R of College data**

### **Importing the dataset**

In [None]:
college = read.csv('/content/College_Data')
head(college)

In [None]:
summary(college)

In [None]:
#find number of rows with missing values
sum(!complete.cases(college))

In [None]:
college$Private = factor(college$Private,
                           levels = c('No', 'Yes'),
                           labels = c(0, 1))

In [None]:
head(college)

In [None]:
ds = college[3:19]

In [None]:
class(ds)

In [None]:
#remove rows with missing values
ds <- na.omit(ds)

In [None]:
#scale each variable to have a mean of 0 and sd of 1
ds_scaled = scale(ds)

### **Elbow method to find the optimum number of clusters**
Look for an “elbow” where the sum of squares begins to “bend” or level off. This is typically the optimal number of clusters.

In [None]:
set.seed(6)
wcss = vector()
for (i in 1:10) wcss[i] = sum(kmeans(ds_scaled, i)$withinss)
plot(1:10,
     wcss,
     type = 'b',
     main = paste('The Elbow Method'),
     xlab = 'Number of clusters',
     ylab = 'WCSS')

### **Number of Clusters vs. the Total Within Sum of Squares**
use of fviz_nbclust() function to create a plot of the number of clusters vs. the total within sum of squares

In [None]:
library(factoextra)
library(cluster)
fviz_nbclust(ds_scaled, kmeans, method = "wss")

It appears that there is a bit of an elbow or “bend” at k = 7 clusters.

In [None]:
fviz_nbclust(
  ds_scaled,
  FUNcluster = kmeans,
  method = "wss",
  diss = NULL,
  k.max = 10,
  nboot = 100,
  verbose = interactive(),
  barfill = "steelblue",
  barcolor = "steelblue",
  linecolor = "steelblue",
  print.summary = TRUE,
)

### **Perform K-Means Clustering with Optimal K: Fitting K-Means to the dataset**

In [None]:
set.seed(29)
KM = kmeans(x = ds_scaled, centers = 7, nstart = 25)
y_KM = KM$cluster
y_KM

In [None]:
KM

From the results we can see that:
* K-means clustering with 7 clusters of sizes 25, 150, 78, 103, 226, 167, *28*
1. 25 colleges were assigned to the first cluster
2. 150 colleges were assigned to the second cluster
3. 78 colleges were assigned to the third cluster
4. 103 colleges were assigned to the fourth cluster
5. 226 colleges were assigned to the fifth cluster
6. 167 colleges were assigned to the sixth cluster
7. 28 colleges were assigned to the seventh cluster

### **Visualization of clusters**
visualize the clusters on a scatterplot that displays the first two principal components on the axes using the fivz_cluster() function.

In [None]:
#plot results of final k-means model
fviz_cluster(KM, data = ds)

### **Mean of the variables in each clusters**
Use the aggregate() function to find the mean of the variables in each cluster

In [None]:
#find means of each cluster
aggregate(ds, by=list(cluster=KM$cluster), mean)

In [None]:
final_data <- cbind(college, cluster = KM$cluster)
head(final_data,10)

### **Append cluster assignments to the original dataset**
Append the cluster assignments of each colleges back to the original dataset

In [None]:
tail(final_data,10)