# Clustering - Wine Quality Data

Clustering leads to new discovery of knoweldge.

Clustering is an branch of ***Unsupervised Learning***. 

**Theory for Clustering**
Basic Req. when we can say we have clusters.
* There must be some way to say that 1 observation is closer to A observation than B.
* There must be some proximity measure or similarity measure between data points of dataset.
* Object should be as homogenous as possible in 1 cluster and object point between 2 cluster should be as homogenous as possible.
* Way to know that this cluster is good enough and no more clustering is required.
* Proximity Measure.
* Goodness of fit function.
* Clustering must be effective i.e it should be complete and correct.

**Req. for a Good Clustering Algorithm**
* Scalable (independent of size of data).
* Should be able to deal with different types of data.
* Whatever may be the shape of cluster, the algo shoud be able to handle the clustering.
* A good clustering solution will remove Noise and Outliers.
* Whatever order the data is feed into algo, the cluster should always be the same.
* Interperoable

All above req. are in ideal scenario.
Algo we will be using are not be able to fulfill all req.

## Various techniques used - 

We are going to use Wine quality data provided to us, and will try to group them and create clusters based on simialrities between them. 
- We will be making use of K-mean clustering techniques. 
- To make sure the rsults we will make use of Data manipulation and data analysis so that we get good clustering results. 
- We will also make use of various techniques to find homoginity and numbers of clusters we should create to gt best results.
- Silhouette Algorithm.
- Hokins Stats.
- Sum of Squared errors. SSE
and others

In [None]:
#Loading Libraries
# Call libraries and read data
library(dplyr)          # For data manipulation ,filtering etc
library(magrittr)       # For pipes
library(caret)          # For dummy variables, nearZeroVar(), findCorrleation()
library(ggplot2)        # For plotting
library(ggthemes)       # For a variety of plot-themes
library(gridExtra)      # Arranging ggplots in a grid
library(lattice)
library(vegan)
library(NbClust)
library(cluster)        # For silhoutte()
library(factoextra)     # get_clust_tendency() assesses hopkins stat
library(clustertend)    # Another package for hopkins() function
library(data.table)
library(GGally)
library(ggcorrplot)
library(mclust)
library(fpc)

In [None]:
#Lets clear any presvious stored variables and do garbage collection. This will remove unwanted variables and free up memory.
rm(list=ls());gc()

### Reading Data

For reasding data from csv file we will be making use of **"read.csv"**.
We can also use **"fread"** too but this was giving some issue with aes_string used in ggplot.
**Reason**: the column names had spaces when reading data using fread where as read.csv replaces spaces with '.'

In [None]:

#Reading the data from data set available, which will be used for clustering.
wineData <- read.csv("../input/winequality.csv")

## Understanding Data
We will run some queries and do some checks to see how good and complete the data is. 
Any data we use should be checked, this will help in determining what all we need to do to make our analysis and predicition accurate.

In [None]:
options(scipen = 999)
set.seed(321)
#Let us start buy looking at the strucutre of data we are going to use.
str(wineData)

***Note:*** Let us note here the class types of all columns. Will clarify in subsequest steps why it is important.
Also you can see how we have '.' in column names, that us because we have used "read.csv"

In [None]:
#Some more details related to data.
#How data looks
print("---Data preview---")
head(wineData)

#Number of rows to identify how big is the data we are dealing with
print("---Number of rows---")
nrow(wineData)

print("---Column names---")
#Names of all columns
names(wineData)

In [None]:
#Let us check if we have any NA values in our data.
#We should remove and NA or incomplete data.
#If FALSE means no NA data in our data.frame
#If TRUE we will check each column and for NA data.

print("----Checking for NA Data---")
any(is.na.data.frame(wineData))

In [None]:
#Checking values in Columns "Color", "Quality" and "Good".
#Might be helpful in subsequest iterations.

print("---Values in color column")
unique(wineData$color)

print("---Values in good column")
unique(wineData$good)

print("---Values in quality column")
unique(wineData$quality)

***Note-*** We can make use of above information to do some factor related computations. This can also be helpful in plotting graphs.

Based on the above values, let us use some graphs and see frequency distribution ( *histogram* ) of **Red** and **White** wines across different quality.

In [None]:
ggplot(wineData, aes(quality,fill=color, color=c("red", "white"))) +
    geom_histogram(binwidth = .5,col="black") +  
        facet_grid(color ~ .)+
        labs(title="Histogram Showing Qulity of Wine", 
        subtitle="Wine Quality across Red and White colors of Wine") 

Let us plot graphs for all other columns, to see how the points are distributed based on different qualities.
We will make use of scatter plot.
We also show the distribution for Red and White Wines.

(To make our life easier we will be using for loops, this will help in removing duplicate code.)

In [None]:
for(i in 1:11){
    print(paste("---Plot for---", colnames(wineData)[i]))

#Overall distribution
  print(ggplot(wineData, aes_string("quality", colnames(wineData)[i]))+
        geom_count(col="tomato3", show.legend=F)+
        theme_solarized())

#Color wise scatter plot 
  print(ggplot(wineData, aes_string("quality", colnames(wineData)[i]))+
        geom_jitter(aes(col=as.factor(color))))
}


***Note-*** We are making use of *print()* function to plot graphs, this is because when using in ggplot in loops it will not print graphs until and unless called with in *print()* function.

By looking at graphs we can easily tell content of wines based on quality.
Also we can tell how content changes across **Red** and **White** wine.

# Start of Clustering
We have gone through the data available in hand and ran tests which has given us some clarity over data, and we can start our clustering analysis.

For successfull clustering and clustering to be good we need to follow few steps, before we can do actual clustering.

## Objectives: 
###        a. Understanding data
###        b. Data pre-processing
####               i)   Creating dummy variables
####               ii)  Removing columns with zero-variance
####               iii) Scaling and centering data
####               iv) Discover and remove higly correlated variables
####               v)  Data Transformation. Scaling and Centering
###         c. Determining Number of Clusters.
###         d. Segmentation using k-means
###         e. Clustering Tendency of data

***Note*** Steps involved in Data pre-process depends on the data we are dealing with. Thats why understanding data is very important. 

We have already performed **Step: a. Understanding Data**.
Now let us make use of same and perform our analysis.

## b. Data PreProcessing.

Data pre-processing involves various process. We will define the process required as we proceed. All these steps will help in achiving good clustering.
Will also tell steps which we can do but might not required here depending on the data.

### Step 1 - Changng all column values to Numeric

Remeber, previously i asked to make a note of class of columns. This where we will make use of it. 
Goal here is to convert all and any columns to numerical value.

*Clustering can be performed only for numerical values.*

Based on **str()** called earlier we have seen that all our columns are of class Num accept - *Quality, Good* (of int type) and *Color* (of factor type).

We will not make changes to Color column as we will use it as is for plotting graphs.
And just to be carefulll we will convert int -> numeric.

In [None]:
#Making changes for converting data types

#Converting 2 interger columns too numeric
wineData$quality <- as.numeric(wineData$quality)
wineData$good <- as.numeric(wineData$good)

#for ease creating a data set without color column
wineData_no_color <- wineData[1:13]

In [None]:
#Let us see the structure again
str(wineData_no_color)

All columns numeric accept color, the way we want it.

***FYI***
*Just to add how we can convert factors to numeric
Replace levels of factors with numbers*
*levels(wineData$color) <- c(0,1)*

*Now Converting them too Numerical value*
*wineData$color<- as.numeric(wineData$color)*

We have successfully converted factors to numerics

### Step - 2 Try finding Near Zero Variance

In this step we will try to find out columns which might not have any significant variations in there values, and there variance doesn't make any difference to results.

We can remove those coluns and make our computation faster.

*We will make use a function from **Caret** library for this.* 

In [None]:
#Will give us column number which might insignificant variance.
nzv <- nearZeroVar(wineData)

print(paste("---Column number with----", nzv))


As we can see nothing is returned i.e we don't have any such column.


### Step 3 - Normalizing the Data i.e Scaling and Centering

In this we step we try to normalize the values of column so that any columns with large values compared to others may not dominate columns with lower values.
This columns with higher values may cause inconsistency in clustering.

Let us identifying columns with large data.

There are 2 ways of doing it but we will make use of a technique which will give us all values in the range of 0's and 1's

In [None]:
#Identifying columns with large data
head(wineData)

In [None]:
#By looking at data we can say that Columns 1, 4, 6, 7, 9, 11, 12 can cause issues, we will try to normalize them so that there value are in 0 to 1 range.
print("---Normalizing Data----")
norm_data <- sapply(wineData[,c(1,4,6,7,9,11,12)], function(x) (x - min(x))/(max(x) - min(x)))

print("---Type of returned data")
class(norm_data)
                    
print("---Converting data from matrix to data.frame---")
norm_data <- data.frame(norm_data)    # norm_data is a 'matrix'

print("---Normalised data---")
head(norm_data)

Brief ***theory*** on what we have done here.

*(Each Data point in column subtracted by minimum value if that column) divided by (maximum value of that columns subtracted by minimum value of that column).*

The result of this will always give a value which will be in range of 0 to 1

Now let us bind the rest of data with normalized data and see the data.


In [None]:
#Binding the normalised data with other data
wineData_norm <- cbind(wineData[,c(2,3,5,8,10,13)],norm_data)
head(wineData_norm)

In [None]:
str(wineData_norm)

Alternatively we can make use of ***scale()*** method for normalisation.
Let us see how the data looks with scaling and will do comparison for data.

In [None]:
wineData_scaled <- scale(wineData_no_color)

head(wineData_scaled)

class(wineData_scaled)

In [None]:
#Converting to data.frame
wineData_scaled_df <- as.data.frame(wineData_scaled)
class(wineData_scaled_df)

### Step 4 - Finding higly corelated columns.

In this step we will try to determine columns which are highly correlated to each other.
This tells if the values of 1 column are changing then the value in other column will also change in same manner. 

For Ex: Income and Taxes.
As Income will go up taxes will also go up and vice-versa.

By this we can easily say that we can remove 1 of the two corelated columns and it will not effect our clustering results.

To Find correlation we will make use of both Visuale as well as data based techniques.

In [None]:
#Lets find correlation using cor()
corr_norm <- round(cor(wineData_norm),1)
corr_norm

In [None]:
#Let us plot a Correlogram using above returned results.
ggcorrplot(corr_norm, hc.order = TRUE, 
           type = "lower", 
           lab = TRUE, 
           lab_size = 3, 
           method="circle", 
           colors = c("tomato2", "white", "springgreen3"), 
           title="Correlogram of Wine Data", 
           ggtheme=theme_dark)

By looking at above graph we can tell how each correlated with other columns.
And we can *visually analyze* that column ***'Quality'*** and ***'Good'*** has the ***maximum*** correlation of ***0.8***.

It is upto us and requirment to determine at what level of correlation we want to consider for removing from cluster equation.

*For us we are considering level 0.7 and anything above it can be removed.*

(For data interpreation we will again make use of a fuction available in *Caret* library.)

In [None]:
corr_scaled <- round(cor(wineData_scaled_df),1)

##Lets plot same for Scaled data
ggcorrplot(corr_scaled, hc.order = TRUE, 
           type = "lower", 
           lab = TRUE, 
           lab_size = 3, 
           method="circle", 
           colors = c("tomato2", "white", "springgreen3"), 
           title="Correlogram of Wine Data", 
           ggtheme=theme_dark)

***Note:*** *Same results i.e we can use any one of them. We will make use normalised data*

In [None]:
#Let us calculate it thought data.
#Again we will make use of Caret Library.
a<-findCorrelation(corr_norm, cutoff = 0.7, verbose = T)
#Returns an integer value

print("--- Columns number---")
a

print("---Column name we want to remove---")
colnames(wineData_norm)[a]

In [None]:
#Let us remove column "Good" from our clustering equation.
#Removing Good column as high quality wine will be good
wineData_scaled_df$good <- NULL
str(wineData_scaled_df)

### End of Data Pre-Processing step.

## c. Determining Number of clusters.

K-Mean makes us of number of clusters as one of the inputs.

The numbers of clusters which be best for us varies from data to data.
There are various techniques which can be used to determine the number of clusters best suited for the data.

Will try to make use of few. We will also include graphs represnations for visual analystics.

#### Method: 1 - Elbow method

Elbow method makes of use ***Sum of Squared Error*** or ***Within-cluster sum of square (WSS)***.

It is nothing but distance of all points in a cluster from its center, square it and sum them up for all points.

Visually it is the point in graph from were we starting no change or a straight line. i.e increase in number of clusters is practically making no difference to our data points.

This technique makes use calling k-means for a range of cluster i.e from 1 to certain number and gives us the clustering data.
Then we can plot the same data to see the behaviour.

*Ref: http://www.sthda.com/english/articles/29-cluster-validation-essentials/96-determining-the-optimal-number-of-clusters-3-must-know-methods/ *

In [None]:
# Initialize total within sum of squares error: wss
wss <- 0

# For 1 to 20 cluster centers
for (i in 1:10) {
  km.out <- kmeans(wineData_norm, centers = i)
  # Save total within sum of squares to wss variable
  wss[i] <- km.out$tot.withinss
}
wss
# Plot total within sum of squares vs. number of clusters
plot(1:10, wss, type = "b", 
     xlab = "Number of Clusters", 
     ylab = "Within groups sum of squares")

#### Method: 2 Silhouette plot

In this we will make used NbClust, after trying multiple time it was found that the large data set can some time cause issues in getting out puts with NbClust.
To tackle this issue we will make use sample data and will do our analysis over it.

***Sampling Data***
Was not able to find many methods, but few are
**sample(),
sample_frac() - dplyr,
createDataPartition() - caret**

We will make use of sample_frac()

In [None]:
wine_train_data <- sample_frac(wineData_norm, 0.65)

head(wine_train_data)

str(wine_train_data)

nrow(wine_train_data)

In [None]:
#Let us plot Silhouette Plot using NbCluster
fviz_nbclust(wine_train_data, kmeans, method = "silhouette")+
  labs(subtitle = "Silhouette method")

There are more methods to find out cluster numbers.
You can check out the ref link.

## d. Segmentation Using K-mean

Let us do the clustering using the numbers of cluster we have got from above analysis and see our outputs.

We are making use od Cluster - 2 for our analysis.

In [None]:
km <- kmeans(wineData_norm, 2, iter.max = 140 , algorithm="Lloyd", nstart=100)
km

In [None]:
#Structure of km
str(km)

Let us do visual analysis of data using results from K-mean. We will try to plot different graphs for better understanding.

In [None]:
# Centroid Plot against 1st 2 discriminant functions
clusplot(wineData, km$cluster, color=TRUE, shade=TRUE, 
         labels=2, lines=0)


In [None]:
fviz_cluster(km, data = wineData_norm,
             ellipse.type = "convex",
             palette = "jco",
             ggtheme = theme_minimal())

In [None]:
fviz_cluster(list(data = wineData_norm, cluster = km$cluster),
             ellipse.type = "norm", geom = "point", stand = FALSE,
             palette = "jco", ggtheme = theme_classic())

In [None]:
pam.res <- pam(wineData_norm, 2)
# Visualize
fviz_cluster(pam.res)

## f. Clustering tendency

Now let us try to determine if our data has a tendency to be clustered or we are forcing it into clusters.
To check tendency of clustering of our data we will make use of a randome data set, and will compare the both the results to see the results.
We will make use Visual analysis for this.

In [None]:
# Random data generated from the iris data set
random_df <- apply(wineData_norm, 2, 
                function(x){runif(length(x), min(x), (max(x)))})
random_df <- as.data.frame(random_df)

head(random_df)

In [None]:
# Plot faithful data set
fviz_pca_ind(prcomp(wineData_scaled_df), title = "PCA - Wine data", 
             habillage = wineData$color,  palette = "jco",
             geom = "point", ggtheme = theme_solarized(),
             legend = "bottom")
# Plot the random df
fviz_pca_ind(prcomp(random_df), title = "PCA - Random data", 
             geom = "point", ggtheme = theme_solarized())

Here, we present two R functions / packages to statistically evaluate clustering tendency by computing the Hopkins statistics:

**get_clust_tendency()** function [in factoextra package]. It returns the Hopkins statistics as defined in the formula above. The result is a list containing two elements:
1. hopkins_stat
2. plot

**hopkins()** function [in clustertend package]. It implements 1- the definition of H provided here.

*library factoextra*

In [None]:
# Compute Hopkins statistic for Wine dataset
res <- get_clust_tendency(head(wine_train_data, 500), n = 499, graph = TRUE)
res$hopkins_stat
plot(res$plot)

Hopkins statistic should be as small as possible (near to 0), for doing clustering.

In [None]:
# Compute Hopkins statistic for random dataset
res <- get_clust_tendency(head(random_df, 500), n = 499, graph = TRUE)
res$hopkins_stat
plot(res$plot)

As you can see the random data has a higher Hopkins value.

***Note: *** Hopkins was taking a lot of time thats why had to use a smaller value dataset.

#### Various issues faced and there solutions that might be helpful for others.
1. Working with nbclust was very difficult, ran into a lot of issue due to high computations.
    Will suggest to take training data set of some size and work on that and then gradually increase the size of training data.
    
2. Sampling dataset techniques.
    I have included techniques which I was able to find out. Provide your comments if you have any more suggestions on sampling data into random formats.
    
3. Not all clustering techniques will give same result as they follow different approaches to calcualte clusters.


### Thank you for checking the kernel. If you liked it then do vote-up and provide your valuable comments.
I will try to incoporate more Clustering techniques to see various results in coming future.
Any particular topic you will like me to cover or I might have missed please let me know in the comments.