Data Mining Assignment 4.Rmd

---
title: "Data Mining Assignment 4"
author: ""
date: ""
output:
  md_document: default
  pdf_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# ECO 395M: Exercises 4    
# Group Members - Alina Khindanova, Anvit Sachdev, Shreya Kamble  
## Problem 1: Clustering and PCA   
Dataset: wine.csv     
```{r 1.loading libraries, echo=FALSE, message=FALSE, warning=FALSE}
library(ggpubr)
library(factoextra)
library(ggfortify)
```
```{r 1.reading data, echo=FALSE, message=FALSE, warning=FALSE}
wine <- read.csv("https://raw.githubusercontent.com/jgscott/ECO395M/master/data/wine.csv")
wine=na.omit(wine)
X = wine[,-(12:13)]
```
To see whether the 11 chemical propeties are capable of distinuishing wines on the basis of their color and quality, we compute principal component analysis (PCA) to reduce the data into small dimensions. We also perform k-means clustering using k=2 (for colors) and k=10 (for quality) to make the following plots.  
  
A. Performing PCA and K-means clustering for distinguishing colors-   
  
PCA plot:-    
```{r PCA plot color, echo=FALSE, message=FALSE, warning=FALSE}
pca_res <- prcomp(X, scale. = TRUE)
autoplot(pca_res, data = wine, colour = 'color')
```
  
k-means clustering plot:-   
```{r 1.k means colors, echo=FALSE, message=FALSE, warning=FALSE}
set.seed(123)
res.km <- kmeans(scale(X), 2, nstart = 25)
# K-means clusters showing the group of each individuals
```
```{r 1.pca colors, echo=FALSE, message=FALSE, warning=FALSE}
res.km$cluster[res.km$cluster==1]=3
res.pca <- prcomp(X,  scale = TRUE)
# Coordinates of individuals
ind.coord <- as.data.frame(get_pca_ind(res.pca)$coord)
# Add clusters obtained using the K-means algorithm
ind.coord$cluster <- factor(res.km$cluster)
#ind.coord$cluster[ind.coord$cluster=="1"]="3"
#ind.coord$cluster = (ind.coord$cluster+1)
# Add Species groups from the original data sett
ind.coord$color <- wine$color
#ind.coord$facidity=wine$fixed.acidity
#ind.coord$vacidity=wine$volatile.acidity
#ind.coord$sdioxide=wine$total.sulfur.dioxide
#ind.coord$cacid=wine$citric.acid
#ind.coord$density=wine$density
#ind.coord$totalso2=wine$
#summary(res.pca)
#res.pca
# Data inspection
#head(ind.coord)
```
```{r 1.eigenvalues colors, echo=FALSE, message=FALSE, warning=FALSE}
eigenvalue <- round(get_eigenvalue(res.pca), 1)
variance.percent <- eigenvalue$variance.percent
#head(eigenvalue)
```
```{r 1.colors classifcation plotting, echo=FALSE, message=FALSE, warning=FALSE}
ggscatter(
  ind.coord, x = "Dim.1", y = "Dim.2", 
  color = "cluster", palette = "npg", ellipse = TRUE, ellipse.type = "convex",
  shape = "color", size = 1.5,  legend = "right", ggtheme = theme_bw(),
xlab = paste0("Dim 1 (", variance.percent[1], "% )" ),
  ylab = paste0("Dim 2 (", variance.percent[2], "%")) +
  stat_mean(aes(color = cluster), size = 4)
```
  
We can see that the k-means clustering model does a better job at distinguishing between the red and white wines relative to PCA. This is likely because PCA compresses the features, and k-means clustering compresses the data points. Since, we're focusing on similarity between data points, so the observation is intuitive.    
  
B. Performing PCA and K-means clustering for distinguishing quality-   
  
PCA plot:-  
```{r 1.PCA plot quality, echo=FALSE, message=FALSE, warning=FALSE}
pca_res <- prcomp(X, scale. = TRUE)
autoplot(pca_res, data = wine, color = 'quality')
```
  
k-means clustering plot:-    
```{r 1.k means quality, echo=FALSE, message=FALSE, warning=FALSE}
set.seed(123)
res.km <- kmeans(scale(X), 10, nstart = 25)
# K-means clusters showing the group of each individuals
```
```{r 1.pca quality, echo=FALSE, message=FALSE, warning=FALSE}
res.pca <- prcomp(X,  scale = TRUE)
# Coordinates of individuals
ind.coord <- as.data.frame(get_pca_ind(res.pca)$coord)
# Add clusters obtained using the K-means algorithm
ind.coord$cluster <- factor(res.km$cluster)
# Add Species groups from the original data sett
ind.coord$quality <- factor(wine$quality)
# Data inspection
#head(ind.coord)
```
```{r 1.eigenvalue quality, echo=FALSE, message=FALSE, warning=FALSE}
eigenvalue <- round(get_eigenvalue(res.pca), 1)
variance.percent <- eigenvalue$variance.percent
#head(eigenvalue)
```
```{r 1.quality classifcation plotting, echo=FALSE, message=FALSE, warning=FALSE}
ggscatter(
  ind.coord, x = "Dim.1", y = "Dim.2", 
  color = "cluster", palette = "npg", ellipse = TRUE, ellipse.type = "convex",
  shape = "quality", size = 1.5,  legend = "right", ggtheme = theme_bw(),
  xlab = paste0("Dim 1 (", variance.percent[1], "% )" ),
  ylab = paste0("Dim 2 (", variance.percent[2], "% )" )
) +
  stat_mean(aes(color = cluster), size = 4)+
  scale_shape_identity()
```
  
Based on analysis of the above graph, it appears that neither PCA nor k-means clustering method are effective in accurately differentiating between wines and different quality levels as there exists very strong overlaps across different qualitites of the wine.  

We can conclude that the unsupervised algorithm used was relatively able to distinguish between red and white wines. There are margin of errors, but the results still could be interpreted.  
  
Information about Dimensions and PCs:-  
  
On an extra note, we can see below the variance percent of each dimension (in sorted order)-     
```{r 1.variance percent, echo=FALSE, message=FALSE, warning=FALSE}
eigenvalue <- round(get_eigenvalue(res.pca), 1)
variance.percent <- eigenvalue$variance.percent
head(eigenvalue)
```
We can see that the six dimensions contribute to roughly 85.3% of the variance percent.  
The information about all the PCs is given below-  
```{r 1.PCS info, echo=FALSE, message=FALSE, warning=FALSE}
res.pca
```
Next, we determine how much each variable is represented in the first two dimensions. This quality of representation is called the Cos2 and corresponds to the square cosine. A low value of Cos2 means that the variable is not perfectly represented by that component. A high value of Cos2, on the other hand, means a good representation of the variable on that component.  
```{r 1.PCS contribution, echo=FALSE, message=FALSE, warning=FALSE}
fviz_cos2(res.pca, choice = "var", axes = 1:2)
```
  
## Problem 2: Market segmentation  
Dataset: social_marketing.csv   
```{r 2.loading libraries, echo=FALSE, message=FALSE, warning=FALSE}
library(tidyverse)
library(ClusterR)
library(foreach)
library(mosaic)
library(ggplot2)
library(corrplot)
```
```{r 2.reading data, echo=FALSE, message=FALSE, warning=FALSE}
marketing = read.csv("https://raw.githubusercontent.com/jgscott/ECO395M/master/data/social_marketing.csv")
```

In this problem we need to analyze data and identify any interesting market segments that appear to stand out in the social-media audience. We deleted some rows from the dataset that do not give any value to the analysis.  

First, we explore correlations between different categories.   

```{r, echo=FALSE, warning = FALSE, message = FALSE}
X = marketing[,-c(1,2,6,36,37)]
cor <- cor(X)
corrplot(cor, method = 'color', type = 'upper', tl.cex = 0.7)
```

By looking at the correlation matrix, we can notice some patterns of the data, and approximately define segments:   

1) The category 'health_nutrition' has strong positive correlation with 'personal_fitness', and they are both positively correlated with 'outdoors'. So there are people who lead healthy lifestyle and spend much time outdoors.  

2) The category 'online_gaming' is positively correlated with 'college_uni', so there are university students who play online games.  

3) The categories 'travel', 'politics' and 'computers' are positively correlated. These people like traveling, aware of political situation, and discuss computers.  

4) The categories 'fashion', 'beauty' and 'cooking'. These are people who are interested in fashion and beauty, they also like to discuss cooking.  

5) The categories 'sports_fandom', 'food', 'religion' and 'parenting' are positively correlated. That gives us a new segment of followers.  


Next, we would like to conduct a cluster analysis. We group observations to five clusters. Additionally, we conduct Principal Component Analysis to define the features of our clusters according to principal components. We will use four principal components in our analysis.  

```{r, echo = FALSE, warning = FALSE, message = FALSE}
# Center and scale the data
Z = scale(X, center=TRUE, scale=TRUE)

# Extract the centers and scales from the rescaled data (which are named attributes)
mu = attr(Z,"scaled:center")
sigma = attr(Z,"scaled:scale")

# kmeans++ with 6 clusters and 25 starts
clust = KMeans_rcpp(Z, clusters=5, num_init=50)


# PCA
pc4 = prcomp(Z, scale=TRUE, rank=5)
loadings = pc4$rotation
scores = as.data.frame(pc4$x)

Final =  cbind(Z,scores)


A = ggplot(Final) + 
  geom_point(aes(PC1, PC2, color=factor(clust$cluster)))+
  theme(legend.position = "none")

B = ggplot(Final) + 
  geom_point(aes(PC3, PC4, color=factor(clust$cluster)))+
  theme(legend.position = "none")

C = ggplot(Final) + 
  geom_point(aes(PC1, PC5, color=factor(clust$cluster)))+
  guides(color = guide_legend(title = "Clusters"))

library("cowplot")
ggdraw() +
  draw_plot(A, x = 0, y = .5, width = .5, height = .5) +
  draw_plot(B, x = .5, y = .5, width = .5, height = .5) +
  draw_plot(C, x = 0, y = 0, width = .65, height = .5)
```
  
As we can see from the graphs, the first cluster has mostly positive values of PC5. The second cluster has positive values of PC4, third cluster has positive values of PC3, fourth cluster does not have any features, fifth cluster has positive values of PC1 and PC2.  

By looking how principal components were formed in terms of categories, we can define four segments (they are almost the same as segment defined by looking on correlations):  

```{r, echo=FALSE, warning = FALSE, message = FALSE}
# Question 2: how are the individual PCs loaded on the original variables?
# The top words associated with each component
o1 = order(loadings[,1], decreasing=TRUE)
PC1 <- colnames(Z)[head(o1,5)]
#colnames(Z)[tail(o1,5)]

o2 = order(loadings[,2], decreasing=TRUE)
PC2 <- colnames(Z)[head(o2,5)]
#colnames(Z)[tail(o2,5)]

o3 = order(loadings[,3], decreasing=TRUE)
PC3 <- colnames(Z)[head(o3,5)]
#colnames(Z)[tail(o3,5)]

o4 = order(loadings[,4], decreasing=TRUE)
PC4 <- colnames(Z)[head(o4,5)]
#colnames(Z)[tail(o4,5)]

o5 = order(loadings[,5], decreasing=TRUE)
PC5 <- colnames(Z)[head(o5,5)]
#colnames(Z)[tail(o5,5)]

table <- cbind(PC1, PC2, PC3, PC4, PC5)
library(knitr)
knitr::kable(table)

```

1) There is a segment interested in religion, food, parenting and sports.  

2) There is a segment interested in healthy lifestyle: health nutrition, personal fitness and outdoor activities.  

3) There is a segment interested in politics, traveling and computers.  

4) There is a segment interested in beauty, fashion and cooking.  
  
## Problem 3: Association rules for grocery purchases    
Dataset: groceries.txt   
```{r 3.loading libraries, echo=FALSE, message=FALSE, warning=FALSE}
library(arulesViz)
library(arules)
library(tidyverse)
library(igraph)
```
```{r 3.reading data, echo=FALSE, message=FALSE, warning=FALSE}
trans1<-read.transactions("https://raw.githubusercontent.com/jgscott/ECO395M/master/data/groceries.txt", format="basket", sep=",", skip=0)
```
We take a look at the dimensions of this data:  
```{r 3.checking dimensions, echo=FALSE, message=FALSE, warning=FALSE}
dim(trans1)
```
This means we have 9835 transactions and 169 distinct items.  
The top 10 most items in terms of abosolute frequencies are:-    
```{r 3.top 10 abosolute frequencies, echo=FALSE, message=FALSE, warning=FALSE}
itemFrequencyPlot(trans1, topN=10, type="absolute", main="Item Frequency")
```
   
The top 10 most items in terms of relative frequencies are:-    
```{r 3.top 10 relative frequencies, echo=FALSE, message=FALSE, warning=FALSE}
itemFrequencyPlot(trans1, topN=10, type="relative", main="Item Frequency")
```
  
The next step is to analyze the rules using the A-Priori Algorithm. We picked the thresholds of support as 0.005, confidence as 0.1 and lift as 1. We chose the support threshold as 0.005 as we wanted to capture rules of only those items that have popularity of at least 0.5%. We chose the threshold of confidence as 0.1 as we are interested in only those (X,Y) pairs such that the Prob(Y|X) is at least 10%. We chose lift threshold to be 1 as it is the traditional threshold that implies which implies that item Y is likely to be bought if item X is bought.  
```{r 3.A-Priopri Algorithm, echo=FALSE, message=FALSE, warning=FALSE, results='hide'}
rules.trans1 = apriori(trans1, 
	parameter=list(support=.005, confidence=.1))
rules.trans1=subset(rules.trans1, subset = lift > 1)
```
Now we plot all the rules in (support, confidence) space.    
```{r 3.plotting in (support, confidence) space, echo=FALSE, message=FALSE, warning=FALSE}
plot(rules.trans1)
```
  
We can notice from the above graph that the high lift rules tend to have low support.  
Similarly, we plot all the rules in (support, lift) space.    
```{r 3.plotting in (support, lift) space, echo=FALSE, message=FALSE, warning=FALSE}
plot(rules.trans1, measure = c("support", "lift"), shading = "confidence")
```
  
The top 10 rules based on the confidence are-   
```{r 3.top 10 rules, echo=FALSE, message=FALSE, warning=FALSE}
subrules <- head(rules.trans1, n = 10, by = "confidence")
inspect(subrules)
```
We graphically visualize the top 10 rules by the following graphs-  
```{r 3.Visualization 1, echo=FALSE, message=FALSE, warning=FALSE}
playlists_graph=associations2igraph(subrules, associationsAsNodes = FALSE)
plot(playlists_graph)
```
```{r 3.Visualization 2, echo=FALSE, message=FALSE, warning=FALSE}
plot(subrules, method = "graph", 
     measure = "confidence", shading = "lift")
```
```{r 3.Visualization 3, echo=FALSE, message=FALSE, warning=FALSE}
plot(subrules, method="paracoord")
```
  
We can see from the above plots that fruits, vegetables and dairy products are the most common grocery items. We can see that if a person bought fruits, vegetables and dairy products; it is very likely for us to see them adding whole milk in his/her basket as well. We can also see that if a person bought citrus fruit, root vegetables, and whole milk; it is very likely for us to see them adding other vegetables in his/her basket as well.  
The above discoveries make perfect sense as these are very common grocery items and this behavior is often observed in the grocery stores.