-
Notifications
You must be signed in to change notification settings - Fork 0
/
Data Mining Assignment 4.Rmd
327 lines (273 loc) · 14.2 KB
/
Data Mining Assignment 4.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
---
title: "Data Mining Assignment 4"
author: ""
date: ""
output:
md_document: default
pdf_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# ECO 395M: Exercises 4
# Group Members - Alina Khindanova, Anvit Sachdev, Shreya Kamble
## Problem 1: Clustering and PCA
Dataset: wine.csv
```{r 1.loading libraries, echo=FALSE, message=FALSE, warning=FALSE}
library(ggpubr)
library(factoextra)
library(ggfortify)
```
```{r 1.reading data, echo=FALSE, message=FALSE, warning=FALSE}
wine <- read.csv("https://raw.githubusercontent.com/jgscott/ECO395M/master/data/wine.csv")
wine=na.omit(wine)
X = wine[,-(12:13)]
```
To see whether the 11 chemical propeties are capable of distinuishing wines on the basis of their color and quality, we compute principal component analysis (PCA) to reduce the data into small dimensions. We also perform k-means clustering using k=2 (for colors) and k=10 (for quality) to make the following plots.
A. Performing PCA and K-means clustering for distinguishing colors-
PCA plot:-
```{r PCA plot color, echo=FALSE, message=FALSE, warning=FALSE}
pca_res <- prcomp(X, scale. = TRUE)
autoplot(pca_res, data = wine, colour = 'color')
```
k-means clustering plot:-
```{r 1.k means colors, echo=FALSE, message=FALSE, warning=FALSE}
set.seed(123)
res.km <- kmeans(scale(X), 2, nstart = 25)
# K-means clusters showing the group of each individuals
```
```{r 1.pca colors, echo=FALSE, message=FALSE, warning=FALSE}
res.km$cluster[res.km$cluster==1]=3
res.pca <- prcomp(X, scale = TRUE)
# Coordinates of individuals
ind.coord <- as.data.frame(get_pca_ind(res.pca)$coord)
# Add clusters obtained using the K-means algorithm
ind.coord$cluster <- factor(res.km$cluster)
#ind.coord$cluster[ind.coord$cluster=="1"]="3"
#ind.coord$cluster = (ind.coord$cluster+1)
# Add Species groups from the original data sett
ind.coord$color <- wine$color
#ind.coord$facidity=wine$fixed.acidity
#ind.coord$vacidity=wine$volatile.acidity
#ind.coord$sdioxide=wine$total.sulfur.dioxide
#ind.coord$cacid=wine$citric.acid
#ind.coord$density=wine$density
#ind.coord$totalso2=wine$
#summary(res.pca)
#res.pca
# Data inspection
#head(ind.coord)
```
```{r 1.eigenvalues colors, echo=FALSE, message=FALSE, warning=FALSE}
eigenvalue <- round(get_eigenvalue(res.pca), 1)
variance.percent <- eigenvalue$variance.percent
#head(eigenvalue)
```
```{r 1.colors classifcation plotting, echo=FALSE, message=FALSE, warning=FALSE}
ggscatter(
ind.coord, x = "Dim.1", y = "Dim.2",
color = "cluster", palette = "npg", ellipse = TRUE, ellipse.type = "convex",
shape = "color", size = 1.5, legend = "right", ggtheme = theme_bw(),
xlab = paste0("Dim 1 (", variance.percent[1], "% )" ),
ylab = paste0("Dim 2 (", variance.percent[2], "%")) +
stat_mean(aes(color = cluster), size = 4)
```
We can see that the k-means clustering model does a better job at distinguishing between the red and white wines relative to PCA. This is likely because PCA compresses the features, and k-means clustering compresses the data points. Since, we're focusing on similarity between data points, so the observation is intuitive.
B. Performing PCA and K-means clustering for distinguishing quality-
PCA plot:-
```{r 1.PCA plot quality, echo=FALSE, message=FALSE, warning=FALSE}
pca_res <- prcomp(X, scale. = TRUE)
autoplot(pca_res, data = wine, color = 'quality')
```
k-means clustering plot:-
```{r 1.k means quality, echo=FALSE, message=FALSE, warning=FALSE}
set.seed(123)
res.km <- kmeans(scale(X), 10, nstart = 25)
# K-means clusters showing the group of each individuals
```
```{r 1.pca quality, echo=FALSE, message=FALSE, warning=FALSE}
res.pca <- prcomp(X, scale = TRUE)
# Coordinates of individuals
ind.coord <- as.data.frame(get_pca_ind(res.pca)$coord)
# Add clusters obtained using the K-means algorithm
ind.coord$cluster <- factor(res.km$cluster)
# Add Species groups from the original data sett
ind.coord$quality <- factor(wine$quality)
# Data inspection
#head(ind.coord)
```
```{r 1.eigenvalue quality, echo=FALSE, message=FALSE, warning=FALSE}
eigenvalue <- round(get_eigenvalue(res.pca), 1)
variance.percent <- eigenvalue$variance.percent
#head(eigenvalue)
```
```{r 1.quality classifcation plotting, echo=FALSE, message=FALSE, warning=FALSE}
ggscatter(
ind.coord, x = "Dim.1", y = "Dim.2",
color = "cluster", palette = "npg", ellipse = TRUE, ellipse.type = "convex",
shape = "quality", size = 1.5, legend = "right", ggtheme = theme_bw(),
xlab = paste0("Dim 1 (", variance.percent[1], "% )" ),
ylab = paste0("Dim 2 (", variance.percent[2], "% )" )
) +
stat_mean(aes(color = cluster), size = 4)+
scale_shape_identity()
```
Based on analysis of the above graph, it appears that neither PCA nor k-means clustering method are effective in accurately differentiating between wines and different quality levels as there exists very strong overlaps across different qualitites of the wine.
We can conclude that the unsupervised algorithm used was relatively able to distinguish between red and white wines. There are margin of errors, but the results still could be interpreted.
Information about Dimensions and PCs:-
On an extra note, we can see below the variance percent of each dimension (in sorted order)-
```{r 1.variance percent, echo=FALSE, message=FALSE, warning=FALSE}
eigenvalue <- round(get_eigenvalue(res.pca), 1)
variance.percent <- eigenvalue$variance.percent
head(eigenvalue)
```
We can see that the six dimensions contribute to roughly 85.3% of the variance percent.
The information about all the PCs is given below-
```{r 1.PCS info, echo=FALSE, message=FALSE, warning=FALSE}
res.pca
```
Next, we determine how much each variable is represented in the first two dimensions. This quality of representation is called the Cos2 and corresponds to the square cosine. A low value of Cos2 means that the variable is not perfectly represented by that component. A high value of Cos2, on the other hand, means a good representation of the variable on that component.
```{r 1.PCS contribution, echo=FALSE, message=FALSE, warning=FALSE}
fviz_cos2(res.pca, choice = "var", axes = 1:2)
```
## Problem 2: Market segmentation
Dataset: social_marketing.csv
```{r 2.loading libraries, echo=FALSE, message=FALSE, warning=FALSE}
library(tidyverse)
library(ClusterR)
library(foreach)
library(mosaic)
library(ggplot2)
library(corrplot)
```
```{r 2.reading data, echo=FALSE, message=FALSE, warning=FALSE}
marketing = read.csv("https://raw.githubusercontent.com/jgscott/ECO395M/master/data/social_marketing.csv")
```
In this problem we need to analyze data and identify any interesting market segments that appear to stand out in the social-media audience. We deleted some rows from the dataset that do not give any value to the analysis.
First, we explore correlations between different categories.
```{r, echo=FALSE, warning = FALSE, message = FALSE}
X = marketing[,-c(1,2,6,36,37)]
cor <- cor(X)
corrplot(cor, method = 'color', type = 'upper', tl.cex = 0.7)
```
By looking at the correlation matrix, we can notice some patterns of the data, and approximately define segments:
1) The category 'health_nutrition' has strong positive correlation with 'personal_fitness', and they are both positively correlated with 'outdoors'. So there are people who lead healthy lifestyle and spend much time outdoors.
2) The category 'online_gaming' is positively correlated with 'college_uni', so there are university students who play online games.
3) The categories 'travel', 'politics' and 'computers' are positively correlated. These people like traveling, aware of political situation, and discuss computers.
4) The categories 'fashion', 'beauty' and 'cooking'. These are people who are interested in fashion and beauty, they also like to discuss cooking.
5) The categories 'sports_fandom', 'food', 'religion' and 'parenting' are positively correlated. That gives us a new segment of followers.
Next, we would like to conduct a cluster analysis. We group observations to five clusters. Additionally, we conduct Principal Component Analysis to define the features of our clusters according to principal components. We will use four principal components in our analysis.
```{r, echo = FALSE, warning = FALSE, message = FALSE}
# Center and scale the data
Z = scale(X, center=TRUE, scale=TRUE)
# Extract the centers and scales from the rescaled data (which are named attributes)
mu = attr(Z,"scaled:center")
sigma = attr(Z,"scaled:scale")
# kmeans++ with 6 clusters and 25 starts
clust = KMeans_rcpp(Z, clusters=5, num_init=50)
# PCA
pc4 = prcomp(Z, scale=TRUE, rank=5)
loadings = pc4$rotation
scores = as.data.frame(pc4$x)
Final = cbind(Z,scores)
A = ggplot(Final) +
geom_point(aes(PC1, PC2, color=factor(clust$cluster)))+
theme(legend.position = "none")
B = ggplot(Final) +
geom_point(aes(PC3, PC4, color=factor(clust$cluster)))+
theme(legend.position = "none")
C = ggplot(Final) +
geom_point(aes(PC1, PC5, color=factor(clust$cluster)))+
guides(color = guide_legend(title = "Clusters"))
library("cowplot")
ggdraw() +
draw_plot(A, x = 0, y = .5, width = .5, height = .5) +
draw_plot(B, x = .5, y = .5, width = .5, height = .5) +
draw_plot(C, x = 0, y = 0, width = .65, height = .5)
```
As we can see from the graphs, the first cluster has mostly positive values of PC5. The second cluster has positive values of PC4, third cluster has positive values of PC3, fourth cluster does not have any features, fifth cluster has positive values of PC1 and PC2.
By looking how principal components were formed in terms of categories, we can define four segments (they are almost the same as segment defined by looking on correlations):
```{r, echo=FALSE, warning = FALSE, message = FALSE}
# Question 2: how are the individual PCs loaded on the original variables?
# The top words associated with each component
o1 = order(loadings[,1], decreasing=TRUE)
PC1 <- colnames(Z)[head(o1,5)]
#colnames(Z)[tail(o1,5)]
o2 = order(loadings[,2], decreasing=TRUE)
PC2 <- colnames(Z)[head(o2,5)]
#colnames(Z)[tail(o2,5)]
o3 = order(loadings[,3], decreasing=TRUE)
PC3 <- colnames(Z)[head(o3,5)]
#colnames(Z)[tail(o3,5)]
o4 = order(loadings[,4], decreasing=TRUE)
PC4 <- colnames(Z)[head(o4,5)]
#colnames(Z)[tail(o4,5)]
o5 = order(loadings[,5], decreasing=TRUE)
PC5 <- colnames(Z)[head(o5,5)]
#colnames(Z)[tail(o5,5)]
table <- cbind(PC1, PC2, PC3, PC4, PC5)
library(knitr)
knitr::kable(table)
```
1) There is a segment interested in religion, food, parenting and sports.
2) There is a segment interested in healthy lifestyle: health nutrition, personal fitness and outdoor activities.
3) There is a segment interested in politics, traveling and computers.
4) There is a segment interested in beauty, fashion and cooking.
## Problem 3: Association rules for grocery purchases
Dataset: groceries.txt
```{r 3.loading libraries, echo=FALSE, message=FALSE, warning=FALSE}
library(arulesViz)
library(arules)
library(tidyverse)
library(igraph)
```
```{r 3.reading data, echo=FALSE, message=FALSE, warning=FALSE}
trans1<-read.transactions("https://raw.githubusercontent.com/jgscott/ECO395M/master/data/groceries.txt", format="basket", sep=",", skip=0)
```
We take a look at the dimensions of this data:
```{r 3.checking dimensions, echo=FALSE, message=FALSE, warning=FALSE}
dim(trans1)
```
This means we have 9835 transactions and 169 distinct items.
The top 10 most items in terms of abosolute frequencies are:-
```{r 3.top 10 abosolute frequencies, echo=FALSE, message=FALSE, warning=FALSE}
itemFrequencyPlot(trans1, topN=10, type="absolute", main="Item Frequency")
```
The top 10 most items in terms of relative frequencies are:-
```{r 3.top 10 relative frequencies, echo=FALSE, message=FALSE, warning=FALSE}
itemFrequencyPlot(trans1, topN=10, type="relative", main="Item Frequency")
```
The next step is to analyze the rules using the A-Priori Algorithm. We picked the thresholds of support as 0.005, confidence as 0.1 and lift as 1. We chose the support threshold as 0.005 as we wanted to capture rules of only those items that have popularity of at least 0.5%. We chose the threshold of confidence as 0.1 as we are interested in only those (X,Y) pairs such that the Prob(Y|X) is at least 10%. We chose lift threshold to be 1 as it is the traditional threshold that implies which implies that item Y is likely to be bought if item X is bought.
```{r 3.A-Priopri Algorithm, echo=FALSE, message=FALSE, warning=FALSE, results='hide'}
rules.trans1 = apriori(trans1,
parameter=list(support=.005, confidence=.1))
rules.trans1=subset(rules.trans1, subset = lift > 1)
```
Now we plot all the rules in (support, confidence) space.
```{r 3.plotting in (support, confidence) space, echo=FALSE, message=FALSE, warning=FALSE}
plot(rules.trans1)
```
We can notice from the above graph that the high lift rules tend to have low support.
Similarly, we plot all the rules in (support, lift) space.
```{r 3.plotting in (support, lift) space, echo=FALSE, message=FALSE, warning=FALSE}
plot(rules.trans1, measure = c("support", "lift"), shading = "confidence")
```
The top 10 rules based on the confidence are-
```{r 3.top 10 rules, echo=FALSE, message=FALSE, warning=FALSE}
subrules <- head(rules.trans1, n = 10, by = "confidence")
inspect(subrules)
```
We graphically visualize the top 10 rules by the following graphs-
```{r 3.Visualization 1, echo=FALSE, message=FALSE, warning=FALSE}
playlists_graph=associations2igraph(subrules, associationsAsNodes = FALSE)
plot(playlists_graph)
```
```{r 3.Visualization 2, echo=FALSE, message=FALSE, warning=FALSE}
plot(subrules, method = "graph",
measure = "confidence", shading = "lift")
```
```{r 3.Visualization 3, echo=FALSE, message=FALSE, warning=FALSE}
plot(subrules, method="paracoord")
```
We can see from the above plots that fruits, vegetables and dairy products are the most common grocery items. We can see that if a person bought fruits, vegetables and dairy products; it is very likely for us to see them adding whole milk in his/her basket as well. We can also see that if a person bought citrus fruit, root vegetables, and whole milk; it is very likely for us to see them adding other vegetables in his/her basket as well.
The above discoveries make perfect sense as these are very common grocery items and this behavior is often observed in the grocery stores.