/
Day12-Multivariate-Solutions.Rmd
293 lines (228 loc) · 11.9 KB
/
Day12-Multivariate-Solutions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
---
title: "Day Twelve: Multivariate"
subtitle: "SDS 192: Introduction to Data Science"
author: |
Lindsay Poirier<br/>
<span style = 'font-size: 70%;'>
[Statistical & Data Sciences](http://www.smith.edu/sds), Smith College<br/>
</span>
date: |
Spring 2022<br/>
output: pdf_document
---
# Introduction
The goal of this lab is to provide you with practice in producing data visualizations that help to answer a research question. Topics that will be covered include:
1. Practice producing and interpreting univariate plots.
2. Practice producing and interpreting multivariate plots.
3. Reordering a categorical axis
4. Practice labeling plots
5. Practice in visualization aesthetics.
Today, we are prioritizing **joy**! If you are a regular Spotify user (Option 1), today's research question will be: How joyful are my Spotify playlists? If you are not a regular Spotify user (Option 2), today's research question will be: How joyful are popular Spotify playlists in my favorite music genre?
The music feature from Spotify's data that serves as a measure of joy is called *valence*. This is the description from their API documentation for valence:
> A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
(Pretty vague if you ask me, but today we'll go with it.)
# Step 1: Load packages
```{r setup}
#Load packates
library(tidyverse)
library(spotifyr)
```
# Step 2: Add your Spotify credentials
Copy client id and secret from your previous lab into the chunk below, and then run the code chunk.
```{r creds, include=FALSE}
id <- Sys.getenv("SPOTIFY_CLIENT_ID")
secret <- Sys.getenv("SPOTIFY_CLIENT_SECRET")
Sys.setenv(SPOTIFY_CLIENT_ID = id)
Sys.setenv(SPOTIFY_CLIENT_SECRET = secret)
access_token <- get_spotify_access_token()
```
# Step 3, Option 1: Analyze your own Spotify data!
1. Navigate to your SDS 192 project in the Spotify developer account you created on Monday.
2. Click Edit Settings. Under the heading **Redirect URIs** copy and paste this URL: https://localhost:1410/ Click Save. This is going to allow us to authenticate our Spotify accounts through our local computers.
3. Below replace `FILL USER NAME HERE` with your Spotify username. This is the ID that appears in the upper right hand corner when you log into your Spotify account (not your developer account.) Run the code chunk below. You will be redirected a web browser window confirming authentication. Return back to RStudio and run this code chunk again to load the data.
```{r getdata1, include=FALSE}
spotify_playlists <- get_user_audio_features(
username = "poiril",
authorization = get_spotify_access_token()
) %>%
select(-c(images,
track.available_markets,
track.artists,
track.album.artists,
track.album.available_markets,
track.album.images))
```
This will create a data frame with the songs your user account has stored in Spotify playlists.
# Step 3, Option 2: Analyze playlist Spotify data!
1. Navigate to Spotify.com and create an account.
2. Navigate to your SDS 192 project in the Spotify developer account you created on Monday.
3. Click Edit Settings. Under the heading **Redirect URIs** copy and paste this URL: https://localhost:1410/ Click Save.
4. Below replace `FILL USER NAME HERE` with your Spotify username. This is the ID that appears in the upper right hand corner when you log into your Spotify account (not your developer account.)
5. Search Spotify for your favorite music genre and select three *playlists* from the search. Playlists may be a ways down in the search results.
6. When you click on a playlist, notice the URL in the navigation bar of your web browser. It should look something like spotify.com/playlist/LONG_STRING_OF_CHARACTERS. Copy the long string of characters at the end of the URL, and paste it into the function below where it says `FILL LONG STRING OF CHARACTERS FROM URL`. Repeat this for the other two playlists.
7. Run the code chunk below. You will be redirected a web browser window confirming authentication. Return back to RStudio and run this code chunk again to load the data.
```{r getdata2, include=FALSE}
spotify_playlists <- get_playlist_audio_features(
username = "poiril",
playlist_uris = c("6Zjz0tu37mciuxwASHLZWp",
"2tYBsYfEo7Lxi3CiVjI2L1",
"40NO1fdNj3ny6VseywtmZe"),
authorization = get_spotify_access_token()
) %>%
select(-c(track.artists,
track.available_markets,
track.album.artists,
track.album.available_markets,
track.album.images))
```
Today we will be mostly practicing plots you've already learned. However, we will learn one new skill - reordering a *categorical* axis based on numeric values. We will do this with the `reorder()` function.
The `reorder()` function has three arguments:
* the vector of categorical values to be reordered,
* a vector that will serve as the basis for reordering,
* and a function to determine how values will be reordered.
So let's say I created the following grouped boxplots visualizing the distribution of energy across key names, and I wanted to reorder the categorical axis so that the key name with the highest median energy would appear first and lowest median energy would appear last (and all other medians ordered accordingly in between).
```{r}
spotify_playlists %>%
ggplot(aes(x = key_name, y = energy)) +
geom_boxplot() +
coord_flip() +
labs(title = "Distribution of Song Energy per Key Name in Spotify Wedding Playlists, 2022",
x = "Key Name",
y = "Energy")
```
I want to reorder my x-axis, so I will place the `reorder()` function around my x aesthetic, and assign the following arguments:
* the vector of categorical values to be reordered: `key_name`
* a vector that will serve as the basis for reordering: `energy`
* and a function to determine how values will be reordered: `median`
```{r}
spotify_playlists %>%
ggplot(aes(x = reorder(key_name, energy, median), y = energy)) +
geom_boxplot() +
coord_flip() +
labs(title = "Distribution of Song Energy per Key Name in Spotify Wedding Playlists, 2022",
x = "Key Name",
y = "Energy")
```
What about reordering based on the height of a bar plot?
```{r}
spotify_playlists %>%
ggplot(aes(x = key_name)) +
geom_bar(color = "white") +
coord_flip() +
labs(title = "Count Song Key Modes in Three Spotify Wedding Playlists, 2022",
x = "Key Mode",
y = "Count of Songs")
```
Notice how here we don't have a separate vector to serve as the basis for reordering. Actually, we want to reorder based on the `length` of the *same vector* we want reordered:
I want to reorder my x-axis, so I will place the `reorder()` function around my x aesthetic, and assign the following arguments:
* the vector of categorical values to be reordered: `key_name`
* a vector of that will serve as the basis for reordering: `key_name`
* and a function to determine how values will be reordered: `length`
```{r}
spotify_playlists %>%
ggplot(aes(x = reorder(key_name, key_name, length))) +
geom_bar(color = "white") +
coord_flip() +
labs(title = "Count Song Key Modes in Three Spotify Wedding Playlists, 2022",
x = "Key Mode",
y = "Count of Songs")
```
# Step Four: How many songs are in each playlist? Create a plot to visualize this, and order the results by the number of songs.
Be sure to give it a descriptive title and labels covering all 5 essential components of data context.
```{r plot1}
# Create plot here
spotify_playlists %>%
ggplot(aes(x = reorder(playlist_name, playlist_name, length))) +
geom_bar(color = "white") +
coord_flip() +
labs(title = "Count of Songs in Three Spotify Wedding Playlists, 2022",
x = "Playlist Name",
y = "Count of Songs")
```
# Step Five: What is the distribution of valence across all of the songs (in intervals of 0.1 valence)? Create a plot to visualize this.
Be sure to give it a descriptive title and labels covering all 5 essential components of data context.
```{r plot2}
# Create plot here
spotify_playlists %>%
ggplot(aes(x = valence)) +
geom_histogram(binwidth = 0.1, color = "white") +
coord_flip() +
labs(title = "Distribution of Valence of Songs in Spotify Wedding Playlists, 2022",
x = "Valence",
y = "Count of Songs")
```
# Step Six: What is the distribution of valence across all of the songs (in intervals of 0.1 valence) *in each playlist*? Create a plot to visualize this.
Be sure to give it a descriptive title and labels covering all 5 essential components of data context.
```{r plot3}
# Create plot here
spotify_playlists %>%
ggplot(aes(x = valence)) +
geom_histogram(binwidth = 0.1, color = "white") +
coord_flip() +
labs(title = "Distribution of Valence of Songs in Spotify Wedding Playlists, 2022",
x = "Valence",
y = "Count of Songs") +
facet_wrap(vars(playlist_name))
# OR, if you have fewer than 8 playlists...
spotify_playlists %>%
ggplot(aes(x = valence, fill = playlist_name)) +
geom_histogram(binwidth = 0.1) +
coord_flip() +
labs(title = "Distribution of Valence of Songs in Spotify Wedding Playlists, 2022",
x = "Valence",
y = "Count of Songs") +
scale_fill_brewer(palette = "Dark2")
```
# Step Seven: What are differences in the summary statistics (max, min, median, etc.) of the valence of songs in each playlist? Create a plot to visualize this, and order the results by the median of valence.
Be sure to give it a descriptive title and labels covering all 5 essential components of data context.
```{r plot4}
# Create plot here
spotify_playlists %>%
ggplot(aes(x = reorder(playlist_name, valence, median), y = valence)) +
geom_boxplot() +
coord_flip() +
labs(title = "Distribution of Valence of Songs in Spotify Wedding Playlists, 2022",
x = "Playlist Name",
y = "Valence")
```
# Step Eight: Do happier songs tend to be more danceable *in each playlist*? Create a plot to visualize this.
Be sure to give it a descriptive title and labels covering all 5 essential components of data context. Also be sure to adjust your plot to address overplotting.
```{r plot5}
# Create plot here
spotify_playlists %>%
ggplot(aes(x = valence, y = danceability)) +
geom_point(alpha = 0.3, size = 0.5) +
labs(title = "Relationship between Danceability and Valence of Songs in Spotify Wedding Playlists, 2022",
x = "Valence",
y = "Danceability") +
facet_wrap(vars(playlist_name)) +
geom_smooth(method = "lm")
```
> Note: You may wish to add a trend line to your plot with method="lm"!
# Step Nine: Do songs composed in the minor or major mode tend to be happier *in each playlist*? Create a plot to visualize this.
```{r plot6}
# Create plot here
spotify_playlists %>%
ggplot(aes(x = mode_name, y = valence)) +
geom_boxplot(alpha = 0.3, size = 0.5) +
labs(title = "Comparison of Valence in Major/Minor Mode Songs in Spotify Wedding Playlists, 2022",
x = "Mode Name",
y = "Valence") +
facet_wrap(vars(playlist_name))
```
# Step Ten: Do happier songs tend to have a higher tempo across all playlists? What role might the song's mode play? Create a plot to visualize this.
Be sure to give it a descriptive title and labels covering all 5 essential components of data context. Also be sure to adjust your plot to address overplotting.
```{r plot7}
#Create boxplot here
spotify_playlists %>%
ggplot(aes(x = tempo, y = valence, col = mode_name)) +
geom_point(alpha = 0.3, size = 0.5) +
labs(title = "Relationship between Tempo, Valence, and Mode of Songs in Spotify Wedding Playlists, 2022",
x = "Tempo",
y = "Valence",
col = "Song Mode") +
facet_wrap(vars(playlist_name)) +
geom_smooth(method = "lm")
```
> Note: You may wish to add a trend line to your plot with method="lm"!
> What did you learn about joy across these playlists?