/
02-tidying.Rmd
298 lines (229 loc) · 8.72 KB
/
02-tidying.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
---
title: "Tidying"
output: workflowr::wflow_html
editor_options:
chunk_output_type: console
---
```{r setup, cache = FALSE}
source(here::here("code", "setup.R"))
```
# Introduction {.unnumbered}
After the scraping process we have a the data about songs played during the
Requestival in a nice tabular format. Let's load it up and see what it looks
like.
```{r load}
requestival <- read_tsv(
PATHS$scraped,
col_types = cols(
.default = col_character()
)
)
```
This isn't too messy at the moment but there are some things we could do to tidy
it up. Let's work through each of the columns and see if we need to do anything
to them.
# File
The first column contains the name of the file the song was scraped from. The
file names have the form `requestival_X`, where `X` is a day in May 2020. We
don't really care about the file name but we **do** care about the day the songs
were played so let's take that part and create a new column.
```{r add-day}
requestival <- requestival %>%
mutate(Day = str_remove(File, "requestival_")) %>%
mutate(Day = paste0("2020-05-", Day))
requestival
```
# Time
The time each song was played is currently a string with the form `hh:mmpp`
(where `pp` is `am` or `pm`). It would be better to have this has a time object
so let's do that conversion.
```{r parse-time}
requestival <- requestival %>%
mutate(Time = parse_time(Time, "%I:%M%p"))
requestival
```
Now all our times are `hms` objects and that have the form `HH:MM:ss`. Our days
and times are now in nice formats let's combine them into a single datetime.
This is pretty easy to do, we just covert our day string to a `datetime` object
and then add on our time.
```{r datetime}
requestival <- requestival %>%
mutate(DateTime = as_datetime(Day, tz = "Europe/Berlin") + Time)
requestival
```
You might notice that the times are in the CEST timezone. This is because the
HTML pages were downloaded in Europe and the Triple J website is clever enough
to display times using your local timezone. It doesn't really make sense to use
European times for a dataset from an Australian radio station so let's convert
them to the Australian east coast timezone that Triple J broadcasts in.
```{r convert-tz}
requestival <- requestival %>%
mutate(DateTime = with_tz(DateTime, "Australia/Sydney"))
requestival
```
# Song
The song name column is made up of strings. This is probably what we want but
let's do a quick check to see if we have any weird characters left over from the
HTML.
```{r song-chars}
song_chars <- requestival$Song %>%
str_extract_all(boundary("character")) %>%
unlist() %>%
sort() %>%
unique()
```
These are the unique characters we have in the song names:
`r glue_collapse(song_chars, sep = ", ")`
Most of these letters in the standard Latin alphabet, numerals or standard
punctuation but there are also a few different kinds of white space, some
accented letters and what I think are Korean characters.
Let's make a table of these characters with their Unicode descriptions to try
and work out what they are.
```{r song-chars-table}
song_chars %>%
map_int(utf8ToInt) %>%
Unicode::u_char_info() %>%
mutate(Character = song_chars) %>%
select(Character, Code, Name)
```
I have quickly checked some of these and it looks like they are correctly part
of song names so I am going to leave them. I will replace the non-breaking
spaces with regular spaces though.
```{r song-spaces}
requestival <- requestival %>%
mutate(Song = str_replace(Song, "\u00a0", " "))
```
# Artist
We can do a similar thing to have a look at the characters in the artist field.
```{r artist-chars}
artist_chars <- requestival$Artist %>%
str_extract_all(boundary("character")) %>%
unlist() %>%
sort() %>%
unique()
artist_chars %>%
map_int(function(.x) {utf8ToInt(.x)[1]}) %>%
Unicode::u_char_info() %>%
mutate(Character = artist_chars) %>%
select(Character, Code, Name)
```
This looks ok so I don't think we need to do anything here.
# Release
Let's have a look at the characters in the release field as well.
```{r release-chars}
release_chars <- requestival$Release %>%
str_extract_all(boundary("character")) %>%
unlist() %>%
sort() %>%
unique()
release_chars %>%
map_int(function(.x) {utf8ToInt(.x)[1]}) %>%
Unicode::u_char_info() %>%
mutate(Character = release_chars) %>%
select(Character, Code, Name)
```
There are some more unusual characters here but they also look to be correct so
we will leave them for now.
# Links
The remaining fields that we scraped are links to places where you can listen
to each song. We probably don't want to mess with these too much but let's take
a look.
## YouTube
The YouTube URLs look something like this: `r requestival$YouTube[1]`
This is just a link a YouTube search results page for a query made up of the
artist and song name. It might be useful to be able to access the query part so
let's make that into a new column.
```{r youtube-query}
requestival <- requestival %>%
mutate(YouTubeQuery = str_remove(YouTube, "^.*="))
requestival
```
## Spotify
The Spotify URLs are similar: `r requestival$Spotify[1]`
We can make a similar column for the Spotify query.
```{r spotify-query}
requestival <- requestival %>%
mutate(SpotifyQuery = str_remove(Spotify, "^.*/results/"))
requestival
```
## Unearthed
The Unearthed links are a bit different: `r requestival$Unearthed[3]`
These are direct links to artist pages on the Unearthed website rather than
searches. There probably isn't much useful information in these but it would be
nice to know which artists are on Unearthed so let's make a boolean column to
represent this.
```{r is-unearthed}
requestival <- requestival %>%
mutate(IsUnearthed = !is.na(Unearthed))
requestival
```
# Selecting columns
Now that we have tidied up the individual columns we can tidy up the table as a
whole. For example the "File" field isn't much use now that we have extracted the
date information and the "Day", "Time" and "DateTime" columns are a bit
redundant. The times are also inconsistent because we have converted DateTime to
the Australian time zone but the others are still in European time. We might
also like to rearrange and rename some of the columns.
```{r select}
requestival <- requestival %>%
select(
DateTime,
Song,
Artist,
Release,
IsUnearthed,
UnearthedURL = Unearthed,
SpotifyQuery,
SpotifyURL = Spotify,
YouTubeQuery,
YouTubeURL = YouTube
) %>%
arrange(DateTime)
requestival
```
# Filtering
The final step in our tidying is to filter out some of the songs.
The Requestival officially runs between 6 am and 9 pm from Monday 25 May to
Sunday 31 May. On the final day it actually ends a bit earlier at 6 pm. We are
only interested in songs played during this period so we need to filter out
other times. There is probably a cleaner way to do this but I'm just going to
use a big ol' logical statement.
```{r filter}
tz <- "Australia/Sydney" # This is just to save space
times <- tribble(
~Start, ~End,
ymd_hm("2020-05-25 06:00", tz = tz), ymd_hm("2020-05-25 21:00", tz = tz),
ymd_hm("2020-05-26 06:00", tz = tz), ymd_hm("2020-05-26 21:00", tz = tz),
ymd_hm("2020-05-27 06:00", tz = tz), ymd_hm("2020-05-27 21:00", tz = tz),
ymd_hm("2020-05-28 06:00", tz = tz), ymd_hm("2020-05-28 21:00", tz = tz),
ymd_hm("2020-05-29 06:00", tz = tz), ymd_hm("2020-05-29 21:00", tz = tz),
ymd_hm("2020-05-30 06:00", tz = tz), ymd_hm("2020-05-30 21:00", tz = tz),
ymd_hm("2020-05-31 06:00", tz = tz), ymd_hm("2020-05-31 18:00", tz = tz)
)
requestival <- requestival %>%
filter(
(DateTime >= times$Start[1] & DateTime <= times$End[1]) |
(DateTime >= times$Start[2] & DateTime <= times$End[2]) |
(DateTime >= times$Start[3] & DateTime <= times$End[3]) |
(DateTime >= times$Start[4] & DateTime <= times$End[4]) |
(DateTime >= times$Start[5] & DateTime <= times$End[5]) |
(DateTime >= times$Start[6] & DateTime <= times$End[6]) |
(DateTime >= times$Start[7] & DateTime <= times$End[7])
)
```
It would be easy to miss this up let's make a quick plot of the play times to
check what songs we have kept.
```{r plot-times}
ggplot(requestival, aes(x = DateTime, y = 0)) +
geom_jitter(width = 0) +
geom_vline(data = times, aes(xintercept = Start), colour = "red") +
geom_vline(data = times, aes(xintercept = End), colour = "blue") +
theme_minimal()
```
That looks pretty good 🎉!
The final dataset has **`r ncol(requestival)`** columns and
**`r nrow(requestival)`** rows. Let's save it to use for future analysis.
```{r save}
write_tsv(requestival, PATHS$tidied)
requestival
```