-
Notifications
You must be signed in to change notification settings - Fork 0
/
02-handle-missing.Rmd
432 lines (364 loc) · 22.8 KB
/
02-handle-missing.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
# Handle missing data {#chapter-3}
The best way to handle missing data is first to visualize it to get a sense of what's going on.
## Simple but useful features
We're given several useful bits of information about the data on the [competition website](https://www.kaggle.com/competitions/spaceship-titanic/data) that we can apply to create some useful features. By features, I mean new variables that are derived from existing ones. For example, Cabin is comprised of information about the cabin number, the deck and the side and PassengerId contains the passenger group id.
I've created a function that derives these new variables from existing ones. It also creates features that count the number of unique categories of different variables by passenger group. Since the passenger group variable is the only one with no missing values, it makes sense to start the exploration of the others based on it. Questions that we could ask are: 'Is any passenger group travelling from different home planets?'
```{r useful-features}
useful_features <- function(x) {
x2 <- x %>%
mutate(PassengerGroup = str_split_i(PassengerId, "_", 1),
LastName = word(Name, -1),
Deck = str_split_i(Cabin, "/", 1),
CabinNumber = str_split_i(Cabin, "/", 2),
Side = str_split_i(Cabin, "/", 3),
TotalSpent = RoomService + FoodCourt + ShoppingMall + Spa + VRDeck,
PassengerCount = 1) %>% # I added this to use as a count variable in visualizations
group_by(PassengerGroup) %>%
add_count(PassengerGroup, name = "PassengerGroupSize") %>%
mutate(HomePlanetsPerGroup = n_distinct(HomePlanet, na.rm = TRUE),
DestinationsPerGroup = n_distinct(Destination, na.rm = TRUE),
CabinsPerGroup = n_distinct(Cabin, na.rm = TRUE),
TotalSpentPerGroup = sum(TotalSpent, na.rm = TRUE),
CryoSleepsPerGroup = n_distinct(CryoSleep, na.rm = TRUE),
VIPsPerGroup = n_distinct(VIP, na.rm = TRUE),
LastNamesPerGroup = n_distinct(LastName, na.rm = TRUE)) %>%
ungroup() %>%
mutate(across(.cols = c(HomePlanet, CryoSleep, Destination, VIP, Transported, Deck, Side, HomePlanetsPerGroup,
PassengerGroupSize, DestinationsPerGroup, CabinsPerGroup, CryoSleepsPerGroup, VIPsPerGroup,
LastNamesPerGroup, PassengerId),
.fns = as.factor)) %>%
mutate(across(.cols = c(CabinNumber, Age, RoomService, FoodCourt, ShoppingMall, Spa, VRDeck, PassengerGroup),
.fns = as.integer))
return(x2)
}
train2 <- useful_features(train)
```
## Closer look at missing values
A closer look at each variable with missing values gives us more insights. For example, every passenger group always starts from the same home planet. Figure \@ref(fig:missing-homeplanet) shows a sample of passenger groups and their home planets. This is one of the patterns of missingness I alluded to earlier and we can use this information to replace missing HomePlanet values for passengers belonging to a group with a known home planet. This will leave us only with passengers who are travelling alone.
```{r missing-homeplanet, fig.cap = "Sample of missing values for HomePlanet"}
train2 %>%
group_by(PassengerGroup) %>%
filter(any(is.na(HomePlanet))) %>%
ungroup() %>%
slice_sample(n = 50) %>%
ggplot(., mapping = aes(x = PassengerGroup, y = PassengerCount, fill = HomePlanet)) +
geom_col() +
labs(title = "Example missing HomePlanet", x = "Passenger group", y = "Number of passengers") +
scale_y_continuous(breaks = seq(0, 9000, 50)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
We can also see that groups with two or more travelers are never housed in the same cabins as other (non-solo) groups. In other words, a group with two or more passengers can be spread out across several cabins but never share those cabins with other groups of two or more passengers. Figure \@ref(fig:missing-cabin) shows a sample. We can use this information to replace missing cabin values.
```{r missing-cabin, warning = FALSE, fig.cap = "Sample of missing values for Cabin"}
train2 %>%
filter(PassengerGroupSize != 1 & Side == "S" & Deck == "G") %>%
slice_sample(n = 50) %>%
ggplot(., mapping = aes(x = as.factor(CabinNumber), y = PassengerCount, fill = PassengerGroup)) +
geom_col() +
labs(title = "Example missing CabinNumber", x = "Cabin number", y = "Number of passengers") +
theme(axis.text.x = element_text(angle = 90, hjust = 1), legend.position = "none")
```
When we look at the new feature Deck, we can see that some decks seem dedicated to passengers from a specific home planet, see Figure \@ref(fig:missing-deck). This helps us replace even more missing values for HomePlanet.
```{r missing-cabin-group, warning = FALSE, fig.cap = "Relationsship between Cabin and passenger group."}
train2 %>%
ggplot(., mapping = aes(x = CabinNumber, y = as.integer(PassengerGroup), colour = Deck)) +
geom_point() +
labs(title = "CabinNumber against PassengerGroup", x = "Cabin number", y = "PassengerGroup") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
We see that the relationship between CabinNumber and PassengerGroup is linear so we can use this to replace CabinNumber where Deck is known.
```{r missing-deck, fig.cap = "Missing values for Deck"}
train2 %>%
group_by(Deck) %>%
filter(any(is.na(HomePlanet))) %>%
ungroup() %>%
ggplot(., mapping = aes(x = Deck, y = PassengerCount, fill = HomePlanet)) +
geom_col() +
labs(title = "Missing HomePlanet by Deck", x = "Deck", y = "Number of passengers") +
scale_y_continuous(breaks = seq(0, 9000, 200)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
We also see that there aren't any VIPs travelling from Earth so we can use this information to replace missing VIP-values where the home planet is known.
```{r missing-vip, fig.cap = "Missing values for VIP"}
train2 %>%
ggplot(., mapping = aes(x = HomePlanet, y = PassengerCount, fill = VIP)) +
geom_col() +
labs(title = "Missing VIP by HomePlanet", x = "HomePlanet", y = "Number of passengers") +
scale_y_continuous(breaks = seq(0, 9000, 200)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
We are also told that passengers in cryosleep are confined to their cabins and we must assume that this means they cannot spend any credits on amenities. In other words, we can replace missing amenities with zeroes for passengers in cryo sleep. The reverse is also true: if a passenger has spent credits, the passenger cannot be in cryo sleep.
One last insight we can get from missing data is that passengers of 12 years of age or under don't spend any credits. We can therefore use this to replace even more missing values for amenities.
```{r missing-age, warning = FALSE, fig.cap = "Missing values for ameneties"}
train2 %>%
ggplot(., mapping = aes(x = Age, y = TotalSpent)) +
geom_col() +
labs(title = "Missing Age", x = "Age of passengers", y = "TotalSpent") +
scale_y_continuous(breaks = seq(0, 500000, 50000)) +
scale_x_continuous(breaks = seq(0, 100, 2)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
## Replace data with manual rules
I've summed up all of these replacement rules in the function below.
```{r manual-imputation, fig.cap="Missing values after manual replacement."}
my_na_replace <- function(d) {
cabin_coef <- d %>%
nest(.by = Deck) %>%
filter(!is.na(Deck)) %>%
mutate(cabin_lm = map(.x = data, .f = \(df) lm(CabinNumber ~ PassengerGroup, data = df, na.action = na.exclude)),
my_tidy = map(.x = cabin_lm, .f = \(m) tidy(m))) %>%
select(Deck, my_tidy) %>%
unnest(my_tidy) %>%
select(Deck, term, estimate) %>%
pivot_wider(names_from = term, values_from = estimate) %>%
rename(Intercept = `(Intercept)`, Slope = PassengerGroup)
d2 <- d %>%
# Replace HomePlanet for passengers in groups where the homeplanet is known from the other passengers
group_by(PassengerGroup) %>%
fill(HomePlanet, .direction = "downup") %>%
# Replace Cabin by group cabin for groups with group count > 1. Update the Deck, CabinNumber and Side variables.
mutate(Cabin2 = Cabin) %>%
fill(data = ., Cabin2, .direction = "downup") %>%
ungroup() %>%
mutate(Cabin = if_else(is.na(Cabin) & PassengerGroupSize != 1, Cabin2, Cabin),
Deck = str_split_i(Cabin, "/", 1),
CabinNumber = as.integer(str_split_i(Cabin, "/", 2)),
Side = str_split_i(Cabin, "/", 3)) %>%
select(-Cabin2) %>%
# Replace remaining CabinNumber with linear relationship with group
left_join(., cabin_coef, by = "Deck") %>%
mutate(CabinNumber = if_else(is.na(CabinNumber), Intercept + Slope * PassengerGroup, CabinNumber)) %>%
select(-Intercept, -Slope) %>%
# Replace HomePlanet if the passenger is housed on a dedicated Deck
mutate(HomePlanet = if_else(is.na(HomePlanet) & Deck == "G", "Earth", HomePlanet),
HomePlanet = if_else(is.na(HomePlanet) & Deck %in% c("A", "B", "C"), "Europa", HomePlanet)) %>%
# Replace all of VIPs from Earth to FALSE
mutate(VIP = if_else(is.na(VIP) & HomePlanet == "Earth", "False", VIP)) %>%
# Replace amenities with zero if CryoSleep is TRUE or if Age <= 12
# Replace CryoSleep with FALSE if the passenger has spent credits
mutate(across(.cols = c(RoomService, FoodCourt, ShoppingMall, Spa, VRDeck),
.fns = ~ if_else(condition = CryoSleep == "True" | Age <= 12, true = 0, false = .x, missing = .x)),
CryoSleep = if_else(TotalSpent > 0 & is.na(CryoSleep), "False", CryoSleep))
return(d2)
}
train3 <- my_na_replace(train2)
train3 <- useful_features(train3)
plot_na_hclust(train3)
```
We see that we've managed to halv the amount of missing values for many variables with our manual replacements. The rest we can handle with imputation.
## Replace data with imputation
Most algorithms require no missing data at all and since we cannot find any more patterns for the missing data, we can use some imputation algorithms to do the rest. I've used two: `MissRanger` and `KNN`. Since KNN can be sensitive to the scale of the data and outliers, I've normalized the numeric data to avoid possible issues with the imputation.
Note that I've saved the results for some of these imputations since they can take some time to run.
### MissRanger - Chained Random Forests
```{r missranger-imputation, warning = FALSE}
rev_normalization <- function(v, rec) { # Custom function that will "unnormalise" numeric values inside mutate(across())
tidy_rec <- tidy(rec, number = 1)
v2 <- v * filter(tidy_rec, terms == cur_column() & statistic == "sd")$value +
filter(tidy_rec, terms == cur_column() & statistic == "mean")$value
v3 <- round(v2, 0)
return(v3)
}
# ranger_norm <- recipe(Transported ~ ., data = train3) %>%
# step_normalize(all_numeric_predictors()) %>%
# prep()
#
# train3_ranger <- ranger_norm %>%
# bake(new_data = NULL) %>%
# select(-c(Cabin, Name, LastName, PassengerCount))
#
# ranger <- missRanger(train3_ranger, formula = . ~ . -PassengerId, seed = 8584, verbose = 0)
#
# ranger_unnorm <- ranger %>%
# mutate(across(.cols = c(Age, CabinNumber, RoomService, FoodCourt, ShoppingMall, Spa, VRDeck),
# .fns = ~ rev_normalization(.x, ranger_norm)))
#
# ranger2 <- train3 %>%
# select(PassengerId, Cabin, Name, LastName, PassengerCount) %>%
# left_join(ranger_unnorm, ., by = "PassengerId")
#
# save(ranger2, file = "Extra/MissRanger.RData")
load("Extra/MissRanger.RData")
```
The missRanger algorithm can impute all of the variables at the same time and there is an option to select an id variable that will be included in the data but not used for imputation. Note that I have excluded the Name variables to improve computation time.
### KNN - K-Nearest Neighbors
I use the `recipe`-package to set up the process of imputation where the `step_impute_knn`-function allows to specify which variables are to be used for the imputation and which are to be imputed.
```{r knn-imputation, warning = FALSE}
train3_for_knn <- train3 %>%
mutate(across(.cols = where(is.factor), .fns = as.character))
vars_to_impute <- c("HomePlanet", "CryoSleep", "Destination", "Age", "VIP", "RoomService", "FoodCourt", "ShoppingMall",
"Spa", "VRDeck", "Deck", "Side", "CabinNumber", "LastName")
vars_for_imputing <- c("HomePlanet", "CryoSleep", "Destination", "Age", "VIP", "RoomService", "FoodCourt",
"ShoppingMall", "Spa", "VRDeck", "PassengerGroup", "Deck", "Side", "CabinNumber",
"PassengerGroupSize", "DestinationsPerGroup", "CabinsPerGroup",
"CryoSleepsPerGroup", "VIPsPerGroup", "LastNamesPerGroup")
train3_noNA <- train3_for_knn[complete.cases(train3_for_knn),]
knn_impute_rec <- recipe(Transported ~ ., data = train3_noNA) %>%
step_normalize(Age, CabinNumber, RoomService, FoodCourt, ShoppingMall, Spa, VRDeck) %>%
step_impute_knn(recipe = ., all_of(vars_to_impute), impute_with = imp_vars(all_of(vars_for_imputing)), neighbors = 5)
set.seed(8584)
knn_impute_prep <- knn_impute_rec %>% prep(strings_as_factors = FALSE)
set.seed(8584)
knn_impute_bake <- bake(knn_impute_prep, new_data = train3_for_knn)
knn_impute_res <- knn_impute_bake %>%
mutate(across(.cols = c(Age, CabinNumber, RoomService, FoodCourt, ShoppingMall, Spa, VRDeck),
.fns = ~ rev_normalization(.x, knn_impute_prep)))
```
The KNN-algorithm is very fast and can impute multiple variables at the same time. It can also be relatively easily tuned by changing the number of neighbors the imputation is based on although I've used the default = 5.
### Comparison of imputation algorithms - numerical variables
Here I use the `skimr`-package to set up metrics to evaluate and do some data wrangling of the imputated variables to that they can be compared to the non-imputed ones.
```{r imputation-evaluation-num, fig.cap = "Comparisson of standard deviation for numerical values before and after imputation"}
my_skim <- skimr::skim_with(numeric = skimr::sfl(min = ~min(., na.rm = TRUE), median = ~median(., na.rm = TRUE),
mean = ~mean(., na.rm = TRUE), max = ~max(., na.rm = TRUE),
sd = ~sd(., na.rm = TRUE)), append = FALSE)
no_imp_skim <- train3 %>%
select(Age, RoomService, FoodCourt, ShoppingMall, Spa, VRDeck, CabinNumber) %>%
my_skim(.)
ranger_skim <- ranger2 %>%
select(Age, RoomService, FoodCourt, ShoppingMall, Spa, VRDeck, CabinNumber) %>%
my_skim(.)
knn_skim <- knn_impute_res %>%
select(Age, RoomService, FoodCourt, ShoppingMall, Spa, VRDeck, CabinNumber) %>%
my_skim(.)
sd_skim <- bind_cols("Variable" = no_imp_skim$skim_variable, "No imp" = no_imp_skim$numeric.sd,
"MissRanger" = ranger_skim$numeric.sd, "KNN" = knn_skim$numeric.sd) %>%
pivot_longer(., cols = !Variable, names_to = "Metric", values_to = "Standard_deviation")
mean_skim <- bind_cols("Variable" = no_imp_skim$skim_variable, "No imp" = no_imp_skim$numeric.mean,
"missRanger" = ranger_skim$numeric.mean, "KNN" = knn_skim$numeric.mean) %>%
pivot_longer(., cols = !Variable, names_to = "Metric", values_to = "Mean")
sd_skim %>%
filter(Metric != "No imp") %>%
ggplot(., mapping = aes(x = Metric, y = Standard_deviation)) +
geom_point() +
geom_hline(data = sd_skim %>% filter(Metric == "No imp"), aes(yintercept = Standard_deviation, colour = "orange")) +
lims(y = c(0, NA)) +
facet_wrap(~Variable, scales = "free") +
scale_colour_discrete(name = "Non-imputed", labels = "sd", type = "orange")
```
The standard deviation for our numeric variables isn't affected by the imputation, although this was expected since we only have 1% of missing data. The same is true for the mean, as can be seen in Figure \@ref(fig:imputation-evaluation-num2).
```{r imputation-evaluation-num2, fig.cap = "Comparisson of mean for numerical values before and after imputation"}
mean_skim %>%
filter(Metric != "No imp") %>%
ggplot(., mapping = aes(x = Metric, y = Mean)) +
facet_wrap(~Variable, scales = "free") +
geom_point() +
geom_hline(data = mean_skim %>% filter(Metric == "No imp"), aes(yintercept = Mean, colour = "green")) +
lims(y = c(0, NA)) +
scale_colour_discrete(name = "Non-imputed", labels = "mean", type = "green")
```
### Comparison of imputation algorithms - categorical variables
A comparison of proportions of the categorical variables also shows that the imputation seems reasonable and hasn't introduced any strange values like outliers. Below are two sample comparisons for HomePlanet and CryoSleep but I encourage you (if you follow along this code) to explore each categorical variable seperately.
```{r imputation-evaluation-home, fig.cap = "Imputed distribution for HomePlanet compared to no imputation"}
my_prop_plot <- function(df, v, t) {
g <- ggplot(data = df, mapping = aes(x = !!sym(t), fill = !!sym(v))) +
geom_bar() +
geom_text(aes(by = as.factor(!!sym(t))), stat = "prop", position = position_stack(vjust = 0.5)) +
labs(title = deparse(substitute(df)), x = t, y = "Number of passengers") +
scale_y_continuous(breaks = seq(0, 9000, 500)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
return(g)
}
h1 <- my_prop_plot(train3, "HomePlanet", "Transported")
h2 <- my_prop_plot(ranger2, "HomePlanet", "Transported")
h3 <- my_prop_plot(knn_impute_res, "HomePlanet", "Transported")
(h1 | h2 | h3) + plot_layout(guides = "collect", axis_titles = "collect")
```
```{r imputation-evaluation-cryo, fig.cap = "Imputed distribution for HomePlanet compared to no imputation"}
c1 <- my_prop_plot(train3, "CryoSleep", "Transported")
c2 <- my_prop_plot(ranger2, "CryoSleep", "Transported")
c3 <- my_prop_plot(knn_impute_res, "CryoSleep", "Transported")
(c1 | c2 | c3) + plot_layout(guides = "collect", axis_titles = "collect")
```
### Do the imputated values break any "rules"?
One last thing we must do before we accept the imputed values is to check whether the imputation has managed to adhere to the 'rules' we discovered earlier that we used for our manual replacement of NA-values. I use a modified version of the `useful_features`-function below to update the simple features.
```{r imputation-evaluation}
useful_features2 <- function(x) {
x2 <- x %>%
mutate(TotalSpent = RoomService + FoodCourt + ShoppingMall + Spa + VRDeck) %>%
group_by(PassengerGroup) %>%
add_count(PassengerGroup, name = "PassengerGroupSize") %>%
mutate(HomePlanetsPerGroup = n_distinct(HomePlanet, na.rm = TRUE),
DestinationsPerGroup = n_distinct(Destination, na.rm = TRUE),
CabinsPerGroup = n_distinct(Cabin, na.rm = TRUE),
TotalSpentPerGroup = sum(TotalSpent, na.rm = TRUE),
CryoSleepsPerGroup = n_distinct(CryoSleep, na.rm = TRUE),
VIPsPerGroup = n_distinct(VIP, na.rm = TRUE),
LastNamesPerGroup = n_distinct(LastName, na.rm = TRUE)) %>%
ungroup() %>%
mutate(across(.cols = c(HomePlanet, CryoSleep, Destination, VIP, Transported, Deck, Side, HomePlanetsPerGroup,
PassengerGroupSize, DestinationsPerGroup, CabinsPerGroup, CryoSleepsPerGroup, VIPsPerGroup,
LastNamesPerGroup, PassengerGroup, PassengerId),
.fns = as.factor)) %>%
mutate(across(.cols = c(CabinNumber, Age, RoomService, FoodCourt, ShoppingMall, Spa, VRDeck),
.fns = as.integer))
return(x2)
}
check_knn <- useful_features2(knn_impute_res)
check_knn %>%
filter(HomePlanetsPerGroup != 1) %>%
select(PassengerId, HomePlanet, HomePlanetsPerGroup)
```
The imputation didn't seem to break the rule where every passenger group travels from the same planet, which is great.
```{r imputation-evaluation4}
check_knn %>%
filter(Age <= 12 & TotalSpent > 0) %>%
select(PassengerId, CryoSleep, Age, RoomService, FoodCourt, ShoppingMall, Spa, VRDeck, TotalSpent)
check_knn %>%
filter(CryoSleep == "True" & TotalSpent > 0) %>%
select(PassengerId, CryoSleep, RoomService, FoodCourt, ShoppingMall, Spa, VRDeck, TotalSpent)
```
The imputation didn't break the rule where passengers that are <=12 years old don't spend any credits but it did break the rule where passengers that are in cryo sleep can't spend credits.
```{r imputation-evaluation5}
check_knn %>%
filter(Deck %in% c("A", "B", "C") & HomePlanet != "Europa" | Deck == "G" & HomePlanet != "Earth") %>%
select(PassengerId, HomePlanet, Deck)
check_knn %>%
filter(VIP == "True" & HomePlanet == "Earth") %>%
select(PassengerId, HomePlanet, VIP)
```
Another rule that was broken regards the Deck variable where some passengers have been imputed as being housed on decks despite travelling from the 'wrong' homeplanet for that deck. We can deal with this in a general function that checks for all the 'rules' and adjusts.
```{r manual-imp-correction, fig.cap = "Missing values overview after imputation"}
fix_knn <- function(df) {
amenities_summary <- df %>%
filter(Age > 12 & TotalSpent > 0) %>%
summarise(mean_age = round(mean(Age), 0), .by = c(CryoSleep, Deck))
wrong_age <- df %>%
filter(Age <= 12 & TotalSpent > 0) %>%
left_join(., amenities_summary, by = c("CryoSleep", "Deck")) %>%
select(PassengerId, mean_age)
wrong_planet <- df %>%
filter(HomePlanetsPerGroup != 1) %>%
select(PassengerId, PassengerGroup, HomePlanet) %>%
group_by(PassengerGroup) %>%
mutate(HomePlanet_correct = first(HomePlanet)) %>%
ungroup() %>%
select(PassengerId, HomePlanet_correct)
wrong_deck <- df %>%
filter(Deck %in% c("A", "B", "C", "G")) %>%
mutate(Deck_correct = case_when(Deck %in% c("A", "B", "C") & HomePlanet == "Earth" ~ "G",
Deck %in% c("A", "B", "C") & HomePlanet == "Mars" ~ "F",
Deck == "G" & HomePlanet == "Mars" ~ "F",
Deck == "G" & HomePlanet == "Europa" ~ "C", .default = Deck)) %>%
select(PassengerId, Deck_correct)
res <- df %>%
left_join(., wrong_age, by = "PassengerId") %>%
left_join(., wrong_planet, by = "PassengerId") %>%
left_join(., wrong_deck, by = "PassengerId") %>%
mutate(Age = if_else(Age <= 12 & TotalSpent > 0, mean_age, Age),
RoomService = if_else(CryoSleep == "True", 0, RoomService),
FoodCourt = if_else(CryoSleep == "True", 0, FoodCourt),
ShoppingMall = if_else(CryoSleep == "True", 0, ShoppingMall),
Spa = if_else(CryoSleep == "True", 0, Spa),
VRDeck = if_else(CryoSleep == "True", 0, VRDeck),
HomePlanet = if_else(HomePlanetsPerGroup != 1, HomePlanet_correct, HomePlanet),
Deck_correct = coalesce(Deck_correct, Deck),
Deck = Deck_correct) %>%
select(-mean_age, -HomePlanet_correct, -Deck_correct)
return(res)
}
fixed_knn <- fix_knn(check_knn)
train4 <- useful_features2(fixed_knn)
plot_na_hclust(train4)
```
The only missing values are now in the variables that we won't use for modelling (we'll use features derived from them instead).
```{r remove-02, include=FALSE}
rm(list = setdiff(ls(), c("train2", "train3", "train4")))
```