Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make pdp and cp work with NA in data #120

Closed
hbaniecki opened this issue Jul 11, 2020 · 6 comments
Closed

make pdp and cp work with NA in data #120

hbaniecki opened this issue Jul 11, 2020 · 6 comments
Assignees
Labels
invalid ❕ This doesn't seem right

Comments

@hbaniecki
Copy link
Member

crossref ModelOriented/modelStudio#71

@hbaniecki hbaniecki added the invalid ❕ This doesn't seem right label Jul 11, 2020
@pbiecek pbiecek self-assigned this Jul 22, 2020
pbiecek added a commit that referenced this issue Jul 28, 2020
@pbiecek
Copy link
Member

pbiecek commented Jul 28, 2020

I've could not find a reproducible example,
@hbaniecki would you check if this is solved?

I've checked this with

library("DALEX")
library("ingredients")
library("randomForest")

model_titanic_glm <- randomForest(survived ~ gender + age + fare,
                        data = na.omit(titanic_imputed))
titanic_imputed[2:1000,2] = NA
explain_titanic_glm <- explain(model_titanic_glm,
                              data = titanic_imputed[,-8],
                              y = titanic_imputed[,8],
                              verbose = FALSE)
pdp_glm <- partial_dependence(explain_titanic_glm,
                             N = 25, variables = c("age", "fare","sibsp"),
                             variable_splits = list(age = seq(0,100,0.1), fare = c(0:100), sibsp=0:10))
 plot(pdp_glm)

@hbaniecki
Copy link
Member Author

I guess that after the fix it works

library("DALEX")
library("ingredients")
library("randomForest")

model_titanic_glm <- randomForest(survived ~ gender + age + fare,
                                  data = na.omit(titanic_imputed))
titanic_imputed[2:1000,2] = NA
explain_titanic_glm <- explain(model_titanic_glm,
                               data = titanic_imputed[,-8],
                               y = titanic_imputed[,8],
                               verbose = FALSE)
pdp_glm <- partial_dependence(explain_titanic_glm,
                              N = 25, variables = c("age", "fare","sibsp"))
#, variable_splits = list(age = seq(0,100,0.1), fare = c(0:100), sibsp=0:10))
plot(pdp_glm)

@pbiecek
Copy link
Member

pbiecek commented Jul 28, 2020

thanks

@pbiecek pbiecek closed this as completed Jul 28, 2020
@p-schaefer
Copy link

p-schaefer commented Mar 9, 2023

Hi there,

I'm wondering if there is some way of making conditional and accumulated dependence plots work with NAs? i,e,

library("DALEX")
library("ingredients")
library("randomForest")

model_titanic_glm <- randomForest(survived ~ gender + age + fare,
                                  data = na.omit(titanic_imputed))
titanic_imputed[2:1000,2] = NA
explain_titanic_glm <- explain(model_titanic_glm,
                               data = titanic_imputed[,-8],
                               y = titanic_imputed[,8],
                               verbose = FALSE)
pdp_glm <- conditional_dependence(explain_titanic_glm,
                              N = 25, variables = c("age", "fare","sibsp"))
#, variable_splits = list(age = seq(0,100,0.1), fare = c(0:100), sibsp=0:10))
plot(pdp_glm)

Thanks

@hbaniecki
Copy link
Member Author

Hi, what is your goal? PD/ALE rely on estimating expected predictions with respect to data distribution.

Did you consider removing observations without age (with NAs) from data to estimate the explanation of age?

@p-schaefer
Copy link

Sorry, this was a bad example. I was piggybacking on the example from this thread. In doing more testing with reasonable numbers of NAs, I see that conditional_dependence() does work with NAs:

library("DALEX")
library("ingredients")
library("randomForest")

model_titanic_glm <- randomForest(survived ~ gender + age + fare,
                                  data = na.omit(titanic_imputed))

toNA<-sample(1:1000,10)

titanic_imputed[toNA,] = NA
explain_titanic_glm <- explain(model_titanic_glm,
                               data = titanic_imputed[,-8],
                               y = titanic_imputed[,8],
                               verbose = FALSE)
pdp_glm <- conditional_dependence(explain_titanic_glm,
                                  N = 25, variables = c("age", "fare","sibsp"))
#, variable_splits = list(age = seq(0,100,0.1), fare = c(0:100), sibsp=0:10))
plot(pdp_glm)

Unfortunately, in my significantly larger and more complicated models, I'm running into issues related to missing values where the aggregated profiles aren't being calculated. When I impute the missing values, there are no issues. But I can't seem to recreate it with a simpler dataset/model. Do you know of any situations where aggregating profiles fails elated to NAs? There are no instances where an entire column is NAs like in my previous examples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid ❕ This doesn't seem right
Projects
None yet
Development

No branches or pull requests

3 participants