Skip to content
This repository has been archived by the owner on Feb 13, 2020. It is now read-only.

Faster way to loop through a data-frame? #490

Open
aramcb opened this issue Jun 11, 2017 · 12 comments
Open

Faster way to loop through a data-frame? #490

aramcb opened this issue Jun 11, 2017 · 12 comments

Comments

@aramcb
Copy link

aramcb commented Jun 11, 2017

Hello,

I have a data frame where one column is a list of time-stamps. I need to annotate which time-stamps are valid or not, depending on whether or not they are close enough to another list of valid time-stamps. For this I have a helper function.

What I've tried to do is loop through the entire data-frame using a for loop with this helper function. However....it's REALLY slow. Is there a better way to do this??

The issue is that I have 100s of rows in my data-frame and 100s of elements in my list of valid-time stamps.

Which means every row of my data-frame involves going through 100s of elements in another list. 100^100 is too big.

Thank you!

df.zip

@ghost
Copy link

ghost commented Jun 11, 2017

Is there a reason you couldn't vectorize it using the ifelse() function?

Sent from my Google Pixel using FastHub

@aramcb
Copy link
Author

aramcb commented Jun 11, 2017

Hello @philstraforelli, Thank you for the suggestion!

Ifelse() is good for creating a list of yes/no based on an evaluation of a test_expression. I want to modify an entry in each row if it passes my criterion. How would I do that with ifelse() ?

The issue is that I have 100s of rows in my data-frame and 100s of elements in my list of valid-time stamps.

Which means every row of my data-frame involves going through and cross-comparing the individual row's time-stamp with each of 100s of elements in another list.

@jennybc
Copy link
Member

jennybc commented Jun 11, 2017

This question is hard to answer w/o seeing a little example that actually runs and has the same structure as your problem, but on a really tiny scale. Agree with @philstraforelli that there is likely a vectorized solution but it's hard to say in the abstract.

@aramcb
Copy link
Author

aramcb commented Jun 11, 2017

@jennybc and @philstraforelli --> corrected example code below!
Hi @jennybc and @philstraforelli , thank you. I've tried to describe my problem below -- hopefully this will clarify my issue!

valid_times <- c(219.934, 229.996, 239.975, 249.935, 259.974) (really though, this list is hundreds of elements long)

actual_times <- c(200, 210, 215, 220.5, 260)
strain <- c("green", "green", "green", "green", "green")
valid_or_not <- c(rep("NULL", 5))

df <- data.frame(strain, actual_times, valid_or_not)
So now I have the following data-frame:

strain actual_times valid_or_not
1 green 200.0 NULL
2 green 210.0 NULL
3 green 215.0 NULL
4 green 220.5 NULL
5 green 260.0 NULL

If an strain's actual_time is within 1 second of an element in valid_times, then I would like to mark it as "valid" within the valid_or_not column.

To do this, I made a helper function:

valid_or_not_fxn<- function(actual_time){
c = "not valid"
for (i in 1:length(valid_times))
if (abs(valid_times[i] - actual_time) <= 1) {
c <- "valid"
} else {
}
return(c)
}

Then I passed this helper function to a for loop and looped through every single row in the data-frame:

df$valid_or_not <- as.character(df$valid_or_not)

for (i in 1:nrow(df)) df[i, "valid_or_not"] <- valid_or_not_fxn(df[i, "actual_times"])

Since my REAL data frame (not just in this example) has hundreds of rows and my valid_times also has hundreds of rows, it is wayyyy too slow. If either of you could help me out with this, I would be eternally grateful.

Merci Bien
Aram

@ghost
Copy link

ghost commented Jun 11, 2017

Would the following work?

 mutate(df, within_1sec = ifelse(abs(valid_times - actual_times) <= 1, "valid", "invalid"))

It works with the toy example:

   strain actual_times within_1sec
  1  green        200.0     invalid
  2  green        210.0     invalid
  3  green        215.0     invalid
  4  green        220.5     invalid
  5  green        260.0       valid

@aramcb
Copy link
Author

aramcb commented Jun 11, 2017

Hello @philstraforelli.

If you add one more or one fewer entry to valid_times (i.e., the length is no longer 5), it doesn't work. (this is more realistic to my actual problem)
Error in mutate_impl(.data, dots) : wrong result size (6), expected 5 or 1 In addition: Warning message: In c(219.934, 229.996, 239.975, 249.935, 259.974, 270) - c(200, : longer object length is not a multiple of shorter object length

In my real problem, there are way more actual_times than valid_times (i.e., length(actual_times) > length(valid_times))

@ghost
Copy link

ghost commented Jun 12, 2017

I understand your problem better, sorry about earlier. How about this?

 df <- mutate(df, row_num = row_number()) #Need to assign an id value to each row for nesting

nested_df <- df %>% 
group_by(row_num) %>%
nest() %>% 
mutate(diff_times = map(actual_times, function(x) {abs(x - valid_times)})) %>%
unnest(diff_times) #We now have a data frame that went through all the elements of valid_times for each of the rows in the main df and calculated the absolute differences

result_df <- nested_df %>% 
left_join(df, by = "row_num") %>% 
filter(diff_times <= 1) %>% 
mutate(within_1sec = "valid") %>% 
select(row_num, within_1sec) #This df establishes which rows had a difference of 1 sec or less

(final_df <- left_join(df, result_df, by = "row_num")) #Joining this back to the main df

No idea how that impacts speed of computation though...

Sent from my Google Pixel using FastHub

@jennybc
Copy link
Member

jennybc commented Jun 12, 2017

Here's one way to approach this:

valid_times <- c(219.934, 229.996, 239.975, 249.935, 259.974)
actual_times <- c(200, 210, 215, 220.5, 260)

is_ok <- function(x, targets, tol) any(abs(x - targets) < tol)

vapply(actual_times, is_ok, logical(1), targets = valid_times, tol = 1)
#> [1] FALSE FALSE FALSE  TRUE  TRUE

The fuzzy join package might also be relevant: https://github.com/dgrtwo/fuzzyjoin#readme

@jennybc
Copy link
Member

jennybc commented Jun 12, 2017

And a tidyverse way:

library(tidyverse)
valid_times <- c(219.934, 229.996, 239.975, 249.935, 259.974)
actual_times <- c(200, 210, 215, 220.5, 260)
map_lgl(actual_times, ~ any(abs(.x - valid_times) < 1))
#> [1] FALSE FALSE FALSE  TRUE  TRUE

@aramcb
Copy link
Author

aramcb commented Jun 13, 2017

jennybc and @philstraforelli Thank you for the help! A lot of syntax/built-in fxns in there that I'm not familiar with -- but which I'm definitely going to be checking out now one by one.

So to finish the solution (provided above), I would then cbind the logical vector to my df as follows:
valid_or_not <- data.frame(map_lgl(actual_times, ~ any(abs(.x - valid_times) < 1)))

final_df <- cbind(df, valid_or_not)

#> strain actual_times valid_or_not
1 green 200.0 FALSE
2 green 210.0 FALSE
3 green 215.0 FALSE
4 green 220.5 TRUE
5 green 260.0 TRUE
6 green 344.0 FALSE

Thanks

@aramcb
Copy link
Author

aramcb commented Jun 13, 2017

Also, if anyone reading this thread is interested, I asked this same question to someone else and they provided another solution that works (though the syntax is even more mysterious to me...I'm going to start with @jennybc's solution)
Reduce(|, lapply(valid_times, function(x) abs(df$actual_times - x) <= 1))

@jennybc
Copy link
Member

jennybc commented Jun 13, 2017

Glad this helped!

I would do this inside mutate().

library(tidyverse)
valid_times <- c(219.934, 229.996, 239.975, 249.935, 259.974)

df <- tibble(
  actual_times = c(200, 210, 215, 220.5, 260),
  some_other_var = "hi"
)
df %>% 
  mutate(time_ok = map_lgl(actual_times, ~ any(abs(.x - valid_times) < 1)))
#> # A tibble: 5 x 3
#>   actual_times some_other_var time_ok
#>          <dbl>          <chr>   <lgl>
#> 1        200.0             hi   FALSE
#> 2        210.0             hi   FALSE
#> 3        215.0             hi   FALSE
#> 4        220.5             hi    TRUE
#> 5        260.0             hi    TRUE

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants