Faster way to loop through a data-frame? #490

aramcb · 2017-06-11T20:25:07Z

Hello,

I have a data frame where one column is a list of time-stamps. I need to annotate which time-stamps are valid or not, depending on whether or not they are close enough to another list of valid time-stamps. For this I have a helper function.

What I've tried to do is loop through the entire data-frame using a for loop with this helper function. However....it's REALLY slow. Is there a better way to do this??

The issue is that I have 100s of rows in my data-frame and 100s of elements in my list of valid-time stamps.

Which means every row of my data-frame involves going through 100s of elements in another list. 100^100 is too big.

Thank you!

df.zip

ghost · 2017-06-11T20:33:44Z

Is there a reason you couldn't vectorize it using the ifelse() function?

Sent from my Google Pixel using FastHub

aramcb · 2017-06-11T21:00:28Z

Hello @philstraforelli, Thank you for the suggestion!

Ifelse() is good for creating a list of yes/no based on an evaluation of a test_expression. I want to modify an entry in each row if it passes my criterion. How would I do that with ifelse() ?

The issue is that I have 100s of rows in my data-frame and 100s of elements in my list of valid-time stamps.

Which means every row of my data-frame involves going through and cross-comparing the individual row's time-stamp with each of 100s of elements in another list.

jennybc · 2017-06-11T21:32:40Z

This question is hard to answer w/o seeing a little example that actually runs and has the same structure as your problem, but on a really tiny scale. Agree with @philstraforelli that there is likely a vectorized solution but it's hard to say in the abstract.

aramcb · 2017-06-11T22:47:34Z

@jennybc and @philstraforelli --> corrected example code below!
Hi @jennybc and @philstraforelli , thank you. I've tried to describe my problem below -- hopefully this will clarify my issue!

valid_times <- c(219.934, 229.996, 239.975, 249.935, 259.974) (really though, this list is hundreds of elements long)

actual_times <- c(200, 210, 215, 220.5, 260)
strain <- c("green", "green", "green", "green", "green")
valid_or_not <- c(rep("NULL", 5))

df <- data.frame(strain, actual_times, valid_or_not)
So now I have the following data-frame:

strain actual_times valid_or_not
1 green 200.0 NULL
2 green 210.0 NULL
3 green 215.0 NULL
4 green 220.5 NULL
5 green 260.0 NULL

If an strain's actual_time is within 1 second of an element in valid_times, then I would like to mark it as "valid" within the valid_or_not column.

To do this, I made a helper function:

valid_or_not_fxn<- function(actual_time){
c = "not valid"
for (i in 1:length(valid_times))
if (abs(valid_times[i] - actual_time) <= 1) {
c <- "valid"
} else {
}
return(c)
}

Then I passed this helper function to a for loop and looped through every single row in the data-frame:

df$valid_or_not <- as.character(df$valid_or_not)

for (i in 1:nrow(df)) df[i, "valid_or_not"] <- valid_or_not_fxn(df[i, "actual_times"])

Since my REAL data frame (not just in this example) has hundreds of rows and my valid_times also has hundreds of rows, it is wayyyy too slow. If either of you could help me out with this, I would be eternally grateful.

Merci Bien
Aram

ghost · 2017-06-11T23:24:56Z

Would the following work?

 mutate(df, within_1sec = ifelse(abs(valid_times - actual_times) <= 1, "valid", "invalid"))

It works with the toy example:

   strain actual_times within_1sec
  1  green        200.0     invalid
  2  green        210.0     invalid
  3  green        215.0     invalid
  4  green        220.5     invalid
  5  green        260.0       valid

aramcb · 2017-06-11T23:45:38Z

Hello @philstraforelli.

If you add one more or one fewer entry to valid_times (i.e., the length is no longer 5), it doesn't work. (this is more realistic to my actual problem)
Error in mutate_impl(.data, dots) : wrong result size (6), expected 5 or 1 In addition: Warning message: In c(219.934, 229.996, 239.975, 249.935, 259.974, 270) - c(200, : longer object length is not a multiple of shorter object length

In my real problem, there are way more actual_times than valid_times (i.e., length(actual_times) > length(valid_times))

ghost · 2017-06-12T00:57:43Z

I understand your problem better, sorry about earlier. How about this?

 df <- mutate(df, row_num = row_number()) #Need to assign an id value to each row for nesting

nested_df <- df %>% 
group_by(row_num) %>%
nest() %>% 
mutate(diff_times = map(actual_times, function(x) {abs(x - valid_times)})) %>%
unnest(diff_times) #We now have a data frame that went through all the elements of valid_times for each of the rows in the main df and calculated the absolute differences

result_df <- nested_df %>% 
left_join(df, by = "row_num") %>% 
filter(diff_times <= 1) %>% 
mutate(within_1sec = "valid") %>% 
select(row_num, within_1sec) #This df establishes which rows had a difference of 1 sec or less

(final_df <- left_join(df, result_df, by = "row_num")) #Joining this back to the main df

No idea how that impacts speed of computation though...

Sent from my Google Pixel using FastHub

jennybc · 2017-06-12T09:36:44Z

Here's one way to approach this:

valid_times <- c(219.934, 229.996, 239.975, 249.935, 259.974)
actual_times <- c(200, 210, 215, 220.5, 260)

is_ok <- function(x, targets, tol) any(abs(x - targets) < tol)

vapply(actual_times, is_ok, logical(1), targets = valid_times, tol = 1)
#> [1] FALSE FALSE FALSE  TRUE  TRUE

The fuzzy join package might also be relevant: https://github.com/dgrtwo/fuzzyjoin#readme

jennybc · 2017-06-12T09:40:40Z

And a tidyverse way:

library(tidyverse)
valid_times <- c(219.934, 229.996, 239.975, 249.935, 259.974)
actual_times <- c(200, 210, 215, 220.5, 260)
map_lgl(actual_times, ~ any(abs(.x - valid_times) < 1))
#> [1] FALSE FALSE FALSE  TRUE  TRUE

aramcb · 2017-06-13T03:57:50Z

jennybc and @philstraforelli Thank you for the help! A lot of syntax/built-in fxns in there that I'm not familiar with -- but which I'm definitely going to be checking out now one by one.

So to finish the solution (provided above), I would then cbind the logical vector to my df as follows:
valid_or_not <- data.frame(map_lgl(actual_times, ~ any(abs(.x - valid_times) < 1)))

final_df <- cbind(df, valid_or_not)

#> strain actual_times valid_or_not
1 green 200.0 FALSE
2 green 210.0 FALSE
3 green 215.0 FALSE
4 green 220.5 TRUE
5 green 260.0 TRUE
6 green 344.0 FALSE

Thanks

aramcb · 2017-06-13T04:12:30Z

Also, if anyone reading this thread is interested, I asked this same question to someone else and they provided another solution that works (though the syntax is even more mysterious to me...I'm going to start with @jennybc's solution)
Reduce(|, lapply(valid_times, function(x) abs(df$actual_times - x) <= 1))

jennybc · 2017-06-13T05:21:09Z

Glad this helped!

I would do this inside mutate().

library(tidyverse)
valid_times <- c(219.934, 229.996, 239.975, 249.935, 259.974)

df <- tibble(
  actual_times = c(200, 210, 215, 220.5, 260),
  some_other_var = "hi"
)
df %>% 
  mutate(time_ok = map_lgl(actual_times, ~ any(abs(.x - valid_times) < 1)))
#> # A tibble: 5 x 3
#>   actual_times some_other_var time_ok
#>          <dbl>          <chr>   <lgl>
#> 1        200.0             hi   FALSE
#> 2        210.0             hi   FALSE
#> 3        215.0             hi   FALSE
#> 4        220.5             hi    TRUE
#> 5        260.0             hi    TRUE

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster way to loop through a data-frame? #490

Faster way to loop through a data-frame? #490

aramcb commented Jun 11, 2017 •

edited

Loading

ghost commented Jun 11, 2017

aramcb commented Jun 11, 2017

jennybc commented Jun 11, 2017

aramcb commented Jun 11, 2017 •

edited

Loading

ghost commented Jun 11, 2017

aramcb commented Jun 11, 2017 •

edited

Loading

ghost commented Jun 12, 2017 •

edited by ghost

Loading

jennybc commented Jun 12, 2017

jennybc commented Jun 12, 2017

aramcb commented Jun 13, 2017 •

edited

Loading

aramcb commented Jun 13, 2017

jennybc commented Jun 13, 2017

Faster way to loop through a data-frame? #490

Faster way to loop through a data-frame? #490

Comments

aramcb commented Jun 11, 2017 • edited Loading

ghost commented Jun 11, 2017

aramcb commented Jun 11, 2017

jennybc commented Jun 11, 2017

aramcb commented Jun 11, 2017 • edited Loading

ghost commented Jun 11, 2017

aramcb commented Jun 11, 2017 • edited Loading

ghost commented Jun 12, 2017 • edited by ghost Loading

jennybc commented Jun 12, 2017

jennybc commented Jun 12, 2017

aramcb commented Jun 13, 2017 • edited Loading

aramcb commented Jun 13, 2017

jennybc commented Jun 13, 2017

aramcb commented Jun 11, 2017 •

edited

Loading

aramcb commented Jun 11, 2017 •

edited

Loading

aramcb commented Jun 11, 2017 •

edited

Loading

ghost commented Jun 12, 2017 •

edited by ghost

Loading

aramcb commented Jun 13, 2017 •

edited

Loading