-
Notifications
You must be signed in to change notification settings - Fork 20
Faster way to loop through a data-frame? #490
Comments
Is there a reason you couldn't vectorize it using the Sent from my Google Pixel using FastHub |
Hello @philstraforelli, Thank you for the suggestion! Ifelse() is good for creating a list of yes/no based on an evaluation of a test_expression. I want to modify an entry in each row if it passes my criterion. How would I do that with ifelse() ? The issue is that I have 100s of rows in my data-frame and 100s of elements in my list of valid-time stamps. Which means every row of my data-frame involves going through and cross-comparing the individual row's time-stamp with each of 100s of elements in another list. |
This question is hard to answer w/o seeing a little example that actually runs and has the same structure as your problem, but on a really tiny scale. Agree with @philstraforelli that there is likely a vectorized solution but it's hard to say in the abstract. |
@jennybc and @philstraforelli --> corrected example code below!
strain actual_times valid_or_not If an strain's actual_time is within 1 second of an element in valid_times, then I would like to mark it as "valid" within the valid_or_not column. To do this, I made a helper function: valid_or_not_fxn<- function(actual_time){ Then I passed this helper function to a for loop and looped through every single row in the data-frame:
Since my REAL data frame (not just in this example) has hundreds of rows and my valid_times also has hundreds of rows, it is wayyyy too slow. If either of you could help me out with this, I would be eternally grateful. Merci Bien |
Would the following work?
It works with the toy example:
|
Hello @philstraforelli. If you add one more or one fewer entry to valid_times (i.e., the length is no longer 5), it doesn't work. (this is more realistic to my actual problem) In my real problem, there are way more actual_times than valid_times (i.e., length(actual_times) > length(valid_times)) |
I understand your problem better, sorry about earlier. How about this?
No idea how that impacts speed of computation though... Sent from my Google Pixel using FastHub |
Here's one way to approach this: valid_times <- c(219.934, 229.996, 239.975, 249.935, 259.974)
actual_times <- c(200, 210, 215, 220.5, 260)
is_ok <- function(x, targets, tol) any(abs(x - targets) < tol)
vapply(actual_times, is_ok, logical(1), targets = valid_times, tol = 1)
#> [1] FALSE FALSE FALSE TRUE TRUE The fuzzy join package might also be relevant: https://github.com/dgrtwo/fuzzyjoin#readme |
And a tidyverse way: library(tidyverse)
valid_times <- c(219.934, 229.996, 239.975, 249.935, 259.974)
actual_times <- c(200, 210, 215, 220.5, 260)
map_lgl(actual_times, ~ any(abs(.x - valid_times) < 1))
#> [1] FALSE FALSE FALSE TRUE TRUE |
jennybc and @philstraforelli Thank you for the help! A lot of syntax/built-in fxns in there that I'm not familiar with -- but which I'm definitely going to be checking out now one by one. So to finish the solution (provided above), I would then cbind the logical vector to my df as follows:
#> strain actual_times valid_or_not Thanks |
Also, if anyone reading this thread is interested, I asked this same question to someone else and they provided another solution that works (though the syntax is even more mysterious to me...I'm going to start with @jennybc's solution) |
Glad this helped! I would do this inside library(tidyverse)
valid_times <- c(219.934, 229.996, 239.975, 249.935, 259.974)
df <- tibble(
actual_times = c(200, 210, 215, 220.5, 260),
some_other_var = "hi"
)
df %>%
mutate(time_ok = map_lgl(actual_times, ~ any(abs(.x - valid_times) < 1)))
#> # A tibble: 5 x 3
#> actual_times some_other_var time_ok
#> <dbl> <chr> <lgl>
#> 1 200.0 hi FALSE
#> 2 210.0 hi FALSE
#> 3 215.0 hi FALSE
#> 4 220.5 hi TRUE
#> 5 260.0 hi TRUE |
Hello,
I have a data frame where one column is a list of time-stamps. I need to annotate which time-stamps are valid or not, depending on whether or not they are close enough to another list of valid time-stamps. For this I have a helper function.
What I've tried to do is loop through the entire data-frame using a for loop with this helper function. However....it's REALLY slow. Is there a better way to do this??
The issue is that I have 100s of rows in my data-frame and 100s of elements in my list of valid-time stamps.
Which means every row of my data-frame involves going through 100s of elements in another list. 100^100 is too big.
Thank you!
df.zip
The text was updated successfully, but these errors were encountered: