New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read in an entire folder worth of files #398

Open
scienceisfiction opened this Issue Nov 3, 2016 · 22 comments

Comments

Projects
None yet
5 participants
@scienceisfiction
Copy link

scienceisfiction commented Nov 3, 2016

This may be getting a bit ahead of the schedule (perhaps this will be covered with Make?), but I'm wondering how to read in an entire folder worth of .csv files and quickly combine into one giant df without having to read in each file individually and without writing a for loop. Is there a quick way to do this with some of the tools we've learned so far that I'm just not remembering right now? Is this something I can do with map() and if so, are there some examples I can look at to figure out how to do this?

I'm also having trouble with my directories. My script is floating freely in the root of my directory as it should be, and if I use list.files("folder_name")[1] the first file title pops up no problem, but if I use read_csv(list.files("folder_name")[1]) I get an error message that it doesn't exist in current working directory. I can work around this by copy and pasting the first file into my root OR by writing out the "folder_name/file_name.csv" but then I won't be able to automate reading in the entire folder, right? Is there another way I should be thinking about this?

@jennybc

This comment has been minimized.

Copy link
Member

jennybc commented Nov 3, 2016

Use full.names = TRUE inside list.files().

@jennybc

This comment has been minimized.

Copy link
Member

jennybc commented Nov 3, 2016

And then you can use readr::read_csv() inside purrr::map_df(). It's magical.

@NicoleOngUBC

This comment has been minimized.

Copy link

NicoleOngUBC commented Nov 3, 2016

Hi Jenny,

Would you mind going through these steps during lecture today (or really soon)? This is probably the most important transitional step I need to do to get started with analyzing my data with R, but I'm not quite following on how to read multiple files and folders (with systematic names) into one large dataframe.

p.s. I have multiple folders and multiple csv files in each folder.

@scienceisfiction

This comment has been minimized.

Copy link
Author

scienceisfiction commented Nov 3, 2016

AMAZING. it's beautiful. I'd managed to put read_csv() in map() but was missing that last argument of full.names to make the whole thing work.

@scienceisfiction

This comment has been minimized.

Copy link
Author

scienceisfiction commented Nov 3, 2016

@supersonicole My test code for this looked like this:

cell_01_02 <- map_df(list.files("SfN wrangle data", full.names = TRUE), read_csv, col_names = FALSE)
@samhinshaw

This comment has been minimized.

Copy link

samhinshaw commented Nov 3, 2016

Keep in mind you can also use regular expressions and subdirectories within list.files()!
For example, if you're in a directory with an input folder and want a specific filetype:

list.files(path = "input", pattern = "[0-9]+.txt$")

Separately, I'd recommend checking out purrr::safely().
So, for example, if you are reading in files with read_csv()

ReadMyDataFiles <- function(pattern = pattern, path = "./"){
  map(list.files(path = path, pattern = pattern, full.names = TRUE), read_tsv)
}
safely_ReadMyDataFiles <- safely(ReadMyDataFiles)
cell_01_02 <- safely_ReadMyDataFiles("SfN wrangle data")

This will give you a list of two lists:

  1. $result, all your successes!
  2. $error, all of your errors.

While this is more complicated, it is supremely useful when mapping functions that take a long time for one reason or another--computational power, API calls, etc. This way if your function fails to apply across one item in your supplied vector, the entire mapping won't fail. You can then examine your results and simplify them into the desired form via other functions supplied by purrr, such as transpose() and simpify_all().

Apologies if Jenny mentioned this in class, I didn't see it in the notes.

@jennybc

This comment has been minimized.

Copy link
Member

jennybc commented Nov 3, 2016

Thanks @samhinshaw! Yes safely() is extremely helpful when you don't want one error to cause complete failure.

@NicoleOngUBC

This comment has been minimized.

Copy link

NicoleOngUBC commented Nov 3, 2016

I could create a dataframe for one folder of csv files using the map_df() and list.files() codes discussed (also see below). Thanks! However, my csv files themselves do not contain id info (a pain!), hence the dataframe, as is, does not tell me which subjects (and groups) the data comes from.

library(tidyverse)

setwd("/Users/supersonicole/Documents/Dropbox/STAT545/ong_nicole")

group1 <- map_df(list.files("group1", full.names = TRUE), 
                 read_csv, col_names = FALSE,
                 col_types = "ciddcc")

group1 <- set_names(group1, nm = c("duration", "peak_count", "freq_count", "bpm",
                                   "time_end", "comments"))

The group# is the name of the folder containing the csv files, while the subject# is the first 3 characters of the csv filename.

After a bit of hacking and trial-and-error, I've devised something that seems to work with the plyr package, but requires that I change my working directory and group numbers in the script to extract multiple files from multiple folders (one folder at a time) and insert subject and group variables manually. (Eventually the aim is to combine all these group data into one giant dataframe for further analysis)

Any thoughts on whether this could be simplified? I really liked how simple the first two lines of code above were...

p.s. I'd uploaded my csv folders and files here (see folders "group1" and "group2").
And my script is here.

library(tidyverse)
library(plyr)

setwd("/Users/supersonicole/Documents/Dropbox/STAT545/ong_nicole")

paths <- dir(path = "./group1", pattern = "\\.csv$")  # change group number HERE
names(paths) <- substr(basename(paths), 1, 3)

setwd("./group1") # change group number HERE

all <- ldply(paths, read.csv, header = FALSE, .id = "subj")

all <- set_names(all, nm = c("subj", "duration", "peak_count", "freq_count", "bpm",
                                          "time_end", "comments"))

all$duration <- as.character(all$duration)
all$time_end <- as.character(all$time_end)
all$comments <- as.character(all$comments)

all <- all %>%
        mutate(group = "positive") %>%  # change group name HERE
        select(subj, group, everything())

all$group <- as.factor(all$group)

all <- as_tibble(all)

gp1_data <- all # set change group number HERE
@samhinshaw

This comment has been minimized.

Copy link

samhinshaw commented Nov 4, 2016

Great questions!! What immediately jumps to mind is what you've already figured out using plyr, the .id argument. What's great about map_df() is that it actually can accept a .id argument as well.

There's just one problem... you need to have names for your IDs to be meaningful! This is an interesting and challenging problem, so I've concocted a solution. Unfortunately, it's not quite as elegant as the previous solution.

library(tidyverse)
library(rprojroot)
library(stringr)
rootDir <- rprojroot::find_rstudio_root_file()
grp1dir <- file.path(rootDir, "group1")
grp2dir <- file.path(rootDir, "group2")
setwd(rootDir)

# List the files in group1 dir
group1Files <- list.files(grp1dir, full.names = TRUE)

# name this list, gsub optional, can also use str_sub() for this
names(group1Files) <- list.files(grp1dir) %>% gsub(pattern = ".csv$", replacement = "")

# Map_df, but with meaningful IDs
group1 <- map_df(group1Files, read_csv, col_names = FALSE,
                 col_types = "ciddcc", .id = "subj")

# Name columns
group1 <- set_names(group1, nm = c("subj", "duration", "peak_count",
                                  "freq_count", "bpm",
                                  "time_end", "comments"))

# Check that our list of files was the same as the names we assigned
stopifnot(str_extract(group1Files, "s[0-9]{2}_PPG") == names(group1Files))

## Repeat for group 2
group2Files <- list.files(grp2dir, full.names = TRUE)
names(group2Files) <- list.files(grp2dir) %>% gsub(pattern = ".csv$", replacement = "")
group2 <- map_df(group2Files, read_csv, col_names = FALSE,
                 col_types = "ciddcc", .id = "subj")
group2 <- set_names(group2, nm = c("subj", "duration", "peak_count",
                                   "freq_count", "bpm",
                                   "time_end", "comments"))

stopifnot(str_extract(group2Files, "s[0-9]{2}_PPG") == names(group2Files))

# Now combine these data.frames maintaining their group identifications. We're 
# not performing a "join" because we're adding additional rows, but we can use 
# dplyr::bind_rows()

Master_list <- bind_rows("group1" = group1, "group2" = group2, .id = "group")

I also really have to say here that I strongly recommend your column headers being included in the CSV files. It is potentially dangerous to have these missing!

@scienceisfiction

This comment has been minimized.

Copy link
Author

scienceisfiction commented Nov 4, 2016

This is such a great thread because adding IDs for each file is something I
need to do as well--thanks for all the tips!

On Nov 3, 2016 6:35 PM, "Sam Hinshaw" notifications@github.com wrote:

Great questions!! What immediately jumps to mind is what you've already
figured out using plyr, the .id argument. What's great about map_df() is
that it actually can accept a .id argument as well.

There's just one problem... you need to have names for your IDs to be
meaningful! This is an interesting and challenging problem, so I've
concocted a solution. Unfortunately, it's not quite as elegant as the
previous solution.

library(tidyverse)
library(rprojroot)
library(stringr)rootDir <- rprojroot::find_rstudio_root_file()grp1dir <- file.path(rootDir, "group1")grp2dir <- file.path(rootDir, "group2")
setwd(rootDir)

List the files in group1 dirgroup1Files <- list.files(grp1dir, full.names = TRUE)

name this list, gsub optional, can also use str_sub() for this

names(group1Files) <- list.files(grp1dir) %>% gsub(pattern = ".csv$", replacement = "")

Map_df, but with meaningful IDsgroup1 <- map_df(group1Files, read_csv, col_names = FALSE,

             col_types = "ciddcc", .id = "subj")

Name columnsgroup1 <- set_names(group1, nm = c("subj", "duration", "peak_count",

                              "freq_count", "bpm",
                              "time_end", "comments"))

Check that our list of files was the same as the names we assigned

stopifnot(str_extract(group1Files, "s[0-9]{2}_PPG") == names(group1Files))

Repeat for group 2group2Files <- list.files(grp2dir, full.names = TRUE)

names(group2Files) <- list.files(grp2dir) %>% gsub(pattern = ".csv$", replacement = "")group2 <- map_df(group2Files, read_csv, col_names = FALSE,
col_types = "ciddcc", .id = "subj")group2 <- set_names(group2, nm = c("subj", "duration", "peak_count",
"freq_count", "bpm",
"time_end", "comments"))

stopifnot(str_extract(group2Files, "s[0-9]{2}_PPG") == names(group2Files))

Now combine these data.frames maintaining their group identifications. We're # not performing a "join" because we're adding additional rows, but we can use # dplyr::bind_rows()

Master_list <- bind_rows("group1" = group1, "group2" = group2, .id = "group")

I also really have to say here that I strongly recommend your column
headers being included in the CSV files. It is potentially dangerous to
have these missing!


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#398 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AQPhIpYsR8ExLwEOMDzhywrwrgPPpcx6ks5q6ovygaJpZM4Kn_wQ
.

@NicoleOngUBC

This comment has been minimized.

Copy link

NicoleOngUBC commented Nov 4, 2016

@samhinshaw
I couldn't understand the code initially, but the MAGIC unravelled as I worked through it line by line. Amazing stuff. THANK YOU!!

Also, out of curiosity, how do you get colour into your code in these issue posts? It makes reading so much easier...

@samhinshaw

This comment has been minimized.

Copy link

samhinshaw commented Nov 4, 2016

Happy to help :)
GitHub can color-code your code if you specify what language it's written
in! I just found out about it recently myself, so it's basically:

Whoops, was formatted wrong, fixed now
@jennybc

This comment has been minimized.

Copy link
Member

jennybc commented Nov 4, 2016

You can also use my (github-only) package reprex to prepare little bits of code for posting here or on stackoverflow. That's what I do!

@andreagaede

This comment has been minimized.

Copy link

andreagaede commented Nov 4, 2016

@samhinshaw Wait! I need your help too!

I'm trying to do something similar....I think. I have directory with csv files (each file is data from a cell). There are no column names or cell ID info in the csv file. Essentially, I want to extract an ID from the file name and add it as a column within to the data within the csv file. My next step is going to be to turn all the csv files into a single data frame and I want to be able to group by cell ID.

I hope that is enough, but not too much info.
Anyway, the csv files have names like: ZF1602-trk1-1044-cell2.csv
I've uploaded my files here
I was initially just trying to do this for a single csv file, so I used this function:

extract_num_from_string <- function(s) {
  s.split <- strsplit(s, "cell")
  s.id <- as.numeric(unlist(strsplit(s.split[[1]][1], "[^[:digit:]]")))
  s.id <- s.id[!is.na(s.id)][1:3]

  s.cell <- as.numeric(unlist(strsplit(s.split[[1]][2], "[^[:digit:]]")))
  s.cell <- s.cell[!is.na(s.cell)][1]
  return(c(s.id, s.cell))
}

This returns 4 values that I later merged using unite(), but I'm okay with that.

Is there a straightforward way to add columns containing these values to the data contained within the corresponding csv file? I was thinking that I would then use write_csv() to make a new file that now has this ID value, so that I could make a giant data frame with all cells using the following line of code:

all_cells <- map_df(list.files("2016 - CB speed working data", full.names = TRUE), read_csv, col_names = FALSE)
I'm trying to work through your previous post, so if you think that does what I'm asking about let me know...I'm just not even sure if what I'm trying to do is conceptually correct.
Any help would be much appreciated!!!

Many thanks!!
Dre

PS I was going to upload the script I've been working, but it has so many extraneous computations that I thought it might not be useful...I'm happy to provide it if everything above reads like gobbledy-gook though.

@andreagaede

This comment has been minimized.

Copy link

andreagaede commented Nov 4, 2016

@samhinshaw

I think I have something that works, and it took me significantly less time than I thought it would thanks to your code...

cell_directory <- "2016 - CB speed working data"

cell_files <- list.files(cell_directory, full.names = TRUE)

names(cell_files) <- list.files(cell_directory) %>% 
  gsub(pattern = ".csv$", replacement = "")

all_cells <- map_df(cell_files, read_csv, col_names = FALSE,
                .id = "bird_id")

I would still be curious to hear any ideas for a smarter approach. Totally feel like I'm winging it right now!

Cheers,
Dre

@jennybc

This comment has been minimized.

Copy link
Member

jennybc commented Nov 4, 2016

@andreagaede That's starting to look about as good as it gets. The only slight improvement is you could define cell_files and set the names all at once with purrr::set_names() which came up in class today. But it doesn't get much more concise than what you have.

@andreagaede

This comment has been minimized.

Copy link

andreagaede commented Nov 4, 2016

@jennybc Oh right! Thanks!

@samhinshaw

This comment has been minimized.

Copy link

samhinshaw commented Nov 4, 2016

@andreagaede Looks great! Jenny makes a very important point, because it is very tenuous to have two separate operations (that depend on order) that get smashed together without validation. Specifically, from the documentation for list.files(), we see that

The files are sorted in alphabetical order, on the full path if full.names = TRUE.

This can be dangerous if, for example, the sorting for list.files() was somehow different from list.files(full.names = TRUE). Or perhaps I went back and accidentally added code in between those two lines. Then my code below would produce incorrect results.

group1Files <- list.files(grp1dir, full.names = TRUE)
## Accidentally added code here
names(group1Files) <- list.files(grp1dir) %>% gsub(pattern = ".csv$", replacement = "")

So always be careful! It's safer to have our list.files() functions as close to identical as possible, and run in the same function. Here I'm using str_extract() instead of gsub() so we can use full.names = TRUE

group1Files <- set_names(list.files(grp1dir, full.names = TRUE), 
                         str_extract(list.files(grp1dir, full.names = TRUE),
                                     "s[0-9]{2}_PPG"))
@andreagaede

This comment has been minimized.

Copy link

andreagaede commented Nov 4, 2016

@samhinshaw Thanks! I have what I hope is a quick question about that:
I have not played with regular expressions much...
list.files() makes a character vector with values that look like this:
"2016 - CB speed working data/ZF1607-trk1-1189-cell1.csv"

I want to extract: ZF1607-trk1-1189-cell1
and make that the name that I use as the ID

So my code looks like this:

cell_files1 <- set_names(list.files(cell_directory, full.names = TRUE), 
                         str_extract(list.files(cell_directory, full.names = TRUE),
                                     ""))

I just don't know what to put between the "" for the pattern. Can you offer any hints?

Thank you!

@samhinshaw

This comment has been minimized.

Copy link

samhinshaw commented Nov 4, 2016

😁
When I have to match something like that, I'll usually open regex101 and a regex cheat sheet like this one side-by-side and test as I go. I'll start you off with this:
regex

@andreagaede

This comment has been minimized.

Copy link

andreagaede commented Nov 4, 2016

@samhinshaw Thank you!! I will start-trial-and-error-ing!

@jennybc

This comment has been minimized.

Copy link
Member

jennybc commented Nov 4, 2016

@andreagaede I think you are also about to learn some lessons in file naming 😃. The more disciplined you are there (no spaces, no punctuation, zero funny stuff, deliberate use of delimiters), the easier these regexes are to write.

Helpful functions for manipulating file paths: dirname(), basename(), tools::file_path_sans_ext(x, compression = FALSE) and the other functions documented with it. You might want to go this route instead of regex.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment