Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use disk.frame to read fwf with transformations #88

Closed
xiaodaigh opened this issue Jul 13, 2019 · 4 comments
Closed

Use disk.frame to read fwf with transformations #88

xiaodaigh opened this issue Jul 13, 2019 · 4 comments
Labels

Comments

@xiaodaigh
Copy link
Collaborator

Hello, I read your package information on the UseR! 2019 and found it fantastic.
I would like to know the package tidyr works with disk.frame?
If not, do you want to implement it with disk.frame?

I have a case that your package would help a lot. I reported the case to the vroom (link) package repository.

Follow the example I reported:

My example is a peculiar case.

The Federal Revenue Service of Brazil publishes data in a single file (10Gb) with several data.frame agglutinated, in a fwf format.

So we have to read part of the file (with read_lines_chunked ()) and treat the chunk with a function executed with callback [SideEffectChunkCallback] and then write the result to a CSV or DBMS.

We repeat this until we read every file (or files, as there may be more than one).

I'll try to sketch an example:

library(readr)
library(tidyr)
library(tibble)
library(magrittr)

# Imagine a 10Gb data.frame in the example below
# Imagine reading this date.frame in pieces of 100,000 rows in each loop of `read_lines_chunked()`

dfs_fwf <- tibble::tibble(lines_fwf = c(
                                        "1aaabbbbccccc",
                                        "2ddddddeeeeeeeffffffff",
                                        "3zzxxkk",
                                        "3zzxxkk",
                                        "1aaabbbbccccc",
                                        "1aaabbbbccccc",
                                        "2ddddddeeeeeeeffffffff",
                                        "1aaabbbbccccc",
                                        "2ddddddeeeeeeeffffffff",
                                        "3zzxxkk",
                                        "2ddddddeeeeeeeffffffff",
                                        "1aaabbbbccccc"
                                        )
                                        )



# Imagine that this would be part of the function executed by `callback[SideEffectChunkCallback]`

dfs_fwf_index <- dfs_fwf %>%
                 tidyr::separate(lines_fwf,
                                 into = c("index", "col_dfs_fwf"),
                                 sep = c(1)
                                 )

df_fwf_1 <- dfs_fwf_index %>%
            dplyr::filter(index == 1) %>%
            tidyr::separate(col_dfs_fwf,
                            into = c("col_A", "col_B", "col_C"),
                            sep = c(3, 7, 12))

df_fwf_2 <- dfs_fwf_index %>%
            dplyr::filter(index == 2) %>%
            tidyr::separate(col_dfs_fwf,
                            into = c("col_D", "col_E", "col_F"),
                            sep = c(6, 13, 21))

df_fwf_3 <- dfs_fwf_index %>%
            dplyr::filter(index == 3) %>%
            tidyr::separate(col_dfs_fwf,
                            into = c("col_Z", "col_X", "col_K"),
                            sep = c(2, 4, 6))

readr::write_csv(df_fwf_1, "csv_df_fwf_1", append = TRUE)
readr::write_csv(df_fwf_2, "csv_df_fwf_2", append = TRUE)
readr::write_csv(df_fwf_3, "csv_df_fwf_3", append = TRUE)

Created on 2019-07-08 by the reprex package (v0.3.0)

I was successful in developing the code to handle the data because of the read_lines_chunked () function to read the file in parts + callback [SideEffectChunkCallback] to process and write the result to a CSV or a DBMS.

A function that has the same functionality in the vroom package would be very important.

Originally posted by @georgevbsantiago in #76 (comment)

@xiaodaigh
Copy link
Collaborator Author

You can read the chunks as you have done and then use disk.frame::add_chunk to add each chunk to the disk.frame chunk by chunk. Hope, this helps.

@xiaodaigh
Copy link
Collaborator Author

Do you have a link to the 10g files?

@georgevbsantiago
Copy link

georgevbsantiago commented Jul 13, 2019

Home: Link

Link to 1 of 20 files:
File_01
or
Mirror_File_01

Data Dictionary (in the main page): PDF File

The R function I developed to handle this data: Function

Thank you for your interest. This database is very important for the Brazilian Society, as we use for academic studies, in the fight against corruption ...

@xiaodaigh
Copy link
Collaborator Author

There isn't enough information that is easy to digest for me to understand what's going on. But I think you want to read a large file chunk by chunk using readr. You can do that and to convert the results to a disk.frame you can simply use the disk.frame::add_chunk for example

library(disk.frame)

df = disk.frame("some_where")
readr::read_lines_chunked(path_to_file, callback = function(chunk, id) {
  add_chunk(df,  chunk, id) # this will add each chunk to the disk.frame
})

please re-open if this doesn't answer your question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants