Use disk.frame to read fwf with transformations #88

xiaodaigh · 2019-07-13T20:56:05Z

Hello, I read your package information on the UseR! 2019 and found it fantastic.
I would like to know the package tidyr works with disk.frame?
If not, do you want to implement it with disk.frame?

I have a case that your package would help a lot. I reported the case to the vroom (link) package repository.

Follow the example I reported:

My example is a peculiar case.

The Federal Revenue Service of Brazil publishes data in a single file (10Gb) with several data.frame agglutinated, in a fwf format.

So we have to read part of the file (with read_lines_chunked ()) and treat the chunk with a function executed with callback [SideEffectChunkCallback] and then write the result to a CSV or DBMS.

We repeat this until we read every file (or files, as there may be more than one).

I'll try to sketch an example:

library(readr)
library(tidyr)
library(tibble)
library(magrittr)

# Imagine a 10Gb data.frame in the example below
# Imagine reading this date.frame in pieces of 100,000 rows in each loop of `read_lines_chunked()`

dfs_fwf <- tibble::tibble(lines_fwf = c(
                                        "1aaabbbbccccc",
                                        "2ddddddeeeeeeeffffffff",
                                        "3zzxxkk",
                                        "3zzxxkk",
                                        "1aaabbbbccccc",
                                        "1aaabbbbccccc",
                                        "2ddddddeeeeeeeffffffff",
                                        "1aaabbbbccccc",
                                        "2ddddddeeeeeeeffffffff",
                                        "3zzxxkk",
                                        "2ddddddeeeeeeeffffffff",
                                        "1aaabbbbccccc"
                                        )
                                        )



# Imagine that this would be part of the function executed by `callback[SideEffectChunkCallback]`

dfs_fwf_index <- dfs_fwf %>%
                 tidyr::separate(lines_fwf,
                                 into = c("index", "col_dfs_fwf"),
                                 sep = c(1)
                                 )

df_fwf_1 <- dfs_fwf_index %>%
            dplyr::filter(index == 1) %>%
            tidyr::separate(col_dfs_fwf,
                            into = c("col_A", "col_B", "col_C"),
                            sep = c(3, 7, 12))

df_fwf_2 <- dfs_fwf_index %>%
            dplyr::filter(index == 2) %>%
            tidyr::separate(col_dfs_fwf,
                            into = c("col_D", "col_E", "col_F"),
                            sep = c(6, 13, 21))

df_fwf_3 <- dfs_fwf_index %>%
            dplyr::filter(index == 3) %>%
            tidyr::separate(col_dfs_fwf,
                            into = c("col_Z", "col_X", "col_K"),
                            sep = c(2, 4, 6))

readr::write_csv(df_fwf_1, "csv_df_fwf_1", append = TRUE)
readr::write_csv(df_fwf_2, "csv_df_fwf_2", append = TRUE)
readr::write_csv(df_fwf_3, "csv_df_fwf_3", append = TRUE)

Created on 2019-07-08 by the reprex package (v0.3.0)

I was successful in developing the code to handle the data because of the read_lines_chunked () function to read the file in parts + callback [SideEffectChunkCallback] to process and write the result to a CSV or a DBMS.

A function that has the same functionality in the vroom package would be very important.

Originally posted by @georgevbsantiago in #76 (comment)

The text was updated successfully, but these errors were encountered:

xiaodaigh · 2019-07-13T20:59:21Z

You can read the chunks as you have done and then use disk.frame::add_chunk to add each chunk to the disk.frame chunk by chunk. Hope, this helps.

xiaodaigh · 2019-07-13T21:01:10Z

Do you have a link to the 10g files?

georgevbsantiago · 2019-07-13T21:08:56Z

Home: Link

Link to 1 of 20 files:
File_01
or
Mirror_File_01

Data Dictionary (in the main page): PDF File

The R function I developed to handle this data: Function

Thank you for your interest. This database is very important for the Brazilian Society, as we use for academic studies, in the fight against corruption ...

xiaodaigh · 2019-07-20T08:32:14Z

There isn't enough information that is easy to digest for me to understand what's going on. But I think you want to read a large file chunk by chunk using readr. You can do that and to convert the results to a disk.frame you can simply use the disk.frame::add_chunk for example

library(disk.frame)

df = disk.frame("some_where")
readr::read_lines_chunked(path_to_file, callback = function(chunk, id) {
  add_chunk(df,  chunk, id) # this will add each chunk to the disk.frame
})

please re-open if this doesn't answer your question.

xiaodaigh mentioned this issue Jul 13, 2019

Plan to submit to cran #76

Closed

xiaodaigh added the question label Jul 20, 2019

xiaodaigh closed this as completed Jul 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use disk.frame to read fwf with transformations #88

Use disk.frame to read fwf with transformations #88

xiaodaigh commented Jul 13, 2019

xiaodaigh commented Jul 13, 2019

xiaodaigh commented Jul 13, 2019

georgevbsantiago commented Jul 13, 2019 •

edited

Loading

xiaodaigh commented Jul 20, 2019

Use disk.frame to read fwf with transformations #88

Use disk.frame to read fwf with transformations #88

Comments

xiaodaigh commented Jul 13, 2019

xiaodaigh commented Jul 13, 2019

xiaodaigh commented Jul 13, 2019

georgevbsantiago commented Jul 13, 2019 • edited Loading

xiaodaigh commented Jul 20, 2019

georgevbsantiago commented Jul 13, 2019 •

edited

Loading