Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds support for .rds files with readRDS function #33

Merged
merged 20 commits into from
Mar 14, 2018
Merged

Conversation

jsams
Copy link
Contributor

@jsams jsams commented Nov 21, 2017

I think this would more-or-less close #22 though it doesn't integrate with the load functionality

Note that the tests are ever so slightly different. This is on purpose. rds files do not name the objects they contain. I converted all the rda test files into named lists, and saved them as rds files. So, while the list itself does not contain a name, it contains objects which do have names, which is what you are accessing after conversion into the julia type. With convert=false, a bare R type is returned.

Copy link
Collaborator

@alyst alyst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for your contribution! I think we would need to make one more iteration and change readRDS into load (+ necessary changes to FileIO) to ensure API uniformity.

src/RData.jl Outdated
@@ -87,4 +88,22 @@ end

load(s::Stream{format"RData"}; kwoptions...) = load(s, kwoptions)

function readRDS(f::AbstractString; kwoptions...)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried to add RDS format to FileIO.jl? It should be straightforward and almost identical to what is there for RData (with the exception of "RDSX" header line). So readRDS() would change into load(s::File{format"RDataSingle"}/Stream{format"RDataSingle"}, ...) pair of methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've gotten things in the state you wished: factored out decompress, reduced the test set for the rds to just a data.frame but in binary compressed and decompressed and ascii formats, and duplicated the load interface and removed readRDS.

This all seems to work for me locally with local changes to FileIO to register the format. I couldn't figure out how to get the format to register in the RData package itself to make testing easier before the FileIO pull.

On that note...

If you know anything about the file formats, can you take a look at the detect_rdata_single function I'm proposing in my FileIO branch before requesting the pull? I don't know anything about the formats, and it seems that when the RDSX header line is removed, not much is left before the file format itself begins.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think we're almost there!

I couldn't figure out how to get the format to register in the RData package itself to make testing easier before the FileIO pull.

AFAIR, you can do FileIO.add_format() to add the format to the FileIO registry externally, but I'm not sure it really works since it happens after the package initialization.
Anyway, adding RDS support directly to FileIO is the right thing to do.
Your FileIO branch looks ok to me. Do you know whether ".rds" is the only extension or other alternatives (".RDS") also exist? If so, I would list them to add_format() call too.
For the FileIO PR to be merged you also need to add the test (see test/query.jl for how it was done for RData). As long as detect_rdata_single() works for the simple test file, that function is fine.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please also mention this PR in NEWS.md?

src/RData.jl Outdated
function readRDS(f::AbstractString; kwoptions...)
io = open(f, "r")
try
gzipped = read(io, UInt8) == 0x1F && read(io, UInt8) == 0x8B # check GZip magic number
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a duplication of the code from load(format{"RData"}). I think it's a good idea to extract it into a separate function, especially since we are going to extend the supported compression formats. E.g.

# check whether the stream is compressed (Gzip) and return the decompressed stream
function decompress(io::IO)
    gzipped = read(io, UInt8) == 0x1F && read(io, UInt8) == 0x8B # check GZip magic number
    seekstart(io)
    if gzipped
        return GzipDecompressorStream(io) # if compressed, transcode gzipped stream
    else
        return io
    end
end

It should be called then within try block: io = decompress(io)


# for converting rda files to rds to test with readRDS
rdafiles = list.files("data/", pattern="*.rda", full.names=T)
for (rdafile in rdafiles) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary to have the equivalents of RDA test files as RDS: the serialization format is the same and the code that reads these objects is also the same. So these tests would just test the same code with the same data.
I think it should be enough to test some moderately complex object (say, data.frame), but check that both binary (compressed and uncompressed) and ASCII versions of RDS format are supported.

test/RDS.jl Outdated
chr = ["ab", "c"],
factor = pool(["ab", "c"]),
cplx = Complex128[1.1+0.5im, 1.0im])
rdf = sexp2julia(load("$testdir/data/types.rds",convert=false))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we need to test both convert=false and convert=true (the default) modes. convert=true should return DataFrame directly

@jsams
Copy link
Contributor Author

jsams commented Nov 22, 2017

Ok, I've made a PR in FileIO, updated the news (though it's not clear how you see the versions moving forward, please modify to your taste), and had the tests also test the default convert=true condition.

test/RDS.jl Outdated
factor = pool(["ab", "c"]),
cplx = Complex128[1.1+0.5im, 1.0im])
rdf = sexp2julia(load("$testdir/data/types.rds",convert=false))
@test eltypes(rdf) == eltypes(df)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, missed that: I think before calling eltypes() or comparing with df it makes sense to do @test rdf isa DataFrame (for the other load() as well). If load goes completely wrong, it would help to diagnose it properly, otherwise @test eltype... will fail with MethodError.

@alyst
Copy link
Collaborator

alyst commented Nov 22, 2017

@jsams Thanks! Let's see what happens with FileIO PR. I will merge once that one is merged. Maybe by that time there would be some progress with POSIXct, then we can tag 0.3.0 with both these features.

@alyst
Copy link
Collaborator

alyst commented Dec 4, 2017

Your PR to FileIO is merged and is available in v0.6.0.
So after updating REQUIRES/NEWS.md and rebasing we should be ready to go.

@jsams
Copy link
Contributor Author

jsams commented Mar 13, 2018

I've brought this in line with the latest master. runtests throws some errors in the master, but my additions don't add any errors and all of my tests pass. Any reason to not pull?

@coveralls
Copy link

coveralls commented Mar 13, 2018

Coverage Status

Coverage increased (+0.9%) to 67.708% when pulling a5bb2b5 on jsams:rds into 4b3b3e8 on JuliaData:master.

@alyst alyst merged commit 04bf272 into JuliaData:master Mar 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Also handle RDS files
3 participants