Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write(filename::AbstractString, data) #14546

Closed
wants to merge 4 commits into from

Conversation

samoconnor
Copy link
Contributor

Convenience function to write directly to a named file.

I'm doing a cleanup of my local stash of convenience functions and thought that this might be generally useful.
I seem to use this very often.

write("/tmp/foo", "hello")
readall("/tmp/foo")
"hello"

See also JuliaIO/GZip.jl#45, gzreadall(filename) and gzwrite(filename, data).

Convenience function to write directly to a named file (like `readall(filename)`)

I'm doing a cleanup of my local stash of convenience functions and thought that this might be generally useful.
I seem to use this very often.

```julia
write("/tmp/foo", "hello")
readall("/tmp/foo")
"hello"
```
@tkelman tkelman added needs tests Unit tests are required for this change needs docs Documentation for this change is required labels Jan 3, 2016
@tkelman
Copy link
Contributor

tkelman commented Jan 3, 2016

this also doesn't close the file handle when it's done. should use the do block form of open.

@samoconnor
Copy link
Contributor Author

this also doesn't close the file handle when it's done. should use the do block form of open

??

open(io->write(io, data), filename, "w") === open(filename, "w") do io write(io,data) end

See iostream.jl:

function open(f::Function, args...)
    io = open(args...)
    try
        f(io)
    finally
        close(io)
    end
end

@@ -125,6 +125,9 @@ function write(s::IO, a::AbstractArray)
end
return nb
end
"""Write directly to a named file. Equivalent to `open(io->write(io,x), filename, "w")`."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd support passing several values as for write since it's easy. I'd also follow the docs for write and say:

    write(filename, x...)

Write the canonical binary representation of a value to file `filename`.
Returns the number of bytes written into the stream.
Equivalent to `open(io->write(io, x...), filename, "w")`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed the formatting should have the signature, was doing it quickly on a phone

@tkelman
Copy link
Contributor

tkelman commented Jan 3, 2016

sorry nevermind, was late. yeah the anonymous function is exactly equivalent to the do block form

@StefanKarpinski
Copy link
Member

This seems to work ok for write but it doesn't pair well with print where this doesn't work. Do people really use write to write data to files this way often enough for this to matter?

@hayd
Copy link
Member

hayd commented Jan 5, 2016

-1, IMO this is too magical. I think it'll clearer to write it out each time:

write(open(fname, "w"), args...)

@StefanKarpinski
Copy link
Member

I'm also inclined to feel that this is a bit too magical.

@samoconnor
Copy link
Contributor Author

Same amount of magic as readall().
Why should there not be a simple function to write stuff to a file?
The whole open(), write(), close() thing originates from a time when almost all files were too bit to fit in memory, and almost all file IO was done incrementally. There are still many files larger than RAM today for sure. But there are now a huge number of files for which it's best to read/write the whole file in one hit.

I'm tempted to suggest a magic ENV-like Dict for the filesystem where you could do:

rootdir = fsdir()
settings = rootdir["/etc/settings"]
...
rootdir["/etc/settings"] = settings

d = fsdir(pwd())
d["my file"] = "Hello"

@StefanKarpinski
Copy link
Member

It's pretty common to want to read all of the contents of a file and return it. How common is it to want to open a file, write exactly one binary value to it and then close it again? Can you propose some use cases? The "hello" example isn't very compelling.

@JeffBezanson
Copy link
Member

The aspect of this I'm most sympathetic to is that open(io->write(io, data), filename, "w") does feel a bit verbose. One alternative I'll just throw out there is tofile(filename,write)(data).

@StefanKarpinski
Copy link
Member

Note that creating a Dict-like object that behaves the way you propose is pretty easy, @samoconnor.

@samoconnor
Copy link
Contributor Author

Hi @StefanKarpinski, yes it would be easy. I've been playing with similar interfaces for XML and ZIP... https://github.com/samoconnor/XMLDict.jl, https://github.com/samoconnor/ZIP.jl

@nalimilan
Copy link
Member

I've been wondering what's your use case for that too while looking at ZIP.jl (which doesn't support creating an archive from files stored on disk). So far, I can't find any.

@samoconnor
Copy link
Contributor Author

Hi @nalimilan,

doesn't support creating an archive from files stored on disk

To create an archive from files stored on disk with my ZipFile.jl fork you can do:

open_zip("foo.zip", "w") do z
    z["foo.csv"] = readall("foo.csv")
end

But, since you've mentioned it I just added this...

function create_zip(io::IO, files::Array)
    create_zip(io::IO, files, [open(readbytes, f) for f in files])
end

e.g.

create_zip("foo.zip", ["file1.csv", "file2.csv", "subdir/file3.csv"])

I think most times I've needed to do "files on disk" -> "zip on disk" I just shell out an call the "zip" program (my production code only ever has to run on OSX or Linux). But I can see why the above would be useful.

I've been wondering what's your use case for that too while looking at ZIP.jl So far, I can't find any.

It seems that the more code I write for cloud deployment, the less I touch disk files. Data tends to come from a queue, or S3, or a database API, or a HTTP connection...

A couple of recent examples are:

  • constructing an email message containing a zip archive of some processing output. The content of the .ZIP comes from an SQS queue and some S3 objects. The output is wrapped in a mime-multipart message, nothing ever goes to a disk file.
  • creating .ZIP archives of code to deploy to AWS Lambda. To deploy code to Lambda, you need to upload a .ZIP archive. My current AWSLambda.jl implementation does most of its zip wrangling in python, because at the time I found that ZipFile.jl didn't support updating a zip archive. This macro takes a julia function body, wraps it up with some serialisation/desearilation code and turns it into a .ZIP file containing a .jl file which is then deployed to Lambda.

@StefanKarpinski
Copy link
Member

I'm still not seeing what the use cases for opening a file and writing a single binary value to it is...

@samoconnor
Copy link
Contributor Author

@StefanKarpinski, I don't want to waste anyones time here.

I'm doing a cleanup of my local stash of convenience functions and thought that this might be generally useful.

If it isn't generally useful I'll close the PR and move on.

I guess to me it is completely obvious why I want to write the content of a variable to a file, so I'm having trouble articulating the reason. I apologise if this goes on too long...

I've had a look through my code for places were I do open(f, "w") do io write(io, v)

I think there are two classes of use...

  1. One is in a production system that processes recorded data in stages. The architecture of the system is that each processing stage reads some files from a session directory, does some computation and writes some output files. There is surrounding infrastructure to join these stages together into workflows in the cloud. It seems quite common in this system to have a result in a variable and want to dump it to a file.

  2. The other case is places where Julia should be as good at gluing programs together as the shell. I've pasted some examples below.

Run gnu plot...

function gnuplot(cmd)
    open ("$dir/$name.gnuplot", "w") do io
        write(io, cmd)
    end
    run(`gnuplot $dir/$name.gnuplot`)
end

(I have another version of this that pipes the command to gnuplot, but I often want to have the .gnuplot file left behind so I can tweak it by hand to adjust the plot without re-running the whole analysis.)

Search and replace in a file...

    f = key_path("info.txt")
    events = replace(readall(f), patient_id, anon_id, 1)
    open(f, "w") do
        io write(io, events)
    end

    vs

    write(f, replace(readall(f), patient_id, anon_id, 1))

If the xml is not identical after the reverse transform, run external diff tool...

    xmlb = dict_xml(xml_dict(xmla))
    if xmla != xmlb
        open("/tmp/a", "w") do io
            write(io, xmla)
        end
        open("/tmp/b", "w") do io
            write(io, xmlb)
        end
        run(`opendiff /tmp/a /tmp/b`)
    end
    @test xmla == xmlb

Use command line unzip to produce a filename => data Dict from zip_data...

function test_unzip(zip_data)
   z = tempname()       
   try
        open(z, "w") do io 
            write(io, zip_data)
        end
        [chomp(f) => readall(`unzip -qc $z $f`) for f in readlines(`unzip -Z1 $z `)]
    finally
        rm(z)
    end
end

Write files from archive to disk...

function unzip(archive, outputpath::AbstractString=pwd())
    for (filename, data) in open_zip(archive)
        filename = joinpath(outputpath, filename)
        mkpath(dirname(filename))
        open(filename, "w") do io
            write(io, data)
        end
    end
end

@StefanKarpinski
Copy link
Member

Thanks for providing examples, that makes this much more compelling. Maybe a writeall function?

@mbauman
Copy link
Member

mbauman commented Jan 7, 2016

I'm only tangentially following this, but it sounds a lot like FileIO.jl's load/save functions.

@samoconnor
Copy link
Contributor Author

Thinking about naming... I've tried to do a quick review of current read* and write* naming conventions.
writeall is write[how much]. There is no precedent for that. For write*, there is only write[format].

Function Filename Blocking
write no yes
write_[format]_
writecsv yes yes
writedlm yes yes
writemime no yes

Looking at the read* functions below, it seems like it might make sense to:

  • rename readall to readstring.
  • rename readbytes to read.
  • add filename-as-1st-arg support everywhere
Function Type Filename Partial Non Blocking
read_[how much]_
read(io,T) T yes
readavailable Array{UInt8} yes yes
readuntil String yes
readline String yes
readall String yes
read_[as type]_
readbytes Array{UInt8}
readlines Array{String}
readcsv Array{T} yes
readdlm Array{T} yes
readdir Array{String} yes
readlink String yes
read_[and then]_
readchomp(x) = chomp(readall(x)

Related: BioJulia/Libz.jl#12 -- should probably be readgz and writegz (not gzread and gzwrite).

@StefanKarpinski
Copy link
Member

Nice, I like the survey. not sure why the "blocking" column exists since it's always "yes".

@samoconnor
Copy link
Contributor Author

why the "blocking" column since it's always "yes" ?

Because it doesn't seem to be well documented.

write(stream, x)
Write the canonical binary representation of a value to the given stream.
Returns the number of bytes written into the stream.

When I read this manual entry the number of bytes as return value made be suspicious that there might be some write() methods that do a partial write and return rather than blocking.

Also your suggestion of writeall made me think that maybe write == writesome.

I now take it that to your knowledge, write always blocks until the whole of the input has been written to the destination?

@StefanKarpinski
Copy link
Member

Everything in Julia always blocks the task it's called from until it's done. Under the hood it's all non-blocking, but that's exposed to the programmer via task-level concurrency.

@samoconnor
Copy link
Contributor Author

OK good.
I've been playing with tasks and @async over here: https://github.com/samoconnor/AsyncMap.jl
I think all-blocking APIs and tasks is absolutely the way to go.

Should readavailable be deprecated to encourage whoever is using it to use @async instead?

@StefanKarpinski
Copy link
Member

Probably yes, but there was some annoying reason we needed it. But definitely off-topic here.

@nalimilan
Copy link
Member

I also find the names readall and readbytes confusing, and I wanted to do this kind of survey. Could you open an issue about possible renames?

@samoconnor
Copy link
Contributor Author

Could you open an issue about possible renames?

@nalimilan, done. #14608

@tkelman
Copy link
Contributor

tkelman commented Jan 12, 2016

superseded by #14660?

@samoconnor
Copy link
Contributor Author

superseded by #14660?

yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs docs Documentation for this change is required needs tests Unit tests are required for this change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants