New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation request #82

Closed
amellnik opened this Issue Oct 27, 2015 · 8 comments

Comments

Projects
None yet
2 participants
@amellnik

amellnik commented Oct 27, 2015

A simple example of writing a small data set to a sqlite db and reading it back out would be appreciated. I would add it myself, but I can't figure out out after the DataStreams rework.

The non-streaming changes seem straightforward enough:

using SQLite, DataFrames, DataStreams
db = SQLite.DB("test.sqlite")
df = DataFrame(num = [1,2], str=["sdsd", "sdg"])
SQLite.drop!(db, "df", ifexists=true) 

but I can't figure out how to write even a simple DataFrame AbstractArray with arrays of column names and types to the db. It looks like create and append have been migrated to SQLite.Sink(schema::Data.Schema,db::DB,tablename;temp::Bool=false,ifnotexists::Bool=true), and I'm stuck on creating a Schema to pass along. The source of DataStreams.jl suggests that I should be able to do this with something like

sch = Data.Schema([string(n) for n in names(df)],[eltype(df[n]) for n in names(df)], nrow(df))

but it looks like i'm not particularly close.

@quinnj

This comment has been minimized.

Member

quinnj commented Oct 27, 2015

Thanks for taking the time to open an issue! I appreciate the feedback and use-case. Here's how I would approach it:
First, we'll want to define a mostly-non-copying conversion from DataFrame to Data.Table

function DataStreams.Data.Table(df::DataFrames.DataFrame)
        rows, cols = size(df)
        schema = DataStreams.Data.Schema(DataType[eltype(i) for i in df.columns],rows)
        data = NullableArrays.NullableVector[NullableArrays.NullableArray(x.data,convert(Vector{Bool},x.na)) for x in df.columns]
        return DataStreams.Data.Table(schema,data,0)
    end

I'll work on getting this into the next tagged version of DataStreams.jl

With that conversion, we can then do:

using DataFrames, SQLite
db = SQLite.DB()
df = DataFrame(num = [1,2], str=["sdsd", "sdg"])
dt = Data.Table(df)
sink = SQLite.Sink(dt, db, "df")
Data.stream!(dt, sink)
@amellnik

This comment has been minimized.

amellnik commented Oct 27, 2015

Thanks! I'm still trying to sort out a few things, and have two questions so far:

When I want to pull the data back out, I can make a source with

source = SQLite.Source(sink) # or 
source = SQLite.Source(sink, "select * from TableName")

and I know I use

Data.stream!(source, SomeDataTable) 

to get the results, but how would I construct a DataTable like this (and then convert to an AbstractArray or similar)?

Also, I noticed that query still works and now returns a NullableArray, but it's now in old_ui.jl. This is being depreciated, right?

@amellnik

This comment has been minimized.

amellnik commented Oct 29, 2015

Here's a gist which shows conversion from Data.Table to DataFrame while preserving types, but it's pretty hacky. In particular, for a column of type T, it needs zero(T) to be defined to put in placeholder values where items are null, since I'm using the constructor DataArray(vector of items, vector of isnull). The operative part is

DataFrame([DataArray(eltype(X.values)
    [x.isnull ? zero(eltype(X.values)) : x.value for x in X], X.isnull) for X in dt2.data],
    Symbol[symbol(x) for x in DataStreams.Data.header(dt2)])

Does anyone have a better way to do this?

@quinnj

This comment has been minimized.

Member

quinnj commented Oct 29, 2015

Aside from the one that's already in DataStreams?

DataFrames.DataFrame(dt::DataStreams.Data.Table) = DataFrame(convert(Vector{Any},dt.data),Symbol[symbol(x) for x in DataStreams.Data.header(dt)])
@amellnik

This comment has been minimized.

amellnik commented Oct 29, 2015

This makes the column types things like NullableArrays.NullableArray{Int64,1} and similar and the element types Nullable{Int64}.

This isn't a problem in itself, but it's a serious issue for me since (IMHO) NullableArrays is in a very rough state at the moment. No code that currently works with DataFrames or DataArrays (including things like Gadfly, Mustache, Escher and all my code base) works with NullableArrays as is.

I understand the reasons for wanting to move to NullableArrays, but this will be incredibly involved, and I at least will need to be able to covert NullableArrays to DataArrays until the rest of the ecosystem catches up.

@amellnik

This comment has been minimized.

amellnik commented Oct 29, 2015

But perhaps I just need to pin to d86fcf6 and wait for everything to settle down, rather than trying to use the master while the interface changes.

@quinnj

This comment has been minimized.

Member

quinnj commented Oct 29, 2015

Well, there's a few different considerations here:

  • Packages that use DataFrames would hopefully be "storage-agnostic", meaning it wouldn't matter if the columns were DataArray or NullableArray, it would "just work"; do you have actual use-cases where things are breaking because the columns are NullableArray?
  • Also, depending on other uses, NullableArray is going to be vastly superior from a type stability perspective, which ends up being my main use case. Get some data, do some calculations/transforms on individual columns, which means type stability is key and DataArrays screw you over. DataArrays are and have been dead to me for a while because of this.

In the end, if you really need DataArrays, however, it's still totally trivial

DataFrames.DataFrame(dt::DataStreams.Data.Table) = DataFrame(convert(Vector{Any},DataArray[DataArray(x.values,x.isnull) for x in dt.data]),Symbol[symbol(x) for x in DataStreams.Data.header(dt)])
@amellnik

This comment has been minimized.

amellnik commented Oct 29, 2015

These packages are storage-agnostic, but they are not ready to handle elements being Nullable{Type} rather than Type

using DataFrames, NullableArrays, Gadfly
dt = NullableArray{Any}[NullableArray(1:3, [false for i in 1:3]),NullableArray(1:3, [false for i in 1:3])]
df = DataFrame(convert(Vector{Any},dt),[:x,:y])
plot(df, x="x", y="y", Geom.point)

results in

MethodError: `isfinite` has no method matching isfinite(::Nullable{Any})

And while this runs

using Mustache
tpl = mt"""{{#:df}}
x:{{:x}} y:{{y}}
{{/:df}}
"""
println(Mustache.render(tpl, df=df))

it produces

x:Nullable(1) y:Nullable(1)
x:Nullable(2) y:Nullable(2)
x:Nullable(3) y:Nullable(3)

due to the way show is defined for Nullables (I think). I could start rewriting all my code to move away from DataArrays (and things like isna(df[:col]) except when I'm handing things off to other modules, and then change things again once those other modules start playing nicely with Nullables, but sticking with DataArrays and some performance hit seems like a much better option for me at the moment.

Side thought: Perhaps there's some easy way to make a non-copying view into a DataFrame-on-NullableArray that has elements of type T rather than Nullable{T} for the case when there are no nulls. (It seems like most packages would work fine with no-null DataFrames with Arrays for columns, but I'm not certain).

Edit: Also, your DataFrame from Data.Table constructor is way better than what I was doing.

@amellnik amellnik closed this Nov 23, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment