Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blaze support #127

Open
scls19fr opened this issue Jul 31, 2015 · 3 comments
Open

Blaze support #127

scls19fr opened this issue Jul 31, 2015 · 3 comments

Comments

@scls19fr
Copy link
Contributor

Hello,

@wavexx did an excellent work to provide a Blaze support to gtabview
see TabViewer/gtabview#10

it's now possible to connect to any database supported by SQLAlchemy and display a table (even a very long table) using a table URI http://blaze.pydata.org/en/latest/uri.html

Some other improvements have been exposed #116
(especially about Pandas DataFrame with multi index)

It will be nice if we could have such a Blaze support on tabview side because it will become possible to display content of very long tables from SSH connection (for example)

Kind regards

@firecat53
Copy link
Collaborator

I'll see what I can do. It should be combined with working on splitting out the common code from tabview and gtabview.

I was a little disappointed in the file:// behavior for large files. One very large (~800M with a couple million rows, I believe) file didn't ever open at all in gtabview after working on it for about 5 minutes. A smaller file (380M) opened, but there was an almost 10 second lag each time it loaded a new section of the file. It also didn't work at all for a Latin-1 encoded file. I tried it with mysql tables and of course most of my tables have a DECIMAL(10,2) data-type...which isn't yet supported by odo (blaze/odo#206). Just a little frustrating that it wasn't handling very well the data I was throwing at it!

Scott

@scls19fr
Copy link
Contributor Author

scls19fr commented Aug 6, 2015

About DECIMAL issue, it's more a datashape issue than odoblaze/datashape#118

I just add here some code to create big random CSV file

import pandas as pd
import numpy as np
(rows, cols) = (4000000, 10)
a = np.random.random((rows, cols))
df = pd.DataFrame(a)
filename = "big_random.csv"
df.to_csv(filename, index=False)

I tried both

  • gtabview file://big_random.csv
  • gtabview big_random.csv

and you are right that's not usable with big file size !

@wavexx
Copy link
Member

wavexx commented Aug 6, 2015

AFAIK blaze is just reading the file in chunks somehow. It initially opens quicker, but then it's just as slow as requiring each new chunk.

And if I'm not mistaken, pandas read_csv is just csv in disguise, without the little tweaks we added in tabview.

We could do much better than that assuming files and sequential reads: ie: read only 'n' lines (exactly one chunk) when the file size is beyond a certain threshold. For files (not streams) we could do that in both forward and reverse to avoid allocating memory at the expense of extra I/O.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants