ctable takes 16 hours (and still running) saving to disk - a better way??

I am taking 71.6 GB of pickle files, converting to dataframes, creating ctables and appending to my rootdir ctable.

The process did not finish so I killed it after 16 hours...
My data is 2 columns: float32 and object.   The object's character length is around 200.

My code:
    
```
def saving_bcolz():
    """  this is the core save logic """
    files = [... my data files ...]
    cols = [np.zeros(0, dtype=dt) for dt in [np.dtype('float32'), np.dtype('object')]]
    ct = bcolz.ctable(cols, ['score','all_cols'], rootdir='/home/dump/using_bcolz_new/')
    
    for chunk in list(group(f1+f2, 10)):
        df = pd.concat([pd.read_pickle(f) for f in chunk], ignore_index=True)
        ct_import = bcolz.ctable.fromdataframe(df, expectedlen=len(df))
        del df;   gc.collect()
        ct.append(ct_import)
        del ct_import;  gc.collect()
    
def group(it, size):
    """    Create iterable on input `it` for every `size`.  """
    it = iter(it)
    return iter(lambda: tuple(itertools.islice(it, size)), ())
```

bcolz 1.2.1
pandas 0.22


Is there a better way to have bcolz store the data?

Any reason for the slowness?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ctable takes 16 hours (and still running) saving to disk - a better way?? #379

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ctable takes 16 hours (and still running) saving to disk - a better way?? #379

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions