Skip to content
This repository was archived by the owner on Dec 11, 2023. It is now read-only.
This repository was archived by the owner on Dec 11, 2023. It is now read-only.

ctable takes 16 hours (and still running) saving to disk - a better way?? #379

@ghost

Description

I am taking 71.6 GB of pickle files, converting to dataframes, creating ctables and appending to my rootdir ctable.

The process did not finish so I killed it after 16 hours...
My data is 2 columns: float32 and object. The object's character length is around 200.

My code:

def saving_bcolz():
    """  this is the core save logic """
    files = [... my data files ...]
    cols = [np.zeros(0, dtype=dt) for dt in [np.dtype('float32'), np.dtype('object')]]
    ct = bcolz.ctable(cols, ['score','all_cols'], rootdir='/home/dump/using_bcolz_new/')
    
    for chunk in list(group(f1+f2, 10)):
        df = pd.concat([pd.read_pickle(f) for f in chunk], ignore_index=True)
        ct_import = bcolz.ctable.fromdataframe(df, expectedlen=len(df))
        del df;   gc.collect()
        ct.append(ct_import)
        del ct_import;  gc.collect()
    
def group(it, size):
    """    Create iterable on input `it` for every `size`.  """
    it = iter(it)
    return iter(lambda: tuple(itertools.islice(it, size)), ())

bcolz 1.2.1
pandas 0.22

Is there a better way to have bcolz store the data?

Any reason for the slowness?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions