You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Dec 11, 2023. It is now read-only.
I am taking 71.6 GB of pickle files, converting to dataframes, creating ctables and appending to my rootdir ctable.
The process did not finish so I killed it after 16 hours...
My data is 2 columns: float32 and object. The object's character length is around 200.
My code:
def saving_bcolz():
""" this is the core save logic """
files = [... my data files ...]
cols = [np.zeros(0, dtype=dt) for dt in [np.dtype('float32'), np.dtype('object')]]
ct = bcolz.ctable(cols, ['score','all_cols'], rootdir='/home/dump/using_bcolz_new/')
for chunk in list(group(f1+f2, 10)):
df = pd.concat([pd.read_pickle(f) for f in chunk], ignore_index=True)
ct_import = bcolz.ctable.fromdataframe(df, expectedlen=len(df))
del df; gc.collect()
ct.append(ct_import)
del ct_import; gc.collect()
def group(it, size):
""" Create iterable on input `it` for every `size`. """
it = iter(it)
return iter(lambda: tuple(itertools.islice(it, size)), ())
bcolz 1.2.1
pandas 0.22
Is there a better way to have bcolz store the data?