pyo3 instegration and adding support for parquet output and s3 stores #69

qooba · 2022-08-24T23:29:28Z

I'd like to add a few features.

Integration with pyo3 bindings which will enable to publish library as a python package and use without using subprocess
Support for a parquet output persistence:
output_format="parquet"
because writing to parquet row by row is inefficient thus and additional parameter will be required to write with the chunks:
chunk_size=3000
Support for a s3 as a input and output store

Example usage:

import cleora

output_dir = 's3://output'
fb_cleora_input_clique_filename = "s3://input/fb_cleora_input_clique.txt"
fb_cleora_input_star_filename = "s3://input/fb_cleora_input_star.txt"

cleora.run(
    input=[fb_cleora_input_clique_filename],
    type_name="tsv",
    dimension=1024,
    max_iter=5,
    seed=10,
    prepend_field=False,
    log_every=1000,
    in_memory_embedding_calculation=True,
    cols_str="complex::reflexive::CliqueNode",
    output_dir=output_dir,
    output_format="parquet",
    relation_name="emb",
    chunk_size=3000,
)

The text was updated successfully, but these errors were encountered:

qooba mentioned this issue Aug 24, 2022

Pyo3 python bindings, support for parquets output and s3 input/output #70

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pyo3 instegration and adding support for parquet output and s3 stores #69

pyo3 instegration and adding support for parquet output and s3 stores #69

qooba commented Aug 24, 2022

pyo3 instegration and adding support for parquet output and s3 stores #69

pyo3 instegration and adding support for parquet output and s3 stores #69

Comments

qooba commented Aug 24, 2022