Support data connectors for ingestion #154

rth · 2017-07-07T04:36:44Z

Currently, FreeDiscovery can ingest files stored on local disk. In real word situations, it would be useful to have the ability to ingest data stored on other supports such as S3 bucket, a relational database (e.g. MySQL) or a remote network storage.

Each of these would have required modifying the ingestion code under the POST /api/v0/feature-extraction/{dsid}. Since we don't want to add additional dependencies to the project, a solution could be to add the ability to use data connectors (implemented as external Python packages).

The development steps could be,

In FreeDiscovery, implement an API for external connectors.

For instance, currently, we can specify the files to ingest as,

 POST /api/v0/feature-extraction/{dsid}
 with
 json={'dataset_definition': [{'file_path': "file_path1.txt'}, {'file_path': 'file_path2.txt'}, ..]}

with an external connector (e.g. the S3 one) the payload could be,

json={'dataset_definition': [{'file_path': "file_path1.txt'}, {'file_path': 'file_path2.txt'}, ..],
          connector="freediscovery_s3_connector",
          connector_pars='some_parameters'  # a string with s3 bucktet name and authentification tokens
          }

which would internally call the freediscovery_s3_connector package that would use some documented API to load the files from S3.

(optinal) Implement some generic functions in FreeDiscovery to test connector packages
Actually implement a connector as an example. For instance, the inital structure for the S3 connector can be found in https://github.com/FreeDiscovery/FreeDiscovery-S3-connector

Reference: Data connectors in ElasticSearch

The text was updated successfully, but these errors were encountered:

rth · 2017-09-29T19:58:39Z

The support of remote data services in dask is probably the way to go here. Progressively migrating the document ingestion / vectorization to use dask in FD would solve this issue in addition to the parallelization/scalability issue (#152).

rth added the new feature label Jul 7, 2017

rth mentioned this issue Oct 11, 2018

How would I ingest jsonl? #187

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support data connectors for ingestion #154

Support data connectors for ingestion #154

rth commented Jul 7, 2017 •

edited

Loading

rth commented Sep 29, 2017

Support data connectors for ingestion #154

Support data connectors for ingestion #154

Comments

rth commented Jul 7, 2017 • edited Loading

rth commented Sep 29, 2017

rth commented Jul 7, 2017 •

edited

Loading