Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support data connectors for ingestion #154

Open
3 tasks
rth opened this issue Jul 7, 2017 · 1 comment
Open
3 tasks

Support data connectors for ingestion #154

rth opened this issue Jul 7, 2017 · 1 comment

Comments

@rth
Copy link
Contributor

rth commented Jul 7, 2017

Currently, FreeDiscovery can ingest files stored on local disk. In real word situations, it would be useful to have the ability to ingest data stored on other supports such as S3 bucket, a relational database (e.g. MySQL) or a remote network storage.

Each of these would have required modifying the ingestion code under the POST /api/v0/feature-extraction/{dsid}. Since we don't want to add additional dependencies to the project, a solution could be to add the ability to use data connectors (implemented as external Python packages).

The development steps could be,

  • In FreeDiscovery, implement an API for external connectors.

    For instance, currently, we can specify the files to ingest as,

     POST /api/v0/feature-extraction/{dsid}
     with
     json={'dataset_definition': [{'file_path': "file_path1.txt'}, {'file_path': 'file_path2.txt'}, ..]}

    with an external connector (e.g. the S3 one) the payload could be,

    json={'dataset_definition': [{'file_path': "file_path1.txt'}, {'file_path': 'file_path2.txt'}, ..],
              connector="freediscovery_s3_connector",
              connector_pars='some_parameters'  # a string with s3 bucktet name and authentification tokens
              }

    which would internally call the freediscovery_s3_connector package that would use some documented API to load the files from S3.

  • (optinal) Implement some generic functions in FreeDiscovery to test connector packages

  • Actually implement a connector as an example. For instance, the inital structure for the S3 connector can be found in https://github.com/FreeDiscovery/FreeDiscovery-S3-connector

Reference: Data connectors in ElasticSearch

@rth rth added the new feature label Jul 7, 2017
@rth
Copy link
Contributor Author

rth commented Sep 29, 2017

The support of remote data services in dask is probably the way to go here. Progressively migrating the document ingestion / vectorization to use dask in FD would solve this issue in addition to the parallelization/scalability issue (#152).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant