You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, FreeDiscovery can ingest files stored on local disk. In real word situations, it would be useful to have the ability to ingest data stored on other supports such as S3 bucket, a relational database (e.g. MySQL) or a remote network storage.
Each of these would have required modifying the ingestion code under the POST /api/v0/feature-extraction/{dsid}. Since we don't want to add additional dependencies to the project, a solution could be to add the ability to use data connectors (implemented as external Python packages).
The development steps could be,
In FreeDiscovery, implement an API for external connectors.
For instance, currently, we can specify the files to ingest as,
with an external connector (e.g. the S3 one) the payload could be,
json={'dataset_definition': [{'file_path': "file_path1.txt'}, {'file_path': 'file_path2.txt'}, ..],
connector="freediscovery_s3_connector",
connector_pars='some_parameters'# a string with s3 bucktet name and authentification tokens
}
which would internally call the freediscovery_s3_connector package that would use some documented API to load the files from S3.
(optinal) Implement some generic functions in FreeDiscovery to test connector packages
The support of remote data services in dask is probably the way to go here. Progressively migrating the document ingestion / vectorization to use dask in FD would solve this issue in addition to the parallelization/scalability issue (#152).
Currently, FreeDiscovery can ingest files stored on local disk. In real word situations, it would be useful to have the ability to ingest data stored on other supports such as S3 bucket, a relational database (e.g. MySQL) or a remote network storage.
Each of these would have required modifying the ingestion code under the
POST /api/v0/feature-extraction/{dsid}
. Since we don't want to add additional dependencies to the project, a solution could be to add the ability to use data connectors (implemented as external Python packages).The development steps could be,
In FreeDiscovery, implement an API for external connectors.
For instance, currently, we can specify the files to ingest as,
with an external connector (e.g. the S3 one) the payload could be,
which would internally call the
freediscovery_s3_connector
package that would use some documented API to load the files from S3.(optinal) Implement some generic functions in FreeDiscovery to test connector packages
Actually implement a connector as an example. For instance, the inital structure for the S3 connector can be found in https://github.com/FreeDiscovery/FreeDiscovery-S3-connector
Reference: Data connectors in ElasticSearch
The text was updated successfully, but these errors were encountered: