-
-
Notifications
You must be signed in to change notification settings - Fork 260
Description
Submitted by: Juarez Rudsatz (juarezr)
Is duplicated by CORE5818
Votes: 2
With little effort, firebird could be extended for covering many big data processing cenarios.
Basically Big Data processing is done in two ways:
- Batch: a program using a big data batch framework reads data from structured storage sources, converts for a programing format like object/struct (properties) or dataset/dataframe (rows/cols), makes several transformations like map, reduce, join, group by, filter and writes the output to a new structured storage.
- Streaming: a program using a streaming framework reads data from realtime ou buffered sources and writes to other realtime/buffered destinations or to a structured storage.
Batch frameworks commonly used are Hadoop, Spark, Pig and several others.
Streaming frameworks commonly used are Spark streaming, Kafka, Amazon Kinesis, Amazon Firehose, etc...
Structured sources can be database data acessed by jdbc or files accessed from network drives, Hadoop Hdfs filesystems, AWS S3 filesystems or Azure Storage filesystems.
Usually the processed data is consumed by:
a) directly exporting to a spreadsheet (csv) in a ad-hoc manner
b) uploaded to a database or datawarehouse/BI infrastructure
c) stored in a pre-summarized format in a structured source for further processing or analysis
Tools used for analysis in cenario c), besides Batch frameworks are: Apache Hive, Amazon Athena, Amazon Spectrum.
They basically provides a mecanism to to query files stored in structured sources like Amazon S3, using plain SQL or PIG languages.
Firebird could take a slice of this market just adding some basic support for this workflow.
For performing well in this cenario, firebird should:
1) have a very fast data injection/bulk insert like Amazon Redshift COPY command (Postgresql columnar clone)
2) support the file formats commonly used in big data like: CSV/TSV, Avro, Parquet, ORC, Grok, RCFile, RegexSerDe, SequenceFile
3) extend EXTERNAL FILE for reading these formats from remote structured sources like the cited above.
These can be done by specifying a FORMAT to the CREATE TABLE EXTERNAL FILE existing command.
Most of these formats and filesystems have libraries which can be used for speeding the development.
Same way, one could start with the most used formats (CSV/TSV, Parquet, Avro) and most used filesystems (AWS S3, Azure Storage).