Spark and PySpark packages that extend Apache Spark (UDFs) with additional SQL functions. Available for both Scala and Python.
Objectives:
-
"As a data scientist, I can get more done by using SQL functions"
-
"As a data engineer, I can get real-time aggregations via SQL functions"
UDF Function | Description | Example | Status |
---|---|---|---|
last_k | Returns the lask k occurences | SELECT user_id, last_k(page_id, timestamp, 100, "unique") FROM dataset | In Development |
approx_topk | Returns the most frequent items using a fast approximation algorithm with limited memory | SELECT approx_topk(ip_address, 1000, "10MB") FROM dataset | Available |
approx_cond_topk | Returns the most frequent items conditioned on anyother item using a fast approximation algorithm with limited memory | NA | In Development |
-
Browse the Jupyter notebooks.
-
Clone the repository
git clone git@github.com:MLStream/mlstream-spark-udfs.git
- Run the demo (Linux and Mac only)
The following command starts a demo Jupyter server which is ready to use with local files.
./demo.sh
TODO
The project mlstream-spark-udfs is distributed in the hope it will be useful and help you solve pressing problems. At the same time its still early days for mlstream-spark-udfs. mlstream-spark-udfs may contain many bugs - known or unknown, it may crash, force yor computer to run out of memory and produce erroneous results. Please carry out due diligence before using and deploying in your organization. The developers developers of mlstream-spark-udfs, be they organizations or people, should not be held liable for any damages which result from running the code. The code is distributed under Apache License which should be consulted for warranties and liabilities. This disclaimer to does not replace the license.
The code is distributed under Apache License. Please check the source files in the repositories for third-party libraries used. We further use Source code derived from GoLang sort. Please consult GoLang LICENSE
TODO
Please file an issue or contact hello@mlstream.com.