#

apache-spark

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

Here are 1,663 public repositories matching this topic...

lakeFS

treeverse / lakeFS

lakeFS - Data version control for your data lake | Git for data

go golang apache-spark aws-s3 google-cloud-storage data-engineering data-lake azure-storage data-version-control object-storage datalake hadoop-filesystem data-quality data-versioning azure-blob-storage apache-sparksql git-for-data lakefs datalakes

Updated Jun 2, 2024
Go

kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.

kubernetes spark apache-spark kubernetes-operator kubernetes-controller kubernetes-crd google-cloud-dataproc

Updated Jun 2, 2024
Go

mlflow / mlflow

Open source platform for the machine learning lifecycle

machine-learning ai apache-spark ml model-management mlflow

Updated Jun 1, 2024
Python

yemrekarakas / yemrekarakas.github.io

A minimal, responsive and feature-rich Jekyll theme for my technical writing.

python java scala sql apache-spark bigdata apache-kafka dataengineering

Updated Jun 1, 2024
SCSS

geoHeil / awesome-tools

curated list of awesome tools and libraries for specific domains

python data-science streaming big-data apache-spark

Updated Jun 1, 2024

exacaster / lighter

REST API for Apache Spark on K8S or YARN

spark apache-spark yarn jupyter k8s livy sparkmagic

Updated Jun 1, 2024
Java

CH6832 / exploratory-tech-studio

The Tech Canvas Experimenters Hub is an interdisciplinary repository for collaborative projects spanning various fields, such as hardware like Arduino UNO, financial engineering, machine learning, natural language processing, and the corresponding mathematical foundations for all fields.

python c java jenkins csv apache-spark hadoop cpp arduino-ide vscode visual-studio-code arduino-sketch arduino-uno-board intellij-idea pycharm-ide arduino-uno-r3 spyder-python-ide

Updated Jun 1, 2024
Python

SEED-VT / DeSQL

DeSQL is an interactive step-through debugging technique for DISC-backed SQL queries. This approach allows users to inspect constituent parts of a query and their corresponding intermediate data interactively, similar to watchpoints in gdb-like debuggers.

debugger apache-spark spark-sql

Updated May 31, 2024
Scala

datamechanics / delight

A Spark UI and Spark History Server alternative with CPU and Memory metrics! Delight is free, cross-platform, and open-source.

kubernetes cpu spark apache-spark monitoring dashboard memory delight spark-history-server spark-ui netapp-public

Updated May 31, 2024
Scala

aloneguid / parquet-dotnet

Fully managed Apache Parquet implementation

windows linux ios xamarin apache-spark dotnet xbox dotnet-core dotnet-standard apache-parquet

Updated May 31, 2024
C#

G-Research / fasttrackml

Experiment tracking server focused on speed and scalability

visualization metadata data-science machine-learning ai apache-spark metrics tensorflow ml data-visualization pytorch tensorboard mlops mlflow mlflow-tracking-server experiment-tracking metadata-tracking

Updated May 31, 2024
Go

O2-Czech-Republic / proxima-platform

The Proxima platform.

apache-spark stream-processing iot-platform apache-beam apache-flink batch-processing analytical-platform unified-data-processing data-mesh

Updated May 30, 2024
Java

FIWARE / tutorials.Big-Data-Spark

📘 FIWARE 306: Real-time Processing of Context Data using Apache Spark

tutorial spark apache-spark fiware big-data-analytics fiware-cosmos orion-spark-connector

Updated May 31, 2024
Shell

data-tools / big-data-types

A library to transform Scala product types and Schemes from different systems into other Schemes. Any implemented type automatically gets methods to convert it into the rest of the types and vice versa. E.g: a Spark Schema can be transformed into a BigQuery table.

bigquery scala spark cassandra apache-spark circe typeclass typesafe schemas database-types typeclass-derivation bigquery-tables

Updated May 30, 2024
Scala

josephmachado / efficient_data_processing_spark

Code for "Efficient Data Processing in Spark" Course

apache-spark pyspark data-engineering minio data-pipeline pyspark-notebook

Updated May 29, 2024
Python

SANSA-Stack / SANSA-Stack

Big Data RDF Processing and Analytics Stack built on Apache Spark and Apache Jena http://sansa-stack.github.io/SANSA-Stack/

spark apache-spark rdf distributed-computing semantic-web flink apache-jena

Updated May 31, 2024
Scala

PastorGL / datacooker-etl

Data transformation framework for ETL processing with SQL-like syntax and GIS extensions, based on Apache Spark

sql apache-spark etl geospatial gis data-transformation geospatial-data schema-less etl-framework columnar-format

Updated May 29, 2024
Java

LucaCanali / Miscellaneous

Includes notes on using Apache Spark in general, notes on using Spark for Physics, how to run TPCDS on PySpark, how to create histograms with Spark, tools for performance testing CPUs, Jupyter notebooks examples for Spark, examples for Oracle and other DB systems.

database apache-spark performance-analysis performance-monitoring jupyter-notebooks performance-testing

Updated May 29, 2024
Jupyter Notebook

cheukhin1024 / Financial-Data-Project-in-Azure

Free High-Quality Financial Data in Azure

python database apache-spark azure sparkml financial-data azure-databricks mlflow delta-lake sckit-learn

Updated May 29, 2024
Python

SynapseML

microsoft / SynapseML

Simple and Distributed Machine Learning

Updated May 28, 2024
Scala

Created by Matei Zaharia

Released May 26, 2014

Followers: 417 followers
Repository: apache/spark
Website: spark.apache.org
Wikipedia: Wikipedia

Related Topics