Create your first ETL Pipeline using Azure Databricks
Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform. Designed with the founders of Apache Spark, Databricks is integrated with Azure to provide one-click setup, streamlined workflows, and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.Azure Databricks is a fast, easy, and collaborative Apache Spark-based analytics service
.
ref:https://docs.microsoft.com/en-us/azure/databricks/scenarios/what-is-azure-databricks#:~:text=Azure%20Databricks%20is%20an%20Apache,Microsoft%20Azure%20cloud%20services%20platform.&text=For%20a%20big%20data%20pipeline,Event%20Hub%2C%20or%20IoT%20Hub.
Azure Databricks comprises the complete open-source Apache Spark cluster technologies and capabilities. Spark in Azure Databricks includes the following components:
.
Spark SQL and DataFrames: Spark SQL is the Spark module for working with structured data. A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python.
Streaming: Real-time data processing and analysis for analytical and interactive applications. Integrates with HDFS, Flume, and Kafka.
MLlib: Machine Learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.
GraphX: Graphs and graph computation for a broad scope of use cases from cognitive analytics to data exploration.
Spark Core API: Includes support for R, SQL, Python, Scala, and Java.