Skip to content

2L-KnowledgeBase/learning-spark-101

Repository files navigation

Learning Spark 101

http://spark.apache.org

Introduction

What is Spark ?

Currently (these definitions are changable in developer community and updated overtime) we said Spark is a unified analytics engine for large-scale data processing ...

关于 是什么的定义(what-is) 时常随着开源项目本身的发展以及PMC Member们的协商而进行调整(下一个Major Releasae), 简单来说, Spark是一个分布式计算框架(引擎), 很多开发者常喜欢用它和MapReduce做比较.

It provides high-level APIs in Scala/Java/Python and R, and an optimized engine that supports general computation graphs(DAG, Directed Acyclic Graph) for data analysis.

高阶 API 又可以划分为 RDD / Dataset (Dataframe, Spark>=2.0之后 Dataframe API 和 Dataset 进行了合并) 2大类, 后者是社区参考著名的 Python 包 Pandas 的 API 设计风格(更利于分析师使用) 二次封装 RDD API 实现

It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for stream processing.

除了可以直接使用上面提到的 RDD/Dataset APIs 编写 SparkApplicaiton 之外. Spark 还提供如下针对特定场景的内嵌库 (built-in library).

spark stack

SparkSQL在构建现代化数仓中相当流行, 现代化数仓这一中文概念,请参考eBay的俞育才老师在18年QCON上的sharing,由于MPP(e.g. Teradata/Netezza/Greenplum/..)本身高昂的成本, 越来越多的公司开始将离线分析从传统数仓迁移到现代化数仓(HDFS + Hive + SparkSQL).

关于 eBay DSS(Data Services and Solutions) Team 完成迁移的sharing 从 "Spark& AI Summit Europe - 2018/10" 开始到国内的QCon/Spark&AI Meetup 等等.. 真的有很多.. 以下为Databricks Youtube账号提供的录制(自备梯子)

  1. Moving eBay’s Data Warehouse Over to Apache Spark (Kimberly Curtis & Brian Knauss)
  2. Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang and Lipeng Zhu (eBay)
  3. Deep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing - Keith Sun eBay
  4. Experience Of Optimizing Spark SQL When Migrating from MPP Database - Yucai Yu and Yuming Wang eBay

SparkStreaming由于Spark2.0后Dataset API的统一, 使得SparkStreaming使用Spark SQL APIs

Get Started

Reference

Awesome Study Resources

Spark lastest version doc: http://spark.apache.org/docs/latest/. For historical versions, find here.

  1. 《Learning Spark》 (En / 译版2-9章) @Notes
  2. Mastering Spark SQL
  3. Mastering Apache Spark
  4. Advanced Apache Spark Training - Sameer Farooqui (Databricks)

Spark usage scenarios @PayPal (publicly exposed only)

  1. SCaaS: Spark Compute as a Service at Paypal - Prabhu Kasinathan
  2. Merchant Churn Prediction Using SparkML at PayPal (Chetan Nadgire and Aniket Kulkarni)
  3. Graph Representation Learning to Prevent Payment Collusion Fraud (Venkatesh Ramanathan)
  4. PayPal Merchant ecosystem using Spark, Hive, Druid, HBase & Elasticsearch
  • Known issues
  • Submit your PR

About

A seperated notebook from my pevious KB.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published