Skip to content

Apache Spark is a fast, in-memory data processing engine with elegant and expressive development API's to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets.This project will have sample programs for Spark in Scala language .

spider-123-eng/Spark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

Spark

Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. With Spark running on Apache Hadoop YARN, developers everywhere can now create applications to exploit Spark’s power, derive insights, and enrich their data science workloads within a single, shared dataset in Hadoop.

This project contains programs for Spark in Scala launguage .

Topics Covered in Spark 2.1

Implementing custom UDF,UDAF,Partitioner using Spark-2.1
Working with DataFrames (ComplexSchema,DropDuplicates,DatasetConversion,GroupingAndAggregation)
Working with DataSets
Working with Parquet files
Partitioning the data by a specific column and store it partition wise
Loading Data from Cassnadra table using Spark
Working with Spark Catalog API to access Hive tables
Inserting data in to Hive table (Managed,External) from Spark
Inserting data in to Hive Partitioned table as Parquet format (Managed,External) from Spark
Adding,Listing Partitions to Hive table using Spark
CRUD operations on Cassandra Using Spark
Reading/Writing to S3 buckets Using Spark
Spark MangoDB Integration
Adding Hive Partitions by fetching data from cassandra.
Exporting/Backup Cassandra table data using spark.
Reading and Writing Data to Elastic Search Using Spark 2.x
Querying ElasticSearch Data From Spark 2.x
Deleting data from ElasticSearch from spark Dataframe
 Pushing Spark Accumulator Values as metrics to DataDog API

Topics Covered in Spark 1.5

Spark Transformations.
Spark To Cassandra connection and storage.
Spark To Cassandra CRUD operations.
Reading data from Cassandra using spark streaming(Cassandra as source).
Spark Kafka Integration.
Spark Streaming with Kafka.
Storing the Spark Streaming data in to HDFS.
Storing the Spark Streaming data in to Cassandra.
Spark DataFrames API (Joining 2 data frames,sorting,wild card search,orderBy,Aggregations).
Spark SQL.
Spark Hive Context (Loading ORC,txt,parquet data from Hive table ).
Kafka Producer.
Kafka Consumer by Spark integration with Kafka.
Spark File Streaming.
Spark Socket Streaming.
Spark JDBC Connection.
Scala Case Class limitations overcoming by using Struct Type.
Working with CSV,Json,XML,ORC,Parquet data files in Spark.
Working with Avro,SequenceFiles in Spark.
Spark Joins.
Spark Window vs Sliding Interval.
Spark Aggregations using DataFrame API.
Writing a Custom UDF,UDAF in Spark.
Storing data as text,parquet file in to HDFS.
Integrating Spark with Mangodb.


Feel free to share any insights or constructive criticism. Cheers!!
#Happy Sparking!!!..

About

Apache Spark is a fast, in-memory data processing engine with elegant and expressive development API's to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets.This project will have sample programs for Spark in Scala language .

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages