Skip to content

Some simple, kinda introductory projects based on Apache Spark to be used as guides in order to make the whole DataFrame data management look less weird or complex.

Coursal/Spark-Examples

Repository files navigation

Spark Examples

Some simple, kinda introductory projects based on Apache Spark to be used as guides in order to make the whole DataFrame data management look less weird or complex.

Preparations & Prerequisites

  • Latest stable version of Spark or at least the one used here, 3.0.1 (in case you're old school).
  • A single node setup is enough. You can also use the applications in a local cluster or in a cloud service, with needed changes on anything to be parallelized, of course.
  • Of course, having (a somehow recent version of) Scala (and Java) installed (oh, you're really old school).
  • The most casual and convenient way to run the projects is to import them to a IDE as shown here.

Projects

Each project comes with its very own input data (.csv, .tsv, or simple text files in the project folder ready to be used or copied to the HDFS) and its execution results are either stored as a single file in an /output directory or printed in console.

The projects featured in this repo are:

Calculating the average price of houses for sale by zipcode.

A typical "sum-it-up" example where for each bank we calculate the number and the sum of its transfers.

Typical case of finding the max recorded temperature for every city.

An interesting application of working on Olympic game stats in order to see the total wins of gold, silver, and bronze medals of every athlete.

Just a plain old normalization example for a bunch of students and their grades.

Finding the oldest tree per city district. Child's play.

The most challenging and abstract one. Every key-character (A-E) has 3 numbers as values, two negatives and one positive. We just calculate the score for every character based on the following expression character_score = pos / (-1 * (neg_1 + neg_2)).

A simple way to calculate the symmetric difference between the records of two files, based on each record's ID.

Filtering out patients' records where their PatientCycleNum column is equal to 1 and their Counseling column is equal to No.

Reading a number of files with multiple lines and storing each of them as records in a DataFrame consisting of two columns, filename and content.

The most challenging yet. Τerm frequency is being calculated from 5 input documents. The goal is to find the document with the max TF for each word and how many documents contain that said word.

A simple merge of WordCount and TopN examples to find the 10 most used words in 5 input documents.


Check out the equivalent Hadoop Examples here.

About

Some simple, kinda introductory projects based on Apache Spark to be used as guides in order to make the whole DataFrame data management look less weird or complex.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages