<div style="font-size:18pt; padding-top:20px; text-align:center"><b>Introduction to <span style="font-weight:bold; color:green">Spark</span></b></div><hr>
<div style="text-align:right;">Sergei Yu. Papulin <span style="font-style: italic;font-weight: bold;">(papulin_bmstu@mail.ru)</span></div>

<a name="0"></a>
<div><span style="font-size:14pt; font-weight:bold">Contents</span>
    <ol>
        <li><a href="#1">Word Count</a>
            <ol style = "list-style-type:lower-alpha">
                <li><a href="#1a">Java</a></li>
                <li><a href="#1b">Python</a></li>
                <li><a href="#1c">Scala</a></li>
            </ol>
        </li>
        <li><a href="#2">Average Rating Calculation</a>
            <ol style = "list-style-type:lower-alpha">
                <li><a href="#2a">Average ratings for each product</a></li>
                <li><a href="#2b">Average rating of all products</a></li>
                <li><a href="#2c">Filter items by their ratings</a></li>
                <li><a href="#2d">Average rating of product</a></li>
            </ol>
        </li>
        <li><a href="#3">Spark on AWS EMR</a></li>
        <li><a href="#4">References</a></li>
    </ol>
</div>

<p>Launch the cell below to apply a jupyter notebook style</p>

In [None]:
%%html
<link href="css/style.css" rel="stylesheet" type="text/css">

<a name="1"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">1. Word Count</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To Content</a></div>
    </div>
</div>

<a name="1a"></a>
<div style="display:table; width:100%">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-style:italic; font-weight:bold; font-size:12pt">
            a. Java
        </div>
        <div style="display:table-cell; border:1px solid lightgrey; width:20%">
            <div style="display:table-cell; width:10%; text-align:center; background-color:whitesmoke;">
                <a href="#1">Back</a>
            </div>
            <div style="display:table-cell; width:10%; text-align:center;">
                <a href="#1b">Next</a>
            </div>
        </div>
    </div>
</div>

<p><b>Preparation to launch on the local Cloudera VM</b></p>

<p>Connect to your local Cloudera VM via SSH</p>

In [None]:
sudo ssh -p 2222 cloudera@127.0.0.1

<p>Create a directory to store data and spark files locally</p>

In [None]:
mkdir /home/cloudera/classes/spark

<p>Create a directory in HDFS for data</p>

In [None]:
hdfs dfs -mkdir -p /data/input

<p>Copy the 100-sample dataset from your local node to the Cloudera VM</p>

In [None]:
sudo scp -P 2222 /YOUR_PATH/data/samples_100.json cloudera@127.0.0.1:/home/cloudera/classes/spark/

<p>Move the dataset to HDFS</p>

In [None]:
hdfs dfs -moveFromLocal /home/cloudera/classes/spark/samples_100.json hdfs:///data/input/

<p><b>Run and debug a Spark code in IntelliJ IDE</b></p>

<p>Java code of word count example</p>

<p>JAVA 7</p>

<p>Add to your project the following libraries</p>

In [None]:
org.apache.spark:spark-core_2.10:1.6.0

In [None]:
org.json:json:20171018

In [None]:
package edu.classes.spark;

import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.SparkConf;

import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import scala.Tuple2;

import org.json.JSONObject;

import java.util.Arrays;


class SplitReviewByWords implements FlatMapFunction<String, String> {
    public Iterable<String> call(String strReviewJSON) {

        JSONObject reviewJSON = new JSONObject(strReviewJSON);

        return Arrays.asList(reviewJSON.getString("reviewText").split(" "));
    }
}

class MapToWordTuple implements PairFunction<String, String, Integer> {

    public Tuple2<String, Integer> call(String word) {

        return new Tuple2<>(word, 1);
    }
}

class ReduceCountWords implements Function2<Integer, Integer, Integer> {

    public Integer call(Integer v1, Integer v2) {

        return v1 + v2;
    }

}

public class WordCount {

    public static void main(String[] args) {

        SparkConf conf = new SparkConf().setAppName("SparkJavaWordCount").setMaster("local[2]");
        JavaSparkContext sc = new JavaSparkContext(conf);

        JavaRDD<String> textFile = sc.textFile(args[0]);

        JavaRDD<String> words = textFile.flatMap(new SplitReviewByWords());

        JavaPairRDD<String, Integer> wordTuple = words.mapToPair(new MapToWordTuple());

        JavaPairRDD<String, Integer> wordCount = wordTuple.reduceByKey(new ReduceCountWords());

        //System.out.println(wordCount.collect());

        wordCount.saveAsTextFile(args[1]);

    }

}

<p>JAVA 8</p>

In [None]:
org.apache.spark:spark-core_2.11:2.3.0

In [None]:
org.json:json:20180130

In [None]:
package edu.classes.spark;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.json.JSONObject;
import scala.Tuple2;

import java.util.Arrays;


public class WordCount {

    public static void main(String[] args) {

        SparkConf conf = new SparkConf().setAppName("SparkJava8WordCount").setMaster("local[2]");
        JavaSparkContext sc = new JavaSparkContext(conf);

        JavaRDD<String> textFile = sc.textFile(args[0]);

        JavaRDD<String> reviews = textFile.map(row -> new JSONObject(row).getString("reviewText"));

        JavaRDD<String> words = reviews.flatMap(review -> Arrays.asList(review.split(" ")).iterator());

        JavaPairRDD<String, Integer> wordTuple = words.mapToPair(word -> new Tuple2<>(word, 1));

        JavaPairRDD<String, Integer> wordCount = wordTuple.reduceByKey((x, y) -> x + y);

        //System.out.println(wordCount.collect());

        wordCount.saveAsTextFile(args[1]);

    }

}


<p>Short form</p>

In [None]:
JavaPairRDD<String, Integer> wordCount = textFile
        .map(row -> new JSONObject(row).getString("reviewText"))
        .flatMap(review -> Arrays.asList(review.split(" ")).iterator())
        .mapToPair(word -> new Tuple2<>(word, 1))
        .reduceByKey((x, y) -> x + y);

<p>Create a jar file to execute the word count Spark Application on the local Cloudera VM. But before that, remove ".setMaster("local[2]")" from the above code</p>

<p>Copy the jar file to the Cloudera VM</p>

In [None]:
sudo scp -P 2222 /YOUR_PATH/SparkWordCount.jar cloudera@127.0.0.1:/home/cloudera/classes/spark/

<p>Connect to the Cloudera VM via SSH and run the Spark Application</p>

In [None]:
spark-submit --master yarn /home/cloudera/classes/spark/SparkWordCount.jar hdfs:///data/input/samples_100.json  hdfs:///data/output_spark_java_wordcount

<p>Print out the result</p>

In [None]:
hdfs dfs -cat /data/output_spark_java_wordcount/par*

<p>Remove the output directory (required for re-running the application)</p>

In [None]:
hdfs dfs -rm -r /data/output_spark_java_wordcount

<a name="1b"></a>
<div style="display:table; width:100%">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-style:italic; font-weight:bold; font-size:12pt">
            b. Python
        </div>
        <div style="display:table-cell; border:1px solid lightgrey; width:20%">
            <div style="display:table-cell; width:10%; text-align:center; background-color:whitesmoke;">
                <a href="#1a">Back</a>
            </div>
            <div style="display:table-cell; width:10%; text-align:center;">
                <a href="#1c">Next</a>
            </div>
        </div>
    </div>
</div>

In [None]:
import sys
from pyspark import SparkContext, SparkConf
import json

conf = SparkConf().setAppName("SparkPythonWordCount")
sc = SparkContext(conf=conf)

textFile = sc.textFile(sys.argv[1])

def split_review(review_json_item):
    dict_review_item = json.loads(review_json_item)
    return dict_review_item["reviewText"].split(" ")

wordCount = textFile.flatMap(lambda row: split_review) \
                          .map(lambda word: (word, 1)) \
                          .reduceByKey(lambda v1, v2: v1 + v2)
        
wordCount.saveAsTextFile(sys.argv[2])

<a name="1c"></a>
<div style="display:table; width:100%">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-style:italic; font-weight:bold; font-size:12pt">
            c. Scala
        </div>
        <div style="display:table-cell; border:1px solid lightgrey; width:20%">
            <div style="display:table-cell; width:10%; text-align:center; background-color:whitesmoke;">
                <a href="#1b">Back</a>
            </div>
            <div style="display:table-cell; width:10%; text-align:center;">
                <a href="#2">Next</a>
            </div>
        </div>
    </div>
</div>

<p>build.sbt</p>

In [None]:
name := "SparkScalaWordCount"

version := "0.1"

scalaVersion := "2.11.12"

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.0"
libraryDependencies += "org.json" % "json" % "20180130"

<p>Word count scala code</p>

In [None]:
package edu.classes.spark

import org.apache.spark.{SparkConf, SparkContext}
import org.json.JSONObject

object WordCount {

  def main(args: Array[String]): Unit = {

    val conf = new SparkConf().setAppName("SparkScalaWordCount").setMaster("local[2]")
    val sc = new SparkContext(conf)

    val textFile = sc.textFile(args(0))

    val reviews = textFile.map(row => new JSONObject(row).getString("reviewText"))

    val words = reviews.flatMap(review => review.split(" "))

    val wordTuple = words.map(word => (word, 1))

    val wordCount = wordTuple.reduceByKey((x, y) => x + y)

    wordCount.take(5).foreach(println)

    // wordCount.saveAsTextFile(args(1))

  }

}

<p>Short form</p>

In [None]:
val wordCount = textFile
  .map(row => new JSONObject(row).getString("reviewText"))
  .flatMap(review => review.split(" "))
  .map(word => (word, 1))
  .reduceByKey((x, y) => x + y)

<a name="2"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">2. Average Rating Calculation</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To Content</a></div>
    </div>
</div>

<a name="2a"></a>
<div style="display:table; width:100%">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-style:italic; font-weight:bold; font-size:12pt">
            a. Average ratings for each product
        </div>
        <div style="display:table-cell; border:1px solid lightgrey; width:20%">
            <div style="display:table-cell; width:10%; text-align:center; background-color:whitesmoke;">
                <a href="#2">Back</a>
            </div>
            <div style="display:table-cell; width:10%; text-align:center;">
                <a href="#2b">Next</a>
            </div>
        </div>
    </div>
</div>

<p><b>Run a Spark Application interactively in <span class="code-font code-key">pyspark</span></p>

<p>[OPTIONAL] Connect to your local Cloudera VM via SSH </p>

In [None]:
sudo ssh -p 2222 cloudera@127.0.0.1

<p>Copy sample reviews to HDFS from local file system (in terminal)</p> 

In [None]:
hdfs dfs -mkdir -p data/spark_rdd

In [None]:
hdfs dfs -copyFromLocal /YOUR_PATH/data/spark-rdd-intro/samples_100.json data/spark_rdd/samples_100.json

In [None]:
hdfs dfs -ls data/spark_rdd

<p>Run <span class="code-font code-key">pyspark</span> in command line and execute the following commands in sequence</p>

In [None]:
file_path = "data/spark_rdd/samples_100.json"

In [None]:
# create RDD from the dataset in HDFS
rdd_review_100 = sc.textFile(file_path)

<p>Print out single review</p>

In [None]:
rdd_review_100.take(1)

<p>Write a function to extract a product ID and rating from a json structure of a product item</p>

In [None]:
def get_prod_rating(review_json_item):
    dict_review_item = json.loads(review_json_item)
    return (dict_review_item["asin"], dict_review_item["overall"])

<p>Create RDD with a product ID and rating for each product</p>

In [None]:
rdd_prod_rating = rdd_review_100.map(lambda row: get_prod_rating(row))
rdd_prod_rating.take(2)

<p>Create RDD with average ratings of products</p>

In [None]:
rdd_prod_rating.aggregateByKey((0,0), lambda x, value: (x[0] + value, x[1] + 1), 
                               lambda x, y: (x[0] + y[0], x[1] + y[1])).mapValues(lambda x: x[0]/x[1]).collect()

<p><b>Run a Spark Application with <span class="code-font code-key">spark-submit</b></p>

<p>Create a python file with the following content</p>

In [None]:
import sys
from pyspark import SparkContext, SparkConf
import json

conf = SparkConf().setAppName("AvgRatingByProd")
sc = SparkContext(conf=conf)

rdd_review = sc.textFile(sys.argv[1])

def get_prod_rating(review_json_item):
    dict_review_item = json.loads(review_json_item)
    return (dict_review_item["asin"], dict_review_item["overall"])

rdd_prod_rating = rdd_review.map(lambda row: get_prod_rating(row))
rdd_avg_by_prod = rdd_prod_rating.aggregateByKey((0,0), 
                                        lambda x, value: (x[0] + value, x[1] + 1), 
                                        lambda x, y: (x[0] + y[0], x[1] + y[1])).mapValues(lambda x: x[0]/x[1])

rdd_avg_by_prod.saveAsTextFile(sys.argv[2])

<p>[OPTIONAL] Copy the python file to the Cloudera VM</p>

In [None]:
sudo scp -P 2222 /YOUR_PATH/avg_rating_by_prod.py cloudera@127.0.0.1:/home/cloudera/classes/spark/

<p>[OPTIONAL] Connect to the Cloudera VM via SSH</p>

<p>Run the Spark Application</p>

In [None]:
spark-submit --master yarn /YOUR_PATH/avg_rating_by_prod.py hdfs:///data/input/samples_100.json  hdfs:///data/output_spark_avg_rating_by_prod

<p>Check the output directory in HDFS</p>

In [None]:
hdfs dfs -ls /data/output_spark_avg_rating_by_prod/

<p>Print out the result</p>

In [None]:
hdfs dfs -cat /data/output_spark_avg_rating_by_prod/part-*

<p>Remove the output directory (required for re-running the application)</p>

In [None]:
hdfs dfs -rm -r /data/output_spark_avg_rating_by_prod

<p>[OPTIONAL] <b>Open Spark UI to monitor Spark applications</b></p>

In [None]:
https://www.cloudera.com/documentation/enterprise/5-8-x/topics/admin_spark_history_server.html

<p>Add the following lines to <span class="code-font">/etc/spark/conf/spark-defaults.conf</span> in the Cloudera VM</p>

In [None]:
spark.eventLog.dir=hdfs:///user/spark/applicationHistory
spark.eventLog.enabled=true

<p>[OPTIONAL] Use port forwarding from your local host to the local Cloudera VM to access a Spark History Server</p>

In [None]:
sudo ssh -N -f -L 9964:quickstart.cloudera:18088 cloudera@127.0.0.1 -p 2222

<p>Open a web browser to see a Hadoop dashboard</p>

<div class="code-block code-font"><a href="http://quickstart.cloudera:18088/">http://quickstart.cloudera:18088/</a></div>

<p>[OPTIONAL] or with port forwarding</p>

<div class="code-block code-font"><a href="http://localhost:9964">http://localhost:9964</a></div>

<a name="2b"></a>
<div style="display:table; width:100%">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-style:italic; font-weight:bold; font-size:12pt">
            b. Average rating of all products
        </div>
        <div style="display:table-cell; border:1px solid lightgrey; width:20%">
            <div style="display:table-cell; width:10%; text-align:center; background-color:whitesmoke;">
                <a href="#2a">Back</a>
            </div>
            <div style="display:table-cell; width:10%; text-align:center;">
                <a href="#2c">Next</a>
            </div>
        </div>
    </div>
</div>

<p>Create a python file with the following content</p>

In [None]:
import sys
from pyspark import SparkContext, SparkConf
import json

conf = SparkConf().setAppName("AvgRating")
sc = SparkContext(conf=conf)

rdd_review = sc.textFile(sys.argv[1])

def get_prod_rating(review_json_item):
    rating = json.loads(review_json_item)["overall"]
    if isinstance(rating, float):
        return rating
    return None

rdd_prod_rating = rdd_review.map(lambda row: get_prod_rating(row)).filter(lambda rating: rating is not None)

rating_count = rdd_prod_rating.aggregate((0,0), 
                                       lambda x, value: (x[0] + value, x[1] + 1),
                                       lambda x, y: (x[0] + y[0], x[1] + y[1]))

avg_rating = sc.parallelize([rating_count[0] / rating_count[1]])

avg_rating.saveAsTextFile(sys.argv[2])

<p>Copy the python file to the Cloudera VM and run using the <span class="code-font code-key">spark-submit</span> command</p>

In [None]:
spark-submit --master yarn /home/cloudera/classes/spark/avg_rating.py hdfs:///data/input/samples_100.json  hdfs:///data/output_spark_avg_rating

<p>Print out the result</p>

In [None]:
hdfs dfs -cat /data/output_spark_avg_rating/part-*

<p>Remove the output directory (required for re-running the application)</p>

In [None]:
hdfs dfs -rm -r /data/output_spark_avg_rating

<a name="2c"></a>
<div style="display:table; width:100%">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-style:italic; font-weight:bold; font-size:12pt">
            c. Filter items by their ratings
        </div>
        <div style="display:table-cell; border:1px solid lightgrey; width:20%">
            <div style="display:table-cell; width:10%; text-align:center; background-color:whitesmoke;">
                <a href="#2b">Back</a>
            </div>
            <div style="display:table-cell; width:10%; text-align:center;">
                <a href="#2d">Next</a>
            </div>
        </div>
    </div>
</div>

<p>Create a python file with the following content</p>

In [None]:
import sys
from pyspark import SparkContext, SparkConf
import json

def check_and_return_rating_arg(rating):
    try:
        return float(rating) 
    except ValueError:
        return 0.0

# Global variable
rating_threshold = check_and_return_rating_arg(sys.argv[3])

rdd_review = sc.textFile(sys.argv[1])

def filter_by_rating(review_json_item):
    rating = json.loads(review_json_item)["overall"]
    if isinstance(rating, float) and rating >= rating_threshold:
        return True
    return False

rdd_items = rdd_review.filter(lambda row: filter_by_rating(row))
rdd_items.saveAsTextFile(sys.argv[2])

<p>Copy the python file to the Cloudera VM</p>

In [None]:
sudo scp -P 2222 /YOUR_PATH/filter_by_threshold.py cloudera@127.0.0.1:/home/cloudera/classes/spark/

<p>Connect to the Cloudera VM via SSH and run the Spark Application</p>

In [None]:
spark-submit --master yarn /home/cloudera/classes/spark/filter_by_threshold.py  hdfs:///data/input/samples_100.json  hdfs:///data/output_spark_filter_by_threshold 3

<p>Print out the result</p>

In [None]:
hdfs dfs -cat /data/output_spark_filter_by_threshold/part-*

<p>Remove the output directory (required for re-running the application)</p>

In [None]:
hdfs dfs -rm -r /data/output_spark_filter_by_threshold

<a name="3"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">3. Spark on AWS EMR
</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To Content</a></div>
    </div>
</div>

<div class="msg-block msg-task">
  <div class="msg-text-task"><p>Deploy an EMR cluser with 3 worker instances and run the apps from the previous section</p></div>
</div>

<p>To launch a Spark EMR Cluster, use the following command</p>

In [None]:
%%bash
aws emr create-cluster \
    --name "Spark_Cluster" \
    --release-label emr-5.8.0 \
    --applications Name=Spark Name=Zeppelin \
    --log-uri s3://YOUR_BUCKET/logs/ \
    --service-role emr-default-role \
    --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=3,InstanceType=m4.large \
    --ec2-attributes InstanceProfile=emr-default-ec2-role,KeyName=YOUR_KEYS,SubnetId="YOUR_SUBNET" \
    --configurations file:///YOUR_PATH/config/hdfs-config.json

In [None]:
# YOUR INSTRUCTIONS

<a name="4"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">4. References</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To content</a></div>
    </div>
</div>

<a href="https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html">Spark Programming Guide</a><br>
<a href="https://spark.apache.org/docs/latest/submitting-applications.html">Submitting Applications</a><br>
<a href="https://hortonworks.com/tutorial/setting-up-a-spark-development-environment-with-java/">Setting up a Spark Development Environment with Java</a><br>
<a href="https://hortonworks.com/tutorial/setting-up-a-spark-development-environment-with-scala/">Setting up a Spark Development Environment with Scala</a>