<div style="font-size:18pt; padding-top:20px; text-align:center"><b>Operations on <span style="font-weight:bold; color:green">Spark RDD</span> using Python</b></div><hr>
<div style="text-align:right;">Sergei Yu. Papulin <span style="font-style: italic;font-weight: bold;">(papulin_bmstu@mail.ru)</span></div>

<a name="0"></a>
<div><span style="font-size:14pt; font-weight:bold">Content</span>
    <ol>
        <li><a href="#1">Distributed dataset</a>
            <ol style = "list-style-type:lower-alpha">
                <li><a href="#1a">Transformations</a></li>
                <li><a href="#1b">Actions</a></li>
            </ol>
        </li>
        <li><a href="#2">Distributed dataset of (K, V) pairs</a>
            <ol style = "list-style-type:lower-alpha">
                <li><a href="#2a">Transformations</a></li>
                <li><a href="#2b">Actions</a></li>
            </ol>
        </li>
        <li><a href="#3">References</a></li>
    </ol>
</div>

<p>[OPTIONAL] <b>Environment Setup</b></p>

In [None]:
import os
import sys

os.environ["SPARK_HOME"]="/opt/cloudera/parcels/SPARK2/lib/spark2"
os.environ["PYSPARK_PYTHON"]="/opt/rh/rh-python36/root/usr/bin/python"
os.environ["PYSPARK_DRIVER_PYTHON"]="/opt/rh/rh-python36/root/usr/bin/python"

spark_home = os.environ.get("SPARK_HOME")
sys.path.insert(0, os.path.join(spark_home, "python"))
sys.path.insert(0, os.path.join(spark_home, "python/lib/py4j-0.10.7-src.zip"))

<p>Run Spark Context</p>

In [None]:
import pyspark

conf = pyspark.SparkConf() \
        .setAppName("basicOperationsRDDApp") \
        .setMaster("yarn") \
        .set("spark.submit.deployMode", "client")

sc = pyspark.SparkContext(conf=conf)

<a name="1"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">1. Distributed dataset</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To Content</a></div>
    </div>
</div>

<a name="1a"></a>
<div style="display:table; width:100%">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-style:italic; font-weight:bold; font-size:12pt">
            a. Transformations
        </div>
        <div style="display:table-cell; border:1px solid lightgrey; width:20%">
            <div style="display:table-cell; width:10%; text-align:center; background-color:whitesmoke;">
                <a href="#1">Back</a>
            </div>
            <div style="display:table-cell; width:10%; text-align:center;">
                <a href="#1b">Next</a>
            </div>
        </div>
    </div>
</div>

<p>Create <b><i>RDD</i></b> from an initial list of numbers и print it using the <b><i>collect</i></b> action</p>

In [None]:
# Initial list
data = [1, 2, 3, 4, 5]

In [None]:
# Create RDD - distributed data
data_rdd = sc.parallelize(data)
data_rdd

In [None]:
# Collect all RDD data on the Spark driver
data_rdd.collect()

<p>Create <b><i>RDD</i></b> from a text file and print it using <b><i>take</i></b> action</p>

In [None]:
# Path to a file in HDFS
file_path = "data/spark_rdd/samples_100.json"

In [None]:
# Create RDD
data_rdd = sc.textFile(file_path)
data_rdd

In [None]:
# Take 2 records from a RDD to the Spark driver
data_rdd.take(2)

<p><b><i>Map</i></b></p>

In [None]:
data = [1, 2, 3, 4, 5]
data_rdd = sc.parallelize(data)

In [None]:
# Increment a value by 1
data_map_rdd = data_rdd.map(lambda x: x + 1)

In [None]:
# Collect data on the Spark driver
data_map_rdd.collect()

<p><b><i>flatMap</i></b></p>

In [None]:
# Create RDD
data_rdd = sc.textFile(file_path)

In [None]:
# Take 2 records from a RDD to the Spark driver
data_rdd.take(2)

In [None]:
data_map_rdd = data_rdd.map(lambda x: x.split(" "))
data_map_rdd.take(2)

In [None]:
data_flatmap_rdd = data_rdd.flatMap(lambda x: x.split(" "))
data_flatmap_rdd.take(2)

<p><b>filter</b></p>

In [None]:
data = [1, 2, 3, 4, 5, 6, 7]
data_rdd = sc.parallelize(data)
data_filter_rdd = data_rdd.filter(lambda x: x % 2 == 0)

data_filter_rdd.collect()

<p><b>sortBy</b></p>

In [None]:
data = ["f", "a", "h", "b", "c"]

data_rdd = sc.parallelize(data)

data_sortby_rdd = data_rdd.sortBy(lambda x: x, ascending=False, numPartitions=3)
data_sortby_rdd.collect()

<p><b>sample</b></p>

In [None]:
# Sample without replacement
data = [1, 2, 3, 4, 5, 6, 7]
data_rdd = sc.parallelize(data)
data_sample_rdd = data_rdd.sample(withReplacement=False, fraction=0.8) #seed=

data_sample_rdd.collect()

In [None]:
# Sample with replacement
data_sample_repl_rdd = data_rdd.sample(withReplacement=True, fraction=0.8) #seed=

data_sample_repl_rdd.collect()

<p><b>union</b></p>

In [None]:
data_1 = [1, 2, 3, 4]
data_2 = [3, 4, 5, 6]

data1_rdd = sc.parallelize(data_1)
data2_rdd = sc.parallelize(data_2)

data_union_rdd = data1_rdd.union(data2_rdd)

data_union_rdd.collect()

<p><b>intersection</b></p>

In [None]:
data_1 = [1, 2, 3, 4]
data_2 = [3, 4, 5, 6]

data1_rdd = sc.parallelize(data_1)
data2_rdd = sc.parallelize(data_2)

data_intersection_rdd = data1_rdd.intersection(data2_rdd)

data_intersection_rdd.collect()

<p><b>distinct</b></p>

In [None]:
data = [1, 2, 2, 4, 4, 6, 7]
data_rdd = sc.parallelize(data)
data_distinct_rdd = data_rdd.distinct()

data_distinct_rdd.collect()

<b>mapPartitions</b>

In [None]:
data = [1, 2, 3, 4, 5, 6, 7, 8, 9]

def funct(partition):
    part = list()
    for record in partition:
        part.append(record)
    return [part]

data_rdd = sc.parallelize(data, 3)
data_mappart_rdd = data_rdd.mapPartitions(funct)
data_mappart_rdd.collect()

<b>mapPartitionsWithIndex</b>

In [None]:
data = [1, 2, 3, 4, 5, 6, 7, 8, 9]

def funct(part_id, partition):
    for record in partition:
        yield part_id, record

data_rdd = sc.parallelize(data, 3)
data_mappart_rdd = data_rdd.mapPartitionsWithIndex(funct)
data_mappart_rdd.collect()

<b>cartesian</b>

In [None]:
data1 = [1, 2, 3, 4]
data2 = ["a", "b", "c", "d"]

data1_rdd = sc.parallelize(data1, 2)
data2_rdd = sc.parallelize(data2, 2)

data_cartesian_rdd = data1_rdd.cartesian(data2_rdd)

data_cartesian_rdd.collect()

<b>glom</b>

In [None]:
data = [1, 2, 3, 4, 5, 6, 7, 8, 9]

data_rdd = sc.parallelize(data, 4)
data_rdd.glom().collect()

<b>coalesce</b>

In [None]:
data_coarse_rdd = data_rdd.coalesce(2)
data_coarse_rdd.glom().collect()

<b>repartition</b>

In [None]:
data = [1, 2, 3, 4, 5, 6, 7, 8, 9]

data_rdd = sc.parallelize(data, 2)
data_rdd.glom().collect()

In [None]:
data_repart_incr_rdd = data_rdd.repartition(4)
data_repart_incr_rdd.glom().collect()

<a name="1b"></a>
<div style="display:table; width:100%">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-style:italic; font-weight:bold; font-size:12pt">
            b. Actions
        </div>
        <div style="display:table-cell; border:1px solid lightgrey; width:20%">
            <div style="display:table-cell; width:10%; text-align:center; background-color:whitesmoke;">
                <a href="#1a">Back</a>
            </div>
            <div style="display:table-cell; width:10%; text-align:center;">
                <a href="#2">Next</a>
            </div>
        </div>
    </div>
</div>

<b>reduce</b>

In [None]:
data = [1, 2, 3, 4, 5, 6, 7, 8, 9]

data_rdd = sc.parallelize(data, 4)

data_reduce = data_rdd.reduce(lambda x, y: x + y)
data_reduce

In [None]:
def summ(x, y):
    return x + y

data_reduce = data_rdd.reduce(summ)
data_reduce

<b>fold</b>

In [None]:
data = [1, 2, 3, 4, 5, 6, 7, 8, 9]

data_rdd = sc.parallelize(data, 4)

data_fold = data_rdd.fold(0, lambda x, y: x + y)
data_fold

In [None]:
data_fold_10 = data_rdd.fold(10, lambda x, y: x + y)
data_fold_10

<b>count</b>

In [None]:
data = [1, 2, 3, 4, 5, 6, 7, 8, 9]

data_rdd = sc.parallelize(data, 4)

data_count = data_rdd.count()
data_count

<p><b>countByValue</b></p>

In [None]:
pers_purchases = ["car", "hotel", "smartphone", "laptop", "car", "laptop", "laptop"]
pers_purchases_rdd = sc.parallelize(pers_purchases, 2)

count_value = pers_purchases_rdd.countByValue()
count_value

<b>first</b>

In [None]:
data = [1, 2, 3, 4, 5, 6, 7, 8, 9]

data_rdd = sc.parallelize(data, 4)

data_first = data_rdd.first()
data_first

<b>take</b>

In [None]:
data = [1, 2, 3, 4, 5, 6, 7, 8, 9]

data_rdd = sc.parallelize(data, 4)

data_take = data_rdd.take(5)
data_take

<b>takeSample</b>

In [None]:
data = [1, 2, 3, 4, 5, 6, 7, 8, 9]

data_rdd = sc.parallelize(data, 4)

data_take_sample = data_rdd.takeSample(withReplacement=False, num=5) #seed=
data_take_sample

<b>takeOrdered</b>

In [None]:
data = [5, 2, 6, 4, 1, 3, 7, 9, 8]

data_rdd = sc.parallelize(data, 4)
data_take_desc_ordered = data_rdd.takeOrdered(num=4, key=lambda x: -x)
data_take_desc_ordered

In [None]:
data_take_asc_ordered = data_rdd.takeOrdered(num=4, key=lambda x: x)
data_take_asc_ordered

<b>aggregate</b>

In [None]:
data = [1, 2, 3, 4]
data_rdd = sc.parallelize(data, 2)

data_agg = data_rdd.aggregate((0, 0),
                              (lambda x, value: (x[0] + value, x[1] + 1)),
                              (lambda x, y: (x[0] + y[0], x[1] + y[1])))
data_agg

In [None]:
data_agg = data_rdd.aggregate((2, 0),
                              (lambda x, value: (x[0] + value, x[1] + 1)),
                              (lambda x, y: (x[0] + y[0], x[1] + y[1])))
data_agg

<b>saveAsTextFile</b>

In [None]:
output_file_path = "data/spark_rdd/samples_100_split.json" 

In [None]:
data_file_rdd = sc.textFile(file_path, 2)

data_map_rdd = data_file_rdd.flatMap(lambda x: x.split())

data_map_rdd.saveAsTextFile(output_file_path)

In [None]:
data_file_output_rdd = sc.textFile(output_file_path, 2)
data_file_output_rdd.take(2)

<p>Пример <b>WordCount</b></p>

In [None]:
data_map_pair_rdd = data_map_rdd.map(lambda x: (x, 1))
data_map_pair_rdd.take(5)

In [None]:
data_map_pair_reduce_rdd = data_map_pair_rdd.reduceByKey(lambda x1, x2: x1+x2)
data_map_pair_reduce_rdd.take(10)

<a name="2"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">2. Distributed dataset of (K, V) pairs</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To Content</a></div>
    </div>
</div>

<a name="2a"></a>
<div style="display:table; width:100%">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-style:italic; font-weight:bold; font-size:12pt">
            a. Transformations
        </div>
        <div style="display:table-cell; border:1px solid lightgrey; width:20%">
            <div style="display:table-cell; width:10%; text-align:center; background-color:whitesmoke;">
                <a href="#2">Back</a>
            </div>
            <div style="display:table-cell; width:10%; text-align:center;">
                <a href="#2b">Next</a>
            </div>
        </div>
    </div>
</div>

<b>groupByKey</b>

In [None]:
purchases = [(1, "car"), (1, "hotel"), (1, "smartphone"), 
             (1, "laptop"), (2, "TV"), (2, "car"), 
             (3, "laptop"), (3, "laptop"), (3, "hotel")]

purchases_rdd = sc.parallelize(purchases, 2)

groupByKey_rdd = purchases_rdd.groupByKey()
groupByKey_rdd.collect()

In [None]:
[(k, list(v)) for k, v in groupByKey_rdd.collect()]

<b>reduceByKey</b>

In [None]:
purchases = [(1, "car"), (1, "hotel"), (1, "smartphone"), 
             (1, "laptop"), (2, "TV"), (2, "car"), 
             (3, "laptop"), (3, "laptop"), (3, "hotel")]

purchases_rdd = sc.parallelize(purchases, 2)

reduce_key_rdd = purchases_rdd.reduceByKey(lambda x, y: x + " " + y)

reduce_key_rdd.collect()

<b>foldByKey</b>

In [None]:
purchases = [(1, "car"), (1, "hotel"), (1, "smartphone"), 
             (1, "laptop"), (2, "TV"), (2, "car"), 
             (3, "laptop"), (3, "laptop"), (3, "hotel")]

purchases_rdd = sc.parallelize(purchases, 2)

reduce_key_rdd = purchases_rdd.foldByKey("x", lambda x, y: x + " " + y)

reduce_key_rdd.collect()

<b>distinct</b>

In [None]:
persons = [(1, "Ivanov"), (2, "Petrov"), (3, "Jamson"), (4, "Black"), (4, "Black")]
persons_rdd = sc.parallelize(persons, 2)

map_rdd = persons_rdd.distinct()
map_rdd.collect()

<b>keys</b>

In [None]:
data = [("f", 2), ("a", 3), ("h", 5), ("b", 6), ("c", 1)]

data_rdd = sc.parallelize(data)

data_keys_rdd = data_rdd.keys()
data_keys_rdd.collect()

<b>values</b>

In [None]:
data = [("f", 2), ("a", 3), ("h", 5), ("b", 6), ("c", 1)]

data_rdd = sc.parallelize(data)

data_values_rdd = data_rdd.values()
data_values_rdd.collect()

<b>mapValues</b>

In [None]:
data = [("f", 2), ("a", 3), ("h", 5), ("b", 6), ("c", 1)]

data_rdd = sc.parallelize(data)
data_mapValue_rdd = data_rdd.mapValues(lambda x: x + 10)
data_mapValue_rdd.collect()

<b>flatMapValues</b>

In [None]:
data = [("f", [2, 1]), ("a", [3,1]), ("h", [3,4,5]), ("b", [6]), ("c", [1])]

data_rdd = sc.parallelize(data)
data_mapValue_rdd = data_rdd.flatMapValues(lambda x: x)
data_mapValue_rdd.collect()

<b>join</b>

In [None]:
persons = [(1, "Ivanov"), (2, "Petrov"), (3, "Jamson"), (4, "Black")]
purchases = [(1, "car"), (1, "hotel"), (1, "smartphone"), (1, "laptop"), (2, "TV"), 
             (2, "car"), (3, "laptop"), (3, "laptop"), (3, "hotel"), (5, "TV")]

persons_rdd = sc.parallelize(persons, 2)
purchases_rdd = sc.parallelize(purchases, 4)

join_rdd = persons_rdd.join(purchases_rdd, numPartitions=2)
join_rdd.collect()

In [None]:
join_left_rdd = persons_rdd.leftOuterJoin(purchases_rdd, numPartitions=2)
join_left_rdd.collect()

In [None]:
join_right_rdd = persons_rdd.rightOuterJoin(purchases_rdd, numPartitions=2)
join_right_rdd.collect()

<b>cogroup</b>

In [None]:
persons = [(1, "Ivanov"), (2, "Petrov"), (3, "Jamson"), (4, "Black")]
purchases = [(1, "car"), (1, "hotel"), (1, "smartphone"), (1, "laptop"), (2, "TV"), 
             (2, "car"), (3, "laptop"), (3, "laptop"), (3, "hotel"), (5, "TV")]

cogroup_rdd = persons_rdd.cogroup(purchases_rdd, numPartitions=2)

cogroup_rdd.collect()

In [None]:
[(k, [list(el) for el in v]) for k, v in cogroup_rdd.collect()]

<b>partitionBy</b>

In [None]:
purchases_price = [("car", 1), ("hotel", 2), ("smartphone", 2), ("laptop", 3), ("TV", 4), 
                   ("car", 2), ("laptop", 1), ("laptop", 3), ("hotel", 1)]
purchases_price_rdd = sc.parallelize(purchases_price, 2)

purchases_price_rdd.glom().collect()

In [None]:
part_rdd = purchases_price_rdd.partitionBy(2)
part_rdd.glom().collect()

<b>aggregateByKey</b>

In [None]:
pers_purchases = [("car", 1), ("hotel", 2), ("smartphone", 2), 
                  ("laptop", 3), ("TV", 4), ("car", 2), 
                  ("laptop", 1), ("laptop", 3), ("hotel", 1)]

pers_purchases_rdd = sc.parallelize(pers_purchases, 4)

agg_key_rdd = pers_purchases_rdd.aggregateByKey((0, 0), 
                                                (lambda x, value: (x[0] + value, x[1] + 1)), 
                                                (lambda x, y: (x[0] + y[0], x[1] + y[1])))
agg_key_rdd.collect()

<b>combineByKey</b>

In [None]:
purchases_price = [("car", 1.0), ("hotel", 2.0), ("smartphone", 2.0), 
                   ("laptop", 3.0), ("TV", 4.0), ("car", 2.0), 
                   ("laptop", 1.0), ("laptop", 3.0), ("hotel", 1.0)]

purchases_price_rdd = sc.parallelize(purchases_price, 4).persist()

purchases_price_rdd.glom().collect()

In [None]:
combine_key_rdd = purchases_price_rdd.combineByKey((lambda value: (value, 1)), 
                                                  (lambda x, value: (x[0] + value, x[1] + 1)), 
                                                  (lambda x, y: (x[0] + y[0], x[1] + y[1])))
combine_key_rdd.collect()

In [None]:
combine_key_rdd = purchases_price_rdd.combineByKey((lambda value: (value, 2)), 
                                                  (lambda x, value: (x[0] + value, x[1] + 1)), 
                                                  (lambda x, y: (x[0] + y[0], x[1] + y[1])))
combine_key_rdd.collect()

<b>sortByKey</b>

In [None]:
data = [("f", 2), ("a", 3), ("h", 5), ("b", 6), ("c", 1)]

data_rdd = sc.parallelize(data)

data_sortbykey_rdd = data_rdd.sortByKey(ascending=True, numPartitions=3)
data_sortbykey_rdd.collect()

<b>sortBy</b>

In [None]:
data = [("f", 2), ("a", 3), ("h", 5), ("b", 6), ("c", 1)]

data_rdd = sc.parallelize(data)

data_sortby_rdd = data_rdd.sortBy(lambda x: x[0], ascending=True, numPartitions=3)
data_sortby_rdd.collect()

In [None]:
data_sortby_rdd = data_rdd.sortBy(lambda x: x[1], ascending=True, numPartitions=3)
data_sortby_rdd.collect()

<a name="2b"></a>
<div style="display:table; width:100%">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-style:italic; font-weight:bold; font-size:12pt">
            b. Actions
        </div>
        <div style="display:table-cell; border:1px solid lightgrey; width:20%">
            <div style="display:table-cell; width:10%; text-align:center; background-color:whitesmoke;">
                <a href="#2a">Back</a>
            </div>
            <div style="display:table-cell; width:10%; text-align:center;">
                <a href="#3">Next</a>
            </div>
        </div>
    </div>
</div>

<b>countByKey</b>

In [None]:
purchases = [(1, "car"), (1, "hotel"), (1, "smartphone"), 
             (1, "laptop"), (2, "TV"), (2, "car"), 
             (3, "laptop"), (3, "laptop"), (3, "hotel")]

purchases_rdd = sc.parallelize(purchases, 2)

count_key = purchases_rdd.countByKey()

count_key

<b>takeOrdered</b>

In [None]:
data = [("f", 2), ("a", 3), ("h", 5), ("b", 6), ("c", 1)]

data_rdd = sc.parallelize(data)
take_ordered = data_rdd.takeOrdered(5, key = lambda x: -x[1])
take_ordered

<a name="3"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">3. References</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To content</a></div>
    </div>
</div>

<a href="http://spark.apache.org/docs/latest/api/python/pyspark.html#module-pyspark">pyspark package</a><br>
<a href="http://spark.apache.org/docs/latest/programming-guide.html">Spark Programming Guide</a>