<a href="https://cognitiveclass.ai"><img src = "https://ibm.box.com/shared/static/9gegpsmnsoo25ikkbl4qzlvlyjbgxs5x.png" width = 400> </a>

<h1 align = "center"> Spark Fundamentals I - Introduction to Spark </h1>
<h2 align = "center"> Python - Working with RDD operations </h2>
<br align = "left">

**Related free online courses:**  

Related courses can be found in the following learning paths:

- [Spark Fundamentals path](http://cocl.us/Spark_Fundamentals_Path)
- [Big Data Fundamentals path](http://cocl.us/Big_Data_Fundamentals_Path)

<img src = "http://spark.apache.org/images/spark-logo.png", height = 100, align = 'left'>

## Analyzing a log file

First, let's download the data that we will working with in this lab.

In [1]:
# download the data from the IBM server
# this may take ~30 seconds depending on your interent speed
!wget --quiet https://ibm.box.com/shared/static/j8skrriqeqw66f51iyz911zyqai64j2g.zip
print("Data Downloaded!")

Data Downloaded!


In [2]:
# unzip the folder's content into "resources" directory
# this may take ~30 seconds depending on your internet speed
!unzip -q -o -d /resources/jupyterlab/labs/BD0211EN/ j8skrriqeqw66f51iyz911zyqai64j2g.zip
print("Data Extracted!")

Data Extracted!


In [4]:
# list the extracted files
!ls -1 /resources/jupyterlab/labs/BD0211EN/LabData/

README.md
followers.txt
notebook.log
nyctaxi.csv
nyctaxi100.csv
nyctaxisub.csv
nycweather.csv
pom.xml
taxistreams.py
users.txt


Now, let's create an RDD by loading the log file that we analyze in the Scala version of this lab.

In [1]:
logFile = sc.textFile("/resources/jupyterlab/labs/BD0211EN/LabData/notebook.log")
type(logFile)

pyspark.rdd.RDD

### <span style="color: red">YOUR TURN:</span> 

#### In the cell below, filter out the lines that contains INFO

In [9]:
# WRITE YOUR CODE BELOW
info = logFile.filter(lambda line: 'INFO' in line)
type(info)
info.take(5)

['15/10/14 14:29:21 INFO SparkContext: Running Spark version 1.4.1',
 '15/10/14 14:29:22 INFO SecurityManager: Changing view acls to: notebook',
 '15/10/14 14:29:22 INFO SecurityManager: Changing modify acls to: notebook',
 '15/10/14 14:29:22 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(notebook); users with modify permissions: Set(notebook)',
 '15/10/14 14:29:23 INFO Slf4jLogger: Slf4jLogger started']

Highlight text for answer:

<textarea rows="3" cols="80" style="color: white">
info = logFile.filter(lambda line: "INFO" in line)
</textarea>

#### Count the lines:

In [14]:
# WRITE YOUR CODE BELOW
info.collect().__len__()
# info.count()

13438

Highlight text for answer:

<textarea rows="3" cols="80" style="color: white">
info.count()
</textarea>

#### Count the lines with "spark" in it by combining transformation and action.

In [17]:
# WRITE YOUR CODE BELOW
logFile.filter(lambda line: 'spark' in line).count()

2238

Highlight text for answer:

<textarea rows="3" cols="80" style="color: white">
info.filter(lambda line: "spark" in line).count()
</textarea>

#### Fetch those lines as an array of Strings

In [22]:
# WRITE YOUR CODE BELOW
# logFile.filter(lambda line: 'spark' in line).collect()
logFile.filter(lambda line: 'spark' in line).take(10)

["Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties",
 '15/10/14 14:29:23 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@172.17.0.22:53333]',
 "15/10/14 14:29:23 INFO Utils: Successfully started service 'sparkDriver' on port 53333.",
 '15/10/14 14:29:23 INFO DiskBlockManager: Created local directory at /tmp/spark-fe150378-7bad-42b6-876b-d14e2c193eb6/blockmgr-c142f2f1-ebb6-4612-945b-0a67d156230a',
 '15/10/14 14:29:23 INFO HttpFileServer: HTTP File server directory is /tmp/spark-fe150378-7bad-42b6-876b-d14e2c193eb6/httpd-ed3f4ab0-7218-48bc-9d8a-3981b1cfe574',
 "15/10/14 14:29:24 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 35726.",
 "Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties",
 '15/10/15 15:33:42 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@172.17.0.22:47412]',
 "15/10/15 15:33:42 INFO Util

Highlight text for answer:

<textarea rows="3" cols="80" style="color: white">
info.filter(lambda line: "spark" in line).collect()
</textarea>

View the graph of an RDD using this command:

In [24]:
print(info.toDebugString())

b'(2) PythonRDD[8] at collect at <ipython-input-12-20a5af6a67d2>:2 []\n |  /resources/jupyterlab/labs/BD0211EN/LabData/notebook.log MapPartitionsRDD[7] at textFile at NativeMethodAccessorImpl.java:0 []\n |  /resources/jupyterlab/labs/BD0211EN/LabData/notebook.log HadoopRDD[6] at textFile at NativeMethodAccessorImpl.java:0 []'


## Joining RDDs

Next, you are going to create RDDs for the same README and the POM files that we used in the Scala version. 

In [25]:
readmeFile = sc.textFile("/resources/jupyterlab/labs/BD0211EN/LabData/README.md")
pomFile = sc.textFile("/resources/jupyterlab/labs/BD0211EN/LabData/pom.xml")

How many Spark keywords are in each file?

In [60]:
# Lines with spark keyword
# print(readmeFile.filter(lambda line: "Spark" in line).count())
# print(pomFile.filter(lambda line: "Spark" in line).count())
# readmeFile.filter(lambda line: "Spark" in line).collect()

In [61]:
# Find the occurence of Spark word
spark_line = readmeFile.filter(lambda line: "Spark" in line)
spark_word = spark_line.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
spark_word.sortByKey().lookup('Spark')

pom_line = pomFile.filter(lambda line: "Spark" in line)
pom_word = pom_line.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
pom_word.sortByKey().lookup('Spark')

[1]

Now do a WordCount on each RDD so that the results are (K,V) pairs of (word,count)

In [82]:
myReadmeCount = readmeFile.flatMap(lambda line: line.split()). \
            map(lambda word: (word, 1)). \
            reduceByKey(lambda a, b: a + b)
# myReadmeCount.collect()

myPomCount = readmeFile.flatMap(lambda line: line.split()). \
            map(lambda word: (word, 1)). \
            reduceByKey(lambda a, b: a + b)
# myReadmeCount.collect()

In [64]:
readmeCount = readmeFile.                    \
    flatMap(lambda line: line.split("   ")).   \
    map(lambda word: (word, 1)).             \
    reduceByKey(lambda a, b: a + b)
    
pomCount = pomFile.                          \
    flatMap(lambda line: line.split("   ")).   \
    map(lambda word: (word, 1)).            \
    reduceByKey(lambda a, b: a + b)

To see the array for either of them, just call the collect function on it.

In [80]:
print("Readme Count\n")
# print(readmeCount.collect())

Readme Count



In [79]:
print("Pom Count\n")
# print(pomCount.collect())

Pom Count



The join function combines the two datasets (K,V) and (K,W) together and get (K, (V,W)). Let's join these two counts together.

In [83]:
# joined = readmeCount.join(pomCount)
joined = myReadmeCount.join(myPomCount)

Print the value to the console

In [95]:
# joined.collect()
joined.take(10)

[('#', (1, 1)),
 ('Apache', (1, 1)),
 ('Spark', (14, 14)),
 ('is', (6, 6)),
 ('It', (2, 2)),
 ('provides', (1, 1)),
 ('high-level', (1, 1)),
 ('APIs', (1, 1)),
 ('in', (5, 5)),
 ('Scala,', (1, 1))]

Let's combine the values together to get the total count

In [104]:
# 2-D join table: add the 1st and 2nd value of tuple
# k[0] -> key | # k[1] -> value tuple | # k[1][0] and k[1][1] are two values of the value pair
# so to add valuees of the value pair -> k[1][0] + k[1][1]

joinedSum = joined.map(lambda k: (k[0], (k[1][0]+k[1][1])))
# joined.map(lambda k: k[1]).take(10)
joined.map(lambda k: k[1]).take(10)

[(1, 1),
 (1, 1),
 (14, 14),
 (6, 6),
 (2, 2),
 (1, 1),
 (1, 1),
 (1, 1),
 (5, 5),
 (1, 1)]

To check if it is correct, print the first five elements from the joined and the joinedSum RDD

In [101]:
print("Joined Individial\n")
print(joined.take(5))

print("\n\nJoined Sum\n")
print(joinedSum.take(5))

Joined Individial

[('#', (1, 1)), ('Apache', (1, 1)), ('Spark', (14, 14)), ('is', (6, 6)), ('It', (2, 2))]


Joined Sum

[('#', 2), ('Apache', 2), ('Spark', 28), ('is', 12), ('It', 4)]


## Shared variables

Normally, when a function passed to a Spark operation (such as map or reduce) is executed on a remote cluster node, it works on separate copies of all the variables used in the function. These variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program. Supporting general, read-write shared variables across tasks would be inefficient. However, Spark does provide two limited types of shared variables for two common usage patterns: broadcast variables and accumulators.

### Broadcast variables

Broadcast variables are useful for when you have a large dataset that you want to use across all the worker nodes. A read-only variable is cached on each machine rather than shipping a copy of it with tasks. Spark actions are executed through a set of stages, separated by distributed “shuffle” operations. Spark automatically broadcasts the common data needed by tasks within each stage.


Read more here: [http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables](http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables)

Create a broadcast variable. Type in:

In [1]:
broadcastVar = sc.broadcast([1,2,3])

To get the value, type in:

In [2]:
broadcastVar.value

[1, 2, 3]

### Accumulators

Accumulators are variables that can only be added through an associative operation. It is used to implement counters and sum efficiently in parallel. Spark natively supports numeric type accumulators and standard mutable collections. Programmers can extend these for new types. Only the driver can read the values of the accumulators. The workers can only invoke it to increment the value.

Create the accumulator variable. Type in:

In [3]:
accum = sc.accumulator(0)

Next parallelize an array of four integers and run it through a loop to add each integer value to the accumulator variable. Type in:

In [4]:
rdd = sc.parallelize([1,2,3,4])
def f(x):
    global accum
    accum += x

Next, iterate through each element of the rdd and apply the function f on it:

In [5]:
rdd.foreach(f)

To get the current value of the accumulator variable, type in:

In [6]:
accum.value

10

You should get a value of 10.

This command can only be invoked on the driver side. The worker nodes can only increment the accumulator.


## Key-value pairs

You have already seen a bit about key-value pairs in the Joining RDD section.

Create a key-value pair of two characters. Type in:

In [7]:
pair = ('a', 'b')

To access the value of the first index use [0] and [1] method for the 2nd.

In [8]:
print(pair[0])
print(pair[1])

a
b


# This part is copied from Scala lab
## Sample Application
In this section, you will be using a subset of a data for taxi trips that will determine the top 10 medallion numbers based on the number of trips.

In [57]:
# taxi = sc.textFile("/resources/jupyterlab/labs/BD0211EN/LabData/nyctaxi.csv")
taxi = sc.textFile("/resources/jupyterlab/labs/BD0211EN/LabData/my_nytaxi_short.csv")

To view the five rows of content, invoke the take function. Type in:

In [58]:
taxi.take(5)

['29b3f4a30dea6688d4c289c9672cb996,1-ddfdec8050c7ef4dc694eeeda6c4625e,1/11/2013 22:03,4.07E+01,-7.40E+01,A93D1F7F8998FFB75EEF477EB6077516,68BC16A99E915E44ADA7E639B4DD5F59,2,1/11/2013 21:48,4.07E+01,-7.40E+01,1,,4.08E+00,900,VTS',
 '2a80cfaa425dcec0861e02ae44354500,1-b72234b58a7b0018a1ec5d2ea0797e32,1/11/2013 4:28,4.08E+01,-7.39E+01,64CE1B03FDE343BB8DFB512123A525A4,60150AA39B2F654ED6F0C3AF8174A48A,1,1/11/2013 4:07,4.07E+01,-7.40E+01,1,,8.53E+00,1260,VTS',
 '29b3f4a30dea6688d4c289c96758d87e,1-387ec30eac5abda89d2abefdf947b2c1,1/11/2013 22:02,4.07E+01,-7.40E+01,2D73B0C44F1699C67AB8AE322433BDB7,6F907BC9A85B7034C8418A24A0A75489,5,1/11/2013 21:46,4.08E+01,-7.40E+01,1,,3.01E+00,960,VTS',
 '2a80cfaa425dcec0861e02ae446226e4,1-aa8b16d6ae44ad906a46cc6581ffea50,1/11/2013 10:03,4.08E+01,-7.40E+01,E90018250F0A009433F03BD1E4A4CE53,1AFFD48CC07161DA651625B562FE4D06,5,1/11/2013 9:44,4.07E+01,-7.40E+01,1,,3.64E+00,1140,VTS',
 '29b3f4a30dea6688d4c289c9675a019c,1-dc8295eae03262a84370b8a6450eb38e,1/11/2013 2

##### Note that the first line is the headers. Normally, you would want to filter that out, but since it will not affect our results, we can leave it in.

##### To parse out the values, including the medallion numbers, you need to first create a new RDD by splitting the lines of the RDD using the comma as the delimiter. Type in:


In [62]:
taxiParse = taxi.map(lambda line: line.split(','))
# taxiParse.take(5)

### Problem Statement: Find no of trips a particular taxi took | See which taxi took the most trips

##### Now create the key-value pairs where the key is the medallion number and the value is 1. We use this model to later sum up all the keys to find out the number of trips a particular taxi took and in particular, will be able to see which taxi took the most trips. Map each of the medallions to the value of one. Type in:
Note: medallion is the 7th field ie at index 6

##### Below Creating 
##### 1) Map: tuple of (medallion no, (1, dist, duration))
##### 2) Reduce: tuple of (occurence, sum of dist travelled, sum of duration of trip)

In [117]:
# taxiMedKey = taxiParse.map(lambda x: (x[6], 1),)
taxiMedKey = taxiParse.map(lambda x: (x[6], (1, int(x[7]), int(x[14]))))
taxiMedKey.take(10)
# taxiMedRed = taxiMedKey.reduceByKey(lambda a , b: (a[0] + b[0], a[1] + b[1], a[2] + b[2]))
# taxiMedRed.take(10)

[('68BC16A99E915E44ADA7E639B4DD5F59', (1, 2, 900)),
 ('60150AA39B2F654ED6F0C3AF8174A48A', (1, 1, 1260)),
 ('6F907BC9A85B7034C8418A24A0A75489', (1, 5, 960)),
 ('1AFFD48CC07161DA651625B562FE4D06', (1, 5, 1140)),
 ('8BF138EA0CF6FF83587993BECA6D6D59', (1, 1, 840)),
 ('23A8ED0AAA1936A28C652B80903B42FB', (1, 1, 1260)),
 ('42BC02EC8FC9719B5B8075C3029B9EE9', (1, 1, 780)),
 ('EB49CE1B3661EF6100CF9EA1B860932E', (1, 1, 900)),
 ('DDE6F0B0832FA5CCE8491924E360FB45', (1, 1, 900)),
 ('68BC16A99E915E44ADA7E639B4DD5F59', (1, 5, 900))]

[('68BC16A99E915E44ADA7E639B4DD5F59', '25'),
 ('6F907BC9A85B7034C8418A24A0A75489', ('5', '960')),
 ('1AFFD48CC07161DA651625B562FE4D06', ('5', '1140')),
 ('8BF138EA0CF6FF83587993BECA6D6D59', ('1', '840')),
 ('23A8ED0AAA1936A28C652B80903B42FB', ('1', '1260'))]

##### vals(6) corresponds to the column where the medallion key is located

##### Next use the reduceByKey function to count the number of occurrence for each key.

In [7]:
taxiMedCounts = taxiMedKey.reduceByKey(lambda a, b: a + b)
taxiMedCounts.take(5)

[('"DDE6F0B0832FA5CCE8491924E360FB45"', 235),
 ('"41DFF791640B9E2452F8FC0120C4A9F7"', 229),
 ('"BAE2E3D1E60161EB06CB10ACEE43F0E7"', 270),
 ('"28B0F9D039C707889FAB90A73377CDE4"', 90),
 ('"6D54E8A7CAC67485032E338D2322180B"', 56)]

##### Finally, the values are swapped so they can be ordered in descending order and the results are presented correctly.

In [12]:
taxiMedTop10 = taxiMedCounts.map(lambda x: (x[1], x[0])).top(10)
taxiMedTop10
for item in taxiMedTop10:
    print("Taxi Medallion {1} has {0} trips.".format(item[0], item[1]))

Taxi Medallion "FE4C521F3C1AC6F2598DEF00DDD43029" has 415 trips.
Taxi Medallion "F5BB809E7858A669C9A1E8A12A3CCF81" has 411 trips.
Taxi Medallion "8CE240F0796D072D5DCFE06A364FB5A0" has 406 trips.
Taxi Medallion "0310297769C8B049C0EA8E87C697F755" has 402 trips.
Taxi Medallion "B6585890F68EE02702F32DECDEABC2A8" has 399 trips.
Taxi Medallion "33955A2FCAF62C6E91A11AE97D96C99A" has 395 trips.
Taxi Medallion "4F7C132D3130970CFA892CC858F5ECB5" has 391 trips.
Taxi Medallion "78833E177D45E4BC520222FFBBAC5B77" has 383 trips.
Taxi Medallion "E097412FE23295A691BEEE56F28FB9E2" has 380 trips.
Taxi Medallion "C14289566BAAD9AEDD0751E5E9C73FBD" has 377 trips.


##### While each step above was processed one line at a time, you can just as well process everything on one line:
##### Find count of trips in two ways 1) reduceByKey on rddtuple(k,1) and 2) combineByKey on rddtuple(k,1)

In [1]:
taxi = sc.textFile("/resources/jupyterlab/labs/BD0211EN/LabData/nyctaxi.csv")
# op1:
taxiMedCountsOneLine1 = taxi.map(lambda line: line.split(',')).map(lambda x: (x[6], 1)).reduceByKey(lambda a, b: a + b)

# op2:
taxiMedCountsOneLine2 = taxi.map(lambda line: line.split(',')).map(lambda x: (x[6], 1)).countByKey()
# taxiMedCountsOneLine

##### Run the same line as above to print the taxiMedCountsOneLine RDD.

In [2]:
# Note: When using combineByKey you can no longer swap as combineByKey is a action which returns collection:defaultdict
taxiMedCountsOneLine1.map(lambda x: (x[1], x[0])).top(5)

[(415, '"FE4C521F3C1AC6F2598DEF00DDD43029"'),
 (411, '"F5BB809E7858A669C9A1E8A12A3CCF81"'),
 (406, '"8CE240F0796D072D5DCFE06A364FB5A0"'),
 (402, '"0310297769C8B049C0EA8E87C697F755"'),
 (399, '"B6585890F68EE02702F32DECDEABC2A8"')]

##### Let's cache the taxiMedCountsOneLine to see the difference caching makes. Run it with the logs set to INFO and you can see the output of the time it takes to execute each line. First, let's cache the RDD


In [21]:
# taxiMedCountsOneLine.cache()
type(taxiMedCountsOneLine.cache())

pyspark.rdd.PipelinedRDD

##### Next, you have to invoke an action for it to actually cache the RDD. Note the time it takes here (either empirically using the INFO log or just notice the time it takes)

In [18]:
taxiMedCountsOneLine.count()

13464

##### Run it again to see the difference.

In [19]:
taxiMedCountsOneLine.count()

13464

### Remove Header from csv file --> While Loading into dataframe

In [3]:
# taxi = sc.textFile("/resources/jupyterlab/labs/BD0211EN/LabData/nyctaxi.csv")
taxi = spark.read.option("header","true").csv("/resources/jupyterlab/labs/BD0211EN/LabData/nyctaxi.csv")
type(taxi)
type(taxi)
# for item in taxi.take(10):
#     print(item[6])
type(spark)

pyspark.sql.session.SparkSession

### Remove Header from csv file --> WHile creating a RDD

In [163]:
taxi = sc.textFile("/resources/jupyterlab/labs/BD0211EN/LabData/nyctaxi.csv")
# taxi = sc.textFile("/resources/jupyterlab/labs/BD0211EN/LabData/my_nytaxi_short_wheader.csv")
# taxi.count()
headertag = taxi.first()
header = sc.parallelize([headertag])
data = taxi.subtract(header)
data.count()

2677600

In [77]:
# # one_through_9 = range(1,10)
one_through_9 = [1,2,3,4,5,6,7,8,9]
parallel = sc.parallelize(one_through_9, 3)
# parallel = sc.parallelize(one_through_9)
# def f(iterator): yield sum(iterator)
# def h(index, iterator): 
#     yield index, list(iterator)
def h(index, iterator):
    if index == 0:
        yield index, list(iterator)[1:]
    else:
        yield index, list(iterator)
    
# parallel.mapPartitions(f).collect()             # -> map partition (iterator)
parallel.mapPartitionsWithIndex(h).collect()      # -> map partition with index posn of partition. (index, iterator)         
# sc.defaultParallelism

[[2, 3], [4, 5, 6], [7, 8, 9]]

##### The bigger the dataset, the more noticeable the difference will be. In a sample file such as ours, the difference may be negligible.

### Summary
Having completed this exercise, you should now be able to describe Spark’s primary data abstraction, work with Resilient Distributed Dataset (RDD) operations, and utilize shared variables and key-value pairs.

This notebook is part of the free course on **Cognitive Class** called *Spark Fundamentals I*. If you accessed this notebook outside the course, you can take this free self-paced course, online by going to: http://cocl.us/Spark_Fundamentals_I

### About the Authors:  
Hi! It's [Alex Aklson](https://www.linkedin.com/in/aklson/), one of the authors of this notebook. I hope you found this lab educational! There is much more to learn about Spark but you are well on your way. Feel free to connect with me if you have any questions.
<hr>