Find file
03b7f0c Mar 6, 2015
unknown Spark
158 lines (139 sloc) 5.84 KB


# installs oracle Java7, I alrdy had openjdk installed for Hadoop tutorial,
# so skipped that Oracle step.

# install Scala
sudo mkdir -p /usr/local/src/scala
sudo tar xvf scala-2.10.4.tgz -C /usr/local/src/scala
# append to ~/.bashrc:
export SCALA_HOME=/usr/local/src/scala/scala-2.10.4

source ~/.bashrc
scala -version

# if you don't plan to use
# HDFS the choice of 'built for Hadoop' version won't matter?
# The spark.tgz download is strangely enormous (200MB), why?
tar xvf spark*
cd spark-1.2.1-bin-hadoop2.4
bin/spark-shell # scala shell
bin/run-example SparkPi 10 # some example app
bin/pyspark # python shell

Using Spark via Scala:

See and Primary abstraction is distributed collection of Resilient Distributed Datasets (RDDs), typically (but not necessarily) stored in Hadoop's HDFS. Unlike Hadoop Spark makes use of in-memory caching (of RDDs?) for faster computation.

First start HDFS node, containing the same data as I used during Hadoop tutorial:

cd hadoop...
bin/hdfs dfs -ls /user/kschubert/input # lists 2 files
bin/hdfs dfs -cat /user/kschubert/input/file01 # Hello World ...

Let Spark access these HDFS files:

val textFile = sc.textFile("hdfs://localhost:9000/user/kschubert/input/file01")
textFile.first() # Hello World
# BTW this 9000 is listed in
# grep -r 9000 etc/hadoop/
# etc/hadoop/core-site.xml:        <value>hdfs://localhost:9000</value>

# now execute the Hadoop WordCount example via Spark, code at
val textFile = sc.textFile("hdfs://localhost:9000/user/kschubert/input/file01")
# P.S.: if you specify a glob expr in the filename (as "".../file*") then you
# can easily iterate over file01, file02 and so on, exactly like the Hadoop
# WordCount example did. You can also specify a dir name as the param for
# textFile, in which case you iterate over all files in the dir. Not sure
# if the calculation automatically is going to be distributed that way, I
# assume so.
val counts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
# here _+_ is syntactic Scala sugar for closure (a, b) => a + b ?
# print output via shell:
bin/hdfs dfs -cat /user/kschubert/output/sparkout/part-00000 # prints:

Now instead of using interactive Scala shell do the same via self-contained Spark app written in Scala:

# from
vim src/main/scala/SimpleApp.scala # paste (and put in abs path for
vim simple.sbt # paste
# install sbt, which is not part of scale tgz?
# See
echo "deb /" | sudo tee -a /etc/apt/sources.list.d/sbt.list
sudo apt-get update
sudo apt-get install sbt
# fails with
#E: Method https has died unexpectedly!
#E: Sub-process https received a segmentation fault.
# Is this a problem with openjdk? Just guessing.
sbt package # cannot execute without sbt installed :) TODO

Of course this code above only accesses file01, whereas the Hadoop sample iterated over all files in the hdfs dir (file01 and file02). How to do the same via Spark? TODO

Using Spark from Python 2

Python 3 is still not supported?! Apparently uses for interop with JVM. See Lets try interactive bin/pyspark first:

textFile = sc.textFile("hdfs://localhost:9000/user/kschubert/input/file01")
textFile.first() # prints hello world
counts = textFile.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a,b: a+b)
print(counts) # not very useful
counts.toDebugString() # pretty verbose (processing chain?) # prints conveniently
counts.saveAsTextFile("myrdd") # thats a dirname, content is myrdd/part-00000

So this is very similar to the Scala experience (with some funcs missing in Python still).

Now run standalone app, non-interactive: see

bin/pyspark kjell/

Running Spark apps on the cluster

Now how to submit a map-reduce script to an hdfs Hadoop/Spark cluster? I want to run the Hadoop equivalent with Spark. See "Running on a cluster" Is even a pyspark interactive shell map/reduce statement already parallelize across the cluster automatically? Judging by it is.

# from
# when you run bin/pyspark you get a sparkcontext sc right away.
# So run this here interactively:
wordcounts = sc.textFile('hdfs://localhost:9000/user/kschubert/isl') \
  .map( lambda x: x.lower()) \
  .flatMap(lambda x: x.split()) \
  .map(lambda x: (x, 1)) \
  .reduceByKey(lambda x,y:x+y) \
  .map(lambda x:(x[1],x[0])) \

Also see

$SPARK_HOME/bin/spark-submit $SPARK_HOME/examples/src/main/python/ 5 1000
# optionally add --master spark://

# now to run wordcount example that actually maps input ( does not)
$SPARK_HOME/bin/spark-submit $SPARK_HOME/examples/src/main/python/ hdfs://localhost:9000/user/kschubert/isl