# Basics
This example works with RDD. The purpose of this is example is solely get to know the technology. You will probably not need RDD's in your project as we will use Spark SQL which is more flexible to use.

In [1]:
import datetime
#Make sure pyspark is installed as a package of your project.
from pyspark import SparkConf, SparkContext
import ConnectionConfig as cc
# This method will setup the environment variables for you. See EnvironmentSetup.py for more information.
cc.setupEnvironment()

Dynamically set JAVA_HOME: /Users/user/Library/Java/JavaVirtualMachines/temurin-21.0.2/Contents/Home



# Basic setup to be able to use a Spark Cluster

In [2]:

#1. Creating a configuration to add parameters.
conf = SparkConf().setAppName("firstJob").setMaster("local[*]").setIfMissing("spark.logLineage", "true")
#setMaster() is used to define on which cluster the code has to run.
#   If you submit a job on the cluster itself the Master will be already set
#   If you want to run the code in your development environment, you have to set the master to local[*]. Spark will initiate a local-cluster that uses the different threads of your CPU.
# IMPORTANT: Never use setMaster("local[*]") hard-coded in your final code. When running on a real environment, overwriting the default with local[*] means the job will just run locally on the master node and the cluster itself will not be used to run the job.

#2. Create a sparkcontext. The context is used to initiate a processing job on the cluster
sc =SparkContext.getOrCreate((conf))
sc.uiWebUrl #just returns the sparkUI string. Visit the link below to get insights in the spark jobs

25/02/05 13:28:37 WARN Utils: Your hostname, MacBook-Pro-170.local resolves to a loopback address: 127.0.0.1; using 10.140.34.14 instead (on interface en0)
25/02/05 13:28:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/02/05 13:28:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


'http://10.140.34.14:4040'

## Running WordCount (the Hello World of distributed processing)

In [3]:
import datetime

# 1. Use the Spark Context to read from an input. This is the start point for the Spark excecution engine. Reading from a textFile returns a dataframe with a record for each line.
# TODO: Use the correct path notation (eg. file:///, gs://, hdfs://) and add this to the arguments of the job instead of hard-coded.
lines = sc.textFile("./FileStore/tables/shakespeare.txt")\
#2. The first transformation transforms the lines to words
words = lines.flatMap(lambda line: line.split(" "))
#3. The second transformations creates a keyvalue map: map to key= word, value= 1
wordKv = words.map(lambda word: (word, 1))
#4. In this transformations an aggregation by key is performed. Every word is counted. Because records with the same key ar spread over the cluster, a shuffle is needed (e.g. all records with key "romeo" will be sent to one node). The execution engine will start a new stage for this step because of the shuffle.

wordCounts = wordKv.reduceByKey(lambda a,b:a +b)
#5. This step is not necessary but will cache the result to memory.
# This is more efficient if the wordCounts dataframe is needed for more new dataframes to avoid that Spark recalculates everything from the beginning for each chain.
wordCounts.cache()
# 6. Save the counts to output. As this is an action, the Spark execution engine will only now decide that all the preceding transformations do need to be executed.
print(wordCounts.saveAsTextFile("./output/words" + datetime.datetime.now().strftime("%m%d%Y%H%M%S")))



None


                                                                                

In [4]:

# Stops the spark context so you can build a new one.
sc.stop()