# Transformation Operation Exercise

The `transform` operation (along with its variations like `transformWith`) allows arbitrary RDD-to-RDD functions to be applied on a DStream. It can be used to apply any RDD operation that is not exposed in the DStream API. For example, the functionality of joining every batch in a data stream with another dataset is not directly exposed in the DStream API. However, you can easily use `transform to do` this. This enables very powerful possibilities. For example, one can do real-time data cleaning by joining the input data stream with precomputed spam information (maybe generated with Spark as well) and then filtering based on it.
```python
spamInfoRDD = sc.pickleFile(...)  # RDD containing spam information

# join data stream with spam information to do data cleaning
cleanedDStream = wordCounts.transform(lambda rdd: rdd.join(spamInfoRDD).filter(...))
```
Note that the supplied function gets called in every batch interval. This allows you to do time-varying RDD operations, that is, RDD operations, number of partitions, broadcast variables, etc. can be changed between batches.

What is the benefit of it?


### Exercise

Suppose we have two rdds tthat are combined into a DStream

We would like to apply the `union()` function to this DStream and the RDD `commonRdd`

In [1]:
import os
import findspark

from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext
from pyspark import SparkConf

from apache_log_parser import ApacheAccessLog


os.environ['SPARK_HOME'] = '/Users/audioworkstation/Documents/WORKSPACE/LEARNING/spark_streaming_using_x/spark-3.5.0-bin-hadoop3'
os.environ['PYSPARK_DEIVER_PYTHON'] = 'jupyter'
os.environ['PYSPARK_DRIVER_PYTHON_OPTS'] = 'lab'
os.environ['PYSPARK_PYTHON'] = 'python'


findspark.init()
findspark.find()

conf = (SparkConf().setMaster('local[2]').setAppName('TextUpdater').set('spark.executer.memory', '2g'))
sc = SparkContext(conf=conf)
ssc = StreamingContext(sparkContext=sc, batchDuration=3)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/11/15 15:19:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
rdd1 = ssc.sparkContext.parallelize([1,2,3])
rdd2 = ssc.sparkContext.parallelize([4,5,6])
rddQueue = [rdd1,rdd2]

In [3]:
# Creates a DStream from the RDDs above
numsDStream = ssc.queueStream(rddQueue)
plusOneDStream = numsDStream.map(lambda x : x+1)

In [4]:
commonRdd = ssc.sparkContext.parallelize([7,8,9])
# TODO: Use the transform function to apply the union function to the RDDs within numsDStream and elements of commonRdd
# and print the resulting DStream
def perform_join(stream_rdd):
    return stream_rdd.union(commonRdd)

transformed_ds = plusOneDStream.transform(perform_join)
transformed_ds.pprint()



In [5]:
ssc.start() 
# ssc.awaitTermination()

In [6]:
ssc.stop(stopSparkContext=True, stopGraceFully=True)

                                                                                

-------------------------------------------
Time: 2023-11-15 15:19:06
-------------------------------------------
2
3
4
7
8
9

-------------------------------------------
Time: 2023-11-15 15:19:09
-------------------------------------------
5
6
7
7
8
9



## References
1. https://spark.apache.org/docs/latest/streaming-programming-guide.html#transform-operation
