# Transformation Operation Exercise

The `transform` operation (along with its variations like `transformWith`) allows arbitrary RDD-to-RDD functions to be applied on a DStream. It can be used to apply any RDD operation that is not exposed in the DStream API. For example, the functionality of joining every batch in a data stream with another dataset is not directly exposed in the DStream API. However, you can easily use `transform to do` this. This enables very powerful possibilities. For example, one can do real-time data cleaning by joining the input data stream with precomputed spam information (maybe generated with Spark as well) and then filtering based on it.
```python
spamInfoRDD = sc.pickleFile(...)  # RDD containing spam information

# join data stream with spam information to do data cleaning
cleanedDStream = wordCounts.transform(lambda rdd: rdd.join(spamInfoRDD).filter(...))
```
Note that the supplied function gets called in every batch interval. This allows you to do time-varying RDD operations, that is, RDD operations, number of partitions, broadcast variables, etc. can be changed between batches.

What is the benefit of it?


In [1]:
!pip install pyspark 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.2.tar.gz (281.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.4/281.4 MB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.7/199.7 KB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.2-py2.py3-none-any.whl size=281824025 sha256=5866213691406db234bd2a3b6ad8c1a2bb00ec8b42c1f0218b3ea1638eb4c297
  Stored in directory: /root/.cache/pip/wheels/6c/e3/9b/0525ce8a69478916513509d43693511463c6468db0de237c86
Successfully built pyspark
Installing collected packages: py4j, pyspa

### Exercise

Suppose we have two rdds tthat are combined into a DStream

We would like to apply the `union()` function to this DStream and the RDD `commonRdd`

In [None]:
'''
import findspark
# TODO: your path will likely not have 'matthew' in it. Change it to reflect your path.
findspark.init('/home/siddharth/spark-2.1.0-bin-hadoop2.7')
'''

In [2]:
from pyspark import SparkConf, SparkContext
from pyspark.streaming import StreamingContext

In [3]:
conf = SparkConf().setMaster("local[2]").setAppName("StreamingTransformExample")
sc = SparkContext(conf=conf)
ssc = StreamingContext(sc, 5)

In [4]:
rdd1 = ssc.sparkContext.parallelize([1,2,3])
rdd2 = ssc.sparkContext.parallelize([4,5,6])
rddQueue = [rdd1,rdd2]

In [5]:
# Creates a DStream from the RDDs above
numsDStream = ssc.queueStream(rddQueue)
plusOneDStream = numsDStream.map(lambda x : x+1)
plusOneDStream.pprint()

In [6]:
commonRdd = ssc.sparkContext.parallelize([7,8,9])
# TODO: Use the transform function to apply the union function to the RDDs within numsDStream and elements of commonRdd
# and print the resulting DStream
combinedDStream = numsDStream.transform(lambda rdd: rdd.union(commonRdd))
combinedDStream.pprint()

In [7]:
ssc.start() 
# ssc.awaitTermination()

In [8]:
ssc.stop(stopSparkContext=True, stopGraceFully=True)

-------------------------------------------
Time: 2023-03-11 07:11:35
-------------------------------------------
5
6
7

-------------------------------------------
Time: 2023-03-11 07:11:35
-------------------------------------------
4
5
6
7
8
9

-------------------------------------------
Time: 2023-03-11 07:11:40
-------------------------------------------

-------------------------------------------
Time: 2023-03-11 07:11:40
-------------------------------------------
7
8
9



## References
1. https://spark.apache.org/docs/latest/streaming-programming-guide.html#transform-operation
