# reduceByKeyAndWindow transformation Demo

| Transformation        | Meaning           |
| -------------:|:-------------|
| **reduceByKeyAndWindow**(func, windowLength, slideInterval, [numTasks])     | When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window. Note: By default, this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property spark.default.parallelism) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks. |
| **reduceByKeyAndWindow**(func, invFunc, windowLength, slideInterval, [numTasks])      | A more efficient version of the above reduceByKeyAndWindow() where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enters the sliding window, and “inverse reducing” the old data that leaves the window. An example would be that of “adding” and “subtracting” counts of keys as the window slides. However, it is applicable only to “invertible reduce functions”, that is, those reduce functions which have a corresponding “inverse reduce” function (taken as parameter invFunc). Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument. Note that checkpointing must be enabled for using this operation.      |


### Demo

In [None]:
'''
import findspark
# TODO: your path will likely not have 'matthew' in it. Change it to reflect your path.
findspark.init('/home/siddharth/spark-2.1.0-bin-hadoop2.7')
'''

In [1]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.2.tar.gz (281.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.4/281.4 MB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.7/199.7 KB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.2-py2.py3-none-any.whl size=281824025 sha256=f71b5eba861eeebbcee747a461d551a01e875d5678ca4cdf8da4553942be13c5
  Stored in directory: /root/.cache/pip/wheels/6c/e3/9b/0525ce8a69478916513509d43693511463c6468db0de237c86
Successfully built pyspark
Installing collected packages: py4j, pyspa

In [None]:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext

In [None]:
sc = SparkContext(master="local[2]", appName="reduceByKeyAndWindowWordcount")
ssc = StreamingContext(sc, 1)
ssc.checkpoint("checkpoint")

In [None]:
lines = ssc.socketTextStream("localhost", 7777)

In [None]:
words = lines.flatMap(lambda line: line.split(" "))
pairs = words.map(lambda word: (word, 1))
pairs.reduceByKeyAndWindow(lambda x,y: (x + y), 30, 10).pprint()

In [None]:
ssc.start()

-------------------------------------------
Time: 2018-02-09 02:16:32
-------------------------------------------

-------------------------------------------
Time: 2018-02-09 02:16:33
-------------------------------------------

-------------------------------------------
Time: 2018-02-09 02:16:34
-------------------------------------------

-------------------------------------------
Time: 2018-02-09 02:16:35
-------------------------------------------

-------------------------------------------
Time: 2018-02-09 02:16:36
-------------------------------------------

-------------------------------------------
Time: 2018-02-09 02:16:37
-------------------------------------------
('hello', 1)

-------------------------------------------
Time: 2018-02-09 02:16:38
-------------------------------------------
('hello', 1)

-------------------------------------------
Time: 2018-02-09 02:16:39
-------------------------------------------
('hello', 2)

-----------------------------------------

In [None]:
ssc.stop(stopSparkContext=True, stopGraceFully=True)

## References
1. https://spark.apache.org/docs/latest/streaming-programming-guide.html#window-operations