# updateStateByKey Demo

### updateStateByKey
The `updateStateByKey` operation allows you to maintain arbitrary state while continuously updating it with new information. To use this, you will have to do two steps.
1. Define the state - The state can be an arbitrary data type.
2. Define the state update function - Specify with a function how to update the state using the previous state and the new values from an input stream.
In every batch, Spark will apply the state update function for all existing keys, regardless of whether they have new data in a batch or not. If the update function returns None then the key-value pair will be eliminated.

Note that using `updateStateByKey` requires the checkpoint directory to be configured.


### mapWithState
MapWithState is another stateful transformation. The Python API for Spark lacks the mapWithState function, unlike Java and Scala. As such we will be focusing on updateStateByKey.

### Demo

In [None]:
'''
import findspark
# TODO: your path will likely not have 'matthew' in it. Change it to reflect your path.
findspark.init('/home/siddharth/spark-2.1.0-bin-hadoop2.7')
'''

In [1]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.2.tar.gz (281.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.4/281.4 MB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.7/199.7 KB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.2-py2.py3-none-any.whl size=281824025 sha256=5af46e65d78f60f7af1f3a50e7c3f8a5b7a96009224514308b29249649156227
  Stored in directory: /root/.cache/pip/wheels/6c/e3/9b/0525ce8a69478916513509d43693511463c6468db0de237c86
Successfully built pyspark
Installing collected packages: py4j, pyspa

In [2]:
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext

In [3]:
sc = SparkContext()
ssc = StreamingContext(sc, 5)
ssc.checkpoint("checkpoint")

In [4]:
lines = ssc.socketTextStream("localhost", 9999)

In [5]:
def updateFunc(new_values, last_sum):
        return sum(new_values) + (last_sum or 0)

In [6]:
running_counts = lines.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).updateStateByKey(updateFunc)
running_counts.pprint()

In [8]:
ssc.start()
ssc.awaitTermination()

-------------------------------------------
Time: 2023-03-11 08:25:35
-------------------------------------------

-------------------------------------------
Time: 2023-03-11 08:25:40
-------------------------------------------

-------------------------------------------
Time: 2023-03-11 08:25:45
-------------------------------------------

-------------------------------------------
Time: 2023-03-11 08:25:50
-------------------------------------------

-------------------------------------------
Time: 2023-03-11 08:25:55
-------------------------------------------

-------------------------------------------
Time: 2023-03-11 08:26:00
-------------------------------------------

-------------------------------------------
Time: 2023-03-11 08:26:05
-------------------------------------------

-------------------------------------------
Time: 2023-03-11 08:26:10
-------------------------------------------

-------------------------------------------
Time: 2023-03-11 08:26:15
----------

ERROR:root:KeyboardInterrupt while sending command.
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/usr/local/lib/python3.9/dist-packages/py4j/clientserver.py", line 511, in send_command
    answer = smart_decode(self.stream.readline()[:-1])
  File "/usr/lib/python3.9/socket.py", line 704, in readinto
    return self._sock.recv_into(b)
KeyboardInterrupt


KeyboardInterrupt: ignored

## References
1. https://databricks.com/blog/2016/02/01/faster-stateful-stream-processing-in-apache-spark-streaming.html