# countByValueAndWindow Transformation Exercise

Spark Streaming also provides windowed computations, which allow you to apply transformations over a sliding window of data. The following figure illustrates this sliding window.


Some of the common window operations are as follows. All of these operations take the said two parameters - windowLength and slideInterval.

| Transformation        | Meaning           |
| -------------:|:-------------|
| **countByValueAndWindow**(windowLength, slideInterval, [numTasks])     | When called on a DStream of (K, V) pairs, returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window. Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument. |

### Exercise

In [1]:
import os
import pathlib
import findspark

os.environ['SPARK_HOME'] = '/Users/audioworkstation/Documents/WORKSPACE/LEARNING/spark_streaming_using_x/spark-3.5.0-bin-hadoop3'
os.environ['PYSPARK_DEIVER_PYTHON'] = 'jupyter'
os.environ['PYSPARK_DRIVER_PYTHON_OPTS'] = 'lab'
os.environ['PYSPARK_PYTHON'] = 'python'


findspark.init()
findspark.find()

from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext
from pyspark import SparkConf
from apache_log_parser import ApacheAccessLog

sc = SparkContext()
ssc = StreamingContext(sparkContext=sc, batchDuration=2)
ssc.checkpoint('checkpoint')

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/11/17 10:25:54 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
# create DStream from text file
# Note: the spark streaming checks for any updates to this directory.
# So first, start this program, and then copy the log file logs/access_log.log to 'directory' location

curr = pathlib.Path().resolve()
logs_directory = os.path.join(curr / 'logs')
log_data = ssc.textFileStream(logs_directory)
access_log_dstream = log_data.map(ApacheAccessLog.parse_from_log_line).filter(lambda parsed_line: parsed_line is not None)
ip_dstream = access_log_dstream.map(lambda parsed_line: (parsed_line.ip, 1)) 
ip_count = ip_dstream.reduceByKey(lambda x,y: x+y)
# ip_count.pprint(num = 30)
ip_bytes_dstream = access_log_dstream.map(lambda parsed_line: (parsed_line.ip, parsed_line.content_size))
ip_bytes_sum_dstream = ip_bytes_dstream.reduceByKey(lambda x,y: x+y)
ip_bytes_request_count_dstream = ip_count.join(ip_bytes_sum_dstream)
# ip_bytes_request_count_dstream.pprint(num = 30)

In [3]:
####### TODO: use reduceByKeyAndWindow() to get Ip counts per window ###########
access_log = access_log_dstream.map(lambda pl: pl.ip)
access_log.countByValueAndWindow(
    windowDuration=20,
    slideDuration=10
).pprint()

####### Exercise End ##########################################################

In [None]:
ssc.start() 
# ssc.awaitTermination()

                                                                                

-------------------------------------------
Time: 2023-11-17 10:26:44
-------------------------------------------



Cannot parse logline: h194n2fls308o1033.telia.com - - [09/Mar/2004:13:49:05 -0800] "-" 408 -
                                                                                

-------------------------------------------
Time: 2023-11-17 10:26:54
-------------------------------------------
('80-219-148-207.dclient.hispeed.ch', 1)
('mmscrm07-2.sac.overture.com', 3)
('h24-70-56-49.ca.shawcable.net', 7)
('prxint-sxb2.e-i.net', 1)
('lj1027.inktomisearch.com', 2)
('pool-68-160-195-60.ny325.east.verizon.net', 5)
('lj1052.inktomisearch.com', 1)
('fw.kcm.org', 2)
('h24-71-236-129.ca.shawcable.net', 51)
('lj1123.inktomisearch.com', 2)
...



                                                                                

In [None]:
ssc.stop(stopSparkContext=True, stopGraceFully=True)

## References
1. https://spark.apache.org/docs/latest/streaming-programming-guide.html#window-operations