# countByWindow transformation Exercise


| Transformation        | Meaning           |
| -------------:|:-------------|
| **window**(windowLength, slideInterval)      | Return a new DStream which is computed based on windowed batches of the source DStream. |
| **countByWindow**(windowLength, slideInterval)     | Return a sliding window count of elements in the stream.     |
| **reduceByWindow**(func, windowLength, slideInterval) | Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func. The function should be associative and commutative so that it can be computed correctly in parallel.     |
| **reduceByKeyAndWindow**(func, windowLength, slideInterval, [numTasks])     | When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window. Note: By default, this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property spark.default.parallelism) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks. |
| **reduceByKeyAndWindow**(func, invFunc, windowLength, slideInterval, [numTasks])      | A more efficient version of the above reduceByKeyAndWindow() where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enters the sliding window, and “inverse reducing” the old data that leaves the window. An example would be that of “adding” and “subtracting” counts of keys as the window slides. However, it is applicable only to “invertible reduce functions”, that is, those reduce functions which have a corresponding “inverse reduce” function (taken as parameter invFunc). Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument. Note that checkpointing must be enabled for using this operation.      |
| **countByValueAndWindow**(windowLength, slideInterval, [numTasks]) | When called on a DStream of (K, V) pairs, returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window. Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument.      |

Explain countByWindow transformation in depth and what is the usage of countByWindow function

### Exercise

In [None]:
import os
import sys
import re
import random
import pathlib
import findspark

from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext
from pyspark import SparkConf
from apache_log_parser import ApacheAccessLog

os.environ['SPARK_HOME'] = '/Users/audioworkstation/Documents/WORKSPACE/LEARNING/spark_streaming_using_x/spark-3.5.0-bin-hadoop3'
os.environ['PYSPARK_DEIVER_PYTHON'] = 'jupyter'
os.environ['PYSPARK_DRIVER_PYTHON_OPTS'] = 'lab'
os.environ['PYSPARK_PYTHON'] = 'python'


findspark.init()
findspark.find()

random.seed(15)
conf = (SparkConf().setMaster('local[2]').setAppName('log processor').set('spark.executer.memory', '2g'))
sc = SparkContext(conf=conf)
ssc = StreamingContext(sparkContext=sc, batchDuration=2)
ssc.checkpoint('checkpoint')

In [None]:
# create DStream from text file
# Note: the spark streaming checks for any updates to this directory.
# So first, start this program, and then copy the log file logs/access_log.log to 'directory' location

curr = pathlib.Path().resolve()
logs_directory = os.path.join(curr / 'logs')
log_data = ssc.textFileStream(logs_directory)

access_log_dstream = log_data.map(ApacheAccessLog.parse_from_log_line).filter(lambda parsed_line: parsed_line is not None)
ip_dstream = access_log_dstream.map(lambda parsed_line: (parsed_line.ip, 1)) 
ip_count = ip_dstream.reduceByKey(lambda x,y: x+y)
# ip_count.pprint(num = 30)


ip_bytes_dstream = access_log_dstream.map(lambda parsed_line: (parsed_line.ip, parsed_line.content_size))
ip_bytes_sum_dstream = ip_bytes_dstream.reduceByKey(lambda x,y: x+y)
ip_bytes_request_count_dstream = ip_count.join(ip_bytes_sum_dstream)
# ip_bytes_request_count_dstream.pprint(num = 30)

In [None]:
####### TODO: Windowed count operation using countByWindow() ###########

# sort the ip_count for finding the ip with maximum number of requests
sorted_by = ip_count.transform(lambda rdd: rdd.sortBy(lambda x: x[1], ascending=False))
# sorted_by.pprint()

# new log records in last 20 seconds, shows up every 10 seconds
access_log_dstream.countByWindow(windowDuration=20, slideDuration=10).pprint()
sorted_by.window(windowDuration=20, slideDuration=10).pprint()

####### Exercise End ##########################################################

In [None]:
ssc.start() 
# ssc.awaitTermination()

In [None]:
# ssc.stop(stopSparkContext=True, stopGraceFully=True)

                                                                                

-------------------------------------------
Time: 2023-11-16 11:46:18
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:46:18
-------------------------------------------



Cannot parse logline: h194n2fls308o1033.telia.com - - [09/Mar/2004:13:49:05 -0800] "-" 408 -
Cannot parse logline: h194n2fls308o1033.telia.com - - [09/Mar/2004:13:49:05 -0800] "-" 408 -
                                                                                

-------------------------------------------
Time: 2023-11-16 11:46:28
-------------------------------------------
1545



                                                                                

-------------------------------------------
Time: 2023-11-16 11:46:28
-------------------------------------------
('64.242.88.10', 452)
('10.0.0.153', 270)
('h24-71-236-129.ca.shawcable.net', 51)
('cr020r01-3.sac.overture.com', 44)
('h24-70-69-74.ca.shawcable.net', 32)
('market-mail.panduit.com', 29)
('ts04-ip92.hevanet.com', 28)
('mail.geovariances.fr', 23)
('ip68-228-43-49.tc.ph.cox.net', 22)
('207.195.59.160', 20)
...



                                                                                

-------------------------------------------
Time: 2023-11-16 11:46:38
-------------------------------------------
1545

-------------------------------------------
Time: 2023-11-16 11:46:38
-------------------------------------------
('64.242.88.10', 452)
('10.0.0.153', 270)
('h24-71-236-129.ca.shawcable.net', 51)
('cr020r01-3.sac.overture.com', 44)
('h24-70-69-74.ca.shawcable.net', 32)
('market-mail.panduit.com', 29)
('ts04-ip92.hevanet.com', 28)
('mail.geovariances.fr', 23)
('ip68-228-43-49.tc.ph.cox.net', 22)
('207.195.59.160', 20)
...



                                                                                

-------------------------------------------
Time: 2023-11-16 11:46:48
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:46:48
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:46:58
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:46:58
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:47:08
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:47:08
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:47:18
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:47:18
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:47:28
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:47:28
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:47:38
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:47:38
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:47:48
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:47:48
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:47:58
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:47:58
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:48:08
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:48:08
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:48:18
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:48:18
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:48:28
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:48:28
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:48:38
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:48:38
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:48:48
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:48:48
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:48:58
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:48:58
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:49:08
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:49:08
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:49:18
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:49:18
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:49:28
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:49:28
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:49:38
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:49:38
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:49:48
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:49:48
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:49:58
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:49:58
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:50:08
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:50:08
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:50:18
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:50:18
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:50:28
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:50:28
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:50:38
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:50:38
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:50:48
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:50:48
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:50:58
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:50:58
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:51:08
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:51:08
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:51:18
-------------------------------------------



                                                                                

-------------------------------------------
Time: 2023-11-16 11:51:18
-------------------------------------------



## References
1. https://spark.apache.org/docs/latest/streaming-programming-guide.html#window-operations
2. https://github.com/jadianes/kdd-cup-99-spark