# ‚ö° Real-Time Word Count Using Spark Streaming

Spark Streaming processes live data streams in real-time.  
This example demonstrates **real-time word count** using a **socket stream**.

---

## ‚úÖ Step 1: Start a Socket Text Stream

Run this in a separate terminal:

```bash
nc -lk 9999
````

> This command starts a simple TCP server on port **9999**
> You can type text lines here, which Spark will read in real-time.

---

## üîπ Step 2: Spark Streaming Word Count Script

Create a Spark Streaming application (`WordCountStreaming.scala`):

```scala
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

object WordCountStreaming {
  def main(args: Array[String]): Unit = {

    val conf = new SparkConf().setAppName("RealTimeWordCount").setMaster("local[*]")
    val ssc = new StreamingContext(conf, Seconds(2))

    // Connect to socket stream
    val lines = ssc.socketTextStream("localhost", 9999)

    // Split lines into words
    val words = lines.flatMap(_.split(" "))

    // Map each word to (word, 1)
    val wordPairs = words.map(word => (word, 1))

    // Reduce by key (word count)
    val wordCounts = wordPairs.reduceByKey(_ + _)

    // Print results
    wordCounts.print()

    ssc.start()
    ssc.awaitTermination()
  }
}
```

---

## ‚ñ∂Ô∏è Step 3: Run the Spark Streaming App

```bash
spark-submit --class WordCountStreaming wordcountstreaming.jar
```

---

## üî• Live Output Example

If you type this in the `nc` terminal:

```
hello spark
hello world
```

You will see output like:

```
-------------------------------------------
Time: 2026-01-20 10:00:02
-------------------------------------------
(hello,2)
(spark,1)
(world,1)
```


## PySpark

In [5]:
# ================================
# Real-Time Word Count Simulation in Colab
# Fully self-contained, no files or sockets
# ================================

!pip install pyspark -q

import time
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, split, col
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Create Spark session
spark = SparkSession.builder.appName("ColabStreamingWordCount").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

# Predefined streaming data
stream_data = [
    "hello spark",
    "hello colab spark",
    "spark streaming in colab",
    "real time word count example",
    "spark spark hello"
]

# Define schema for cumulative DataFrame
schema = StructType([
    StructField("word", StringType(), True),
    StructField("count", IntegerType(), True)
])

# Create empty cumulative DataFrame with schema
cumulative_counts = spark.createDataFrame([], schema)

# Process each "batch" to simulate streaming
for batch_num, line in enumerate(stream_data, 1):
    # Create DataFrame for current batch
    batch_df = spark.createDataFrame([(line,)], ["value"])

    # Split line into words
    words_df = batch_df.select(explode(split(col("value"), " ")).alias("word"))

    # Count words in this batch
    batch_counts = words_df.groupBy("word").count()

    # Merge with cumulative counts
    if cumulative_counts.count() == 0:
        cumulative_counts = batch_counts
    else:
        cumulative_counts = cumulative_counts.union(batch_counts) \
            .groupBy("word") \
            .sum("count") \
            .withColumnRenamed("sum(count)", "count")

    # Display cumulative counts
    print(f"\n--- Batch {batch_num} ---")
    cumulative_counts.show()

    # Wait 2 seconds to simulate streaming
    time.sleep(2)


--- Batch 1 ---
+-----+-----+
| word|count|
+-----+-----+
|hello|    1|
|spark|    1|
+-----+-----+


--- Batch 2 ---
+-----+-----+
| word|count|
+-----+-----+
|hello|    2|
|spark|    2|
|colab|    1|
+-----+-----+


--- Batch 3 ---
+---------+-----+
|     word|count|
+---------+-----+
|    hello|    2|
|    spark|    3|
|    colab|    2|
|       in|    1|
|streaming|    1|
+---------+-----+


--- Batch 4 ---
+---------+-----+
|     word|count|
+---------+-----+
|       in|    1|
|    hello|    2|
|streaming|    1|
|    spark|    3|
|    colab|    2|
|  example|    1|
|    count|    1|
|     real|    1|
|     word|    1|
|     time|    1|
+---------+-----+


--- Batch 5 ---
+---------+-----+
|     word|count|
+---------+-----+
|  example|    1|
|       in|    1|
|    hello|    3|
|    count|    1|
|streaming|    1|
|     real|    1|
|    spark|    5|
|     word|    1|
|    colab|    2|
|     time|    1|
+---------+-----+

