<a href="https://colab.research.google.com/github/ShovalBenjer/Bigdata_Pyspark_Spark_Hadoop_Apache/blob/main/ex_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

TL;DR
Collaborators: Shoval Benjer 319037404, Adir Amar 209017755

The Kafka-Spark pipeline successfully consumed sentiment data from the sentiments topic, processing words and their sentiment scores in real time. After receiving 10 messages, the consumer stopped as configured, and producer threads were gracefully terminated. This demonstrates the system's ability to handle streaming data efficiently within predefined limits.

#**Setup**

**System Requirements:**

    Operating System: Linux-based environment (recommended for compatibility) or Windows with WSL2.

Software:
    Python 3.8+, Java 8 (OpenJDK 8), and Apache Spark.

Libraries:

    pyspark, kafka-python, threading, and json.

Environment Setup:

**Google Colab is Recommended for running the notebook.**

Local System: Ensure you have Apache Spark and Kafka installed with appropriate environment variables configured.


Description for Each Step:

      Install Java:
      This command installs the OpenJDK 8 runtime environment, a necessary dependency for running Apache Spark and Kafka. The -qq flag minimizes output during the installation process.

      Download Apache Spark:
      Downloads Apache Spark version 3.5.0 with Hadoop 3 compatibility from the official Apache archives. Spark is a distributed computing framework essential for big data processing tasks.

      Verify the Spark Download:
      Lists the downloaded Spark tarball to confirm that the file has been successfully downloaded.

      Extract the Spark Archive:
      Unpacks the Spark tarball to make the Spark distribution files accessible for configuration and usage.

      Move Spark to the Local Directory:
      Moves the extracted Spark directory to /usr/local/spark, setting a standard location for Spark installation, simplifying environment variable configuration.

      Download Apache Kafka:
      Downloads Apache Kafka version 3.5.1 (Scala version 2.13), a distributed event-streaming platform commonly used for real-time data pipelines and streaming applications.

      Verify the Kafka Download:
      Lists the downloaded Kafka tarball to ensure successful file retrieval.

      Extract the Kafka Archive:
      Unpacks the Kafka tarball to access its binaries and configuration files.

      Move Kafka to the Local Directory:
      Moves the extracted Kafka directory to /usr/local/kafka for organized setup and easier configuration.

      Set Environment Variables:
      Configures environment variables for Java, Spark, and Kafka to ensure their executables can be accessed system-wide. This includes updating the PATH variable for seamless command-line operations.

      Install Python Libraries:
      Installs pyspark for interacting with Spark using Python and kafka-python for Kafka integration within Python applications.

      Start Zookeeper:
      Launches Zookeeper, a centralized service used by Kafka for managing distributed systems. It provides configuration synchronization and group services for Kafka brokers.

      Start Kafka Broker:
      Starts the Kafka broker service, which handles message queuing, storage, and distribution to clients in a publish-subscribe model.

In [1]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
!ls -l spark-3.5.0-bin-hadoop3.tgz
!tar xf spark-3.5.0-bin-hadoop3.tgz
!mv spark-3.5.0-bin-hadoop3 /usr/local/spark
!wget -q https://archive.apache.org/dist/kafka/3.5.1/kafka_2.13-3.5.1.tgz
!ls -l kafka_2.13-3.5.1.tgz
!tar xf kafka_2.13-3.5.1.tgz
!mv kafka_2.13-3.5.1 /usr/local/kafka
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/usr/local/spark"
os.environ["PATH"] += ":/usr/local/spark/bin"
os.environ["PATH"] += ":/usr/local/kafka/bin"
!pip install pyspark kafka-python

-rw-r--r-- 1 root root 400395283 Sep  9  2023 spark-3.5.0-bin-hadoop3.tgz
-rw-r--r-- 1 root root 106748875 Jul 21  2023 kafka_2.13-3.5.1.tgz


In [2]:
!nohup /usr/local/kafka/bin/zookeeper-server-start.sh /usr/local/kafka/config/zookeeper.properties &
!nohup /usr/local/kafka/bin/kafka-server-start.sh /usr/local/kafka/config/server.properties &

nohup: appending output to 'nohup.out'
nohup: appending output to 'nohup.out'


Creating Kafka Topics for Data Streams

    This name accurately reflects the purpose of the snippet, which is to create Kafka topics (sentiments and text) for managing separate streams of data within the Kafka ecosystem.

In [3]:
!kafka-topics.sh --create --topic sentiments --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1
!kafka-topics.sh --create --topic text --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1

Error while executing topic command : Topic 'sentiments' already exists.
[2025-01-02 10:57:13,482] ERROR org.apache.kafka.common.errors.TopicExistsException: Topic 'sentiments' already exists.
 (kafka.admin.TopicCommand$)
Error while executing topic command : Topic 'text' already exists.
[2025-01-02 10:57:15,241] ERROR org.apache.kafka.common.errors.TopicExistsException: Topic 'text' already exists.
 (kafka.admin.TopicCommand$)


**Producer for Sentiments (producer_sentiments):**

    This function reads sentiment data from a file (e.g., AFINN-111.txt) and continuously sends key-value pairs representing words and their sentiment scores to a specified Kafka topic. It simulates real-time streaming by batching the data and rotating through the dataset. The Kafka producer uses JSON serialization to encode messages before sending them to the topic.

**Producer for Text (producer_text):**

    This function allows users to input text sentences via the console and sends them to a specified Kafka topic in real time. It uses a Kafka producer to serialize the text and send it as a message. This enables dynamic user interaction and real-time data streaming for text analysis.

**Spark Kafka Consumer (spark_kafka_consumer):**

    This function consumes messages from two Kafka topics: one for sentiment data and another for user text input. Using Apache Spark, it processes these messages to calculate the Total Sentiment Level (TSL) for user input based on the sentiment data. The consumer maintains a dictionary of word sentiments and updates it dynamically from the sentiment topic. It evaluates each input sentence for known sentiment words, computes the TSL, and outputs the results. The function also includes safeguards for JSON decoding and message processing errors, with a configurable limit for the number of messages to process before stopping.

In [4]:
from kafka import KafkaProducer
import json
import time
import threading

def producer_sentiments(file_path, topic, stop_event, bootstrap_servers='localhost:9092'):
    """
    Reads word-sentiment pairs from AFINN and sends them to the `sentiments` Kafka topic.
    Streams 100 word-sentiment pairs every 2 seconds.
    """
    producer = KafkaProducer(
        bootstrap_servers=bootstrap_servers,
        value_serializer=lambda v: json.dumps(v).encode('utf-8')
    )
    with open(file_path, 'r') as f:
        lines = f.readlines()
    sentiment_data = [(line.split('\t')[0], int(line.split('\t')[1])) for line in lines]

    try:
        while not stop_event.is_set():
            batch = sentiment_data[:100]
            for word, sentiment in batch:
                producer.send(topic, {'word': word, 'sentiment': sentiment})
            sentiment_data = sentiment_data[100:] + batch
            time.sleep(2)
    except Exception as e:
        print(f"[producer_sentiments] Error: {e}")
    finally:
        producer.close()
        print("[producer_sentiments] Exiting...")


def producer_text(topic, stop_event, bootstrap_servers='localhost:9092'):
    """
    Takes user input from the console and sends sentences to the `text` Kafka topic.
    """
    producer = KafkaProducer(
        bootstrap_servers=bootstrap_servers,
        value_serializer=lambda v: v.encode('utf-8')
    )
    try:
        while not stop_event.is_set():
            user_input = input("Enter a sentence to analyze (Ctrl+C to stop): ")
            if user_input.strip():
                producer.send(topic, user_input)
                print(f"[producer_text] Sent: {user_input}")
    except KeyboardInterrupt:
        print("\n[producer_text] Interrupted by user.")
    except Exception as e:
        print(f"[producer_text] Error: {e}")
    finally:
        producer.close()
        print("[producer_text] Exiting...")



In [5]:
from kafka import KafkaConsumer
from pyspark.sql import SparkSession

def spark_kafka_consumer(bootstrap_servers='localhost:9092',
                         sentiment_topic='sentiments',
                         text_topic='text',
                         stop_event=None):
    """
    Consumes from the `sentiments` and `text` Kafka topics.
    Calculates and displays the Total Sentiment Level (TSL) for each new sentence in `text`.
    """
    spark = SparkSession.builder.appName("KafkaSparkConsumer").getOrCreate()
    sc = spark.sparkContext

    consumer = KafkaConsumer(
        sentiment_topic,
        text_topic,
        bootstrap_servers=bootstrap_servers,
        value_deserializer=lambda v: v.decode('utf-8'),
        auto_offset_reset='earliest',
        enable_auto_commit=True
    )

    sentiment_dict = {}

    try:
        for message in consumer:
            if stop_event and stop_event.is_set():
                print("[spark_kafka_consumer] Stop event detected. Exiting loop.")
                break

            topic = message.topic
            value = message.value

            if topic == sentiment_topic:
                # Update sentiment dictionary
                sentiment_data = json.loads(value)
                sentiment_dict[sentiment_data['word']] = sentiment_data['sentiment']

            elif topic == text_topic:
                # Calculate TSL for the new sentence
                words = value.split()
                words_rdd = sc.parallelize(words)
                known_sentiments_rdd = words_rdd.filter(lambda word: word in sentiment_dict)\
                                                .map(lambda word: sentiment_dict[word])
                known_sentiments = known_sentiments_rdd.collect()

                if known_sentiments:
                    tsl = sum(known_sentiments) / len(known_sentiments)
                    print(f"[spark_kafka_consumer] TSL for \"{value}\": {tsl:.2f}")
                else:
                    print(f"[spark_kafka_consumer] TSL for \"{value}\": 0 (No known words)")

    except Exception as e:
        print(f"[spark_kafka_consumer] Error: {e}")
    finally:
        consumer.close()
        sc.stop()
        print("[spark_kafka_consumer] Exiting...")


In [6]:
stop_event = threading.Event()

# Start the sentiment producer
producer_sentiments_thread = threading.Thread(
    target=producer_sentiments,
    args=("AFINN-111.txt", "sentiments", stop_event),
    name="ProducerSentimentsThread"
)
producer_sentiments_thread.start()

# Start the text producer
producer_text_thread = threading.Thread(
    target=producer_text,
    args=("text", stop_event),
    name="ProducerTextThread"
)
producer_text_thread.start()

# Start the Spark Kafka consumer
consumer_thread = threading.Thread(
    target=spark_kafka_consumer,
    kwargs={
        "bootstrap_servers": "localhost:9092",
        "sentiment_topic": "sentiments",
        "text_topic": "text",
        "stop_event": stop_event
    },
    name="SparkKafkaConsumerThread"
)
consumer_thread.start()


In [7]:
stop_event.set()
producer_sentiments_thread.join()
producer_text_thread.join()
consumer_thread.join()
print("[main] All threads have been stopped. Exiting program.")

[producer_text] Exiting...
[producer_sentiments] Exiting...




[spark_kafka_consumer] Stop event detected. Exiting loop.
[spark_kafka_consumer] Exiting...
[main] All threads have been stopped. Exiting program.
