<a href="https://colab.research.google.com/github/ShovalBenjer/Bigdata_Pyspark_Spark_Hadoop_Apache/blob/main/ex_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

TL;DR
Collaborators: Shoval Benjer 319037404, Adir Amar 209017755

The Kafka-Spark pipeline successfully consumed sentiment data from the sentiments topic, processing words and their sentiment scores in real time. After receiving 10 messages, the consumer stopped as configured, and producer threads were gracefully terminated. This demonstrates the system's ability to handle streaming data efficiently within predefined limits.

#**Setup**

**System Requirements:**

    Operating System: Linux-based environment (recommended for compatibility) or Windows with WSL2.

Software:
    Python 3.8+, Java 8 (OpenJDK 8), and Apache Spark.

Libraries:

    pyspark, kafka-python, threading, and json.

Environment Setup:

**Google Colab is Recommended for running the notebook.**

Local System: Ensure you have Apache Spark and Kafka installed with appropriate environment variables configured.


Description for Each Step:

      Install Java:
      This command installs the OpenJDK 8 runtime environment, a necessary dependency for running Apache Spark and Kafka. The -qq flag minimizes output during the installation process.

      Download Apache Spark:
      Downloads Apache Spark version 3.5.0 with Hadoop 3 compatibility from the official Apache archives. Spark is a distributed computing framework essential for big data processing tasks.

      Verify the Spark Download:
      Lists the downloaded Spark tarball to confirm that the file has been successfully downloaded.

      Extract the Spark Archive:
      Unpacks the Spark tarball to make the Spark distribution files accessible for configuration and usage.

      Move Spark to the Local Directory:
      Moves the extracted Spark directory to /usr/local/spark, setting a standard location for Spark installation, simplifying environment variable configuration.

      Download Apache Kafka:
      Downloads Apache Kafka version 3.5.1 (Scala version 2.13), a distributed event-streaming platform commonly used for real-time data pipelines and streaming applications.

      Verify the Kafka Download:
      Lists the downloaded Kafka tarball to ensure successful file retrieval.

      Extract the Kafka Archive:
      Unpacks the Kafka tarball to access its binaries and configuration files.

      Move Kafka to the Local Directory:
      Moves the extracted Kafka directory to /usr/local/kafka for organized setup and easier configuration.

      Set Environment Variables:
      Configures environment variables for Java, Spark, and Kafka to ensure their executables can be accessed system-wide. This includes updating the PATH variable for seamless command-line operations.

      Install Python Libraries:
      Installs pyspark for interacting with Spark using Python and kafka-python for Kafka integration within Python applications.

      Start Zookeeper:
      Launches Zookeeper, a centralized service used by Kafka for managing distributed systems. It provides configuration synchronization and group services for Kafka brokers.

      Start Kafka Broker:
      Starts the Kafka broker service, which handles message queuing, storage, and distribution to clients in a publish-subscribe model.

In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
!ls -l spark-3.5.0-bin-hadoop3.tgz
!tar xf spark-3.5.0-bin-hadoop3.tgz
!mv spark-3.5.0-bin-hadoop3 /usr/local/spark
!wget -q https://archive.apache.org/dist/kafka/3.5.1/kafka_2.13-3.5.1.tgz
!ls -l kafka_2.13-3.5.1.tgz
!tar xf kafka_2.13-3.5.1.tgz
!mv kafka_2.13-3.5.1 /usr/local/kafka
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/usr/local/spark"
os.environ["PATH"] += ":/usr/local/spark/bin"
os.environ["PATH"] += ":/usr/local/kafka/bin"
!pip install pyspark kafka-python

im feeling fantastic!-rw-r--r-- 1 root root 400395283 Sep  9  2023 spark-3.5.0-bin-hadoop3.tgz
mv: cannot move 'spark-3.5.0-bin-hadoop3' to '/usr/local/spark/spark-3.5.0-bin-hadoop3': Directory not empty
-rw-r--r-- 1 root root 106748875 Jul 21  2023 kafka_2.13-3.5.1.tgz
mv: cannot move 'kafka_2.13-3.5.1' to '/usr/local/kafka/kafka_2.13-3.5.1': Directory not empty


In [None]:
!nohup /usr/local/kafka/bin/zookeeper-server-start.sh /usr/local/kafka/config/zookeeper.properties &
!nohup /usr/local/kafka/bin/kafka-server-start.sh /usr/local/kafka/config/server.properties &

nohup: appending output to 'nohup.out'
nohup: appending output to 'nohup.out'


Creating Kafka Topics for Data Streams

    This name accurately reflects the purpose of the snippet, which is to create Kafka topics (sentiments and text) for managing separate streams of data within the Kafka ecosystem.

In [None]:
!kafka-topics.sh --create --topic sentiments --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1
!kafka-topics.sh --create --topic text --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1

**Producer for Sentiments (producer_sentiments):**

    This function reads sentiment data from a file (e.g., AFINN-111.txt) and continuously sends key-value pairs representing words and their sentiment scores to a specified Kafka topic. It simulates real-time streaming by batching the data and rotating through the dataset. The Kafka producer uses JSON serialization to encode messages before sending them to the topic.

**Producer for Text (producer_text):**

    This function allows users to input text sentences via the console and sends them to a specified Kafka topic in real time. It uses a Kafka producer to serialize the text and send it as a message. This enables dynamic user interaction and real-time data streaming for text analysis.

**Spark Kafka Consumer (spark_kafka_consumer):**

    This function consumes messages from two Kafka topics: one for sentiment data and another for user text input. Using Apache Spark, it processes these messages to calculate the Total Sentiment Level (TSL) for user input based on the sentiment data. The consumer maintains a dictionary of word sentiments and updates it dynamically from the sentiment topic. It evaluates each input sentence for known sentiment words, computes the TSL, and outputs the results. The function also includes safeguards for JSON decoding and message processing errors, with a configurable limit for the number of messages to process before stopping.

In [None]:
from kafka import KafkaProducer, KafkaConsumer
from pyspark.sql import SparkSession
import threading
import time
import json

def producer_sentiments(file_path, topic, bootstrap_servers='localhost:9092'):
    """
    Reads sentiment data from a file and sends it to the specified Kafka topic.

    Args:
        file_path (str): Path to the sentiment file (e.g., 'AFINN-111.txt').
        topic (str): Kafka topic to send the data to.
        bootstrap_servers (str): Kafka server address. Default is 'localhost:9092'.
    """
    producer = KafkaProducer(
        bootstrap_servers=bootstrap_servers,
        value_serializer=lambda v: json.dumps(v).encode('utf-8')
    )
    with open(file_path, 'r') as f:
        lines = f.readlines()
    sentiment_data = [(line.split('\t')[0], int(line.split('\t')[1])) for line in lines]
    while True:
        batch = sentiment_data[:100]
        for word, sentiment in batch:
            producer.send(topic, {'word': word, 'sentiment': sentiment})
        sentiment_data = sentiment_data[100:] + batch  # Rotate data
        time.sleep(2)

def producer_text(topic, bootstrap_servers='localhost:9092'):
    """
    Reads user input from the console and sends it to the specified Kafka topic.

    Args:
        topic (str): Kafka topic to send the data to.
        bootstrap_servers (str): Kafka server address. Default is 'localhost:9092'.
    """
    producer = KafkaProducer(
        bootstrap_servers=bootstrap_servers,
        value_serializer=lambda v: v.encode('utf-8')
    )
    while True:
        user_input = input("Enter a sentence to analyze: ")
        producer.send(topic, user_input)
        print(f"Sent: {user_input}")

def spark_kafka_consumer(bootstrap_servers='localhost:9092', sentiment_topic='sentiments', text_topic='text', stop_after=10):
    """
    Consumes messages from Kafka topics and calculates the Total Sentiment Level (TSL) using Spark.

    Args:
        bootstrap_servers (str): Kafka server address. Default is 'localhost:9092'.
        sentiment_topic (str): Kafka topic for sentiment data.
        text_topic (str): Kafka topic for text data.
        stop_after (int): Number of messages to process before stopping. Default is 10.
    """
    from kafka import KafkaConsumer
    from pyspark.sql import SparkSession

    # Start Spark session
    spark = SparkSession.builder.appName("KafkaSparkConsumer").getOrCreate()
    sc = spark.sparkContext

    # Create Kafka consumer
    consumer = KafkaConsumer(
        sentiment_topic,
        text_topic,
        bootstrap_servers=bootstrap_servers,
        value_deserializer=lambda v: v.decode('utf-8'),  # Deserialize as string
    )

    # Dictionary to store sentiments
    sentiment_dict = {}
    message_count = 0

    for message in consumer:
        try:
            topic = message.topic
            value = message.value
            print(f"Received message from topic '{topic}': {value}")

            if topic == sentiment_topic:
                # Parse JSON for sentiment data
                sentiment_data = json.loads(value)
                sentiment_dict[sentiment_data['word']] = sentiment_data['sentiment']
            elif topic == text_topic:
                # Process text data
                words = value.split()
                known_sentiments = [sentiment_dict[word] for word in words if word in sentiment_dict]
                if known_sentiments:
                    tsl = sum(known_sentiments) / len(known_sentiments)
                    print(f"TSL for '{value}': {tsl}")
                else:
                    print(f"TSL for '{value}': No known words in sentiment dictionary")
        except json.JSONDecodeError:
            print("Error decoding JSON, skipping message:", message.value)
        except Exception as e:
            print(f"Error processing message: {e}")

        message_count += 1
        if message_count >= stop_after:
            print("Processed maximum messages. Stopping consumer...")
            break

**Main Function Execution (if __name__ == "__main__":):**

    Acts as the entry point for the application, coordinating the execution of Kafka producers and the Spark Kafka consumer. It orchestrates the threads for producers and ensures proper shutdown after processing is complete.

**Starting Kafka Producer for Sentiments (producer_sentiments_thread):**

    Initializes a separate thread to execute the producer_sentiments function, which streams sentiment data from a file (AFINN-111.txt) to the Kafka sentiments topic in real time. The thread runs as a daemon, ensuring it stops when the main program ends.

**Starting Kafka Producer for Text (producer_text_thread):**

    Initializes another daemon thread to execute the producer_text function, which allows users to input text and streams the sentences to the Kafka text topic in real time. This thread operates independently, enabling simultaneous interaction with the consumer.

**Running the Spark Kafka Consumer (spark_kafka_consumer):**

    Starts the consumer process, which subscribes to Kafka topics (sentiments and text), processes messages, and calculates the Total Sentiment Level (TSL) for user input. The consumer stops after processing a configurable number of messages (stop_after=10), demonstrating batch processing within a controlled scope.

**Stopping Producer Threads:**

    After the consumer finishes processing, the main program ensures clean termination of the producer threads by joining them with a timeout. This step gracefully ends the producer processes and prevents lingering threads from running indefinitely.



In [None]:
import threading
import time

# Main function to start producers and consumer
if __name__ == "__main__":
    # Path to the sentiment file
    sentiment_file = "AFINN-111.txt"

    # Start the Kafka producer for 'sentiments' topic
    producer_sentiments_thread = threading.Thread(
        target=producer_sentiments, args=(sentiment_file, "sentiments")
    )
    producer_sentiments_thread.daemon = True
    producer_sentiments_thread.start()

    # Start the Kafka producer for 'text' topic
    producer_text_thread = threading.Thread(
        target=producer_text, args=("text",)
    )
    producer_text_thread.daemon = True
    producer_text_thread.start()

    # Run the Kafka consumer for a limited number of messages
    spark_kafka_consumer(stop_after=10)

    # Stop producer threads after consumer finishes
    print("Stopping producer threads...")
    producer_sentiments_thread.join(timeout=5)
    producer_text_thread.join(timeout=5)
    print("Producers stopped.")



Received message from topic 'sentiments': {"word": "boosts", "sentiment": 1}
Received message from topic 'sentiments': {"word": "bore", "sentiment": -2}
Received message from topic 'sentiments': {"word": "bored", "sentiment": -2}
Received message from topic 'sentiments': {"word": "boring", "sentiment": -3}
Received message from topic 'sentiments': {"word": "bother", "sentiment": -2}
Received message from topic 'sentiments': {"word": "bothered", "sentiment": -2}
Received message from topic 'sentiments': {"word": "bothers", "sentiment": -2}
Received message from topic 'sentiments': {"word": "bothersome", "sentiment": -2}
Received message from topic 'sentiments': {"word": "boycott", "sentiment": -2}
Received message from topic 'sentiments': {"word": "boycotted", "sentiment": -2}
Processed maximum messages. Stopping consumer...
Stopping producer threads...
Producers stopped.
