# 📝 PySpark Streaming Assignment: Real-Time Sentiment Analysis of Tweets

## 🎯 Objective
Use PySpark Structured Streaming to read live tweets from a socket/text stream, perform basic sentiment analysis, and display the results in real-time.

## 📦 Dataset Source
Simulated tweet stream using Netcat (`nc`) or a text file streamed via socket.

## 🧪 Tasks
1. **Set Up Streaming Source**: Use SparkSession to read streaming data from a socket.
2. **Preprocess Tweets**: Remove punctuation and convert text to lowercase. Tokenize the tweet text.
3. **Perform Sentiment Analysis**: Use a simple keyword-based approach to classify tweets as Positive, Negative, or Neutral.
4. **Display Results**: Show tweet text and its sentiment classification in the console.
5. **Save Results**: Write the output to a CSV file or memory sink.

## 📚 Learning Outcomes
- Understand Spark Structured Streaming.
- Apply UDFs in streaming pipelines.
- Perform basic NLP tasks in real-time.
- Handle socket-based data ingestion.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType

# Start Spark session
spark = SparkSession.builder.appName("TweetSentimentStreaming").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

# Read streaming data from socket
lines = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load()

# Define sentiment analysis function
def get_sentiment(text):
    positive = ["good", "great", "happy", "love", "excellent"]
    negative = ["bad", "sad", "hate", "terrible", "poor"]
    text = text.lower()
    pos_count = sum(word in text for word in positive)
    neg_count = sum(word in text for word in negative)
    if pos_count > neg_count:
        return "Positive"
    elif neg_count > pos_count:
        return "Negative"
    else:
        return "Neutral"

# Register UDF
sentiment_udf = udf(get_sentiment, StringType())

# Apply UDF to streaming data
sentiment_df = lines.withColumn("sentiment", sentiment_udf(col("value")))

# Output to console
query = sentiment_df.writeStream.outputMode("append").format("console").start()
query.awaitTermination()