# Sentiment Analysis on Amazon Reviews/Divvy Trips Data

This notebook guides you through the process of running sentiment analysis on either:
1. Amazon reviews dataset (as specified in the assignment)
2. Adapted Divvy Trips data (in case you need to use the mentioned Divvy_Trips_2020_Q1 file)

Let's start by setting up our environment and exploring the data.

## 1. Setup and Dependencies

In [None]:
# Install required packages
!pip install transformers torch scikit-learn matplotlib seaborn nltk pandas

## 2. Data Preparation

### Option 1: If you're using the Amazon reviews dataset

In [None]:
# Initialize Spark session
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("AmazonReviewsSentimentAnalysis") \
    .config("spark.driver.memory", "4g") \
    .config("spark.executor.memory", "4g") \
    .getOrCreate()

# Define the path to your Amazon reviews dataset
amazon_data_path = "path/to/amazon_reviews.csv"

# Load data
amazon_df = spark.read.csv(amazon_data_path, header=True, inferSchema=True)

# Show the schema
amazon_df.printSchema()

# Show a sample of the data
amazon_df.show(5)

### Option 2: If you're using the Divvy Trips dataset

In [None]:
# Run the adapter script to convert Divvy Trips data
!python divvy_adapter.py --input "path/to/Divvy_Trips_2020_Q1.csv" --output "adapted_divvy_data.csv" --sample 10000

# Initialize Spark session
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("DivvyTripsSentimentAnalysis") \
    .config("spark.driver.memory", "4g") \
    .config("spark.executor.memory", "4g") \
    .getOrCreate()

# Load the adapted Divvy data
divvy_df = spark.read.csv("adapted_divvy_data.csv", header=True, inferSchema=True)

# Show the schema
divvy_df.printSchema()

# Show a sample of the data
divvy_df.show(5)

## 3. Task 1: Sentiment Analysis

Let's apply sentiment analysis to the review texts using the pretrained Hugging Face pipeline.

In [None]:
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType
from transformers import pipeline

# Choose the appropriate dataframe based on your dataset
# df = amazon_df  # Uncomment if using Amazon reviews
df = divvy_df    # Uncomment if using Divvy trips

# Initialize the sentiment analysis pipeline
sentiment_analyzer = pipeline("sentiment-analysis")

# Define UDF for sentiment analysis
@udf(StringType())
def analyze_sentiment(text):
    try:
        if text is None or len(str(text).strip()) == 0:
            return "NEUTRAL"
        result = sentiment_analyzer(str(text))
        return "POSITIVE" if result[0]['label'] == 'POSITIVE' else "NEGATIVE"
    except Exception as e:
        print(f"Error analyzing sentiment for text: {e}")
        return "NEUTRAL"

# Apply sentiment analysis to the review text
result_df = df.withColumn("predicted_sentiment", analyze_sentiment(col("reviewText")))

# Show results
result_df.select("reviewText", "predicted_sentiment").show(10, truncate=50)

## 4. Task 2: Model Evaluation

Let's evaluate the sentiment analysis model by comparing predicted sentiment with true sentiment derived from ratings.

In [None]:
from pyspark.sql.functions import when
import pandas as pd
import numpy as np
from sklearn.metrics import confusion_matrix, precision_score, recall_score
import matplotlib.pyplot as plt
import seaborn as sns

# Convert ratings to true sentiment labels (POSITIVE if rating >= 3.0, else NEGATIVE)
df_with_true = result_df.withColumn(
    "true_sentiment", 
    when(col("overall") >= 3.0, "POSITIVE").otherwise("NEGATIVE")
)

# Convert to Pandas for easier analysis
pd_df = df_with_true.select("predicted_sentiment", "true_sentiment").toPandas()

# Filter out any NEUTRAL predictions for evaluation purposes
pd_df = pd_df[pd_df['predicted_sentiment'] != "NEUTRAL"]

# Create confusion matrix
cm = confusion_matrix(
    pd_df['true_sentiment'], 
    pd_df['predicted_sentiment'],
    labels=["POSITIVE", "NEGATIVE"]
)

# Calculate precision and recall
precision = precision_score(
    pd_df['true_sentiment'], 
    pd_df['predicted_sentiment'],
    pos_label="POSITIVE"
)

recall = recall_score(
    pd_df['true_sentiment'], 
    pd_df['predicted_sentiment'],
    pos_label="POSITIVE"
)

# Display results
print(f"Confusion Matrix:\n{cm}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")

# Visualize confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(
    cm, 
    annot=True, 
    fmt='d', 
    cmap='Blues',
    xticklabels=["Predicted POSITIVE", "Predicted NEGATIVE"],
    yticklabels=["True POSITIVE", "True NEGATIVE"]
)
plt.title('Sentiment Analysis Confusion Matrix')
plt.tight_layout()
plt.show()

## 5. Additional Analysis: Distribution of Sentiments

In [None]:
# Get full pandas DataFrame for visualization
full_pd_df = df_with_true.toPandas()

# Plot the distribution of true vs predicted sentiment
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
sns.countplot(data=full_pd_df, x='true_sentiment')
plt.title('Distribution of True Sentiment')
plt.xlabel('Sentiment')
plt.ylabel('Count')

plt.subplot(1, 2, 2)
sns.countplot(data=full_pd_df, x='predicted_sentiment')
plt.title('Distribution of Predicted Sentiment')
plt.xlabel('Sentiment')
plt.ylabel('Count')

plt.tight_layout()
plt.show()

## 6. Run the Complete Pipeline

Let's run the full pipeline using the main script.

In [None]:
# For Amazon reviews dataset
# !python main.py --input "path/to/amazon_reviews.csv" --output "output"

# For adapted Divvy trips dataset
!python main.py --input "adapted_divvy_data.csv" --output "output"

## 7. Examine the Results

Let's look at the outputs generated by the pipeline.

In [None]:
# Display confusion matrix
confusion_df = pd.read_csv("output/confusion_matrix.csv", index_col=0)
print("Confusion Matrix:")
print(confusion_df)
print("\n")

# Display metrics
with open("output/metrics.txt", "r") as f:
    metrics = f.read()
print("Metrics:")
print(metrics)

## 8. Conclusion

In this notebook, we've completed both tasks from the assignment:

1. We used a pretrained sentiment analysis pipeline to classify review texts as POSITIVE or NEGATIVE
2. We evaluated the model by comparing the predicted sentiment with true sentiment labels derived from ratings

The solution follows the map-reduce paradigm for distributed processing, with proper error handling and logging.