<a href="https://colab.research.google.com/github/TCU-DCDA/WRIT20833-2025/blob/main/notebooks/codeAlongs/WRIT20833_VADER_Sentiment_Analysis_F25.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis with VADER
## From Words to Emotions in Cultural Data

Welcome to sentiment analysis! Today we'll learn to analyze the **emotional tone** of cultural texts using VADER, a tool designed for social media and informal language.

### üéØ What You'll Learn:
- **Install and use VADER** for sentiment analysis
- **Apply sentiment analysis** to your scraped cultural data
- **Interpret and visualize** emotional patterns in texts
- **Think critically** about automated emotion detection

### üîó Connection to Your Work:
This prepares you for **HW4-1**, where you'll analyze term frequency AND sentiment in your own dataset.

## Part 1: From Word Counting to Emotion Analysis

### Quick HW1 Refresher
In HW1, you counted frequent words and formed **first impressions** about different text types:
- Political documents had words like "shall," "constitution," "rights"
- Novels had character names and descriptive language
- You made predictions, then tested them by counting

### Today's Evolution
**Same prediction-testing process, new question:**
- HW1: "What words appear most often?"
- Today: "What emotions do these words express?"

Let's start with examples from cultural data like **YouTube comments** and **reviews**.

In [None]:
# Install and import VADER\n!pip install vaderSentiment  # Download and install the VADER package\n\n# Import the sentiment analysis tool\nfrom vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer\n\n# Create our sentiment analyzer object\nanalyzer = SentimentIntensityAnalyzer()\n\nprint(\"‚úÖ VADER installed and ready!\")

In [ ]:
# Test VADER on our cultural examples\nprint(\"VADER Sentiment Analysis Results:\")\nprint(\"=\" * 50)\n\n# Loop through each text example\nfor i, text in enumerate(cultural_texts, 1):  # enumerate gives us numbers starting at 1\n    # Analyze the sentiment of this text\n    scores = analyzer.polarity_scores(text)\n    \n    # Extract the compound score (our main number)\n    compound = scores['compound']  # This ranges from -1 to 1\n    \n    # Print the results\n    print(f\"{i}. Text: {text}\")\n    print(f\"   Compound Score: {compound:.3f}\")  # .3f means 3 decimal places\n    print(f\"   Breakdown: {scores}\")  # Shows all four scores\n    print()  # Empty line for spacing

In [None]:
# Cultural text examples - predict the sentiment first!
cultural_texts = [
    "This museum exhibition was absolutely AMAZING!!!",
    "worst movie ever made, complete waste of time",
    "The book was okay, nothing special but not terrible either",
    "I love love LOVE this artist's work! So inspiring ‚ù§Ô∏è",
    "The concert was good but the sound quality sucked"
]

print("ü§î Before we run VADER, predict:")
print("Which texts are positive? Negative? Neutral?")
print("Which will have the STRONGEST emotion?")
print("\nNow let's test your predictions...")

In [None]:
# Install TextBlob for comparison\n!pip install textblob  # Download TextBlob package\nfrom textblob import TextBlob  # Import TextBlob tool\n\n# Compare on a tricky example\ntest_text = \"This movie is SO good!!! I can't even üòç\"\n\n# VADER analysis\nvader_score = analyzer.polarity_scores(test_text)['compound']  # Get compound score\n\n# TextBlob analysis\nblob = TextBlob(test_text)  # Create TextBlob object\ntextblob_score = blob.sentiment.polarity  # Get polarity score\n\n# Print comparison\nprint(f\"Text: {test_text}\")\nprint(f\"VADER score: {vader_score:.3f}\")     # VADER result\nprint(f\"TextBlob score: {textblob_score:.3f}\")  # TextBlob result\nprint(\"\\nüí° VADER is better with informal language, caps, and emoticons!\")"

In [ ]:
# Create sample cultural data (like what you scraped)\nsample_reviews = {\n    # List of Broadway show titles\n    'title': ['Hamilton', 'Cats', 'The Lion King', 'Phantom of the Opera', 'Chicago', \n              'Wicked', 'Les Mis√©rables', 'Mamma Mia!', 'Book of Mormon', 'Dear Evan Hansen'],\n    \n    # Review text - this is what we'll analyze for sentiment\n    'review_text': [\n        \"Absolutely brilliant musical! The hip-hop history was incredible and the cast was amazing.\",\n        \"I really didn't understand what was happening. Weird costumes and confusing plot.\",\n        \"Beautiful production with stunning visuals. The Circle of Life scene gave me chills!\",\n        \"Classic for a reason. The phantom's voice was haunting and the staging was magnificent.\",\n        \"Great dancing and catchy songs. Entertaining but not life-changing.\",\n        \"Mind-blowing! Defying Gravity made me cry. Elphaba was perfect.\",\n        \"Long but worth it. The barricade scene was emotionally devastating in the best way.\",\n        \"Fun and energetic! ABBA songs are so catchy, left feeling happy and uplifted.\",\n        \"Hilarious and inappropriate. Not for everyone but I laughed until my sides hurt.\",\n        \"Heartbreaking and beautiful. Dealt with mental health in such a thoughtful way.\"\n    ],\n    \n    # Numeric ratings (1-5 stars)\n    'rating': [5, 2, 4, 5, 3, 5, 4, 4, 4, 5],\n    \n    # Show categories\n    'category': ['Historical', 'Fantasy', 'Family', 'Classic', 'Musical', \n                'Fantasy', 'Historical', 'Musical', 'Comedy', 'Contemporary']\n}\n\n# Convert dictionary to DataFrame (like loading your CSV)\ndf = pd.DataFrame(sample_reviews)\n\nprint(\"üé≠ Sample Broadway Reviews Dataset Loaded\")\nprint(f\"Dataset shape: {df.shape}\")  # Shows (rows, columns)\ndf.head()  # Display first 5 rows

In [ ]:
# Simple text cleaning (less aggressive than HW1 word counting)\ndef clean_for_sentiment(text):\n    \"\"\"Clean text for sentiment analysis\"\"\"\n    if pd.isna(text):  # Check if text is missing/NaN\n        return \"\"       # Return empty string if missing\n    # Keep punctuation and capitalization - VADER needs them!\n    return str(text).strip()  # Convert to string and remove extra spaces\n\n# Apply cleaning function to the review text column\ndf['clean_text'] = df['review_text'].apply(clean_for_sentiment)\n\nprint(\"‚úÖ Text cleaning complete - kept punctuation for VADER\")\nprint(\"\\nSample cleaned text:\")\nprint(df['clean_text'].iloc[0])  # Show first cleaned text

In [None]:
# Analyze one review first\nsample_review = df['clean_text'].iloc[0]  # Get the first review\nsample_scores = analyzer.polarity_scores(sample_review)  # Analyze it\n\nprint(\"Individual Review Analysis:\")\nprint(f\"Review: {sample_review}\")\nprint(f\"Compound score: {sample_scores['compound']:.3f}\")  # Main score\nprint(f\"Full breakdown: {sample_scores}\")  # All four scores"

In [ ]:
# Now process the entire dataset (batch processing)\ndef get_sentiment_score(text):\n    \"\"\"Get compound sentiment score for a text\"\"\"\n    scores = analyzer.polarity_scores(text)  # Analyze the text\n    return scores['compound']  # Return just the compound score\n\n# Apply sentiment analysis to entire dataset\n# .apply() runs our function on every row in the column\ndf['sentiment_score'] = df['clean_text'].apply(get_sentiment_score)\n\nprint(\"‚úÖ Sentiment analysis complete for entire dataset!\")\nprint(\"\\nFirst few results:\")\n# Show title, sentiment score, and original rating for comparison\nprint(df[['title', 'sentiment_score', 'rating']].head())

In [None]:
# Create sample cultural data (like what you scraped)
sample_reviews = {
    'title': ['Hamilton', 'Cats', 'The Lion King', 'Phantom of the Opera', 'Chicago', 
              'Wicked', 'Les Mis√©rables', 'Mamma Mia!', 'Book of Mormon', 'Dear Evan Hansen'],
    'review_text': [
        "Absolutely brilliant musical! The hip-hop history was incredible and the cast was amazing.",
        "I really didn't understand what was happening. Weird costumes and confusing plot.",
        "Beautiful production with stunning visuals. The Circle of Life scene gave me chills!",
        "Classic for a reason. The phantom's voice was haunting and the staging was magnificent.",
        "Great dancing and catchy songs. Entertaining but not life-changing.",
        "Mind-blowing! Defying Gravity made me cry. Elphaba was perfect.",
        "Long but worth it. The barricade scene was emotionally devastating in the best way.",
        "Fun and energetic! ABBA songs are so catchy, left feeling happy and uplifted.",
        "Hilarious and inappropriate. Not for everyone but I laughed until my sides hurt.",
        "Heartbreaking and beautiful. Dealt with mental health in such a thoughtful way."
    ],
    'rating': [5, 2, 4, 5, 3, 5, 4, 4, 4, 5],
    'category': ['Historical', 'Fantasy', 'Family', 'Classic', 'Musical', 
                'Fantasy', 'Historical', 'Musical', 'Comedy', 'Contemporary']
}

# Convert to DataFrame (like loading your CSV)
df = pd.DataFrame(sample_reviews)

print("üé≠ Sample Broadway Reviews Dataset Loaded")
print(f"Dataset shape: {df.shape}")
df.head()

In [None]:
# Simple text cleaning (less aggressive than HW1 word counting)
def clean_for_sentiment(text):
    if pd.isna(text):
        return ""
    # Keep punctuation and capitalization - VADER needs them!
    return str(text).strip()

# Clean the text column
df['clean_text'] = df['review_text'].apply(clean_for_sentiment)

print("‚úÖ Text cleaning complete - kept punctuation for VADER")
print("\nSample cleaned text:")
print(df['clean_text'].iloc[0])

### Individual Analysis ‚Üí Batch Processing

Let's start with one review, then scale up:

In [None]:
# Now process the entire dataset (batch processing)
def get_sentiment_score(text):
    """Get compound sentiment score for a text"""
    scores = analyzer.polarity_scores(text)
    return scores['compound']

# Apply to entire dataset
df['sentiment_score'] = df['clean_text'].apply(get_sentiment_score)

print("‚úÖ Sentiment analysis complete for entire dataset!")
print("\nFirst few results:")
print(df[['title', 'sentiment_score', 'rating']].head())

## Part 4: Interpreting and Visualizing Results

Let's explore what our sentiment analysis reveals:

In [None]:
# Create visualizations
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Sentiment distribution
axes[0].hist(df['sentiment_score'], bins=8, color='skyblue', alpha=0.7)
axes[0].set_title('Distribution of Sentiment Scores')
axes[0].set_xlabel('Sentiment Score (-1 to 1)')
axes[0].set_ylabel('Number of Reviews')
axes[0].axvline(0, color='red', linestyle='--', alpha=0.5, label='Neutral')
axes[0].legend()

# Sentiment vs. Rating scatter plot
axes[1].scatter(df['rating'], df['sentiment_score'], alpha=0.7, color='coral')
axes[1].set_title('Sentiment Score vs. Numeric Rating')
axes[1].set_xlabel('Star Rating (1-5)')
axes[1].set_ylabel('Sentiment Score')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("üìä What patterns do you notice?")

## Part 5: Critical Analysis - When VADER Gets It Wrong

Let's examine where automated sentiment analysis might struggle:

In [None]:
# Test VADER on tricky cultural examples
tricky_texts = [
    "This show was so bad it was good",  # Irony
    "The ending was beautifully tragic",  # Mixed emotions
    "I literally died laughing",  # Hyperbole
    "It was... fine",  # Subtle negativity
    "Not the worst thing I've ever seen"  # Double negative
]

print("ü§î CHALLENGING CASES FOR SENTIMENT ANALYSIS")
print("=" * 50)

for text in tricky_texts:
    score = analyzer.polarity_scores(text)['compound']
    print(f"Text: '{text}'")
    print(f"VADER score: {score:.3f}")
    print(f"Your human judgment: _____")  # Students fill this in
    print()

### üí≠ Notes for Final Reflection:

*Space for your thoughts during class - you'll use these for HW4-1:*

**Where VADER worked well:**
- 
- 

**Where VADER struggled:**
- 
- 

**Questions this raises about cultural texts:**
- 
- 

**Predictions for my own dataset:**
- 
- 

## Part 6: Preparing for HW4-1

You now have the skills for HW4-1! Let's review the complete workflow:

In [None]:
# Complete workflow summary
print("üéØ HW4-1 WORKFLOW CHECKLIST")
print("=" * 30)
print("‚úÖ 1. Load your scraped CSV data")
print("‚úÖ 2. Clean text (but keep punctuation for VADER)")
print("‚úÖ 3. Make predictions about sentiment patterns")
print("‚úÖ 4. Apply VADER to your dataset")
print("‚úÖ 5. Create visualizations of sentiment patterns")
print("‚úÖ 6. Compare predictions to actual results")
print("‚úÖ 7. Analyze where VADER works/fails with your texts")
print("‚úÖ 8. Reflect on insights and limitations")

print("\nüöÄ You're ready for HW4-1!")

### Looking Ahead: Topic Modeling Preview

**After HW4-1, you'll move to HW4-2** where you'll discover **hidden topics** in your text using machine learning. 

**Quick preview**: While sentiment analysis asks "What emotions?", topic modeling asks "What themes and subjects are hiding in this collection?"

The text preprocessing you're learning now will prepare you for that next step!

## Summary: From Words to Emotions

Today you learned to:

**Technical Skills:**
- ‚úÖ Install and use VADER sentiment analysis
- ‚úÖ Process individual texts and entire datasets
- ‚úÖ Interpret compound sentiment scores
- ‚úÖ Create meaningful visualizations of emotional patterns

**Critical Thinking:**
- ‚úÖ Compare different sentiment analysis tools
- ‚úÖ Recognize where automated analysis succeeds and fails
- ‚úÖ Question the objectivity of algorithmic emotion detection
- ‚úÖ Connect computational analysis to cultural research questions

**Research Skills:**
- ‚úÖ Form predictions and test them systematically
- ‚úÖ Scale analysis from individual examples to entire datasets
- ‚úÖ Document insights for deeper reflection
- ‚úÖ Prepare text data for multiple types of analysis

### üéØ You're Ready for HW4-1!

Apply these skills to your own scraped cultural dataset and discover what emotional patterns emerge from your data. Remember: being surprised by your results is a sign of genuine learning, not failure!

**Next**: Use your scraped data to complete the term frequency and sentiment analysis assignment, then get ready for topic modeling in HW4-2.