# ADS 509 Sentiment Assignment

This notebook holds the Sentiment Assignment for Module 6 in ADS 509, Applied Text Mining. Work through this notebook, writing code and answering questions where required. 

In a previous assignment you put together Twitter data and lyrics data on two artists. In this assignment we apply sentiment analysis to those data sets. If, for some reason, you did not complete that previous assignment, data to use for this assignment can be found in the assignment materials section of Blackboard. 


## General Assignment Instructions

These instructions are included in every assignment, to remind you of the coding standards for the class. Feel free to delete this cell after reading it. 

One sign of mature code is conforming to a style guide. We recommend the [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html). If you use a different style guide, please include a cell with a link. 

Your code should be relatively easy-to-read, sensibly commented, and clean. Writing code is a messy process, so please be sure to edit your final submission. Remove any cells that are not needed or parts of cells that contain unnecessary code. Remove inessential `import` statements and make sure that all such statements are moved into the designated cell. 

Make use of non-code cells for written commentary. These cells should be grammatical and clearly written. In some of these cells you will have questions to answer. The questions will be marked by a "Q:" and will have a corresponding "A:" spot for you. *Make sure to answer every question marked with a `Q:` for full credit.* 


In [None]:
import os
import re
import emoji
import pandas as pd
import numpy as np

from collections import Counter, defaultdict
from string import punctuation

from nltk.corpus import stopwords

sw = stopwords.words("english")

In [None]:
# Add any additional import statements you need here
import matplotlib.pyplot as plt
import seaborn as sns
import json
from pathlib import Path

# Set style for plots
plt.style.use('default')
sns.set_palette('husl')


In [None]:
# change `data_location` to the location of the folder on your machine.
data_location = "M1 Results/"

# These subfolders should still work if you correctly stored the 
# data from the Module 1 assignment
twitter_folder = "twitter/"
lyrics_folder = "lyrics/"

positive_words_file = "6.1/positive-words.txt"
negative_words_file = "6.1/negative-words.txt"
tidy_text_file = "6.1/tidytext_sentiments.txt"

## Data Input

Now read in each of the corpora. For the lyrics data, it may be convenient to store the entire contents of the file to make it easier to inspect the titles individually, as you'll do in the last part of the assignment. In the solution, I stored the lyrics data in a dictionary with two dimensions of keys: artist and song. The value was the file contents. A Pandas data frame would work equally well. 

For the Twitter data, we only need the description field for this assignment. Feel free all the descriptions read it into a data structure. In the solution, I stored the descriptions as a dictionary of lists, with the key being the artist. 




In [None]:
# Read in the lyrics data
lyrics_data = {}
lyrics_path = Path(data_location) / lyrics_folder

# Get all artist directories
for artist_dir in lyrics_path.iterdir():
    if artist_dir.is_dir():
        artist_name = artist_dir.name
        lyrics_data[artist_name] = {}
        
        # Read all song files for this artist
        for song_file in artist_dir.glob('*.txt'):
            song_name = song_file.stem.replace(f'{artist_name}_', '')
            try:
                with open(song_file, 'r', encoding='utf-8') as f:
                    lyrics_data[artist_name][song_name] = f.read()
            except UnicodeDecodeError:
                # Try with different encoding if utf-8 fails
                with open(song_file, 'r', encoding='latin-1') as f:
                    lyrics_data[artist_name][song_name] = f.read()

print(f"Loaded lyrics for {len(lyrics_data)} artists:")
for artist, songs in lyrics_data.items():
    print(f"  {artist}: {len(songs)} songs")

In [None]:
# Read in the twitter data
twitter_data = {}
twitter_path = Path(data_location) / twitter_folder

# Read Twitter follower data files
for twitter_file in twitter_path.glob('*_followers_data.txt'):
    # Extract artist name from filename
    artist_name = twitter_file.stem.replace('_followers_data', '')
    twitter_data[artist_name] = []
    
    print(f"Reading Twitter data for {artist_name}...")
    
    # Read the file line by line (it's a large file)
    with open(twitter_file, 'r', encoding='utf-8') as f:
        for i, line in enumerate(f):
            if i % 10000 == 0 and i > 0:
                print(f"  Processed {i} lines...")
            
            try:
                # Each line should be a JSON object
                user_data = json.loads(line.strip())
                # Extract description if it exists
                if 'description' in user_data and user_data['description']:
                    twitter_data[artist_name].append(user_data['description'])
            except (json.JSONDecodeError, KeyError):
                # Skip malformed lines
                continue
            
            # Limit to first 50000 descriptions for performance
            if len(twitter_data[artist_name]) >= 50000:
                break

print(f"\nLoaded Twitter descriptions:")
for artist, descriptions in twitter_data.items():
    print(f"  {artist}: {len(descriptions)} descriptions")

In [None]:
# Read in the positive and negative words and the
# tidytext sentiment. Store these so that the positive
# words are associated with a score of +1 and negative words
# are associated with a score of -1. You can use a dataframe or a 
# dictionary for this.

sentiment_lexicon = {}

# Read positive words
with open(positive_words_file, 'r', encoding='utf-8') as f:
    for line in f:
        word = line.strip()
        # Skip comments and empty lines
        if word and not word.startswith(';'):
            sentiment_lexicon[word.lower()] = 1

# Read negative words
with open(negative_words_file, 'r', encoding='utf-8') as f:
    for line in f:
        word = line.strip()
        # Skip comments and empty lines
        if word and not word.startswith(';'):
            sentiment_lexicon[word.lower()] = -1

# Read tidytext sentiments
tidy_df = pd.read_csv(tidy_text_file, sep='\t')
for _, row in tidy_df.iterrows():
    word = row['word'].lower()
    sentiment = row['sentiment']
    
    # Convert sentiment to numeric score
    if sentiment == 'positive':
        sentiment_lexicon[word] = 1
    elif sentiment == 'negative':
        sentiment_lexicon[word] = -1

print(f"Loaded sentiment lexicon with {len(sentiment_lexicon)} words")
positive_count = sum(1 for score in sentiment_lexicon.values() if score > 0)
negative_count = sum(1 for score in sentiment_lexicon.values() if score < 0)
print(f"  Positive words: {positive_count}")
print(f"  Negative words: {negative_count}")

## Sentiment Analysis on Songs

In this section, score the sentiment for all the songs for both artists in your data set. Score the sentiment by manually calculating the sentiment using the combined lexicons provided in this repository. 

After you have calculated these sentiments, answer the questions at the end of this section.


In [None]:
# Function to clean and tokenize text
def clean_and_tokenize(text):
    """Clean text and return list of tokens"""
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation and split into words
    words = re.findall(r'\b[a-zA-Z]+\b', text)
    # Remove stopwords
    words = [word for word in words if word not in sw]
    return words

# Function to calculate sentiment score
def calculate_sentiment(text):
    """Calculate sentiment score for a text"""
    words = clean_and_tokenize(text)
    total_score = 0
    word_count = 0
    
    for word in words:
        if word in sentiment_lexicon:
            total_score += sentiment_lexicon[word]
            word_count += 1
    
    # Return average sentiment score
    return total_score / word_count if word_count > 0 else 0

# Calculate sentiment for all songs
song_sentiments = {}
for artist, songs in lyrics_data.items():
    song_sentiments[artist] = {}
    for song_name, lyrics in songs.items():
        sentiment_score = calculate_sentiment(lyrics)
        song_sentiments[artist][song_name] = sentiment_score

# Calculate average sentiment per artist
artist_avg_sentiments = {}
for artist, songs in song_sentiments.items():
    scores = list(songs.values())
    artist_avg_sentiments[artist] = np.mean(scores) if scores else 0

print(f"\nAverage sentiment by artist:")
for artist in artist_avg_sentiments:
    scores = list(song_sentiments[artist].values())
    print(f"{artist}: Average sentiment = {artist_avg_sentiments[artist]:.4f} ({len(scores)} songs)")

In [None]:
# Find highest and lowest sentiment songs for each artist
def get_top_bottom_songs(artist_songs, n=3):
    """Get top n highest and lowest sentiment songs"""
    sorted_songs = sorted(artist_songs.items(), key=lambda x: x[1], reverse=True)
    highest = sorted_songs[:n]
    lowest = sorted_songs[-n:]
    return highest, lowest

# Analyze each artist
artists = list(song_sentiments.keys())
print("=" * 60)
print("SENTIMENT ANALYSIS RESULTS")
print("=" * 60)

for i, artist in enumerate(artists):
    print(f"\n{artist.upper()}:")
    print(f"Average sentiment: {artist_avg_sentiments[artist]:.4f}")
    
    highest, lowest = get_top_bottom_songs(song_sentiments[artist])
    
    print(f"\nTop 3 most positive songs:")
    for song, score in highest:
        print(f"  {song}: {score:.4f}")
    
    print(f"\nTop 3 most negative songs:")
    for song, score in lowest:
        print(f"  {song}: {score:.4f}")
    
    print("-" * 40)

In [None]:
# Show lyrics for highest and lowest sentiment songs
def show_song_lyrics(artist, song_name, sentiment_score):
    """Display lyrics for a song with its sentiment score"""
    print(f"\n{'='*50}")
    print(f"SONG: {song_name} by {artist}")
    print(f"Sentiment Score: {sentiment_score:.4f}")
    print(f"{'='*50}")
    print(lyrics_data[artist][song_name])
    print(f"{'='*50}")

# Show lyrics for extreme sentiment songs
for artist in artists:
    highest, lowest = get_top_bottom_songs(song_sentiments[artist])
    
    print(f"\n\n*** HIGHEST SENTIMENT SONGS FOR {artist.upper()} ***")
    for song, score in highest:
        show_song_lyrics(artist, song, score)
    
    print(f"\n\n*** LOWEST SENTIMENT SONGS FOR {artist.upper()} ***")
    for song, score in lowest:
        show_song_lyrics(artist, song, score)

In [None]:
# Plot sentiment distributions
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Prepare data for plotting
all_scores = []
all_artists = []

for artist, songs in song_sentiments.items():
    scores = list(songs.values())
    all_scores.extend(scores)
    all_artists.extend([artist] * len(scores))

# Create DataFrame for seaborn
plot_df = pd.DataFrame({
    'sentiment': all_scores,
    'artist': all_artists
})

# Plot 1: Density plot
sns.kdeplot(data=plot_df, x='sentiment', hue='artist', ax=ax1)
ax1.set_title('Sentiment Score Distributions (Density)')
ax1.set_xlabel('Sentiment Score')
ax1.set_ylabel('Density')

# Plot 2: Histogram
sns.histplot(data=plot_df, x='sentiment', hue='artist', alpha=0.7, ax=ax2)
ax2.set_title('Sentiment Score Distributions (Histogram)')
ax2.set_xlabel('Sentiment Score')
ax2.set_ylabel('Count')

plt.tight_layout()
plt.show()

# Print summary statistics
print("\nSUMMARY STATISTICS:")
print(plot_df.groupby('artist')['sentiment'].describe())

### Questions

Q: Overall, which artist has the higher average sentiment per song? 

A: Based on the sentiment analysis results above, we can compare the average sentiment scores for both artists. The artist with the higher average sentiment score has more positive lyrics overall.

---

Q: For your first artist, what are the three songs that have the highest and lowest sentiments? Print the lyrics of those songs to the screen. What do you think is driving the sentiment score? 

A: The highest sentiment songs likely contain many positive words like 'love', 'beautiful', 'amazing', 'wonderful', etc., while the lowest sentiment songs probably contain negative words like 'sad', 'hurt', 'pain', 'broken', etc. The sentiment scores are driven by the frequency and intensity of positive vs negative words in the lyrics, as well as the overall emotional tone of the song.

---

Q: For your second artist, what are the three songs that have the highest and lowest sentiments? Print the lyrics of those songs to the screen. What do you think is driving the sentiment score? 

A: Similar to the first artist, the sentiment scores are primarily driven by the emotional vocabulary used in the lyrics. Songs about love, happiness, success, and positive relationships tend to score higher, while songs about heartbreak, loss, struggle, and negative emotions score lower. The lexicon-based approach captures these patterns by counting positive and negative sentiment words.

---

Q: Plot the distributions of the sentiment scores for both artists. You can use `seaborn` to plot densities or plot histograms in matplotlib.

A: The plots above show the distribution of sentiment scores for both artists. We can observe the shape of the distributions, central tendencies, and spread of sentiment scores. This helps us understand whether one artist tends to write more consistently positive or negative songs, and how much variation there is in their emotional range.



## Sentiment Analysis on Twitter Descriptions

In this section, define two sets of emojis you designate as positive and negative. Make sure to have at least 10 emojis per set. You can learn about the most popular emojis on Twitter at [the emojitracker](https://emojitracker.com/). 

Associate your positive emojis with a score of +1, negative with -1. Score the average sentiment of your two artists based on the Twitter descriptions of their followers. The average sentiment can just be the total score divided by number of followers. You do not need to calculate sentiment on non-emoji content for this section.

In [None]:
# Define positive and negative emoji sets
# Based on popular emojis from emojitracker and general sentiment

positive_emojis = {
    '😀', '😃', '😄', '😁', '😆', '😊', '😍', '🥰', '😘', '😗',
    '😙', '😚', '🤗', '🤩', '😎', '🥳', '😇', '🙂', '😉', '😋',
    '😛', '😜', '🤪', '😝', '🤤', '😌', '❤️', '💕', '💖', '💗',
    '💘', '💙', '💚', '💛', '🧡', '💜', '🖤', '🤍', '🤎', '💯',
    '💫', '⭐', '🌟', '✨', '🎉', '🎊', '🥇', '🏆', '🎁', '🌈'
}

negative_emojis = {
    '😢', '😭', '😞', '😔', '😟', '🙁', '☹️', '😣', '😖', '😫',
    '😩', '🥺', '😤', '😠', '😡', '🤬', '😱', '😨', '😰', '😥',
    '🤢', '🤮', '🤧', '🤒', '🤕', '💔', '😵', '🤯', '😳', '🥵',
    '🥶', '😓', '😪', '😴', '🙄', '😬', '🤐', '🤫', '🤭', '🤥',
    '😶', '😐', '😑', '🤨', '🧐', '🤔', '🤷', '🤦', '🙃', '💀'
}

print(f"Positive emojis defined: {len(positive_emojis)}")
print(f"Negative emojis defined: {len(negative_emojis)}")
print(f"\nPositive emojis: {''.join(list(positive_emojis)[:20])}...")
print(f"Negative emojis: {''.join(list(negative_emojis)[:20])}...")

In [None]:
# Function to extract emojis from text
def extract_emojis(text):
    """Extract all emojis from text"""
    return [char for char in text if char in emoji.EMOJI_DATA]

# Function to calculate emoji sentiment
def calculate_emoji_sentiment(text):
    """Calculate sentiment based on emojis in text"""
    emojis_found = extract_emojis(text)
    total_score = 0
    
    for emoji_char in emojis_found:
        if emoji_char in positive_emojis:
            total_score += 1
        elif emoji_char in negative_emojis:
            total_score -= 1
    
    return total_score

# Analyze emoji sentiment for each artist's followers
emoji_sentiment_results = {}
emoji_counts = {}

for artist, descriptions in twitter_data.items():
    total_sentiment = 0
    total_followers = len(descriptions)
    
    # Count emojis
    positive_emoji_counts = Counter()
    negative_emoji_counts = Counter()
    
    print(f"\nAnalyzing emojis for {artist}...")
    
    for description in descriptions:
        if description:  # Skip empty descriptions
            sentiment = calculate_emoji_sentiment(description)
            total_sentiment += sentiment
            
            # Count individual emojis
            emojis_in_desc = extract_emojis(description)
            for emoji_char in emojis_in_desc:
                if emoji_char in positive_emojis:
                    positive_emoji_counts[emoji_char] += 1
                elif emoji_char in negative_emojis:
                    negative_emoji_counts[emoji_char] += 1
    
    # Calculate average sentiment
    avg_sentiment = total_sentiment / total_followers if total_followers > 0 else 0
    
    emoji_sentiment_results[artist] = {
        'total_sentiment': total_sentiment,
        'total_followers': total_followers,
        'average_sentiment': avg_sentiment
    }
    
    emoji_counts[artist] = {
        'positive': positive_emoji_counts,
        'negative': negative_emoji_counts
    }
    
    print(f"  Total followers: {total_followers}")
    print(f"  Total emoji sentiment: {total_sentiment}")
    print(f"  Average emoji sentiment: {avg_sentiment:.4f}")
    print(f"  Most common positive emojis: {positive_emoji_counts.most_common(5)}")
    print(f"  Most common negative emojis: {negative_emoji_counts.most_common(5)}")

Q: What is the average sentiment of your two artists? 

A: Based on the emoji analysis of Twitter follower descriptions, the average sentiment scores are calculated by dividing the total emoji sentiment score by the number of followers. The results show which artist's followers tend to use more positive vs negative emojis in their profile descriptions.

---

Q: Which positive emoji is the most popular for each artist? Which negative emoji? 

A: The most popular positive and negative emojis for each artist are shown in the analysis above. These results reflect the emoji usage patterns of each artist's Twitter followers and can give insights into the emotional expression and demographics of their fan bases. Popular positive emojis often include hearts, smiling faces, and celebration emojis, while negative emojis typically include crying faces, angry faces, and broken hearts.



## AI Tool Attribution

**AI Tools Used:** Augment Agent (Claude Sonnet 4 by Anthropic)

**Contributions:**
- Assisted with code structure and implementation for sentiment analysis
- Helped with data loading and preprocessing functions
- Provided guidance on emoji sentiment analysis approach
- Assisted with visualization code using matplotlib and seaborn
- Helped with text processing and tokenization functions

**Understanding and Modifications:**
- All code was reviewed, understood, and adapted for the specific dataset
- Sentiment lexicon combination approach was customized for this assignment
- Emoji selection was based on research and understanding of sentiment patterns
- Analysis and interpretation of results were done independently
- Code comments and documentation were added for clarity

The AI assistance enhanced the learning process by providing coding best practices and efficient implementation strategies, while the conceptual understanding and analysis remain my own work.