# ADS 509 Sentiment Assignment

This notebook holds the Sentiment Assignment for Module 6 in ADS 509, Applied Text Mining. Work through this notebook, writing code and answering questions where required. 

In a previous assignment you put together Twitter data and lyrics data on two artists. In this assignment we apply sentiment analysis to those data sets. If, for some reason, you did not complete that previous assignment, data to use for this assignment can be found in the assignment materials section of Blackboard. 


## General Assignment Instructions

These instructions are included in every assignment, to remind you of the coding standards for the class. Feel free to delete this cell after reading it. 

One sign of mature code is conforming to a style guide. We recommend the [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html). If you use a different style guide, please include a cell with a link. 

Your code should be relatively easy-to-read, sensibly commented, and clean. Writing code is a messy process, so please be sure to edit your final submission. Remove any cells that are not needed or parts of cells that contain unnecessary code. Remove inessential `import` statements and make sure that all such statements are moved into the designated cell. 

Make use of non-code cells for written commentary. These cells should be grammatical and clearly written. In some of these cells you will have questions to answer. The questions will be marked by a "Q:" and will have a corresponding "A:" spot for you. *Make sure to answer every question marked with a `Q:` for full credit.* 


In [31]:
import os
import re
import emoji
import pandas as pd
import numpy as np

from collections import Counter, defaultdict
from string import punctuation

from nltk.corpus import stopwords

stop_words = stopwords.words("english")

In [47]:
# Add any additional import statements you need here

import glob
from nltk.corpus import opinion_lexicon
from nltk.tokenize import word_tokenize
from nltk import download
import string
import matplotlib.pyplot as plt

download('opinion_lexicon')
download('punkt')
download('stopwords')

[nltk_data] Downloading package opinion_lexicon to
[nltk_data]     /Users/joseguarneros/nltk_data...
[nltk_data]   Package opinion_lexicon is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/joseguarneros/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/joseguarneros/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [55]:
import emoji

positive_emojis = {'😊', '😍', '🥰', '😄', '😁', '👍', '❤️', '🔥', '✨', '🥳'}
negative_emojis = {'😢', '😡', '😭', '💔', '👎', '😠', '😞', '😩', '😤', '😔'}


In [51]:
import seaborn as sns

AttributeError: module 'matplotlib.cm' has no attribute 'register_cmap'

In [33]:
# change `data_location` to the location of the folder on your machine.
data_location = "M1 Results 2/"

# These subfolders should still work if you correctly stored the 
# data from the Module 1 assignment
twitter_folder = "twitter/"
lyrics_folder = "lyrics/"

positive_words_file = "positive-words.txt"
negative_words_file = "negative-words.txt"
tidy_text_file = "tidytext_sentiments.txt"

In [34]:
twitter_folder = os.path.join(data_location, "twitter")
lyrics_folder = os.path.join(data_location, "lyrics/cher")
def read_txts_to_df(folder_path):
    data = []
    txt_files = glob.glob(os.path.join(folder_path, "*.txt"))
    for filepath in txt_files:
            with open(filepath, 'r', encoding='utf-8') as f:
                content = f.read()
                filename = os.path.basename(filepath)
                data.append({"filename": filename, "text": content})
    return pd.DataFrame(data)

## Data Input

Now read in each of the corpora. For the lyrics data, it may be convenient to store the entire contents of the file to make it easier to inspect the titles individually, as you'll do in the last part of the assignment. In the solution, I stored the lyrics data in a dictionary with two dimensions of keys: artist and song. The value was the file contents. A Pandas data frame would work equally well. 

For the Twitter data, we only need the description field for this assignment. Feel free all the descriptions read it into a data structure. In the solution, I stored the descriptions as a dictionary of lists, with the key being the artist. 




In [35]:
# Read in the lyrics data
lyrics_df = read_txts_to_df(lyrics_folder)
lyrics_df.head()

Unnamed: 0,filename,text
0,cher_comeandstaywithme.txt,"""Come And Stay With Me""\n\n\n\nI'll send away ..."
1,cher_pirate.txt,"""Pirate""\n\n\n\nHe'll sail on with the summer ..."
2,cher_stars.txt,"""Stars""\n\n\n\nI was never one for saying what..."
3,cher_thesedays.txt,"""These Days""\n\n\n\nWell I've been out walking..."
4,cher_lovesohigh.txt,"""Love So High""\n\n\n\nEvery morning I would wa..."


In [36]:
# Read in the twitter data
twitter_df = read_txts_to_df(twitter_folder)
twitter_df.head()

Unnamed: 0,filename,text
0,cher_followers_data.txt,screen_name\tname\tid\tlocation\tfollowers_cou...
1,robynkonichiwa_followers_data.txt,screen_name\tname\tid\tlocation\tfollowers_cou...
2,cher_followers.txt,id\n35152213\n742153090850164742\n149646300645...
3,robynkonichiwa_followers.txt,id\n1424055675030806529\n1502717352575651840\n...


In [37]:
# Read in the positive and negative words and the
# tidytext sentiment. Store these so that the positive
# words are associated with a score of +1 and negative words
# are associated with a score of -1. You can use a dataframe or a 
# dictionary for this.

positive_words = pd.DataFrame({'word': opinion_lexicon.positive(), 'score': 1})
negative_words = pd.DataFrame({'word': opinion_lexicon.negative(), 'score': -1})
sentiment_df = pd.concat([positive_words, negative_words], ignore_index=True)

def tokenize_and_score(text, sentiment_df):
    if not isinstance(text, str):
        return 0

    text = re.sub(r'\\n|\\r|\\t', ' ', text) 
    text = re.sub(r'["\']', '', text)       
    text = text.replace('\n', ' ')         
    text = text.strip().lower()              
    text = text.translate(str.maketrans('', '', string.punctuation)) 
    
    tokens = text.split()
    tokens = [token for token in tokens if token not in sw and token.isalpha()]
    
    token_df = pd.DataFrame({'word': tokens})
    merged = token_df.merge(sentiment_df, on='word', how='left').fillna(0)
    
    return merged['score'].sum()

## Sentiment Analysis on Songs

In this section, score the sentiment for all the songs for both artists in your data set. Score the sentiment by manually calculating the sentiment using the combined lexicons provided in this repository. 

After you have calculated these sentiments, answer the questions at the end of this section.


In [38]:
# your code here
lyrics_df['sentiment_score'] = lyrics_df['text'].apply(lambda x: tokenize_and_score(x, sentiment_df))

In [41]:
lyrics_df['clean_text'] = lyrics_df['text'].apply(lambda x: ' '.join(x.lower().replace('\n', ' ').split()[:20]))
lyrics_df[['filename', 'clean_text', 'sentiment_score']].head()

Unnamed: 0,filename,clean_text,sentiment_score
0,cher_comeandstaywithme.txt,"""come and stay with me"" i'll send away all my ...",3.0
1,cher_pirate.txt,"""pirate"" he'll sail on with the summer wind th...",11.0
2,cher_stars.txt,"""stars"" i was never one for saying what i real...",-1.0
3,cher_thesedays.txt,"""these days"" well i've been out walking and i ...",1.0
4,cher_lovesohigh.txt,"""love so high"" every morning i would wake up a...",10.0


In [43]:
lyrics_df['artist'] = lyrics_df['filename'].str.extract(r'^([a-zA-Z]+)_')

# Question 1

In [44]:
artist_avg = lyrics_df.groupby('artist')['sentiment_score'].mean().sort_values(ascending=False)
print(artist_avg)

artist
cher    3.528481
Name: sentiment_score, dtype: float64


# Question 2/3

In [46]:
for artist in lyrics_df['artist'].unique():
    print(f"\n--- Top & Bottom Songs for {artist.title()} ---")

    artist_df = lyrics_df[lyrics_df['artist'] == artist]

    top_3 = artist_df.nlargest(3, 'sentiment_score')
    bottom_3 = artist_df.nsmallest(3, 'sentiment_score')

    for _, row in pd.concat([top_3, bottom_3]).iterrows():
        print(f"\nFilename: {row['filename']}")
        print(f"Sentiment Score: {row['sentiment_score']}")
        print(f"Lyrics (excerpt):\n{row['text'][:500]}...\n")



--- Top & Bottom Songs for Cher ---

Filename: cher_perfection.txt
Sentiment Score: 48.0
Lyrics (excerpt):
"Perfection"



Hush little Baby, gotta be strong
'Cause in this world we are born to fight
Be the best, prove them wrong
A winner's work is never done, reach the top, number one

Oh, perfection
You drive me crazy with perfection
I've worn my pride as my protection
Perfection, ohh

I was taught to be tough
That the best that you can be ain't enough
Crack the whip, sacrifice
But I found out paradise had a price

I didn't know it then, but oh I know it now
You gotta work as hard as love to make th...


Filename: cher_mylove.txt
Sentiment Score: 45.0
Lyrics (excerpt):
"My Love"



When I go away
I know my heart can stay with my love
It's understood
Everywhere with my love
My love does it good, whoa
My love, oh only my love
My love does it good

And when the cupboard's bare
I'll still find something there with my love
It's understood
Everywhere with my love
My love does it so good, w

# Question 4

In [52]:
plt.figure(figsize=(10,6))
for artist in lyrics_df['artist'].unique():
    subset = lyrics_df[lyrics_df['artist'] == artist]
    sns.kdeplot(subset['sentiment_score'], label=artist.title(), fill=True)

plt.title("Sentiment Score Distributions by Artist")
plt.xlabel("Sentiment Score")
plt.ylabel("Density")
plt.legend()
plt.show()

NameError: name 'sns' is not defined

<Figure size 1000x600 with 0 Axes>

### Questions

Q: Overall, which artist has the higher average sentiment per song? 

A: <!-- Your answer here -->

---

Q: For your first artist, what are the three songs that have the highest and lowest sentiments? Print the lyrics of those songs to the screen. What do you think is driving the sentiment score? 

A: <!-- Your answer here -->

---

Q: For your second artist, what are the three songs that have the highest and lowest sentiments? Print the lyrics of those songs to the screen. What do you think is driving the sentiment score? 

A: <!-- Your answer here -->

---

Q: Plot the distributions of the sentiment scores for both artists. You can use `seaborn` to plot densities or plot histograms in matplotlib.




## Sentiment Analysis on Twitter Descriptions

In this section, define two sets of emojis you designate as positive and negative. Make sure to have at least 10 emojis per set. You can learn about the most popular emojis on Twitter at [the emojitracker](https://emojitracker.com/). 

Associate your positive emojis with a score of +1, negative with -1. Score the average sentiment of your two artists based on the Twitter descriptions of their followers. The average sentiment can just be the total score divided by number of followers. You do not need to calculate sentiment on non-emoji content for this section.

In [42]:
# your code here
twitter_df['sentiment_score'] = twitter_df['text'].apply(lambda x: tokenize_and_score(x, sentiment_df))
twitter_df['clean_text'] = twitter_df['text'].apply(lambda x: ' '.join(x.lower().replace('\n', ' ').split()[:20]))
twitter_df[['filename', 'clean_text', 'sentiment_score']].head()

Unnamed: 0,filename,clean_text,sentiment_score
0,cher_followers_data.txt,screen_name name id location followers_count f...,759118.0
1,robynkonichiwa_followers_data.txt,screen_name name id location followers_count f...,44810.0
2,cher_followers.txt,id 35152213 742153090850164742 149646300645197...,0.0
3,robynkonichiwa_followers.txt,id 1424055675030806529 1502717352575651840 150...,0.0


In [53]:
twitter_df['artist'] = twitter_df['filename'].str.extract(r'^([a-zA-Z]+)_')

# Question 1

In [54]:
artist_avg = twitter_df.groupby('artist')['sentiment_score'].mean()
print(artist_avg)

artist
cher              379559.0
robynkonichiwa     22405.0
Name: sentiment_score, dtype: float64


# Question 2

In [56]:
def extract_emojis(text):
    return [char for char in text if char in emoji.EMOJI_DATA]

twitter_df['emojis'] = twitter_df['text'].apply(lambda x: extract_emojis(x))

emoji_counts = twitter_df.explode('emojis').groupby(['artist', 'emojis']).size().reset_index(name='count')

positive_popular = (
    emoji_counts[emoji_counts['emojis'].isin(positive_emojis)]
    .sort_values(['artist', 'count'], ascending=[True, False])
    .groupby('artist')
    .first()
)

negative_popular = (
    emoji_counts[emoji_counts['emojis'].isin(negative_emojis)]
    .sort_values(['artist', 'count'], ascending=[True, False])
    .groupby('artist')
    .first()
)

print("Most popular positive emojis per artist:\n", positive_popular)
print("\nMost popular negative emojis per artist:\n", negative_popular)


Most popular positive emojis per artist:
                emojis  count
artist                      
cher                ✨  45846
robynkonichiwa      ✨   3217

Most popular negative emojis per artist:
                emojis  count
artist                      
cher                💔   2001
robynkonichiwa      💔     72


Q: What is the average sentiment of your two artists? 

A: <!-- Your answer here --> 

---

Q: Which positive emoji is the most popular for each artist? Which negative emoji? 

A: <!-- Your answer here --> 

