# Week 8
## Overview

It's the last time we meet in class for exercises! And to celebrate this mile-stone, I've put together an very nice little set of exercises. And if you're behind, don't worry. The workload is low!

  - Part A: First, we play around with sentiment analysis
  - That's it!


# Part A: Sentiment analysis

Sentiment analysis is another highly useful technique which we'll use to make sense of the Wiki
data. Further, experience shows that it might well be very useful when you get to the project stage of the class.



> **Video Lecture**: Uncle Sune talks about sentiment and his own youthful adventures.



In [None]:
import nltk
from IPython.display import YouTubeVideo
YouTubeVideo("JuYcaYYlfrI",width=800, height=450)

In [None]:
# There's also this one from 2010 with young Sune's research
YouTubeVideo("hY0UCD5UiiY",width=800, height=450)

> Reading: [Temporal Patterns of Happiness and Information in a Global Social Network: Hedonometrics and Twitter](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0026752)
>


*Exercise*: Sentiment distribution.
>
> * Download the LabMT wordlist. It's available as supplementary material from [Temporal Patterns of Happiness and Information in a Global Social Network: Hedonometrics and Twitter](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0026752) (Data Set S1). Describe briefly how the list was generated.
> * Based on the LabMT word list, write a function that calculates sentiment given a list of tokens (the tokens should be lower case, etc).
> * Iterage over the nodes in your network, tokenize each page, and calculate sentiment every single page. Now you have sentiment as a new nodal property.
> * Calculate the average sentiment across all the pages. Also calculate the median, variance, 25th percentile, 75th percentile.
> * Remember histograms? Create a histogram of all of the artists's associated page-sentiments. (And make it a nice histogram - use your histogram making skills from Week 2). Add the mean, meadian, ect from above to your plot.
> * Who are the 10 artists with happiest and saddest pages?

<div class="alert alert-block alert-info">
As long as you get the plots right, it's OK to use LLM help here.
</div>

*Exercise*: Community sentiment distribution. 
  
> * Last week we calculated the stuctural communities of the graph. For this exercise, we use those communities (just the 10 largest ones). Specifically, you should calculate the average the average sentiment of the nodes in each community to find a *community level sentiment*. 
>   - Name each community by its three most connected bands. (Or feed the list of bands in each community and ask the LLM to come up with a good name for the community).
>   - What are the three happiest communities? 
>   - what are the three saddest communities?
>   - Do these results confirm what you can learn about each community by comparing to the genres, checking out the word-clouds for each community, and reading the wiki-pages? 
> * Compare the sentiment of the happiest and saddest communities to the overall (entire network) distribution of sentiment that you calculated in the previous exercise. Are the communities very differenct from the average? Or do you find the sentiment to be quite similar across all of the communities?

<div class="alert alert-block alert-info">
As above, feel free to go nuts with help from an LLM with this exercise for the technical parts. But try to answer the questions about interpreting the results with your own human brain.
</div>

**Note**: Calculating sentiment may take a long time, so arm yourself with patience as your code runs (remember to check that it runs correctly, before waiting patiently). Further, these tips may speed things up. And save somewhere, so you don't have to start over.

**Tips for speed**
* If you use `freqDist` prior to finding the sentiment, you only have to find it for every unique word and hereafter you can do a weighted mean.
* More tips for speeding up loops https://wiki.python.org/moin/PythonSpeed/PerformanceTips#Loops

In [None]:
word_scores = {}
with open("labMIT-1.0.txt", 'r', encoding='utf-8') as f:
            next(f)
            for line in f:
                parts = line.strip().split('\t')
                if len(parts) >= 3:
                    word = parts[0].lower()
                    happiness_score = float(parts[2])
                    word_scores[word] = happiness_score

In [None]:
word_scores

In [None]:
import nltk

nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger_eng')

In [None]:
import statistics
import nltk



def calculate_sentiment(text):
    scores = []
    tokens = nltk.word_tokenize(text)
    total_tokens = len(tokens)
    scored_tokens = 0

    # Collect all scores
    for token in tokens:
        token_lower = token.lower()
        if token_lower in word_scores:
            scored_tokens += 1
            score = word_scores[token_lower]


            scores.append(score)

    used_tokens = len(scores)

    # Initialize result dictionary
    result = {
        'scores': scores
    }

    # Calculate statistics if we have scores
    if scores:
        result['mean'] = statistics.mean(scores)
        result['median'] = statistics.median(scores)
        result['variance'] = statistics.variance(scores) if len(scores) > 1 else 0.0

        sorted_scores = sorted(scores)
        result['percentile_25'] = statistics.quantiles(sorted_scores, n=4)[0] if len(scores) >= 2 else sorted_scores[0]
        result['percentile_75'] = statistics.quantiles(sorted_scores, n=4)[2] if len(scores) >= 2 else sorted_scores[0]
    else:
        # No scores available
        result['mean'] = None
        result['median'] = None
        result['variance'] = None

        result['percentile_25'] = None
        result['percentile_75'] = None

    return result

In [None]:
import json
import re
from pathlib import Path

def extract_wiki_text(data):
    pages = data["query"]["pages"]
    first_page = next(iter(pages.values()))
    return first_page["revisions"][0]["slots"]["main"]["*"]

#We made a function that returns the  words in a json file based on the regex as below
def get_wiki_text(current_artist_file_name):
    file_path = Path(current_artist_file_name)
    with file_path.open("r", encoding="utf-8") as f:
        data = json.load(f)

    text = extract_wiki_text(data)

    text = re.sub(r"\s+", " ", text).strip()

    #words = re.findall(r"\b[\wâ€™-]+\b", text, flags=re.UNICODE) #if we need a list of the words

    return text

In [None]:
get_wiki_text("wikipedia_pages/10cc.json")

In [None]:
import os

sentiment_scores_of_pages = {}

items = os.listdir("wikipedia_pages")
files = [f for f in items if os.path.isfile(os.path.join("wikipedia_pages", f))]
print(f"Found {len(files)} files:")
for file in files:
    print(file)
    if file.endswith(".json"):
        text = get_wiki_text("wikipedia_pages/" + file)
        sentiment = calculate_sentiment(text)['mean']
        sentiment_scores_of_pages[file] = sentiment


In [None]:
sentiment_scores_of_pages

In [None]:
data = list(sentiment_scores_of_pages.values())

In [None]:
import numpy as np
from matplotlib import pyplot as plt

plt.figure(figsize=(10, 6))
plt.hist(data, bins=30, edgecolor='black', alpha=0.7, color='skyblue')

# Add labels and title
plt.xlabel('Value', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Histogram of Values', fontsize=14, fontweight='bold')

# Add grid for better readability
plt.grid(axis='y', alpha=0.3, linestyle='--')

# Display statistics on the plot
mean_val = np.mean(data)
median_val = np.median(data)
plt.axvline(mean_val, color='red', linestyle='--', linewidth=2, label=f'Mean: {mean_val:.2f}')
plt.axvline(median_val, color='green', linestyle='--', linewidth=2, label=f'Median: {median_val:.2f}')
plt.legend()

In [None]:
happiest_10 = sorted(sentiment_scores_of_pages.items(), key=lambda x: x[1], reverse=True)[:10]

# Get 10 smallest
saddest_10 = sorted(sentiment_scores_of_pages.items(), key=lambda x: x[1])[:10]

print("10 Happiest:")
for key, value in happiest_10:
    print(f"{key}: {value}")

print("\n10 Saddest:")
for key, value in saddest_10:
    print(f"{key}: {value}")