<span style="font-family:Helvetica; color:gray">These exercise materials provided to you as a student of Aalto University are protected by copyright. You are authorized to use these materials for your personal educational purposes, including completing the exercises and submitting them for grading. You are prohibited from reproducing, distributing, displaying, or sharing any portion of these materials in any form, including, but not limited to, posting on the internet or other forms of electronic communication. Aalto University reserves all rights in the exercise materials.</span>

## Exercise 4: Counting things and analysing text

In this exercise you will examine the data set in the previous round to gain insights about what people are discussing within the climate related discourse. More precisely you seek to find out whether climate activists often tend to be positive in their discussions, while deniers sometimes post negative content to disrupt the ongoing conversation. You will set out to test this hypothesis about the two user groups (activists and skeptics) and to examine the public opinion being voiced in the data set.

To get points from this exercise, you need to answer seven questions on A+. You will need to extract information from the data set, following the instructions given in this notebook. **Your implementation in this notebook will not be assessed; you only need to complete the implementation in order to answer the questions on A+.**

In [7]:
import json
from datetime import datetime, timedelta
from collections import Counter
import csv
import urllib.request
import re

import numpy as np
import pandas as pd
from scipy.special import softmax
import matplotlib.pyplot as plt
import matplotlib
matplotlib.rcParams.update({'font.size': 13.5})

from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer
import time
import random

### Data and sentiment analysis using a BERT model
The following snippet of code has been provided for you to initialize and save a pre-trained sentiment analysis model.
Function `sentiment_inference` finds the sentiment of all texts in the `data` argument and adds the sentiment score and label for each post in the data set.

In [8]:
MODEL = f"cardiffnlp/twitter-roberta-base-sentiment-latest"

# Initialize the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)

# Save the model
model.save_pretrained(MODEL)

Downloading:   0%|          | 0.00/929 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/478M [00:00<?, ?B/s]

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [9]:
def sentiment_inference(data):
    senti_labels = [-1, 0, 1]  
    for post in data:
        text = post["text"]
        # Tokenize the text
        encoded_input = tokenizer(text, return_tensors='pt')

        # Run the model
        output = model(**encoded_input)

        # Score labels: -1 -> Negative; 0 -> Neutral; 1 -> Positive
        post['senti_score'] = softmax(output[0][0].detach().numpy()).tolist()
        post['senti_label'] = senti_labels[post['senti_score'].index(max(post['senti_score']))]
    
    return data

In [56]:
# Read the data set from file into a list of dictionaries.
# Make sure dataset.jsonl is in the same directory as the notebook.
all_data = []
with open("dataset.jsonl", 'r') as fdata:
    for post in fdata:
        all_data.append(json.loads(post))
print(f"There are {len(all_data)} many posts in the data set")

There are 131797 many posts in the data set


### Overall opinion
Your task is to identify the sentiment (positive, neutral, negative) for the posts in the data set. Namely, count the posts with each sentiment label. Once you have completed the `sentiment_counts` function, compute the sentiments and their counts in the second cell below and answer the corresponding question **4.2.1 Overall opinion** on A+.

In [61]:
import copy
# Your code here.
dataNew = copy.deepcopy(all_data)

for dataPoint in dataNew:
    dataPoint['date'] = datetime.strptime(dataPoint['date'], '%Y-%m-%dT%H:%M:%S.000Z')
    
# Fill in this function.
# For every post in the data, count how many posts have labels -1, +1 and 0.
# You can either return a dictionary or simply print.
# Your code here.
print(dataNew[0])

def sentiment_counts(data):
    print(f"Length of data: {len(data)}")
    dataFilter = sentiment_inference(data)
    sentimentsResult = {-1: 0, 0: 0, 1: 0}
    for post in dataFilter:
        sentimentsResult[post['senti_label']] += 1
    return sentimentsResult


{'user_id': 18224287041377074931, 'date': datetime.datetime(2024, 3, 1, 5, 22, 51), 'text': '@USER @USER @USER the fact that you think climate change is an actual emergency is a disgrace. #climatestrike #climatechangeisreal', 'id_orig': 18224287041377074931, 'repost': False}


In [19]:
# Infer the sentiments on a sampled data of size 500
# Count the sentiments using the sentiment_counts function
# The 500 sample size reduces runtime.
# If you are interested you can run the sentimenta analysis for the whole dataset
# to compare the results. Runtime is in order of 3 hours.



random.seed(42)
sample_data = random.sample(all_data, 500)
# You code here. 

sentimentsResult = sentiment_counts(sample_data)
print(sentimentsResult)

{-1: 262, 0: 207, 1: 31}


### Group size
We first identify the activists and skeptics in the data set by manually curating a seed set of hashtags for each user group: “#climatecrisis” and “#climatejustice” for activists; “#climatehoax” and “#globalwarminghoax” for skeptics. This has been done for you. Your task is to classify the users who post activist hashtags but do not post any skeptics hashtags as activists, and conversely for skeptics. Once you have populated the two lists and dictionary, answer the corresponding question **4.2.2 Group size** on A+.

In [65]:
# Seed activist and skeptic hashtags
activist_hashtags = {"#climatecrisis", "#climatejustice"}
skeptic_hashtags = {"#climatehoax", "#globalwarminghoax"}

# Populate these lists and dictionary.
per_user_hashtags = {}
activist_users = set()
skeptic_users = set()


# Your task is twofold:
# 1. First identify all the hashtags posted by each user by populating the per_user_hashtags dictionary.
# The keys are user IDs and the values a set of all used hashtags by the user ID.
# Your code here.

for post in all_data:
    allhashtagsByUser = set([token for token in post['text'].split() if token.startswith('#')])
    if post['user_id'] not in per_user_hashtags:
        per_user_hashtags[post['user_id']] = allhashtagsByUser
    else:
        per_user_hashtags[post['user_id']] = per_user_hashtags[post['user_id']].union(allhashtagsByUser)

#print(per_user_hashtags)
# 2. Then identify the users who post seed set of activist/skeptic hashtags at least once 
# AND do not post the other side's hashtags as the activist/skeptic users.
# Your code here.

for user_id in per_user_hashtags:
    hashTagsUser = per_user_hashtags[user_id]
    activistList = hashTagsUser.intersection(activist_hashtags)
    skepticList = hashTagsUser.intersection(skeptic_hashtags)
    if (len(activistList) != 0 and len(skepticList) == 0):
        activist_users.add(user_id)
    elif (len(activistList) == 0 and len(skepticList) != 0):
        skeptic_users.add(user_id)
    
print(f"There are {len(activist_users)} unique activist users")
print(f"There are {len(skeptic_users)} unique skeptic users")      

There are 727 unique activist users
There are 831 unique skeptic users


### Group activity
Your task is to identify which of the user groups activists or skeptics posted more in the data set.
Once you have populated the two lists, answer the corresponding question **4.2.3 Group activity** on A+.

In [66]:
activist_posts_all = []
skeptic_posts_all = []

# Identify all the posts made by activists/skeptics.
# Loop over all posts and check what group every user belong to, activists users or skeptic users.
# You can then add the post to the corresponding list, activist_posts_all or skeptic_posts_all
# Your code here.

for post in all_data:
    if post['user_id'] in activist_users:
        activist_posts_all.append(post)
    elif post['user_id'] in skeptic_users:
        skeptic_posts_all.append(post)
print(f"There are total {len(activist_posts_all)} activist posts on SMP")
print(f"There are total {len(skeptic_posts_all)} skeptics posts on SMP")

There are total 45422 activist posts on SMP
There are total 60370 skeptics posts on SMP


### Activists' and Skeptics' sentiment
Once you have computed the sentiment counts, answer the corresponding question **4.2.4 Activists' sentiment** and **4.2.5 Skeptics' sentiment** on A+.

**Please keep in mind that the sentiment analysis takes some time, ~10 minutes or so**.

In [67]:
# Count the posts containing at least one of the seed hashtags of one group and none from the other, for each group activist and skeptic.
# Loop through all posts, get the hashtags, and check whether there exists any of the seed hashtags for each group and none from the other group.
# You can then populate the lists in this cell accordingly, after which you will compute the sentiments on these posts.
activist_posts_seed_only = []
skeptic_posts_seed_only = []

for post in all_data:
    if post['user_id'] in activist_users:
        allhashtags = set([token for token in post['text'].split() if token.startswith('#')])
        if (allhashtags.intersection(activist_hashtags)) != 0:
            activist_posts_seed_only.append(post)
    elif post['user_id'] in skeptic_users:
        allhashtags = set([token for token in post['text'].split() if token.startswith('#')])
        if (allhashtags.intersection(skeptic_hashtags)) != 0:
            skeptic_posts_seed_only.append(post)

print(len(activist_posts_seed_only))
print(len(skeptic_posts_seed_only))
print(activist_posts_seed_only[0])
# Your code here.

45422
60370
{'user_id': 18224287041377074931, 'date': '2024-03-01T05:22:51.000Z', 'text': '@USER @USER @USER the fact that you think climate change is an actual emergency is a disgrace. #climatestrike #climatechangeisreal', 'id_orig': 18224287041377074931, 'repost': False}


In [63]:
# Compute the sentiment inference and counts for activist posts containing only the seed hashtags
start_time = time.time()

sentiment_activist = sentiment_counts(activist_posts_seed_only[1:])
print("sentiment of activists: ")
print(sentiment_activist)

# Your code here.

print(f"Sentiment inference on activists posts containing seed hashtags took {(time.time() - start_time)/60.0} minutes")

Length of data: 3937


KeyboardInterrupt: 

In [None]:
# Compute the sentiment inference and counts for skeptic posts containing only the seed hashtags
start_time = time.time()

sentiment_skeptic = sentiment_counts(skeptic_posts_seed_only)
print("sentiment of skeptics: ")
print(sentiment_skeptic)

# Your code here.

print(f"Sentiment inference on activists posts containing seed hashtags took {(time.time() - start_time)/60.0} minutes")

### Sentiment observations

Answer question **4.2.6 Sentiment observations** on A+.

### Hypothesis
Your task is to dentify the dominant sentiment for **ALL** the posts made by activists and skeptics, not just the ones containing the seed set of hashtags. Once you have done the sentiment counting, answer the question **4.2.7 Hypothesis** on A+.

In [68]:
# Compute the sentiment inference and count for all activist posts,
# using a random sample of 500 posts to reduce run time.
# If you are interested you can run the sentiment analysis for the whole data set
# to compare the results. Runtime is in order of 1-2 hours.
start_time = time.time()
random.seed(42)

activist_posts_seed_only = []
skeptic_posts_seed_only = []

for post in sample_data:
    if post['user_id'] in activist_users:
        activist_posts_seed_only.append(post)
    elif post['user_id'] in skeptic_users:
        skeptic_posts_seed_only.append(post)
            
sentiment_activist = sentiment_counts(activist_posts_seed_only[1:])
print("sentiment of activists: ")
print(sentiment_activist)


# Your code here.

print("Sentiment inference on all activists posts containing seed hashtags took", ((time.time() - start_time)/60.0))

Length of data: 180
sentiment of activists: 
{-1: 65, 0: 95, 1: 20}
Sentiment inference on all activists posts containing seed hashtags took 0.33747415939966835


In [69]:
# Compute the sentiment inference and count for all skeptics posts,
# using a random sample of 500 posts to reduce run time.
# If you are interested you can run the sentiment analysis for the whole data set
# to compare the results. Runtime is in order of 1-2 hours.
start_time = time.time()
random.seed(42)

sentiment_skeptic = sentiment_counts(skeptic_posts_seed_only)
print("sentiment of skeptics: ")
print(sentiment_skeptic)

# Your code here.

print("Sentiment inference on all skeptics posts containing seed hashtags took", ((time.time() - start_time)/60.0))

Length of data: 235
sentiment of skeptics: 
{-1: 165, 0: 65, 1: 5}
Sentiment inference on all skeptics posts containing seed hashtags took 0.40694276889165243
