# **Sentiment & Emotion Analysis**
***BAMP 2022 - MCT 4 - Ante Jelavic, Franziskus Perkhofer, Manuel Mencher, Melissa Ewering, Tim Ritzheimer***

*Short description*: This script is based on the data (tweets) collected in the script "DataCollectionTweetsPerUser". The purpose of this script is to analyze the sentiments as well as the emotions of these tweets using Transformer based Deep Neutral Networks. First, the input data is checked and cleaned - the same is done with the output data after the analysis to prepare it for further analysis in the script "VisualizationOfResults". 

Since the data is not labeled the analysis is based on pre-trained deep neural network transformer models from Huggingface. More concretely, two models have been used: 


1.   siebert/sentiment-roberta-large-english (Heitmann et al. 2020)


*   Paper link: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3489963
*   Huggingface link: https://huggingface.co/siebert/sentiment-roberta-large-english


> @article{heitmann2020,
  title={More than a feeling: Benchmarks for sentiment analysis accuracy},
  author={Heitmann, Mark and Siebert, Christian and Hartmann, Jochen and Schamp, Christina},
  journal={Available at SSRN 3489963},
  year={2020}
}

2.   j-hartmann/emotion-english-roberta-large (Hartmann, 2022)


*   Reference: Jochen Hartmann, "Emotion English DistilRoBERTa-base". https://huggingface.co/j-hartmann/emotion-english-distilroberta-base/, 2022.


> @misc{hartmann2022emotionenglish,
  author={Hartmann, Jochen},
  title={Emotion English DistilRoBERTa-base},
  year={2022},
  howpublished = {\url{https://huggingface.co/j-hartmann/emotion-english-distilroberta-base/}},
}


Both Transformer models are fine-tuned checkpoints of RoBERTa-large (Liu et al. 2019) - Paper link: https://arxiv.org/pdf/1907.11692.pdf. The respective script sections are largely based on the documentation and scripts provided by the authors. 

For performance reasons, this script was created and executed on Google Colab and then saved as an .ipynb and .pdf file.  Therefore, all input and output files are stored on the connected Google Drive account and were afterwards transferred. In order to be able to run this script locally, it may be necessary to make adjustments to the dependencies / loaded packages.





In [1]:
# Loading packages & dependencies

# For dealing with json responses we receive from the API
import json
# For displaying the data after
import pandas as pd
# For saving the response data in CSV format
import csv
# For parsing the dates received from twitter in readable formats
import datetime
import dateutil.parser
import unicodedata
#To add wait time between requests
import time
#enable downloading output files from colab environment
from google.colab import files

# Data Preparation before Analysis

This part will load the data (tweets) and perform a simple analysis and cleaning activities to prepare the dataset for sentiment & emotion analysis.


In [2]:
#Import data from csv and xls files stored on Google Drive

from google.colab import drive
drive.mount('/content/drive')


file_name_1 = "/content/drive/MyDrive/Colab Notebooks/Tweets.csv"
file_name_2 = "/content/drive/MyDrive/Colab Notebooks/Demographics.xlsx"

rawTweets = pd.read_csv(file_name_1) # Tweets.csv (see Output_Data)
demographics = pd.read_excel(file_name_2) #Demographics.xlsx (see Input_Data)

Mounted at /content/drive


  exec(code_obj, self.user_global_ns, self.user_ns)


In [3]:
#Check the number of twets per user and per language and save to csv
pd.set_option('display.max_rows', 102)
pd.set_option('display.max_columns', 50)
crosstab = pd.crosstab(rawTweets['author_id'], rawTweets['lang'])
crosstab.to_csv("crosstab.csv") # (see Output_Data)
crosstab

lang,am,ar,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fr,hi,ht,hu,in,is,it,iw,ja,lt,lv,ml,nl,no,pl,ps,pt,ro,ru,sl,sv,tl,tr,uk,und,ur,vi,zh
author_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1
1294741.0,0,0,1,1,3,0,2,0,3193,4,0,0,0,0,4,0,3,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,29,0,0,0
5715682.0,0,0,0,0,0,1,1,0,3190,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,17,0,0,0
6612402.0,0,0,2,1,4,1,5,0,3012,9,3,1,0,0,12,0,3,2,9,0,5,0,0,0,2,0,4,4,0,0,2,3,0,0,2,26,0,0,118,0,0,0
6705042.0,0,0,0,0,0,0,3,0,3154,8,0,0,0,0,7,1,1,0,2,0,1,0,0,0,0,0,0,0,2,0,2,0,0,0,0,2,0,0,61,0,0,0
8161232.0,0,0,0,0,0,0,0,0,3223,1,1,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,2,0,0,16,0,0,0
9950972.0,0,0,0,2,1,0,0,0,764,2,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,1,0,30,0,0,0
12354830.0,0,0,4,0,1,0,9,0,2958,3,3,0,0,0,3,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,260,0,0,0
12455300.0,0,0,5,0,2,2,28,0,2856,16,7,1,0,0,53,0,2,1,4,0,3,0,0,1,1,0,3,1,2,0,2,3,0,1,0,5,2,0,224,0,0,0
14157130.0,0,0,0,0,0,0,0,0,3230,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,15,0,0,0
14680600.0,0,0,0,0,0,1,1,0,1253,15,4,0,0,8,3,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,16,0,0,0


Conclusion: 198430 tweets are in english. Not any profile did not tweet in english. The 3 user with smallest amount of english posts did 13, 78 andf 84 posts. The user with the highest amount of english posts did approx. 3200 posts (or even more as 3200 is the limit of possible tweets which can be retrieved by the twitter API per user). The second most used language is undefined with 10307 tweets and afterwards german with 188 tweets. Based on the low amount of tweets not in english, it was decided to drop all tweets not in english to simplify the following analysis.

In [4]:
# drop all tweets not in english
enTweets = rawTweets.drop(rawTweets[rawTweets["lang"] != "en"].index)
len(enTweets)

198430

Analysis of hashtags (hashtags indicate that a post belongs to a certain topic)

In [5]:
# Function to extract the hashtags from text s
def extract_hash_tags(s):
  return set(part[1:] for part in s.split() if part.startswith('#'))

In [6]:
# Extract all hashtags and save in seperate column
extracted_hashtags = []
for x in range(len(enTweets["text"])):
    hashtags = list(extract_hash_tags(enTweets["text"].iloc[x]))
    extracted_hashtags.append(hashtags)
    
enTweets["Hashtags"] = extracted_hashtags
enTweets.tail()

Unnamed: 0.1,Unnamed: 0,referenced_tweets,text,author_id,created_at,id,conversation_id,source,reply_settings,lang,public_metrics,in_reply_to_user_id,geo,withheld,Hashtags
211238,10,,Loved doing this shoot with @InfrarougeMag!#Wi...,262502487.0,2019-01-05T12:36:57.000Z,1081529988464168960,1081529988464168960,Hootsuite Inc.,everyone,en,"{'retweet_count': 48, 'reply_count': 33, 'like...",,,,[INFRAROUGE]
211239,11,,Have you been to see it yet?#TheEmperorOfParis...,262502487.0,2019-01-04T14:10:05.000Z,1081191038247714816,1081191038247714816,Hootsuite Inc.,everyone,en,"{'retweet_count': 18, 'reply_count': 20, 'like...",,,,[LEmpereurDeParis]
211240,12,,When you realise the holidays are over...#Holi...,262502487.0,2019-01-03T12:15:04.000Z,1080799703019806721,1080799703019806721,Hootsuite Inc.,everyone,en,"{'retweet_count': 56, 'reply_count': 44, 'like...",,,,[BackToWork]
211241,13,,Riding into #January like... @InfrarougeMag ht...,262502487.0,2019-01-02T13:50:05.000Z,1080461228886110209,1080461228886110209,Hootsuite Inc.,everyone,en,"{'retweet_count': 33, 'reply_count': 15, 'like...",,,,[January]
211242,14,,HAPPY NEW YEAR! 🎉 Who’s feeling like this afte...,262502487.0,2019-01-01T15:45:04.000Z,1080127776810766336,1080127776810766336,Hootsuite Inc.,everyone,en,"{'retweet_count': 43, 'reply_count': 54, 'like...",,,,[Hello2019]


In [7]:
# add unique ID to dataset to enable bettter over
enTweets.insert(0, 'Unique_ID', range(0, len(enTweets)))

In [8]:
# save a new table with each hashtags assigned to a Unique_ID (realted to tweets)
hash_df = pd.DataFrame(columns=['Unique_ID', 'Hashtag'])
i = 0
for x in range(len(enTweets["Unique_ID"])):
    for y in range(len(enTweets["Hashtags"].iloc[x])):
        #data = pd.DataFrame({"Unique_ID": x, "Hashtag": enTweets["Hashtags"].iloc[x][y]}, index=[i])
        to_append = [x, enTweets["Hashtags"].iloc[x][y]]
        hash_df.loc[len(hash_df)] = to_append

hash_df.to_csv("Hashtags.csv") # see Output_Data

In [9]:
hash_df['Hashtag'].value_counts()[:30] #show top 30 most frequent hashtags 

MakeHumanityGreatAgain    558
COVID19                   519
Ad                        432
TEAMSM                    432
VirginFamily              368
TheApprentice             361
TomorrowsPapersToday      323
JoinIn                    300
SistersInLaw              276
COP26                     224
ESG                       220
100bookshops              199
WayTooEarly               195
AfterLife                 190
ad                        176
Brexit                    161
StopBrexit                152
SuperNature               152
coronavirus               125
BorisJohnson              123
BlackLivesMatter          118
thecroonersessions        105
DOOH                       98
Peston                     97
PMQs                       97
MusicPlayedByHumans        96
100Bookshops               95
celebrityApprentice        95
NHS                        88
5GoldRings                 86
Name: Hashtag, dtype: int64

The following topic groups seem to be relevant for a large user group and will be therefore analyzed more deeply to show an example:


*   COVID (e.g. hashtags: #COVID19, #coronaviruse, #corona,...)
*   Brexit (e.g. hashtags: #Brexit, #StopBrecit, ...)
*   ESG (e.g. hashtags: #COP26, #ESG, ...)

In the next steps we will analze the hastags and cluser different wrintings and synonyms to those 3 groups. This is manual work, however, the other use-case (topic-modelling) of this BAMP will show a way how to cover this task automatically.

In [10]:
# Function to iterate over the hashtags and check for related searchterms in order to cluster hashtags to topic groups
def checkHashtag (hashtag_series, searchterm_series):
  hashtag_found = []
  for hashtag_list in hashtag_series:
    val = 0
    for hashtag in hashtag_list:
      for searchterm in searchterm_series:
        if searchterm.lower() in hashtag.lower():
          val = 1
    hashtag_found.append(val)
  return hashtag_found

In [11]:
# Search for topic groups by hashtag and save those into the tweets dataset
enTweets['covid_hashtags'] = checkHashtag(enTweets["Hashtags"], ['covid', 'corona', 'pandemic', 'biontech', 'moderna', 'lockdown'])
enTweets['brexit_hashtags'] = checkHashtag(enTweets["Hashtags"], ['brexit'])
enTweets['esg_hashtags'] = checkHashtag(enTweets["Hashtags"], ['esg', 'sustainability', 'COP26', 'climate', 'co2', 'humanrights', 'blacklives', 'racism', 'sexism'])

enTweets.to_csv("enTweetsNew.csv") # see Output_Data
enTweets.head()

Unnamed: 0.1,Unique_ID,Unnamed: 0,referenced_tweets,text,author_id,created_at,id,conversation_id,source,reply_settings,lang,public_metrics,in_reply_to_user_id,geo,withheld,Hashtags,covid_hashtags,brexit_hashtags,esg_hashtags
0,0,0,"[{'type': 'retweeted', 'id': '1476916508265783...","RT @TheElders: ""True peace is never won by dip...",8161232.0,2021-12-31T15:38:31.000Z,1476940842531229721,1476940842531229721,Twitter Web App,everyone,en,"{'retweet_count': 39, 'reply_count': 0, 'like_...",,,,[],0,0,0
1,1,1,,My thoughts on COVID and its effects on younge...,8161232.0,2021-12-27T10:00:18.000Z,1475406173557997569,1475406173557997569,Hootsuite Inc.,everyone,en,"{'retweet_count': 131, 'reply_count': 214, 'li...",,,,[],0,0,0
2,2,2,,"Thank you Arch for your love, life, laughter a...",8161232.0,2021-12-26T10:00:04.000Z,1475043727840366597,1475043727840366597,Hootsuite Inc.,everyone,en,"{'retweet_count': 144, 'reply_count': 51, 'lik...",,,,[],0,0,0
3,3,3,,I’m so sad that Archbishop Tutu has passed awa...,8161232.0,2021-12-26T08:09:55.000Z,1475016006603001860,1475016006603001860,Hootsuite Inc.,everyone,en,"{'retweet_count': 772, 'reply_count': 197, 'li...",,,,[],0,0,0
4,4,4,,Happy Christmas from my family to yours. https...,8161232.0,2021-12-25T09:23:01.000Z,1474672016800239618,1474672016800239618,Twitter Media Studio,everyone,en,"{'retweet_count': 81, 'reply_count': 112, 'lik...",,,,[],0,0,0


In [12]:
# Download output files to save locally
#files.download('Hashtags.csv')
#files.download('enTweetsNew.csv')

# Sentiment Analysis
This section will perform the sentiment analysis based on a pretrained transformer model.

In [13]:
# Import required packages and transformers libary 
import torch
import pandas as pd
import numpy as np
!pip install transformers 
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer

Collecting transformers
  Downloading transformers-4.16.2-py3-none-any.whl (3.5 MB)
[K     |████████████████████████████████| 3.5 MB 4.2 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 55.4 MB/s 
Collecting tokenizers!=0.11.3,>=0.10.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 51.7 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 76.3 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 7.3 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found exis

In [14]:
# Create class for data preparation
class SimpleDataset:
    def __init__(self, tokenized_texts):
        self.tokenized_texts = tokenized_texts
    
    def __len__(self):
        return len(self.tokenized_texts["input_ids"])
    
    def __getitem__(self, idx):
        return {k: v[idx] for k, v in self.tokenized_texts.items()}

In [15]:
# Load tokenizer and model, create trainer
model_name = "siebert/sentiment-roberta-large-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
trainer = Trainer(model=model)

Downloading:   0%|          | 0.00/256 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/687 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/780k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32G [00:00<?, ?B/s]

In [16]:
# Tokenize texts and create prediction data set
pred_texts = enTweets["text"].dropna().astype('str').tolist()
tokenized_texts = tokenizer(pred_texts,truncation=True,padding=True)
pred_dataset = SimpleDataset(tokenized_texts)

In [17]:
# Run predictions
predictions = trainer.predict(pred_dataset)

***** Running Prediction *****
  Num examples = 198430
  Batch size = 8


In [18]:
# Transform predictions to labels
preds = predictions.predictions.argmax(-1)
labels = pd.Series(preds).map(model.config.id2label)
scores = (np.exp(predictions[0])/np.exp(predictions[0]).sum(-1,keepdims=True)).max(1)

In [19]:
# Create DataFrame with texts, predictions, labels, and scores
df_results = pd.DataFrame(list(zip(pred_texts,preds,labels,scores)), columns=['text','pred','label','score'])
df_results.insert(0, 'Unique_ID', range(0, len(df_results)))
df_results.head()

Unnamed: 0,Unique_ID,text,pred,label,score
0,0,"RT @TheElders: ""True peace is never won by dip...",1,POSITIVE,0.998705
1,1,My thoughts on COVID and its effects on younge...,1,POSITIVE,0.997372
2,2,"Thank you Arch for your love, life, laughter a...",1,POSITIVE,0.998721
3,3,I’m so sad that Archbishop Tutu has passed awa...,0,NEGATIVE,0.996188
4,4,Happy Christmas from my family to yours. https...,1,POSITIVE,0.998654


# Emotion Analysis

This section will perform the emotion analysis based on a pretrained transformer model. As the transformer below is very similiar to the one above used for sentiment analysis, it reuses certain code sections and variables. Therefore, please make sure to run first the section of sentiment analysis and only afterwards the section Emotion Analysis

In [20]:
# load tokenizer and model, create trainer
model_name = "j-hartmann/emotion-english-roberta-large"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
trainer = Trainer(model=model)

https://huggingface.co/j-hartmann/emotion-english-roberta-large/resolve/main/tokenizer_config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpz3gasbg1


Downloading:   0%|          | 0.00/328 [00:00<?, ?B/s]

storing https://huggingface.co/j-hartmann/emotion-english-roberta-large/resolve/main/tokenizer_config.json in cache at /root/.cache/huggingface/transformers/f039b3b4df1ee68631ecc6dd6924249afb9ae89349f84d4445bd09ac16e8bfab.333a7a48f8ce38c4d0620813f4bb171b2ded95f2591b17690f31d9df9de3d414
creating metadata file for /root/.cache/huggingface/transformers/f039b3b4df1ee68631ecc6dd6924249afb9ae89349f84d4445bd09ac16e8bfab.333a7a48f8ce38c4d0620813f4bb171b2ded95f2591b17690f31d9df9de3d414
https://huggingface.co/j-hartmann/emotion-english-roberta-large/resolve/main/vocab.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpyei3dbx6


Downloading:   0%|          | 0.00/780k [00:00<?, ?B/s]

storing https://huggingface.co/j-hartmann/emotion-english-roberta-large/resolve/main/vocab.json in cache at /root/.cache/huggingface/transformers/89a81a716bd56f6a5b81421f9046102fc73a38c0eafcfe53c6fe9cd333b85ee3.bfdcc444ff249bca1a95ca170ec350b442f81804d7df3a95a2252217574121d7
creating metadata file for /root/.cache/huggingface/transformers/89a81a716bd56f6a5b81421f9046102fc73a38c0eafcfe53c6fe9cd333b85ee3.bfdcc444ff249bca1a95ca170ec350b442f81804d7df3a95a2252217574121d7
https://huggingface.co/j-hartmann/emotion-english-roberta-large/resolve/main/merges.txt not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmp_nq6j3rw


Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

storing https://huggingface.co/j-hartmann/emotion-english-roberta-large/resolve/main/merges.txt in cache at /root/.cache/huggingface/transformers/3653ca003db05f7589afa081119de488c8dff62d7ea47bbb81080c7769841b4b.f5b91da9e34259b8f4d88dbc97c740667a0e8430b96314460cdb04e86d4fc435
creating metadata file for /root/.cache/huggingface/transformers/3653ca003db05f7589afa081119de488c8dff62d7ea47bbb81080c7769841b4b.f5b91da9e34259b8f4d88dbc97c740667a0e8430b96314460cdb04e86d4fc435
https://huggingface.co/j-hartmann/emotion-english-roberta-large/resolve/main/tokenizer.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmps9z6ie4d


Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

storing https://huggingface.co/j-hartmann/emotion-english-roberta-large/resolve/main/tokenizer.json in cache at /root/.cache/huggingface/transformers/aef4a5624ed9ea21b40b5ccae18db01dc3d468b01d547b8737321d13bff648a6.e4a1036259cd90a493c93aa24355c4271f14bdcd6f1ddc26baf86f429752713f
creating metadata file for /root/.cache/huggingface/transformers/aef4a5624ed9ea21b40b5ccae18db01dc3d468b01d547b8737321d13bff648a6.e4a1036259cd90a493c93aa24355c4271f14bdcd6f1ddc26baf86f429752713f
https://huggingface.co/j-hartmann/emotion-english-roberta-large/resolve/main/special_tokens_map.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpojhcrnqj


Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

storing https://huggingface.co/j-hartmann/emotion-english-roberta-large/resolve/main/special_tokens_map.json in cache at /root/.cache/huggingface/transformers/11f22a314e072168d04f9f4d97b0c4d5cd409a03efa715a93cc40c38998ae1ee.a11ebb04664c067c8fe5ef8f8068b0f721263414a26058692f7b2e4ba2a1b342
creating metadata file for /root/.cache/huggingface/transformers/11f22a314e072168d04f9f4d97b0c4d5cd409a03efa715a93cc40c38998ae1ee.a11ebb04664c067c8fe5ef8f8068b0f721263414a26058692f7b2e4ba2a1b342
loading file https://huggingface.co/j-hartmann/emotion-english-roberta-large/resolve/main/vocab.json from cache at /root/.cache/huggingface/transformers/89a81a716bd56f6a5b81421f9046102fc73a38c0eafcfe53c6fe9cd333b85ee3.bfdcc444ff249bca1a95ca170ec350b442f81804d7df3a95a2252217574121d7
loading file https://huggingface.co/j-hartmann/emotion-english-roberta-large/resolve/main/merges.txt from cache at /root/.cache/huggingface/transformers/3653ca003db05f7589afa081119de488c8dff62d7ea47bbb81080c7769841b4b.f5b91da9e34259b

Downloading:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

storing https://huggingface.co/j-hartmann/emotion-english-roberta-large/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/2bfc7aed293b5562dc3f5f15379552d8f14dc8c1147b9243ef00033279ddda24.b03b808a761d36e37009d7ec09d5b7cacb480be2a13d49abc60b1011855b82bb
creating metadata file for /root/.cache/huggingface/transformers/2bfc7aed293b5562dc3f5f15379552d8f14dc8c1147b9243ef00033279ddda24.b03b808a761d36e37009d7ec09d5b7cacb480be2a13d49abc60b1011855b82bb
loading configuration file https://huggingface.co/j-hartmann/emotion-english-roberta-large/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/2bfc7aed293b5562dc3f5f15379552d8f14dc8c1147b9243ef00033279ddda24.b03b808a761d36e37009d7ec09d5b7cacb480be2a13d49abc60b1011855b82bb
Model config RobertaConfig {
  "_name_or_path": "j-hartmann/emotion-english-roberta-large",
  "architectures": [
    "RobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier

Downloading:   0%|          | 0.00/1.32G [00:00<?, ?B/s]

storing https://huggingface.co/j-hartmann/emotion-english-roberta-large/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/158746e2379654348830f7e5fe15156d98faa9373095acc8526db28fdb7f9a81.e6c1ea27606e23923d3c3add510aca2d13292f20b0f39ab4a19da4aae4d5611a
creating metadata file for /root/.cache/huggingface/transformers/158746e2379654348830f7e5fe15156d98faa9373095acc8526db28fdb7f9a81.e6c1ea27606e23923d3c3add510aca2d13292f20b0f39ab4a19da4aae4d5611a
loading weights file https://huggingface.co/j-hartmann/emotion-english-roberta-large/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/158746e2379654348830f7e5fe15156d98faa9373095acc8526db28fdb7f9a81.e6c1ea27606e23923d3c3add510aca2d13292f20b0f39ab4a19da4aae4d5611a
All model checkpoint weights were used when initializing RobertaForSequenceClassification.

All the weights of RobertaForSequenceClassification were initialized from the model checkpoint at j-hartmann/emotion-english-roberta

In [21]:
# Run predictions
predictions = trainer.predict(pred_dataset)

***** Running Prediction *****
  Num examples = 198430
  Batch size = 8


In [22]:
# Transform predictions to labels
preds = predictions.predictions.argmax(-1)
labels = pd.Series(preds).map(model.config.id2label)
scores = (np.exp(predictions[0])/np.exp(predictions[0]).sum(-1,keepdims=True)).max(1)

In [23]:
# scores raw
temp = (np.exp(predictions[0])/np.exp(predictions[0]).sum(-1,keepdims=True))

In [24]:
# container
anger = []
disgust = []
fear = []
joy = []
neutral = []
sadness = []
surprise = []

# extract scores (as many entries as exist in pred_texts)
for i in range(len(pred_texts)):
  anger.append(temp[i][0])
  disgust.append(temp[i][1])
  fear.append(temp[i][2])
  joy.append(temp[i][3])
  neutral.append(temp[i][4])
  sadness.append(temp[i][5])
  surprise.append(temp[i][6])

In [25]:
# Create DataFrame with texts, predictions, labels, and scores
df_restuls_emotions = pd.DataFrame(list(zip(pred_texts,preds,labels,scores,  anger, disgust, fear, joy, neutral, sadness, surprise)), columns=['text','pred','emotion_label','score', 'anger', 'disgust', 'fear', 'joy', 'neutral', 'sadness', 'surprise'])
df_restuls_emotions.insert(0, 'Unique_ID', range(0, len(df_restuls_emotions)))
df_restuls_emotions.head()

Unnamed: 0,Unique_ID,text,pred,emotion_label,score,anger,disgust,fear,joy,neutral,sadness,surprise
0,0,"RT @TheElders: ""True peace is never won by dip...",2,fear,0.318116,0.264952,0.000666,0.318116,0.06245,0.184452,0.137633,0.031732
1,1,My thoughts on COVID and its effects on younge...,4,neutral,0.845129,0.013473,0.011618,0.020255,0.011955,0.845129,0.061033,0.036537
2,2,"Thank you Arch for your love, life, laughter a...",3,joy,0.828227,0.004068,0.000749,0.001675,0.828227,0.111543,0.012796,0.040943
3,3,I’m so sad that Archbishop Tutu has passed awa...,5,sadness,0.960134,0.001699,0.00051,0.00323,0.010924,0.006881,0.960134,0.016623
4,4,Happy Christmas from my family to yours. https...,3,joy,0.74693,0.002606,0.000694,0.001642,0.74693,0.068365,0.023596,0.156166


# Data combination and storage

This section will combine all the results / data collected in different data frames and merge all relevant data into one file which will contain all relevant data for visualization of the results.



In [26]:
# Check that results and tweet data has the same length
print(len(enTweets))
print(len(df_results))
print(len(df_restuls_emotions))

198430
198430
198430


In [27]:
# Merge sentiment and emotion analysis results with tweets
merged_df1 = pd.merge(enTweets, df_results, on="Unique_ID") #Merge sentiment results with tweets data
merged_df2 = pd.merge(merged_df1, df_restuls_emotions, on="Unique_ID") #Merge emotion results with tweets + sentiment data
merged_df2.head()

Unnamed: 0.1,Unique_ID,Unnamed: 0,referenced_tweets,text_x,author_id,created_at,id,conversation_id,source,reply_settings,lang,public_metrics,in_reply_to_user_id,geo,withheld,Hashtags,covid_hashtags,brexit_hashtags,esg_hashtags,text_y,pred_x,label,score_x,text,pred_y,emotion_label,score_y,anger,disgust,fear,joy,neutral,sadness,surprise
0,0,0,"[{'type': 'retweeted', 'id': '1476916508265783...","RT @TheElders: ""True peace is never won by dip...",8161232.0,2021-12-31T15:38:31.000Z,1476940842531229721,1476940842531229721,Twitter Web App,everyone,en,"{'retweet_count': 39, 'reply_count': 0, 'like_...",,,,[],0,0,0,"RT @TheElders: ""True peace is never won by dip...",1,POSITIVE,0.998705,"RT @TheElders: ""True peace is never won by dip...",2,fear,0.318116,0.264952,0.000666,0.318116,0.06245,0.184452,0.137633,0.031732
1,1,1,,My thoughts on COVID and its effects on younge...,8161232.0,2021-12-27T10:00:18.000Z,1475406173557997569,1475406173557997569,Hootsuite Inc.,everyone,en,"{'retweet_count': 131, 'reply_count': 214, 'li...",,,,[],0,0,0,My thoughts on COVID and its effects on younge...,1,POSITIVE,0.997372,My thoughts on COVID and its effects on younge...,4,neutral,0.845129,0.013473,0.011618,0.020255,0.011955,0.845129,0.061033,0.036537
2,2,2,,"Thank you Arch for your love, life, laughter a...",8161232.0,2021-12-26T10:00:04.000Z,1475043727840366597,1475043727840366597,Hootsuite Inc.,everyone,en,"{'retweet_count': 144, 'reply_count': 51, 'lik...",,,,[],0,0,0,"Thank you Arch for your love, life, laughter a...",1,POSITIVE,0.998721,"Thank you Arch for your love, life, laughter a...",3,joy,0.828227,0.004068,0.000749,0.001675,0.828227,0.111543,0.012796,0.040943
3,3,3,,I’m so sad that Archbishop Tutu has passed awa...,8161232.0,2021-12-26T08:09:55.000Z,1475016006603001860,1475016006603001860,Hootsuite Inc.,everyone,en,"{'retweet_count': 772, 'reply_count': 197, 'li...",,,,[],0,0,0,I’m so sad that Archbishop Tutu has passed awa...,0,NEGATIVE,0.996188,I’m so sad that Archbishop Tutu has passed awa...,5,sadness,0.960134,0.001699,0.00051,0.00323,0.010924,0.006881,0.960134,0.016623
4,4,4,,Happy Christmas from my family to yours. https...,8161232.0,2021-12-25T09:23:01.000Z,1474672016800239618,1474672016800239618,Twitter Media Studio,everyone,en,"{'retweet_count': 81, 'reply_count': 112, 'lik...",,,,[],0,0,0,Happy Christmas from my family to yours. https...,1,POSITIVE,0.998654,Happy Christmas from my family to yours. https...,3,joy,0.74693,0.002606,0.000694,0.001642,0.74693,0.068365,0.023596,0.156166


In [28]:
# Merge demographics into above merged dataframe
demographics = demographics.rename(columns={"ID": "author_id"}) # rename to be able to merge
merged_df3 = pd.merge(merged_df2, demographics, on="author_id") # merge demographics into tweets + results data
merged_df3.tail()

Unnamed: 0.1,Unique_ID,Unnamed: 0,referenced_tweets,text_x,author_id,created_at,id,conversation_id,source,reply_settings,lang,public_metrics,in_reply_to_user_id,geo,withheld,Hashtags,covid_hashtags,brexit_hashtags,esg_hashtags,text_y,pred_x,label,score_x,text,pred_y,emotion_label,score_y,anger,disgust,fear,joy,neutral,sadness,surprise,Name,Profession,Age,Age_Group,Gender,Username
198425,198425,10,,Loved doing this shoot with @InfrarougeMag!#Wi...,262502487.0,2019-01-05T12:36:57.000Z,1081529988464168960,1081529988464168960,Hootsuite Inc.,everyone,en,"{'retweet_count': 48, 'reply_count': 33, 'like...",,,,[INFRAROUGE],0,0,0,Loved doing this shoot with @InfrarougeMag!#Wi...,1,POSITIVE,0.998841,Loved doing this shoot with @InfrarougeMag!#Wi...,2,fear,0.736871,0.024285,0.000586,0.736871,0.164261,0.029907,0.014784,0.029306,Olga Kurylenko,Model,42,40-49,w,@OlyaKurylenko
198426,198426,11,,Have you been to see it yet?#TheEmperorOfParis...,262502487.0,2019-01-04T14:10:05.000Z,1081191038247714816,1081191038247714816,Hootsuite Inc.,everyone,en,"{'retweet_count': 18, 'reply_count': 20, 'like...",,,,[LEmpereurDeParis],0,0,0,Have you been to see it yet?#TheEmperorOfParis...,1,POSITIVE,0.998488,Have you been to see it yet?#TheEmperorOfParis...,2,fear,0.750157,0.043652,0.001043,0.750157,0.010665,0.086694,0.053474,0.054314,Olga Kurylenko,Model,42,40-49,w,@OlyaKurylenko
198427,198427,12,,When you realise the holidays are over...#Holi...,262502487.0,2019-01-03T12:15:04.000Z,1080799703019806721,1080799703019806721,Hootsuite Inc.,everyone,en,"{'retweet_count': 56, 'reply_count': 44, 'like...",,,,[BackToWork],0,0,0,When you realise the holidays are over...#Holi...,1,POSITIVE,0.998353,When you realise the holidays are over...#Holi...,5,sadness,0.98366,0.002895,0.000163,0.006155,0.002504,0.001611,0.98366,0.003013,Olga Kurylenko,Model,42,40-49,w,@OlyaKurylenko
198428,198428,13,,Riding into #January like... @InfrarougeMag ht...,262502487.0,2019-01-02T13:50:05.000Z,1080461228886110209,1080461228886110209,Hootsuite Inc.,everyone,en,"{'retweet_count': 33, 'reply_count': 15, 'like...",,,,[January],0,0,0,Riding into #January like... @InfrarougeMag ht...,1,POSITIVE,0.998661,Riding into #January like... @InfrarougeMag ht...,2,fear,0.419508,0.057563,0.000886,0.419508,0.169405,0.175297,0.032972,0.144369,Olga Kurylenko,Model,42,40-49,w,@OlyaKurylenko
198429,198429,14,,HAPPY NEW YEAR! 🎉 Who’s feeling like this afte...,262502487.0,2019-01-01T15:45:04.000Z,1080127776810766336,1080127776810766336,Hootsuite Inc.,everyone,en,"{'retweet_count': 43, 'reply_count': 54, 'like...",,,,[Hello2019],0,0,0,HAPPY NEW YEAR! 🎉 Who’s feeling like this afte...,1,POSITIVE,0.998856,HAPPY NEW YEAR! 🎉 Who’s feeling like this afte...,3,joy,0.929404,0.002541,0.000355,0.002869,0.929404,0.008105,0.011832,0.044894,Olga Kurylenko,Model,42,40-49,w,@OlyaKurylenko


In [29]:
# Clean out unnecessary columns
merged_df3.drop(['Unnamed: 0','text','text_y'], axis=1, inplace=True)
merged_df3 = merged_df3.rename(columns={"text_x": "text"})
merged_df3.head()

Unnamed: 0,Unique_ID,referenced_tweets,text,author_id,created_at,id,conversation_id,source,reply_settings,lang,public_metrics,in_reply_to_user_id,geo,withheld,Hashtags,covid_hashtags,brexit_hashtags,esg_hashtags,pred_x,label,score_x,pred_y,emotion_label,score_y,anger,disgust,fear,joy,neutral,sadness,surprise,Name,Profession,Age,Age_Group,Gender,Username
0,0,"[{'type': 'retweeted', 'id': '1476916508265783...","RT @TheElders: ""True peace is never won by dip...",8161232.0,2021-12-31T15:38:31.000Z,1476940842531229721,1476940842531229721,Twitter Web App,everyone,en,"{'retweet_count': 39, 'reply_count': 0, 'like_...",,,,[],0,0,0,1,POSITIVE,0.998705,2,fear,0.318116,0.264952,0.000666,0.318116,0.06245,0.184452,0.137633,0.031732,Richard Branson,Tycoon,71,>70,m,@richardbranson
1,1,,My thoughts on COVID and its effects on younge...,8161232.0,2021-12-27T10:00:18.000Z,1475406173557997569,1475406173557997569,Hootsuite Inc.,everyone,en,"{'retweet_count': 131, 'reply_count': 214, 'li...",,,,[],0,0,0,1,POSITIVE,0.997372,4,neutral,0.845129,0.013473,0.011618,0.020255,0.011955,0.845129,0.061033,0.036537,Richard Branson,Tycoon,71,>70,m,@richardbranson
2,2,,"Thank you Arch for your love, life, laughter a...",8161232.0,2021-12-26T10:00:04.000Z,1475043727840366597,1475043727840366597,Hootsuite Inc.,everyone,en,"{'retweet_count': 144, 'reply_count': 51, 'lik...",,,,[],0,0,0,1,POSITIVE,0.998721,3,joy,0.828227,0.004068,0.000749,0.001675,0.828227,0.111543,0.012796,0.040943,Richard Branson,Tycoon,71,>70,m,@richardbranson
3,3,,I’m so sad that Archbishop Tutu has passed awa...,8161232.0,2021-12-26T08:09:55.000Z,1475016006603001860,1475016006603001860,Hootsuite Inc.,everyone,en,"{'retweet_count': 772, 'reply_count': 197, 'li...",,,,[],0,0,0,0,NEGATIVE,0.996188,5,sadness,0.960134,0.001699,0.00051,0.00323,0.010924,0.006881,0.960134,0.016623,Richard Branson,Tycoon,71,>70,m,@richardbranson
4,4,,Happy Christmas from my family to yours. https...,8161232.0,2021-12-25T09:23:01.000Z,1474672016800239618,1474672016800239618,Twitter Media Studio,everyone,en,"{'retweet_count': 81, 'reply_count': 112, 'lik...",,,,[],0,0,0,1,POSITIVE,0.998654,3,joy,0.74693,0.002606,0.000694,0.001642,0.74693,0.068365,0.023596,0.156166,Richard Branson,Tycoon,71,>70,m,@richardbranson


In [30]:
# Save to csv
merged_df3.to_csv("FinalResults.csv", index = False)
files.download('FinalResults.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [31]:
# Check that FinalResults and tweet data has the same length
print(len(enTweets))
print(len(merged_df3))

198430
198430
